Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    Jeff P
    Hi, is anybody available to discuss sparklyr/sparklyr#2534 ? A bit difficult to express in words and requires Kafka to replicate. This is blockinge from going further so I am curious to see if any of you has insights
    Jeff P
    @HikaGenji note to self: problem solved, see the ticket for more info

    Hi , Someone please help me to fix the below error ,I have another setup working which is hadoop 2 (EMR 5.x) , Now I am testing EMR 6 with new spark home that is /usr/lib/spark6/ , Just I compare with both setting everything looks good for me. Is there any specific setting I need to checlk

    sc <- spark_connect(master = "yarn", spark_home = "/usr/lib/spark6", deploymode = "cluster", enableHiveSupport = TRUE)
    Error in force(code) :
    Failed while connecting to sparklyr to port (8880) for sessionid (32486): Gateway in localhost:8880 did not respond.
    Path: /usr/lib/spark6/bin/spark-submit
    Parameters: --class, sparklyr.Shell, '/opt/R/3.6.0/lib64/R/library/sparklyr/java/sparklyr-2.4-2.11.jar', 8880, 32486
    Log: /tmp/RtmpijZOtA/filee69e18f188dc_spark.log

    ---- Output Log ----
    Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
    at sparklyr.Shell$.main(shell.scala:9)
    at sparklyr.Shell.main(shell.scala)

    Yitao Li
    @bdharang The error is caused by scala version incompatibility -- I still need to look into whether there is a reasonable way to fix it, but for now you can work around it by passing in "version = 3.0.0-preview" to force sparklyr to load jar files compiled with scala 2.12
    Maher Daoud
    Hi Guys, I'm now working on a project that we need to apply a huge number of mutation on the data, I'm using sparkler package with dplyr mutate function
    The problem is when applying a huge number of the mutation using mutate I will have the following error
    "Error: org.apache.spark.sql.AnalysisException: cannot resolve 'GOOGLE_CITY_DESC' given input columns:"
    It seems there is a limitation for using mutate function
    Any help plz?
    Yitao Li
    @maherdaoud I'm not aware of any built-in limitation with dplyr.
    ^^ dplyr is just a R interface translating all your data manipulation verbs to Spark SQL, so, it has the same capabilities as Spark SQL itself. I suggest double-checking the schema of your result before the failed dplyr::mutate and see if the column names are what you expected
    Maher Daoud
    Yitao Li
    @maherdaoud wow that's really interesting. I didn't realize too many mutate calls would fail in this way. My guess is it might be because of the resulting SQL query becoming too long. Anyways I'll create a github issue for it.
    Oh nevermind I saw the one you created already.
    Maher Daoud
    Any workaround suggestion?
    Yitao Li
    I'll need to look into it. I think what happens is all dplyr::mutate calls are evaluated lazily so when you accumulate too many of them that are yet to be "materialized" in the backend, you end up with a long SQL query. But I'm not exactly sure whether SQL query being too long is the root cause of this problem it was just the guess off top of my head.
    Maher Daoud
    I think you are right, it seems each time we have mutate function, it creates a new subquery with a new name. after too many subqueries it can't generate more names for the last one
    Yitao Li
    If I read correctly https://dplyr.tidyverse.org/reference/compute.html essentially says compute() forces the SQL query you have accumulated so far to be evaluated so that might help
    Maher Daoud
    let me try, I will back, and thanks for your great support bro :>
    Maher Daoud
    I tried both "compute" and "collapse" functions, it tasks a long time without finishing. I applied it on 60,000 rows of spark_dataframe
    Yitao Li
    If it appears to be hanging forever, then I would also do the following:
    <my_spark_dataframe> %>% dplyr::mutate(...) %>% ... %>% dplyr::show_query() and then sanity-check the query is OK and then try running the query directly to see how long that takes
    Maher Daoud
    Maher Daoud
    running the query directly using dbGetQuery?
    Yitao Li
    Yes either that or just launch a spark-sql shell (from $SPARK_HOME/bin/spark-sql), copy-paste the query, and then run
    Maher Daoud
    Yes, but in fact, this will not help, as you know it takes too much time after adding "compute" or "collapse" functions.
    I think I need to wait until solving this bug or finding a workaround :(
    Yitao Li
    I was more suggesting doing it for debugging purposes: if running that query from the spark-sql shell directly returns result fast enough (i.e., we did that to bypass the R-layer entirely), then we can conclude slowness was caused by some bug in one of the R packages.
    Maher Daoud
    If we are talking about spark-sql, it's very fast using R-layer, Also I tried it directly using dbGetQuery it was fast
    But we shouldn't forget the main mentioned issue
    Yitao Li
    @maherdaoud Hey I just commented on sparklyr/sparklyr#2589 with some good news for you. Let me know what you think.
    Thanks for raising this issue for sparklyr BTW. It's a really good catch! Even though at the end it appears to be not a sparklyr problem : D
    Maher Daoud
    What a great news, let me check and I will back to you with my feedback, again, many thanks for your great efforts
    Niels Jespersen
    I just cannot get sparklyr to work on Windows in local mode. Not matter what I do, the only error message I see is "Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, : Gateway in localhost:8880 did not respond.". R 3.6.1, R 4.0.2, Spark 2.0.1, Spark 2.4.5, Java 8. Not filepaths involved (java, spark) contain spaces. Logfiles created, but empty. winutils.exe copied to where spark/hadoop wants it (at least the errormessage saying it misses winutils goes away). A Javaprocess is created. Any hints for further investigation? Has anyone made this work recently?
    Yitao Li

    @njesp You can try the following to print spark-submit log to console to see what's failing:

    options(sparklyr.log.console = TRUE)
    sc <- spark_connect(master = "local")

    The spark-submit log usually ends up in a text file, but the path to that file is highly system-dependent and also could be influenced by your local config... so rather than spending time figuring out where it might be it's just easier to have options(sparklyr.log.console = TRUE) while trouble-shooting

    Niels Jespersen
    @yl790 Thank you for replying. Today it suddenly works, at least on my notebook. Tomorrow I will try again at my workstation at work. options(sparklyr.log.console = TRUE) has an effect when running in a console R, but it seems that RStudio eats the log messages somehow. I will get back if I still have problems on my workstation at work tomorrow. Once again, thank you for helping
    Niels Jespersen
    @yl790 , Now in the office, working behind a proxy. Logging to console now works when running R from a console. Its actually spark.sas7bdat that causes trouble, as it depends on maven's ability to collect jars across the internet. Here is the log.

    @yl790 Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, :
    Gateway in localhost:8880 did not respond.

    :: resolution report :: resolve 84419ms :: artifacts dl 0ms

        :: modules in use:
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |

    :: problems summary ::
    :::: WARNINGS
    module not found: saurfang#spark-sas7bdat;1.1.5-s_2.11

        ==== local-m2-cache: tried
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
        ==== local-ivy-cache: tried
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
        ==== central: tried
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
        ==== spark-packages: tried
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
                ::          UNRESOLVED DEPENDENCIES         ::
                :: saurfang#spark-sas7bdat;1.1.5-s_2.11: not found

    :::: ERRORS
    Server access error at url https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom (java.net.ConnectException: Connection timed out: connect)

        Server access error at url https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar (java.net.ConnectException: Connection timed out: connect)
        Server access error at url https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom (java.net.ConnectException: Connection timed out: connect)
        Server access error at url https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar (java.net.ConnectException: Connection timed out: connect)

    Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: saurfang#spark-sas7bdat;1.1.5-s_2.11: not found]
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSub

    Niels Jespersen
    @yl790 Well, I made it work. I let it run on my internet-connected notebook. Copied the jars in ~/.ivy2 to a folder on my office pc. Added a spark.jars.ivy entry to spark-defaults.conf, poinint to the ivy2 folder. Now it runs and loads the spark.sas7bdat jars. Thank you for your help.
    3 replies

    Hi, I tried to connect via sparklyr to our spark cluster via yarn-cluster mode. But the connection fails after 30 seconds. When looking at the logs I see the following behaviour. Everything looks quite normal until the the application starts:

    20/07/24 13:21:01 INFO Client: Submitting application application_1595494790876_0069 to ResourceManager
    20/07/24 13:21:01 INFO YarnClientImpl: Submitted application application_1595494790876_0069
    20/07/24 13:21:02 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)
    20/07/24 13:21:02 INFO Client: 
             client token: N/A
             diagnostics: AM container is launched, waiting for AM container to Register with RM
             ApplicationMaster host: N/A
             ApplicationMaster RPC port: -1
             queue: default
             start time: 1595596856971
             final status: UNDEFINED
             tracking URL: http://lm:8088/proxy/application_1595494790876_0069/
             user: hadoop
    20/07/24 13:21:03 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)
    20/07/24 13:21:04 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)

    The last message then continuously repeats. After some while I get some messages from the logs.

    20/07/24 13:22:03 WARN sparklyr: Gateway (35459) Failed to get network interface of gateway server socketnull

    Any idea what could go wrong? I guess a lot of things... especially since we are in a quite restrict network. It was already quite a pain to reach this point. The client only sees the workers ports 9868. I now also opened
    port 8880 since I thought maybe the sparklyr gateway on the node tries to communicate with the client
    and fails. But this didn't change anything.

    Yuan Zhao
    Anyone ran into this issue? I've google searched a bit, but still no luck
    sc <- spark_connect(
      master = "",
      version = "2.4.4",
      method = "livy", config = livy_config(
        driver_memory = "2G",
        driver_cores = 2,
        executor_memory = "4G",
        executor_cores = 2,
        num_executors = 4
    Error in livy_validate_http_response("Failed to create livy session", : Failed to create livy session (Client error: (400) Bad Request):
     {"msg":"java.net.URISyntaxException: Illegal character in scheme name at index 1: 
    c(\"https://github.com/sparklyr/sparklyr/blob/feature/sparklyr-1.3.0/inst/java/sparklyr-2.4-2.11.jar?raw=true\", \"https://github.com/sparklyr/sparklyr/blob/feature/sparklyr-1.3.0/inst/java/sparklyr-2.4-2.12.jar?raw=true\")"}
    1. spark_connect(master = "", version = "2.4.4", 
     .     method = "livy", config = livy_config(config, driver_memory = "2G", 
     .         driver_cores = 2, executor_memory = "4G", executor_cores = 2, 
     .         num_executors = 4))
    2. livy_connection(master, config, app_name, version, hadoop_version, 
     .     extensions, scala_version = scala_version)
    3. livy_create_session(master, config)
    4. livy_validate_http_response("Failed to create livy session", 
     .     req)
    5. stop(message, " (", httpStatus$message, "): ", httpContent)
    Yuan Zhao
    Just filed a bug report: sparklyr/sparklyr#2641
    Yitao Li
    @yuanzhaoYZ Hey thanks for the bug report!! Seems to be something worth looking into. I can't think of anything that would explain that error off the top of my head.
    Also, sparklyr currently only has Livy test coverage for Spark 2.3 and it is possible Spark 2.3 has been the only version intended to be working with both sparklyr and Livy so far. Hopefully it shouldn't be difficult to do the same for Spark 2.4 though.
    Yitao Li
    @yuanzhaoYZ sparklyr/sparklyr#2641 is resolved now.
    Yuan Zhao
    @yl790 Awesome, thanks for fixing this one up so fast!
    Jordan Bentley
    is it at all possible to pass an R callback into Scala? So I can call a Scala method from R and have that Scala method call back to R in the middle of its execution?
    I'm fairly confident the answer is 'no' but I'm holding on to a little hope that I'm wrong

    I was trying to connect to spark locally using:

    conf <- spark_config()
    conf$`sparklyr.connect.cores.local` <- 12
    conf$`sparklyr.cores.local` <- 4
    conf$sparklyr.shell.deploy-mode <- "client"
    conf$`sparklyr.shell.driver-memory` <- "32G"
    conf$`spark.executor.cores` <- 1
    conf$`spark.executor.memory` <- "2G"
    conf$`sparklyr.verbose` <- TRUE
    conf$`sparklyr.log.console` <- TRUE
    conf$`spark.executor.instances` <- 4
    conf$spark.sql.shuffle.partitions <- 5
    conf$`spark.dynamicAllocation.enabled` <- FALSE
    sc <- spark_connect(master = "local",
                        config = conf,
                        spark_home =  Sys.getenv("SPARK_HOME"),
                        log = "console", version = "3.0.0")

    but then, I figured that master "local" do not create executors, but only the driver one. So I tried to run on yarn-client, however, I get the following error message:

    d:\spark\bin\spark-submit2.cmd --driver-memory 32G --class sparklyr.Shell "C:\Users\B2623385\Documents\R\win-library\3.6\sparklyr\java\sparklyr-3.0-2.12.jar" 8880 23210
    Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId,  :
      Gateway in localhost:8880 did not respond.

    I am using a shared windows server in my work that holds 16 cores with 64Gb.

    I am trying to get dplyr::summarize_all to work on a spark dataframe as described in chapter 3 of "Mastering R With Spark" is this possible? I posted the question on stackoverflow see: https://stackoverflow.com/questions/64032888/how-to-get-dplyrsummarize-all-to-work-on-a-sparkdataframe-using-databricks
    Regan McDonald
    Does anyone distribute a versioned R environment to sparklyr session worker nodes rather than install the same version of R on all Spark nodes?
    Hello - question... w/ the new sparklyr version tag (which is now required for livy connections) - I am noticing that my classes doesn't load on remote system anymore. It would throw a ClassNotFoundException . Tho, I've been adding these classes via conf$livy.jars <- c( ... ) as before. What I did noticed is that on the yarn side - that these jar files are no longer listed as part of the launch script. Is there a different way to specify jar files to be loaded as part of livy config?