Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Rob Linger
    @th3walkingdud3
    I will check it out, thanks for the quick reply @HikaGenji
    Rob Linger
    @th3walkingdud3
    Screen Shot 2020-05-05 at 9.23.06 AM.png
    Rob Linger
    @th3walkingdud3
    I have not had any luck with deserializing data streaming from kafka using spark_read_kafka. Does anyone have any resources of code examples for this use case? The provided example on the sparklyr site reads and immediately writes back to kafka, which works fine, but I am needing some example code of processing incoming data prior to writing it back out to kafka.
    Jeff P
    @HikaGenji
    Essentially you can treat the result of stream_read_kafka just like a dataframe with dplyr verbs
    Kumar G
    @abdkumar

    @JakeRuss I'm trying to connect to remote cassandra using host, port, user name and password
    conf <- spark_config()
    conf[["spark.cassandra.connection.ssl.enabled"]] = TRUE
    conf[["spark.cassandra.connection.host"]] = cassandra_host
    conf[["spark.cassandra.connection.port"]] = cassandra_port
    conf[["spark.cassandra.auth.username"]] = cassandra_username
    conf[["spark.cassandra.auth.password"]] = cassandra_password
    config[["sparklyr.defaultPackages"]] <- c("org.apache.hadoop:hadoop-aws:2.7.3", "datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11")

    sc <- spark_connect(master = "local", version = "2.2.0", spark_home = spark_path, config = conf)

    df <- spark_read_source(
    sc,
    name = "emp",
    source = "org.apache.spark.sql.cassandra",
    options = list(keyspace = "temp", table = "category_distribution"),
    memory = FALSE)

    but this is not working. please suggest a solution

    tarun9450
    @tarun9450
    Hi All, can you please help me with this error- "Failed while connecting to sparklyr to port (8880) for sessionid (26038): Gateway in localhost:8880 did not respond."
    tarun9450
    @tarun9450
    library(sparklyr)
    sc <- spark_connect(master = "local", spark_version = "2.4.5")

    Error in force(code) :
    Failed while connecting to sparklyr to port (8880) for sessionid (52016): Gateway in localhost:8880 did not respond.
    Path: C:\Users\Tarun_Gupta2\AppData\Local\spark\spark-2.4.5-bin-hadoop2.7\bin\spark-submit2.cmd
    Parameters: --class, sparklyr.Shell, "C:\Users\TarunGupta2\Documents\R\win-library\3.6\sparklyr\java\sparklyr-2.4-2.11.jar", 8880, 52016
    Log: C:\Users\TARUN
    ~1\AppData\Local\Temp\Rtmpw9ZV82\filea70487da97_spark.log

    ---- Output Log ----
    /Java/jdk1.8.0_251\bin\java was unexpected at this time.

    ---- Error Log ----

    Yitao Li
    @yitao-li
    @tarun9450 Did you have space in your JAVA_HOME environment variable?
    Jeff P
    @HikaGenji
    Hi, is anybody available to discuss sparklyr/sparklyr#2534 ? A bit difficult to express in words and requires Kafka to replicate. This is blockinge from going further so I am curious to see if any of you has insights
    Jeff P
    @HikaGenji
    @HikaGenji note to self: problem solved, see the ticket for more info
    bdharang
    @bdharang

    Hi , Someone please help me to fix the below error ,I have another setup working which is hadoop 2 (EMR 5.x) , Now I am testing EMR 6 with new spark home that is /usr/lib/spark6/ , Just I compare with both setting everything looks good for me. Is there any specific setting I need to checlk

    sc <- spark_connect(master = "yarn", spark_home = "/usr/lib/spark6", deploymode = "cluster", enableHiveSupport = TRUE)
    Error in force(code) :
    Failed while connecting to sparklyr to port (8880) for sessionid (32486): Gateway in localhost:8880 did not respond.
    Path: /usr/lib/spark6/bin/spark-submit
    Parameters: --class, sparklyr.Shell, '/opt/R/3.6.0/lib64/R/library/sparklyr/java/sparklyr-2.4-2.11.jar', 8880, 32486
    Log: /tmp/RtmpijZOtA/filee69e18f188dc_spark.log

    ---- Output Log ----
    Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
    at sparklyr.Shell$.main(shell.scala:9)
    at sparklyr.Shell.main(shell.scala)

    Yitao Li
    @yitao-li
    @bdharang The error is caused by scala version incompatibility -- I still need to look into whether there is a reasonable way to fix it, but for now you can work around it by passing in "version = 3.0.0-preview" to force sparklyr to load jar files compiled with scala 2.12
    Maher Daoud
    @maherdaoud
    Hi Guys, I'm now working on a project that we need to apply a huge number of mutation on the data, I'm using sparkler package with dplyr mutate function
    The problem is when applying a huge number of the mutation using mutate I will have the following error
    "Error: org.apache.spark.sql.AnalysisException: cannot resolve 'GOOGLE_CITY_DESC' given input columns:"
    It seems there is a limitation for using mutate function
    Any help plz?
    Yitao Li
    @yitao-li
    @maherdaoud I'm not aware of any built-in limitation with dplyr.
    ^^ dplyr is just a R interface translating all your data manipulation verbs to Spark SQL, so, it has the same capabilities as Spark SQL itself. I suggest double-checking the schema of your result before the failed dplyr::mutate and see if the column names are what you expected
    Maher Daoud
    @maherdaoud
    Yitao Li
    @yitao-li
    @maherdaoud wow that's really interesting. I didn't realize too many mutate calls would fail in this way. My guess is it might be because of the resulting SQL query becoming too long. Anyways I'll create a github issue for it.
    Oh nevermind I saw the one you created already.
    Maher Daoud
    @maherdaoud
    Any workaround suggestion?
    Yitao Li
    @yitao-li
    I'll need to look into it. I think what happens is all dplyr::mutate calls are evaluated lazily so when you accumulate too many of them that are yet to be "materialized" in the backend, you end up with a long SQL query. But I'm not exactly sure whether SQL query being too long is the root cause of this problem it was just the guess off top of my head.
    Maher Daoud
    @maherdaoud
    I think you are right, it seems each time we have mutate function, it creates a new subquery with a new name. after too many subqueries it can't generate more names for the last one
    Yitao Li
    @yitao-li
    If I read correctly https://dplyr.tidyverse.org/reference/compute.html essentially says compute() forces the SQL query you have accumulated so far to be evaluated so that might help
    Maher Daoud
    @maherdaoud
    let me try, I will back, and thanks for your great support bro :>
    Maher Daoud
    @maherdaoud
    I tried both "compute" and "collapse" functions, it tasks a long time without finishing. I applied it on 60,000 rows of spark_dataframe
    Yitao Li
    @yitao-li
    If it appears to be hanging forever, then I would also do the following:
    <my_spark_dataframe> %>% dplyr::mutate(...) %>% ... %>% dplyr::show_query() and then sanity-check the query is OK and then try running the query directly to see how long that takes
    Maher Daoud
    @maherdaoud
    okay
    Maher Daoud
    @maherdaoud
    running the query directly using dbGetQuery?
    Yitao Li
    @yitao-li
    Yes either that or just launch a spark-sql shell (from $SPARK_HOME/bin/spark-sql), copy-paste the query, and then run
    Maher Daoud
    @maherdaoud
    Yes, but in fact, this will not help, as you know it takes too much time after adding "compute" or "collapse" functions.
    I think I need to wait until solving this bug or finding a workaround :(
    Yitao Li
    @yitao-li
    I was more suggesting doing it for debugging purposes: if running that query from the spark-sql shell directly returns result fast enough (i.e., we did that to bypass the R-layer entirely), then we can conclude slowness was caused by some bug in one of the R packages.
    Maher Daoud
    @maherdaoud
    If we are talking about spark-sql, it's very fast using R-layer, Also I tried it directly using dbGetQuery it was fast
    But we shouldn't forget the main mentioned issue
    Yitao Li
    @yitao-li
    @maherdaoud Hey I just commented on sparklyr/sparklyr#2589 with some good news for you. Let me know what you think.
    Thanks for raising this issue for sparklyr BTW. It's a really good catch! Even though at the end it appears to be not a sparklyr problem : D
    Maher Daoud
    @maherdaoud
    What a great news, let me check and I will back to you with my feedback, again, many thanks for your great efforts
    Niels Jespersen
    @njesp
    I just cannot get sparklyr to work on Windows in local mode. Not matter what I do, the only error message I see is "Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, : Gateway in localhost:8880 did not respond.". R 3.6.1, R 4.0.2, Spark 2.0.1, Spark 2.4.5, Java 8. Not filepaths involved (java, spark) contain spaces. Logfiles created, but empty. winutils.exe copied to where spark/hadoop wants it (at least the errormessage saying it misses winutils goes away). A Javaprocess is created. Any hints for further investigation? Has anyone made this work recently?
    Yitao Li
    @yitao-li

    @njesp You can try the following to print spark-submit log to console to see what's failing:

    library(sparklyr)
    options(sparklyr.log.console = TRUE)
    sc <- spark_connect(master = "local")

    The spark-submit log usually ends up in a text file, but the path to that file is highly system-dependent and also could be influenced by your local config... so rather than spending time figuring out where it might be it's just easier to have options(sparklyr.log.console = TRUE) while trouble-shooting

    Niels Jespersen
    @njesp
    @yl790 Thank you for replying. Today it suddenly works, at least on my notebook. Tomorrow I will try again at my workstation at work. options(sparklyr.log.console = TRUE) has an effect when running in a console R, but it seems that RStudio eats the log messages somehow. I will get back if I still have problems on my workstation at work tomorrow. Once again, thank you for helping
    Niels Jespersen
    @njesp
    @yl790 , Now in the office, working behind a proxy. Logging to console now works when running R from a console. Its actually spark.sas7bdat that causes trouble, as it depends on maven's ability to collect jars across the internet. Here is the log.

    @yl790 Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, :
    Gateway in localhost:8880 did not respond.

    :: resolution report :: resolve 84419ms :: artifacts dl 0ms

        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------

    :: problems summary ::
    :::: WARNINGS
    module not found: saurfang#spark-sas7bdat;1.1.5-s_2.11

        ==== local-m2-cache: tried
    
          file:/C:/Users/njn/.m2/repository/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom
    
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
    
          file:/C:/Users/njn/.m2/repository/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar
    
        ==== local-ivy-cache: tried
    
          C:\Users\njn\.ivy2\local\saurfang\spark-sas7bdat\1.1.5-s_2.11\ivys\ivy.xml
    
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
    
          C:\Users\njn\.ivy2\local\saurfang\spark-sas7bdat\1.1.5-s_2.11\jars\spark-sas7bdat.jar
    
        ==== central: tried
    
          https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom
    
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
    
          https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar
    
        ==== spark-packages: tried
    
          https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom
    
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
    
          https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar
    
                ::::::::::::::::::::::::::::::::::::::::::::::
    
                ::          UNRESOLVED DEPENDENCIES         ::
    
                ::::::::::::::::::::::::::::::::::::::::::::::
    
                :: saurfang#spark-sas7bdat;1.1.5-s_2.11: not found
    
                ::::::::::::::::::::::::::::::::::::::::::::::

    :::: ERRORS
    Server access error at url https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom (java.net.ConnectException: Connection timed out: connect)

        Server access error at url https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar (java.net.ConnectException: Connection timed out: connect)
    
        Server access error at url https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom (java.net.ConnectException: Connection timed out: connect)
    
        Server access error at url https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar (java.net.ConnectException: Connection timed out: connect)

    :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
    Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: saurfang#spark-sas7bdat;1.1.5-s_2.11: not found]
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSub

    Niels Jespersen
    @njesp
    @yl790 Well, I made it work. I let it run on my internet-connected notebook. Copied the jars in ~/.ivy2 to a folder on my office pc. Added a spark.jars.ivy entry to spark-defaults.conf, poinint to the ivy2 folder. Now it runs and loads the spark.sas7bdat jars. Thank you for your help.
    3 replies
    markuszinser
    @markuszinser

    Hi, I tried to connect via sparklyr to our spark cluster via yarn-cluster mode. But the connection fails after 30 seconds. When looking at the logs I see the following behaviour. Everything looks quite normal until the the application starts:

    20/07/24 13:21:01 INFO Client: Submitting application application_1595494790876_0069 to ResourceManager
    20/07/24 13:21:01 INFO YarnClientImpl: Submitted application application_1595494790876_0069
    20/07/24 13:21:02 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)
    20/07/24 13:21:02 INFO Client: 
             client token: N/A
             diagnostics: AM container is launched, waiting for AM container to Register with RM
             ApplicationMaster host: N/A
             ApplicationMaster RPC port: -1
             queue: default
             start time: 1595596856971
             final status: UNDEFINED
             tracking URL: http://lm:8088/proxy/application_1595494790876_0069/
             user: hadoop
    20/07/24 13:21:03 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)
    20/07/24 13:21:04 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)

    The last message then continuously repeats. After some while I get some messages from the logs.

    20/07/24 13:22:03 WARN sparklyr: Gateway (35459) Failed to get network interface of gateway server socketnull

    Any idea what could go wrong? I guess a lot of things... especially since we are in a quite restrict network. It was already quite a pain to reach this point. The client only sees the workers ports 9868. I now also opened
    port 8880 since I thought maybe the sparklyr gateway on the node tries to communicate with the client
    and fails. But this didn't change anything.

    Yuan Zhao
    @yuanzhaoYZ
    Anyone ran into this issue? I've google searched a bit, but still no luck
    sc <- spark_connect(
      master = "http://192.168.0.6:8998",
      version = "2.4.4",
      method = "livy", config = livy_config(
        driver_memory = "2G",
        driver_cores = 2,
        executor_memory = "4G",
        executor_cores = 2,
        num_executors = 4
      ))
    Error in livy_validate_http_response("Failed to create livy session", : Failed to create livy session (Client error: (400) Bad Request):
    
     {"msg":"java.net.URISyntaxException: Illegal character in scheme name at index 1: 
    c(\"https://github.com/sparklyr/sparklyr/blob/feature/sparklyr-1.3.0/inst/java/sparklyr-2.4-2.11.jar?raw=true\", \"https://github.com/sparklyr/sparklyr/blob/feature/sparklyr-1.3.0/inst/java/sparklyr-2.4-2.12.jar?raw=true\")"}
    
    
    Traceback:
    1. spark_connect(master = "http://192.168.0.6:8998", version = "2.4.4", 
     .     method = "livy", config = livy_config(config, driver_memory = "2G", 
     .         driver_cores = 2, executor_memory = "4G", executor_cores = 2, 
     .         num_executors = 4))
    2. livy_connection(master, config, app_name, version, hadoop_version, 
     .     extensions, scala_version = scala_version)
    3. livy_create_session(master, config)
    4. livy_validate_http_response("Failed to create livy session", 
     .     req)
    5. stop(message, " (", httpStatus$message, "): ", httpContent)
    Yuan Zhao
    @yuanzhaoYZ
    Just filed a bug report: sparklyr/sparklyr#2641