Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Dave Kincaid
    @dkincaid
    I'll open an issue then. I thought maybe there was some simple solution I was overlooking. Thanks.
    Javier Luraschi
    @javierluraschi
    Thanks for the GitHub issue, should be enough to reproduce on our end and send that PR as well. Let us know how far you get and we will take it from there :)
    Dave Kincaid
    @dkincaid
    I'm not sure I can be much help. I don't really know Scala at all and it looks like that would be needed. I did create the issue #2441. I'll take a look this weekend and see if I can understand what's happening in there, but I'm not optimistic that I'll be able to figure it out.
    Javier Luraschi
    @javierluraschi
    No worries, let us take a look. Would you nag us early next week if you don’t see progress?
    I’ll try to take a look at it tomorrow or ask Yitao for help
    Javier Luraschi
    @javierluraschi
    AH, yes, a fix was needed in the serializer. Here is the PR @dkincaid sparklyr/sparklyr#2442 — We will merge withing the next few hours after tests finish running. Excited to see how you get all this working in sparknlp !!!
    Dave Kincaid
    @dkincaid
    Wow! That's great! Thank you so much. I'll check it out this weekend
    Javier Luraschi
    @javierluraschi
    NP! Merged now. Try remotes::install_github("sparklyr/sparklyr")
    Jeff P
    @HikaGenji
    hi, are sliding windows supported in sparklyr ? looking at this, seems they are not sparklyr/sparklyr#2231
    2 replies
    Rob Linger
    @th3walkingdud3
    Does anyone have any documentation or a set of repos for dealing with json objects from kafka? I can read from and write to kafka using the example code, but I have not had much luck with any type of processing.
    Jeff P
    @HikaGenji
    @th3walkingdud3 you can use the spark SQL function from_json function to parse the json string into columns. You need to provide it with the json schema, you can find examples here https://stackoverflow.com/questions/50373104/spark-sql-from-json-documentation
    13 replies
    Jeff P
    @HikaGenji
    For how to use SQL in sparklyr, I personally like this resource: https://sparkfromr.com/constructing-sql-and-executing-it-with-spark.html
    Rob Linger
    @th3walkingdud3
    I will check it out, thanks for the quick reply @HikaGenji
    Rob Linger
    @th3walkingdud3
    Screen Shot 2020-05-05 at 9.23.06 AM.png
    Rob Linger
    @th3walkingdud3
    I have not had any luck with deserializing data streaming from kafka using spark_read_kafka. Does anyone have any resources of code examples for this use case? The provided example on the sparklyr site reads and immediately writes back to kafka, which works fine, but I am needing some example code of processing incoming data prior to writing it back out to kafka.
    Jeff P
    @HikaGenji
    Essentially you can treat the result of stream_read_kafka just like a dataframe with dplyr verbs
    Kumar G
    @abdkumar

    @JakeRuss I'm trying to connect to remote cassandra using host, port, user name and password
    conf <- spark_config()
    conf[["spark.cassandra.connection.ssl.enabled"]] = TRUE
    conf[["spark.cassandra.connection.host"]] = cassandra_host
    conf[["spark.cassandra.connection.port"]] = cassandra_port
    conf[["spark.cassandra.auth.username"]] = cassandra_username
    conf[["spark.cassandra.auth.password"]] = cassandra_password
    config[["sparklyr.defaultPackages"]] <- c("org.apache.hadoop:hadoop-aws:2.7.3", "datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11")

    sc <- spark_connect(master = "local", version = "2.2.0", spark_home = spark_path, config = conf)

    df <- spark_read_source(
    sc,
    name = "emp",
    source = "org.apache.spark.sql.cassandra",
    options = list(keyspace = "temp", table = "category_distribution"),
    memory = FALSE)

    but this is not working. please suggest a solution

    tarun9450
    @tarun9450
    Hi All, can you please help me with this error- "Failed while connecting to sparklyr to port (8880) for sessionid (26038): Gateway in localhost:8880 did not respond."
    tarun9450
    @tarun9450
    library(sparklyr)
    sc <- spark_connect(master = "local", spark_version = "2.4.5")

    Error in force(code) :
    Failed while connecting to sparklyr to port (8880) for sessionid (52016): Gateway in localhost:8880 did not respond.
    Path: C:\Users\Tarun_Gupta2\AppData\Local\spark\spark-2.4.5-bin-hadoop2.7\bin\spark-submit2.cmd
    Parameters: --class, sparklyr.Shell, "C:\Users\TarunGupta2\Documents\R\win-library\3.6\sparklyr\java\sparklyr-2.4-2.11.jar", 8880, 52016
    Log: C:\Users\TARUN
    ~1\AppData\Local\Temp\Rtmpw9ZV82\filea70487da97_spark.log

    ---- Output Log ----
    /Java/jdk1.8.0_251\bin\java was unexpected at this time.

    ---- Error Log ----

    Yitao Li
    @yitao-li
    @tarun9450 Did you have space in your JAVA_HOME environment variable?
    Jeff P
    @HikaGenji
    Hi, is anybody available to discuss sparklyr/sparklyr#2534 ? A bit difficult to express in words and requires Kafka to replicate. This is blockinge from going further so I am curious to see if any of you has insights
    Jeff P
    @HikaGenji
    @HikaGenji note to self: problem solved, see the ticket for more info
    bdharang
    @bdharang

    Hi , Someone please help me to fix the below error ,I have another setup working which is hadoop 2 (EMR 5.x) , Now I am testing EMR 6 with new spark home that is /usr/lib/spark6/ , Just I compare with both setting everything looks good for me. Is there any specific setting I need to checlk

    sc <- spark_connect(master = "yarn", spark_home = "/usr/lib/spark6", deploymode = "cluster", enableHiveSupport = TRUE)
    Error in force(code) :
    Failed while connecting to sparklyr to port (8880) for sessionid (32486): Gateway in localhost:8880 did not respond.
    Path: /usr/lib/spark6/bin/spark-submit
    Parameters: --class, sparklyr.Shell, '/opt/R/3.6.0/lib64/R/library/sparklyr/java/sparklyr-2.4-2.11.jar', 8880, 32486
    Log: /tmp/RtmpijZOtA/filee69e18f188dc_spark.log

    ---- Output Log ----
    Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
    at sparklyr.Shell$.main(shell.scala:9)
    at sparklyr.Shell.main(shell.scala)

    Yitao Li
    @yitao-li
    @bdharang The error is caused by scala version incompatibility -- I still need to look into whether there is a reasonable way to fix it, but for now you can work around it by passing in "version = 3.0.0-preview" to force sparklyr to load jar files compiled with scala 2.12
    Maher Daoud
    @maherdaoud
    Hi Guys, I'm now working on a project that we need to apply a huge number of mutation on the data, I'm using sparkler package with dplyr mutate function
    The problem is when applying a huge number of the mutation using mutate I will have the following error
    "Error: org.apache.spark.sql.AnalysisException: cannot resolve 'GOOGLE_CITY_DESC' given input columns:"
    It seems there is a limitation for using mutate function
    Any help plz?
    Yitao Li
    @yitao-li
    @maherdaoud I'm not aware of any built-in limitation with dplyr.
    ^^ dplyr is just a R interface translating all your data manipulation verbs to Spark SQL, so, it has the same capabilities as Spark SQL itself. I suggest double-checking the schema of your result before the failed dplyr::mutate and see if the column names are what you expected
    Maher Daoud
    @maherdaoud
    Yitao Li
    @yitao-li
    @maherdaoud wow that's really interesting. I didn't realize too many mutate calls would fail in this way. My guess is it might be because of the resulting SQL query becoming too long. Anyways I'll create a github issue for it.
    Oh nevermind I saw the one you created already.
    Maher Daoud
    @maherdaoud
    Any workaround suggestion?
    Yitao Li
    @yitao-li
    I'll need to look into it. I think what happens is all dplyr::mutate calls are evaluated lazily so when you accumulate too many of them that are yet to be "materialized" in the backend, you end up with a long SQL query. But I'm not exactly sure whether SQL query being too long is the root cause of this problem it was just the guess off top of my head.
    Maher Daoud
    @maherdaoud
    I think you are right, it seems each time we have mutate function, it creates a new subquery with a new name. after too many subqueries it can't generate more names for the last one
    Yitao Li
    @yitao-li
    If I read correctly https://dplyr.tidyverse.org/reference/compute.html essentially says compute() forces the SQL query you have accumulated so far to be evaluated so that might help
    Maher Daoud
    @maherdaoud
    let me try, I will back, and thanks for your great support bro :>
    Maher Daoud
    @maherdaoud
    I tried both "compute" and "collapse" functions, it tasks a long time without finishing. I applied it on 60,000 rows of spark_dataframe
    Yitao Li
    @yitao-li
    If it appears to be hanging forever, then I would also do the following:
    <my_spark_dataframe> %>% dplyr::mutate(...) %>% ... %>% dplyr::show_query() and then sanity-check the query is OK and then try running the query directly to see how long that takes
    Maher Daoud
    @maherdaoud
    okay
    Maher Daoud
    @maherdaoud
    running the query directly using dbGetQuery?
    Yitao Li
    @yitao-li
    Yes either that or just launch a spark-sql shell (from $SPARK_HOME/bin/spark-sql), copy-paste the query, and then run
    Maher Daoud
    @maherdaoud
    Yes, but in fact, this will not help, as you know it takes too much time after adding "compute" or "collapse" functions.
    I think I need to wait until solving this bug or finding a workaround :(
    Yitao Li
    @yitao-li
    I was more suggesting doing it for debugging purposes: if running that query from the spark-sql shell directly returns result fast enough (i.e., we did that to bypass the R-layer entirely), then we can conclude slowness was caused by some bug in one of the R packages.
    Maher Daoud
    @maherdaoud
    If we are talking about spark-sql, it's very fast using R-layer, Also I tried it directly using dbGetQuery it was fast
    But we shouldn't forget the main mentioned issue
    Yitao Li
    @yitao-li
    @maherdaoud Hey I just commented on sparklyr/sparklyr#2589 with some good news for you. Let me know what you think.