Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Maher Daoud
    @maherdaoud
    I tried both "compute" and "collapse" functions, it tasks a long time without finishing. I applied it on 60,000 rows of spark_dataframe
    Yitao Li
    @yitao-li
    If it appears to be hanging forever, then I would also do the following:
    <my_spark_dataframe> %>% dplyr::mutate(...) %>% ... %>% dplyr::show_query() and then sanity-check the query is OK and then try running the query directly to see how long that takes
    Maher Daoud
    @maherdaoud
    okay
    Maher Daoud
    @maherdaoud
    running the query directly using dbGetQuery?
    Yitao Li
    @yitao-li
    Yes either that or just launch a spark-sql shell (from $SPARK_HOME/bin/spark-sql), copy-paste the query, and then run
    Maher Daoud
    @maherdaoud
    Yes, but in fact, this will not help, as you know it takes too much time after adding "compute" or "collapse" functions.
    I think I need to wait until solving this bug or finding a workaround :(
    Yitao Li
    @yitao-li
    I was more suggesting doing it for debugging purposes: if running that query from the spark-sql shell directly returns result fast enough (i.e., we did that to bypass the R-layer entirely), then we can conclude slowness was caused by some bug in one of the R packages.
    Maher Daoud
    @maherdaoud
    If we are talking about spark-sql, it's very fast using R-layer, Also I tried it directly using dbGetQuery it was fast
    But we shouldn't forget the main mentioned issue
    Yitao Li
    @yitao-li
    @maherdaoud Hey I just commented on sparklyr/sparklyr#2589 with some good news for you. Let me know what you think.
    Thanks for raising this issue for sparklyr BTW. It's a really good catch! Even though at the end it appears to be not a sparklyr problem : D
    Maher Daoud
    @maherdaoud
    What a great news, let me check and I will back to you with my feedback, again, many thanks for your great efforts
    Niels Jespersen
    @njesp
    I just cannot get sparklyr to work on Windows in local mode. Not matter what I do, the only error message I see is "Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, : Gateway in localhost:8880 did not respond.". R 3.6.1, R 4.0.2, Spark 2.0.1, Spark 2.4.5, Java 8. Not filepaths involved (java, spark) contain spaces. Logfiles created, but empty. winutils.exe copied to where spark/hadoop wants it (at least the errormessage saying it misses winutils goes away). A Javaprocess is created. Any hints for further investigation? Has anyone made this work recently?
    Yitao Li
    @yitao-li

    @njesp You can try the following to print spark-submit log to console to see what's failing:

    library(sparklyr)
    options(sparklyr.log.console = TRUE)
    sc <- spark_connect(master = "local")

    The spark-submit log usually ends up in a text file, but the path to that file is highly system-dependent and also could be influenced by your local config... so rather than spending time figuring out where it might be it's just easier to have options(sparklyr.log.console = TRUE) while trouble-shooting

    Niels Jespersen
    @njesp
    @yl790 Thank you for replying. Today it suddenly works, at least on my notebook. Tomorrow I will try again at my workstation at work. options(sparklyr.log.console = TRUE) has an effect when running in a console R, but it seems that RStudio eats the log messages somehow. I will get back if I still have problems on my workstation at work tomorrow. Once again, thank you for helping
    Niels Jespersen
    @njesp
    @yl790 , Now in the office, working behind a proxy. Logging to console now works when running R from a console. Its actually spark.sas7bdat that causes trouble, as it depends on maven's ability to collect jars across the internet. Here is the log.

    @yl790 Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, :
    Gateway in localhost:8880 did not respond.

    :: resolution report :: resolve 84419ms :: artifacts dl 0ms

        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   1   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------

    :: problems summary ::
    :::: WARNINGS
    module not found: saurfang#spark-sas7bdat;1.1.5-s_2.11

        ==== local-m2-cache: tried
    
          file:/C:/Users/njn/.m2/repository/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom
    
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
    
          file:/C:/Users/njn/.m2/repository/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar
    
        ==== local-ivy-cache: tried
    
          C:\Users\njn\.ivy2\local\saurfang\spark-sas7bdat\1.1.5-s_2.11\ivys\ivy.xml
    
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
    
          C:\Users\njn\.ivy2\local\saurfang\spark-sas7bdat\1.1.5-s_2.11\jars\spark-sas7bdat.jar
    
        ==== central: tried
    
          https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom
    
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
    
          https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar
    
        ==== spark-packages: tried
    
          https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom
    
          -- artifact saurfang#spark-sas7bdat;1.1.5-s_2.11!spark-sas7bdat.jar:
    
          https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar
    
                ::::::::::::::::::::::::::::::::::::::::::::::
    
                ::          UNRESOLVED DEPENDENCIES         ::
    
                ::::::::::::::::::::::::::::::::::::::::::::::
    
                :: saurfang#spark-sas7bdat;1.1.5-s_2.11: not found
    
                ::::::::::::::::::::::::::::::::::::::::::::::

    :::: ERRORS
    Server access error at url https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom (java.net.ConnectException: Connection timed out: connect)

        Server access error at url https://repo1.maven.org/maven2/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar (java.net.ConnectException: Connection timed out: connect)
    
        Server access error at url https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.pom (java.net.ConnectException: Connection timed out: connect)
    
        Server access error at url https://dl.bintray.com/spark-packages/maven/saurfang/spark-sas7bdat/1.1.5-s_2.11/spark-sas7bdat-1.1.5-s_2.11.jar (java.net.ConnectException: Connection timed out: connect)

    :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
    Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: saurfang#spark-sas7bdat;1.1.5-s_2.11: not found]
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSub

    Niels Jespersen
    @njesp
    @yl790 Well, I made it work. I let it run on my internet-connected notebook. Copied the jars in ~/.ivy2 to a folder on my office pc. Added a spark.jars.ivy entry to spark-defaults.conf, poinint to the ivy2 folder. Now it runs and loads the spark.sas7bdat jars. Thank you for your help.
    markuszinser
    @markuszinser

    Hi, I tried to connect via sparklyr to our spark cluster via yarn-cluster mode. But the connection fails after 30 seconds. When looking at the logs I see the following behaviour. Everything looks quite normal until the the application starts:

    20/07/24 13:21:01 INFO Client: Submitting application application_1595494790876_0069 to ResourceManager
    20/07/24 13:21:01 INFO YarnClientImpl: Submitted application application_1595494790876_0069
    20/07/24 13:21:02 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)
    20/07/24 13:21:02 INFO Client: 
             client token: N/A
             diagnostics: AM container is launched, waiting for AM container to Register with RM
             ApplicationMaster host: N/A
             ApplicationMaster RPC port: -1
             queue: default
             start time: 1595596856971
             final status: UNDEFINED
             tracking URL: http://lm:8088/proxy/application_1595494790876_0069/
             user: hadoop
    20/07/24 13:21:03 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)
    20/07/24 13:21:04 INFO Client: Application report for application_1595494790876_0069 (state: ACCEPTED)

    The last message then continuously repeats. After some while I get some messages from the logs.

    20/07/24 13:22:03 WARN sparklyr: Gateway (35459) Failed to get network interface of gateway server socketnull

    Any idea what could go wrong? I guess a lot of things... especially since we are in a quite restrict network. It was already quite a pain to reach this point. The client only sees the workers ports 9868. I now also opened
    port 8880 since I thought maybe the sparklyr gateway on the node tries to communicate with the client
    and fails. But this didn't change anything.

    Yuan Zhao
    @yuanzhaoYZ
    Anyone ran into this issue? I've google searched a bit, but still no luck
    sc <- spark_connect(
      master = "http://192.168.0.6:8998",
      version = "2.4.4",
      method = "livy", config = livy_config(
        driver_memory = "2G",
        driver_cores = 2,
        executor_memory = "4G",
        executor_cores = 2,
        num_executors = 4
      ))
    Error in livy_validate_http_response("Failed to create livy session", : Failed to create livy session (Client error: (400) Bad Request):
    
     {"msg":"java.net.URISyntaxException: Illegal character in scheme name at index 1: 
    c(\"https://github.com/sparklyr/sparklyr/blob/feature/sparklyr-1.3.0/inst/java/sparklyr-2.4-2.11.jar?raw=true\", \"https://github.com/sparklyr/sparklyr/blob/feature/sparklyr-1.3.0/inst/java/sparklyr-2.4-2.12.jar?raw=true\")"}
    
    
    Traceback:
    1. spark_connect(master = "http://192.168.0.6:8998", version = "2.4.4", 
     .     method = "livy", config = livy_config(config, driver_memory = "2G", 
     .         driver_cores = 2, executor_memory = "4G", executor_cores = 2, 
     .         num_executors = 4))
    2. livy_connection(master, config, app_name, version, hadoop_version, 
     .     extensions, scala_version = scala_version)
    3. livy_create_session(master, config)
    4. livy_validate_http_response("Failed to create livy session", 
     .     req)
    5. stop(message, " (", httpStatus$message, "): ", httpContent)
    Yuan Zhao
    @yuanzhaoYZ
    Just filed a bug report: sparklyr/sparklyr#2641
    Yitao Li
    @yitao-li
    @yuanzhaoYZ Hey thanks for the bug report!! Seems to be something worth looking into. I can't think of anything that would explain that error off the top of my head.
    Also, sparklyr currently only has Livy test coverage for Spark 2.3 and it is possible Spark 2.3 has been the only version intended to be working with both sparklyr and Livy so far. Hopefully it shouldn't be difficult to do the same for Spark 2.4 though.
    Yitao Li
    @yitao-li
    @yuanzhaoYZ sparklyr/sparklyr#2641 is resolved now.
    Yuan Zhao
    @yuanzhaoYZ
    @yl790 Awesome, thanks for fixing this one up so fast!
    Jordan Bentley
    @jbentleyEG
    is it at all possible to pass an R callback into Scala? So I can call a Scala method from R and have that Scala method call back to R in the middle of its execution?
    I'm fairly confident the answer is 'no' but I'm holding on to a little hope that I'm wrong
    gustavomrg
    @gustavomrg

    I was trying to connect to spark locally using:

    conf <- spark_config()
    
    conf$`sparklyr.connect.cores.local` <- 12
    conf$`sparklyr.cores.local` <- 4
    conf$sparklyr.shell.deploy-mode <- "client"
    conf$`sparklyr.shell.driver-memory` <- "32G"
    conf$`spark.executor.cores` <- 1
    conf$`spark.executor.memory` <- "2G"
    conf$`sparklyr.verbose` <- TRUE
    conf$`sparklyr.log.console` <- TRUE
    conf$`spark.executor.instances` <- 4
    conf$spark.sql.shuffle.partitions <- 5
    conf$`spark.dynamicAllocation.enabled` <- FALSE
    
    sc <- spark_connect(master = "local",
                        config = conf,
                        spark_home =  Sys.getenv("SPARK_HOME"),
                        log = "console", version = "3.0.0")

    but then, I figured that master "local" do not create executors, but only the driver one. So I tried to run on yarn-client, however, I get the following error message:

    d:\spark\bin\spark-submit2.cmd --driver-memory 32G --class sparklyr.Shell "C:\Users\B2623385\Documents\R\win-library\3.6\sparklyr\java\sparklyr-3.0-2.12.jar" 8880 23210
    Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId,  :
      Gateway in localhost:8880 did not respond.

    I am using a shared windows server in my work that holds 16 cores with 64Gb.

    nathan-alexander-mck
    @nathan-alexander-mck
    I am trying to get dplyr::summarize_all to work on a spark dataframe as described in chapter 3 of "Mastering R With Spark" is this possible? I posted the question on stackoverflow see: https://stackoverflow.com/questions/64032888/how-to-get-dplyrsummarize-all-to-work-on-a-sparkdataframe-using-databricks
    Regan McDonald
    @fifthpostulate
    Does anyone distribute a versioned R environment to sparklyr session worker nodes rather than install the same version of R on all Spark nodes?
    Ying
    @ying1
    Hello - question... w/ the new sparklyr version tag (which is now required for livy connections) - I am noticing that my classes doesn't load on remote system anymore. It would throw a ClassNotFoundException . Tho, I've been adding these classes via conf$livy.jars <- c( ... ) as before. What I did noticed is that on the yarn side - that these jar files are no longer listed as part of the launch script. Is there a different way to specify jar files to be loaded as part of livy config?
    Ying
    @ying1
    It looks like the conf$livy.jars has been updated to calculate a bunch of other jars... but the code is not working properly. :(
    Ying
    @ying1
    I set conf$sparklyr.livy.sources <- TRUE and continue to use conf$livy.jars and that seems to push the correct livy settings to the livy server. But then there is an issue with Failed to initialize livy connection: Failed to execute Livy statement with error: <console>:24: error: not found: value LivyUtils ... not sure why ?
    Jake Russ
    @JakeRuss
    I am seeking advice for how structure sparklyr calls and queries mapped over a list of dates. I posted over a https://community.rstudio.com/t/seeking-better-practice-for-sparklyr-purrr-map-to-iterate-query-over-a-list/85171 I wanted to draw attention to it here in case any of you sparklyr experts could comment there. Many thanks.
    ®γσ ξηg
    @englianhu

    sc <- spark_connect(master = 'local')
    Error in start_shell(master = master, spark_home = spark_home, spark_version = version, : Failed to find 'spark-submit2.cmd' under 'C:\Users\Owner\AppData\Local\spark\spark-3.0.0-bin-hadoop2.7', please verify SPARK_HOME.

    I faced an issue and raised via sparklyr/sparklyr#2769

    ®γσ ξηg
    @englianhu

    sc <- spark_connect(master = 'local')
    Error in start_shell(master = master, spark_home = spark_home, spark_version = version, : Failed to find 'spark-submit2.cmd' under 'C:\Users\Owner\AppData\Local\spark\spark-3.0.0-bin-hadoop2.7', please verify SPARK_HOME.

    I faced an issue and raised via sparklyr/sparklyr#2769

    Solved !!!
    Step :
    1) https://spark.apache.org/downloads.html
    2) extract zipped file to 'C:/Users/scibr/AppData/Local/spark/spark-3.0.1-bin-hadoop3.2'.
    3) manually choose latest version : spark_home_set('C:/Users/scibr/AppData/Local/spark/spark-3.0.1-bin-hadoop3.2')

    Regan McDonald
    @fifthpostulate
    Has anyone run in to "java.lang.SecurityException: class "io.netty.buffer.ArrowBuf"'s signer information does not match signer information of other classes in the same package" when trying to use arrow with sparklyr?
    ajp97
    @ajp97

    Hi everyone. Spark newbie here. Got the following error in class and haven't been able to solve it:
    Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, :
    Gateway in localhost:8880 did not respond.

    Try running options(sparklyr.log.console = TRUE) followed by sc <- spark_connect(...) for more debugging info..

    I saw @javierluraschi answer here sparklyr/sparklyr#801, but this fix hasn't proven effective for me. Any kind of help would be deeply appreciated.

    Thanks a lot!

    lidyaann1
    @lidyaann1
    Hi , I am trying to connect to standalone EMR cluster from R studio using sparklyr and livy but keep getting Error in livy_connection(master, config, app_name, version, hadoop_version, :
    Failed to launch livy session, session status is shutting_down
    Maher Daoud
    @maherdaoud
    Guys, I hope all of you are doing well, I stuck with the following error when I try to run a basic GraphFrames example, Error:java.lang.ClassNotFoundException: org.graphframes.GraphFrame
    Gisela
    @giselamorrone

    Hi Everyone!, Im trying to migrate a script to sparklyR, And I cant find the equivalent to spread and gather. My code looks something like:

    ltv_curves %>%
          spread(
            key = !!as.name(column_to_fill),
            value = grosstotal
          ) %>%
          gather(
            key = !!as.name(column_to_fill),
            value = grosstotal,
            -ignore_columns
          )

    Anyone around that can help with this?

    Zachary Barry
    @ZackBarry

    I'm getting an error using spark_apply when connecting to a Kubernetes cluster. I can run sdf_len(sc, 10) just fine but running sdf_len(sc, 10) %>% spark_apply(function(df) I(df)) returns the following error:

    Error: java.io.FileNotFoundException: File file:/var/folders/jf/lqnngxkj0x75cdmv_xjygfq40000gq/T/RtmpuaCX4s/packages/packages.8599.tar does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
        at org.apache.spark.SparkContext.addFile(SparkContext.scala:1534)
        at org.apache.spark.SparkContext.addFile(SparkContext.scala:1498)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sparklyr.Invoke.invoke(invoke.scala:147)
        at sparklyr.StreamHandler.handleMethodCall(stream.scala:136)
        at sparklyr.StreamHandler.read(stream.scala:61)
        at sparklyr.BackendHandler.$anonfun$channelRead0$1(handler.scala:58)
        at scala.util.control.Breaks.breakable(Breaks.scala:42)
        at sparklyr.BackendHandler.channelRead0(handler.scala:39)
        at sparklyr.BackendHandler.channelRead0(handler.scala:14)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

    Spark 3.0.1, Scala 2.12, sparklyr 1.5.2

    Noman Bukhari
    @nmnbkhr
    hi i am running a code in R when i run spark_apply i get exception..
    need help pn that
    rink1135
    @rink1135
    Hello, I am trying to connect to spark and am getting an error when connecting to the port. Not sure what is happening. I tried many things to try to get it working.

    sc <- spark_connect(master = "local", version = "2.3")#connect to this local cluster
    Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, :
    Gateway in localhost:8880 did not respond.

    Try running options(sparklyr.log.console = TRUE) followed by sc <- spark_connect(...) for more debugging info.

    this is the error