Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Francisco Romero
    @fraroco
    : java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details.
    at sparklyr.Rscript.init(rscript.scala:106)
    at sparklyr.WorkerApply$$anon$2.run(workerapply.scala:116)
    I am using databricks method
    Javier Luraschi
    @javierluraschi
    If you are using sparkR this is not the right channel… if you are using sparklyr then we can definetely help. To copy large data frames, you could try this feature: rstudio/sparklyr#1762 which attempts to copy data incrementally, as in:
    # specify a callback that uploads subsets of the dataset instead of loading all at once
    iris_tbl <- copy_to(sc, function() iris, overwrite = T)
    That said, it would be best to copy data using HDFS or whatever databricks provides since those tools were designed with scalability in mind, copy_to() was originally designed to ber a convenience function or to copy secondary tables that are not necessarily large.
    Olaf
    @randomgambit
    @javierluraschi how s going? I am usinf sparkyr again 😉 one quick question I had is whether sparklyr is affected by this nasty bug https://issues.apache.org/jira/plugins/servlet/mobile#issue/ARROW-3543
    It is my understanding that spark_apply is powered by arrow. what do you think?
    Olaf
    @randomgambit
    also, unrelated. do you guys know how can I process a (string) timestamp with milliseconds in Hive? I cant use lubridate via spark_apply. Using hive's unix_timestamp get rid of the milliseconds. any ideas?
    Olaf
    @randomgambit
    @javierluraschi summoning
    🤣
    Javier Luraschi
    @javierluraschi
    Lol, sorry I’ve been on and off for a while so I need to catch up here!
    let me check...
    If you aree using Arrow, I would hope that bug is not present there… but you could check their JIRA, if it is, open a bug, etc.
    Honestly, timezone changes are so painful that converting all your times to unix-time and dealing yourself with conversions might be the best approach here.
    The problem is that Spark uses the JVMs time zone which could be differen that R’s time zone, etc.
    I don’t have much advice here appart from being careful and using unix-time integers if needed.
    BTW. RStudio folks working in TensorFlow and Spark are creating a youtube channel… mostly targetting neew users but we should built up to more interesting functionality. I’m also hoping long term we have some dedicated time to live stream in twitch and take questions, solve problems together, etc.
    Here is the tweet from the first video… I’ll try to make them better over time:
    Javier Luraschi
    @javierluraschi
    I also forgot to mention that the Spark with R book is already available for pre-sale! Here is the reference:
    Jordan Bentley
    @jbentleyEG
    Just pre-ordered a copy :)
    I've been trying to look through the source and figure this out but haven't been able: is it possible to send arbitrary Scala code to a spark connection through Sparklyr?
    I have a weird development pattern where I mostly work in R notebooks but want to port any reusable code I write back into a Scala library so that it can be accessible through Scala (most of our production) and SparklyR as well as potentially pySpark down the road
    and if I could execute Scala to define functions inside a notebook that would be incredibly useful
    Javier Luraschi
    @javierluraschi
    @jbentleyEG yes, there are two ways: (1) for simple scala command you can use invoke(), (2) for complex scripts we have compile_package_jars() which allows you to build and R + Scala package that you can then reuse in Spark.
    Jordan Bentley
    @jbentleyEG
    I'm more looking for something that can write more complex commands on-the-fly, including defining methods
    put another way, can I access the Scala REPL through sparklyr?
    Jordan Bentley
    @jbentleyEG
    update: I think I have a partial solution working by writing a method that takes in a string and executes it through Toolbox and then loading that Jar when I start Spark
    Javier Luraschi
    @javierluraschi
    @jbentleyEG that sounds quite interesting! If you have code to share to take a look we could consider getting this into sparklyr.
    Francisco Palomares
    @iPhaco96
    Hi, Has anyone worked with Plumber, Postgresql and JSON?
    Javier Luraschi
    @javierluraschi
    @iPhaco96 try http://community.rstudio.com, this chat is meaent to be for Spark + R topics.
    Olaf
    @randomgambit
    @javierluraschi thanks! makes sense. timezones are hell
    minero-de-datos
    @minero-de-datos
    @javierluraschi I really like your awesome new SparklyR Book. The code : my_file <- spark_read_csv(sc, "my-file", path = "./testfile.csv") results in an error Error: org.apache.spark.sql.AnalysisException: Invalid view name: my-file;. Without the dash, it should work perfectly.
    Jozef
    @jozefhajnala
    Hello everyone, not sure if this is the right place to ask. Could someone please provide an update on the status of development with regards to using the R package arrow with sparklyR ? We are looking to improve the performance of spark_apply and the benchmarks here look great. However even though the github readme instructs to use install.packages("arrow") I could not find the package on CRAN. Any links/updates on the current status would be appreciated! thanks.
    Javier Luraschi
    @javierluraschi
    @jozefhajnala you are right, arrow is not yet in CRAN… should be there any day. In the meantime, you can install with remotes::install_github("apache/arrow", ref = "apache-arrow-0.14.1")
    Jozef
    @jozefhajnala

    @javierluraschi thanks, I see arrow is now on CRAN, congratulations! Using the CRAN version, I come across errors with simple operations, example:

    config <- sparklyr::spark_config()
    sc <- sparklyr::spark_connect(master = "local", config = config)
    mtcars_sp <- dplyr::copy_to(sc, datasets::mtcars, overwrite = TRUE)
    
    # Works fine
    if ("arrow" %in% .packages()) detach("package:arrow")
    mtcars_sp %>% sparklyr::spark_apply(function(df) df) %>% collect()
    
    # Error
    library(arrow)
    mtcars_sp %>% sparklyr::spark_apply(function(df) df) %>% collect()

    Looking at the worker log, this seems to be relevant:

    ERROR sparklyr: RScript (6891) terminated unexpectedly: object 'as_tibble' not found

    Relevant sessioninfo:
    R version 3.6.0, x86_64-redhat-linux-gnu (64-bit) Packages: arrow_0.14.1 dplyr_0.8.3 sparklyr_1.0.1

    Jozef
    @jozefhajnala
    With regards to the above upgrading sparklyr to the latest released version (1.0.2) resolved the problem. Maybe it would be worth mentioning such fixes in the NEWS? See also SO question
    Javier Luraschi
    @javierluraschi
    Ah, yeah that would make sense, feel free to send a PR. There been a bunch of minor arrow fixes and since it has not been officially released, we’ve kept them out of the NEWS file.
    Nikunj Maheshwari
    @nik-maheshwari
    Hi. Is there a reason that Spark/Sparklyr does not have a PLS model?
    Benjamin White
    @benmwhite
    @nik-maheshwari That's at the Spark MLlib level rather than the sparklyr level. It's tough to say exactly why specific models aren't included, but they seem to prioritize ones with high usage rates and parallelized implementations. The Spark Jira board would probably be the best place to suggest PLS.
    My guess for why PLS isn't included already is just low demand
    Nikunj Maheshwari
    @nik-maheshwari
    @benmwhite Yes I agree that it could be due to low demand. I have started to implement it in Spark, but can also put it up on Jira board, thanks for that suggestion.
    Nikunj Maheshwari
    @nik-maheshwari
    Is sparklyr and future fully compatible? See my issue here - HenrikBengtsson/future#331
    Nikunj Maheshwari
    @nik-maheshwari
    Thanks for the comments. Integrating the two could be really good for increasing Shiny apps' responsiveness.
    Nikunj Maheshwari
    @nik-maheshwari
    Hi all. Does sparklyr comes with any hashing function? Something to compare if you read in two large files and want to make sure they are identical. Currently, I am using spark_apply to apply digest function on all partitions of tbl_spark, collect the hashes into a data.frame and apply digest to it again. It works, but is extremely slow.
    Javier Luraschi
    @javierluraschi
    Use the arrow package if you want to speed up spark_apply()
    Otherwise, you could create a single string with the entire file using something like mutate = contents = concat_ws(‘,’, collect_list(column_name)) and then use sha2() from dplyr as well to compute the SHA2 hash
    Nikunj Maheshwari
    @nik-maheshwari
    Ok thanks. I will give both of them a go.