Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Francisco Palomares
    @iPhaco96
    Hi, Has anyone worked with Plumber, Postgresql and JSON?
    Javier Luraschi
    @javierluraschi
    @iPhaco96 try http://community.rstudio.com, this chat is meaent to be for Spark + R topics.
    Olaf
    @randomgambit
    @javierluraschi thanks! makes sense. timezones are hell
    minero-de-datos
    @minero-de-datos
    @javierluraschi I really like your awesome new SparklyR Book. The code : my_file <- spark_read_csv(sc, "my-file", path = "./testfile.csv") results in an error Error: org.apache.spark.sql.AnalysisException: Invalid view name: my-file;. Without the dash, it should work perfectly.
    Jozef
    @jozefhajnala
    Hello everyone, not sure if this is the right place to ask. Could someone please provide an update on the status of development with regards to using the R package arrow with sparklyR ? We are looking to improve the performance of spark_apply and the benchmarks here look great. However even though the github readme instructs to use install.packages("arrow") I could not find the package on CRAN. Any links/updates on the current status would be appreciated! thanks.
    Javier Luraschi
    @javierluraschi
    @jozefhajnala you are right, arrow is not yet in CRAN… should be there any day. In the meantime, you can install with remotes::install_github("apache/arrow", ref = "apache-arrow-0.14.1")
    Jozef
    @jozefhajnala

    @javierluraschi thanks, I see arrow is now on CRAN, congratulations! Using the CRAN version, I come across errors with simple operations, example:

    config <- sparklyr::spark_config()
    sc <- sparklyr::spark_connect(master = "local", config = config)
    mtcars_sp <- dplyr::copy_to(sc, datasets::mtcars, overwrite = TRUE)
    
    # Works fine
    if ("arrow" %in% .packages()) detach("package:arrow")
    mtcars_sp %>% sparklyr::spark_apply(function(df) df) %>% collect()
    
    # Error
    library(arrow)
    mtcars_sp %>% sparklyr::spark_apply(function(df) df) %>% collect()

    Looking at the worker log, this seems to be relevant:

    ERROR sparklyr: RScript (6891) terminated unexpectedly: object 'as_tibble' not found

    Relevant sessioninfo:
    R version 3.6.0, x86_64-redhat-linux-gnu (64-bit) Packages: arrow_0.14.1 dplyr_0.8.3 sparklyr_1.0.1

    Jozef
    @jozefhajnala
    With regards to the above upgrading sparklyr to the latest released version (1.0.2) resolved the problem. Maybe it would be worth mentioning such fixes in the NEWS? See also SO question
    Javier Luraschi
    @javierluraschi
    Ah, yeah that would make sense, feel free to send a PR. There been a bunch of minor arrow fixes and since it has not been officially released, we’ve kept them out of the NEWS file.
    Nikunj Maheshwari
    @nik-maheshwari
    Hi. Is there a reason that Spark/Sparklyr does not have a PLS model?
    Benjamin White
    @benmwhite
    @nik-maheshwari That's at the Spark MLlib level rather than the sparklyr level. It's tough to say exactly why specific models aren't included, but they seem to prioritize ones with high usage rates and parallelized implementations. The Spark Jira board would probably be the best place to suggest PLS.
    My guess for why PLS isn't included already is just low demand
    Nikunj Maheshwari
    @nik-maheshwari
    @benmwhite Yes I agree that it could be due to low demand. I have started to implement it in Spark, but can also put it up on Jira board, thanks for that suggestion.
    Nikunj Maheshwari
    @nik-maheshwari
    Is sparklyr and future fully compatible? See my issue here - HenrikBengtsson/future#331
    Nikunj Maheshwari
    @nik-maheshwari
    Thanks for the comments. Integrating the two could be really good for increasing Shiny apps' responsiveness.
    Nikunj Maheshwari
    @nik-maheshwari
    Hi all. Does sparklyr comes with any hashing function? Something to compare if you read in two large files and want to make sure they are identical. Currently, I am using spark_apply to apply digest function on all partitions of tbl_spark, collect the hashes into a data.frame and apply digest to it again. It works, but is extremely slow.
    Javier Luraschi
    @javierluraschi
    Use the arrow package if you want to speed up spark_apply()
    Otherwise, you could create a single string with the entire file using something like mutate = contents = concat_ws(‘,’, collect_list(column_name)) and then use sha2() from dplyr as well to compute the SHA2 hash
    Nikunj Maheshwari
    @nik-maheshwari
    Ok thanks. I will give both of them a go.
    Ying
    @ying1
    slightly confused about sparklyr::spark_read_table(sc, name) . The description of what "name" is - says https://www.rdocumentation.org/packages/sparklyr/versions/1.0.2/topics/spark_read_table - : The name to assign to the newly generated table. But if I am attempting to read a hive table, I would set name to the hive table... so is it b/c this hive table is then loaded into spark , and that name is referenced in spark? (srry the overwrite bit - I noticed that it is removed as of 0.9.2)
    Ying
    @ying1
    Just checking. And if I need to reference a different database, I would use tbl_change_db ?
    Javier Luraschi
    @javierluraschi
    The name parameter is not very accurate for spark_read_table()… All the read functions share the same parameters/docs, in this case you want to interpret name as the table to read (not write to).
    Ying
    @ying1
    Ok. cool. Thank you!
    Nikunj Maheshwari
    @nik-maheshwari
    Hi all. Can I have 2 spark connections on 2 different R processes at the same time? I am in an R session, and I start 2 additional R workers. On one of these workers, I am sending out some spark jobs via future(). Scenario 1 - I also have a spark connection running in my R session. The future() always break. Scenario 2 - I don't have a spark connection in my R session, and future() runs fine.
    Nikunj Maheshwari
    @nik-maheshwari
    So short answer, I can't on Windows.
    Javier Luraschi
    @javierluraschi
    You can indeed have two connections in sparklyr, but not support for latere yet. To set up the connecetions:
    sc1 <- spark_connect(master = “local”, app_name = “app1”)
    sc2 <- spark_connect(master = “local”, app_name = “app2”)
    Nikunj Maheshwari
    @nik-maheshwari

    You can indeed have two connections in sparklyr, but not support for latere yet. To set up the connecetions:

    I tried it, but got an error box when I executed the 2nd spark_connect command saying "R code execution error". Then a bunch of error in console in red starting with "Error: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;"

    Should I open this up as an issue?
    Javier Luraschi
    @javierluraschi
    Feel free to open an issue; however, this worked for me… so please include operating system, R version, sessionInfo(), etc:
    > sc1 <- spark_connect(master = "local", app_name = "app1")
    * Using Spark: 2.3.3
    > sc2 <- spark_connect(master = "local", app_name = "app2")
    * Using Spark: 2.3.3
    Nikunj Maheshwari
    @nik-maheshwari
    Hmmm. I am using Spark 2.3.2
    I will open up an issue today with the required info
    Also tried on Spark 2.4.0 with same error
    Nikunj Maheshwari
    @nik-maheshwari
    On a similar note, assuming I am able to start two Spark connections, can I access Spark tables loaded in sc1 from sc2?
    Javier Luraschi
    @javierluraschi
    No, you can’t. Unless you save the tables to disk, say HDFS, and then reload them in the other connection.
    Vineel
    @vineel258
    @here am trying to run RStudio as docker in an ec2 with IAM role attached to it, however when I run spark_read_csv which reads from S3, it fails due to 403
    here is the example code that am using to achieve this
    library(sparklyr)
    spark_install(version = "2.3.2", hadoop_version = "2.7")
    conf <- spark_config()
    conf$sparklyr.defaultPackages <- c("com.databricks:spark-csv_2.10:1.5.0",
                                       "com.amazonaws:aws-java-sdk-pom:1.10.34",
                                       "org.apache.hadoop:hadoop-aws:2.7.3")
    conf$`sparklyr.cores.local` <- 8
    conf$`sparklyr.shell.driver-memory` <- "32G"
    conf$spark.memory.fraction <- 0.9
    sc <- spark_connect(master = "local", 
                        version = "2.3.2",
                        config = conf)
    ctx <- sparklyr::spark_context(sc)
    jsc <- invoke_static(sc, 
                         "org.apache.spark.api.java.JavaSparkContext", 
                         "fromSparkContext", 
                         ctx)
    hconf <- jsc %>% invoke("hadoopConfiguration")
    hconf %>% invoke("set", "com.amazonaws.services.s3a.enableV4", "true")
    hconf %>% invoke("set", "fs.s3a.fast.upload", "true")
    hconf %>% invoke("set", "fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    hconf %>% invoke("set","fs.s3a.access.key", "accesskey") 
    hconf %>% invoke("set","fs.s3a.secret.key", "secretkey")
    sparklyr::spark_connection_is_open(sc=sc)
    curr_csv <- spark_read_csv(sc, name = "csv_ex",
                                            path = "s3a://test_dir/csv_read_test")
    Vineel
    @vineel258
    can someone point me whats missing in here
    Javier Luraschi
    @javierluraschi
    Could you share the stacktrace of this error?
    Vineel
    @vineel258
    @javierluraschi here is the stacktrace
    This message was deleted
    Error: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: XXXXXXXXX, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: EXXXXXXXXXX at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:718) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:390) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:390) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:389) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sparklyr.Invoke.invoke(invoke.scala:147) at sparklyr.StreamHandler.handleMethodCall(stream.scala:123) at sparklyr.StreamHandler.read(stream.scala:66) at sparklyr.BackendHandler.channelRead0(handler.scala:51) at sparklyr.BackendHandler.channelRead0(handler.scala:4) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.nett
    Javier Luraschi
    @javierluraschi
    I would try specyfing the credeentials as environment variablees using sys.setenv() and then reconnect, see https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html
    Ying
    @ying1
    What happens when running sdf_collect? we've been seeing some out of memory error when running this thru livy :( .. and it wasn't too much data (40M of data)? Any thoughts as to what we need to check ?
    Benjamin
    @bmkor
    Anyone tried successfully on spark_connect on k8s cluster?
    Benjamin
    @bmkor
    I tried successfully creating pod spark-pi-driver on k8s but failed to connect as the log showed below:
    19/10/16 02:37:30 INFO sparklyr: Gateway (34847) is terminating backend since no client has connected after 60 seconds to 10.244.9.17/8880.