Hi, everyone! My name is Phillip Hwang and SparklyR makes up a huge part of my Masters Thesis. I thought about opening up a Github issue, but because this project uses data provided by the Department of Justice, it's probably not a good idea to post it online. Javier told me to come here as a last resort, so here I am! If you guys could help guide me in the right direction, I would be eternally grateful.
I'm trying to run analytics with SparklyR on 1100 tables in HBase. Each one of these tables correspond to time series data taken from a specific photovoltaic powerplant somewhere in the world. We want to analyze this and build models with SparklyR (cluster, not local) to predict when it would be time to replace these solar panels for maximum renewable energy efficiency. We want to take data that's in HBase then analyze it with SparklyR. What would be the best way to do this? Here are the ways that I've been thinking of doing this:
My lab has an R wrapper package for a python package that downloads the data of a specific table and powerplant. After looking a few layers deeper, I realized that it's done by pulling data from HBase, writing it into a csv file, reading that csv file with R, and storing it into a dataframe. Before analyzing this package in-depth, I was using this wrapper to make a dataframe, then using sdf_import to create a spark dataframe. Very inefficient.
Why not cut out the middleman from 1? Take the csv file, put it into HDFS, and run spark_read_csv on it?
How about using REST APIs through Livy to draw data directly from HBase and put it into Spark?
Any other thoughts? I thought the best way to do this would be to ask the experts. Thank you so much for reading this, and I hope you have a fantastic week!
sc <- spark_connect(master = "local", version = "2.3", config = list( sparklyr.connect.packages = "com.hortonworks:shc-core:1.1.1-2.1-s_2.11", sparklyr.shell.repositories = "http://repo.hortonworks.com/content/groups/public/", sparklyr.shell.files = "/etc/hbase/conf/hbase-site.xml")) spark_read_source( sc, name = "<table>", source = "org.apache.spark.sql.execution.datasources.hbase", options = list("HBaseTableCatalog.tableCatalog" = "<catalog>"), memory = FALSE)
Hi, I am using sparklyr and am connecting to Spark in local mode. A spark job is hanging intermittently. Spark Web UI reports that the application is still running. Sparklyr java process and R worker process are stuck.
Sparklyr version is 1.5.2, Spark is v3.0.2, Arrow 3.0
I have turned on logging using
conf[["sparklyr.log.console"]] <- TRUE
In a successful run, I see
INFO sparklyr: RScript (2121) updating 6896317 rows using 690 row batches
INFO sparklyr: RScript (2121) finished apply
INFO sparklyr: RScript (2121) finished
INFO sparklyr: Session (2121) is shutting down with expected SocketException,java.net.SocketException: Socket closed)
INFO sparklyr: Worker (2121) completed R process
INFO sparklyr: Worker (2121) completed wait using lock for RScript
INFO sparklyr: Worker (2121) is returning RDD iterator with 6896317 rows
INFO sparklyr: Session (2121) is terminating backend
In a hang run, I see this as the last message:
INFO sparklyr: RScript (9315) updating 6896317 rows using 690 row batches
I do not see the ‘finished apply’ message as in the successful case.
Any suggestions on what could be going on?
It seems to point to some memory resource issue. I would like this job to fail gracefully and not hang. Are there any settings that we can set. I was hoping to get a OOME.