Hi, everyone! My name is Phillip Hwang and SparklyR makes up a huge part of my Masters Thesis. I thought about opening up a Github issue, but because this project uses data provided by the Department of Justice, it's probably not a good idea to post it online. Javier told me to come here as a last resort, so here I am! If you guys could help guide me in the right direction, I would be eternally grateful.
I'm trying to run analytics with SparklyR on 1100 tables in HBase. Each one of these tables correspond to time series data taken from a specific photovoltaic powerplant somewhere in the world. We want to analyze this and build models with SparklyR (cluster, not local) to predict when it would be time to replace these solar panels for maximum renewable energy efficiency. We want to take data that's in HBase then analyze it with SparklyR. What would be the best way to do this? Here are the ways that I've been thinking of doing this:
My lab has an R wrapper package for a python package that downloads the data of a specific table and powerplant. After looking a few layers deeper, I realized that it's done by pulling data from HBase, writing it into a csv file, reading that csv file with R, and storing it into a dataframe. Before analyzing this package in-depth, I was using this wrapper to make a dataframe, then using sdf_import to create a spark dataframe. Very inefficient.
Why not cut out the middleman from 1? Take the csv file, put it into HDFS, and run spark_read_csv on it?
How about using REST APIs through Livy to draw data directly from HBase and put it into Spark?
Any other thoughts? I thought the best way to do this would be to ask the experts. Thank you so much for reading this, and I hope you have a fantastic week!
Sincerely,
sc <- spark_connect(master = "local", version = "2.3", config = list(
sparklyr.connect.packages = "com.hortonworks:shc-core:1.1.1-2.1-s_2.11",
sparklyr.shell.repositories = "http://repo.hortonworks.com/content/groups/public/",
sparklyr.shell.files = "/etc/hbase/conf/hbase-site.xml"))
spark_read_source(
sc,
name = "<table>",
source = "org.apache.spark.sql.execution.datasources.hbase",
options = list("HBaseTableCatalog.tableCatalog" = "<catalog>"),
memory = FALSE)