Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Suk Hyun Hwang (Phillip)
    @civiliangame

    Hi, everyone! My name is Phillip Hwang and SparklyR makes up a huge part of my Masters Thesis. I thought about opening up a Github issue, but because this project uses data provided by the Department of Justice, it's probably not a good idea to post it online. Javier told me to come here as a last resort, so here I am! If you guys could help guide me in the right direction, I would be eternally grateful.

    I'm trying to run analytics with SparklyR on 1100 tables in HBase. Each one of these tables correspond to time series data taken from a specific photovoltaic powerplant somewhere in the world. We want to analyze this and build models with SparklyR (cluster, not local) to predict when it would be time to replace these solar panels for maximum renewable energy efficiency. We want to take data that's in HBase then analyze it with SparklyR. What would be the best way to do this? Here are the ways that I've been thinking of doing this:

    1. My lab has an R wrapper package for a python package that downloads the data of a specific table and powerplant. After looking a few layers deeper, I realized that it's done by pulling data from HBase, writing it into a csv file, reading that csv file with R, and storing it into a dataframe. Before analyzing this package in-depth, I was using this wrapper to make a dataframe, then using sdf_import to create a spark dataframe. Very inefficient.

    2. Why not cut out the middleman from 1? Take the csv file, put it into HDFS, and run spark_read_csv on it?

    3. How about using REST APIs through Livy to draw data directly from HBase and put it into Spark?

    4. Any other thoughts? I thought the best way to do this would be to ask the experts. Thank you so much for reading this, and I hope you have a fantastic week!

    Sincerely,

    Phillip
    Javier Luraschi
    @javierluraschi
    @civiliangame Right, (1) is not recommended, (2) is OK but a bit inneficient since you need to copy data manually, (3) is the recommended approach but do not use LIVY for this, you want to use a Spark-HBase connected.
    For (3), you should be able to use sparklyr with something like this:
    But instead of connecting to Cassandra, you would use the Spark-HBase connector from Hortonworks:
    Something like the following might just work...
    Javier Luraschi
    @javierluraschi
    sc <- spark_connect(master = "local", version = "2.3", config = list(
      sparklyr.connect.packages = "com.hortonworks:shc-core:1.1.1-2.1-s_2.11",
      sparklyr.shell.repositories = "http://repo.hortonworks.com/content/groups/public/",
      sparklyr.shell.files = "/etc/hbase/conf/hbase-site.xml"))
    
    spark_read_source(
      sc, 
      name = "<table>",
      source = "org.apache.spark.sql.execution.datasources.hbase",
      options = list("HBaseTableCatalog.tableCatalog" = "<catalog>"),
      memory = FALSE)
    Added this suggestion into sparklyr/sparklyr#720
    Suk Hyun Hwang (Phillip)
    @civiliangame
    Wow, thanks for the reply!!! @javierluraschi I really do appreciate it.
    Two quick questions:
    1. Do I need spark version 2.3+ to run this?
    1. Do I have to change anything in lines 2, 3, or 4?
    2*
    Javier Luraschi
    @javierluraschi
    Any version that the extension supports...
    Apache Spark 2.1.1
    Lines (2) and (3) should not need a change, line (4) is the path to your hbase configuration file so you would need to customize this.
    I’m not an expert on hbase and have never done this myself, so it probably won’t work with the exact instructions I mentioned but should be close enough to investigate and get it working.