Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    Suk Hyun Hwang (Phillip)

    Hi, everyone! My name is Phillip Hwang and SparklyR makes up a huge part of my Masters Thesis. I thought about opening up a Github issue, but because this project uses data provided by the Department of Justice, it's probably not a good idea to post it online. Javier told me to come here as a last resort, so here I am! If you guys could help guide me in the right direction, I would be eternally grateful.

    I'm trying to run analytics with SparklyR on 1100 tables in HBase. Each one of these tables correspond to time series data taken from a specific photovoltaic powerplant somewhere in the world. We want to analyze this and build models with SparklyR (cluster, not local) to predict when it would be time to replace these solar panels for maximum renewable energy efficiency. We want to take data that's in HBase then analyze it with SparklyR. What would be the best way to do this? Here are the ways that I've been thinking of doing this:

    1. My lab has an R wrapper package for a python package that downloads the data of a specific table and powerplant. After looking a few layers deeper, I realized that it's done by pulling data from HBase, writing it into a csv file, reading that csv file with R, and storing it into a dataframe. Before analyzing this package in-depth, I was using this wrapper to make a dataframe, then using sdf_import to create a spark dataframe. Very inefficient.

    2. Why not cut out the middleman from 1? Take the csv file, put it into HDFS, and run spark_read_csv on it?

    3. How about using REST APIs through Livy to draw data directly from HBase and put it into Spark?

    4. Any other thoughts? I thought the best way to do this would be to ask the experts. Thank you so much for reading this, and I hope you have a fantastic week!


    Javier Luraschi
    @civiliangame Right, (1) is not recommended, (2) is OK but a bit inneficient since you need to copy data manually, (3) is the recommended approach but do not use LIVY for this, you want to use a Spark-HBase connected.
    For (3), you should be able to use sparklyr with something like this:
    But instead of connecting to Cassandra, you would use the Spark-HBase connector from Hortonworks:
    Something like the following might just work...
    Javier Luraschi
    sc <- spark_connect(master = "local", version = "2.3", config = list(
      sparklyr.connect.packages = "com.hortonworks:shc-core:1.1.1-2.1-s_2.11",
      sparklyr.shell.repositories = "http://repo.hortonworks.com/content/groups/public/",
      sparklyr.shell.files = "/etc/hbase/conf/hbase-site.xml"))
      name = "<table>",
      source = "org.apache.spark.sql.execution.datasources.hbase",
      options = list("HBaseTableCatalog.tableCatalog" = "<catalog>"),
      memory = FALSE)
    Added this suggestion into sparklyr/sparklyr#720
    Suk Hyun Hwang (Phillip)
    Wow, thanks for the reply!!! @javierluraschi I really do appreciate it.
    Two quick questions:
    1. Do I need spark version 2.3+ to run this?
    1. Do I have to change anything in lines 2, 3, or 4?
    Javier Luraschi
    Any version that the extension supports...
    Apache Spark 2.1.1
    Lines (2) and (3) should not need a change, line (4) is the path to your hbase configuration file so you would need to customize this.
    I’m not an expert on hbase and have never done this myself, so it probably won’t work with the exact instructions I mentioned but should be close enough to investigate and get it working.
    Suk Hyun Hwang (Phillip)
    Hi, Javier! Sorry, I just saw this but I wanted to thank you nonetheless for your quick and helpful reply :)
    Please let me know if there's anything I can do for you at all! I really do appreciate your help
    I also work as CTO of a tech startup that specializes in automating outreach. If, for any reason, you need to reach a lot of people quick, please let me know and I'd be glad to be of service :)
    Mark Hamilton
    Hey folks, anyone have any idea how i can pass the "spark.jars.repositories" in addition to sparklyr.defaultPackages? cant seem to get this working
    Yitao Li
    Dear contributors of sparklyr,
    I will present sparklyr during the LFAI annual project review on Aug 26th, and would like to take this opportunity to acknowledge all individuals and organizations who have contributed to sparklyr in the past. Can you please send me the official name and logo (in Scalable Vector Graphics format, if possible) of your organization at your earliest convenience? My email is yitao@rstudio.com .
    Thanks in advance!
    Yitao Li
    Sunitha OSS

    Hi, I am using sparklyr and am connecting to Spark in local mode. A spark job is hanging intermittently. Spark Web UI reports that the application is still running. Sparklyr java process and R worker process are stuck.
    Sparklyr version is 1.5.2, Spark is v3.0.2, Arrow 3.0

    I have turned on logging using
    conf[["sparklyr.log.console"]] <- TRUE

    In a successful run, I see
    INFO sparklyr: RScript (2121) updating 6896317 rows using 690 row batches
    INFO sparklyr: RScript (2121) finished apply
    INFO sparklyr: RScript (2121) finished
    INFO sparklyr: Session (2121) is shutting down with expected SocketException,java.net.SocketException: Socket closed)
    INFO sparklyr: Worker (2121) completed R process
    INFO sparklyr: Worker (2121) completed wait using lock for RScript
    INFO sparklyr: Worker (2121) is returning RDD iterator with 6896317 rows
    INFO sparklyr: Session (2121) is terminating backend

    In a hang run, I see this as the last message:
    INFO sparklyr: RScript (9315) updating 6896317 rows using 690 row batches

    I do not see the ‘finished apply’ message as in the successful case.

    Any suggestions on what could be going on?

    It seems to point to some memory resource issue. I would like this job to fail gracefully and not hang. Are there any settings that we can set. I was hoping to get a OOME.