These are chat archives for thunder-project/thunder

Jun 2015
Jun 30 2015 00:18
@freeman-lab I couldn't find discussion, but what are suggested java memory options for running thunder locally (fedora with 128gb RAM, 12 threads on Xeon)? I keep getting java.lang.OutOfMemoryError: Java heap space on clustering of 80MB of data. Thank you
andrew giessel
Jun 30 2015 00:19
@aandreev0 I’ve used 4gb that seems to work pretty well, hold for options
@aandreev0: try this _JAVA_OPTIONS="-Xms512m -Xmx4g" IPYTHON_OPTS="notebook" pyspark --driver-memory 4G
obs the first two are just env vars set on the command line, the later is a pyspark option
Jun 30 2015 03:48
@andrewgiessel thanks!
Jeremy Freeman
Jun 30 2015 15:33
@naory cool, that sounds like a neat use case, definitely want to support it! currently that would be possible only by converting the DataFrame to an RDD of (tuple, 1d array) pairs, and then directly initializing a Series object with the RDD
but we'd love to support a more direct route, maybe directly from a DataFrame (so we can let Spark itself handle the Hive loading)
mind opening an issue with a detailed description of the use case? especially the DataFrame schema and how you're loading it