These are chat archives for locationtech/geomesa

21st
Apr 2016
spereirag
@spereirag
Apr 21 2016 15:32
Hello guys, quick question: I have a small dataset <1GB which I would like to cache entirely to process it with spark. Is it currently possible to do this?
Anthony Fox
@anthonyccri
Apr 21 2016 15:35
spereirarg: definitely
what are you trying to do with your data set?
spereirag
@spereirag
Apr 21 2016 15:40
Awesome! Right now I'm just doing basic calculations: density, counts per date range, etc. Is there any code sample that shows how to implement the caching? Right now I'm modifying the CountByDay.scala example in geomesa-compute and am able to read my data from accumulo.
Anthony Fox
@anthonyccri
Apr 21 2016 15:40
nice
so, do you want to load the data into spark and then run multiple operations on it?
spereirag
@spereirag
Apr 21 2016 15:42
Exactly. For example, calculate densities using different grid sizes.
Anthony Fox
@anthonyccri
Apr 21 2016 15:42
gotcha
so, i can suggest some ways to do it with the existing capability but i've actually been thinking about an improvement to core geomesa that would make that kind of operation much easier
so, in order to do a count per grid cell, you'll want to map a function over the geometry in the data set that emits (gridCellId, 1)
and then reduce by summing up the tuples that have the same gridCellId
does that make sense?
you can use the Z2 class in sfcurve to compute a grid cell id on any resolution you specify
you can also do a z3 to do grid+time aggregations but there's an additional wrinkle there which is that you have to prepend with the coarse date first
so (day, z3gridcell, 1) as the output from the maps
spereirag
@spereirag
Apr 21 2016 15:47
Yes, there is an example in the tutorials showing something similar:
val discretized = queryRDD.map { f =>
       (GeoHash(f.getDefaultGeometry.asInstanceOf[Point], 25), 1)
    }

val density = discretized
   .groupBy { case (gh, _)    => gh }
   .map     { case (gh, iter) => (gh.bbox.envelope, iter.size) }
But for the caching, is it just a matter of calling queryRDD.cache()?
Anthony Fox
@anthonyccri
Apr 21 2016 15:56
yes, i think that will load your data into memory
then, the resulting rdd will be available for multiple operations
spereirag
@spereirag
Apr 21 2016 15:59
Cool! I'm gonna run some tests
Anthony Fox
@anthonyccri
Apr 21 2016 17:36
spereirag: did it work?
Anthony Fox
@anthonyccri
Apr 21 2016 17:42
forgot to @ mention you, @spereirag
spereirag
@spereirag
Apr 21 2016 17:57
Yes, it does seem to work. The improvement isn't much, but I guess with a dataset this small that's expected. I still have to properly tune the cluster. Thanks a lot, @anthonyccri
Anthony Fox
@anthonyccri
Apr 21 2016 17:57
@spereirag no problem
how big is your cluster?
you may want to do a 'repartition' to spread the data around your cpus
Anthony Fox
@anthonyccri
Apr 21 2016 18:04
@spereirag and then cache (after the repartition)
spereirag
@spereirag
Apr 21 2016 18:08
It's a 10 node cluster (12GB each). By repartition you mean using mapPartitions over the queryRDD?
Anthony Fox
@anthonyccri
Apr 21 2016 18:08
no, there's actually a method called 'repartition'
spereirag
@spereirag
Apr 21 2016 18:33
It seems to take a little longer using repartition, haha :)
Anthony Fox
@anthonyccri
Apr 21 2016 19:06
oy
that's no good