Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 05:38
  • Feb 26 11:47
    dtiarks starred DataSystemsLab/GeoSpark
  • Feb 26 00:22
    jiayuasu closed #420
  • Feb 26 00:22
    jiayuasu labeled #420
  • Feb 26 00:22
    jiayuasu labeled #420
  • Feb 26 00:22
    jiayuasu labeled #420
  • Feb 26 00:22
    jiayuasu milestoned #420
  • Feb 26 00:22
    jiayuasu demilestoned #420
  • Feb 26 00:22
    jiayuasu milestoned #420
  • Feb 26 00:21
    jiayuasu commented #424
  • Feb 26 00:04
    jiayuasu labeled #425
  • Feb 26 00:04
    jiayuasu labeled #425
  • Feb 26 00:02

    jiayuasu on master

    ST_Point and ST_PolygonFromEnve… (compare)

  • Feb 26 00:02
    jiayuasu closed #427
  • Feb 26 00:02
    jiayuasu labeled #427
  • Feb 26 00:02
    jiayuasu labeled #427
  • Feb 26 00:02
    jiayuasu labeled #427
  • Feb 26 00:01
    jiayuasu milestoned #427
  • Feb 26 00:01
    jiayuasu commented #427
  • Feb 25 19:38
    geoHeil commented #385
Kaushik Roy
@kaushikCanada
how to join a spatialrdd consisting of polygons with a shapefile consisting of polygons? whihc join to use?
thank you james
i have been strugglign with geospark a lot, and looking for some light in the dark
hehe
  1. is there any noticable difference between rdd join and sql joins as of the current version? i have been following this project for quite some time and see a lot of improvements been added over the last 4 years.
  2. does the order of the variables in the join matter? which side should i put the shapefile and the data? is the behaviour same for rdd and sql?
  3. there are a bunch of joins explained in the doc. how to know where to use which join. FOr example so far we are making an rtree out of the data. and then joining the shapefiles using that. do we do an rtree in this case too?
  4. can the solution be saved in a parquet? i read somewhere that the data will be saved as wkt. can we spark jts library functions in the DSL?
  5. can i run the joins without partitioning? suppose i partition the data into ontario counties/regions. could that maybe help with the joins with shapefiles?
  1. ST_GeometryType says since 1.2.1, but i dont see it anywhere.
Jia Yu
@jiayuasu
@kaushikCanada If the range query window is a RDD, then what you need is a spatial join query. See the doc: https://datasystemslab.github.io/GeoSpark/tutorial/rdd/#write-a-spatial-join-query
@kaushikCanada 1. RDD join is faster than SQL join in the current version.
The order of the join matters because the “contains” predicate has different definition for the left and right datasets:
The result of spatial join can be converted back to Spark DataFrame using GeoSpark Adapter API: https://datasystemslab.github.io/GeoSpark/tutorial/sql/#spatialpairrdd-to-dataframe
In a distributed data system, everything needs to be partitioned. You probably misunderstood the partitioning in GeoSpark
After converting results to DataFrame API, you can store data in Parquet format
Jia Yu
@jiayuasu
ST_GeometryType is 1.2.1-SNAPSHOT. To use the API in SNAPSHOT version, please follow the instruction here: https://datasystemslab.github.io/GeoSpark/download/GeoSpark-All-Modules-Maven-Central-Coordinates/#snapshot-versions
geoHeil
@geoHeil
@jiayuasu why exactly is the RDD join faster again than Dataframe?
Jia Yu
@jiayuasu
Just because you can easily control the number of partition, the option of spatial partitioning method, and the option of indexing method.
In SpatialSQL, if the setting of the join is not set up well, the performance may decrease significantly
geoHeil
@geoHeil
I see. So it is not necessarily faster, just forces the people to think about paralellism / indexing / partitioning. Is this correct?
Or why do you say easy. When setting the properties in spark-sql is'nt partitioning and indexing just as simple?
Kaushik Roy
@kaushikCanada
can anyone shine some light on DataSystemsLab/GeoSpark#400 ?
James Hughes
@jnh5y
Have you tried
Hints.putSystemDefault(Hints.FORCE_LONGITUDE_FIRST_AXIS_ORDER, true)
?
Not sure if that'll address it completely
Kaushik Roy
@kaushikCanada
where shoild i write this line?
inside the spark program?
where do i get the Hints object?
James Hughes
@jnh5y
the class is org.geotools.factory.Hints; I mentioned it since your issue said you had read the docs on GeoTools's referencing (for example: https://docs.geotools.org/latest/userguide/library/referencing/faq.html)
Simon Staudenmann
@simon-staudenmann_gitlab
Is there any way to do a KNN join for two PointRDDs?
dbarrett3
@dbarrett3
Hi, trying to work out distance in meters between two lat/longs. SELECT ST_DISTANCE(ST_TRANSFORM(ST_PointFromText('51,0', ','),'epsg:4326','epsg:3857'),ST_TRANSFORM(ST_PointFromText('53,0', ','),'epsg:4326','epsg:3857'))/1000 as km gives the expected 222km but SELECT ST_DISTANCE(ST_TRANSFORM(ST_PointFromText('53.24,-0.51', ','),'epsg:4326','epsg:3857'),ST_TRANSFORM(ST_PointFromText('53.26,-0.48', ','),'epsg:4326','epsg:3857'))/1000 as km returns 4km instead of the expected 3km. What am i doing wrong?
Jia Yu
@jiayuasu
@simon-staudenmann_gitlab Currently, we don’t have KNN join
@dbarrett3 Is this because of the precision issue? Probably it was rounded up to 4km?
dbarrett3
@dbarrett3
Thanks @jiayuasu - do you happen to know where I might find any further information on this precision issue and whether it is something that might get fixed?
Jia Yu
@jiayuasu
@dbarrett3 Does it return an integer type data? If so, this is simply because your SQL function casted the result to an integer...
dbarrett3
@dbarrett3
@jiayuasu I dont think its rounding as the actual value returned is 4.013785055328752 and it should be something like 2.99km
Igor Stravinksy
@kloit.chess_gitlab
Is there a built in functionality to write to KML?
Jia Yu
@jiayuasu
@kloit.chess_gitlab Currently, no support for KML
Duncan Davis
@The_TaNaHaRa_twitter

import org.datasyslab.geosparksql.utils.{Adapter, GeoSparkSQLRegistrator}

object geosparksql is not a member of package org.datasyslab
import org.datasyslab.geosparksql.utils

I have geospark importing great but trying to get sql import working but i feel like i am just missing some name format
any thoughts import org.datasyslab.geospark_sql.utils.{Adapter, GeoSparkSQLRegistrator} doesnt work
Jia Yu
@jiayuasu
@The_TaNaHaRa_twitter Please make sure you add the correct the correct Maven coordiante into your project: https://datasystemslab.github.io/GeoSpark/download/GeoSpark-All-Modules-Maven-Central-Coordinates/
geoHeil
@geoHeil
jar -tf geospark-1.2.0.jar | grep geotools shows many class entries for geotools classes. Geospark is including the transitive dependency in the fat jar? What is the reason for this? Why don't you publish a jar where the dependencies are only loaded via maven (not a fat jar)?
geoHeil
@geoHeil
geoHeil
@geoHeil
@jiayuasu would you mind if I shade the geotools dependencies? From looking at several issues it looks like you prefer to keep it as batteries included as merging of the geotools jars is complicated
James Hughes
@jnh5y
(@geoHeil asked in the GeoMesa gitter and I'm suggesting that GeoSpark might be better served by an additional 'library' jar which is not a fat jar.)
geoHeil
@geoHeil
Instead of some discussions in the gitter channels of geospark and geoemsa perhaps this is the better place to continue the discussion DataSystemsLab/GeoSpark#253
Frank Dekervel
@kervel
Hello, geospark-viz 1.2.0 contains a non-shaded copy of amazon AWS java SDK. which makes it conflict with a lot of other stuff. is there a good reason for this ? otherwise i'll file a bug
James Hughes
@jnh5y
@geoHeil I'll look more about #253 when I get a chance; it may be later this week
BaurRitto
@BaurRitto
Hi everyone!
I have an error: JavaPackage is
object is not callable
On the line spark._jvm
geoHeil
@geoHeil
What were the most important changes for the 1.3 / 1.3.1 release? Somehow, I cannot find a changelog.
Jia Yu
@jiayuasu
Change log will be out soon. I just recently got too busy