Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Kaushik Iska
    @iskakaushik
    Hi @harsha2010, thanks for the lib. Over at harsha2010/magellan#128 you mentioned about a PR, but I was not able to find any. Are you planning on working on it anytime soon? If not I can take a crack at it. Thanks!
    Uwe Zenker
    @waras2017

    Hello @harsha2010, thank you for the great lib!

    I try to get following magellan example running:

    URL: https://github.com/harsha2010/magellan/wiki/Scala-API

    polygons.select(point(lit(0.5), lit(0.5)).within($"polygon")).show(false)

    This works!

    Now I try to use the "plain" sql version in order to be able to write sql
    statements by using jdbc over the Spark Thrift Service:

    polygons.createOrReplaceTempView("tbl_poly")

    spark.sql("select point(0.5, 0.5).within(polygon) from tbl_Poly").show(false)

    This does not work. What is the correct way to this?

    Thanks a lot

    balakrishnan097332
    @balakrishnan097332
    image.png
    I am trying a simple use case of loading GeoJson using Magellan using databricks notebook. But, I am not sure why the code is failing. Below is the screenshot of the code and error message. GeoJson source: https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson
    Ram Sriharsha
    @harsha2010
    Hey! Looks like the databricks runtime ships with a version of JSON4s that conflicts with what we use. Let me see if I can shade it .. what version of DBR are you running this on?
    balakrishnan097332
    @balakrishnan097332
    image.png
    Hey Ram. Thanks for your response. My version says, 'Latest stable (Scala 2.11)'
    balakrishnan097332
    @balakrishnan097332
    Hi Sri, can you please let me know is there anything I could do to override with a different version in a databricks notebook?
    Akshita6
    @Akshita6
    image.png
    Brandon Geise
    @bdgeise

    I'm currently testing a very simple case and seeing results I wouldn't expect.
    Creating a dataframe from an Array

    ("US", "TX", "2018-12-08 00:00:00", 12.0123, "ios", 2, 32.813548, -96.835159)

    The coordinates are in the Dallas, TX area

    I then load the Texas geojson polygon.

        .option("magellan.index", "true")
        .option("magellan.index.precision", "15")
        .option("type", "geojson").load(filterFilePath)
        .cache()
    
      filteringDS.count()

    So far all good

    But when doing

      val filtered = df1
        .withColumn("locationPoint", point(col("longitude"), col("latitude")))
        .join(filteringDS)
        .where(col("locationPoint") within col("polygon"))

    If I just injectRules I get 0 results. But if I don't use injectRules I get the proper results. Anyone come across this before?

    @harsha2010 Any thoughts on the above?
    Using 1.0.5-s_2.11 here too
    Ram Sriharsha
    @harsha2010
    are these points in last long cords? Is the geometry in lat/long?
    without injection rules there is no assumption on the coordinate system made. When you inject the indexing rule it assumes lat/long
    that might be the issue
    If that’s not the case, can you post some sample data so I can check?
    Brandon Geise
    @bdgeise
    The order in the example above is lat/long. The point is created at long, lat
    Would you prefer an issue related to this?
    Ram Sriharsha
    @harsha2010
    Can you create an issue and attach this example? Also where did you get the TX geojson from? If you can link that geometry as well I can take a look today
    Brandon Geise
    @bdgeise
    Sure, can do. I can upload the Texas geojson to the issue.
    harsha2010/magellan#237. Thanks for looking @harsha2010. Hopefully not something completely dumb that I'm missing here.
    Deep Learning
    @DeepLearning007_gitlab
    Similar to @bdgeise , When I do not injestRules, the join is completed in under a second (3 points in ~112 polygons), but once I do injestRules, the join takes 3.4 minutes to complete (though I get correct results). I am using 1.0.5-s_2.11 version on spark 2.2.0. Let me know if you want me to paste shapefile.
    Ram Sriharsha
    @harsha2010
    The latter case is different .. you might need to specify the granularity of the index by using index($polygon, size).. by default size=6 which leads to a geohash with six characters. That might be too granular for you and more of the time might be spent creating the index than necessary
    Also if I have only 3 points to join then it will always be faster to not index unless your geometry happens to have a lot of edges (O(10k) or more)
    Ram Sriharsha
    @harsha2010
    incidentally just out of curiosity what is the polygon dataset you are trying to join?
    Deep Learning
    @DeepLearning007_gitlab
    Thanks @harsha2010 . I was trying to get my hands dirty. I am using one of the dataset mentioned here before. https://opendata.arcgis.com/datasets/58b0dfa605d5459b80bf08082999b27c_0.zip
    Fabio Fonseca
    @draconar_twitter
    Hi, I'm trying to read a SQL table containing WKT and transform it in polygon
    what should I do?
    Derek Hazard, Ph.D.
    @derekhazard
    Hi @harsha2010 , I was wondering if there is a way to save polygon data after it is indexed out to an ORC format HIVE table (or another format in Hadoop) to avoid recalculating the indices each time I need to reverse geocode lat/long to US states for example. It seems like a waist to recompute each time if I can just save the data and would speed up the time to completion for each run. I was also wondering, if I use the inject rules command and compute the index using polygonDF = df.withColumn("index", $"polygon" index 15), will the package automatically use a spatial join when I run pointDF.join(polygonDF).where($"point" within $"polygon")? Does it have to recompute the indexing during the join? Just wondering if there is an advantage of precomputing the polygon indices other than the ability to print the Z-order Curves and geohashes.
    Zaki
    @zakipatel
    Hi... does the Magellan spatial join (point within polygon) work with any version of the Databricks run time env ? If so, please can you share specifics. many thanks !
    @harsha2010 if you have any suggestions, I'd be very thankful !
    Zaki
    @zakipatel
    I got it to work on this Databricks Runtime Env ..... "3.5 LTS (includes Apache Spark 2.2.1, Scala 2.11)", yay !
    Padma Chitturi
    @ChitturiPadma

    Hi @harsha2010 i have a json file containing polygon coordinates, latitude, longitude and location attributes. I could read the file as below -
    spark.read.json("/$PATH/geodata.json")
    This is how the data looks like -

    POLYGON ((46.2084961 12.7046505, 50.0976563 13.3254849, 54.4921875 15.7076628, 59.0625 19.1451682, 62.0947266 24.2870269, 65.9838867 24.126702, 71.6638184 19.6425875, 75.6793213 8.8307952, 72.3339844 7.2316987, 72.0153809 -0.6811363, 74.1055298 -1.3347181, 74.3595886 4.0916716, 74.4756317 7.3720005, 76.0168076 8.5042911, 79.6453857 5.4957039, 82.8733063 6.7123456, 81.2443256 10.8041364, 81.6163588 14.9634427, 86.5587044 18.6277019, 89.9849367 20.6192087, 92.9724669 18.092706, 91.7412472 13.5300215, 92.0898056 7.6245674, 93.3961487 4.2560294, 96.4242554 -2.0814512, 108.28125 -12.2111802, 121.6186523 -11.2322864, 124.1235352 -11.0328639, 126.9250488 -9.3000173, 131.3250732 -8.7995826, 134.395752 -7.3297784, 134.9947643 -6.9542026, 135.3520775 -6.359999, 134.9900436 -4.7536769, 136.5950775 -5.1716436, 137.759757 -5.6545084, 138.4630108 -7.0755106, 137.716198 -7.296747, 137.4969864 -8.4567515, 137.7616882 -8.6652028, 139.8312378 -8.6299031, 141.026001 -9.2214046, 141.0177612 -6.8998424, 140.9609413 -6.91633, 140.9041214 -6.8837376, 140.8467865 -6.7735458, 140.8423233 -6.6831066, 140.8433533 -6.5933328, 140.915451 -6.4133955, 140.961113 -6.3118008, 140.9992218 -6.3221233, 140.9902954 -2.4766454, 140.8447266 0.4833927, 137.4609375 7.0572824, 127.1337891 12.6403383, 125.3979492 21.6982655, 135.1208496 30.8645102, 142.5311279 35.2635619, 144.2285156 40.7139558, 159.8291016 48.5166043, 165.7836914 53.9560855, 167.6953125 59.1759282, 174.9462891 60.7645257, 174.9902344 69.9980521, 163.5974121 70.0130783, 152.1276855 70.0093227, 132.0666504 69.9980521, 63.1054688 69.9961731, 30.9814453 69.9227586, 28.8061523 69.1664656, 28.861084 68.0999557, 29.5916748 67.6813003, 29.5175171 66.9585645, 30.2714539 65.808686, 30.8242035 63.8341592, 31.759758 63.071523, 31.6123009 62.4801722, 28.0229473 60.3423469, 28.2937002 59.3970991, 28.134841 58.0092765, 28.6791068 56.3712681, 30.9981072 55.940353, 33.0872798 53.525341, 32.0563352 52.8349893, 32.2000426 52.458868, 34.3812713 52.4306525, 36.5705185 50.6397541, 40.9170952 49.6893342, 40.2339382 47.5730177, 37.5193129 46.9184795, 35.3683295 44.3942519, 29.4819945 42.1810128, 27.5970199 41.9068729, 27.2328325 42.1256131, 26.3726217 41.8119252, 26.6956571 41.3615814, 26.309013 41.2492571, 26.3487132 40.9908661, 25.9666877 40.7046424, 25.9815433 40.3146418, 25.6448364 40.2208299, 25.6297303 40.0686136, 25.9283578 40.0807579, 25.9111284 39.8887846, 26.0030124 39.4353727, 26.4504522 39.4165187, 26.6696727 39.1448726, 26.7950033 38.8427157, 26.2876079 38.7718849, 26.357717 38.4818121, 26.2342634 38.4336875, 26.1906031 38.220914, 26.4987905 38.0794351, 27.2003092 37.9365929, 27.1064593 37.691381, 26.9901708 37.695681, 26.9553477 37.6499345, 27.0779622 37.617121, 27.1638895 37.3508507, 27.0830584 37.1577172, 27.2345067 36.9457762, 27.4426461 36.939465, 27.3398209 36.7971892, 27.3264313 36.6772306, 27.8517151 36.4676807, 28.407898 36.6750278, 29.1851807 36.2221188, 31.2890625 35.1738083, 34.1455078 31.297328, 34.9584961 29.5256704, 34.4641113 27.9119126, 37.7764893 22.4897195, 41.7617798 16.4769132, 42.5898743 14.439335, 43.2861328 12.7475163, 44.0332031 12.3721974, 46.2084961 12.7046505))|34.047863|100.6196518|Asia|

    Columns are : polygonobject (String), latitude (double), longitude (Stting) and search_term (String). I see there is support geojson format for magellan. How could I analyse this json file. Any tool internal to spark that converts json to geojson ? Any pointers would help me a lot !!!

    bpurdy1645
    @bpurdy1645
    I am looking to implement Magellan within my AWS EMR PySpark application. Is there any documentation that would better showcase where / how I bootstrap Magellan onto my working EMR cluster and any examples using pyspark (if that is supported - SparkSQL or Dataframe API)? Any assistance would be GREATLY appreciated!
    Ghost
    @ghost~59b805c5d73408ce4f757471
    Hi, I am getting below error when trying to check if point falls within the polygon of CA provinces .. any help on this please ...Here is the error
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.RuntimeException: Extra(...99 47.0634893610001), [traced - not evaluated])
    at magellan.WKTParser$.parseAll(WKTParser.scala:97)
    at magellan.WKTParser.parseAll(WKTParser.scala)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec
    KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲13: anonfun$13
    anon$1.hasNext(WholeStageCodegenExec.scala:636)
    at org.apache.spark.sql.execution.SparkPlan
    KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲2.apply(SparkPl…: anonfun$2.apply(SparkPlan.scala:255)
        at org.apache.spark.sql.execution.SparkPlan
    anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD
    KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲mapPartitionsIn…: anonfun$mapPartitionsInternal$1
    anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD
    KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲mapPartitionsIn…: anonfun$mapPartitionsInternal$1
    anonfun$apply$24.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    Here are complete details in issue:239 .. any response is greatly appreciated .. Thanks
    Ghost
    @ghost~59b805c5d73408ce4f757471
    @harsha2010 : any suggestions for harsha2010/magellan#239
    PawaritL
    @PawaritL
    hi can anyone explain how LineStrings are indexed? (i.e. in order to perform predicate pushdown when doing an 'intersects' query)
    slenain
    @slenain
    Hello, anybody successed for use databricks (with spark 2.4.1) and magellan ? I get a lot of error, and I can't downgrade spark version
    ChrisZcu
    @ChrisZcu
    hello, why
    ChrisZcu
    @ChrisZcu
    hello,why I could not use the dependency through maven? nothing get
    awadhesh singh
    @awadhesh14
    Hi @harsha2010 Is there some explanation on how the partition is done in Magellan.
    Christos
    @Charmatzis
    @awadhesh14 Hi, yes Magellan uses this for partitioning the data https://en.wikipedia.org/wiki/Z-order_curve
    Uwe Zenker
    @waras2017

    Hi, I still try to get following magellan example running:

    URL: https://github.com/harsha2010/magellan/wiki/Scala-API

    polygons.select(point(lit(0.5), lit(0.5)).within($"polygon")).show(false)

    This works!

    Now I try to use the "plain" sql version in order to be able to write sql
    statements by using jdbc over the Spark Thrift Service:

    polygons.createOrReplaceTempView("tbl_poly")

    spark.sql("select point(0.5, 0.5).within(polygon) from tbl_Poly").show(false)

    This does not work. What is the correct way to this?
    I found another GIS Lib for spark (GeoSpark : http://geospark.datasyslab.org/) which provide SQL functions (e.g. ST_Within - "select year(utc) as year, month(utc) as month, latitude, longitude, pt.platform_name, geo.name as area from tbl_platform_tracks pt, tbl_geometry geo where year(utc) == 2018 and month(utc) == 6 and (geo.name == 'north sea' or geo.name == 'center point') and ST_Within(ST_Point(cast(pt.longitude as Decimal(24,20)), cast(pt.latitude as Decimal(24,20))), geo.geometry) order by year, month limit 200"). But I would like to use magellan instead.
    Thanks a lot

    Alan Garcia
    @alan51_gitlab
    Hey all, I'm using sbt version 1.4.3 and tried to build a jar from source locally. Running sbt assembly in the source root directory ran successfully for a while and then landed on the following NullPointerException:
    [error] (compile:compile) java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "this.text" is null
    Has anyone run into this before? Appreciate any pointers.
    2 replies