Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    Hi all, I am wondering about the Spark SQL implementation in Sedona. From the documentation, it writes that Spark SQL is an API built on top of the SRDD abstraction. When I look into the source code, it seems that SparkSQL operations are separately written (e.g., the UDTs, predicates, and other functions). Am I correct about it? Or is SQL part based on the core part?
    9 replies
    Does someone wish to implement the R lwgeom::st_splitfunction in sedona?
    4 replies

    when I follow the official document to launch spark like this

    ./bin/spark-shell --packages org.apache.sedona:sedona-python-adapter-3.0_2.12:1.1.1-incubating,org.apache.sedona:sedona-viz-3.0_2.12:1.1.1-incubating,org.datasyslab:geotools-wrapper:1.1.0-25.2

    I found the ST_SubDivideExplode function will fail like this:

    %%sql nn --preview
    SELECT ST_SubDivideExplode(ST_GeomFromText("LINESTRING(0 0, 85 85, 100 100, 120 120, 21 21, 10 10, 5 5)"), 5)


    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 49.0 failed 4 times, most recent failure: Lost task 0.3 in stage 49.0 (TID 997, zw02-data-hdp-dn9258.mt, executor 9): java.lang.NoClassDefFoundError: org/geotools/geometry/jts/JTS
        at org.apache.spark.sql.sedona_sql.expressions.subdivide.GeometrySubDivider$.getIntersectionGeometries(GeometrySubDivider.scala:124)
        at org.apache.spark.sql.sedona_sql.expressions.subdivide.GeometrySubDivider$.subDivideRecursive(GeometrySubDivider.scala:88)
        at org.apache.spark.sql.sedona_sql.expressions.subdivide.GeometrySubDivider$.subDividePrecise(GeometrySubDivider.scala:137)
        at org.apache.spark.sql.sedona_sql.expressions.subdivide.GeometrySubDivider$.subDivide(GeometrySubDivider.scala:144)
        at org.apache.spark.sql.sedona_sql.expressions.ST_SubDivideExplode.eval(Functions.scala:1413)
        at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
        at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:873)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:873)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.scheduler.Task.run(Task.scala:129)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:467)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1415)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:470)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Could anyone tell me how to get a difference set/collection ?
    e.g. ST_DIFF (A:geometry, B:geometry) for polygonLeftDf and polygonRightDf.
    9 replies
    Hi people, I'm trying something very similar to this example https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb , trying to run the Spatial Join using RDD API. Apparently works ok, but I cannot add the other non-spatial columns in the Adapter.toDF. It seems that "rdd.fieldNames" is returning an error
    16 replies

    @Imbruced Hey Pawel.
    I got the same issue as @harryprince did when performing ST_Intersection

    Caused by: org.locationtech.jts.geom.TopologyException: found non-noded intersection between LINESTRING ( 117.62675807275772 37.19592923355748, 117.62680869437253 37.195931665434756 ) and LINESTRING ( 117.62683537409399 37.19593294589055, 117.62680869437253 37.195931665434756 ) [ (117.62680869437253, 37.195931665434756, NaN) ]

    I'd like to write a new udf for intersection perform.
    The idea is to try catch TopologyException and assign the record with null.

    5 replies

    Hi, is it possible to write GeometryType() column into a mdsys.sdo_geometry column in oracle database?
    writing directly cause java.lang.IllegalArgumentException: Can't get JDBC type for array<tinyint>.

    I'm using sedona-python-adapter-2.4_2.11-1.1.0-incubating.jar and ojdbc6.jar

    4 replies
    Daniel Haviv

    I'm trying to run the following query but for some reason it ends up running with one task, causing the job a major slowdown as it's no parallelized:

    select t.*, gs.geohash, cs.english as start_city, ce.english as end_city
    from trips_step1 t 
    join public.geohash_unique gs on t.geohash_start = gs.geohash
    join public.cities cs on (gs.ObjId = cs.ObjId and st_Contains(st_geomFromText(cs.geometry),st_Point(t.lon_start,t.lat_start)) )
    join public.geohash_unique ge on t.geohash_end = ge.geohash
    join public.cities ce on (ge.ObjId = ce.ObjId and st_Contains(st_geomFromText(ce.geometry),st_Point(t.lon_end,t.lat_end)) )
    where 1 = 1
    and ST_GeometryType(st_geomFromText(cs.geometry)) != 'ST_GeometryCollection'
    and ST_GeometryType(st_geomFromText(ce.geometry)) != 'ST_GeometryCollection'

    This is the DAG visualization

    Screen Shot 2021-12-27 at 12.27.09.png
    40 replies
    Daniel Haviv
    Daniel Haviv
    Hey again :)
    https://sedona.apache.org/download/compile/ gives 404, can anyone please direct me to the build instructions?
    1 reply
    Hi all, this a more theoretical than practical question. I'm working with a process that joins some shapefiles, and group the areas and intersections by ID codes in one of this areas. The end product is just a table that could fit a csv. However, when I save the last file (with no spatial features) it takes a while just to save the table. Is this the best format to save the result? Or it may be better to use some other functions to save this? I'm currently using: resultadofinal.write.mode('overwrite').option("header",True).csv(output_path + 'table3')
    3 replies
    hello, everyone,
    is there someone know Sedona support GeoPackage file format or not?
    1 reply
    Noe Alejandro Perez Dominguez

    Hello, could anyone point me towards some documentation on how to get EMR (v6.4.0) Zeppelin notebook working with Sedona? I have tried adding the .json config to the bootstrap script in the helium folder of the nodes. That does make apache-sedona show up as a an available visualisation but when I enable it and create a notebook, the globe symbol does not appear after doing a %sql query.
    This is the json config I am using:

      "type": "VISUALIZATION",
      "name": "apache-sedona",
      "description": "Zeppelin visualization support for Sedona",
      "artifact": "apache-sedona@1.0.1-incubating",
      "license": "BSD-2-Clause",
      "icon": "<i class='fa fa-globe'></i>"

    I also add the required .jar dependencies to the spark interpreter.


    Hi there,
    I'm following along with the Sedona docs (https://sedona.apache.org/tutorial/core-python/#read-from-shapefile) and am trying to read in a shapefile with this simple line of code but keep receiving an error.

    rdd = ShapefileReader.readToGeometryRDD(spark.sparkContext, 's3a://path/to/Hail_Risk_OptHotSpotAnalysis.shp')

    Py4JJavaError: An error occurred while calling z:org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD. : java.lang.ArrayIndexOutOfBoundsException: 0

    I've successfully read this same file into a geopandas df and converted to a spark df, so I don't think there is an issue with the file itself. Any thoughts?

    Paweł Kociński
    @mmilett14 did you try to put files in some directory ?
    3 replies
    and in the code rdd = ShapefileReader.readToGeometryRDD(spark.sparkContext, 's3a://path/to/roads')
    Itamar Landsman

    Hi all,
    I'm trying to run over a dataframe where every row has a start lon/lat and end lon/lat and produce two new columns indicating if the start/end locations are within a polygon.
    my code is running locally on my pc with small number of rows, but when trying to go EMR, my code failes with:

    22/01/05 17:21:11 WARN JoinQuery: UseIndex is true, but no index exists. Will build index on the fly.
    [Stage 17:========================================>            (151 + 49) / 200]22/01/05 17:21:28 WARN TaskSetManager: Lost task 152.0 in stage 17.0 (TID 3454, ip-172-11-3-240.ec2.internal, executor 19): scala.MatchError: null
        at org.apache.spark.sql.sedona_sql.expressions.ST_Point.eval(Constructors.scala:266)
        at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryBase$$anonfun$toSpatialRDD$1.apply(TraitJoinQueryBase.scala:45)
        at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryBase$$anonfun$toSpatialRDD$1.apply(TraitJoinQueryBase.scala:44)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
        at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
        at scala.collection.AbstractIterator.aggregate(Iterator.scala:1334)
        at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$24.apply(RDD.scala:1167)
        at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$24.apply(RDD.scala:1167)
        at org.apache.spark.SparkContext$$anonfun$36.apply(SparkContext.scala:2157)
        at org.apache.spark.SparkContext$$anonfun$36.apply(SparkContext.scala:2157)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

    Any Idea?

    7 replies

    Hi team.
    I'm trying to speed up the join operation between two big dataframes using SQL API.
    I'm wondering if there is a need to using spatialpartition like this in RDD API


    If the answer is yes,so is the common practice is to transform the dataframe to rdd,then set spatial partition?
    Or just using repartion on the two dataframes without manully spatial partition setting?

    Karthick Narendran
    Hi Team, Currently, I'm using the gdal_translate command to resize the input GeoTiff file and load it into a Dataframe. I'd like to know if there is a way to resize the GeoTiff using Sedona? I searched through the docs & tutorials, unable to find it, or I'm missing it. TIA
    2 replies
    Karthick Narendran
    There are broken hyperlinks under https://github.com/apache/incubator-sedona/tree/master/docs/tutorial that needs fixing. I'd want to check if anyone is working on this, else I can submit a PR.
    2 replies
    Karthick Narendran
    Hi, I'm trying to run val BandDF = spark.sql("select RS_AddBand(Band1, Band2) as sumOfBands from GeotiffDataframe") from the Raster operators document (http://sedona.incubator.apache.org/api/sql/Raster-operators/) on an Azure Databricks cluster and got the below exception. I tried replacing the function with "RS_Add" as well, but still the same. However, RS_GetBand or RS_Array doesn't have this problem. Let me know if there is anything I'm missing?
    AnalysisException: Undefined function: 'RS_AddBand'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7
    11 replies

    Hi all!
    I have two dataframes, one is with areas, another - with addresses.
    The task is to match addresses with containing areas.
    What is important, there is also a key by which I can join dataframes.
    The pseudo-code looks like this:

      .join(addresses, addresses.area_id === areas.id, "inner")
      .filter(expr("ST_Contains(area, address)"))

    Areas dataframe is small (hundreds of records), but addresses is big (70 mln records).
    However, result of join by key is thousands of records, after filter it should be even less.
    Spark job works super slow. In fact, I waited for a few hours, then killed it.
    In the plan I see that RangeJoin optimization is applied with extra condition addresses.area_id === areas.id.
    I assume that extra condition is applied after calculation for each area and address.
    Could anyone confirm my assumption? Are there any workarounds? Should it be improved, so that extra condition is applied first?
    The best workaround I can think of right now is to use UDF instead of ST_Contains. In this case spark is not able to optimize the query, so it first joins by key and then applies ST_Contains on a small number of records.

    27 replies
    I got an issue when trying to search the records in left dataframe not intersecting right dataframe.
    When performing select("*") operation,I cannot generate expected records in output csv file.
    The number of output file somehow increases and include the undesired records I don't want.
    Does anyone have any idea?
    While performing select("geom_res") or select("geom_res","some attributes"),it works well.
    val leftDfWithId = spatialLeftDf.withColumn("UniqueID", monotonically_increasing_id)
    val intersect_tmp_df: DataFrame = sparkSession.sql(
        |SELECT leftDfWithId.*,rightdf.geom_right,ST_DIFF(leftDfWithId.geom_left,rightdf.geom_right) as diff FROM leftDfWithId,rightdf WHERE ST_Intersects(leftDfWithId.geom_left, rightdf.geom_right)
    val a_not_inter_b = leftDfWithId.join(intersect_tmp_df.select("UniqueIDIntersect").distinct(), $"UniqueID" === $"UniqueIDIntersect", "left")
        .withColumnRenamed("geom_left", "geom_res")
    val df_tmp = a_not_inter_b.select("*")   //  select operation
    val df_res = df_tmp
        .selectExpr("ST_AsText(geom_res) as wkt", "*")
        .drop("geom_res")    //  df_res
        .option("header", true)
    2 replies

    Hey all, thanks for your work, i really appreciate it!
    I am trying to do some spatial clustering with ELKI and therefore i aggregated spatial data and it works fine that far. Now i want to do some better spatial partitioning than a simple grid and used the rdd functions for spatial partitioning. I am able to get the geometry with the partition index:
    pointsRDD.spatialPartitionedRDD.rdd.mapPartitionsWithIndex{case (i: Int,rows: Iterator[org.locationtech.jts.geom.Geometry]) => rows.map{ a=> (i,a.toString) } }.toDF("i","partition_number").show(400,200)

    But how can I get the other Fields of the spatialRDD ? i need the identifier columns. Thanks in advance!

    1 reply
    I notice that there's a merged PR for Scala 2.13 builds of Sedona. Is there a timeline for when 2.13 builds will get published to maven?
    1 reply
    Bram Weterings

    I'm trying to use sedona functions in sql within a udtf on Databricks and i'm running in to some issues. This does not work:

         ConvertGeoFromWKT(WKT STRING COMMENT 'WKT') 
       PointWKT STRING,
       WKB binary,
       XCoordinate STRING,
       YCoordinate STRING,
       GeoType STRING
       COMMENT 'Vertaal wkt naar verschillende formats'
         ST_Centroid(ST_GeomFromWKT(WKT)) as PointWKT,
         ST_AsEWKB(ST_GeomFromWKT(WKT)) as WKB,
         ST_X(ST_Centroid(ST_GeomFromWKT(WKT))) as XCoordinate,
         ST_Y(ST_Centroid(ST_GeomFromWKT(WKT))) as YCoordinate,
         ST_GeometryType(ST_GeomFromWKT(WKT)) as GeoType

    I get:
    Error in SQL statement: AnalysisException: Not allow to create a permanent function ConvertGeoFromWKT by referencing a temporary function `ST_Centroid`.

    Creating it as a temporary functions work, i can also use that function in a temporary view (as in, it creates the view)

    create or replace global temporary view testview
      LATERAL ConvertGeoFromWKT(table.Geometry)

    When i use the view, i get the error
    AnalysisException: could not resolve `ConvertGeoFromWKT` to a table-valued function;

    How can i create a function with sedona functions that i can use in a (temporary) view?

    5 replies
    Can sedona be integrated with flink?
    2 replies
    Ashic Mahtab

    I'm looking to write a spark udf (or rather pandas_udf) that returns a shapely.geometry.point.Point. What would the return "type" be? My udf currently looks like:

    def myUdf(latitude: pd.Series, longitude: pd.Series) -> pd.Series:
        # some processing to get tuples
        points = tups.map(ShapelyPoint)
        return points

    Testing the code, I can see that the pandas return type is pandas series of ShapelyPoint (my alias for shapely.geometry.point.Point). With this, I get:

    Invalid return type with scalar Pandas UDFs: GeometryType is not supported

    Any pointers on what the return type should be to get a column of points would be appreciated.

    3 replies
    Ashic Mahtab
    This worked:
    @F.pandas_udf(GeometryType(), F.PandasUDFType.SCALAR_ITER)
    @Imbruced looks like it just needed the second param to pandas_udf
    Ashic Mahtab
    @Imbruced might have jumped the gun.. seemed to be working, but wasn't actually. basic udf does work.
    Ashic Mahtab

    i'm joining a fairly large dataset of Points with a 5MB dataset of polygons like this:

    gdf.withColumnRenamed('geometry', 'point').join(F.broadcast(locations_gdf), on=F.expr("ST_Contains(geometry, point)"), how='left')

    If I sepcify how='left', the resulting datafrrame.count() doesn't even finish in 30 minutes+. If I say how='inner', it finishes in about 90 seconds. If this expected? I could of course just take the inner join results, do a left anti join with the ids of the gdf dataframe and union what's missing, but was wondering if there's a more straighforward option.

    3 replies
    gdf is the large dataset and locations_gdf is the 5MB one with polygons.
    Good evening all, something changed in the last ~week that I'm having trouble debugging. I'm installing Sedona 1.2 SNAPSHOTS on Databricks (9.1 LTS Photon Apache Spark 3.1.2, Scala 2.12) via the init script as described in the docs. I have double checked that that the spark.sql.extensions and serializers match the guidance on the spark configuration AND that the SedonaRegistrator executed without error. 90% of the functions seem to work, EXCEPT for selected sql commands that I know worked last week. For instance, ST_SubDivideExplode, where I get an unhelpful py4j error. I think that somehow I haven't registered all the sql functions correctly. Maybe I'm missing setting spark.sql.catalog.spark_catalog is correctly match with Sedona? Any thoughts?
    Paweł Kociński
    @jwebster42 Hi, which sedona version did you add ? What kind of exception was raised ? Can you send the stacktrace ?
    This message was deleted
    @mjohns-databricks @Imbruced - that's exactly the place where they failed. I tested a bunch of other functions this weekend and ST_SubDivideExplode was the only one that I could make... well... explode. :D. I unfortunately do not have access to DBR 7.3 due to a quirk in how my organization's sys admin set things up, so for now, I will cheat and write a UDF that does the same thing on a dataframe and come back to the sedona optimized versions later. Do you know if DBR 10 has this same challenge?
    Pedro Mano Fernandes
    Does anybody have a working example of build.sbt with sedona:1.1.1-incubating and spark-core:2.4?
    I'm getting Incompatible Jackson version: 2.10.0 😔
    Pedro Mano Fernandes
    ✔ Got it! I forgot to exclude jackson in wololo dependency
    @jwebster42 -- Ok, so we looked harder at what has changed. New + better info -- It actually is that DBR 9.1 is Spark 3.1 and in Spark 3.2 (DBR 10) Generators class was changed in to now include nodePatterns field. So, since Sedona pom.xml (master / 1.2 SNAPSHOT) specifies Spark 3.2 at the build, that missing field is the issue you are hitting when you run on DBR 9. You could either build Sedona against Spark 3.1 or you could most likely update to DBR 10. We are still looking into it but that is where things are currently.
    1 reply
    Jean-Denis Giguère
    Greetings! I opened yesterday the issue https://issues.apache.org/jira/browse/SEDONA-79. I could have some time to do experimentation or improve the issue documentation if someone could give me some feedback on this issue.
    7 replies

    Hello everybody, I'm a beginner with sedona and I try to use Adapter.toSpatialRdd to convert dataframe to spatial rdd but i have this error : An error occurred while calling z:org.apache.sedona.sql.utils.Adapter.toSpatialRdd.
    : java.lang.NoClassDefFoundError: org/opengis/referencing/FactoryException

    In addition to the original libraries of spark I installed those one : sedona-python-adapter-3.0_2.12-1.1.1-incubating.jar / geotools-wrapper-geotools-24.0-sources.jar / jts-core-1.18.2.jar

    Did someone have a solution ?

    3 replies
    hello , i have trouble opening a shapefile . i am facing the following error when i am loading the shp : py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD.
    : java.lang.ArrayIndexOutOfBoundsException: 0
    I am using pyspark 2.3.2 , the 5 files composing the shape are in the same directory (.shp, .dbf, .shx, .prj, .cpg all in small case) without any other fil in the directory. I successfully loaded smaller shapefile but this one and other larger shapefile are refusing to load. In the meantime I successfully loaded this same shapefile on the binder instance of sedona (after uploading the file there). file is ~500Mo large. Any idea ? thanks in advance.
    2 replies
    Pedro Mano Fernandes
    if I perform a SparkSQL distance join query, do I still have to do preparation tasks (like performing RDD spatial partitioning prior to dataframe join)?
    8 replies