Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    Smith, James David

    I think maybe a found a bug with st_transform?

    Here is a point in PostGIS which is in CRS:4326. I transform it to 3857:

    SELECT st_astext(st_transform(st_setsrid(st_point(-0.063, 51.5), 4326), 3857))
    POINT(-7013.127919976235 6710219.083220741)

    The same in Sedona:

    SELECT st_transform(st_setsrid(st_pointfromtext("-0.063, 51.5", ","), 4326), 'epsg:4326','epsg:3857') geom
    POINT (5732953.775853588 -7013.129333153518)

    The result is not exactly the same, which I guess is fine, but more importantly st_transform in Sedona has switched the X and Y?

    10 replies

    Hi! I need to union multiple geometries in a single one. That brings me to a pretty common and frequent problem:

    org.locationtech.jts.geom.TopologyException: found non-noded intersection between LINESTRING ( 38.73214659317092 55.117539741026164, 38.7324924930679 55.1174096032198 ) and LINESTRING ( 38.73224185154213 55.117503901999974, 38.7318966522058 55.1175031436374 ) [ (38.73224185154213, 55.117503901999974, NaN) ]

    of couse I found many posts with exatly the same notation (most of them on https://gis.stackexchange.com/) and found out that in many cases the reason is in high precision and almost coinciding but still not identical lines of the geometries.

    25 replies

    Hi, friends! I'm so sorry for so many questions from me for the last few days. Highly appreciate your help!
    The nex problem is - the ST_Area function results: seems like sedona does not calculate it correctly (while geospark does). But in fact I'm not sure that the core problem is in ST_Area, maybe that's the ST_Transform wrong results.
    Here is the example:

    import com.vividsolutions.jts.geom.Geometry //for geospark
    import org.locationtech.jts.geom.Geometry // for sedona
    import scala.math.floor
    // ancillary functions
    def stx(geom: Geometry) : Double =
      val st_x = spark.udf.register("stx", stx _)
      // UDF determining local epsg by longitude, is used in ST_Transform)
      def getEPSG(x: String): String = {
        "epsg:" + (floor((x.toDouble + 180)/6).toInt % 60 + 32601).toString
      val epsg = spark.udf.register("epsg", getEPSG _ )
        sql("select 'POLYGON((47.7327773009663 41.615196099211,47.7383215644648 41.6176978515387,47.7331100700674 41.6205451750929,47.7275657100572 41.6180428924801,47.7327773009663 41.615196099211))' as geo_wkt").
        withColumn("epsg", expr("epsg(stx(ST_Centroid(ST_GeomFromWKT(geo_wkt))))")).
        withColumn("area", expr("ST_Area(ST_Transform(ST_GeomFromWKT(geo_wkt), 'epsg:4326', epsg))")).

    Result in geospark (the area value is correct, checked it with PostGIS):

    scala> res21.show
    |             geo_wkt|      epsg|              area|

    And here is the result with sedona:

    |             geo_wkt|      epsg|             area|

    And applying ST_FlipCoordinates after ST_Transform doe not change anything at all

        sql("select 'POLYGON((47.7327773009663 41.615196099211,47.7383215644648 41.6176978515387,47.7331100700674 41.6205451750929,47.7275657100572 41.6180428924801,47.7327773009663 41.615196099211))' as geo_wkt").
        withColumn("epsg", expr("epsg(stx(ST_Centroid(ST_GeomFromWKT(geo_wkt))))")).
        withColumn("area", expr("ST_Area(ST_FlipCoordinates(ST_Transform(ST_GeomFromWKT(geo_wkt), 'epsg:4326', epsg)))")).
    |             geo_wkt|      epsg|             area|

    Am I doing something wrong? Or have I discovered another bug?

    15 replies
    Apache Sedona will support delta lake, hudi, iceberg ?
    2 replies
    RIP KevinRadleman
    Hi Sedona devs, I'm trying to understand the parameters in https://sedona.apache.org/api/sql/Parameter/
    specificially sedona.join.numpartition, sedona.join.indexbuildside, sedona.join.spatitionside
    I tried increasing the sedona.join.numpartion = 200 but I only see 21 executors used for the join task.
    Any examples, docs with more explanation would be greatly appreciated.
    2 replies

    Hi! Please tell me what's wrong with my submit now =)
    I get org.opengis.referencing.NoSuchAuthorityCodeException: Authority "AUTO" is unknown or doesn't match the supplied hints. Maybe it is defined in an unreachable JAR file? exception trying to submit my application. The app code is tested in spark-shell - it works.
    I add sedona libs to spark shell this way

    spark-shell --master yarn --name anyname --jars /home/nkochetov/lib/sedona/geotools-wrapper-1.1.0-25.2.jar,/home/nkochetov/lib/sedona/sedona-python-adapter-2.4_2.11-1.2.0-incubating.jar --conf...

    The application I try to submit is built with sbt.
    Sedona libs are added to the project in file build.sbt this way:

    libraryDependencies ++= Seq(
    . . .
      // https://mvnrepository.com/artifact/org.datasyslab/geotools-wrapper
      "org.datasyslab" % "geotools-wrapper" % "1.1.0-25.2",
      // https://mvnrepository.com/artifact/org.apache.sedona/sedona-python-adapter-2.4
      "org.apache.sedona" %% "sedona-python-adapter-2.4" % "1.2.0-incubating",
    . . .
    13 replies
    I'm sorry for trying to answer here instead of the question thread. I've read the last two messages where someone asked for help with TopologyException occuring with ST_Union_Aggr. Unfortunately now I can't find nor those threads, nor the nickname of a person that asked the question...and it drives me mad.
    So, with apologies to all other members of this chatroom, the person who asked, please look at my approach here, please. That makes ST_Union_Aggr a little bit more safe, catching first TopologyException and trying to fix it with tiny buffer. If buffering won't help - it will fail anyway. But In my case that was enough to do the trick
    Hi! I have question connected to performance and optimization. My goal is to perform ST_Contains operation on two data frames one contains points, second polygons. I'm using RDD core API on Databricks cluster. The problem is that the computation is stacked on last few jobs and is not able to finish it. Have u got any ideas what sholud I change? Maybe partitioning alghoritm or indexing is not proper?
    5 replies
    Hi guys, do you know what exactly sedona.join.numpartition parameter is ? It's setting number of spatial partition ? Another question how it refers to spatial partitioning alghoritm like KDBTREE? Is there any recommendation how to set up this two parameters together? And finally are u able to set up this parameters by simply spark.conf.set(), beacause in my case (on databricks cluster) it doesn't work while setting in code (notebook), it works only set directly in cluster configuration. Thank u for replies!
    1 reply
    Hi again ) Is it possible to execute ST_Union_Aggr with HashAggregate instead of SortAggregate?
    Siddhartha Khatsuriya
    Hey all, I am trying to read netcdf on spark and I am new to Sedona. Is there any example which I can take a reference from. I prepare Python but java works as well. Thanks for any suggestions you'll might have.
    1 reply

    Howdy? On trying to use toDF on SpatialJoinQueryFlat getting the following error:

    scala> val spatial_join_result_df = Adapter.toDf(spatial_join_result, Seq("county_code", "county_name"), Seq("osm_id", "code", "fclass"), spark)
    <console>:28: error: type mismatch;
     found   : org.apache.spark.api.java.JavaPairRDD[org.locationtech.jts.geom.Polygon,org.locationtech.jts.geom.Point]
     required: org.apache.spark.api.java.JavaPairRDD[org.locationtech.jts.geom.Geometry,org.locationtech.jts.geom.Geometry]
    Note: org.locationtech.jts.geom.Polygon <: org.locationtech.jts.geom.Geometry, but class JavaPairRDD is invariant in type K.
    You may wish to define K as +K instead. (SLS 4.5)
    Note: org.locationtech.jts.geom.Point <: org.locationtech.jts.geom.Geometry, but class JavaPairRDD is invariant in type V.
    You may wish to define V as +V instead. (SLS 4.5)
           val spatial_join_result_df = Adapter.toDf(spatial_join_result, Seq("county_code", "county_name"), Seq("osm_id", "code", "fclass"), spark)

    Any ideas how to troubleshoot or what I am doing wrong?

    5 replies
    Hi all! Is there a way to clip a raster with a polygon using Sedona?
    Anybody knows if Sedona supports Spark 3.2.x?
    2 replies
    Abhishek Vij
    I am trying to install Spark 3.2.x with Sedona but I run into an issue with conda forcing the Spark version down to 2.4.0. I am using the following command to create a new environment "conda create -n scale_spatial_120 apache-sedona[spark]"
    3 replies
    Conda creates the following plan to install sedona 1.2.0
    Mehmet Kalich
    Hey there, would anyone be able to direct me to a tutorial or any materials in using Apache Sedona with AWS EMR? I have SSH'd into the master node and spark-submitted our application using the following command:
    this is the error message we are getting back:
    6 replies
    The cluster is running on Spark 2.4.8, I'm wondering if there is some kind of dependency version mismatch that is happening?

    Hello sorry for the noob question, I am starting to use pyspark and sedona. I have setup the spark cluster i think i scorrect as simple spark code works, but when adding sedona and running some basic script agaisnt it I get this error.

    Traceback (most recent call last): File "/home/hadoop/scratch3.py", line 7, in <module> SedonaRegistrator.registerAll(session) File "/home/hadoop/.local/lib/python3.7/site-packages/sedona/register/geo_registrator.py", line 43, in registerAll cls.register(spark) File "/home/hadoop/.local/lib/python3.7/site-packages/sedona/register/geo_registrator.py", line 48, in register return spark._jvm.SedonaSQLRegistrator.registerAll(spark._jsparkSession) TypeError: 'JavaPackage' object is not callable

    IM running this:

    `````from pyspark.sql import SparkSession
    from sedona.register import SedonaRegistrator
    from sedona.utils import SedonaKryoRegistrator, KryoSerializer

    if name == "main":
    session = SparkSession.builder.appName("sedonaapp").getOrCreate()
    sc = session.sparkContext

    df = session.sql("""SELECT st_GeomFromWKT('POINT(6.0 52.0)') as geom""")
    12 replies
    Hello, after lot of trials im getting lot of issues to make work simple script using sedona:
    Traceback (most recent call last):
    File "/home/hadoop/scratch5.py", line 19, in <module>
    File "/home/hadoop/.local/lib/python3.7/site-packages/sedona/register/geo_registrator.py", line 41, in registerAll
    spark.sql("SELECT 1 as geom").count()
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 586, in count
    File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 131, in deco
    File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling o72.count.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo
    org.apache.spark.SparkException: Failed to register classes with Kryo

    Hello sedona/geospark team! Thank you for making such a fine package!

    I am struggling with a performance issue and I was hoping you could provide feedback.

    I am using sedona in AWS EMR (6.6.0) (spark 3.2.0) with the Python API. My hardware is 16 core i9-11950H @2.6Ghz, 64GB ram

    This is my current benchmark program:

    import os
    import copy
    import pyspark.sql.functions as f
    from pyspark.sql import SparkSession
    from pyspark import SparkConf
    from sedona.register import SedonaRegistrator
    from sedona.utils import SedonaKryoRegistrator, KryoSerializer
    import geopandas as gpd
    import time
    total_memory_gb = os.sysconf('SC_PAGE_SIZE') * os.sysconf('SC_PHYS_PAGES') / (1024 ** 3)
    memory_gb = int(total_memory_gb * 0.66)
    spark_config = [
        ("spark.sql.sources.partitionOverwriteMode", "dynamic"),
        ("spark.sql.adaptive.advisoryPartitionSizeInBytes", "%d" % (int(64 * 1024**2))),
        ("spark.sql.adaptive.enabled", "true"),
        ("spark.sql.adaptive.coalescePartitions.parallelismFirst", "false"),
        ("spark.driver.memory", "{}g".format(memory_gb)),
        ("spark.driver.maxResultSize", "0"),
        ("spark.jars.packages", "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.0-incubating,"
        ("spark.serializer", KryoSerializer.getName),
        ("spark.kryo.registrator", SedonaKryoRegistrator.getName),
        # ("sedona.global.indextype", "rtree"),
    spark = SparkSession.builder \
        .master("local[*]") \
        .config(conf=SparkConf().setAll(spark_config)) \
        .appName('sedona-benchmark') \
    if not SedonaRegistrator.registerAll(spark):
        raise ImportError('failed to register sedona functions')
    start = time.time()
    path = '/home/me/datasets/data_10mil_rows.orc'
    df = spark.read.format('orc').load(path).repartition(1000)
    df = df.withColumn('wkt', f.format_string("POINT(%s %s)", f.col('lon'), f.col('lat')))
    dfg = df.withColumn("geometry", f.expr("st_geomFromWkt(wkt)")).drop('wkt')
    polygon_df = gpd.read_file("/home/me/datasets/ne_10m_admin_0_countries/ne_10m_admin_0_countries.shp")[["geometry"]]
    polygon_df = polygon_df.explode()  # make it all single polygons instead of mix of single / multipolygon
    polygon = spark.createDataFrame(polygon_df).withColumnRenamed("geometry", "poly")
    filter_result = dfg.join(f.broadcast(polygon),
                                          f.expr("ST_Contains(poly, geometry)")).drop('poly', 'geometry')
    filter_result.write.format('orc').mode('overwrite').partitionBy('source', 'year', 'month', 'day').save(
        path + '-polyfiltered')
    elapsed = time.time() - start
    print("seconds elapsed: ", elapsed)

    When I run this with a polygon set from naturalearth of the world countries (10m resolution) (~4k polygons, ~550k vertices) it takes 1 hour to complete. Running against the 110m resolution of countries from NE is under 1 minute to complete (288 polygons, 10642 vertices). OK, obviously it's going to run faster with simpler polygons. However, running against another polygon from one of my business tasks (~33k polygons, ~245k vertices) it runs in under 1 minute. That task seems on the same scale as the high resolution naturalearth polygons. (please correct me if I'm wrong). All of these tasks are doing the same broadcast index spatial join against 10 million records with POINT geometries attached.

    The main diference in these tasks appears to be the complexity of polygons (the country polygons are fewer and more detailed) and that the country polygons share borders with eachother where my business polygon set has significant spatial distance between polygons. Maybe the difference is because the business-polygon set is able to get better spatial partitioning? I tried both Quadtree and Rtree partitioning, but it had no noticeable effect.

    (post too long... see below for part 2)

    3 replies

    Each time the query is run the polygons are small enough to broadcast, resulting in a BroadcastIndexJoin. I have also tried using python RDD API very similar to this example https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb . However the RDD API seemed slow (maybe serialization overhead because I need to convert back to dataframe and write to ORC on disk). Can you use "broadcast" with the RDD API?

    Is there something I am missing in my implementation? Can anyone suggest why the high-resolution country polygons take so much longer than my similarly (ish?) sized business-polygons?

    16 replies
    Feedback is much appreciated! Thank you for reading.
    Mehmet Kalich
    Has anyone else had any issues using the Kryo Serializer when using Pyspark with Sedona? I've followed the setup instructions on the Sedona website but can't get past this error. I'm wondering if I need to manually register the classes.
    I'm using emr-6.6.0 and Spark 3.2.0. I can see when using Scala there is a need to create an array and register the classes with Kryo, but I can't find the same instructions for python.
    8 replies

    I’m trying to follow the example code, there is this code:
    counties = spark.\
    option("delimiter", "|").\
    option("header", "true").\

    Where can find the download this counties.csv file? Thanks for any tips.

    Hersh Gupta
    Hi Sedona folks, I'm using the Sedona 1.2.1 R package from the GitHub repo, and I'm getting a Number of partitions must be >= 0 error when running a Spark SQL join against a dataframe with one row. It looks like it was resolved in apache/incubator-sedona#207 but doesn't seem to have been updated in the R package. I'm wondering if there are plans to resolve this error for R? I couldn't find a relevant ticket in the Sedona JIRA. Thanks!
    3 replies
    Shamanth K M
    hey there, Im trying to join two polygons which are in a tsv file in format of WKT using scala, both files have 1 polygon in each, The error we are getting is : spatialRDD.spatialPartitioning(GridType.QUADTREE)
    java.lang.IllegalArgumentException: [Sedona] Number of partitions 2 cannot be larger than half of total records num 1,
    any insights on how to join two polygons?
    1 reply
    Brendon Daugherty

    Hey all!

    I've been experimenting with using sedona; one thing I was hoping to do is extract values for a list of points from a raster (geotiff). However, I've been struggling to figure out how to do so using sedona's SparkSQL api.

    In R, I would do something like stars::st_extract(my_raster_obj, at = a_list_of_points)

    Does anyone know how I might do this? Sorry if this is the wrong place to ask!

    I've loaded the raster successfully using the Geotiff Dataframe reader, but haven't seen anything in the raster frame operators that looks promising. Is there something else I should be looking at?

    2 replies
    Tom Kilgore

    Hey Sedona team, thanks so much for all your great work on this project. I was hoping you could provide some guidance on what's causing the registerAll function to fail.

    When SedonaRegistrator.registerAll(self._spark) I'm getting the following error:
    An error occurred while calling z:org.apache.sedona.sql.utils.SedonaSQLRegistrator.registerAll. : java.lang.NoClassDefFoundError: org/locationtech/jts/geom/Geometry

    I'm using Spark/EMR versions:

    EMR 6.7.0
    Spark 3.21
    PySpark 3.2.1

    Sedona related jars



    4 replies

    I have a job that needs to query large amounts of spatial data. Does anyone have a good idea for partitioning data so that Spark can do dynamic partition pruning on disk before I run my intersection query on my points?

    Here's what I'm thinking so far:

    My records on disk are POINT() geometries (1 per row) and I want to do ST_Contains() with a set of query polygons. Maybe I partition the points by H3 or geohash on disk, then do a "pre-query" where I intersect the polygons with my H3/geohash partition polygons so I can run my true query with .select([list_of_intersecting_geohashes])

    does that sound reasonable or is there some built-in partitioning on disk that I can take advantage of that I have totally overlooked?

    8 replies
    Hugh Saalmans
    Just saw the Sedona v1.2.1 release note on Twitter; however I can't see the v1.2.1 files in Maven?
    5 replies
    Matthias De Geyter
    I'm having troubles reading some shapefiles with Sedona. Problem is also described here: https://stackoverflow.com/questions/71273008/why-is-apache-sedona-not-reading-this-shapefile-properly (I'm not the thread starter, but replied).
    There are no error messages, the schema is read correctly, but I just get an empty Dataframe. Loading the file in QGIS and exporting them with a specific Geometry type instead of "Automatic" seems to fix it.
    Anyone an idea what this could be? Files are free for use, e.g. https://geoftp.ibge.gov.br/organizacao_do_territorio/malhas_territoriais/malhas_municipais/municipio_2010/ro/ro_municipios.zip
    or https://downloadagiv.blob.core.windows.net/wegenregister/Wegenregister_SHAPE_20220616+-+correctie.zip
    4 replies
    SedonaViz can assemble any customized styles.Is there an example of a point, line or polygon being rendered using a custom style?
    Douglas Dennis
    I was wondering if there are plans or any interest in supporting the spark Dataframe API, such as in pyspark.sql.functions?
    4 replies
    Krzysztof Karski

    I am trying to parse a directory with a nested structure of geojson files using the GeoJsonReader

    SpatialRDD<Geometry> rdd = GeoJsonReader.readToGeometryRDD(
            path.toString(), true, true);

    Where path.toString() points at a directory. With regular SparkSQL this works but with Sedona I get the following error:

    is a directory, which is not supported by the record reader when `mapreduce.input.fileinputformat.input.dir.recursive` is false.

    Searching I am getting advice to set

    context.conf().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

    but this doesn't make a difference.

    9 replies
    Krzysztof Karski

    Getting a ClassNotFoundException on

    ClassNotFoundException: geoparquet.DefaultSource

    When I try to...

    set.write().format("geoparquet").save(path.toString() + "/parquet/postalcodes.parquet");

    My pom.xml dependencies are

    6 replies
    Hi! Is there any way to get the geodesic area using sedona?
    3 replies

    I'm working on an EMR notebook in AWS and my goal is to calculate distances between the elements of two shapes.

    At first, I set the config and install apache-sedona:

    %%configure -f
    {"conf": {"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
              "spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
              "spark.jars.packages": "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.1-incubating,org.datasyslab:geotools-wrapper:1.1.0-25.2",
              "spark.pyspark.python": "python3",
              "spark.pyspark.virtualenv.enabled": "true",
              "spark.pyspark.virtualenv.type": "native",
              "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv"}}
    from sedona.register import SedonaRegistrator

    After this, I load the shapes and I calculate distances using:

    distances = spark.sql("""select id, ST_Distance(building_shapes.geometry, seas_shape.geometry) as distance
        from building_shapes, seas_shape""")

    It works, but the problem is that the performance is very poor, it takes 10 minutes for 1000 polygons in building_shapes and 40 polygons in seas_shape.

    Am I forgetting something important to improve the performance?

    14 replies
    hi everyone , i am using st_dumppoints, however i dont find enough points generated, can it be invoked nested ? or is there any custom function to generate points in the geometry[polygon/multipolygon] ?
    6 replies
    Skip Woolley
    Hi everyone, Is there a option for loading geotifs into spark using apache.sedona for R? Even a work around would be great.
    15 replies
    Skip Woolley
    6 replies
    @jiayuasu I get the following error. I'm trying to load the raster from a NFS using a stand alone cluster
    Kristin Cowalcijk
    Hi, I've noticed that the semantics of ST_Contains is actually "covers" defined by OGC, and ST_Within used to be "coveredBy" but changed to "within" after resolving SEDONA-118. Should we change the semantics of ST_Contains to "contains" and add additional UDFs for ST_Covers and ST_CoveredBy?
    1 reply