Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Nov 29 06:20
  • Nov 16 07:34
    gisnewbie starred locationtech/rasterframes
  • Nov 14 22:30
    DragonEnergy commented #594
  • Nov 14 22:21
    DragonEnergy edited #594
  • Nov 14 22:20
    DragonEnergy opened #594
  • Nov 14 22:17

    dependabot[bot] on pip

    (compare)

  • Nov 14 22:17
    dependabot[bot] commented #592
  • Nov 14 22:17
    echeipesh closed #592
  • Nov 14 22:06
    echeipesh closed #593
  • Nov 14 22:06
    echeipesh commented #593
  • Nov 14 22:02
    DragonEnergy opened #593
  • Nov 11 08:25
    dependabot[bot] closed #586
  • Nov 11 08:25

    dependabot[bot] on pip

    (compare)

  • Nov 11 08:25
    dependabot[bot] commented #586
  • Nov 11 08:25
    dependabot[bot] labeled #592
  • Nov 11 08:25
    dependabot[bot] opened #592
  • Nov 11 08:25

    dependabot[bot] on pip

    Bump pyspark from 3.1.2 to 3.2.… (compare)

  • Nov 06 22:55
    noeigenschaften starred locationtech/rasterframes
  • Nov 02 15:32
    denmoroz starred locationtech/rasterframes
  • Oct 26 23:48
    pomadchin unlabeled #591
jterry64
@jterry64
oh haha
Simeon H.K. Fitch
@metasim
the problem is that PySpark's Arrow integration doesn't support UDTs (and arrays?)
(definitely not multidimensional arrays)
You'd want to use @pandas_udf, but it just doesn't a) work TileUDT, and b) is still slow.
1 reply
If you create the aggregate in Scala, exposing it in Python is just a couple of boilerplate steps.
jterry64
@jterry64
oh ok, maybe I'll go that route that then. Thanks!
Yuri de Abreu
@yurigba

I am having problems with rasterframes finding GDAL (once again).

I have installed gdal through conda, and gdalinfo --version returns:

GDAL 2.4.4, released 2020/01/08

Also whereis gdalinfo returns:

gdalinfo: /usr/bin/gdalinfo /home/yuri/anaconda3/envs/condaenv/bin/gdalinfo /usr/share/man/man1/gdalinfo.1.gz

Yet when I run gdal_version() in python, it states 'not available'. What could this be?

4 replies
Mandal
@hificoders
I have a CSV file contents latitude and longitude. The number of records in CSV file is huge, say 1 million. Now I'm trying to extract raster information from tif file. Code is given below, the problem is it's very slow. @pomadchin suggested me to check with rasterframes folks how we can improve performance. @metasim or any folks kindly guide me how we can achieve very good performance on below code.
package main.scala.sample
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.Row

object SparkSQLExample {

  def main(args: Array[String]) {

    val spark = SparkSession.builder()
      .master("local[1]").appName("RasterFrames")
      .withKryoSerialization.getOrCreate().withRasterFrames


    val example = "1.tif"
    val rf = spark.read.raster.from(example).load()

    import spark.implicits._
    val pointDFCsv = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("Book1.csv")

    val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
      val extent = extentEnc.to[Extent]
      Raster(tile, extent).getDoubleValueAtPoint(point)
    })
    val points = pointDFCsv

   val result = points
      .withColumnRenamed("lon", "x")
      .withColumnRenamed("lat", "y")
      .join(rf)
      .where(st_intersects(rf_geometry($"proj_raster"), st_makePoint($"x", $"y")))
      .select(rf_value_at_point(rf_extent($"proj_raster"), rf_tile($"proj_raster"), st_makePoint($"x", $"y")) as "value")

    result.coalesce(300).write.format("csv").option("header", "false").mode("append").save("/op/")

  }
}
Simeon H.K. Fitch
@metasim
@hificoders The problem is likely you need to spatially partition on lat, lon before you do the join.
I'm unfortunately on a schedule crunch and don't have the time right now to write up an example, but here's a sketch.
You first want to use the withSpatialIndex:SpatialIndexOptionsSupport.this._TaggedReader) option to spark.read.raster.
Simeon H.K. Fitch
@metasim
(^^^ gitter screwed up the link, but it's in that trait)
The second is to use rf_xz2_index:org.apache.spark.sql.TypedColumn[Any,Long]) on the CSV data.
Simeon H.K. Fitch
@metasim
Add a st_makePoint($"lon", $"lat") as "position" in the CSV data frame.
Follow that by a .repartitionByRange(rf_zx2_index($"position"))
this should ensure the raster data and the point data are in the same executor before the join happens.
make sure your input raster is in COG format.
Simeon H.K. Fitch
@metasim

The issue is that you really have to become adept at debugging jobs in the Spark UI, learning to read if they are imbalanced, blocked, repeating reads, etc.. It's unfortunately a bit of an black art, and not much fun.

To be honest, 100 rows isn't that big in this territory. Unless your input tiff is huge (again, make sure it's a COG formatted GeoTIFF), this job may be faster outside of Spark. It's in the cluster environment with where you get bigger wins.

(Huge as in Gigabytes)
It may be faster to do it with just GeoTrellis' raster package (without Spark)
Downside is you can't use the DataFrame API.
(or Python, but that doesn't seem to be an issue here)
abhishekkrbaliase
@abhishekkrbaliase

I need help with 2 requirements. How can these two be achieved? I have the tile in column NDVI here

  1. Get number of cells in a tile having value less than 0
  2. Set Tile A's cell value if corresponding index value of Tile B is less than 0.
    @metasim Can you please help

`from pyrasterframes.rasterfunctions import *

df = df.withColumn('ndvi', rf_normalized_difference(df.B04, df.B08))

df.printSchema()

root
|-- B04_path: string (nullable = false)
|-- B08_path: string (nullable = false)
|-- B04: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)
|-- B08: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)
|-- ndvi: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)

`

Mandal
@hificoders

@metasim after your suggestion I have written below code and after running the code getting error

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`position`' given input columns: [y, id, proj_raster_path, spatial_index, x, proj_raster];;
'RepartitionByExpression [rf_xz2_index('position, rf_crs('position), 18) ASC NULLS FIRST], 200

13 replies
Final Code
package main.scala.sample
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.Row
import org.apache.spark.sql.TypedColumn


object SparkSQLExample {

  def main(args: Array[String]) {

    val spark = SparkSession.builder()
      .master("local[1]").appName("RasterFrames")
      .withKryoSerialization.getOrCreate().withRasterFrames
    // spark.sparkContext.setLogLevel("ERROR")
    //org.apache.spark.sql.SQLTypes.init(spark.sqlContext)

    val ds = spark.read.raster.withSpatialIndex(20)
    val example = "1.tif"
    val rf = ds.from(example).load()

    import spark.implicits._
    val pointDFCsv = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("/input.csv")

    val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
      val extent = extentEnc.to[Extent]
      Raster(tile, extent).getDoubleValueAtPoint(point)
    })

    val points = pointDFCsv

   val result = points
      .withColumnRenamed("lon", "x") 
      .withColumnRenamed("lat", "y")
      .join(rf)
      .where(st_intersects(rf_geometry($"proj_raster"), st_makePoint($"x", $"y") as "position"))
        .repartitionByRange(rf_xz2_index($"position"))
      .select(rf_value_at_point(rf_extent($"proj_raster"), rf_tile($"proj_raster"), st_makePoint($"x", $"y")) as "value")

    result.coalesce(20).write.format("csv").option("header", "false").mode("append").save("/op/")
  }
}
rokek
@rokek
Hello, does rasterframes support pyspark 3.0? I have a requirement to suport both apache-sedona and pyrasterframes, however there is a dependency conflict because apache-sedona requires pyspark 3.0 and pyrasterframes requires pyspark 2.4.7.
rokek
@rokek
Simeon H.K. Fitch
@metasim
@rokek Work is being done on it here: locationtech/rasterframes#554
Unfortunately, Spark 3.x made a number of breaking changes to the encoding api, so it's a non-trivial upgrade done on limited volunteer time. We could certainly use help from folks that have experience in the nuances of Spark's encoding system.
Simeon H.K. Fitch
@metasim
@abhishekkrbaliase use rf_local_less to create a mask and then apply rf_mask_by_value.
abhishekkrbaliase
@abhishekkrbaliase
@metasim thanks.. had that figured out!
By the way in my local, I had followed below steps:
  1. conda create -n RasterFramesSpark -c conda-forge python=3.6 GDAL=2.4.4 rasterio[s3] boto3 geopandas numpy
  2. conda activate RasterFramesSpark
  3. pip install pyrasterframes
    However when I invoke below code, it always give "not available". Can I pointed to right direction on how I can make gdal available. Due to this I am unable to read raster data from 's3' or 'hdfs'
    from pyrasterframes.utils import gdal_version print(gdal_version())
8 replies
rokek
@rokek
@metasim I see, thanks! I don't have that experience (yet) but might be willing to pitch in if it would be worth your team's time to show me where to begin.
Simeon H.K. Fitch
@metasim
@rokek Wish I could give you a timeline. It's all volunteer time, so it's as we can get to it.
rokek
@rokek
@metasim I understand, thanks!
Grigory
@pomadchin
Hey, can give you a pointer to the EMR bootstrap only https://github.com/pomadchin/vlm-performance/blob/feature/gt-3.x/emr/bootstrap.sh
That’s only about GDAL though ^
Mandal
@hificoders
@metasim or anyone in this forum can help me I'm getting error when using rf_mk_crs
not found: value rf_mk_crs      .repartitionByRange(rf_xz2_index($"position",rf_mk_crs("epsg:4326")))
val result = points
      .withColumnRenamed("lon", "x") // use correct names here
      .withColumnRenamed("lat", "y")
      .withColumn("position", st_makePoint($"x", $"y"))     .repartitionByRange(rf_xz2_index($"position",rf_mk_crs("epsg:4326")))
      .join(rf)    .where(st_intersects(rf_geometry($"proj_raster"), $"position"))     .select(rf_value_at_point(rf_extent($"proj_raster"), rf_tile($"proj_raster"), $"position") as "value")
abhishekkrbaliase
@abhishekkrbaliase

After setting up Gdal as per https://github.com/pomadchin/vlm-performance/blob/feature/gt-3.x/emr/bootstrap.sh, still unable to read from hdfs

df2 = spark.read.raster("gdal://vsihdfs/hdfs://ip-15-0-113-178.ap-south-1.compute.internal:8020/user/hadoop/data/testB01.tiff")
df2.printSchema()
root
|-- proj_raster_path: string (nullable = false)
|-- proj_raster: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)

df2.select("proj_raster.tile").show(5)
[Stage 4:==============================================> (170 + 4) / 200][3 of 1000] FAILURE(3) CPLE_OpenFailed(4) "Open failed." /vsihdfs/hdfs://ip-15-0-113-178.ap-south-1.compute.internal:8020/user/hadoop/data/testB01.tiff: No such file or directory
[4 of 1000] FAILURE(3) CPLE_OpenFailed(4) "Open failed." /vsihdfs/hdfs://ip-15-0-113-178.ap-south-1.compute.internal:8020/user/hadoop/data/testB01.tiff: No such file or directory
21/08/09 16:28:53 ERROR Executor: Exception in task 186.0 in stage 4.0 (TID 403)
java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(gdal://vsihdfs/hdfs://ip-15-0-113-178.ap-south-1.compute.internal:8020/user/hadoop/data/testB01.tiff)
at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81)
at org.apache.spark.sql.execution.GenerateExec

KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲1: anonfun$1
anonfun$3.apply(GenerateExec.scala:95)
at org.apache.spark.sql.execution.GenerateExec
KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲1: anonfun$1
anonfun$3.apply(GenerateExec.scala:92)
at scala.collection.Iterator
KaTeX parse error: Can't use function '$' in math mode at position 5: anon$̲12.nextCur(Iter…: anon$12.nextCur(Iterator.scala:435)
    at scala.collection.Iterator
anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
at scala.collection.Iterator
KaTeX parse error: Can't use function '$' in math mode at position 5: anon$̲11.hasNext(Iter…: anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator
anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:227)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲2.apply(Shuffle…: anonfun$2.apply(ShuffleExchangeExec.scala:297)
    at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
anonfun$2.apply(ShuffleExchangeExec.scala:266)
at org.apache.spark.rdd.RDD
KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲mapPartitionsIn…: anonfun$mapPartitionsInternal$1
anonfun$apply$24.apply(RDD.scala:858)
at org.apache.spark.rdd.RDD
KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲mapPartitionsIn…: anonfun$mapPartitionsInternal$1
anonfun$apply$24.apply(RDD.scala:858)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 4
at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143)
at geotrellis.raster.gdal.GDA

Simeon H.K. Fitch
@metasim
@abhishekkrbaliase Please paste your output in code fences. It makes it hard to read when you don't.
abhishekkrbaliase
@abhishekkrbaliase

@metasim Here you go

df2 = spark.read.raster("gdal://vsihdfs/hdfs://ip-15-0-113-178.ap-south-1.compute.internal:8020/user/hadoop/data/testB01.tiff")
df2.printSchema()

This gives

|-- proj_raster_path: string (nullable = false)
|-- proj_raster: struct (nullable = true)
| |-- tile_context: struct (nullable = true)
| | |-- extent: struct (nullable = false)
| | | |-- xmin: double (nullable = false)
| | | |-- ymin: double (nullable = false)
| | | |-- xmax: double (nullable = false)
| | | |-- ymax: double (nullable = false)
| | |-- crs: struct (nullable = false)
| | | |-- crsProj4: string (nullable = false)
| |-- tile: tile (nullable = false)

However when I try to see some data in dataframe, it fails

df2.select("proj_raster.tile").show(5)

[Stage 4:==============================================> (170 + 4) / 200][3 of 1000] FAILURE(3) CPLE_OpenFailed(4) "Open failed." /vsihdfs/hdfs://ip-15-0-113-178.ap-south-1.compute.internal:8020/user/hadoop/data/testB01.tiff: No such file or directory [4 of 1000] FAILURE(3) CPLE_OpenFailed(4) "Open failed." /vsihdfs/hdfs://ip-15-0-113-178.ap-south-1.compute.internal:8020/user/hadoop/data/testB01.tiff: No such file or directory 21/08/09 16:28:53 ERROR Executor: Exception in task 186.0 in stage 4.0 (TID 403) java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(gdal://vsihdfs/hdfs://ip-15-0-113-178.ap-south-1.compute.internal:8020/user/hadoop/data/testB01.tiff) at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81) at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95) at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:227) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:297) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:266) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: geotrellis.raster.gdal.MalformedDataException: Unable to construct a RasterExtent from the Transformation given. GDAL Error Code: 4 at geotrellis.raster.gdal.GDALDataset$.rasterExtent$extension1(GDALDataset.scala:143) at geotrellis.raster.gdal.GDALRasterSource.gridExtent$lzycompute(GDALRast

1 reply
abhishekkrbaliase
@abhishekkrbaliase
Can someone help me with the syntax to read tif / jp2 files from s3 and hdfs
Simeon H.K. Fitch
@metasim
@abhishekkrbaliase Did you see this?
https://rasterframes.io/raster-read.html#uri-formats
Your URL is incorrectly formatted.
abhishekkrbaliase
@abhishekkrbaliase

@metasim I even tried

rf = spark.read.raster('gdal://vsis3/my-bucket/data/try/outfile_new.tiff')
rf.show()

But got Reading 'gdal://vsis3/dataplatform-prod-apps/data/try/outfile_new.tiff' not supported
Error fetching data for one of: at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:81) at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:95) at org.apache.spark.sql.execution.GenerateExec$$anonfun$1$$anonfun$3.apply(GenerateExec.scala:92) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:227) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$3.apply(ShuffleExchangeExec.scala:283) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$3.apply(ShuffleExchangeExec.scala:252) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:95) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.UnsupportedOperationException: Reading 'gdal://vsis3/dataplatform-prod-apps/data/try/outfile_new.tiff' not supported at org.locationtech.rasterframes.ref.RFRasterSource$$anonfun$apply$1.apply(RFRasterSource.scala:123)

abhishekkrbaliase
@abhishekkrbaliase

I have tried
gdalinfo /vsis3/dataplatform-prod-apps/data/try/B08.jp2
It gives me appropriate details. However when I try with rasterframes, it says:

rf1 = spark.read.raster('/vsis3/dataplatform-prod-apps/data/try/B08.jp2')
crs = rf1.select(rf_crs("proj_raster").alias("value")).first()
Caused by: java.lang.IllegalArgumentException: requirement failed: Can only read /vsis3/dataplatform-prod-apps/data/try/B08.jp2 if GDAL is available

I am running on CentOS and gdal is available in pyspark

 from pyrasterframes.utils import gdal_version
print(gdal_version())
GDAL 2.4.4, released 2020/01/08

@pomadchin Can you help

Simeon H.K. Fitch
@metasim
Did you try: 'gdal://vsis3//dataplatform-prod-apps/data/try/B08.jp2?
If you look here, you'll see that when using GDAL (via the gdal:// scheme specifier, the driver name has to be terminated with two / characters.
Also make sure gdal is installed on all your cluster nodes, and libgdal.so is in your LD_LIBRARY_PATH or java.library.path of every worker node's JVM environment.
Grigory
@pomadchin
@abhishekkrbaliase @metasim should not that be just gdal+s3://path/to/file.jp2 or RF has a bit different behaviors?
Simeon H.K. Fitch
@metasim
I don't remember which version of GDALRasterSource we're using right now. I think we wrote the URL parsing stuff in RF before GT had settled on the "scheme1+scheme2" syntax, so there's likely some legacy stuff hanging around. Also, at the time, the extra slash stuff seemed to be imposed by something in the GDAL chain. Could certainly benefit from a rewrite.
It's been long enough that I don't remember the details, but do remember it was more painful than it should have been.