Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
@metasim get it, thanks :)
Hello, i'm attempting to use pyrasterframes from the pyspark shell. I set the py-files config to the zip file i downloaded. I've tried zip files for both 2.11 (which fails b/c I'm using spark 3) and 2.12 which fails to find the geotrellis dependency. Digging a little deeper and it looks as if geotrellis does not run on spark 3. If that's the case, does rasterframes also not run on spark 3?
@kembles5 Geotrellis does run on Spark 3. That's not to say that I know what problem you're running into, but GT 3.6.0 and up are spark 3 compatible.
You may want to check which version is included in the zip file you want to use.
I've been running rasterframes 0.10.1 (in Scala, though) in spark 3.1.2 without a problem.
@jpolchlo good to hear that it runs on spark 3. I'm getting the following module not found: org.locationtech.geotrellis#geotrellis-spark_2.12;3.6.1-SNAPSHOT
which may be related to an issue on the geotrellis side
glancing at the geotrellis gitter and there's mention today of a build/dependency issue
which i do not get if I use the 2.12-0.9.1 zip file
sorry, meant 2.11-0.9.1
@kembles5 OK, that makes sense. Bintray shut down, so our snapshots are being hosted elsewhere. I think https://repo.eclipse.org/content/repositories/geotrellis-snapshots/org/locationtech/geotrellis/ ? I'm not 100% on how to use this information to solve your problem. Hopefully one of the rasterframes engineers can speak to that.
In the past I've used the --repositories flag to spark-submit to download packages from nonstandard locations. Perhaps there's a way to do that with pyspark?
@jpolchlo thank you. i'll see if i can add a --repository flag. fwiw, here's the command i'm running:
pyspark --archives hub_env.tar.gz --py-files pyrasterframes_2.12-0.10.0-python.zip --packages org.locationtech.rasterframes:rasterframes_2.12:0.10.0,org.locationtech.rasterframes:pyrasterframes_2.12:0.10.0,org.locationtech.rasterframes:rasterframes-datasource_2.12:0.10.0 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrator=org.locationtech.rasterframes.util.RFKryoRegistrator --conf spark.kryoserializer.buffer.max=500m

hey @kembles5 I am not sure why do you need a snapshot, the most recent gt release is 3.6.2 and the most recent RF release is is 0.10.1 which depends on GT 3.6.1 transitively;

RF of versions prior to 0.10.0 are not compatible with Spark 3.x

if you still need GT snapshots than they are available on the maven nexus i.e.: https://oss.sonatype.org/content/repositories/snapshots/org/locationtech/geotrellis/geotrellis-spark_2.12/

Check out the GT README badges: https://github.com/locationtech/geotrellis#geotrellis


Thanks @pomadchin. I tried rasterframes 0.10.1 and the GT errors were resolved. Now there's only one module not found.

com.github.everit-org.json-schema#org.everit.json.schema;1.12.2: not found

I'm simply going through the getting started guide for rasterframes (https://rasterframes.io/getting-started.html) and trying to follow the "using pyspark shell" section, and I get the above error. The getting started guide doesn't appear to work with 0.10.0 and 0.10.1

@kembles5 yea,unfortunately, we use a schema validator that is published here https://jitpack.io
^ try to add https://jitpack.io as a resolver => it should work than
@pomadchin thanks for the response. Forgive me for the noob questions :). I'm a python developer and have very little understanding of what's happening behind the scenes with the RF zip file. What would I need to add to the command from the "usying pyspark shell" section to add a resolver?
@kembles5 I think you need to add --repositories https://jitpack.io. From the command line, issue pyspark for more details.

Hi @jpolchlo, thank you. i'm able to get to the pyspark shell now, but get the following error when i run: spark = spark.withRasterFrames(). From what I've read this looks like a scala version mismatch, but I verified I'm using spark 3.1 which uses scala 2.12

: java.lang.NoSuchMethodError: shapeless.DefaultSymbolicLabelling$.instance(Lshapeless/HList;)Lshapeless/DefaultSymbolicLabelling; at org.locationtech.rasterframes.encoders.StandardEncoders.spatialKeyEncoder(StandardEncoders.scala:68) at org.locationtech.rasterframes.encoders.StandardEncoders.spatialKeyEncoder$(StandardEncoders.scala:68)

@kembles5 Yes, that looks like some kind of version mismatch, but the origin of those can be hard to track down. Are you setting up a very vanilla, minimal environment here, or are there other dependencies getting thrown in here? I haven't worked too much with pyrasterframes, so I'm possibly not extremely useful here. I wonder if @metasim has any advice?
Thanks @jpolchlo. Looks like the latest version of spark is using shapeless 2.3.7 (per pom.xml from master branch) and the version that 0.10.1 depends on is 2.12. Is there another version of the rasterframes zip file (maybe a shaded version) that I should be using with spark 3. I'm using pyrasterframes_2.12-0.10.0-python.zip
@kembles5 @jpolchlo one option is to use assembly jar with shaded deps, another to upgrade shapeless in the spark classpath
I think we need to move to spark 3.2.x to get rid of this error and be fully compatible with everything
@pomadchin I can't upgrade shapeless in the spark classpath and I can't move to spark 3.2. That leaves using the assembly jar with shaded deps. Where can I get this jar, i've searched high and low. I also tried to build the project but run into the following error(macos, openjdk18):
java.lang.ClassCastException: class java.lang.UnsupportedOperationException cannot be cast to class xsbti.FullReload (java.lang.UnsupportedOperationException is in module java.base of loader 'bootstrap'; xsbti.FullReload is in unnamed module of loader 'app')
sbt script version: 1.6.2
@kembles5 what OS are you on?
@pomadchin macOS Monterey 12.2.1
That's my personal laptop, but at work and where I'm using spark, I'm on rhel7
@pomadchin downgraded java to version 11 and sbt worked. Any instructions on how to build the assembly with shaded dependencies would be appreciated. Thanks for all your help
i think, the pip installed RF is already shaded
I would recommend you to follow RF quick start instructions
a good question how to deploy it on a cluster with no k8s, that’s a q to @metasim, but I’m pretty sure it is doable via conda
@pomadchin Thanks for the info. The getting started guide is where I'm having issues. Specifically, the "Using pyspark shell" section. I tried following the guide using 0.10.0 and 0.10.1 on spark 3.1. Both initially failed with "modules not found", adding --repositories https://jitpack.io had no effect, although 0.10.1 eventually started working despite no changes, the 0.10.0 continues to have issues with finding the geotrellis module. For 0.10.1 i then ran into the shapeless conflict.
hello, when I use pyrasterframes to write a geotiff, I set raster_dimensions=(5000, 5000), then it failed because of outofmemoryerror. The raster data that I'm trying to write is not so large, just 16m, and my spark worker memory is 2g, why it's not sufficient?
Just checking if anyone has come up with a good bootstrap script/configuration/example to follow for running pyrasterframes on AWS EMR? I've still never successfully been able to do it
Hey @jterry64 I think it can be some version of https://github.com/pomadchin/vlm-performance/blob/feature/gt-3.x/emr/bootstrap.sh but with an extra RasterFrames deps
Woo I did finally get it to work! This ended up working as a bootstrap scripts:
set -ex

# Install Conda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sudo sh Miniconda3-latest-Linux-x86_64.sh -b -p /usr/local/miniconda

source ~/.bashrc
export PATH=/usr/local/miniconda/bin:$PATH

# Install GDAL
sudo /usr/local/miniconda/bin/conda config --add channels conda-forge
sudo /usr/local/miniconda/bin/conda install -c conda-forge libnetcdf gdal=3.5.0 -y
sudo /usr/local/miniconda/bin/pip install pyrasterframes geopandas boto3 s3fs

echo "export PATH=/usr/local/miniconda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/miniconda/lib/:/usr/local/lib:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib" >> ~/.bashrc
echo "export PROJ_LIB=/usr/local/miniconda/share/proj" >> ~/.bashrc
echo "export PYSPARK_PYTHON=/usr/local/miniconda/bin/python" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON=/usr/local/miniconda/bin/python" >> ~/.bashrc

But now to hammer you with more questions... I've been using the polygonal summary method in geotrellis over blocks from rasters, and it seems like geotrellis has an optimization where it just rasterizes the parts of the polygon that are within the extent of the raster. Is there a good way to replicate this behavior with rasterframes?

I.e. if I do a big join of a bunch of polygons and raster blocks, and now I want to rasterize the polygons to use as a mask, how do I rasterize so that the zone raster is aligned with just that block? In the example in the documentation, it just rasterizes using the dimensions of the raster, but I'm unclear how this actually aligns correctly with raster block: https://rasterframes.io/zonal-algebra.html

could someone please help solve a problem about using spark-submit?
/export/server/spark/bin/spark-submit \
--master yarn \
--num-executors 6 \
--jars /export/server/anaconda3/envs/pyspark/lib/python3.8/site-packages/pyrasterframes/jars/pyrasterframes-assembly-0.10.1.jar \
Using spark-submit will cause an error as follows:
Traceback (most recent call last):
File "/tmp/pycharm_project_255/raster_slope.py", line 9, in <module>
spark = (SparkSession.builder
File "/export/server/anaconda3/envs/pyspark/lib/python3.8/site-packages/pyrasterframes/init.py", line 44, in _rf_init
spark_session.rasterframes = RFContext(spark_session)
File "/export/server/anaconda3/envs/pyspark/lib/python3.8/site-packages/pyrasterframes/rf_context.py", line 45, in init
self._jrfctx = self._jvm.org.locationtech.rasterframes.py.PyRFContext(jsess)
File "/export/server/spark/python/lib/py4j-", line 1573, in call
File "/export/server/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/export/server/spark/python/lib/py4j-", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.locationtech.rasterframes.py.PyRFContext.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.objects.Invoke$.apply$default$5()Z
at frameless.RecordEncoder.$anonfun$toCatalyst$2(RecordEncoder.scala:154)
at scala.collection.immutable.List.map(List.scala:293)
at frameless.RecordEncoder.toCatalyst(RecordEncoder.scala:153)
at frameless.TypedExpressionEncoder$.apply(TypedExpressionEncoder.scala:28)
at org.locationtech.rasterframes.encoders.TypedEncoders.typedExpressionEncoder(TypedEncoders.scala:22)
at org.locationtech.rasterframes.encoders.TypedEncoders.typedExpressionEncoder$(TypedEncoders.scala:22)
at org.locationtech.rasterframes.package$.typedExpressionEncoder(package.scala:39)
at org.locationtech.rasterframes.encoders.StandardEncoders.spatialKeyEncoder(StandardEncoders.scala:68)
at org.locationtech.rasterframes.encoders.StandardEncoders.spatialKeyEncoder$(StandardEncoders.scala:68)
at org.locationtech.rasterframes.package$.spatialKeyEncoder$lzycompute(package.scala:39)
at org.locationtech.rasterframes.package$.spatialKeyEncoder(package.scala:39)
at org.locationtech.rasterframes.StandardColumns.$init$(StandardColumns.scala:42)
at org.locationtech.rasterframes.package$.<init>(package.scala:39)
at org.locationtech.rasterframes.package$.<clinit>(package.scala)
at org.locationtech.rasterframes.py.PyRFContext.<init>(PyRFContext.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
Run the project directly won't cause any error. But I want to set num-executors
Eugene Cheipesh

Trying to debug a GDAL reading problem that results in following exception (on EMR with GDAL 3.1.2 installed)

Caused by: java.lang.UnsupportedOperationException: Reading 'gdal://vsis3/bucket/some.tif not supported
  at org.locationtech.rasterframes.ref.RFRasterSource$.$anonfun$apply$1(RFRasterSource.scala:119)
  at scala.compat.java8.functionConverterImpls.AsJavaFunction.apply(FunctionConverters.scala:262)
  at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)

rasterframes/RFRasterSource.scala at develop · locationtech/rasterframes · GitHub

In spark-shell on master with the job jar I’m able to reproduce but not explain:

Raster Reads:

val url = "gdal://vsis3/bucket/some.tif”

scala> val rs = RFRasterSource(new java.net.URI(url))
rs: org.locationtech.rasterframes.ref.RFRasterSource = GDALRasterSource(gdal://vsis3/...)

scala> rs.read(GridBounds(0,0,10,10), List(0))
res1: geotrellis.raster.Raster[geotrellis.raster.MultibandTile] = Raster(ArrayMultibandTile(11,11,1,float32ud-3.4028234663852886E38),Extent(4031670.65908466, 3215321.1233700267, 4031725.65908466, 3215376.1233700267))

scala> RFRasterSource.IsGDAL.unapply(new java.net.URI(url))
res2: Boolean = true

scala> spark.read.raster.from(url).load().show()
22/06/09 19:26:23 WARN TaskSetManager: Lost task 986.0 in stage 1.0 (TID 988) (ip-172-31-19-30.eu-west-1.compute.internal executor 7): java.lang.IllegalArgumentException: Error fetching data for one of:
    at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:83)
    at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$3(GenerateExec.scala:95)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:275)
    at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:400)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.UnsupportedOperationException: Reading 'gdal://vsis3/...
7 replies