echeipesh on spark-3.2
Fix UDF style Aggregates (compare)
echeipesh on spark-3.2
bumped dev version Spark 3.2 Lets get it compiled… withNewChildrenInternal and 2 more (compare)
dependabot[bot] on pip
Bump pyspark from 3.1.2 to 3.1.… (compare)
metasim on develop
CI fix. Dependency updates. Spark 3.1.3 and 1 more (compare)
set -ex
# Install Conda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sudo sh Miniconda3-latest-Linux-x86_64.sh -b -p /usr/local/miniconda
source ~/.bashrc
export PATH=/usr/local/miniconda/bin:$PATH
# Install GDAL
sudo /usr/local/miniconda/bin/conda config --add channels conda-forge
sudo /usr/local/miniconda/bin/conda install -c conda-forge libnetcdf gdal=3.5.0 -y
sudo /usr/local/miniconda/bin/pip install pyrasterframes geopandas boto3 s3fs
echo "export PATH=/usr/local/miniconda/bin:$PATH" >> ~/.bashrc
echo "export LD_LIBRARY_PATH=/usr/local/miniconda/lib/:/usr/local/lib:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib" >> ~/.bashrc
echo "export PROJ_LIB=/usr/local/miniconda/share/proj" >> ~/.bashrc
echo "export PYSPARK_PYTHON=/usr/local/miniconda/bin/python" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON=/usr/local/miniconda/bin/python" >> ~/.bashrc
But now to hammer you with more questions... I've been using the polygonal summary method in geotrellis over blocks from rasters, and it seems like geotrellis has an optimization where it just rasterizes the parts of the polygon that are within the extent of the raster. Is there a good way to replicate this behavior with rasterframes?
I.e. if I do a big join of a bunch of polygons and raster blocks, and now I want to rasterize the polygons to use as a mask, how do I rasterize so that the zone raster is aligned with just that block? In the example in the documentation, it just rasterizes using the dimensions of the raster, but I'm unclear how this actually aligns correctly with raster block: https://rasterframes.io/zonal-algebra.html
Trying to debug a GDAL reading problem that results in following exception (on EMR with GDAL 3.1.2 installed)
Caused by: java.lang.UnsupportedOperationException: Reading 'gdal://vsis3/bucket/some.tif not supported
at org.locationtech.rasterframes.ref.RFRasterSource$.$anonfun$apply$1(RFRasterSource.scala:119)
at scala.compat.java8.functionConverterImpls.AsJavaFunction.apply(FunctionConverters.scala:262)
at com.github.benmanes.caffeine.cache.LocalCache.lambda$statsAware$0(LocalCache.java:139)
rasterframes/RFRasterSource.scala at develop · locationtech/rasterframes · GitHub
In spark-shell
on master with the job jar I’m able to reproduce but not explain:
Raster Reads:
val url = "gdal://vsis3/bucket/some.tif”
scala> val rs = RFRasterSource(new java.net.URI(url))
rs: org.locationtech.rasterframes.ref.RFRasterSource = GDALRasterSource(gdal://vsis3/...)
scala> rs.read(GridBounds(0,0,10,10), List(0))
res1: geotrellis.raster.Raster[geotrellis.raster.MultibandTile] = Raster(ArrayMultibandTile(11,11,1,float32ud-3.4028234663852886E38),Extent(4031670.65908466, 3215321.1233700267, 4031725.65908466, 3215376.1233700267))
scala> RFRasterSource.IsGDAL.unapply(new java.net.URI(url))
res2: Boolean = true
scala> spark.read.raster.from(url).load().show()
22/06/09 19:26:23 WARN TaskSetManager: Lost task 986.0 in stage 1.0 (TID 988) (ip-172-31-19-30.eu-west-1.compute.internal executor 7): java.lang.IllegalArgumentException: Error fetching data for one of:
at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:83)
at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$3(GenerateExec.scala:95)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:275)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.$anonfun$prepareShuffleDependency$10(ShuffleExchangeExec.scala:400)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.UnsupportedOperationException: Reading 'gdal://vsis3/...
@echeipesh That surprises me too. I've not done any RasterFrames work for some time, so I've not been tracking it. Debugging performance issues in Spark is just so damn difficult. Java version different? If I had $$ to work on it, I'd put in dropwizard/prometheus metrics and maybe even tracing on tile ops in hopes instrumentation would help.
I've been wondering fi we should do a new release with just this flag flipped:
https://github.com/locationtech/rasterframes/blob/6bd49e093cd80dd78999d2898e66e76f47d07463/core/src/main/resources/reference.conf#L3
@pomadchin Hi, How to convert rasterframes dataFrame to geotrellis RDD[(K, V)] with Metadata[M]?
val df = spark.read.raster
.withLazyTiles(true)
.withSpatialIndex(2)
.withTileDimensions(512, 512)
.load(filePath)
df.printSchema()
df.asLayer.show(false)
Get error:
Caused by: java.lang.IllegalArgumentException: requirement failed: A RasterFrameLayer requires a column identified as a spatial key
at scala.Predef$.require(Predef.scala:281)
at org.locationtech.rasterframes.extensions.DataFrameMethods.asLayer(DataFrameMethods.scala:234)
at org.locationtech.rasterframes.extensions.DataFrameMethods.asLayer$(DataFrameMethods.scala:229)
at org.locationtech.rasterframes.extensions.Implicits$WithDataFrameMethods.asLayer(Implicits.scala:59)
def apply(left: DataFrame, right: DataFrame, leftExtent: Column, leftCRS: Column, rightExtent: Column, rightCRS: Column, resampleMethod: GTResampleMethod,
fallbackDimensions: Option[Dimensions[Int]]): DataFrame = {
val leftGeom = st_geometry(leftExtent)
val rightGeomReproj = st_reproject(st_geometry(rightExtent), rightCRS, leftCRS)
val joinExpr = new Column(SpatialRelation.Intersects(leftGeom.expr, rightGeomReproj.expr))
apply(left, right, joinExpr, leftExtent, leftCRS, rightExtent, rightCRS, resampleMethod, fallbackDimensions)
}
.......
.withColumn(id, monotonically_increasing_id())
// 2. Perform the left-outer join
.join(right, joinExprs, joinType = "left")
// 3. Group by the unique ID, reestablishing the LHS count
.groupBy(col(id))
......
Hello, I'm new to rasterframes and encountering the error below while running zonal statistics prototype using pyrasterframes in AWS EMR Serverless environment. I'm a bit lost on the java error and hoping someone can point in the right direction to resolve this.
The environment:
EMR v6.6 (the only version available in EMR Serverless)
Spark v3.2.0
Pyrasterframes v0.10.1
And here is the error:
Traceback (most recent call last):
File "/tmp/spark-515422fd-4ed2-4ca6-9f2d-160cd422df73/zonal_statistics.py", line 25, in <module>
spark = create_rf_spark_session()
File "/home/hadoop/rasterframes/lib/python3.8/site-packages/pyrasterframes/utils.py", line 88, in create_rf_spark_session
spark.withRasterFrames()
File "/home/hadoop/rasterframes/lib/python3.8/site-packages/pyrasterframes/__init__.py", line 44, in _rf_init
spark_session.rasterframes = RFContext(spark_session)
File "/home/hadoop/rasterframes/lib/python3.8/site-packages/pyrasterframes/rf_context.py", line 45, in __init__
self._jrfctx = self._jvm.org.locationtech.rasterframes.py.PyRFContext(jsess)
File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1573, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.locationtech.rasterframes.py.PyRFContext.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.objects.Invoke$.apply$default$5()Z
at frameless.RecordEncoder.$anonfun$toCatalyst$2(RecordEncoder.scala:154)
at scala.collection.immutable.List.map(List.scala:293)
at frameless.RecordEncoder.toCatalyst(RecordEncoder.scala:153)
at frameless.TypedExpressionEncoder$.apply(TypedExpressionEncoder.scala:28)
at org.locationtech.rasterframes.encoders.TypedEncoders.typedExpressionEncoder(TypedEncoders.scala:22)
at org.locationtech.rasterframes.encoders.TypedEncoders.typedExpressionEncoder$(TypedEncoders.scala:22)
at org.locationtech.rasterframes.package$.typedExpressionEncoder(package.scala:39)
at org.locationtech.rasterframes.encoders.StandardEncoders.spatialKeyEncoder(StandardEncoders.scala:68)
at org.locationtech.rasterframes.encoders.StandardEncoders.spatialKeyEncoder$(StandardEncoders.scala:68)
at org.locationtech.rasterframes.package$.spatialKeyEncoder$lzycompute(package.scala:39)
at org.locationtech.rasterframes.package$.spatialKeyEncoder(package.scala:39)
at org.locationtech.rasterframes.StandardColumns.$init$(StandardColumns.scala:42)
at org.locationtech.rasterframes.package$.<init>(package.scala:39)
at org.locationtech.rasterframes.package$.<clinit>(package.scala)
at org.locationtech.rasterframes.py.PyRFContext.<init>(PyRFContext.scala:49)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)