Hi all, first of all thank you for the great job with rasterframes. I just started my first steps into EO data processing in pyspark and rasterframes and i have couple of conceptual questions to have a clear idea of how all of this comes together.
In my case study i have several images i want to use as input for a supervised classification. I have been reading the rasterframes documentation and explore the docker image you have with the examples folder. Fromt the supervised-learning notebook i clearly see the steps but my question are:
once i produce my output (classification), I would need to get not only the model accuracy but also the classified raster. I see there is a rf.write.geotiff() function but in the documentation it is stated that this is not a smart move. How should i then create an output that i can then get and visualize in regular GIS Software like QGIS for example?
I intend to start using a cloud provider to send my application via spark-submit() and run it there. I do not have a clear idea on this step. I guess i have to produce a .py file + dependencies (.yml) and in that .py file write the pyspark code that makes use of rasterframe to run my analysis. Is this right? Am I missing something?
@vpipkt , I have adopted your suggestion and omitted the "raster_dimensions" parameter, however, "java.lang.OutOfMemoryError: Java heap space" took place. Furthermore, I have replaced "raster_dimensions=(558, 507)" with "raster_dimensions=(5580, 5070)" or with "raster_dimensions=(1558, 1507)", but the "java.lang.OutOfMemoryError: Java heap space" errors also appeared.
A new project named "rasterframes-GeoTIFFs" has been created on my Github page. I have uploaded the "unsupervised machine learning" and "supervised machine learning (https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/machine-learning/supervised_machine_learning.py)" codes, also the error logs (https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/tree/main/error-logs) while changing the "raster_dimensions".
The show output result of GeoTIFFs is: "https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/show-output-result/supervised-machine-learning/show.png".
Lastly, the output of gdalinfo for the resulting output .tiff generated by "supervised machine learning" is below. In that case, the "raster_dimensions" parameter is "raster_dimensions=(558, 507)".
(https://github.com/JenniferYingyiWu2020/rasterframes-GeoTIFFs/blob/main/show-output-result/supervised-machine-learning/gdalinfo.txt)
Hi, I am using the the s22s/rasterframes docker image to practice and when trying to open a geojson file to clip a raster I am getting a permission denied error.
So far I am doing:df_geojson = spark.read.geojson('path/to/file.geojson')
The geojson file is stored in a local folder that I linked to the container with the -v parameter when running it ( -v my/local/folder:/home/jovyan/work)
Py4JJavaError: An error occurred while calling o179.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.io.FileNotFoundException: /home/jovyan/work/SA_geojson.geojson (Permission denied)
Any idea what I could be missing?
--user root
and -e GRANT_SUDO=yes
tags. After that, I could modify the file/folder permisssions and avoid the permisssion denied. Would this be a logic solution or did you have in mind another approach?
when trying to run the supervised-learning notebook from the docker s22s/rasterframe image, I am getting the following error in cell 3
crses = df.select('crs.crsProj4').distinct().collect()
Py4JJavaError: An error occurred while calling o142.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 37 in stage 1.0 failed 1 times, most recent failure: Lost task 37.0 in stage 1.0 (TID 207, localhost, executor driver): java.lang.IllegalArgumentException: Error fetching data for one of: GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/SCL.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B01.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B02.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B03.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B04.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B05.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B06.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B07.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B08.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B09.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B11.tif), GDALRasterSource(s3://s22s-test-geotiffs/luray_snp/B12.tif)
📣📣 If you're a user of RasterFrames (or GeoMesa) and would like to see it kept up to date with the latest versions of Spark and JVM technologies, there's a small favor you could do us to help. Go to issue SPARK-7768 and vote for it.
The TL;DR of it is that RasterFrames and GeoMesa (and other frameworks built on Spark using UDTs) use a non-sustainable hack to register the types with Spark. This hack would no longer be required if the Spark committers changed literally one line of code. This ticket has been open since 2015 against Spark 1.5 and keeps getting pushed to the next release.
Regrettably, to vote for the issue you have to create an account on the Spark Jira system, but my hope is that collectively overcoming that small bit of friction will reap larger rewards.
.filterNot(_._2.contains(null))
can be a workaound to avoid this problem.