metasim on develop
Added a couple example inspried… Updates to make LocationTech Ci… Merge pull request #498 from s2… (compare)
input to function coalesce should all be the same type, but it's....
spark.read.rasterand using a
joinoperation which should be much faster than the
@vpipkt okay, so:
['b02_10m_path', 'b03_10m_path', 'b04_10m_path', 'b08_10m_path', 'b02_10m', 'b03_10m', 'b04_10m', 'b08_10m', 'spatial_index', 'b02', 'b03', 'b04', 'b08']
rf_joined = raster_10m.raster_join(raster_20m) rf_joined.printSchema()
schema is here (too long for gitter chat): https://hastebin.com/oyafoyuvep.m
loadAndSetNodata- it fixed it but it doesnt seem to change anything
tile_dimensionsof bands 10m/20m/60m: sometimes it works, other times i get this error:
Py4JJavaError: An error occurred while calling o105._dfToHTML. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 205 in stage 146.0 failed 4 times, most recent failure: Lost task 205.3 in stage 146.0 (TID 9369, dp-09, executor 7): geotrellis.raster.GeoAttrsError: Cannot combine rasters with different dimensions.150x150 does not match 300x300
cond_1020 = [rf_crs(raster_10m.b02) == rf_crs(raster_20m.b11), rf_extent(raster_10m.b02) == rf_extent(raster_20m.b11)] cond_60 = [rf_crs(raster_10m.b02) == rf_crs(raster_60m.b01), rf_extent(raster_10m.b02) == rf_extent(raster_60m.b01)] rf_1020 = raster_10m.join(raster_20m, cond_1020) rf_joined = rf_1020.join(broadcast(raster_60m), cond_60)
rf_resamplethe 20m and 60m bands as discussed in the issue 450. Basically within a single row all tiles should have the same dimension
gdalinfoon such a files? If so, there's a mechanism to pass parameters down to the GDAL bindings.
GDALRasterSource, do you know if there is a way to read from S3 via the AWS "assumed role" approach outlined in the docs above (as opposed to env vars or
@metasim @sllynn if the job is launched on the AWS cluster, than all the credentials would be retrieved from the metadata storage
in all other cases - only explicit credentials (~/.aws/ / env vars) AFAIK; also see https://github.com/OSGeo/gdal/blob/f87673d2ac225e117fd6f6d5b32443cab1c7460b/gdal/port/cpl_aws.cpp
GridExtent(Extent(-2.0E7, 2000000.0, -7250000.0, 1.16E7),CellSize(5.0,5.0),2550000x1920000)), but when loading with rasterframes
spark.read.raster.from(path).load(), the driver spins at 100% cpu for ~30 mins (no activity on workers) until it eventually raises a variety of Executor timeout and OutOfMemoryError: GC overhead limit exceeded, bad datanode, and finally:
org.apache.spark.SparkException: Job aborted.->
Reason: Executor heartbeat timed out after 708872 ms. Raster is ~20GB (compressed) and using r5.2xl driver/worker instances.
@basilveerman To me it sounds like a lot of shuffling is happening, which could be caused by a few things. What does the Spark UI "Executors" page show in terms of shuffle reads/writes and/or GC time? Select the stage that the job failed on and take a look at the task time distribution. Is it heavily skewed to the right (longer times)? Are you doing any
groupBys? If so, you want to use the
lazy_tiles feature and make sure you repartition by your join key right after the call to
Another thing to check is using the tool(s) referenced here to confirm your GeoTIFFs are "Cloud Optimized". If they aren't, then every executor is going to end up reading in a complete copy of the raster, which could easily send the JVM into GC hell.
spark.read.raster.from(<s3_path>).load(). I attempted again with r5.12xlarge instances (file is 200G uncompressed, should fit entirely in memory on a single node) and ran into the same errors. It's also using JVMGeoTiffRasterSource rather than GDAL, could that be an issue? I'll add GDAL to the AMI we use regardless.
loadshould return immediately, and not reading /processing should happen until an action happens, so there seems something odd going on. I'd turn on
lazy_tilesand add a
loadto see what happens.