by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 19:54
    vpipkt commented #507
  • 17:10
    ajinkya933 commented #507
  • Sep 22 20:48
  • Sep 22 13:48
    diversoft starred locationtech/rasterframes
  • Sep 21 15:39
    vpipkt commented #507
  • Sep 17 15:07
    vpipkt commented #507
  • Sep 17 15:06
    vpipkt commented #507
  • Sep 17 15:06
    vpipkt labeled #507
  • Sep 17 07:09
    ajinkya933 commented #507
  • Sep 17 07:06
    ajinkya933 commented #507
  • Sep 16 21:06
    vpipkt commented #507
  • Sep 16 17:35
    ajinkya933 opened #507
  • Sep 14 18:58
    metasim closed #506
  • Sep 14 18:58
    metasim commented #506
  • Sep 14 15:15

    metasim on develop

    Added a couple example inspried… Updates to make LocationTech Ci… Merge pull request #498 from s2… (compare)

  • Sep 14 15:15
    metasim closed #498
  • Sep 14 15:15
    metasim review_requested #506
  • Sep 14 15:15
    metasim opened #506
  • Sep 11 14:08
    metasim labeled #214
  • Sep 10 22:14
    metasim labeled #61
Jason T Brown
@vpipkt
i suspect there is a slight mismatch between schemas in the left dataframe into the 2nd raster_join, at that point rf_joined
note this in the stack trace input to function coalesce should all be the same type, but it's....
Jason T Brown
@vpipkt
you might try something like this at that line, if you are up for building etc yourself.
coalesce(rf_dimensions(left.tileColumns.map(unresolved): _*))
although in all of this, i am not clear how the issues arises on Yarn but not local
Jason T Brown
@vpipkt
And taking the discussion up another level. @mjgolebiewski if you are trying to arrive at a single dataframe with the 10m, 20m, and 60m bands of Sentinel2, I suggest reviewing this issue https://github.com/locationtech/rasterframes/issues/450#issuecomment-575648336
in particular i have some commentary there about using multiple spark.read.raster and using a join operation which should be much faster than the rasterJoin.
Jason T Brown
@vpipkt
the gitter compiler failed me, but here comes a PR to fix this issue. you can also probably work around this using rf_proj_raster on all the left hand side tiles before the 2nd raster_join
anhfit
@anhfit
@vpipkt I just have tried the line rf = spark.read.raster('file:///home/jovyan/examples/data/LC81270452015182LGN00_B2.tif') to read a file transfered to docker instance. But it does not work ! Please help me
10 replies
Michał Gołębiewski
@mjgolebiewski

@vpipkt okay, so:
raster_10m.columns returns

['b02_10m_path', 'b03_10m_path', 'b04_10m_path', 'b08_10m_path', 'b02_10m', 'b03_10m', 'b04_10m', 'b08_10m', 'spatial_index', 'b02', 'b03', 'b04', 'b08']
rf_joined = raster_10m.raster_join(raster_20m)
rf_joined.printSchema()

schema is here (too long for gitter chat): https://hastebin.com/oyafoyuvep.m

thanks for catch in loadAndSetNodata - it fixed it but it doesnt seem to change anything
Michał Gołębiewski
@mjgolebiewski
also thanks for tip with sentinel2 bands. if its more efficient in sentinel2 case then it will help alot
Michał Gołębiewski
@mjgolebiewski
i tried this approach and i get mixed results - i choosed 300/150/50 tile_dimensions of bands 10m/20m/60m: sometimes it works, other times i get this error:
Py4JJavaError: An error occurred while calling o105._dfToHTML.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 205 in stage 146.0 failed 4 times, most recent failure: Lost task 205.3 in stage 146.0 (TID 9369, dp-09, executor 7): geotrellis.raster.GeoAttrsError: Cannot combine rasters with different dimensions.150x150 does not match 300x300
Michał Gołębiewski
@mjgolebiewski
but that may be due to me trying zonal statistic on this joined data... should i upsample 20m and 60m first? or is join enough?
cond_1020 = [rf_crs(raster_10m.b02) == rf_crs(raster_20m.b11),
                 rf_extent(raster_10m.b02) == rf_extent(raster_20m.b11)]

cond_60 = [rf_crs(raster_10m.b02) == rf_crs(raster_60m.b01),
               rf_extent(raster_10m.b02) == rf_extent(raster_60m.b01)]

rf_1020 = raster_10m.join(raster_20m, cond_1020)
rf_joined = rf_1020.join(broadcast(raster_60m), cond_60)
11 replies
tieuthienvn1987
@tieuthienvn1987
image.png
4 replies
Jason T Brown
@vpipkt
@mjgolebiewski yes the join on crs and extent will be much more efficient. I merged the changes fixing the coalesce issue. I tagged you on PR and issue for your perusal. Hope this help
And yes you need to rf_resample the 20m and 60m bands as discussed in the issue 450. Basically within a single row all tiles should have the same dimension
(dimensions can be nonhomogenous across rows)
anhfit
@anhfit
@vpipkt I am executing the provided supervised-learning example . However, I changed the data sources to my local Landsat imaginery. I am now facing the problem in the part of creating the mask (Masking Poor Quality Cells) because there is no "scl" column with my Landsat 8 data. Can you suggest me a solution?
27 replies
tieuthienvn1987
@tieuthienvn1987
I installed gdal 2.4.4 and reported successful, but I got an error with command: from pyrasterframes.utils import gdal_version
print(gdal_version())
Report: not available
I get errors when worked with GeoTIFF files: 20/08/25 16:18:16 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
20/08/25 16:18:27 WARN GDALRasterSource$: GDAL native bindings are not available. Falling back to JVM-based reader for GeoTIFF format.
Jason T Brown
@vpipkt
@tieuthienvn1987 the two warn statements you are getting are expected from the result about GDAL not available. You should see that you are able to read GeoTIFF files just fine.
@tieuthienvn1987 if you do need GDAL (e.g. to read file formats other than GTiff) let us know about the environment you are using. Local, spark cluster,etc? OS? Using conda, virtual environment, etc?
tieuthienvn1987
@tieuthienvn1987
@vpipkt The environment is local, read raster catalog successfully, caculated NDVI and display NDVI successfully, but when I wrote it with code: df.write.geotiff(), I have to wait a long time and still have no result. I used conda to install GDAL
Jason T Brown
@vpipkt
@tieuthienvn1987 the RasterFrames GeoTIFF writer does not depend on GDAL, so the job time is probably more of a factor of volume of data coming back to the driver. GeoTIFF writing does have to be done on a single machine. You may check the spark UI for some clues about job progress
Stuart Lynn
@sllynn
If I'm using the raster reader to pick up TIF files from a non-public S3 bucket, is there any alternative to having aws credentials on the cluster nodes? Could I, for example, read the files using the spark binary-files reader and unpack them somehow once they're in a dataframe?
Stuart Lynn
@sllynn
@vpipkt It’s actually your SAIS 2020 notebook that I’m trying to work with. Did you have to set those credentials up to read from the Astraea requester-pays bucket?
Simeon H.K. Fitch
@metasim

@sllynn

read the files using the spark binary-files reader

How would this happen without AWS credentials? The process should be the same regardless.

Stuart Lynn
@sllynn
via the spark s3 reader which calls STS
rather than using long lived credentials
Simeon H.K. Fitch
@metasim
Gotcha. I'm not familiar with that technique. Is there a way to do something similar using GDAL on the command line? IOW, is there some combination of settings that would allow you to call gdalinfo on such a files? If so, there's a mechanism to pass parameters down to the GDAL bindings.
Simeon H.K. Fitch
@metasim

@sllynn

via the spark s3 reader which calls STS

Can you point to me some docs on how this works?

Jason T Brown
@vpipkt
@sllynn to answer your question, yes i did have to set up the aws credentials to run the example. I ran that in local mode so only on a single machine.
That or InstanceProfileCredentialsProvider with an instance profile on the cluster nodes
Simeon H.K. Fitch
@metasim
@sllynn Thank you.
In reality, this is at the GeoTrellis level. I'll ask @pomadchin here (might might need to move to #geotrellis): With GDALRasterSource, do you know if there is a way to read from S3 via the AWS "assumed role" approach outlined in the docs above (as opposed to env vars or ~/.aws/credentials?
Grigory
@pomadchin

@metasim @sllynn if the job is launched on the AWS cluster, than all the credentials would be retrieved from the metadata storage

in all other cases - only explicit credentials (~/.aws/ / env vars) AFAIK; also see https://github.com/OSGeo/gdal/blob/f87673d2ac225e117fd6f6d5b32443cab1c7460b/gdal/port/cpl_aws.cpp

1 reply
Simeon H.K. Fitch
@metasim
@pomadchin :bow:
Basil Veerman
@basilveerman
Hi, does anyone have guidance for instance sizing when working with large geotiffs? Raster metadata can be loaded with geotrellis.raster.geotiff.GeoTiffRasterSource (ex extent: GridExtent(Extent(-2.0E7, 2000000.0, -7250000.0, 1.16E7),CellSize(5.0,5.0),2550000x1920000)), but when loading with rasterframes spark.read.raster.from(path).load(), the driver spins at 100% cpu for ~30 mins (no activity on workers) until it eventually raises a variety of Executor timeout and OutOfMemoryError: GC overhead limit exceeded, bad datanode, and finally: org.apache.spark.SparkException: Job aborted. -> Reason: Executor heartbeat timed out after 708872 ms. Raster is ~20GB (compressed) and using r5.2xl driver/worker instances.
Simeon H.K. Fitch
@metasim

@basilveerman To me it sounds like a lot of shuffling is happening, which could be caused by a few things. What does the Spark UI "Executors" page show in terms of shuffle reads/writes and/or GC time? Select the stage that the job failed on and take a look at the task time distribution. Is it heavily skewed to the right (longer times)? Are you doing any joins or groupBys? If so, you want to use the lazy_tiles feature and make sure you repartition by your join key right after the call to spark.read.raster.

Another thing to check is using the tool(s) referenced here to confirm your GeoTIFFs are "Cloud Optimized". If they aren't, then every executor is going to end up reading in a complete copy of the raster, which could easily send the JVM into GC hell.

Jason T Brown
@vpipkt
a strong second to making sure your file/s are "cloud optimized" especially with internal tiling.
Basil Veerman
@basilveerman
@metasim I'm not performing any operations yet, just spark.read.raster.from(<s3_path>).load(). I attempted again with r5.12xlarge instances (file is 200G uncompressed, should fit entirely in memory on a single node) and ran into the same errors. It's also using JVMGeoTiffRasterSource rather than GDAL, could that be an issue? I'll add GDAL to the AMI we use regardless.
The files are not explicitly cloud optimized (no internal overviews, not internally tiled). I am however able to open directly with python/rasterio, and read random small 4 cell windows within milliseconds. I'll see if I can re-write them as cloud optimized, but as I've found with the current issues, working with them is proving difficult.
Jason T Brown
@vpipkt
Don't overlook the lazy_tiles option in the reader. It seems that this might explain the difference between the GeoTiffRasterSource and read...
Simeon H.K. Fitch
@metasim
@basilveerman Sounds like a bug somewhere. Not sure how to winnow it down unless we have an example image to work with. Off the cuff, calling load should return immediately, and not reading /processing should happen until an action happens, so there seems something odd going on. I'd turn on lazy_tiles and add a repartition(1000) after the load to see what happens.
With GDAL installed and working with RF, theoretically speaking the reading patterns in RF should be the same as Rasterio.
But it's possible there's a bug.
Jason T Brown
@vpipkt
I am unsure at that file size what the performance and resource differences might be between the JVMGeoTIff vs GDAL...
Simeon H.K. Fitch
@metasim
Yeh, good point. Doesn't smell right. Especially on a monstrous machine like that.
@basilveerman How much memory did you give the driver?