Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
msmahasm
@msmahasm:matrix.org
[m]
yeah, here it is
Simeon H.K. Fitch
@metasim
OK, it looks like you have a RasterFrameLayer, which has a TileLayerMetadata from GeoTrellis.
1 reply
Which means all of your tiles are in the same CRS.
Regardless, you still have in each row the CRS and associated tile extent. So those two things tell you where on the map the tile is.
I may be misunderstanding your question.
The units of the extent column are defined by the crs column.
msmahasm
@msmahasm:matrix.org
[m]
Thanks! yeah, i have lat / long values with centroid of building footprint (separate table) and i want to tag associated ZipCode to the building footprint data from TIFF file (from above rf dataframe)
please share your thoughts / ideas for the above!
msmahasm
@msmahasm:matrix.org
[m]

Hi @metasim the expectation is need to process layer by layer of tiff file to tag building with ZipCode

I'm facing the below challenges,

1) While reading .tiff file, there is no information related to ZipCode / Location
2) How to process both data without Cartesian to tag building or Centroid to the tiff file / raster

Kindly help on this!

msmahasm
@msmahasm:matrix.org
[m]

:point_up: Edit: Hi @metasim the expectation is need to process layer by layer of tiff file to tag building with Zip Code

I'm facing the below challenges,

  1. While reading .tiff file, there is no information related to Zip Code / Location. Any way to get those information?
  2. How to process both data without doing "Cartesian" to tag building or centroid to the tiff file / raster

Kindly help on this!

Simeon H.K. Fitch
@metasim
@msmahasm:matrix.org Sounds like an interesting project, and certainly one that my company Astraea could help with. Unfortunately, your questions are more than I'm able to provide on a gratis basis. If you're interested in commercial help, please check out our website and contact form. We'd love to get connected.
msmahasm
@msmahasm:matrix.org
[m]
Thanks @metasim , Thanks again for your response! i will get back to you!
Michał Gołębiewski
@mjgolebiewski
hello, is rasterframes capable of processing LIDAR data?
Simeon H.K. Fitch
@metasim
@mjgolebiewski Never tried, to be honest. I don't know enough about point cloud processing to know if the tiled data model of RasterFrames is helpful or not. I'd start with GeoTrellis' Point Cloud module (https://github.com/geotrellis/geotrellis-pointcloud) and if has what you need, maybe there's a pathway to integrating into Spark's DataFrame model.
Michał Gołębiewski
@mjgolebiewski
@metasim thanks, ill look into it :)
it;s a bit experimental though, but any feedback and contributions are appreciated
Simeon H.K. Fitch
@metasim
Oh, awesome! Didn't know about that!
Grigory
@pomadchin
@metasim hehe I hope at some point that https://github.com/PDAL/java would have more STARS :D :D
LIDAR is not that popular for some reason
1 reply
Michał Gołębiewski
@mjgolebiewski
@pomadchin thanks, thats great news. i assume there is no way i could use it with python?
Grigory
@pomadchin
@mjgolebiewski why? I think you can, just add neccsary packages and call them through the dataframes API
4 replies
it is not tested though since we didn’t have python in our plans
Yuri de Abreu
@yurigba
Just out of curiosity, I have seen that the spatial indexing feature is experimental and may be removed... I only could get good performance by using it. Why it is considered something that can be removed in the future? Are there alternatives to the implementation of spatial indexing for distributing data over the nodes?
Grigory
@pomadchin
hey @yurigba what is going to be removed btw? and where you’ve seen it?
Yuri de Abreu
@yurigba
Simeon H.K. Fitch
@metasim

@yurigba

Just out of curiosity, I have seen that the spatial indexing feature is experimental and may be removed... I only could get good performance by using it. Why it is considered something that can be removed in the future? Are there alternatives to the implementation of spatial indexing for distributing data over the nodes?

The rationale--which admittedly is not very good--is that if you don't know what you're doing, turning it on can make your jobs run even longer. It's more an "ergonomics" issue. To be honest, you are the first to even mention it , so perhaps it should not be marked as experimental.... definitely interested in your feedback.

FWIW, The underlying functions a not experimental.
Yuri de Abreu
@yurigba

@metasim I have been using RasterFrames in a development cluster (which means it has far less resources than the cluster that will be used effectively). We are migrating now to the full-fledged environment. However, we were trying to do machine learning runs in the development cluster, and when we turned on spatial indexing, the performance increased a lot (this was five months ago). The run consisted of joining the rasterframe with GeoJSON labels of a small area, before turning into vectorized pixels. We were using GLAD ARD satellite imagery, which consists of seven-band TIF rasters.

We are progressively increasing our ability to make big data runs, in some months we will be able to scale up the algorithm to see how it fares, and we will be using USGS Level 2 Landsat 8 data and Sentinel 2 L2A data.

By using spatial indexing as-is, we had good performance increasing. But when setting manually the spatial_index_partitions argument, it was hard to get good performance. I was trying to understand how it worked, when I go back to big data environment I will go more in depth.

msmahasm
@msmahasm:matrix.org
[m]

Hi @metasim I have a quick question regarding do we have any first class function / approach in rasterframes to extract value from raster using lat & long?

https://stackoverflow.com/questions/56913645/rasterframes-extracting-location-information-problem/67686934#67686934

"rf.where(st_intersects(rf_geometry($"proj_raster"), point))
.select(rf_value_at_point(rf_extent($"proj_raster"), rf_tile($"proj_raster"), point) as "value")
.show(false)"
msmahasm
@msmahasm:matrix.org
[m]
the above code is working good, but it ending with Cartesian product for records has more than million
Simeon H.K. Fitch
@metasim
@msmahasm:matrix.org Interesting. Is point in the same CRS as the the raster? You normally need to do something like st_reproject(point, rf_mk_crs("epsg:4326"), rf_crs($"proj_raster")
3 replies
Assuming you have multiple images and each has a different CRS.
I ask that in the event the CRSs are in completely different the resultant size is just an anomaly of incompatible units.
Do you have a join happening somewhere further upstream?
Simeon H.K. Fitch
@metasim

@yurigba

But when setting manually the spatial_index_partitions argument, it was hard to get good performance. I was trying to understand how it worked, when I go back to big data environment I will go more in depth.

What you describe is definitely the right use case for adding the spatial index partitioning, so keep using it. I'll make a note to remove the deprecation warning. Here's what's happening under the hood:
As you know, to be able to fit rasters into Spark memory they are carved up in to tiles. The RasterSourceRelation can do this in two modes: lazy reads and eager reads. With eager reads, the cell data is loaded into memory as the raster is being carved up into tiles, making subsequent shuffling (repartitioning) very expensive. With lazy reads (aka lazy tiles) we only construct and store enough information to read the subset of cells we need at a later time. So when we add the spatial index and range partition on it, the resultant shuffle is on very light weight data, and the cell reads are delayed until finally needed (including any subsequent filtering you do!).

Depending on how big the vector data is, consider marking it as broadcast. If not, you'll need to make sure you add a spatial index to the vector data as well and range repartition so the vector data is "near" the raster data.

One thing to also note about Sentinel 2 L2A data from EU Central.... it's in JP2K, and the standard GDAL drivers for it do not support "range reads" (ability to read only subset of raster). As such lazy tiles can be even more expensive than eager tiles, depending on how the job is constructed. In our commercial Astraea EarthAI platform we have a proprietary JP2K driver enabling range reads, but (as far as I know) there is no open source driver for JP2K that has the same performance as the GeoTIFF/COG driver.

Yuri de Abreu
@yurigba

@metasim I see. There is really a lot going on, and it makes a lot of sense now. The hard part is to understand when there is an operation that will degrade performance while lazy reading. If I understand correctly, RasterFrames has the ability to lazy read and get all the needed spatial information from any GDAL-readable format, but the problem arises when you cannot do range reads, forcing to load the entire raster into memory, overloading the workers if the job isn't constructed thinking about this. I think spatial indexing would mitigate this too, since when using range partition the odds of the tiles being of the same raster increases dramatically.

Spatial indexing makes a lot of sense in Sentinel data due to the different resolutions of each band... I suppose that takeaway is that for JP2K there is no problem in lazy reading as long as I keep in mind that at the end the worker will need to load the entire raster into memory anyway.

Could you tell me an example of spatial indexing actually degrading performance?

Simeon H.K. Fitch
@metasim

@yurigba

The hard part is to understand when there is an operation that will degrade performance while lazy reading.

Exactly... we've struggled to figure out a way to do this that's both intuitive and not prone to really mess up performance.

If I understand correctly, RasterFrames has the ability to lazy read and get all the needed spatial information from any GDAL-readable format, but the problem arises when you cannot do range reads, forcing to load the entire raster into memory, overloading the workers if the job isn't constructed thinking about this.

Precisely! Being able to read the header without reading the whole image is key to making lazy reads useful. One option might be to support a mode whereby RasterDataSource attempts to figure out which mode would be better. Haven't been sure if it's worth the extra complexity.

We also try to cache these lookups so even reads from separate threads are minimized.

Could you tell me an example of spatial indexing actually degrading performance?

Its generally application specific. If memory serves, we had a case in a longitudinal time series analysis it it was better to partition by some other property. But it was a colleague who was doing the work and I've unfortunately forgotten the details.

If you don't need spatial partitioning there's some extra cost up front, because I think Spark samples the index space to construct the range partitioning scheme (how the index space is divided between nodes). Also, if your index space usage is very unbalanced I could see it being a problem, but I'm guessing here.

There are also cases where subsequent filtering (say, a spatial join with vector data) creates very unbalanced partitioning, and if you don't repartition after that you're leaving CPUs unused. But that's not really a spatial partitioning problem.

Yuri de Abreu
@yurigba

@metasim I don't think the extra complexity is worth too. To be able to do range reads and thinking in a data engineering pipeline perspective, I'd just do a preprocessing layer to get the needed bands, save it in GeoTIFF/COG format and be done with it (of course, this has a high cost in disk space). But it would be nice to have a warning message so that the user could be aware of these issues!

Also, I have learned a lot of useful information, so thank you for sharing this!

Simeon H.K. Fitch
@metasim
:+1: :)
Lepitrust
@lepitrust

Hi, i have a problem with rasterframes, spark and hadoop both in cluster mode.
I test the configuration with:

pyspark --master local[*] --py-files pyrasterframes_2.11-0.9.1-python.zip --packages org.locationtech.rasterframes:rasterframes_2.11:0.9.1,org.locationtech.rasterframes:pyrasterframes_2.11:0.9.0,org.locationtech.rasterframes:rasterframes-datasource_2.11:0.9.1 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrator=org.locationtech.rasterframes.util.RFKryoRegistrator --conf spark.kryoserializer.buffer.max=500m --conf org.locationtech.rasterframes.rfConfig.prefer-gdal=true

this work fine:

df = spark.read.format("GeoTIFF").load("hdfs://namenode:8020/user/pippo/raster/beam_17_0400.tiff")
print(df.head())

This not!!

df= spark.read.raster("hdfs://namenode:8020/user/pippo/raster/beam_17_0400.tiff")
print(df.head())

The error is "Error fetching data for one of: HadoopGeoTiffRasterSource(hdfs://namenode:8020/user/pippo/raster/beam_17_0400.tiff)"
Can anyone help me?
Hadoop version 2.7.4
Spark version 2.4.7
Rasterframes 0.9.1

Lepitrust
@lepitrust
Hi, I understood why it didn't work.
On every node with spark and hadoop must configure core-site.xml. Set the property fs.default.name and fs.defaultFS to your hdfs:// address (for me hdfs://namenode:8020/) default is file:/// that is local...
Bye
Simeon H.K. Fitch
@metasim
Great! :+1:
Michał Gołębiewski
@mjgolebiewski
anyone tried exporting geotrellis layer with pyspark and could provide me with code?
Simeon H.K. Fitch
@metasim
@mjgolebiewski Export to what format?
Michał Gołębiewski
@mjgolebiewski
@metasim oh, sorry - i meant exporting dataframe to geotrellis layer.
silnykamcio
@silnykamcio
@mjgolebiewski Such syntax works for me:
rf_dataframe.write.option('layer','layer_name').option('zoom',zoom_level).geotrellis(path='path_to_catalog')
1 reply
Simeon H.K. Fitch
@metasim
Thanks @silnykamcio !
msmahasm
@msmahasm:matrix.org
[m]

hi, can any one tell how to extract only valid tile data example,

from below $proj_raster,

"object
tile_context: null
tile: "RasterRefTile(RasterRef(GDALRasterSource(file:/zip.tiff),0,Some(Extent(-100.01743333343231, 29.758829999979092, -99.99610000009906, 29.78016333331234)),Some(GridBounds(296960,235264,297215,235519))))"
"

i want to extract only valid raster data since gridbounds having more no data value, it is taking too long to execute.