Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • May 22 01:34
    pomadchin commented #3465
  • May 22 01:33
    pomadchin labeled #3465
  • May 21 16:17
    pomadchin commented #3465
  • May 21 16:10
    qw845602 commented #3465
  • May 21 16:07
    qw845602 commented #3465
  • May 21 15:59
    qw845602 commented #3465
  • May 21 15:57
    qw845602 commented #3465
  • May 21 15:56
    qw845602 commented #3465
  • May 21 15:56
    pomadchin commented #3465
  • May 21 15:55
    qw845602 edited #3465
  • May 21 15:55
    qw845602 commented #3465
  • May 21 15:40
    pomadchin commented #3465
  • May 21 15:39
    pomadchin commented #3465
  • May 21 15:39
    pomadchin commented #3465
  • May 21 15:35
    qw845602 commented #3465
  • May 21 15:29
    pomadchin commented #3465
  • May 21 15:29
    pomadchin commented #3465
  • May 21 15:28
    pomadchin commented #3465
  • May 21 15:28
    pomadchin commented #3465
  • May 21 15:18
    pomadchin closed #2504
wxmimperio
@imperio-wxm
@pomadchin Hi, I added the printout of start and length to the source code and found a lot of repeated reading. When I use S3 datasource, it's actually a lot more network IO consumption.
By the way, whether increasing parallelism or clip length can improve the reading efficiency?
Grigory
@pomadchin
@imperio-wxm no, only incresing parallelism is a thing to go with. I am actually very confused: what exactly parallelism are you increasing?
there are multiple levels on which you can approach it though
the clip length is impossible to increase, it does the range it needs to read segments it needs
Could you show the code + logs that concern you?
but tldr: work with COGs / with Tiled TIFFs - it will reduce sizes of chunks and the amount of overlaps
if your logs are observed on a cluster - no surpises that once in a while you see the same set of ranges - that should be the beggining of the file; tags should always be read before initiating the segments reads
you may try to cache TIFFs metadata so you’re not rereding them always
however, It is a bit hard to follow you; I need more details to give a better response;
could you share some code and s3 range read logs? is it on a single machine or it is distributed? are rereads of the same range happens within a single machine or on multiple matchines?
wxmimperio
@imperio-wxm

@pomadchin Hi,my code is sample:

 val sourceRDD: RDD[RasterSource] = sc.parallelize(files, 50).filter(url => url.endsWith(".tiff")).map(uri => {
        GeoTiffRasterSource(uri): RasterSource
      })

 val summary = RasterSummary.fromRDD(sourceRDD)
 val LayoutLevel(zoom, layout) = summary.levelFor(layoutScheme)
 val contextRDD = RasterSourceRDD.tiledLayerRDD(sourceRDD, layout, KeyExtractor.spatialKeyExtractor, rasterSummary = Some(summary))

 contextRDD.foreach { case (key, tile) =>
     println(tile.size)
  }

I increasing paralelism like this, but there are only 11 tasks actually read in foreach.

sc.parallelize(files, 50)

I added logs and outputs to the source code of S3. Obviously, there are a lot of the same range being read repeatedly. There are too many logs and I only intercept a part of them.
image.png
image.png

And I am testing locally, not in cluster mode.
Grigory
@pomadchin
yea, so don’t worry too much about it
this is all good, there is no work around

the problem here is that you have a layout, right? and you want to tile the scene accrding to some layout definition

but what is the tiling scheme of every TIFF you have?

in almost every cases you are going to hit the problem that tiles on your layout should require multiple segments partially
but as I said, we benchmarked it; it does not really affect much you if at all
Grigory
@pomadchin
there is one thing though: if the cluter / local parallelism is not enough: to add more thread into the S3Cient thread pool: https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/asynchronous.html#advanced-operations
wxmimperio
@imperio-wxm
@pomadchin Thanks for your answer, I just want to read data as fast as possible, I don't know why when I assign 50 parallelism, but actually read only 12 tasks.
image.png
Grigory
@pomadchin
@imperio-wxm most likely you don’t have enough resources (: try smaller executors: 1/2cpu each and ~2gbs each
wxmimperio
@imperio-wxm
@pomadchin Thanks, I will try it.
zmx110
@zmx110
@pomadchin hi , i changed scala from 2.12.14 to 2.12.10, idea error :unread block data ; spark error : Failed to register classes with Kryo. How can I deal with them,thank you!
Grigory
@pomadchin
@imperio-wxm you’re welcome! let me know how it goes
@zmx110 hey, do you have a full stack trace?
zmx110
@zmx110
Sorry, I don't know how to output it
Grigory
@pomadchin
@zmx110 just copy and paste here
but I’ll recommend you to use the most up to date scala version always
wxmimperio
@imperio-wxm

@pomadchin
HI, What is the index order of MultibandTile bands?

// I read three tiffs:
// LC09_L2SP_148033_20220512_20220514_02_T1_SR_B2.TIF
// LC09_L2SP_148033_20220512_20220514_02_T1_SR_B3.TIF
// LC09_L2SP_148033_20220512_20220514_02_T1_SR_B4.TIF
val sourceRDD: RDD[RasterSource] = sc.parallelize(files, 10)
      .map(uri => {
        GeoTiffRasterSource(uri): RasterSource
      }).cache()

multibandTile.band(0) = SR_B2,multibandTile.band(1) = SR_B3,multibandTile.band(2) = SR_B4?
How to accurately select the band I want by index, what is the mapping relationship between index and band?

Grigory
@pomadchin
hey @imperio-wxm in your particular example all multibands are with a single band
RasterSoruce points to a single file, it treats all TIFFs as multibands
It will require you a little bit of work to figure out how you want to load such tiffs into mem
Frank Dekervel
@kervel
hello, i'm still struggling with porting my old code to new GT. now i'm completely on latest GT, but the reprojection of the layer is off (the tiles are more or less where they are supposed to be but they seem to be rotated). this is basically my entire code now (down from much more)
        val rastersources = layer_filenames.mapPartitions(partition => { S3Utils.configureGeotrellis(s3confBroadcast.value); partition.map(x => GeoTiffRasterSource(x))})
        val epsg4326 = CRS.fromEpsgCode(4326)
        val rastersources_webmercator = rastersources.map(x => x.reproject(epsg4326, method = Max).reproject(WebMercator, method = Max))
        val summary = RasterSummary.fromRDD(rastersources_webmercator)
        val layoutScheme = ZoomedLayoutScheme(WebMercator)
        val LayoutLevel(maxZoom, layoutDefinition) = summary.levelFor(layoutScheme)
        val layer_webmercator = RasterSourceRDD.tiledLayerRDD(rastersources_webmercator, layoutDefinition,  KeyExtractor.spatialKeyExtractor, rasterSummary = Some(summary))
        val layer_webmercator_singleband = ContextRDD(layer_webmercator.map(x => (x._1,x._2.band(0))) , layer_webmercator.metadata)
        Pyramid.upLevels(lyr, layoutScheme, maxZoom, Max) { (rdd, z) =>
          val layerId = LayerId(layerName, z)
          if(store.layerExists(layerId)) deleter.delete(layerId)
          writer.write(layerId, rdd, ZCurveKeyIndexMethod)
        }
image.png
Grigory
@pomadchin
hey @kervel do you have a minimized example? it could be a bug in the old gt that it was not rotated
23 replies
RasterSource(tiff).tileToLayout(...).read(key) <- the min example
wxmimperio
@imperio-wxm
@pomadchin Hi, If I need to select B2 band in map operation, is it band(0) or band(1) or band(2)?
Grigory
@pomadchin
@imperio-wxm in your code it is band(0) for all TIFFs
wxmimperio
@imperio-wxm
@pomadchin Hi, So what is the order of band organization? Is it in the order I pass in the list of datasource files? If the file order is B10, B6, B1, then band(0) = B10, band(1) = B6, band(2) = B1? Because I don't know the mapping relationship between index and band.Just like rasterframes, executing select(col('B2')) can determine that I select B2, of course, this requires defining a schema mapping
wxmimperio
@imperio-wxm
@pomadchin There are indeed two scenarios, one is that a single tiff file contains multiple bands, and the other is a list of single-band tiff files. When read into rdd, it is represented by MultibandTile, and bands are represented by Array (Tile). What I want to know is what is the order of this Array. A single tiff file is easy to understand, that is, the order of the bands written to the file, but for the list of single-band tiffs, it is the order in which the rdd is read.
Grigory
@pomadchin
Hey @imperio-wxm - in the code that you sent before, I don’t see how you craft multibands at all: they are all single bands
but in short - they way you build the MutlibandTileLayerRDD affects the ordering; that’s all
If you read with raster source TIFFs, they are all in case of LC8 are singlebanded
My next assumption will be that you do nothing about the ordering, just tile all TIFFs to layout, and when performing tiling to layout TIFFs got squashed into real multibands because in your case multiple tiles will be assigned to the same key
^ if that’s true, than there is no way you can know anything about the ordering, it is random and determined by the cluster data distribution and ordering of tiles in the partitioning operation result
To fix it you need to come up with own techniques related to that, index tiles some how, etc
wxmimperio
@imperio-wxm
@pomadchin Yes, what I mean is to read multiple single-band files, after converting to rdd, it will be compressed to a multibandtile tile with the same key, so how do I choose the correct band, I don't know if 0 means B2 or B3 .For example, if I want to implement ndvi calculation by myself, it will load two single-band files B4 and B5 instead of a synthesized multi-band file. Then I don't know whether band(0) is B4 or B5.
wxmimperio
@imperio-wxm
val sourceRdd: RDD[RasterSource] = sc.parallelize(
      Seq("LC09_L2SP_148033_20220512_20220514_02_T1_SR_B4.TIF", "LC09_L2SP_148033_20220512_20220514_02_T1_SR_B5.TIF"),
      2
    ).map(uri => {
      GeoTiffRasterSource(uri): RasterSource
    }).cache()
    val summary: RasterSummary[Unit] = RasterSummary.fromRDD(sourceRdd)
    val LayoutLevel(_, layout) = summary.levelFor(FloatingLayoutScheme(256))
    val contextRDD = RasterSourceRDD.tiledLayerRDD(sourceRdd, layout, KeyExtractor.spatialKeyExtractor, rasterSummary = Some(summary)).persist(StorageLevel.DISK_ONLY)
    contextRDD.map {
      case (key, tile) =>
        // bandCount == 2
        println(tile.bandCount)
        // B4 or B5 ?
        println(tile.band(0))
    }.collect()
Grigory
@pomadchin
^ yea it is non deterministic if you approach it this way (:
wxmimperio
@imperio-wxm
@pomadchin I found a similar example in the documentation, where the assumed band4 and band5 are actually undefined? I feel this is unreasonable. Are there any best practices for this? https://geotrellis.readthedocs.io/en/latest/guide/spark.html#region-query-and-ndvi-calculation