Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jan 31 2019 17:36
    schnerd starred locationtech/geowave
  • Jan 30 2019 11:01
    hsg77 commented #1474
  • Jan 30 2019 10:58
    hsg77 commented #1474
  • Jan 30 2019 10:57
    hsg77 commented #1474
  • Jan 30 2019 10:53
    hsg77 commented #1474
  • Jan 30 2019 10:53
    hsg77 commented #1474
  • Jan 30 2019 10:51
    hsg77 commented #1474
  • Jan 29 2019 16:30
    JWileczek commented #1474
  • Jan 29 2019 16:30
    JWileczek commented #1474
  • Jan 29 2019 16:12
    rfecher commented #1474
  • Jan 29 2019 10:44
    hsg77 commented #1474
  • Jan 28 2019 22:47
    sunapi386 starred locationtech/geowave
  • Jan 28 2019 21:12

    rfecher on gh-pages

    Lastest javadoc on successful t… (compare)

  • Jan 28 2019 20:47

    rfecher on master

    fixing coveralls (#1488) (compare)

  • Jan 28 2019 20:47
    rfecher closed #1488
  • Jan 28 2019 20:47
    rfecher opened #1488
  • Jan 28 2019 17:02

    rfecher on master

    Update README.md (compare)

  • Jan 28 2019 16:53

    rfecher on master

    updated readme.md (#1486) (compare)

  • Jan 28 2019 16:53
    rfecher closed #1486
rfecher
@rfecher
I don't believe you need to care about IndexDependent... or RowMerging...
I don't think using index sort/partition keys as the data ID would be a good idea (they're not guaranteed unique, plus its already in the key so one thing data ID is there for is to absolutely guarantee uniqueness of a key)
with 3-5 dimensions you start to get into extreme unlikeliness for overlapping keys anyways
Grigory
@pomadchin
Thanks @rfecher makes sense; so I will try to derive some unique string basing on the input entry (: thanks!
I also thought to derive it basing on some information in the entry and basing on the index :o
~ get partition key from the index by passing all dims inside + some kinda identifying information from the entry
rfecher
@rfecher
and in answer to another question you had, to just see what keys your index should be generated for a row you can call index.getIndexStrategy().getInsertionIds(<BasicNumericDataSet>)
Grigory
@pomadchin
:+1: nice
rfecher
@rfecher
BasicNumericDataSet just wraps NumericData (which can be a range or single value) per dimension in the same order as the dimensions defined in your index
basically what your NumericDimensionField in the CommonIndexModel does within its getNumericData() method gets passed to the index strategy's getInsertionIds() method which ultimately gets written as the partition and sort keys in the data store
Grigory
@pomadchin
@rfecher hmmm you know it looks like Im getting same insertionIDS even though dims are different:
// for instance I have these dims:
// pseudocode here
val bounds = List(NumericRange [min=-82.0, max=-60.0], NumericRange [min=25.0, max=34.0], NumericRange [min=1.4019264E12, max=1.4019264E12], NumericRange [min=21.0, max=21.0])

val keys = index.getIndexStrategy().getInsertionIds(bounds).getFirstPartitionAndSortKeyPair
//>  keys.getLeft: List(4, 50, 48, 49, 52)
//> keys.getRight: List(122, -55)

// but I get the same result for 
val bounds2 = List(NumericRange [min=-82.0, max=-60.0], NumericRange [min=25.0, max=34.0], NumericRange [min=1.4019264E12, max=1.4019264E12], NumericRange [min=22.0, max=22.0])

val keys = index.getIndexStrategy().getInsertionIds(bounds).getFirstPartitionAndSortKeyPair
//>  keys.getLeft: List(4, 50, 48, 49, 52)
//> keys.getRight: List(122, -55)
is it smth wrong in my index configuration (i.e. 4th dimension set incorrectly)?
rfecher
@rfecher
hmm, whats the min/max on that 4th dimension?
Grigory
@pomadchin
two first are spatial, 3d is temporal and 4th is just a custom BasicDimensionDefinition(minValue, maxValue)
rfecher
@rfecher
ie. how does it normalize the value 21 or 22
Grigory
@pomadchin
I think right now it is 21 / 100
i.e. 0.21 would be a normalized value for the 21
Yep, doublechecked - I limited this dimension definition in the 0 to 100 range
rfecher
@rfecher
hmm, I don't know the math on it, try calling XZOrderSFC.getId(<array of 4 doubles, normalized values in each dimension fo for the example above>)
oh, I mean 8 doubles
pairwise min and max for each dimension (normalized)
Grigory
@pomadchin
Hmmm will try it in an hour; thanks
Grigory
@pomadchin
ah I was quicker than I thought:
XZHierarchicalIndexStrategy::mins: [-82.0, 25.0, 1.3392E10, 21.0]
XZHierarchicalIndexStrategy::maxes: [-60.0, 34.0, 1.3392E10, 21.0]
XZHierarchicalIndexStrategy::xzId: [0, 0, 2, -124, -18, -18, -18, -15]

XZHierarchicalIndexStrategy::mins: [-82.0, 25.0, 1.3392E10, 22.0]
XZHierarchicalIndexStrategy::maxes: [-60.0, 34.0, 1.3392E10, 22.0]
XZHierarchicalIndexStrategy::xzId: [0, 0, 2, -124, -18, -18, -18, -15]
rfecher
@rfecher
and I really suspect that what you're seeing here is actually the math but it would be best to double-check .... what we've found in our benchmarking is that XZ is great in that it guarantees a single key given extents, it really loses specificity when each dimension is highly irregular (so great for polygons in lat/lon, but when you add to it an insertion time range for example which has no strong relation to lat/lon, ie. a highly oblong hyper-rectangle, it really loses specificity in indexing)
we measured it and put out some of these numbers to give an idea here: http://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1027&context=foss4g
Grigory
@pomadchin
aouch
so what would you recommend here?
another curve?
rfecher
@rfecher
yeah, so it looks like the keys are expected ... well, its really hard to say holistically, there's a lot of tradeoffs going on - I think I'd recommend a tiered index with a max duplication set to something you find reasonable given your storage constraints
but the index specificity xzorder may not be a huge problem either if your query constraints are sufficiently constricting in each dimension
remember there are 3 other dimensions that would additionally help constrain query results in addition to that 4th dimension where you see the loss in specificity
Grigory
@pomadchin
Hm, yea; It looks like it works tbh; if I would add a dataId; it will filter everything correct (I think)
So in this case it will loose information about 4th dim and will do a scan through all selected rows?
rfecher
@rfecher
of course it depends on data distribution but given reasonable constraints in all 4 dimensions you should be fine - and understand that we do "fine-grained" intersection in addition to the SFC key space so you're not going to get back false positives all the way to your client, it will be filtered out within geowave
Grigory
@pomadchin
gotcha
that is cool
Okay, I would like to go an honest way now: So you recommend to try TieredSFCIndexFactory?
rfecher
@rfecher
but if you're likely to have extremely loose constraints in the other dimensions and expecting tight constraints in the 4th dimension to filter out all your keys you'd probably want to look at the tiered approach
do you only have ranges/extents in your spatial dimensions on insertion?
ie. no time ranges going in, and no range on that 4th dimension on insertion
Grigory
@pomadchin
eh I have a time dimension but we can bound it (may be :D)
rfecher
@rfecher
to be clear though, I am talking about the insertion value, not query
Grigory
@pomadchin
They key right now looks like this: (extent, timestamp, mydim)
rfecher
@rfecher
what I saw in your example is a single time for the entry, ie. the image was collected at a certain time, as opposed to for example a track you want to index represents a start time and end time
Grigory
@pomadchin
Ah, yes; I have a single time value per entry (at least for now)
rfecher
@rfecher
at least if you want to index the track properly
Grigory
@pomadchin
hmm
rfecher
@rfecher
yeah, so I'd definitely lean towards tiered if duplicating up to 4 times is ok, you have extents in 2 dimensions - generally I think it's going to be faster for this use case, but you could benchmark too if you'd like
Grigory
@pomadchin
How would it compute keys in this case? They would be unique?
(In case of duplicates) or there would be just entries with the same partition key?