Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    jochenhebbrecht
    @jochenhebbrecht
    $ du -sh *
    72G    blocks
    61G    input.pbf
    Ow, that's weird - yours is giving a better result :-)
    not sure why blocks is bigger then 61 GB. I believe your output is more logical
    Ángel Cervera Claudio
    @angelcervera
    ls -al does not give the size of the folder. :)
    jochenhebbrecht
    @jochenhebbrecht
    ok, that's why :-)
    jochenhebbrecht
    @jochenhebbrecht
    I removed the blocks folder and started all over again (just to be sure), but the result stays the same: the size of folder 'blocks' is bigger than the PBF ...
    Ángel Cervera Claudio
    @angelcervera
    Thinking how that could be possible. So we are moving the problem from Spark to the core library. Interesting.
    Something is different in that Pbf from the "official" pbf files.
    jochenhebbrecht
    @jochenhebbrecht
    Ok. So what would you suggest to do next to figure out what's going wrong?
    Ángel Cervera Claudio
    @angelcervera
    Let me think.
    I'm going to write a small app to generate a report of all blocks and sizes. Then, we can see which one differs.
    jochenhebbrecht
    @jochenhebbrecht
    Thanks, seems like a good idea.
    In parallel, would it be possible to iterate over the blocks and search the block with the missing id?
    Ángel Cervera Claudio
    @angelcervera
    yes, give me 30 minutes to grap a coffe and to write it.
    jochenhebbrecht
    @jochenhebbrecht
    Great, thank you! :-)
    Ángel Cervera Claudio
    @angelcervera
    But the size stuff is even more strange.
    The size should be the same.
    jochenhebbrecht
    @jochenhebbrecht
    Yes, that's indeed very strange - I agree
    Just a second, I'll try to do the same on a smaller PBF from Geofabrik, see how that one behaves
    $ du -sh *
    437M    belgium-latest.osm.pbf
    479M    blocks
    so here it seems we also don't have equal sizes
    Ángel Cervera Claudio
    @angelcervera
    Where are you running this? Linux, right?
    Ángel Cervera Claudio
    @angelcervera
    Testing the script with full planet, looking for the 5103977631. It will take a litle bit of time.
    jochenhebbrecht
    @jochenhebbrecht

    Where are you running this? Linux, right?

    Yes, correct

    $ cat /etc/lsb-release 
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=20.04
    DISTRIB_CODENAME=focal
    DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"

    Testing the script with full planet, looking for the 5103977631. It will take a litle bit of time.

    Ok, but that's OK. We're very curious to figure out why we always lose the block when reading data with osm4scala (and while we don't seem to be losing the block if we use other open source tools)

    Ángel Cervera Claudio
    @angelcervera
    Me too. It is really weird. Probably a bug.
    Ángel Cervera Claudio
    @angelcervera
    Ok, script is working. Let me change few things and I will send you the branch
    jochenhebbrecht
    @jochenhebbrecht
    Cool, looking forward to it :-)
    Ángel Cervera Claudio
    @angelcervera
    Also I'm curious about why different size. Which filesystem are you using? I'm using ext4
    jochenhebbrecht
    @jochenhebbrecht

    I'm actually on a disk which is encrypted:

    Disk /dev/mapper/supersecretdisk: 952.36 GiB, 1022575509504 bytes, 1997217792 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: dos
    Disk identifier: 0x6ea01bdc

    I would pay too much attention to the different folder size, might be something related to my laptop. I think we can trust your BlockExtractions code

    Ángel Cervera Claudio
    @angelcervera
    Branch: bugfix/#91_node_gone
    The script is at examples/blockswithidextraction
    jochenhebbrecht
    @jochenhebbrecht
    thanks, I'll drag it in :-)
    Ángel Cervera Claudio
    @angelcervera
    Disklabel type: dos ??? Is that a FAT32?
    Could you execute lsblk -f to be sure?
    jochenhebbrecht
    @jochenhebbrecht
    └─  LVM2_member       iuCwqr-VOb9-hRsy-M7hs-ivWc-Q1eE-aT9itv                
      └─ _LUKS       da518ecb-2b17-4e92-b1b2-70c43bf2a532                  
        └─supersecretdisk   ext4              c4c1eb31-f2e0-421e-b35a-b8293c911e97      667G    25% /
    it's ext4
    Ángel Cervera Claudio
    @angelcervera
    Yes, but it is under a LVM system. Maybe there is something related to that.
    Could you send me that block with the header?
    jochenhebbrecht
    @jochenhebbrecht
    Jup, the script is running! :-) I'll let you know if I have it
    Ángel Cervera Claudio
    @angelcervera
    Bug Fixed and pushed to the branch. This evenining I will update documentation and create the release.
    jochenhebbrecht
    @jochenhebbrecht
    Thanks!!! Amazing :-)
    Ángel Cervera Claudio
    @angelcervera
    Thanks for the catch. ;)
    jochenhebbrecht
    @jochenhebbrecht
    What I will do, I'll create a custom build from your branch. And I'll try to reproduce it locally!
    Normally, the problem should not appear anymore
    Ángel Cervera Claudio
    @angelcervera
    yes, the problem should not appears anymore. If it appears, should be a different issue. com.acervera.osm4scala.spark.OsmPbfFormatWithSplitsSpec contains the Unit Test that I created to reproduce the error and was failing before the fix.
    jochenhebbrecht
    @jochenhebbrecht
    You're a really good engineer. I love the way how you handle software problems :-)
    Ángel Cervera Claudio
    @angelcervera
    Thanks! Have a good evening.
    jochenhebbrecht
    @jochenhebbrecht
    JAR has been built. I'm currently running the query
    Ángel Cervera Claudio
    @angelcervera
    You are good as well. Finding this type of error means good testing. Where do you find it? PArt of sanity check?
    jochenhebbrecht
    @jochenhebbrecht
    Hehe, thanks! :-)
    I believe we both deserve a good weekend ;-)!
    Ángel Cervera Claudio
    @angelcervera
    Yes, we do!
    jochenhebbrecht
    @jochenhebbrecht

    Hi @angelcervera . Hope you're doing fine? I have a new question. Don't worry, no bug. Just trying to understand the library. It's a question on how you distribute work in the cluster.

    So PBFs are most likely containing data in an ordered way: nodes << ways << relations
    Looking at how you distribute the work, it seems possible that for example a task gets only relation data. But what if the relations are pointing to nodes which the task doesn't have, how does everything get together?

    Ángel Cervera Claudio
    @angelcervera
    In Spark, parallelization of all sources based in files are managed by Spark (usually by hadoop). So It is not managed by the osm4scala controller. The osm4scala Spark connector It's taking chunks of the file from Hadoop and process it. It is hadoop the one that generate the list of chunks. For your case, you will need to read and filter or categorize entities or whatever before join them. If you are going to read the same file few times (for example, to create three dataframes with naodes, ways and relations) it is important to persist it to avoid parsing the fail three times.
    Remember that this process will not take more time and resources because, in the clusters, the original pbf will not be in memory and the vectorized data will be distributed and stored in the node that will process it.
    Ángel Cervera Claudio
    @angelcervera
    If you are using hdfs, for example, every chunk is (usually) a data block. On this way, data and task are in the same node, so no data transfer penalty (data locality).
    jochenhebbrecht
    @jochenhebbrecht
    Thanks @angelcervera . Clear answer!