Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    andrewchambers
    @andrewchambers
    the client would alert you
    however if you want checksums on your repository data too, I considered adding more, but I would also say you could use the btrfs or zfs file systems
    those filesystems verify checksums on read automatically for all data
    andrewchambers
    @andrewchambers
    for my personal backups I use btrfs so I can use btrfs scrub
    or I can just read the data
    and it will check for errors
    andrewchambers
    @andrewchambers
    That being said, in the future I might add a data storage mode that adds parity information and checksums as an extra
    for now I would advise using a filesystem with checks or RAID
    tionis
    @tionis:matrix.org
    [m]
    Ok, thanks
    nh2
    @nh2:matrix.org
    [m]

    @andrewchambers: can you remind me, what exactly is bupstash's behaviour with small files, and does it depend on if they are in the same dir or not?
    In https://github.com/andrewchambers/bupstash/issues/26#issuecomment-730882871 you write

    small files are packed into single chunks

    but that is the only reference I can find on the Internet for this feature

    andrewchambers
    @andrewchambers
    @nh2:matrix.org sure
    so bupstash packs small files within the same directory into a chunk
    bupstash internally uses directory boundaries as a sort of deduplication boundary
    4 replies
    that helps the dedup resync
    as a consequence it means it must split the chunk at directory boundaries and isn't able to pack more small files into it
    I am considering other heuristics that can group directories
    technically it would be possible to pack many directories into a single chunk
    it doesn't affect the repository format - it is something we can change
    bupstash concatenates all files into a giant stream
    all file data into a giant stream
    and dedups that by just splitting it into chunks
    our index is another stream that stores file names and their offsets into the data stream
    it is quite different to every other backup tool I have seen
    they tend to make a far more complex structure
    andrewchambers
    @andrewchambers
    I imagine it would be hard for them to apply the same trick
    of course we have read amplification when you fetch a single file
    since you might grab the chunk with X other files in it too
    originally bupstash simply stored a deduplicated tarball
    however that doesn't support --pick which I considered super important
    anyway - kind of a tangent :P I want to do more documentation/blog posts on the bupstash design
    1 reply
    tionis
    @tionis:matrix.org
    [m]
    So bupstash unpacks the tarball on the server side and repacks it into its own format for duplication with random access?
    andrewchambers
    @andrewchambers
    no
    it just uses its own format for everything except for 'bupstash get' where it generates the tarball on the client side
    it just keeps the .tar name as a default for historic reasons :P
    nh2
    @nh2:matrix.org
    [m]

    @andrewchambers: OK. I think sorting makes sense and is required to some extent on many file systems, because e.g. on CephFS, the directory order can change all the time, so deduplication would not be very good without a sort there.

    The reason I'm asking is that currently I am trying to back up my production CephFS. This one has 200 million files in 1 flat directory. I rsynced the folder to a ZFS on local disks with metadata on SSDs so that I can experiment faster.
    Bupstash OOMs on the 200M dir, during the phase where it does statx() on all the contents (the previous getdents() phase succeeds on it).
    So I am wondering whether the sort isn't happening on just the filenames (getdents() output), but on the stat results. And what I could do about it.

    nh2
    @nh2:matrix.org
    [m]

    @andrewchambers: In
    https://github.com/andrewchambers/bupstash/blob/b0c7b88353ebc0a7d5dff1e202098275ddd9fce9/src/indexer.rs#L549-L551
    it looks like it's getting all stat()s into a big vector and then does the sort based on the file names.

    That seems like it uses unnecessarily much memory: sizeof(struct stat64) == 144 on Linux.

    Couldn't we get just the file names, sort those, and then get the metadata in a streaming-parallel fashion afterwards?

    If yes, there's a caveat:
    On some file systems like CephFS, stat() information is stored next to the getdents() information (at least I think so from recent learnings, see https://tracker.ceph.com/issues/57062).
    That means while statting in directory order is very fast, statting in sorted order is very slow because it leads to massive read amplification (each stat() has to fetch a whole CephFS directory fragment and then discard all but 1 entry).
    (Ceph recently had a patch merged to avoid the insane read amplification, see https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HVNYI6Y3V4BZN3S3UFEHUMRCJBIHUEAH/, but it would still cost the network roundtrip.)
    That means that the current approach is better for such file systems; CephFS is the only one I know that does this so far, so bupstash may still want to change its approach here for normal local file systems.

    However, the current approach just doesn't work for single large dirs, and eventually, even the file names won't fit into memory.
    So I think a more thorough solution for large dirs would make the

    let mut dir_ents = self.metadata_fetcher.parallel_get_metadata(dir_ent_paths);

    not return Vec<(PathBuf, std::io::Result<CombinedMetadata>)>, but instead use something that can page the results to disk (e.g. sqlite, rocksdb, whatever embedded kv store).

    A simpler approach than an embedded KV store that can sort on disk would be to just write the Vec<(PathBuf, std::io::Result<CombinedMetadata>)> on disk in getdents() order (a simple mmap() might suffice), and then instead of sorting that Vec, sort an array of indices into it only.
    That would require only 8 bytes per entry, instead of 144 + len(filename) + memory allocator fragmentation overhead, thus reducing RES memory usage by ~25x.
    In my case filenames are ~40 bytes on average, so the above is ~200 bytes.
    200M files * 200 bytes = 40 GB for dir_ents.
    With the index array, it would be 1.6 GB.

    nh2
    @nh2:matrix.org
    [m]

    https://stdrs.dev/nightly/x86_64-unknown-linux-gnu/std/sys/unix/fs/struct.FileAttr.html has

    pub struct FileAttr {
        stat: stat64,
        statx_extra_fields: Option<StatxExtraFields>,
    }

    so the sizeof that is probably even larger, 144 + at leat 24 = 168 B

    nh2
    @nh2:matrix.org
    [m]
    I wrote this up in andrewchambers/bupstash#314
    tionis
    @tionis:matrix.org
    [m]
    Hi, when doing a bupstash get --pick sync/Pictures id=63cd4ab38415c6a2457109e7744e6d1a | tar -xvf - I get the following error:
    bupstash serve: hash tree ended before requested range
    bupstash get: hash tree ended before requested range
    tar: Unexpected EOF in archive
    tar: Unexpected EOF in archive
    tar: Error is not recoverable: exiting now
    Is that a common error
    Corruption of the underlying storage or something else?
    andrewchambers
    @andrewchambers
    @tionis:matrix.org this shouldn't happen I am very interested
    are you able to get the data without a --pick?
    andrewchambers
    @andrewchambers
    the --pick code was kind of tricky so it could be a bug in that, if it is it is really high priority for me
    I have a stress tester for that code
    andrewchambers
    @andrewchambers
    @tionis:matrix.org andrewchambers/bupstash#315
    I created a ticket
    let me know if you can get the data without --pick
    also I am interested in your setup
    1 reply
    andrewchambers
    @andrewchambers
    @tionis:matrix.org I am most interested in if you can get your data without --pick