Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    andrewchambers
    @andrewchambers
    for now I would advise using a filesystem with checks or RAID
    tionis
    @tionis:matrix.org
    [m]
    Ok, thanks
    nh2
    @nh2:matrix.org
    [m]

    @andrewchambers: can you remind me, what exactly is bupstash's behaviour with small files, and does it depend on if they are in the same dir or not?
    In https://github.com/andrewchambers/bupstash/issues/26#issuecomment-730882871 you write

    small files are packed into single chunks

    but that is the only reference I can find on the Internet for this feature

    andrewchambers
    @andrewchambers
    @nh2:matrix.org sure
    so bupstash packs small files within the same directory into a chunk
    bupstash internally uses directory boundaries as a sort of deduplication boundary
    4 replies
    that helps the dedup resync
    as a consequence it means it must split the chunk at directory boundaries and isn't able to pack more small files into it
    I am considering other heuristics that can group directories
    technically it would be possible to pack many directories into a single chunk
    it doesn't affect the repository format - it is something we can change
    bupstash concatenates all files into a giant stream
    all file data into a giant stream
    and dedups that by just splitting it into chunks
    our index is another stream that stores file names and their offsets into the data stream
    it is quite different to every other backup tool I have seen
    they tend to make a far more complex structure
    andrewchambers
    @andrewchambers
    I imagine it would be hard for them to apply the same trick
    of course we have read amplification when you fetch a single file
    since you might grab the chunk with X other files in it too
    originally bupstash simply stored a deduplicated tarball
    however that doesn't support --pick which I considered super important
    anyway - kind of a tangent :P I want to do more documentation/blog posts on the bupstash design
    1 reply
    tionis
    @tionis:matrix.org
    [m]
    So bupstash unpacks the tarball on the server side and repacks it into its own format for duplication with random access?
    andrewchambers
    @andrewchambers
    no
    it just uses its own format for everything except for 'bupstash get' where it generates the tarball on the client side
    it just keeps the .tar name as a default for historic reasons :P
    nh2
    @nh2:matrix.org
    [m]

    @andrewchambers: OK. I think sorting makes sense and is required to some extent on many file systems, because e.g. on CephFS, the directory order can change all the time, so deduplication would not be very good without a sort there.

    The reason I'm asking is that currently I am trying to back up my production CephFS. This one has 200 million files in 1 flat directory. I rsynced the folder to a ZFS on local disks with metadata on SSDs so that I can experiment faster.
    Bupstash OOMs on the 200M dir, during the phase where it does statx() on all the contents (the previous getdents() phase succeeds on it).
    So I am wondering whether the sort isn't happening on just the filenames (getdents() output), but on the stat results. And what I could do about it.

    nh2
    @nh2:matrix.org
    [m]

    @andrewchambers: In
    https://github.com/andrewchambers/bupstash/blob/b0c7b88353ebc0a7d5dff1e202098275ddd9fce9/src/indexer.rs#L549-L551
    it looks like it's getting all stat()s into a big vector and then does the sort based on the file names.

    That seems like it uses unnecessarily much memory: sizeof(struct stat64) == 144 on Linux.

    Couldn't we get just the file names, sort those, and then get the metadata in a streaming-parallel fashion afterwards?

    If yes, there's a caveat:
    On some file systems like CephFS, stat() information is stored next to the getdents() information (at least I think so from recent learnings, see https://tracker.ceph.com/issues/57062).
    That means while statting in directory order is very fast, statting in sorted order is very slow because it leads to massive read amplification (each stat() has to fetch a whole CephFS directory fragment and then discard all but 1 entry).
    (Ceph recently had a patch merged to avoid the insane read amplification, see https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HVNYI6Y3V4BZN3S3UFEHUMRCJBIHUEAH/, but it would still cost the network roundtrip.)
    That means that the current approach is better for such file systems; CephFS is the only one I know that does this so far, so bupstash may still want to change its approach here for normal local file systems.

    However, the current approach just doesn't work for single large dirs, and eventually, even the file names won't fit into memory.
    So I think a more thorough solution for large dirs would make the

    let mut dir_ents = self.metadata_fetcher.parallel_get_metadata(dir_ent_paths);

    not return Vec<(PathBuf, std::io::Result<CombinedMetadata>)>, but instead use something that can page the results to disk (e.g. sqlite, rocksdb, whatever embedded kv store).

    A simpler approach than an embedded KV store that can sort on disk would be to just write the Vec<(PathBuf, std::io::Result<CombinedMetadata>)> on disk in getdents() order (a simple mmap() might suffice), and then instead of sorting that Vec, sort an array of indices into it only.
    That would require only 8 bytes per entry, instead of 144 + len(filename) + memory allocator fragmentation overhead, thus reducing RES memory usage by ~25x.
    In my case filenames are ~40 bytes on average, so the above is ~200 bytes.
    200M files * 200 bytes = 40 GB for dir_ents.
    With the index array, it would be 1.6 GB.

    nh2
    @nh2:matrix.org
    [m]

    https://stdrs.dev/nightly/x86_64-unknown-linux-gnu/std/sys/unix/fs/struct.FileAttr.html has

    pub struct FileAttr {
        stat: stat64,
        statx_extra_fields: Option<StatxExtraFields>,
    }

    so the sizeof that is probably even larger, 144 + at leat 24 = 168 B

    nh2
    @nh2:matrix.org
    [m]
    I wrote this up in andrewchambers/bupstash#314
    tionis
    @tionis:matrix.org
    [m]
    Hi, when doing a bupstash get --pick sync/Pictures id=63cd4ab38415c6a2457109e7744e6d1a | tar -xvf - I get the following error:
    bupstash serve: hash tree ended before requested range
    bupstash get: hash tree ended before requested range
    tar: Unexpected EOF in archive
    tar: Unexpected EOF in archive
    tar: Error is not recoverable: exiting now
    Is that a common error
    Corruption of the underlying storage or something else?
    andrewchambers
    @andrewchambers
    @tionis:matrix.org this shouldn't happen I am very interested
    are you able to get the data without a --pick?
    andrewchambers
    @andrewchambers
    the --pick code was kind of tricky so it could be a bug in that, if it is it is really high priority for me
    I have a stress tester for that code
    andrewchambers
    @andrewchambers
    @tionis:matrix.org andrewchambers/bupstash#315
    I created a ticket
    let me know if you can get the data without --pick
    also I am interested in your setup
    1 reply
    andrewchambers
    @andrewchambers
    @tionis:matrix.org I am most interested in if you can get your data without --pick
    nh2
    @nh2:matrix.org
    [m]

    @andrewchambers: btw, I am very convinced that you need prefix directories for the files in the bupstash repo.

    Here are my latest learnings:

    • ext4 will just return ENOSPC (no space left on device) by default, even at "small" directory sizes (e.g. 8M files on one of my servers). This is because it uses 32-bit hashes for its dir_index lookup feature (a default option). A hash collision when writing a new file will immediately give ENOSPC. This can be addressed with the large_dir option (not default), but that was buggy and led to corruption until recently (http://www.voxelsoft.com/2021/ext4_large_dir_corruption.html, great page btw, has some gems). Linux fixed it, but introduced another corruption that I told the maker of this article, so it is now also mentioned on this article (after "kernel developers apply na├»vefix anyway"). This means a bupstash repo on ext4 will just fail.
    • Ceph has dirfrags intended to support large dirs, but still suffers from various issues: Apparent locking of writes vs reads, making reads stall.
    • ZFS: Seems to work OK.
    • XFS: Not tried.
    • All flat large directories on Linux make parallel file enumeration impossible, as getdents() is inherently sequential. This means rsyncing the repo to somewhere else cannot be parallelised with flat dirs.

    For my CephFS storage with the 200M write-once files, I moved off a flat directory to a prefixed approach with 3 letters of base57 chars, thus 180k prefix dirs. Performance improved a lot due to the avoidance of above-mentioned lock contention.

    Kopia uses a/abc/abce123... for its repo. I think this 1+3 approach (or 1+N in general) is even better because it further removes lock contention on the suffix dirs at very low cost (the 1-dirs are always cached in memory).
    Like bupstash, kopia uses base16 chars (hex). But Kopia uses packfiles and thus has ~10x less files stored.
    So I think that 1+3 is too small for bupstash's many files, and 1+4 or 1+5 is better.

    Do you think prefix directories would be easy to implement?
    I think it would make the backup of my production servers viable.

    piegames
    @piegames:matrix.org
    [m]
    nh2: That's really interesting, I wonder if it has some implications for Nix? (Because /nix/store is one huge flat structure)
    andrewchambers
    @andrewchambers
    @nh2:matrix.org yeah its relatively easy
    also, I am eventually going to implement/open source my s3 plugin
    I have a version that doesn't use the data dir, instead using s3 - however it has a bunch of extra stuff relevant only to bupstash.io
    piegames
    @piegames:matrix.org
    [m]
    Oh well my /nix/store only has 95,056 files and even /nix/store/.links (by far the largest on my system) "only" has 1.5 M. I guess you have a somewhat skewed definition of "large folder" :D
    andrewchambers
    @andrewchambers
    in theory some of the OSS s3 storage engines can handle 200M objects in a flat namespace