@andrewchambers: can you remind me, what exactly is bupstash's behaviour with small files, and does it depend on if they are in the same dir or not?
In https://github.com/andrewchambers/bupstash/issues/26#issuecomment-730882871 you write
small files are packed into single chunks
but that is the only reference I can find on the Internet for this feature
@andrewchambers: OK. I think sorting makes sense and is required to some extent on many file systems, because e.g. on CephFS, the directory order can change all the time, so deduplication would not be very good without a sort there.
The reason I'm asking is that currently I am trying to back up my production CephFS. This one has 200 million files in 1 flat directory. I rsynced the folder to a ZFS on local disks with metadata on SSDs so that I can experiment faster.
Bupstash OOMs on the 200M dir, during the phase where it does
statx() on all the contents (the previous
getdents() phase succeeds on it).
So I am wondering whether the sort isn't happening on just the filenames (
getdents() output), but on the stat results. And what I could do about it.
it looks like it's getting all
stat()s into a big vector and then does the sort based on the file names.
That seems like it uses unnecessarily much memory:
sizeof(struct stat64) == 144 on Linux.
Couldn't we get just the file names, sort those, and then get the metadata in a streaming-parallel fashion afterwards?
If yes, there's a caveat:
On some file systems like CephFS,
stat() information is stored next to the
getdents() information (at least I think so from recent learnings, see https://tracker.ceph.com/issues/57062).
That means while statting in directory order is very fast, statting in sorted order is very slow because it leads to massive read amplification (each
stat() has to fetch a whole CephFS directory fragment and then discard all but 1 entry).
(Ceph recently had a patch merged to avoid the insane read amplification, see https://email@example.com/thread/HVNYI6Y3V4BZN3S3UFEHUMRCJBIHUEAH/, but it would still cost the network roundtrip.)
That means that the current approach is better for such file systems; CephFS is the only one I know that does this so far, so bupstash may still want to change its approach here for normal local file systems.
However, the current approach just doesn't work for single large dirs, and eventually, even the file names won't fit into memory.
So I think a more thorough solution for large dirs would make the
let mut dir_ents = self.metadata_fetcher.parallel_get_metadata(dir_ent_paths);
Vec<(PathBuf, std::io::Result<CombinedMetadata>)>, but instead use something that can page the results to disk (e.g. sqlite, rocksdb, whatever embedded kv store).
A simpler approach than an embedded KV store that can sort on disk would be to just write the
Vec<(PathBuf, std::io::Result<CombinedMetadata>)> on disk in
getdents() order (a simple
mmap() might suffice), and then instead of sorting that
Vec, sort an array of indices into it only.
That would require only 8 bytes per entry, instead of 144 + len(filename) + memory allocator fragmentation overhead, thus reducing RES memory usage by ~25x.
In my case filenames are ~40 bytes on average, so the above is ~200 bytes.
200M files * 200 bytes = 40 GB for
With the index array, it would be 1.6 GB.