@andrewchambers: OK. I think sorting makes sense and is required to some extent on many file systems, because e.g. on CephFS, the directory order can change all the time, so deduplication would not be very good without a sort there.
The reason I'm asking is that currently I am trying to back up my production CephFS. This one has 200 million files in 1 flat directory. I rsynced the folder to a ZFS on local disks with metadata on SSDs so that I can experiment faster.
Bupstash OOMs on the 200M dir, during the phase where it does statx()
on all the contents (the previous getdents()
phase succeeds on it).
So I am wondering whether the sort isn't happening on just the filenames (getdents()
output), but on the stat results. And what I could do about it.
@andrewchambers: In
https://github.com/andrewchambers/bupstash/blob/b0c7b88353ebc0a7d5dff1e202098275ddd9fce9/src/indexer.rs#L549-L551
it looks like it's getting all stat()
s into a big vector and then does the sort based on the file names.
That seems like it uses unnecessarily much memory: sizeof(struct stat64) == 144
on Linux.
Couldn't we get just the file names, sort those, and then get the metadata in a streaming-parallel fashion afterwards?
If yes, there's a caveat:
On some file systems like CephFS, stat()
information is stored next to the getdents()
information (at least I think so from recent learnings, see https://tracker.ceph.com/issues/57062).
That means while statting in directory order is very fast, statting in sorted order is very slow because it leads to massive read amplification (each stat()
has to fetch a whole CephFS directory fragment and then discard all but 1 entry).
(Ceph recently had a patch merged to avoid the insane read amplification, see https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HVNYI6Y3V4BZN3S3UFEHUMRCJBIHUEAH/, but it would still cost the network roundtrip.)
That means that the current approach is better for such file systems; CephFS is the only one I know that does this so far, so bupstash may still want to change its approach here for normal local file systems.
However, the current approach just doesn't work for single large dirs, and eventually, even the file names won't fit into memory.
So I think a more thorough solution for large dirs would make the
let mut dir_ents = self.metadata_fetcher.parallel_get_metadata(dir_ent_paths);
not return Vec<(PathBuf, std::io::Result<CombinedMetadata>)>
, but instead use something that can page the results to disk (e.g. sqlite, rocksdb, whatever embedded kv store).
A simpler approach than an embedded KV store that can sort on disk would be to just write the Vec<(PathBuf, std::io::Result<CombinedMetadata>)>
on disk in getdents()
order (a simple mmap()
might suffice), and then instead of sorting that Vec
, sort an array of indices into it only.
That would require only 8 bytes per entry, instead of 144 + len(filename) + memory allocator fragmentation overhead, thus reducing RES memory usage by ~25x.
In my case filenames are ~40 bytes on average, so the above is ~200 bytes.
200M files * 200 bytes = 40 GB for dir_ents
.
With the index array, it would be 1.6 GB.
https://stdrs.dev/nightly/x86_64-unknown-linux-gnu/std/sys/unix/fs/struct.FileAttr.html has
pub struct FileAttr {
stat: stat64,
statx_extra_fields: Option<StatxExtraFields>,
}
so the sizeof that is probably even larger, 144 + at leat 24 = 168 B
bupstash get --pick sync/Pictures id=63cd4ab38415c6a2457109e7744e6d1a | tar -xvf -
I get the following error:bupstash serve: hash tree ended before requested range
bupstash get: hash tree ended before requested range
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
@andrewchambers: btw, I am very convinced that you need prefix directories for the files in the bupstash repo.
Here are my latest learnings:
ext4
will just return ENOSPC
(no space left on device
) by default, even at "small" directory sizes (e.g. 8M files on one of my servers). This is because it uses 32-bit hashes for its dir_index
lookup feature (a default option). A hash collision when writing a new file will immediately give ENOSPC
. This can be addressed with the large_dir
option (not default), but that was buggy and led to corruption until recently (http://www.voxelsoft.com/2021/ext4_large_dir_corruption.html, great page btw, has some gems). Linux fixed it, but introduced another corruption that I told the maker of this article, so it is now also mentioned on this article (after "kernel developers apply naïvefix anyway"). This means a bupstash repo on ext4 will just fail.getdents()
is inherently sequential. This means rsyncing the repo to somewhere else cannot be parallelised with flat dirs.For my CephFS storage with the 200M write-once files, I moved off a flat directory to a prefixed approach with 3 letters of base57 chars, thus 180k prefix dirs. Performance improved a lot due to the avoidance of above-mentioned lock contention.
Kopia uses a/abc/abce123...
for its repo. I think this 1+3 approach (or 1+N in general) is even better because it further removes lock contention on the suffix dirs at very low cost (the 1-dirs are always cached in memory).
Like bupstash, kopia uses base16 chars (hex). But Kopia uses packfiles and thus has ~10x less files stored.
So I think that 1+3 is too small for bupstash's many files, and 1+4 or 1+5 is better.
Do you think prefix directories would be easy to implement?
I think it would make the backup of my production servers viable.
/nix/store
only has 95,056 files and even /nix/store/.links
(by far the largest on my system) "only" has 1.5 M. I guess you have a somewhat skewed definition of "large folder" :D