@adrian:lisas.de , I ran my experiment again. The used size and buff/cache size both increase during dumping, and stay at that high value till we kill the process. I ran echo 1 > /proc/sys/vm/drop_caches . This reduced the buff/cache space, but not the used space
I think this is something other than the write buffers
Anyway, I have a minimal repro of the issue. A python script (I ran it with python3 script.py)
from time import sleep
import os
a=10000
b=10000
array_ab = [ [ '?' for i in range(a) ] for j in range(b) ]
for var in list(range(5)):
n=os.fork()
if n > 0:
print("Parent process: ", id)
else:
break
while True:
print("Hello")
sleep(5)
print("World")
The memory usage of this script is around 800 MB. After checkpointing with sudo criu dump -j -t <pid> --tcp-established --ghost-limit=9999999999 --leave-running --file-locks --images-dir /mnt/image0
and running echo 3 > /proc/sys/vm/drop_caches
, the buff/cache goes down and the used still stays high.
--tcp-established --ghost-limit=9999999999 --file-locks
because those options seem unnecessary
pstree.c
: https://github.com/checkpoint-restore/criu/blob/04f8368eaee2b29bb92ff0ba4f5c43501408d15e/criu/pstree.c#L372-L412Migrating process tree (SID 8->7)
which results in 7 being added to some set of PIDs followed by Migrating process tree (GID 8->7)
which fails with Error (criu/pstree.c:404): Current gid 7 intersects with pid (255) in images
because it actually collides with the 7 that was just added. The code doesn't make much sense to me; if sid/gid is initially the same they shouldn't really be considered as colliding after migration should they?
I'm running into issues with this code in
pstree.c
: https://github.com/checkpoint-restore/criu/blob/04f8368eaee2b29bb92ff0ba4f5c43501408d15e/criu/pstree.c#L372-L412
Basically I seeMigrating process tree (SID 8->7)
which results in 7 being added to some set of PIDs followed byMigrating process tree (GID 8->7)
which fails withError (criu/pstree.c:404): Current gid 7 intersects with pid (255) in images
because it actually collides with the 7 that was just added. The code doesn't make much sense to me; if sid/gid is initially the same they shouldn't really be considered as colliding after migration should they?
I believe that in some cases the phraze "shouldn't be considered as colliding" can be arguable. Imagine when you dump the process it has both sid=10 and pgid=20 external (no process with pid=10 or pid=20 in dumped subtree of processes) but sid and pgid are different, and on restore you try to rewrite (10,20) with (30,30) because the process calling restore happened to have same sid=pgid=30. Changing 10->30 and 20->30 can be considered wrong because we convert separate process group to an session-initial process group, this may also affect signal behavior after restore. So I'm not sure, probably we should restrict restoring with --shell-job if the inherited sid/pgid topology does not match exactly.
Hello!
Do you have any experience in trying to checkpoint/restart MPI processes? In particular, I am stumbling across this error in the restore.log
:
56227: Error (criu/files-reg.c:1831): Can't open file dev/shm/vader_segment.box.54c60001.0 on restore: No such file or directory
56227: Error (criu/files-reg.c:1767): Can't open file dev/shm/vader_segment.box.54c60001.0: No such file or directory
56227: Error (criu/mem.c:1383): `- Can't open vma
Error (criu/cr-restore.c:2397): Restoring FAILED.
I have --tcp-established enabled, and to me it seems like the problem is in the restoring of the open connection; for more context, the process I am trying to restart is an MPI process; I dump it using --leave-running
and then restore it, but it does not seem to work and I get the error above; any suggestion on how to investigate/fix this problem?
First of all thank you so much for the answer!
I read various issue (investigating this since a few weeks) on this topic (such as this one https://github.com/checkpoint-restore/criu/issues/1247#issuecomment-717903614 ) but what's not clear to me is that I am actually trying to restore the process on the same host, so even if shared memory was actually used, I don't get why MPI cannot use the same files.
Anyway I will to try to solve this by:
A: load the MPI nodes in separates Docker environments (yes, I have docker overhead, but I get CRIU to checkpoint not an MPI process with all the related complications but a Docker image, which should be much better)
B: change the local process communication - I will try setting mca to tcp and check what happens
One container for each rank will not work.
Why this will not work? What I am trying to do is restore/checkpoint a single rank (just reading you saying that it is impossible :( ), to enable checkpoint-restore in highly parallel workloads where I can just restart a rank in case the machine fails