Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    alidhamieh
    @alidhamieh
    Internal Source IP and Internal destination IP of src/dest containers were the same 10.88.0.8
    alidhamieh
    @alidhamieh
    No i am not sure about this statement "Source IP and Internal destination IP of src/dest containers were the same 10.88.0.8". Seems are not the same.
    alidhamieh
    @alidhamieh
    Yes when migrating to different machine, the container gets different internal IP. On source of migration: [tmp]$ sudo podman inspect get_counter_5 --format '{{.NetworkSettings.IPAddress}}' 10.88.0.11
    And on destination of migration: [podman-dest ~]$ sudo podman inspect get_counter_5 --format '{{.NetworkSettings.IPAddress}}' 10.88.0.10
    When i restart the container on destination, it will get 10.88.0.11 IP. [podman-dest ~]$ sudo podman restart get_counter_5
    40ef9ee1aefe0c2fe062b640fba870470667f23342ab330bab6ab9c0db46246d
    [podman-dest ~]$ sudo podman inspect get_counter_5 --format '{{.NetworkSettings.IPAddress}}' 10.88.0.11
    alidhamieh
    @alidhamieh
    When doing checkpoint/restore on same machine the internal IP of container stays the same after restore: [instance-9 tmp]$ sudo podman inspect get_counter_5 --format '{{.NetworkSettings.IPAddress}}' 10.88.0.11
    [instance-9 tmp]$ curl 10.88.0.11:8088 counter: 0
    [instance-9 tmp]$ sudo podman rm get_counter_5
    40ef9ee1aefe0c2fe062b640fba870470667f23342ab330bab6ab9c0db46246d
    [instance-9 tmp]$ sudo podman container restore --import=get_counter_5.tar.gz --tcp-established
    40ef9ee1aefe0c2fe062b640fba870470667f23342ab330bab6ab9c0db46246d
    [instance-9 tmp]$ curl 10.88.0.11:8088 counter: 1
    alidhamieh
    @alidhamieh
    In summary, internal IP of container changed when migrating to a different host. i assume it uses the IP of whatever it was before running the new version of container. i do not know how it does that if the previous container was already removed with sudo podman rm get_counter_5 on destination. Seems there is a cache somewhere. And the new migrated container will run on the previous internal IP and not on the internal IP of the source container. It only gets the IP of the source of migration if i restart the container on destination.
    alidhamieh
    @alidhamieh
    @adrian:lisas.de How did you migrated to a new host and the internal IPs of source and destination containers stayed the same? It is not the case in my setup. Do you know why?
    minhbq-99
    @minhbq-99
    Hi,
    I am trying to implement support for MAP_HUGETLB checkpointing/restoring.
    I add a new function in kerndat to get the hugetlb devs then later use them to determine hugetlb mappings (similar approach to shared anon mem, memfd). I also add a simple zdtm test for it. The zdtm test runs just fine on my Ubuntu 18.04. However, I get error in CI test, the reason is that I cannot get hugetlb devs' number. When I run the CI container with a shell and run that test again, it works. So weird.
    I allocated some hugetlb pages before zdtm test with this hacky way
    https://github.com/minhbq-99/criu/commit/3f88581718d4b8227a73ac700a143fe84b6fe245#diff-c9838f32b66301675f1aaba9ee24003ac88febf1f77cd993feba0db4b75d7ea6R2694
    Here is my branch: https://github.com/minhbq-99/criu/commits/hugepage_local
    Adrian Reber
    @adrian:lisas.de
    [m]
    @minhbq-99: looks like your changes broke the maps09 test, most of our CI runs are running on 20.04 or CentOS 8, so maybe the newer kernel does something differently
    minhbq-99
    @minhbq-99
    I try to run the test in 20.04 and I get the same result (successful in host but failed when run CI in docker, run that container manually with shell and the test is successful). Actually, when developing, I sometimes observe this behavior that my kerndat function for getting hugetlb devs' number fails but then it disappears without changing anything.
    The maps09 is the new zdtm test I add for MAP_HUGETLB
    Adrian Reber
    @adrian:lisas.de
    [m]
    Sorry, but I have no idea why. Maybe seccomp filtering. Kerndat depends on /run being a tmpfs. I think. So, not sure
    Alexander Mikhalitsyn
    @mihalicyn

    Hi @minhbq-99!
    Just a few versions: the hugetlbfs mounts have no FS_USERNS_MOUNT (it means that cannot be mounted from inside the user namespace) (but shmemfs has this fs_flag). I'm not sure that I've understood you correctly here:

    successful in the host but failed when run CI in docker, run that container manually with shell and the test is successful).

    Did you manage to reproduce the problem on your machine? Can you show the command?

    another possible reason is that the hugetlbfs should be mounted manually (as I can see), but the shmem fs is mounted automatically from the system start. Maybe you have to add direct hugetlbfs mount into your kerndat initialization function?

    minhbq-99
    @minhbq-99

    Just a few versions: the hugetlbfs mounts have no FS_USERNS_MOUNT (it means that cannot be mounted from inside the user namespace) (but shmemfs has this fs_flag).

    Interesting information. I get error with user namespace but have no idea why, so I've disabled uns test already.

    I think I understand the problem. hugetlbfs is mounted by default as I see but I need to set the number of hugepage in sysfs so that I can allocate them. Currently, I add some code to zdtm.py to set some hugepages before run zdtm test. The kerndat function may get called before hugepages are available so I cannot allocate and get dev number. The reason for inconsistent zdtm result is because the kerndat get cached. And my cached version in host get the correct dev number so the zdtm test is always successful.

    Thank you @adrian:lisas.de @mihalicyn :)

    alidhamieh
    @alidhamieh
    @adrian:lisas.de please let me know your thoughts about the above chat elaborated issue? internal IP of container changed when migrating to a different host.
    Alexander Mikhalitsyn
    @mihalicyn
    @minhbq-99 you are welcome ;) yep, kdat is cached. And yep, uns case will not work. It will work only if hugetlb mount created in initial usernamespace and mount is inherited in the container mount namespace from the host
    alidhamieh
    @alidhamieh
    Does CRIU re-sends TCP ACK to client on restoring TCP connection? for previously established connection
    On restore i see: 15:56:00.417689 IP 10.88.0.43.8080 > 73.132.70.25.54894: Flags [.], ack 3809915341, win 229, options [nop,nop,TS val 2159752066 ecr 0], length 0
    Adrian Reber
    @adrian:lisas.de
    [m]
    @alidhamieh: I am pretty sure CRIU does not, but maybe the TCP stack does it
    alidhamieh
    @alidhamieh
    Got it.
    can we eliminate this ack of restore? do you see this ack on your setup?
    5 replies
    i want to eliminate that because it seems gCloud filters unsolicited egress ack.
    minhbq-99
    @minhbq-99
    Hi, I can see than hugetlb can be used with sysvipc shm but I cannot find any ways to determine if a shm segment is backed by hugetlb or not.
    Does anyone have ideas for this problem?
    Pavel Tikhomirov
    @Snorch

    Hi @minhbq-99 , afaics If we shmat sysvipc shm with hugetlb backing it looks the same as hugetlb mapping created by mmap. Meaning that it also has /proc/pid/map_files/file. But looking on your PR https://github.com/checkpoint-restore/criu/pull/1622/commits/3ec6dbfe29558c5067fdb7c04313f01743e694c7#diff-6f08d59ddde08ca75f7ccb0aac7f5ca6e011bd968b57d3de0ee7a1786f582763R238 I'm not sure that your dev comparison works even for mmap's. If I do simple test https://gist.github.com/Snorch/ab5f86e5e8f3d7f9fecfd7eabdcadd7a

    [root@fedora helpers]# ./shm-huge 
    shm_ptr = 0x7f8868400000
    map = 0x7f88689af000
    map2m = 0x7f8868200000

    All three different mappings have the same device:

    [root@fedora snorch]# stat /proc/136984/map_files/{7f8868400000,7f88689af000,7f8868200000}* | grep Dev
    Device: 16h/22d    Inode: 1055674     Links: 1
    Device: 16h/22d    Inode: 1055681     Links: 1
    Device: 16h/22d    Inode: 1055673     Links: 1

    On pretty new 5.13.12-200.fc34.x86_64 kernel.

    Maybe I'm missing something but I don't see a way to understand which hugepage type (16k/2m/1g) is the mapping.
    Pavel Tikhomirov
    @Snorch
    Ah I missed that we get dev from /proc/pid/maps not from stat, then dev looks like indicating that it's a hugepage and everything is ok:
    [root@fedora helpers]# ./shm-huge 
    shm_ptr = 0x7f555e800000
    map = 0x7f555ed76000
    map2m = 0x7f555e600000
    
    [root@fedora helpers]# grep "7f555e800000\|7f555ed76000\|7f555e600000" /proc/158858/maps
    7f555e600000-7f555e800000 rw-s 00000000 00:0f 1088051                    /anon_hugepage (deleted)
    7f555e800000-7f555ea00000 rw-s 00000000 00:0f 65567                      /SYSV6129e7d0 (deleted)
    7f555ed76000-7f555ed77000 rw-s 00000000 00:01 73556                      /dev/zero (deleted)
    minhbq-99
    @minhbq-99
    Hi @Snorch , I use that dev number to detect hugetlb, with different page size (2MB, 1GB), we will have different device number
    For the mapping, file path is used to differentiate between shm (/SYSV), memfd (/memfd), I will update the pull request with my latest local branch.
    Pavel Tikhomirov
    @Snorch
    To conclude: sysvipc shm should be exactly same as mmap'ed regions
    minhbq-99
    @minhbq-99
    The problem is that shm may not be mmaped (not using shmat yet) but I come up with an idea that when collecting shm key we will use shmat to check if it is hugetlb
    Pavel Tikhomirov
    @Snorch
    yes
    one more thing here, You detect device numbers for hugetlb and cache this in kdat, can't it be a problem if a new hugetlb dev appear?
    Pavel Tikhomirov
    @Snorch
    probably we need to refresh hugetlb device numbers each run...
    minhbq-99
    @minhbq-99
    yes, that's my solution, every time we load the kerndat cache, we need to collect hugetlb dev again
    Pavel Tikhomirov
    @Snorch
    nice, thanks!
    minhbq-99
    @minhbq-99

    Hi, my pull request fails on a CentOS 7 user_namespace test case. The problem is in restoring hugetlb shmem mappings, when restoring shmem mappings, we try to use memfd, if we cannot, we open map_files link of that mapping. In case of of CentOS 7, we fall into open map_files link and we don't have CAP_SYS_ADMIN cap

    https://elixir.bootlin.com/linux/v3.10/source/fs/proc/base.c#L1889

    With some debuggings, I found that the restored process have CAP_SYS_ADMIN but its cred->user_ns has the lower level than init_user_ns. But why the checkpoint process can open map_files link? I see that checkpoint process's cred->user_ns is the same as init_user_ns. So why there is a difference in cred->user_ns between checkpoint and restore process?

    minhbq-99
    @minhbq-99
    Hmm, I understand the problem. When checkpointing, the criu process in root userns tries to checkpoint a process inside a userns. So the checkpoint code are actually in root userns. On the other hand, when restoring, the restoring code is run by the process that is in a userns
    Andrei Vagin
    @avagin
    @minhbq-99 you can look at userns_call
    Doraemon
    @dongliangde
    Processes that have been restored by criu-ns, if frozen again, the following error will occur
    Traceback (most recent call last):
      File "./criu-ns", line 231, in <module>
        res = wrap_dump()
      File "./criu-ns", line 200, in wrap_dump
        set_pidns(pid, pid_idx)
      File "./criu-ns", line 161, in set_pidns
        raise OSError(errno.ENOENT, 'Cannot find NSpid field in proc')
    FileNotFoundError: [Errno 2] Cannot find NSpid field in proc
    Adrian Reber
    @adrian:lisas.de
    [m]
    Which OS are you running it on? I think CentOS 7 does not have NSpid
    everybody else should have it
    Doraemon
    @dongliangde

    Which OS are you running it on? I think CentOS 7 does not have NSpid

    Run in docker, the bottom package is ubuntu

    Adrian Reber
    @adrian:lisas.de
    [m]
    which version of ubuntu?
    cat /proc/self/status | grep NSpid that command should work
    Doraemon
    @dongliangde

    which version of ubuntu?

    Ubuntu 20.04.3

    which version of ubuntu?

    When criu is restored, it will cause pid conflicts, so through criu-ns restoration under the new command space, the successful process has been restored, and freezing again will cause this problem

    Doraemon
    @dongliangde

    which version of ubuntu?

    Not found by the command

    Adrian Reber
    @adrian:lisas.de
    [m]
    uname -a ?
    Doraemon
    @dongliangde

    uname -a ?

    Linux 8194e282c3c5 3.10.0-1062.el7.x86_64 #1 SMP Wed Aug 7 18:08:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux