Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Shawn Tian
    @s2100
    @scarlett2018 pls help and I have a question of submit job failed after cluster rebooted
    Issue's here: microsoft/pai#3946
    And also, I wanna set environment variables while submitting a job(v2), which is okay in v1.
    The issue's here: microsoft/pai#3956
    Thanks
    uba888
    @uba888
    how can i update to v0.16.0, can't get image for v0.16.0 tag
    Scarlett Li
    @scarlett2018
    @uba888 - v0.16.0 is an internal release, we are still under internal test, the image is not formal released. please stay tuned, and use previous versions for now.
    liuc-shaiic
    @liuc-shaiic

    Hi guys,
    I'm running a "Hello world" job but it shows an error of "Illegal instruction". Could someone help me out? Thx!

    My job Config:
    protocolVersion: 2
    name: admin_1578374848565_a7bb39a_ca815832
    type: job
    jobRetryCount: 0
    prerequisites:

    • type: dockerimage
      uri: openpai/tensorflow-py36-cpu
      name: docker_image_0
      taskRoles:
      Task_role_1:
      instances: 1
      completion:
      minFailedInstances: 1
      minSucceededInstances: 1
      dockerImage: docker_image_0
      resourcePerInstance:
      gpu: 0
      cpu: 4
      memoryMB: 8192
      commands:
      • apt update
      • apt install -y git
      • 'git clone https://github.com/tensorflow/models'
      • cd models/research/slim
      • pip3 uninstall tensorflow
      • pip3 install contextlib2 Pillow tensorflow==1.14
      • -
        python download_and_convert_data.py --dataset_name=cifar10
        --dataset_dir="/tmp/data"
        taskRetryCount: 0
        defaults:
        virtualCluster: default

    The error message is : /pai/bootstrap/docker_bootstrap.sh: line 206: 1514 Illegal instruction (core dumped) python download_and_convert_data.py --dataset_name=cifar10 --dataset_dir="/tmp/data"

    Scarlett Li
    @scarlett2018
    @liuc-shaiic - it's spring festival session. responds might delay. may you open an issue in the github and provide more info about which PAI version you are using. PAI's latest version on master (not yet release) has a limitation for HiveD on supporting "0 gpu", my first impression is you hit that limitation.
    liuc-shaiic
    @liuc-shaiic
    Thank Scarlett Li for your response! I have open an issue: microsoft/pai#4163, and the PAI version is v0.14.0. Wish you and PAI team a happy Spring Festival!
    JosephKang
    @JosephKang
    Hello all,
    I am new to openpai and try to deploy a v0.14.0 single-box as a test on my ubuntu 16.04 VM ( https://github.com/microsoft/pai/blob/master/docs/pai-management/doc/single-box.md).
    It seems that I am blocked on kube-apiserver issue while executing paictl.py (https://github.com/microsoft/pai/blob/master/docs/pai-management/doc/how-to-bootup-k8s.md#command)

    The The related information is listed below
    ~/pai-config/kubernetes-configuration.yaml

    kubernetes:
    cluster-dns: 192.168.153.2
    load-balance-ip: 192.168.153.130
    service-cluster-ip-range: 10.254.0.0/16
    storage-backend: etcd3
    docker-registry: gcr.io/google_containers
    hyperkube-version: v1.9.9
    etcd-version: 3.2.17
    apiserver-version: v1.9.9
    kube-scheduler-version: v1.9.9
    kube-controller-manager-version: v1.9.9
    dashboard-version: v1.8.3
    etcd-data-path: "/var/etcd"

    info.log

    2020-04-23 08:24:00,334 [INFO] - deployment.clusterCmd : Begin to initialize PAI k8s cluster.
    2020-04-23 08:24:00,347 [WARNING] - deployment.k8sPaiLibrary.maintainlib.deploy : Begin to deploy a new cluster to your machine or vm.
    2020-04-23 08:24:00,348 [INFO] - deployment.k8sPaiLibrary.maintainlib.deploy : Begin to deploy k8s on host 192.168.153.130, the node role is [ master ]
    2020-04-23 08:24:00,390 [INFO] - paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
    2020-04-23 08:24:00,487 [INFO] - paramiko.transport : Authentication (password) successful!
    2020-04-23 08:24:00,658 [INFO] - deployment.k8sPaiLibrary.maintainlib.common : Executing the command on host [192.168.153.130]: getent passwd cfkang | cut -d: -f6
    2020-04-23 08:24:00,781 [INFO] - paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
    2020-04-23 08:24:00,874 [INFO] - paramiko.transport : Authentication (password) successful!
    2020-04-23 08:24:01,069 [INFO] - paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
    2020-04-23 08:24:01,165 [INFO] - paramiko.transport : Authentication (password) successful!
    2020-04-23 08:24:01,370 [INFO] - paramiko.transport.sftp : [chan 0] Opened sftp connection (server version 3)
    2020-04-23 08:24:01,377 [INFO] - paramiko.transport.sftp : [chan 0] sftp session closed.
    2020-04-23 08:24:01,387 [INFO] - paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
    2020-04-23 08:24:01,484 [INFO] - paramiko.transport : Authentication (password) successful!
    2020-04-23 08:24:01,683 [INFO] - deployment.k8sPaiLibrary.maintainlib.common : Executing the command on host [192.168.153.130]: tar -xvf master-deployment.tar
    2020-04-23 08:24:01,706 [INFO] - paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
    2020-04-23 08:24:01,797 [INFO] - paramiko.transport : Authentication (password) successful!
    2020-04-23 08:24:01,971 [INFO] - deployment.k8sPaiLibrary.maintainlib.common : Executing the command on host [192.168.153.130]: sudo ./master-deployment/hosts-check.sh 192.168.153.130
    2020-04-23 08:24:01,994 [INFO] - paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
    2020-04-23 08:24:02,086 [INFO] - paramiko.transport : Authentication (password) successful!
    2020-04-23 08:24:02,263 [INFO] - deployment.k8sPaiLibrary.maintainlib.common : Executing the command on host [192.168.153.130]: sudo ./master-deployment/docker-ce-install.sh master-deployment
    2020-04-23 08:24:02,298 [INFO] - paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
    2020-04-23 08:24:02,389 [INFO] - paramiko.transport : Authentication (password) successful!
    2020-04-23 08:24:02,569 [INFO] - deployment.k8sPaiLibrary.maintainlib.common : Executing the command on host [192.168.153.130]: sudo ./master-deployment/kubelet-start.sh master-deployment
    2020-04-23 08:24:02,786 [INFO] - deployment.k8sPaiLibrary.maintainlib.deploy : Successfully running master-deployment job on node 192.168.153.130!
    2020-04-23 08:24:02,786 [INFO] - deployment.k8sPaiLibrary.maintainlib.deploy : package cleaner is working on the folder of 192.168.153.130!
    2020-04-23 08:24:02,790 [INFO] - deployment.k8sPaiLibrary.maintainlib.deploy : package cleaner's work finished!
    2020-04-23 08:24:02,790 [INFO] - deployment.k8sPaiLibrary.maintainlib.deploy : remote host cleaner is working on the host of 192.168.153.130!
    2020-04-23 08:24:02,795 [INFO] - paramiko.transport : Connected (version 2.0, client OpenSSH_7.2p2)
    2020-04-23 08:24:02,848 [INFO] - paramiko.transport : Authentication (password) successful!
    2020-04-23 08:24:03,088 [INFO] - deployment.k8sPaiLibrary.maintainlib.common : Executing the command on host [192.168.153.130]: sudo rm -rf master-deployment*
    2020-04-23 08:24:03,100 [INFO] - deployment.k8sPaiLibrary.maintainlib.deploy : remote host cleaning job finished!
    2020-04-23 08:24:03,100 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Execute the script to install kubectl on your host!

    2020-04-23 08:24:03,116 [INFO] - deployment.k8sPaiLibrary.maintainlib.kubectl_install : Successfully install kubectl

    JosephKang
    @JosephKang

    kubelet instance log

    E0423 00:24:02.988923 19037 kubelet.go:1287] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
    I0423 00:24:02.990115 19037 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
    I0423 00:24:02.990296 19037 server.go:129] Starting to listen on 192.168.153.130:10250
    I0423 00:24:02.990692 19037 server.go:299] Adding debug handlers to kubelet server.
    I0423 00:24:02.998208 19037 server.go:149] Starting to listen read-only on 192.168.153.130:10255
    E0423 00:24:02.999643 19037 event.go:209] Unable to write event: 'Post http://192.168.153.130:8080/api/v1/namespaces/default/events: dial tcp 192.168.153.130:8080: getsockopt: connection refused' (may retry after sleeping)
    I0423 00:24:03.001780 19037 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.153.130
    I0423 00:24:03.001842 19037 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.153.130
    I0423 00:24:03.001866 19037 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.153.130
    I0423 00:24:03.003283 19037 manager.go:188] Starting Device Plugin manager
    E0423 00:24:03.003546 19037 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
    I0423 00:24:03.003551 19037 manager.go:444] Read checkpoint file /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
    I0423 00:24:03.003694 19037 manager.go:219] Serving device plugin registration server on "/var/lib/kubelet/device-plugins/kubelet.sock"
    I0423 00:24:03.003740 19037 fs_resource_analyzer.go:66] Starting FS ResourceAnalyzer
    I0423 00:24:03.003765 19037 status_manager.go:140] Starting to sync pod status with apiserver
    I0423 00:24:03.003780 19037 kubelet.go:1778] Starting kubelet main sync loop.
    I0423 00:24:03.003787 19037 kubelet.go:1795] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s]
    I0423 00:24:03.003499 19037 container_manager_linux.go:425] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
    I0423 00:24:03.003926 19037 volume_manager.go:245] The desired_state_of_world populator starts
    I0423 00:24:03.003929 19037 volume_manager.go:247] Starting Kubelet Volume Manager
    I0423 00:24:03.010341 19037 factory.go:356] Registering Docker factory
    I0423 00:24:03.011253 19037 factory.go:136] Registering containerd factory
    I0423 00:24:03.011372 19037 factory.go:54] Registering systemd factory
    I0423 00:24:03.011837 19037 factory.go:86] Registering Raw factory
    I0423 00:24:03.012308 19037 manager.go:1178] Started watching for new ooms in manager
    I0423 00:24:03.014073 19037 manager.go:329] Starting recovery of all containers
    I0423 00:24:03.055501 19037 manager.go:334] Recovery completed
    E0423 00:24:03.092243 19037 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.153.130" not found
    I0423 00:24:03.104149 19037 kubelet.go:1857] SyncLoop (ADD, "file"): "kube-apiserver-192.168.153.130_kube-system(7ce60ec6f51925aba529a2fcfa5b6fc7), kube-controller-manager-192.168.153.130_kube-system(c5b5299c795b018e6d88d32ed2ebcab9), etcd-server-192.168.153.130_default(1509edd69790e8f5831efd3c898890ee), kube-scheduler-192.168.153.130_kube-system(9ce9cc74d855d92111d84c5caab5eac1)"
    I0423 00:24:03.104202 19037 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
    I0423 00:24:03.104160 19037 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
    I0423 00:24:03.106007 19037 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.153.130

    I0423 00:24:03.106019 19037 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.153.130

    • Docker instance status *
      CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
      50c6b9d4886e gcr.io/google_containers/hyperkube:v1.9.9 "/hyperkube kubelet …" 7 seconds ago Up 6 seconds kubelet
    May I know how can I resolve it? Thanks
    Scarlett Li
    @scarlett2018
    @ydye might know more about this ^^ @JosephKang
    JosephKang
    @JosephKang
    @scarlett2018 , Thank you for your notification. I will ping ydye
    YundongYe
    @ydye
    Need more kubelet log to track the issue. I can't find any useful detail in the above log.
    JosephKang
    @JosephKang
    @ydye FYI.
    Flag --require-kubeconfig has been deprecated, You no longer need to use --require-kubeconfig. This will be removed in a future version. Providing --kubeconfig enables API server mode, omitting --kubeconfig enables standalone mode unless --require-kubeconfig=true is also set. In the latter case, the legacy default kubeconfig path will be used until --require-kubeconfig is removed.
    I0423 00:24:02.930126 19037 server.go:182] Version: v1.9.9
    I0423 00:24:02.930196 19037 feature_gate.go:226] feature gates: &{{} map[DevicePlugins:true TaintBasedEvictions:true]}
    W0423 00:24:02.930321 19037 server.go:280] --require-kubeconfig is deprecated. Set --kubeconfig without using --require-kubeconfig.
    I0423 00:24:02.930413 19037 mount_linux.go:195] Detected OS without systemd
    W0423 00:24:02.930604 19037 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
    I0423 00:24:02.933336 19037 plugins.go:101] No cloud provider specified.
    I0423 00:24:02.933414 19037 server.go:303] No cloud provider specified: "" from the config file: ""
    I0423 00:24:02.934643 19037 manager.go:151] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct"
    I0423 00:24:02.950240 19037 fs.go:139] Filesystem UUIDs: map[6ce6282d-4bc3-4ac0-9777-cb1821f5e2df:/dev/dm-1 d6a678b0-51f5-448c-aafc-11439701d903:/dev/dm-0 0d44ff95-9fb0-4e83-ae1f-18676b4940fa:/dev/sda1]
    I0423 00:24:02.950273 19037 fs.go:140] Filesystem partitions: map[shm:{mountpoint:/rootfs/var/lib/docker/containers/d61825a0339ca61ded64c73994897ba1d624fd211bb5c0bbeb01102226f2cfb2/mounts/shm major:0 minor:43 fsType:tmpfs blockSize:0} tmpfs:{mountpoint:/sys/fs/cgroup major:0 minor:45 fsType:tmpfs blockSize:0} /dev/mapper/openapi1--vg-root:{mountpoint:/var/lib/docker major:252 minor:0 fsType:ext4 blockSize:0} /dev/sda1:{mountpoint:/rootfs/boot major:8 minor:1 fsType:ext2 blockSize:0}]
    I0423 00:24:02.951574 19037 manager.go:225] Machine: {NumCores:2 CpuFrequency:1992000 MemoryCapacity:4125179904 HugePages:[{PageSize:1048576 NumPages:0} {PageSize:2048 NumPages:0}] MachineID:6ce4e2a111a9449766f3aff65e9ecfca SystemUUID:CE0B4D56-32D7-7781-3015-48F3E50FF9F2 BootID:8a66116a-16f1-4698-aadc-415a341ea8ae Filesystems:[{Device:/dev/sda1 DeviceMajor:8 DeviceMinor:1 Capacity:754434048 Type:vfs Inodes:46848 HasInodes:true} {Device:shm DeviceMajor:0 DeviceMinor:43 Capacity:67108864 Type:vfs Inodes:503562 HasInodes:true} {Device:overlay DeviceMajor:0 DeviceMinor:42 Capacity:65688932352 Type:vfs Inodes:4087808 HasInodes:true} {Device:tmpfs DeviceMajor:0 DeviceMinor:45 Capacity:2062589952 Type:vfs Inodes:503562 HasInodes:true} {Device:/dev/mapper/openapi1--vg-root DeviceMajor:252 DeviceMinor:0 Capacity:65688932352 Type:vfs Inodes:4087808 HasInodes:true}] DiskMap:map[252:0:{Name:dm-0 Major:252 Minor:0 Size:66873982976 Scheduler:none} 252:1:{Name:dm-1 Major:252 Minor:1 Size:1023410176 Scheduler:none} 8:0:{Name:sda Major:8 Minor:0 Size:68719476736 Scheduler:deadline}] NetworkDevices:[{Name:ens33 MacAddress:00:0c:29:0f:f9:f2 Speed:1000 Mtu:1500}] Topology:[{Id:0 Memory:4125179904 Cores:[{Id:0 Threads:[0] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:8388608 Type:Unified Level:3}]} {Id:2 Memory:0 Cores:[{Id:0 Threads:[1] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:8388608 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None}
    I0423 00:24:02.952115 19037 manager.go:231] Version: {KernelVersion:4.4.0-177-generic ContainerOsVersion:Debian GNU/Linux 9 (stretch) DockerVersion:18.09.7 DockerAPIVersion:1.39 CadvisorVersion: CadvisorRevision:}
    I0423 00:24:02.952632 19037 server.go:431] --cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /
    I0423 00:24:02.953253 19037 container_manager_linux.go:242] container manager verified user specified cgroup-root exists: /
    I0423 00:24:02.953277 19037 container_manager_linux.go:247] Creating Container Manager object based on Node Config: {RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:docker CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[memory:{i:{value:3221225472 scale:0} d:{Dec:<nil>} s:3Gi Format:BinarySI}] HardEvictionThresholds:[]} ExperimentalQOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerReconcilePeriod:10s}
    I0423 00:24:02.953385 19037 container_manager_linux.go:266] Creating device plugin manager: true
    I0423 00:24:02.953391 19037 manager.go:96] Creating Device Plugin manager at /var/lib/kubelet/device-plugins/kubelet.sock
    I0423 00:24:02.953646 19037 server.go:696] Using root directory: /var/lib/kubelet
    I0423 00:24:02.953673 19037 kubelet.go:293] Adding manifest path: /etc/kubernetes/manifests
    I0423 00:24:02.953696 19037 file.go:52] Watching path "/etc/kubernetes/manifests"
    I0423 00:24:02.953702 19037 kubelet.go:318] Watching apiserver
    E0423 00:24:02.963045 19037 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get http://192.168.153.130:8080/api/v1/pods?fieldSelector=spec.nodeName%3D192.168.153.130&limit=500&resourceVersion=0: dial tcp 192.168.153.130:8080: getsockopt: connection refused
    E0423 00:24:02.963159 19037 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:482: Failed to list *v1.Node: Get http://192.168.153.130:8080/api/v1/nodes?fieldSelector=metadata.name%3D192.168.153.130&limit=500&resourceVersion=0: dial tcp 192.168.153.130:8080: getsockopt: connection refused
    E0423 00:24:02.963253 19037 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:473: Failed to list *v1.Service: Get http://192.168.153.130:8080/api/v1/services?limit=500&resourceVersion=0: dial tcp 192.168.153.130:8080: getsockopt: connection refused
    W0423 00:24:02.964611 19037 kubelet_network.go:139] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back to "hairpin-veth"
    I0423 00:24:02.964661 19037 kubelet.go:580] Hairpin mode set to "hairpin-veth"
    I0423 00:24:02.965236 19037 client.go:80] Connecting to docker on unix:///var/run/docker.sock
    I0423 00:24:02.965306 19037 client.go:109] Start docker client with request timeout=2m0s
    W0423 00:24:02.966204 19037 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
    I0423 00:24:02.968094 19037 docker_service.go:232] Docker cri networking managed by kubernetes.io/no-op
    I0423 00:24:02.971934 19037 docker_service.go:237] Docker Info: &{ID:P73F:NVJL:SX6H:7SV6:SA6F:LNL4:KUJE:563U:OHA4:VOFV:PWY3:WNSC Containers:1 ContainersRunning:1 ContainersPaused:0 ContainersStopped:0 Images:1 Driver:overlay2 DriverStatus:[[Backing Filesystem extfs] [Supports d_type true] [Native Overlay Diff true]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} MemoryLimit:true SwapLimit:false KernelMemory:true CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true IPv4Forwarding:true BridgeNfIptables:true BridgeNfIP6tables:true Debug:false NFd:28 OomKillDisable:true NGoroutines:43 SystemTime:2020-04-23T08:24:02.968719422+08:00 LoggingDriver:json-file CgroupDriver:cgroupfs NEventsListener:0 KernelVersion:4.4.0-177-generic OperatingSystem:Ubuntu 16.04.6 LTS OSType:linux Architecture:x86_64 IndexServerAddress:https://index.docker.io/v1/ RegistryConfig:0xc4201503f0 NCPU:2 MemTotal:4125179904 GenericResources:[] DockerRootDir:/var/lib/docker HTTPProxy: HTTPSProxy: NoProxy: Name:openapi1 Labels:[] ExperimentalBuild:false ServerVersion:18.09.7 ClusterStore: ClusterAdvertise: Runtimes:map[nvidia:{Path:/usr/bin/nvidia-container-runtime Args:[]} runc:{Path:runc Args:[]}] DefaultRuntime:runc Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster:<nil>} LiveRestoreEnabled:false Isolation: InitBinary:docker-init ContainerdCommit:{ID: Expected:} RuncCommit:{ID:N/A Expected:N/A} InitCommit:{ID:v0.18.0 Expected:fec3683b971d9c3ef73f284f176672c44b448662} SecurityOptions:[name=apparmor name=seccomp,profile=default]}
    I0423 00:24:02.971988 19037 docker_service.go:250] Setting cgroupDriver to cgroupfs
    I0423 00:24:02.972018 19037 kubelet.go:657] Starting the GRPC server for the docker CRI shim.
    I0423 00:24:02.972027 19037 docker_server.go:51] Start dockershim grpc server
    I0423 00:24:02.982007 19037 remote_runtime.go:43] Connecting to runtime service unix:///var/run/dockershim.sock
    I0423 00:24:02.983018 19037 kuberuntime_manager.go:186] Container runtime docker initialized, version: 18.09.7, apiVersion: 1.39.0
    W0423 00:24:02.983121 19037 probe.go:215] Flexvolume plugin directory at /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ does not exist. Recreating.
    I0423 00:24:02.983304 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/aws-ebs"
    I0423 00:24:02.983311 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/empty-dir"
    I0423 00:24:02.983315 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/gce-pd"
    I0423 00:24:02.983318 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/git-repo"
    I0423 00:24:02.983322 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/host-path"
    I0423 00:24:02.983326 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/nfs"
    I0423 00:24:02.983329 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/secret"
    I0423 00:24:02.983333 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/iscsi"
    I0423 00:24:02.983337 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/glusterfs"
    I0423 00:24:02.983341 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/rbd"
    I0423 00:24:02.983345 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/cinder"
    I0423 00:24:02.983349 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/quobyte"
    I0423 00:24:02.983352 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/cephfs"
    I0423 00:24:02.983357 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/downward-api"
    I0423 00:24:02.983360 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/fc"
    I0423 00:24:02.983364 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/flocker"
    I0423 00:24:02.983367 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/azure-file"
    I0423 00:24:02.983371 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/configmap"
    I0423 00:24:02.983377 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/vsphere-volume"
    I0423 00:24:02.983381 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/azure-disk"
    I0423 00:24:02.983384 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/photon-pd"
    I0423 00:24:02.983388 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/projected"
    I0423 00:24:02.983391 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/portworx-volume"
    I0423 00:24:02.983395 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/scaleio"
    I0423 00:24:02.983407 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/local-volume"
    I0423 00:24:02.983411 19037 plugins.go:453] Loaded volume plugin "kubernetes.io/storageos"
    I0423 00:24:02.988749 19037 server.go:758] Started kubelet
    E0423 00:24:02.988923 19037 kubelet.go:1287] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
    I0423 00:24:02.990115 19037 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
    I0423 00:24:02.990296 19037 server.go:129] Starting to listen on 192.168.153.130:10250
    I0423 00:24:02.990692 19037 server.go:299] Adding debug handlers to kubelet server.
    I0423 00:24:02.998208 19037 server.go:149] Starting to listen read-only on 192.168.153.130:10255
    E0423 00:24:02.999643 19037 event.go:209] Unable to write event: 'Post http://192.168.153.130:8080/api/v1/namespaces/default/events: dial tcp 192.168.153.130:8080: getsockopt: connection refused' (may retry after sleeping)
    I0423 00:24:03.001780 19037 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.153.130
    I0423 00:24:03.001842 19037 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.153.130
    I0423 00:24:03.001866 19037 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.153.130
    I0423 00:24:03.003283 19037 manager.go:188] Starting Device Plugin manager
    E0423 00:24:03.003546 19037 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
    I0423 00:24:03.003551 19037 manager.go:444] Read checkpoint file /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
    I0423 00:24:03.003694 19037 manager.go:219] Serving device plugin registration server on "/var/lib/kubelet/device-plugins/kubelet.sock"
    I0423 00:24:03.003740 19037 fs_resource_analyzer.go:66] Starting FS ResourceAnalyzer
    I0423 00:24:03.003765 19037 status_manager.go:140] Starting to sync pod status with apiserver
    I0423 00:24:03.003780 19037 kubelet.go:1778] Starting kubelet main sync loop.
    I0423 00:24:03.003787 19037 kubelet.go:1795] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s]
    I0423 00:24:03.003499 19037 container_manager_linux.go:425] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
    I0423 00:24:03.003926 19037 volume_manager.go:245] The desired_state_of_world populator starts
    I0423 00:24:03.003929 19037 volume_manager.go:247] Starting Kubelet Volume Manager
    I0423 00:24:03.010341 19037 factory.go:356] Registering Docker factory
    I0423 00:24:03.011253 19037 factory.go:136] Registering containerd factory
    I0423 00:24:03.011372 19037 factory.go:54] Registering systemd factory
    I0423 00:24:03.011837 19037 factory.go:86] Registering Raw factory
    I0423 00:24:03.012308 19037 manager.go:1178] Started watching for new ooms in manager
    I0423 00:24:03.014073 19037 manager.go:329] Starting recovery of all containers
    I0423 00:24:03.055501 19037 manager.go:334] Recovery completed
    E0423 00:24:03.092243 19037 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "192.168.153.130" not found
    I0423 00:24:03.104149 19037 kubelet.go:1857] SyncLoop (ADD, "file"): "kube-apiserver-192.168.153.130_kube-system(7ce60ec6f51925aba529a2fcfa5b6fc7), kube-controller-manager-192.168.153.130_kube-system(c5b5299c795b018e6d88d32ed2ebcab9), etcd-server-192.168.153.130_default(1509edd69790e8f5831efd3c898890ee), kube-scheduler-192.168.153.130_kube-system(9ce9cc74d855d92111d84c5caab5eac1)"
    I0423 00:24:03.104202 19037 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
    I0423 00:24:03.104160 19037 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
    I0423 00:24:03.106007 19037 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.153.130
    I0423 00:24:03.106019 19037 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.153.130
    I0423 00:24:03.106024 19037 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.153.130
    I0423 00:24:03.106344 19037 predicate.go:124] Predicate failed on Pod: kube-apiserver-192.168.153.130_kube-system(7ce60ec6f51925aba529a2fcfa5b6fc7), for reason: Node didn't have enough resource: memory, requested: 1073741824, used: 0, capacity: 903954432
    I0423 00:24:03.106404 19037 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
    I0423 00:24:03.106480 19037 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.153.130
    I0423 00:24:03.106526 19037 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.153.130
    I0423 00:24:03.106532 19037 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.153.130
    I0423 00:24:03.106538 19037 kubelet_node_status.go:82] Attempting to register node 192.168.153.130
    E0423 00:24:03.106843 19037 kubelet_node_status.go:106] Unable to register node "192.168.153.130" with API server: Post http://192.168.153.130:8080/api/v1/nodes: dial tcp 192.168.153.130:8080: getsockopt: connection refused
    W0423 00:24:03.106885 19037 status_manager.go:459] Failed to get status for pod "kube-apiserver-192.168.153.130_kube-system(7ce60ec6f51925aba529a2fcfa5b6fc7)": Get http://192.168.153.130:8080/api/v1/namespaces/kube-system/pods/kube-apiserver-192.168.153.130: dial tcp 192.168.153.130:8080: getsockopt: connection refused
    I0423 00:24:03.107635 19037 kubelet_node_status.go:434] Recording NodeHasSufficientDisk event message for node 192.168.153.130
    I0423 00:24:03.107647 19037 kubelet_node_status.go:434] Recording NodeHasSufficientMemory event message for node 192.168.153.130
    I0423 00:24:03.107652 19037 kubelet_node_status.go:434] Recording NodeHasNoDiskPressure event message for node 192.168.153.130
    I0423 00:24:03.107849 19037 predicate.go:124] Predicate failed on Pod: kube-controller-manager-192.168.153.130_kube-system(c5b5299c795b018e6d88d32ed2ebcab9), for reason: Node didn't have enough resource: memory, requested: 1073741824, used: 0, capacity: 903954432
    I0423 00:24:03.107934 19037 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
    There are some characters limit on the chat room so I just copy the first 112 line in the docker log message. Please let me know if you need more information.
    YundongYe
    @ydye
    @JosephKang I0423 00:24:03.106344 19037 predicate.go:124] Predicate failed on Pod: kube-apiserver-192.168.153.130_kube-system(7ce60ec6f51925aba529a2fcfa5b6fc7), for reason: Node didn't have enough resource: memory, requested: 1073741824, used: 0, capacity: 903954432
    Node didn't have enough resource: memory, requested: 1073741824, used: 0, capacity: 903954432
    set this value to false
    And clean the bed, then redeploy
    And in open-service's deployment, the issue exits. You should also close this switch. https://github.com/microsoft/pai/blob/pai-0.14.y/examples/cluster-configuration/services-configuration.yaml#L28
    JosephKang
    @JosephKang
    @ydye Thanks.
    The kubenetics clustrer is booted successfully after adopting your suggestion, and I can see kube dashboard now. :)
    JosephKang
    @JosephKang

    Hello, have issue in Step 4. Update cluster configuration into kubernetes https://github.com/microsoft/pai/blob/master/docs/pai-management/doc/single-box.md
    cat ~/.kube/config
    apiVersion: v1
    kind: Config
    preferences: {}

    clusters:

    contexts:

    • context:
      cluster: kubernetes
      user: admin
      name: kubernetes

    current-context: kubernetes

    users:

    • name: developer

    python paictl.py config push -c ~/.kube/config

    /home/cfkang/pai/deployment/paiLibrary/common/file_handler.py:37: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    cluster_data = yaml.load(f)
    2020-04-23 16:25:50,230 [INFO] - deployment.paiLibrary.common.kubernetes_handler : Couldn't find configmap named pai-external-storage-conf
    2020-04-23 16:25:50,231 [ERROR] - deployment.confStorage.external_version_control.external_config : Unable to get the external storage configuration from k8s cluster.
    2020-04-23 16:25:50,232 [ERROR] - deployment.confStorage.external_version_control.external_config : Please check the configmap named [pai-external-storage] in the namespace [default].

    Did I miss any configuration step since the ~/.kube/config cannot be used due to No cluster-id?

    python paictl.py service start -c ~/.kube/config -n service-list
    /home/cfkang/pai/deployment/paiLibrary/common/file_handler.py:37: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    cluster_data = yaml.load(f)
    2020-04-23 16:37:35,932 [INFO] - deployment.paiLibrary.paiService.service_management_start : Get the service-list to manage : ['service-list']
    2020-04-23 16:37:35,937 [INFO] - deployment.paiLibrary.common.kubernetes_handler : Couldn't find configmap named pai-cluster-id
    2020-04-23 16:37:35,938 [ERROR] - deployment.confStorage.download : No cluster_id found in your cluster, which should be done the first time you upload your configuration.
    2020-04-23 16:37:35,938 [ERROR] - deployment.confStorage.download : Please execute the command following!
    2020-04-23 16:37:35,938 [ERROR] - deployment.confStorage.download : paictl.py config push [-c /path/to/kubeconfig ] [-p /path/to/cluster/configuration | -e /path/to/external/storage/conf/path]
    2020-04-23 16:37:35,938 [ERROR] - deployment.confStorage.download : More detailed information, please refer to the following link.
    2020-04-23 16:37:35,938 [ERROR] - deployment.confStorage.download : https://github.com/Microsoft/pai/blob/master/docs/paictl/paictl-manual.md
    YundongYe
    @ydye
    python paictl.py config push -c /path/to/your/openpaiconfig
    JosephKang
    @JosephKang
    Can ~/.kube/config generated by paictl(python paictl.py config generate -i /pai/deployment/quick-start/quick-start.yaml...) be used for /path/to/your/openpaiconfig ?
    JosephKang
    @JosephKang

    cat ~/.kube/config
    ...
    apiVersion: v1
    kind: Config
    preferences: {}

    clusters:

    contexts:

    • context:
      cluster: kubernetes
      user: admin
      name: kubernetes

    current-context: kubernetes

    users:

    • name: developer
    YundongYe
    @ydye
    Sorry the command should be python paictl.py config push -p /path/to/your/openpaiconfig
    -c is optional
    default value is ~/.kube/config
    JosephKang
    @JosephKang
    Got it. It should be python paictl.py config push -p ~/pai-config/
    It works.. Moving to Step 5. Start all OpenPAI services. Thanks @ydye
    Chenxiao Niu
    @ShawnNew
    Anybody knows about how to solve ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). problem?
    AprLie
    @AprLie
    hello, I cannot run the cmd "python paictl.py config generate -i /pai/deployment/quick-start/quick-start.yaml -o ~/pai-config -f" as "ModuleNotFoundError: No module named 'clusterObjectModel'", any ideas?
    Xiaolul
    @xiaolul
    Hello, does openPAI still support single box installation at the moment?
    winston-zhang-orz
    @winston-zhang-orz
    hi @scarlett2018
    Guikarist
    @guikarist
    Hi, you guys! Is there anyway to use multiple virtual clusters in a single job?
    For example, our reinfocement learning (RL) actors prefer to use CPU VCs, while RL learners prefer to use GPU VCs.
    gaoyangcaiji
    @gaoyangcaiji
    i have made a pvc,why my openpai dashboard can't see any storages ?does anyone know why ?thanks
    poetryben888
    @poetryben88
    anybody?