Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Ward Vandewege
@cure
are there actual errors in the container logs?
Tom Schoonjans
@tschoonj
Hi Ward
regular srun commands work perfectly fine
from the crunch side of things, everything looks good
Ward Vandewege
@cure
great
Tom Schoonjans
@tschoonj
it appears that the crunch-dispatch-slurm service is explicitly trying to scancel the job after it has successfully completed
Ward Vandewege
@cure
yeah those messages in the logs may just be noise that can be ignored
(and we can file a bug to get them fixed)
Tom Schoonjans
@tschoonj
they're not just noise I am afraid, as they drain the node :-(
Ward Vandewege
@cure
are the jobs completing OK from the arvados perspective?
Tom Schoonjans
@tschoonj
we are now manually setting them to idle after every crunch job
as far as I can tell they are
Ward Vandewege
@cure
so it's maybe a slurm configuration issue then
Tom Schoonjans
@tschoonj
maybe indeed
have you seen this before where the crunch-dispatch-slurm service is actively trying to run scancel on finished jobs?
Ward Vandewege
@cure
I think that is normal
the part where that turns into draining of the node is weird
does that happen if you try using scancel manually, too?
Tom Schoonjans
@tschoonj
we havent tried that yet, as we tested only with short running jobs
Ward Vandewege
@cure
hmm, maybe those scancels are not normal
Tom Schoonjans
@tschoonj
Here is our slurm.conf, is there something that looks fishy?
SlurmctldPort=6817
SlurmdPort=6818
SrunPortRange=60001-63000
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmdPidFile=/run/slurm/slurmd.pid
ProctrackType=proctrack/cgroup
CacheGroups=0
ReturnToService=2
TaskPlugin=task/cgroup
SlurmUser=slurm
SlurmdUser=root
SlurmctldHost=crunch-dispatcher-slurm-controller-blue-dispatcher
ClusterName=core-infra
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
#JobCompLoc=
JobAcctGatherType=jobacct_gather/cgroup
# ACCOUNTING
AccountingStorageHost=localhost
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStoragePort=6819

# COMPUTE NODES
NodeName=DEFAULT
PartitionName=DEFAULT MaxTime=INFINITE State=UP

NodeName=slurm-worker-blue-dispatcher-[1-2] RealMemory=7961 CPUs=4 TmpDisk=29597 State=UNKNOWN
PartitionName=compute Nodes=slurm-worker-blue-dispatcher-[1-2] Default=YES Shared=YES
Ward Vandewege
@cure
I'm not spotting anything weird here
maybe in NodeName, you have State=UNKNOWN
hmm no we have that too
Ward Vandewege
@cure
we have this on one of our systems:
# COMPUTE NODES
NodeName=DEFAULT CPUs=20 SOCKETs=1 CORESPERSOCKET=20 THREADSPERCORE=1 State=UNKNOWN RealMemory=32109 Weight=32109
PartitionName=DEFAULT MaxTime=INFINITE State=UP
#PartitionName=compute Default=YES Shared=yes

NodeName=compute[0-63]

PartitionName=compute Nodes=compute[0-63] Default=YES Shared=yes
that's just the bit about compute nodes
Tom Schoonjans
@tschoonj
well thanks for the help anyway. We will revisit this after the weekend. I will try first getting rid of the cgroup config as I am not sure that is working properly on our VMs
Ward Vandewege
@cure
sounds good! Happy to help
Tom Schoonjans
@tschoonj
many thanks Ward!
Tom Schoonjans
@tschoonj

Hi, we got things working by dropping cgroups support in the Slurm config.

We now also got our first CWL workflow running with arvados-cwl-runner, but it it possible to do this through the REST API? It looks possible to create/delete/modify a workflow via the REST API, but couldn't find how to start one except through the aforementioned tool.

Andrey Kartashov
@portah
@tschoonj Can you share the experience arvados on slurm?
Peter Amstutz
@tetron
@tschoonj there isn't a high level API for submitting workflows. the way arvados-cwl-runner works is by constructing a container request that runs the workflow runner, and that actually manages the particular instance of the running workflow
@tschoonj so you can programmatically submit a container request the follows the same pattern, there's just more detail to account for
Ward Vandewege
@cure
@tetron maybe we should add a code example for that to our docs
Tom Schoonjans
@tschoonj
Thanks for the replies!
Ward Vandewege
@cure
is everything working OK otherwise @tschoonj ?
Tom Schoonjans
@tschoonj
yes, we did get our minimal crunch-dispatch-slurm working
Ward Vandewege
@cure
excellent
Andrey Kartashov
@portah
@tschoonj Do you have a recommendation for a newbie? or just follow the docs?
Tom Schoonjans
@tschoonj
just follow the docs
one thing we are still discussing is federation
is it possible with 2.2 to submit a CWL wf to cluster A that has a keep but without crunch, to get that wf to get picked up by cluster B, which has no keep, but crunch support.
Andrey Kartashov
@portah
Arvados federation? Or some integration with you current infrastructure?
Tom Schoonjans
@tschoonj
Arvados federation
Peter Amstutz
@tetron
@tschoonj cluster B still needs to have Keep for storing intermediate results
there's some federation token handling that we need to work on to make that case of "cluster B uses federated data from cluster A" work smoothly
Tom Schoonjans
@tschoonj
@tetron so this won't be possible even with 2.3, even with cluster B having its own keep?
Peter Amstutz
@tetron
there might be a workaround
you can override the token it uses
actually