Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jul 04 15:45
    RemiLacroix-IDRIS commented #87
  • Jul 04 15:32

    mdiazmel on master

    Tips and tricks: .bashrc vs .b… (compare)

  • Jul 04 15:32
    mdiazmel closed #87
  • Jul 04 15:29
    RemiLacroix-IDRIS opened #87
  • Jun 30 17:27
    bourcierj commented #86
  • Jun 30 17:27
    bourcierj commented #86
  • Jun 30 14:05
    bourcierj commented #86
  • Jun 30 08:30
    zaccharieramzi closed #86
  • Jun 30 08:30

    zaccharieramzi on master

    Add tip on changing the default… (compare)

  • Jun 30 08:23
    zaccharieramzi synchronize #86
  • Jun 29 18:24
    bourcierj opened #86
  • May 26 22:32

    zaccharieramzi on master

    Fix path in Pytorch STL10 examp… (compare)

  • May 26 22:32
    zaccharieramzi closed #85
  • May 26 12:08
    Phylliade edited #85
  • May 26 12:08
    Phylliade opened #85
  • May 03 11:55
    RemiLacroix-IDRIS commented #83
  • May 03 11:44
    FlorentMeyer opened #84
  • May 03 11:37
    FlorentMeyer opened #83
  • Apr 18 11:51
    jamal919 synchronize #82
  • Apr 18 11:44
    jamal919 opened #82
Rémi Lacroix
@RemiLacroix-IDRIS
Hello. As you might have read already, Jean Zay will be unavailable for a whole day on April 5th (from 7am to the late afternoon or even later in case something goes wrong) due to a maintenance on the electrical infrastructures of IDRIS. Also note that since there will be no electrical power in the building, the hotline will be closed that day.
2 replies
lucanest
@lucanest
Hi everyone, is there a way to install R on Jean Zay? Thanks
9 replies
Rémi Lacroix
@RemiLacroix-IDRIS
Hello. We are having some issues on Jean Zay since around 10pm yesterday. The Slurm controller is now back but most nodes are offline.
4 replies
Rémi Lacroix
@RemiLacroix-IDRIS
Jean Zay is back online, hopefully for good!
Ramana Sundararaman
@Sentient07
Hello. Since the restart, I'm unable to start either batched or interactive processes. Is it due to unavailability? How can I know better why my tasks aren't starting? Thank you
11 replies
grgkopanas
@grgkopanas
Hi, I am trying to set-up my own conda environment in Jean-Zay, reading this documentation: http://www.idris.fr/jean-zay/gpu/jean-zay-gpu-python-env.html I noticed that very probably if I do that I wont be able to have optimal performance. To by-pass that I thought of cloning one of the existing environments and add my dependencies from there, but it fails to clone conda create --clone pytorch-gpu-1.8.1+py3.8 -p $SCRATCH/conda_envs/neural_catacaustics/ with the following error PermissionError: [Errno 13] Permission denied: '/gpfslocalsup/pub/anaconda-py3/2020.11/pkgs/qt-5.12.9-h1f2b2cb_0/info/paths.json'
6 replies
nicolas-dufour
@nicolas-dufour
Hi, is it possible to have pytorch 1.11 please? Thanks
1 reply
Chaithya G R
@chaithyagr
My machine at CEA is having some hardware issues and I would want to access Jean Zay from a different IP address. What procedures do I need to follow to get the same and how long does it usually take? Thank you..
3 replies
Varun Kapoor
@kapoorlab
I am unable to load the QT platform and it is most likely coming from a missing library that needs sudo permission to install, can I request to have this command run by an admin: sudo apt install libxkbcommon-x11-0
12 replies
Varun Kapoor
@kapoorlab
image.png
It is to get Napari running in an interactive job and because of this library missing in the compute nodes it is unable to open it (screenshot attached)
Varun Kapoor
@kapoorlab
image.png
grgkopanas
@grgkopanas
Hi, is it ok to run tensorboard on the node we ssh to submit jobs?
6 replies
Chaithya G R
@chaithyagr
Hello, Is it possible to have a tensorflow 2.8 environemnt? I would like to use some packages which need it.. I can create a personal environment for it, but that would just use up my INODES and I am under strong space crunch because of it.
5 replies
Ramana Sundararaman
@Sentient07
Hello. May I please know when JeanZay servers will be back today?
1 reply
Rémi Lacroix
@RemiLacroix-IDRIS
Jean Zay is back online. It seems that some jobs that were pending before the maintenance and started just after the end of the maintenance crashed. We suggest that you resubmit any job that failed, they should work now.
Ramana Sundararaman
@Sentient07
Hello I face this error https://dpaste.com/E42F8UWY2 when starting a job. Can @RemiLacroix-IDRIS please help me out? Thanks
8 replies
Seems V100 nodes are okay, it's just that A100 ones wouldn't allow me to start a process
Stavros Diolatzis
@diolatzis

Hey everyone. After today's maintentance all my jobs crash with the following error:

Traceback (most recent call last):
  File "/gpfsdswork/projects/rech/zox/ufd68on/mesostylegan3/main/jobs_utils/genci/../../../main/train.py", line 398, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.9.0+py3.9/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.9.0+py3.9/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.9.0+py3.9/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.9.0+py3.9/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/gpfsdswork/projects/rech/zox/ufd68on/mesostylegan3/main/jobs_utils/genci/../../../main/train.py", line 393, in main
    launch_training(c, opts)
  File "/gpfsdswork/projects/rech/zox/ufd68on/mesostylegan3/main/jobs_utils/genci/../../../main/train.py", line 107, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.9.0+py3.9/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.9.0+py3.9/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/pytorch-gpu-1.9.0+py3.9/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 4 terminated with signal SIGKILL

The same jobs were running fine before today's maintenance. I am running a variant of stylegan on gpu_p2 with 8 gpus. Any clues why this is happening?

9 replies
Chaithya G R
@chaithyagr
Just checking, if I can recover some files from $STORE which i deleted with rm -rf by mistake?
1 reply
Elias Ramzi
@elias-ramzi
Hi, I have the following error slurmstepd: error: Detected 1 oom-kill event(s) in StepId=1311924.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. for a job that should run fine on 16GiB. I am not sure how I should find the error. Thanks !
4 replies
Varun Kapoor
@kapoorlab
I have a vnc installation on ubuntu and I am trying to launch a visualization on the visu node which prompts me to connect to a URL with password but I do not know how to launch the vnc server from my ubuntu to use it as I do not have the GUI that is shown in the page here: http://www.idris.fr/eng/jean-zay/pre-post/jean-zay-outils-visu-noeuds-eng.html
10 replies
Charlie Saillard
@checkpt:matrix.org
[m]
Hello !
I have weird issues with torch.distributed.init_process_group().
For a multi-gpu multi-node job (4 nodes, 4 gpus per node), init_process_group() is hanging (nothing happens).
This is weird because the same script launched with 2 nodes and 2 gpus per node on the same partition (gpu_13) works fine.
I saw it happened to other pytorch/pytorch#76069 but I couldn't find a solution for now.
I know this is more a torch issue than a jean-zay one, but I thought maybe other people here might have encountered this issue ?
8 replies
nicolas-dufour
@nicolas-dufour
Hi, I had a project that got renewed but i can't use it anymore. When doing idracct i don't see the project hours and when submitting a job it fails with error (IDRIS: account ipk@v100 is closed to regular jobs. Please check your budget using idracct command.)
6 replies
Ramana Sundararaman
@Sentient07
Hello. I'm unable to connect to JeanZay front-end. Is there a planned maintenance scheduled today?
Rémi Lacroix
@RemiLacroix-IDRIS
Hello. Yes, there is a maintenance from 8am to 10:30am: http://www.idris.fr/status.html.
grgkopanas
@grgkopanas
Hi, is there a way to get the submission command line I used for all my current jobs so I know which one to cancel?
2 replies
Elias Ramzi
@elias-ramzi

Hi everyone !
I have a job that is blocked. I tried to run a job using 8 gpus on the gpu_p2 partition for 50 hours. Here is the part of the header of my .slurm script:

#SBATCH --nodes=1
#SBATCH --partition=gpu_p2
#SBATCH --ntasks=8                   # number of MP tasks
#SBATCH --ntasks-per-node=8          # number of MPI tasks per node
#SBATCH --gres=gpu:8                 # number of GPUs per node
#SBATCH --cpus-per-task=3           # number of cores per tasks
#SBATCH --hint=nomultithread         # we get physical cores not logical
#SBATCH --distribution=block:block   # we pin the tasks on contiguous cores
#SBATCH --time=50:00:00              # maximum execution time (HH:MM:SS)
#SBATCH --qos=qos_gpu-t4

The job seems to be blocked. I have the following message: QOSGrpGRES (this is currently my only job waiting). I checked this website : http://www.idris.fr/jean-zay/gpu/jean-zay-gpu-exec_partition_slurm.html but I did not find a reason for this message.
Thanks for any help!

2 replies
Parcollet Titouan
@TParcollet
Hi there, I have two datasets that contain a lot of files (1M and 3M), they are also quite heavy. For now, they are stored in $SCRATCH, but I would love to just compress them efficiently in maybe multiple tar files. What is the recommended way of dealing with that on Jean Zay ? I was thinking of WebDataset, but I also think that multiple .tar would make sense as well.
2 replies
Parcollet Titouan
@TParcollet
Dear all, is the quota (total space, not inodes) of $SCRATCH shared across project members ? If our project has 50To of $STORE, does it mean that the project members share 5To in $SCRATCH ? Thanks !
3 replies
Parcollet Titouan
@TParcollet
Hi again ! If we already have a project under A11, can we request an access to the A100 ? Thanks !
3 replies
Varun Kapoor
@kapoorlab
Hello, I am getting 'Disk quota exceeded' error which is not letting me create new files or run my calculations even though am much below the threshold of 50 TB on the STORE directory.
1 reply
ChaimaeHADROUCH
@ChaimaeHADROUCH
hello I have a weird problem, it's been a week now , my problem is that I have a code that actually runs very well on AWS cloud and on my local machine, but I got an error when running it on JeanZay even though the coding environement is good and that I checked everything. I'm asking is there a way to debugg on jean-zay ,
3 replies
because what i do is to run the script by the command :sbatch ------.slurm and after I do cat errors.out to see the error that i got
ChaimaeHADROUCH
@ChaimaeHADROUCH
thank you ; I have another question ; if the training of the code needs more than 20 hours , what should I do to log for example the checkpoints that I get or just to continue the training from the point that it stopped
12 replies
as I know jeanZay gives us maximum 20 hours to run a job
but if we need more than that time to run our job
ChaimaeHADROUCH
@ChaimaeHADROUCH
like in my case the training of my code needs more tham 20 hours
4 replies
adeschemps
@adeschemps

Dumb question, but how am I supposed to access jean-zay.idris.fr from an Inria computer, using the VPN ? I tried using

ssh uname_inria@transit.irisa.fr
ssh uname_jz@jean-zay.idris.fr

but the authentication fails (I changed my password during the initial connexion on the frontal node, I am using the new password). Is there anything I am doing wrong ?

2 replies
I am currently in Canada, so not using the Inria internal network, but my understanding was that that's the point of irisa.transit.fr ? Please excuse how clueless I am with networks and ssh connexions ...
ChaimaeHADROUCH
@ChaimaeHADROUCH
sorry @RemiLacroix-IDRIS you did sent me an email about the 100 hours that i use and that my job is pended because I set time lapse at 100hours
but it was @mypey that recommended to run jobs up to 100h on QoS qos_gpu-t4
1 reply
so in this case i can not run jobs over 20 h even if i need more time than that
ChaimaeHADROUCH
@ChaimaeHADROUCH
hi @RemiLacroix-IDRIS , do you have an idea how to launch tensorboard on jean-zay
2 replies
Rémi Lacroix
@RemiLacroix-IDRIS
Hi. Just a reminder that Jean Zay will be unavailable until 2pm due to a hardware maintenance.
4 replies
LTMeyer
@LTMeyer
I'm training a neural network using DDP on a custom dataset stored on disk. I'm puzzled because the training is much faster on my personal computer using a single GPU than on JZ. 15 minutes on my laptop for 50 minutes on JZ for one epoch. The most obvious differences I see between the two trainings is the use of DDP and the number of dataloader workers I increased on JZ. I assume the problem might be related to the number of workers. I set it to 5 although I'm running on a full node which offers 40 CPUs. I'm trying to avoid this common issue (pytorch/pytorch#13246) by using numpy arrays in my dataset instead of bare-Python data structures.However this issue causes memory over-consumption not necessarily CPU slowness. Any ideas on what I should look at, and how? For example how could I profile my dataloader to check possible bottlenecks?
7 replies
ChaimaeHADROUCH
@ChaimaeHADROUCH
salut je veux spécifier 4 workers for each train , val et test dataloaders , how can i do that with #SBATCH?
14 replies
ChaimaeHADROUCH
@ChaimaeHADROUCH
@LTMeyer actually I had the same problem , could you tell me how did u resolve it please
my problem is that in my code I use 4 workers for train dataloader and 4 for val dataloader and 4 for test dataloader but the process run the training with 4 workers on train dataset but only 1 workers on val and test dataset , even though I set all the workers to 4 , I don't understand why?
6 replies
CentofantiEze
@CentofantiEze
Hi everyone, I have a problem accessing Jean-Zay. When I log in it shows the banner with Jean-Zay's presentation as usual but then the cursor remains loading without entering the prompt. Have you ever experienced this and do you know how to solve it?
18 replies