These are chat archives for nextflow-io/nextflow

26th
Jun 2017
Phil Ewels
@ewels
Jun 26 2017 08:46
Morning! I have a tricky error for you @pditommaso ;)
Paolo Di Tommaso
@pditommaso
Jun 26 2017 08:47
finally some fun!
:)
Phil Ewels
@ewels
Jun 26 2017 08:48
This process works if (a) I supply a reference with params.bismark_index, (b) generate an index on our linux cluster
However, running locally with Docker, I get an error from Perl saying that it can't open the file (this line)
The log shows that it's in the correct directory and that it has found the filename, but I get the error No such file or directory
And yeah, specific to docker + generating an index. So quite a weird error.
Travis job with full log here
Any ideas why the script can't read this file in this docker run? Anything I can try to help with debugging?
Maxime Garcia
@MaxUlysse
Jun 26 2017 08:51
Is the file linked well ?
Phil Ewels
@ewels
Jun 26 2017 08:51
Yeah, I can read it fine on the command line (outside of docker)
Maxime Garcia
@MaxUlysse
Jun 26 2017 08:51
I remember having problem with bwa, because I was linking in my process only the fasta file, and not the indexes needed by bwa
Phil Ewels
@ewels
Jun 26 2017 08:52
Nope, that's not the problem here - Bismark is trying to read specifically that fasta file (I've had similar problems in the past too)
Maxime Garcia
@MaxUlysse
Jun 26 2017 08:53
OK
Phil Ewels
@ewels
Jun 26 2017 08:53
Also, if that were the case then it would also fail on the cluster
Maxime Garcia
@MaxUlysse
Jun 26 2017 08:53
Depending on how you write the path to the file
It was failing for me with Docker, but not on the cluster
Phil Ewels
@ewels
Jun 26 2017 08:54
Ah I see, but this is being written by a channel which is generating the index
Paolo Di Tommaso
@pditommaso
Jun 26 2017 08:54
it must a problem with the docker mounts
Phil Ewels
@ewels
Jun 26 2017 08:54
Yup, that was my guess too - any ideas how to investigate?
Paolo Di Tommaso
@pditommaso
Jun 26 2017 08:54
use the docker command in the .command.run to run it in interactive mode
Phil Ewels
@ewels
Jun 26 2017 08:54
from .command.run:
docker run -i --memory 7168m -v /Users/philewels/GitHub/NGI-MethylSeq/tests/work:/Users/philewels/GitHub/NGI-MethylSeq/tests/work -v "$PWD":"$PWD" -w "$PWD" --entrypoint /bin/bash --name $NXF_BOXID ewels/ngi-methylseq -c "/bin/bash /Users/philewels/GitHub/NGI-MethylSeq/tests/work/de/0ab4b4cc2d69de126ebce2765b8dcf/.command.run.1"
ah ok, just run that whole command?
Paolo Di Tommaso
@pditommaso
Jun 26 2017 08:55
nope
remove the option from entrypoint, add bash and -t option :)
docker run -it --memory 7168m \
 -v /Users/philewels/GitHub/NGI-MethylSeq/tests/work:/Users/philewels/GitHub/NGI-MethylSeq/tests/work \
 -v "$PWD":"$PWD" -w "$PWD" \
  bash
however I already see a problem
where is symlinking the input file ?
I mean where is supposed to be located the real file ?
Phil Ewels
@ewels
Jun 26 2017 08:58
The BismarkIndex folder is a symlink to the previous process folder (/Users/philewels/GitHub/NGI-MethylSeq/tests/work/fb/31fcd90800dfe06e763998b37f2ed8/BismarkIndex). Then the fasta file that's being read inside that folder is also a symlink to the user input (/Users/philewels/GitHub/NGI-MethylSeq/tests/test_data/ngi-bisulfite_test_set/references/WholeGenomeFasta/genome.fa)
Paolo Di Tommaso
@pditommaso
Jun 26 2017 08:59
Then the fasta file that's being read inside that folder is also a symlink to the user input (/Users/philewels/GitHub/NGI-MethylSeq/tests/test_data/ngi-bisulfite_test_set/references/WholeGenomeFasta/genome.fa)
then this is problem
Phil Ewels
@ewels
Jun 26 2017 08:59
Yup, I get the No such file or directory error inside the docker image too
Paolo Di Tommaso
@pditommaso
Jun 26 2017 08:59
the only mount is /Users/philewels/GitHub/NGI-MethylSeq/tests/work
Phil Ewels
@ewels
Jun 26 2017 08:59
and also it's a symlink to.. yup
gotcha
So I can't provide an external input as an output from a process
Paolo Di Tommaso
@pditommaso
Jun 26 2017 09:00
how are you providing that file in the process ?
what line ?
I'm cheating ;)
Moving the input into a directory, then providing that directory as the output
I guess if I change that to a cp instead of mv then everything will work?
Paolo Di Tommaso
@pditommaso
Jun 26 2017 09:02
uh, you should not modify the inputs ..
tho not sure that's the problem in this case
Phil Ewels
@ewels
Jun 26 2017 09:03
yeah, changing to cp fixed it :+1:
Paolo Di Tommaso
@pditommaso
Jun 26 2017 09:03
:ok:
Phil Ewels
@ewels
Jun 26 2017 09:04
Now that output file is no longer a softlink, so it's accessible within docker
ok great, thanks for the help!
Paolo Di Tommaso
@pditommaso
Jun 26 2017 09:04
welcome
Phil Ewels
@ewels
Jun 26 2017 09:05
Separate question whilst you're here - is v0.25 release likely to be days / weeks or months (just very roughly?)
Paolo Di Tommaso
@pditommaso
Jun 26 2017 09:05
hours!
Phil Ewels
@ewels
Jun 26 2017 09:05
even better!
Paolo Di Tommaso
@pditommaso
Jun 26 2017 09:05
working on that right now
Phil Ewels
@ewels
Jun 26 2017 09:05
ok awesome, I'll wait a bit before asking our sysadmins for an upgrade then :)
Maxime Garcia
@MaxUlysse
Jun 26 2017 09:05
\o/
great news
Paolo Di Tommaso
@pditommaso
Jun 26 2017 09:06
what a great team :)
Phil Ewels
@ewels
Jun 26 2017 09:21
:star2:
Alexander Mikheyev
@mikheyev
Jun 26 2017 10:13
@pditommaso continuing from our google groups conversation, there is no directory associated with the failed slurm task. https://groups.google.com/forum/#!topic/nextflow/F1swlIh3GMM
Alexander Mikheyev
@mikheyev
Jun 26 2017 10:19
I also checked a few other tasks with similar status (e.g., Jun-23 19:05:43.044 [Thread-2] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 40 -- first: TaskHandler[jobId: 13916883; id: 33; name: mapReads (33); status: SUBMITTED; exit: -; workDir: /home/s/sasha/src/bee-varroa/data/work/e0/8cecc99340e136517f674d57c7232a started: -; exited: -; ]), but these directories don't exist either.
Could this be due to some sort of temporary asynchrony between the head node where the nextflow script is launched and the remote node?
Paolo Di Tommaso
@pditommaso
Jun 26 2017 10:30
it looks very strange
I would suggest to resume the execution and monitor some of that task reported in the log as SUBMITTED
and verify why that directory does not exist
but I'm not understanding how this can happen, the work dir is created by NF in the head node
if does not exist the job would not be execute at all, instead it seems that those jobs run and they fail after a while
Alexander Mikheyev
@mikheyev
Jun 26 2017 10:46
OK, I re-ran another test case, just launching jobs that sleep for 30s and echo something to a file. Weirdly, only some of the jobs launch:
task_id    hash    native_id    name    status    exit    submit    duration    realtime    %cpu    rss    vmem    rchar    wchar
3    34/8ff460    13926539    mapReads (4)    COMPLETED    0    2017-06-26 19:43:59.650    34.8s    30s    0.0%    1.7 MB    202.2 MB    6.7 KB    0
2    85/53e66a    13926540    mapReads (6)    COMPLETED    0    2017-06-26 19:43:59.689    34.8s    30s    0.0%    1.7 MB    202.2 MB    6.7 KB    0
1    25/0eb4db    13926541    mapReads (1)    COMPLETED    0    2017-06-26 19:43:59.722    34.8s    30s    0.0%    1.7 MB    202.2 MB    6.7 KB    0
4    bd/326b6b    13926544    mapReads (5)    COMPLETED    0    2017-06-26 19:43:59.821    34.7s    30s    0.0%    1.7 MB    202.2 MB    6.7 KB    0
5    1e/201d0e    13926542    mapReads (3)    ABORTED    -    2017-06-26 19:43:59.756    -    -    -    -    -    -    -
6    d1/5254d3    13926543    mapReads (2)    ABORTED    -    2017-06-26 19:43:59.788    -    -    -    -    -    -    -
At this point the script just hangs. If I rerun with -resume, the workflow succeeds
I guess I should ask the cluster admins, why the two jobs failed? But maybe nextflow can handle failures on our tetchy cluster better ;)
Phil Ewels
@ewels
Jun 26 2017 10:51

Hi @pditommaso - I got one successful travis test, but then I updated the docker image and now I have a different random docker error (on Travis only, not locally).

Status: Downloaded newer image for scilifelab/ngi-methylseq:latest
  docker: Error response from daemon: Duplicate mount point '/home/travis/build/SciLifeLab/NGI-MethylSeq/tests/work/91/d8ac4e6df23bb2b7272246ebad459a'.

Job here - any ideas?

Paolo Di Tommaso
@pditommaso
Jun 26 2017 10:53
Never seen before..
Alexander Mikheyev
@mikheyev
Jun 26 2017 10:59
@pditommaso OK, I just realized what's going on on the cluster side of things. For reasons no one understands, if you launch a script from the data storage volume on our cluster you get random bus errors. I get around it by storing my code on a different albeit much smaller partition, but I specifically had Nextflow use the data partition for the work folder, so that it could have space for its work files. However, that's also where Nextflow stores the scripts to launch its jobs, leading to the above-mentioned bus errors. However, is there any way to gracefully recover from such an error and ideally relaunch the job?
Paolo Di Tommaso
@pditommaso
Jun 26 2017 12:46
@mikheyev The problem is that NF is not aware of this error. The sbatch returns an success exist status.
you get random bus errors
what do you mean exactly? a SIGBUS error ?
@ewels Any progress with docker: Error response from daemon: Duplicate mount point ?
Phil Ewels
@ewels
Jun 26 2017 12:55
No, I have zero idea on this one
I was hoping that it was random and would fix itself (building remote docker image? something?)
but still getting the same message, on my fork and the head fork :\
Paolo Di Tommaso
@pditommaso
Jun 26 2017 13:01
also googling there isn't so much
Phil Ewels
@ewels
Jun 26 2017 13:01
no exactly
Paolo Di Tommaso
@pditommaso
Jun 26 2017 13:01
what version of docker are u using ?
Phil Ewels
@ewels
Jun 26 2017 13:02
It's travis, so don't think it's a docker version issue
as it's specific to this repo
Maxime Garcia
@MaxUlysse
Jun 26 2017 13:03
Have you tried running the same test on your own machine ?
Phil Ewels
@ewels
Jun 26 2017 13:03
yeah, that works fine
Maxime Garcia
@MaxUlysse
Jun 26 2017 13:03
(Just to be sure)
OK
Definitively strange
Paolo Di Tommaso
@pditommaso
Jun 26 2017 13:03
I've seen the most crazy Docker issues across different versions
Phil Ewels
@ewels
Jun 26 2017 13:04
Hmm, could it be because I'm using -with-docker [image-id] as well as having process.container defined?
Paolo Di Tommaso
@pditommaso
Jun 26 2017 13:04
what if you force a new docker image by using a different tag instead of latest ?
no
eventually I would check if the list of mounts on the docker command line created in the .command.run for the failing process
Phil Ewels
@ewels
Jun 26 2017 13:08
(to confirm - you're completely correct that -with-docker had no effect ;) )
(..it was something I'd recently added to just this repo, so was suspicious)
Ok, have tried reverting the Dockerfile changes that I did earlier, I think it was working before these
then I can at least isolate it to a specific change
Paolo Di Tommaso
@pditommaso
Jun 26 2017 13:13
you don't the unique container id of the previous image ?
Phil Ewels
@ewels
Jun 26 2017 13:14
ah cool, didn't know you could do that :+1: - sorry, thought I had to create a new tag and stuff and wasn't sure how to do that retrospectively
Paolo Di Tommaso
@pditommaso
Jun 26 2017 13:15
use that whenever possible ..
Phil Ewels
@ewels
Jun 26 2017 13:15
where do you find the hash?
Maxime Garcia
@MaxUlysse
Jun 26 2017 13:15
I do think it's a great practice
Phil Ewels
@ewels
Jun 26 2017 13:15
Paolo Di Tommaso
@pditommaso
Jun 26 2017 13:16
it's printed when you push or pull an image
Maxime Garcia
@MaxUlysse
Jun 26 2017 13:16
I was planning to use that when fixing our next version
Phil Ewels
@ewels
Jun 26 2017 13:17
But these are automated builds, so I don't push myself. And I'd need the hash to pull it I think?
nm, found it
Paolo Di Tommaso
@pditommaso
Jun 26 2017 13:19
I never understood why they don't publish that I'd in the hub..
Phil Ewels
@ewels
Jun 26 2017 13:20
I rebuilt it locally to get the hash that works locally
different name, but hash should be the same, right..?
hmm
@MaxUlysse - looks like you can configure dockerhub to create automatic tags when you tag a release on GitHub (named by the release). So you could have the NF config call the docker image using the internal version variable if you wanted..
Would be quite neat :)
Phil Ewels
@ewels
Jun 26 2017 13:25
hmm, nope I got the hash wrong. I'll try building in travis.
Docker version 17.03.1-ce, build c6d412e
ah, it already had that at the top:
docker version
Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:10:36 2017
 OS/Arch:      linux/amd64
Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:10:36 2017
 OS/Arch:      linux/amd64
 Experimental: false
Maxime Garcia
@MaxUlysse
Jun 26 2017 13:35
@ewels I tried configuring DockerHub that way, but I didn't manage to do it, so I made a NF script to build and push all my images to docker hub
But yes, it would be quite neat if it would work properly
I'll look more into it
But I do need to fix more version too
Phil Ewels
@ewels
Jun 26 2017 13:36
I'm wondering if it's worth ditching the dockerhub automated builds and just using travis to do them instead. May have more control that way..
(and may be quicker..? dockerhub takes forever to run)
Hmm, same error when building the image on travis here
Maxime Garcia
@MaxUlysse
Jun 26 2017 13:37
Isn't your image too big to build on Travis ?
Phil Ewels
@ewels
Jun 26 2017 13:37
apparently not..
Maxime Garcia
@MaxUlysse
Jun 26 2017 13:38
I know that was the problem for some of our images (like snpEff) which with the db to d/l takes like forever
Phil Ewels
@ewels
Jun 26 2017 13:41
No I hadn't..
Maxime Garcia
@MaxUlysse
Jun 26 2017 13:42
Don't think it's related, but still
Phil Ewels
@ewels
Jun 26 2017 13:42
No me either, but thanks ;)
Phil Ewels
@ewels
Jun 26 2017 14:37
Hmmm, I wonder if it could be the NextFlow version that changed..
The last good build ran with NF v0.24
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:37
Ummm
Maxime Garcia
@MaxUlysse
Jun 26 2017 14:38
Have you tried to reverse the NF version ?
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:38
what version are you using locally ?
Phil Ewels
@ewels
Jun 26 2017 14:38
Version 0.24.0 build 4235
will try to update locally and run the test
huh, nextflow self-update isn't pushing me to v0.25
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:41
do you have NXF_VER variable set ?
Maxime Garcia
@MaxUlysse
Jun 26 2017 14:41
strange, I just did it and I got the 0.25.0
Phil Ewels
@ewels
Jun 26 2017 14:41
nope
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:43
what the output of this ?
curl -s get.nextflow.io | tail -n +20  | head
Phil Ewels
@ewels
Jun 26 2017 14:43
shows 0.25.0
NXF_VER=${NXF_VER:-'0.25.0'}
NXF_ORG=${NXF_ORG:-'nextflow-io'}
NXF_HOME=${NXF_HOME:-$HOME/.nextflow}
NXF_PROT=${NXF_PROT:-'https'}
NXF_BASE=${NXF_BASE:-$NXF_PROT://www.nextflow.io/releases}
NXF_CLI="$0 $@"

export NXF_CLI
export NXF_ORG
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:44
fine
Phil Ewels
@ewels
Jun 26 2017 14:44
np, will just do curl -fsSL get.nextflow.io | bash again
Hmm, no dice. Still runs perfectly locally, even with NF v0.25.0
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:47
could you check the docker command line for the offending process in your local run ?
Phil Ewels
@ewels
Jun 26 2017 14:49
Can do, but it's not always the same process that throws the error
In fact, this repo has two separate pipeline scripts and travis runs both - both give the same error
docker run -i --memory 7168m -v /Users/philewels/GitHub/NGI-MethylSeq/tests:/Users/philewels/GitHub/NGI-MethylSeq/tests:ro -v "$PWD":"$PWD" -w "$PWD" --entrypoint /bin/bash --name $NXF_BOXID ewels/ngi-methylseq -c "/bin/bash /Users/philewels/GitHub/NGI-MethylSeq/tests/work/74/273068d19c39344fc109ca28eb8430/.command.run.1"
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:51
and here /Users/philewels/GitHub/NGI-MethylSeq/tests is different from $PWD
right ?
Phil Ewels
@ewels
Jun 26 2017 14:51
nope
that's the dir that I'm running from
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:52
ok, but PWD should a dir assigned by NF ..
no?
Phil Ewels
@ewels
Jun 26 2017 14:52
sorry, I don't understand
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:52
in the docker command line there are two mounts:
1) -v /Users/philewels/GitHub/NGI-MethylSeq/tests
2) -v "$PWD"
Phil Ewels
@ewels
Jun 26 2017 14:54
$ echo $PWD
/Users/philewels/GitHub/NGI-MethylSeq/tests
from where I launched the pipeline anyway
but the docker command above is from the working directory
I've deleted it now
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:54
I'm wondering if the $PWD is the same as /Users/philewels/GitHub/NGI-MethylSeq/tests
Phil Ewels
@ewels
Jun 26 2017 14:55
I guess that depends on how NF executes .command.run?
But not sure how that would be any different for this pipeline compare to the others we have
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:56
I'm starting to think that's related to the fact the now NF mounts the past as read-only
except the $PWD that is always added a read-write
Phil Ewels
@ewels
Jun 26 2017 14:57
The latest travis run on our RNA pipeline runs fine though
that was with v0.25
Paolo Di Tommaso
@pditommaso
Jun 26 2017 14:57
weird
Phil Ewels
@ewels
Jun 26 2017 14:58
I might restart the last job that worked
see if it fails this time
gah, I hate problems that are difficult to replicate :sweat:
Emilio Palumbo
@emi80
Jun 26 2017 15:01
I could replicate the issue on OSX with this command:
docker run --rm -ti -w $PWD -v $PWD:$PWD -v $PWD:$PWD:ro debian:stable
Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Tue Mar 28 00:40:02 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.03.1-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Fri Mar 24 00:00:50 2017
 OS/Arch:      linux/amd64
 Experimental: true
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:02
so this happens when the same path is mounted twice, once read-only and the second time read-write
Phil Ewels
@ewels
Jun 26 2017 15:02
The previously successful travis job above just failed when I reran it
So can't be related to the underlying pipeline code. Must be NF + Docker image somehow.
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:03
does travis have SSH access ?
Phil Ewels
@ewels
Jun 26 2017 15:05
Afraid not
CircleCI does
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:06
uh, no :)
try this
set the following in the nextflow config for travis
docker.writableInputMounts = true
Phil Ewels
@ewels
Jun 26 2017 15:09
ewels/NGI-MethylSeq@d642459
hmm, now the travis test isn't even running :laughing:
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:11
move to Circle! :smile:
Phil Ewels
@ewels
Jun 26 2017 15:11
Yeah, probably should do.. Just use Travis for everything else and it feels tidy to have everything in one place!
It's getting further than it got previously :+1:
Looks like it works :tada:
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:17
shit, now the hot potato is on our side
Phil Ewels
@ewels
Jun 26 2017 15:18
:laughing:
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:19
:)
Phil Ewels
@ewels
Jun 26 2017 15:19
But only this pipeline / docker image, and only on travis...
RNA pipeline travis run with v0.25 runs fine
Phil Ewels
@ewels
Jun 26 2017 15:38
Just broke it again, but with a few more debugging statements. Also told it to e-mail me, so that should contain a longer error log if the e-mail manages to make it through.
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:39
always travis ?
Phil Ewels
@ewels
Jun 26 2017 15:40
hmm?
ah, didn't work as workflow.onComplete doesn't fire with this error
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:40
a bad day ..
Phil Ewels
@ewels
Jun 26 2017 15:40
oh no it did, it's just that didn't work either
hah, yup ;) Ok, git reset and let's just stick with writableInputMounts = true so that I can go home :D
Thanks everyone for debugging help! Sorry for spamming everyone.
Let me know if you have any ideas or if there's anything I can do to help debugging work..
Paolo Di Tommaso
@pditommaso
Jun 26 2017 15:47
I'm trying to find a patch