Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Tiago Jesus
    @tiagofilipe12
    const generateNcbiRefGenomeUrlFromNcbiMetadata = task({
      input: '*.metadata.json',
      output: '*.urls.txt',
      name: 'From accessions in a file, generate ENA download URLs for FASTQ'
    }, ({ input }) => `cat ${input} | \
        jq -r '@text "\\(.ftppath_refseq)/\\(.assemblyaccession)_\\(.assemblyname)_genomic.fna.gz"'
        > ${input.replace(/\.metadata.json/, '.urls.txt')}`
    )
    gets the output properly written I think:
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR620/SRR620242
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR620/SRR620547
    Am I right?
    but... this task
    const generateEnaFastqUrlsFromRunsInNcbiMetadata = task({
      // NCBI SRAs takes longer to download and extract, so we use ENA
      input: '*.metadata.json',
      output: '*.urls.txt',
      name: 'From Runs accessions in a file, generate ENA download URLs for FASTQ'
    }, ({ input }) => `cat ${input} | jq -r '.runs.Run[] | .acc' | bash -c 'while read acc; do
        if [ \${#acc} == 9 ] ; then
          dir2=""
        else
          dir2=$(printf %03d \${acc:9:3})/
        fi
        echo ftp://ftp.sra.ebi.ac.uk/vol1/fastq/\${acc:0:6}/$dir2$acc
      done' > ${input.replace(/\.metadata.json/, '.urls.txt')}`
    )
    creates an empty output
    Tiago Jesus
    @tiagofilipe12
    also notice this
    "operationString": "cat /home/tiago/bin/bionode-watermill/examples/pipelines/tests/data/ebd8890/solenopsis.metadata.json | jq -r '.runs.Run[] | .acc' | bash -c 'while read acc; do\n    if [ ${#acc} == 9 ] ; then\n      dir2=\"\"\n    else\n      dir2=$(printf %03d ${acc:9:3})/\n    fi\n    echo ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${acc:0:6}/$dir2$acc\n  done' > /home/tiago/bin/bionode-watermill/examples/pipelines/tests/data/ebd8890/solenopsis.urls.txt"
    there are some do\n
    and then\n
    Julian Mazzitelli
    @thejmazz
    maybe should have a \ at end of those?
    all that gets turned into child_process.spawn('cat', [everythingElseSplitBySpace], { shell: true })
    so maybe something about that while loop being executed that way, can try ./my script.sh instead
    Tiago Jesus
    @tiagofilipe12
    I would suggest also that
    It's is cleaner also
    Julian Mazzitelli
    @thejmazz
    maybe wrapping it all inside a bash -c work too
    Julian Mazzitelli
    @thejmazz
    hmm but you use ${input} inside your script so making a script file you'd lose that abilitiy
    maybe can do
    cat ${input} | ./mythingy.sh | ...
    I sketched out a way to run inline scripts: bionode/bionode-watermill#89
    Julian Mazzitelli
    @thejmazz

    using my "script" function from that issue and still get same error as bruno:

    69840a4 :   traverse got undefined, returning
    error!:  TypeError: path.split is not a function
    Unhandled rejection (<{"threads":1,"container":null,"resume"...>, no stack trace)

    but the solenopsis.metadata.json is also empty

    Tiago Jesus
    @tiagofilipe12
    Can you see where the error occurs? I could not find the "error!" in watermill repo. I mean where is that being printed?
    Julian Mazzitelli
    @thejmazz
    idk im gonna go through pipeline commenting out each task to see first why the metadata.json is empty
    Julian Mazzitelli
    @thejmazz
    oh wait nvm derp i dont even have bionode-ncbi on this box thats probably why its empty lol
    Julian Mazzitelli
    @thejmazz
    hmmm ok so separating out the stuff in junction and each works (alebit download can't resolve output for ref cause its looking for fastq.gz not fna) - but anyways each works on its own
    Julian Mazzitelli
    @thejmazz
    hmmmm, i thought maybe it was because *.urls.txt existed from two tasks and that was tripping it up, so changed one to be *.readsurls.txt - but still that problem
    Julian Mazzitelli
    @thejmazz
    so the trajectory for the task after junction just has the junction node id, which has an array, but some items are null
    Julian Mazzitelli
    @thejmazz
    here it gets the vertexValue for the junction node which includes undefines (must only be null when stringified) https://github.com/bionode/bionode-watermill/blob/master/lib/lifecycle/resolve-input.js#L104
    also not sure if this junction node should have duplicates
    node:  22b6f83d6f6c101c188adc81936013f42e2b4a93ecea3b04051a38083f2baf8a
    [ undefined,
      { type: 'collection/add-output',
        name: 'Generate IDs file for solenopsis',
        resolvedOutput: '/home/ubuntu/watermill-test/data/70335b6/solenopsis.ids.txt',
        params: [ 279040, 280098 ] },
      { type: 'collection/add-output',
        name: 'Get reference genome metadata from NCBI',
        resolvedOutput: '/home/ubuntu/watermill-test/data/dbf22b0/solenopsis.metadata.json',
        params: {} },
      { type: 'collection/add-output',
        name: 'From accessions in a file, generate ENA download URLs for FASTQ',
        resolvedOutput: '/home/ubuntu/watermill-test/data/411bcc4/solenopsis.urls.txt',
        params: {} },
      undefined,
      { type: 'collection/add-output',
        name: 'Generate IDs file for solenopsis',
        resolvedOutput: '/home/ubuntu/watermill-test/data/70335b6/solenopsis.ids.txt',
        params: [ 279040, 280098 ] },
      { type: 'collection/add-output',
        name: 'Get reads metadata from NCBI, including all sequencing runs accessions',
        resolvedOutput: '/home/ubuntu/watermill-test/data/700b10a/solenopsis.metadata.json',
        params: {} },
      { type: 'collection/add-output',
        name: 'From Runs accessions in a file, generate ENA download URLs for FASTQ',
        resolvedOutput: '/home/ubuntu/watermill-test/data/46989a2/solenopsis.readurls.txt',
        params: {} } ]
    Julian Mazzitelli
    @thejmazz
    if I filter out all undefineds that would be a quick fix, but @bmpvieira unless you do *.refurls.txt and *.readurls.txt the downloadUrls task will pick up the first one it finds when traversing within resolve input (we need to make a design decision if we want to handle this sort of case - perhaps make duplicate instances of a task when multiple possible inputs found)
    Julian Mazzitelli
    @thejmazz
    for some reason it is taking the params of the very first task as a UID and then since thats not a key into the DAG, adding an undefined here:
    HERE
    HERE2
    VERTEX: [279040,280098], undefined
    VERTEX: 70335b6fdb33eb3e693d8d836f6e8593d001718fe19bf381c1e796a5b4f98193, [object Object]
    VERTEX: dbf22b04897d0fdbbf3a1f2912f264cdd3e535f5f6bf55767db2b1fd0f250be8, [object Object]
    VERTEX: 411bcc470157f2bc6f1ddb03e69c83ffc7935e8753a73bcbbd2e307f286feee8, [object Object]
    HERE2
    VERTEX: [279040,280098], undefined
    VERTEX: 70335b6fdb33eb3e693d8d836f6e8593d001718fe19bf381c1e796a5b4f98193, [object Object]
    VERTEX: 700b10ae0f04e178b80cab8cc6ed45765022e6862df5bc4be6bcb24fc5779c3e, [object Object]
    VERTEX: 46989a2a9871f5c19793afeb3ada12c8deb4e6a61c7da824fa3aff2b79ee154a, [object Object]
    which means that somehow [279040,280098] got into the trajectory of one of the tasks from taskToPromise in junction orchestrator
    in mergeCtx:
    if (paramsString !== '{}') {
          newTrajection = [paramsString, uid]
        } else {
          newTrajection = [uid]
    }
    it was including the stringified params as a trajectory id - this must be relic from when I was doing graph ids that way
    but that causes the undefined problem b/c there is no dag node there
    Julian Mazzitelli
    @thejmazz
    PR submitted :D
    @bmpvieira even with PR merged and released your pipeline wont work at the downloadUrls step - needs different task for ref and reads urls
    Tiago Jesus
    @tiagofilipe12
    What if the name of the task is different and we had task name to the list of props that create the uid? Something like I have sketched during gsoc?
    Of course it is not ideal but in many cases it will generate a different uid b/c many users like to put a description like Bruno did.
    But yeah a better handling of these cases is required
    Julian Mazzitelli
    @thejmazz
    the uids were all unique
    you just never happened to make a pipeline with a task following a junction where a task inside the junction used params
    Tiago Jesus
    @tiagofilipe12
    Hmmm I see. But have you fixed it already?
    Julian Mazzitelli
    @thejmazz
    I didn't actually test b/c I finally tracked down bug so late in evening but pretty sure it will fix the issue XD
    Bruno Vieira
    @bmpvieira
    Thanks Julian