These are chat archives for nextflow-io/nextflow

4th
Oct 2017
Paolo Di Tommaso
@pditommaso
Oct 04 2017 07:07
Hi Michael, thanks for suggesting that however in NF already has a fairly generic mechanism for splitting text formats. Tho it does not support yet bed, gff and vcf I think it would be easier to implement them in the current mechanism than reengineering it to integrate another library.
Simone Baffelli
@baffelli
Oct 04 2017 08:22

Discussion about NF, Snakemake and CWL

For me NF >> snakemake. But it could be a function of the type of problem I tend to work on.

Unrelated question: I tend to write a lot of utility functions to go with my workflow. Is there an accepted practice to modularize that? A class perhaps?
Paolo Di Tommaso
@pditommaso
Oct 04 2017 08:29
we know here, write this there ;)
Simone Baffelli
@baffelli
Oct 04 2017 08:41
I did my homework :)
Paolo Di Tommaso
@pditommaso
Oct 04 2017 08:41
:clap: :clap:
Simone Baffelli
@baffelli
Oct 04 2017 08:56
I wrote an extensive defense of nextflow :grimacing:
Paolo Di Tommaso
@pditommaso
Oct 04 2017 08:57
tiple :clap:
Simone Baffelli
@baffelli
Oct 04 2017 09:10
I'm starting to enjoy writing again. Perhaps it is because I tend to be a bore :)
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:11
to write what? novels? NF code?
Simone Baffelli
@baffelli
Oct 04 2017 09:12
papers
or generally technical stuff
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:12
blog post !
Simone Baffelli
@baffelli
Oct 04 2017 09:14
I could maybe someday write one for nextflows blog, if you so whish
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:14
if you like it, you are welcome
Simone Baffelli
@baffelli
Oct 04 2017 09:15
it may be interesting to blog from a non-bioinformatics perspective
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:15
sounds good
Simone Baffelli
@baffelli
Oct 04 2017 09:22
back to the other question, is a class of static methods a good idea to modularize my helper functions?
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:23
I think so
in this way you can even write unit tests
Evan Floden
@evanfloden
Oct 04 2017 09:24
@baffelli Out of general interest, can I see an example of such a utility function?
Simone Baffelli
@baffelli
Oct 04 2017 09:24
Here you go:
//check empty binary
def checkEmptyBinary =
{
    it->
    if(it.size() > 0)
    {
    cpx = (new float[params.empty_slc_win_size])
    buf = ByteBuffer.wrap(it.bytes)
    buf.order(ByteOrder.BIG_ENDIAN)
    buf.position(0)
    buf.asFloatBuffer().get(cpx,0,params.empty_slc_win_size)
    pwr = cpx.collect{el->Math.abs(el)}
    avg = mean(pwr) 
    sd = std(pwr)
    mx = pwr.max()
    cv = avg/mx
    return mx > params.empty_slc_threshold
    }
    else
    {
        return false
    }
}
I need to check wether a binary is empty. I call it from .filter to skip processing empty files
Evan Floden
@evanfloden
Oct 04 2017 09:25
Cool, thanks a lot!
Simone Baffelli
@baffelli
Oct 04 2017 09:25
I tend to write lots of these utilities. Most very specific to my problem
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:30
fuck! real stuff !
Simone Baffelli
@baffelli
Oct 04 2017 09:31
why?
excuse my poor java style :wink:
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:32
kidding, because that's very unusual in genomics workflow
the ByteBuffer still scares me when I need to use (.. when?)
Simone Baffelli
@baffelli
Oct 04 2017 09:33
you bionformaticians only use textfiles :innocent:
I use BIG COMPLEX BINARY DATA
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:34
image.png
Simone Baffelli
@baffelli
Oct 04 2017 09:34
But tbh I try to save as much stuff as possible into a csv and analyze it using R
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:35
I see, I'm starting to thing a blog post about your work would be very interesting
Simone Baffelli
@baffelli
Oct 04 2017 09:35
You don't want to see my current pipeline :see_no_evil:
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:36
ahahah, but you don't need to publish the source code :)
Simone Baffelli
@baffelli
Oct 04 2017 09:37

For example this (currently disabled) pearl:

process stack{

    errorStrategy 'ignore'

    validExitStatus 0, 255, 127//Ignore warning of reference outside

        publishDir "${params.results}/stacked/${stack_name}/${n_stack}/${used_method}"

        tag {"${stack_name},${n_stack},${used_method}"}

        input:
            set file(unw_ls:"*.diff"), file(off_ls:"*.off"), val(master_id), val(slave_id), val(bl), val(method) from to_stack
            each ref_pt from ref_pix_stack//repeat it continously becuase ref_pix_stack is only sent once
            each ref_mli from ref_mli_stack
            each n_stack from (logSpace(params.n_stack_min,params.n_stack_max, params.n_stack_samples) as Integer[])

        when:
            (unw_ls as List).size() > n_stack

        output:
            set file(off_par), file(rate_m), file(sig_rate_m), file(sig_ph), 
            val(stack_id), val(used_method), val(n_stack), val(av_time) into stacked
             set file('rate.bmp'),  file('rate_std.bmp'), file('ph_std.bmp') into rate_ras
        shell:
            //we send one off par toghether with the stack for geocoding
            off_par = off_ls[0]
            used_method = method[0]
            bl_subset = (bl as List) [0..n_stack]
            unw_subset = (unw_ls as List) [0..n_stack]
            master_id_subset =  (master_id as List)[0..n_stack]
            slave_id_subset =  (slave_id as List)[0..n_stack]
            //Create a  list of ids
            datetimes = [master_id_subset, slave_id_subset].transpose()
            datetimes_names = datetimes.join("\n")
            bl_day  = bl_subset.collect{ item-> secondsToDay(item as long)}
            //construct the columns of the diff_tab file
            stack_text = createStackingTable(unw_subset as List, bl_day)
             // Save the name of the stacked id
            // Give it an unique id based on first master and last slave
            stack_id = [master_id_subset[0], slave_id_subset[-1]]
            //Total time covered by stack
            av_time = computeBaseline(master_id_subset[0], master_id_subset[-1])
            //The stack name corresponds to the first master id
            stack_name = "${master_id_subset[0]}"
            // //Get the radar frequency
            mli_par = ref_mli[1]
            //Get the frequency from the parameters
            freq = (mli_par.text =~ /radar_frequency:(.+)/)[0][1].tokenize()[0] as float
            freq = params.radar_frequency
            //phase to distance conversion
            phase_to_meter = (3e8/freq) / (4 * Math.PI)
            '''
            echo '!{stack_text}' > dt
            width=$(get_value !{off_ls[0]} interferogram_width)
            stacking dt ${width} rate sig_rate sig_ph !{ref_pt[0]} !{ref_pt[1]} - - - 0
            wd=$(get_value !{off_par} interferogram_width)
            float_math rate - rate_m ${wd} 2 - - - - !{phase_to_meter}
            float_math sig_rate - sig_rate_m ${wd} 2 - - - - !{phase_to_meter}
            float_math sig_ph - sig_ph_m ${wd} 2 - - - - !{phase_to_meter}
            visdt_pwr.py rate_m !{ref_mli[0]} ${wd} -2 2 -u rate.bmp -m spectral_r
            visdt_pwr.py sig_rate_m !{ref_mli[0]} ${wd} 0 2 -u rate_std.bmp -m cool
            visdt_pwr.py sig_ph_m !{ref_mli[0]} ${wd} 0 2 -u ph_std.bmp -m cool
            '''
}

Notice the elegant combination of: groovy, bash and python. Plust several filter,map, buffer here and there

Edgar
@edgano
Oct 04 2017 09:38
@baffelli nice utility function! :)
Simone Baffelli
@baffelli
Oct 04 2017 09:39
Feel free to reuse it. Just be aware that I need it to check complex valued data :grimacing:
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:39
polyglot is here to stay !
Simone Baffelli
@baffelli
Oct 04 2017 09:39
I presume complex values are not that common in bioinformatics?
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:41
well, there are a lot of complex data structures, but generally are managed at a lower level by specific tools
Simone Baffelli
@baffelli
Oct 04 2017 09:42
ahah no i mean complex as in "complex number"
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:42
ahh, no I'm not aware of that
Simone Baffelli
@baffelli
Oct 04 2017 09:43
xCx \in \mathbb{C}
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:43
yep
Simone Baffelli
@baffelli
Oct 04 2017 09:43
How cool is that, typing formulas in a chat
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:43
:)
Simone Baffelli
@baffelli
Oct 04 2017 09:45
time to use some complex maths in bioinformatics. Something like frequency domain bionformatics would sound cool
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:45
you want to apply for another phd, I would be happy yo introduce you to my PI ;)
Simone Baffelli
@baffelli
Oct 04 2017 09:46
No. But if he has a job for me, that could work
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:47
mm, when are you going to defend your phd ?
Simone Baffelli
@baffelli
Oct 04 2017 09:47
Next year I presume
Writing paper 2 out of 3
Paolo Di Tommaso
@pditommaso
Oct 04 2017 09:47
ok, drop an email when done ;)
Simone Baffelli
@baffelli
Oct 04 2017 09:48
Will do
Michael L Heuer
@heuermh
Oct 04 2017 15:03
@pditommaso Yes, I figured you'd say that, BED and GFF3 are simple enough. Having lost way too many hours of my life to VCF, though, I would encourage you not to write that from scratch. And not to use htsjdk.
Michael L Heuer
@heuermh
Oct 04 2017 15:10
@baffelli The study of regulatory networks uses the maths you're looking for, e.g. https://www.biorxiv.org/content/early/2016/04/20/049502
Simone Baffelli
@baffelli
Oct 04 2017 16:58
Interesting!
Any explaination for this weird bug?
null/modelList (No such file or directory)
Paolo Di Tommaso
@pditommaso
Oct 04 2017 16:58
too little ..
Simone Baffelli
@baffelli
Oct 04 2017 16:59
I'm using a exec directive with modelTextObj = new File("${task.workDir}/modelList")
and workDir becomes null, i cannot write to the file etc
and finally nextflow complains about the missing file in the (unexisting) path "null/modelList"
Luca Cozzuto
@lucacozzuto
Oct 04 2017 17:00
any way to know which process are still running? (I have 2 out of 700 processes still running and I don't know how to detect the missing ones)
Paolo Di Tommaso
@pditommaso
Oct 04 2017 17:00
@baffelli it works in my computer .. :)
Simone Baffelli
@baffelli
Oct 04 2017 17:00
that is really strange
it used to work
and during a test i overwrote the file
Paolo Di Tommaso
@pditommaso
Oct 04 2017 17:01
trace file ?
Simone Baffelli
@baffelli
Oct 04 2017 17:01
java.util.ConcurrentModificationException: null
I guest that should be the responsible
Paolo Di Tommaso
@pditommaso
Oct 04 2017 17:01
@lucacozzuto or SGE qstat ?
@baffelli ... umm, you need to check the error stack trace in the log file
Simone Baffelli
@baffelli
Oct 04 2017 17:02
*/
Channel.from([params.trend]).into{allModels}


process modelListToText{

    input:
        val(models) from allModels

    output:
        file(modelList) into modelText

    exec:
        println(models)
        modelTextObj = new File("${task.workDir}/modelList")
        modelTextObj << "model_name, model_formula\n"
        models.each{ m -> modelTextObj << "${m[0]},${m[1]}\n"}
}
I mean, that looks innocent enough
Luca Cozzuto
@lucacozzuto
Oct 04 2017 17:02
qstat tell me the process name
Simone Baffelli
@baffelli
Oct 04 2017 17:03
Oct-04 18:58:02.102 [Task monitor] DEBUG nextflow.Session - Unexpected error dumping DGA status java.util.ConcurrentModificationException: null
Paolo Di Tommaso
@pditommaso
Oct 04 2017 17:03
qstat -j <jobid> gives you all lot of info including the work dir that's is the NF process id
Luca Cozzuto
@lucacozzuto
Oct 04 2017 17:03
that is always the same
Simone Baffelli
@baffelli
Oct 04 2017 17:03
it is somehow related to the hashmap that I'm using to store a list of formulas
with names
Luca Cozzuto
@lucacozzuto
Oct 04 2017 17:04
thanks
it will be nice to NF way to check this
nextflow check pending etc
I'll think about it :)
Simone Baffelli
@baffelli
Oct 04 2017 17:04
:confused: just imagine, storing a list of models that are used together with a collected file that is passed to a dplyr pipeline which is called by nextflow
nested workflows :dizzy:
Paolo Di Tommaso
@pditommaso
Oct 04 2017 17:05
.... w/o error stack trace can't help
Simone Baffelli
@baffelli
Oct 04 2017 17:06
I think is related to hasmaps
Paolo Di Tommaso
@pditommaso
Oct 04 2017 17:07
try
def modelTextObj = new File("${task.workDir}/modelList")
Simone Baffelli
@baffelli
Oct 04 2017 17:07
I think that was not the problem. Rather something related to the hashmap
I will solve it :muscle:
Simone Baffelli
@baffelli
Oct 04 2017 17:17
solved
Paolo Di Tommaso
@pditommaso
Oct 04 2017 17:25
What was the pb?
Simone Baffelli
@baffelli
Oct 04 2017 18:10
I was tyring to use an hashmap
Simone Baffelli
@baffelli
Oct 04 2017 19:37
In two places at the same time