These are chat archives for nextflow-io/nextflow

17th
Oct 2017
Simone Baffelli
@baffelli
Oct 17 2017 06:02
Does the syntax file(name:"glob_pattern*") preserve the order of files?
Simone Baffelli
@baffelli
Oct 17 2017 06:38
A second question: how to use the custom sort option with groupTuple?
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:06
Does the syntax file(name:"glob_pattern*") preserve the order of files?
Simone Baffelli
@baffelli
Oct 17 2017 08:29
?
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:30
oops, I didn't press enter on my question
in what context? the file method on in a process ?
Simone Baffelli
@baffelli
Oct 17 2017 08:31
in a process
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:31
input or output ?
Simone Baffelli
@baffelli
Oct 17 2017 08:31
I want the files to be sorted in the order they are input
ups sorry, input
should have said that
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:31
no, they are not sorted
Simone Baffelli
@baffelli
Oct 17 2017 08:32
:worried:
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:32
tho groupTuple can sort them
Simone Baffelli
@baffelli
Oct 17 2017 08:32
I suppose I can sort them
and thats where the second question comes
going crazy since 8 am
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:33
:)
what's the second question ?
Simone Baffelli
@baffelli
Oct 17 2017 08:34

Suppose i have tuples [file, date, id]. I use groupTuple(by:2) to group them by id, but I want the grouped files sorted by date to do:

groupTuple(by:2)
.into{myChan}
process bla{
input:
set file(unw:"a*.unw"), val(id), val(dateList) from myChan
}

. Can i use sort: something to sort them?

Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:35
you can use sort:true for natural sorting, or implementing your own logic with a closure
Simone Baffelli
@baffelli
Oct 17 2017 08:36
and there I fail. The closure seems to receive every element individually, I was hoping to sort the grouped tuples after the fact. My understanding is that the comparable would receive elements of the form
[[file1,...,filen],[date1,..daten],id]
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:37
it can be either a closure or a Comparable
a closure takes one argument, the comparable two
Simone Baffelli
@baffelli
Oct 17 2017 08:38
and here I'm lost. What are the arguments passed to the comparable?
the collected tuples?
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:39
the comparable is invoked by the sorting algorithm as many times as need to compare each pair of elements to sort
eg
.groupTuple(by:2, sort { a, b -> return /* your comparing logic here  */  } )
where
Returns:
a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified object.
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:46
@amacbride I would answer .. yes, tho not sure to have understood your question
Simone Baffelli
@baffelli
Oct 17 2017 08:49
Yes, but which elements are sorted? The collected tuples?
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:49
yes
if you have different types, you need to make the comparable able to sort both
Simone Baffelli
@baffelli
Oct 17 2017 08:52
I
My goal is to sort each sublist according to the list of dates, so that the files are paired with the corresponding date
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:54
how effect the sorting the executed process ?
they are always the same files staged in work dir whatever is the order ..
Simone Baffelli
@baffelli
Oct 17 2017 08:55
Because I am computing a weighted average of images using an AR model
So they must be passed in the correct order to the averaging function
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:55
ok, anyhow I think you need a different approach
Simone Baffelli
@baffelli
Oct 17 2017 08:55
In another case i am doing an animation
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:56
Toy Story like ? :D
Simone Baffelli
@baffelli
Oct 17 2017 08:56
No to show how the data changes over time and how my methods reduces the variance of error😂
Using imagemagick 😌
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:57
cool
anyhow instead of using sort parameter use a map
.groupTuple(by:2).map { files, dates, id -> /* whatever */ }
Simone Baffelli
@baffelli
Oct 17 2017 08:59
ah that makes sense
cool did not know that map could directly untuple sets
still, how to stage them in the right order?
Paolo Di Tommaso
@pditommaso
Oct 17 2017 08:59
map just take the resulting items and you can transform as you need
Simone Baffelli
@baffelli
Oct 17 2017 09:00
sort them by date
Paolo Di Tommaso
@pditommaso
Oct 17 2017 09:00
well, now files and dates are the complete collection you can rearrange as you need
Simone Baffelli
@baffelli
Oct 17 2017 09:00
yes
that makes sense
That should be the final product
make sure you are not prone to seizures
Paolo Di Tommaso
@pditommaso
Oct 17 2017 09:02
fascinating ..
Simone Baffelli
@baffelli
Oct 17 2017 09:02
horrible mess
Edgar
@edgano
Oct 17 2017 09:02
looks awesome! <3
Simone Baffelli
@baffelli
Oct 17 2017 09:03
that data is an horrendous mess
the bottom one is closer to what I want to see
and still is not sorted by date
Simone Baffelli
@baffelli
Oct 17 2017 09:11
@edgano thanks
Simone Baffelli
@baffelli
Oct 17 2017 09:17
Hope you like method chains

def sortListsByList(Iterable first, Iterable second, Closure fun)
{
    return [first, second]
    .transpose()
    .collectEntries()
    .sort{fun}
    .inject([]){collector, element-> collector << [element.key, element.value]}
    .transpose()
}
Paolo Di Tommaso
@pditommaso
Oct 17 2017 09:19
I love it !
Simone Baffelli
@baffelli
Oct 17 2017 09:22
still not sure it is working properly
Paolo Di Tommaso
@pditommaso
Oct 17 2017 09:22
I have any clue :satisfied:
Simone Baffelli
@baffelli
Oct 17 2017 09:23
the idea would be to use that to sort a list by another list :smile:
Simone Baffelli
@baffelli
Oct 17 2017 10:16
still running in the old problem that defining that function on top of the pipeline causes the caches to be invalidated
Simone Baffelli
@baffelli
Oct 17 2017 12:14
In case someone needs to sort a list by a given sublist
def sortListByList(Iterable listToSort, Iterable listOfSortingKeys, fun = null)
{
    sorter = fun != null ? fun : {it->it[1]} 
    return [listToSort, listOfSortingKeys]
    .transpose()
    .sort{it->sorter(it)}
    .inject([]){collector, element-> collector << element[0]}
}

def sortListOfListsBySublist(Iterable listOfLists, Integer sublistIndex, fun = null)
{
    sortingKeys = listOfLists[sublistIndex]
    sorted = listOfLists.collect
    {
        listElement-> 
        try{
           return sortListByList(listElement, sortingKeys)
        }
        //The sublist is a single number, we cannot sort it
        catch(MissingMethodException e){
            return listElement
        }
    }
    return sorted
}


lOl = [["fourth","first","third","second"],[4,1,3,2]]


sorted = sortListOfListsBySublist(lOl,1)


assert sorted == [["first","second","third", "fourth"],[1,2,3,4]]
Simone Baffelli
@baffelli
Oct 17 2017 12:48
It works finally!
Paolo Di Tommaso
@pditommaso
Oct 17 2017 12:48
:tada: :tada: :tada:
matthieudumont
@matthieudumont
Oct 17 2017 13:12
Hi, I work with Felix Morency @ Imeka, I would like to know the standrd procedure if we want to cite nextflow in an article?
matthieudumont
@matthieudumont
Oct 17 2017 13:13
Thank you :)
Luca Cozzuto
@lucacozzuto
Oct 17 2017 13:21
you're welcome!
Paolo Di Tommaso
@pditommaso
Oct 17 2017 14:06
@matthieudumont looking forward to reading your paper
Félix C. Morency
@fmorency
Oct 17 2017 14:07
:)
Tim Diels
@timdiels
Oct 17 2017 14:47
Hi, I'd like to validate some input I have put in a channel. The following prints if there are errors, but I also need the rest of the pipeline to wait for this validation to complete and to quit if validation fails (i.e. if the channel isn't empty). The tapfunction calls tap on the channel and replaces the global var species with one channel of the tap and returns the other.
tap(species)
    // Filter down to invalid species
    .filter { !it.orthofinder && it.phyml }
    // Print for each invalid
    .println { """\
        $it.name has PhyML set, but not OrthoFinder. PhyML requires
        OrthoFinder. Please set OrthoFinder as well or unset PhyML.
        """.stripIndent().trim()
    }
Paolo Di Tommaso
@pditommaso
Oct 17 2017 14:49
the validation is implemented by the logic in the filter ?
Tim Diels
@timdiels
Oct 17 2017 14:54
yes, if the body is true, it is invalid
Paolo Di Tommaso
@pditommaso
Oct 17 2017 14:56
you want to stop when the first invalid is matched or if there are none valid ?
Tim Diels
@timdiels
Oct 17 2017 14:57
Ideally stop when all invalids have been matched, so the user can fix all errors in one go. And I guess it's fine if the rest of the pipeline already starts while validation happens
Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:00
a bit tricky, you could do
Tim Diels
@timdiels
Oct 17 2017 15:00

I was thinking of

invalidSpecies = tap(species).filter { ... }
tap(invalidSpecies).println { ... }
assert !invalidSpecies.count()

but thought there had to be a better way

Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:02
int invalidCont = 0 
your_channel
     .filter { def invalid = it.orthofinder && it.phyml; if(invalid) {printInvalidMessage(it); invalidCount++; return !invalid }
     .tap { species }
     .subscribe onComplete: { if(invalidCount) error("error message") }
Tim Diels
@timdiels
Oct 17 2017 15:12
Thanks, I've gotten rusty
Simone Baffelli
@baffelli
Oct 17 2017 15:21
good, today I learned the importance of variable scoping when using methods in combination with map
it can mess things up badly :sweat_smile:
Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:26
very good ;)
always use def for local variables
Simone Baffelli
@baffelli
Oct 17 2017 15:32
exactly
that was wreaking havoc on caching
because I had two variables named the same
inside two methods
but without def
Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:33
ah-ah
Francesco Strozzi
@fstrozzi
Oct 17 2017 15:33
why an input definition like this from the same CSV file we were discussing Friday, is saving me the URL name inside a file, and does not download the file themselves ?
set dbxref,sample_type,strand_specific,file(fastq_1),file(fastq_2) from encode_files_ch_1.splitCsv()
the CSV is splitted correctly
sure it something from my side
Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:35
oh, this would be a nice feature, this is not expected to work in this way
ah wait
Simone Baffelli
@baffelli
Oct 17 2017 15:35
the only thing I need to get sorted is how to get only one header when collecting csv
and then my pipeline is almost ready for "production"
Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:36
you still need to convert the url to a remote file
Francesco Strozzi
@fstrozzi
Oct 17 2017 15:37
ah ok, how I can do that ?
Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:37
.. from encode_files_ch_1.splitCsv().map { ref,type,strand,read1,read2 -> [ ref,type,strand,file(read1),file(read2) ] }
but still, this will copy the ftp files to the s3 storage
in this case the best thing to do is
set dbxref,sample_type,strand_specific,fastq_1,fastq_2 from encode_files_ch_1.splitCsv()
then use wget to pull them
Francesco Strozzi
@fstrozzi
Oct 17 2017 15:39
mmm, why is this different from assigning the variables with the returning values from splitCsv and declare the file() on those that need to be downloaded ?
I mean, from a NF code point of view
yes the wget thing is the way I am doing it now, just exploring all the other possibilities ;)
Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:40
if you declare a input as file, you still need to provide a matching object
the cvs is just a string url, hence you will need to map to convert to a proper file object
this is something I want to change
Francesco Strozzi
@fstrozzi
Oct 17 2017 15:41
just curious :)
ok thanks for the answer!
Paolo Di Tommaso
@pditommaso
Oct 17 2017 15:41
the original idea was that you could feed a process which a data chunk as a string
and the process is able to use it as a file by the fact that declared as file
but it turned out that it's not a so useful use case
Francesco Strozzi
@fstrozzi
Oct 17 2017 15:51
mmm I see
amacbride
@amacbride
Oct 17 2017 16:28

@pditommaso Thanks. I also just went ahead and tried it and it seems to have worked. Let me see if I can explain more concretely:

I need to preserve some intermediate files. I have a process A that produces a big BAM file that is consumed downstream by other processes B, C, and D. I use a publishDir directive to preserve the BAM file from process A, and also declare it as output of A & input of B,C,D. The documentation for publishDir warns that the copy is asynchronous, so I just wanted to confirm that the processes B, C, D are actually operating on the symlinked of the original file from the work directory.

Paolo Di Tommaso
@pditommaso
Oct 17 2017 17:20
process execution and publishDir are unrelated
amacbride
@amacbride
Oct 17 2017 17:27
Perfect, thanks. Is it possible to set maxErrors globally in the config file? as in,
process { 
     queue = 'someq'
     maxErrors = -1
     maxRetries = 10
}
Paolo Di Tommaso
@pditommaso
Oct 17 2017 17:28
yes
amacbride
@amacbride
Oct 17 2017 17:28
awesome
Paolo Di Tommaso
@pditommaso
Oct 17 2017 17:37
@amacbride BTW the latest beta should solve the problem when downloading multiple remote files you were experiencing if I'm not wrong
amacbride
@amacbride
Oct 17 2017 21:24
@pditommaso I'd like to try it, but as we're still on Java 7, I'll have to wait a bit. (Switching to 8 involves re-qualifying our whole system, so it's not trivial.)
I may be stuck at 0.25.x for awhile if 0.26.x requires 8.