These are chat archives for nextflow-io/nextflow

13th
Feb 2017
amacbride
@amacbride
Feb 13 2017 21:44

If I have a channel that consists of a series of tuples, is there a way to sort that channel by a particular tuple value, then emit it in particularly-sized chunks?

To be concrete: I have a channel that consists of a series of tuples (sample_name, sample_id, lane_id, etc.) that is currently sorted by lane_id as it comes out of the map function that produces it:

have:

[sample_name:sample1 sample_id:1 lane_id:L1, read1_path: read1, read2_path:read2]
[sample_name:sample2 sample_id:2 lane_id:L1, read1_path: read1, read2_path:read2]
[sample_name:sample1 sample_id:1 lane_id:L2, read1_path: read1, read2_path:read2]
[sample_name:sample2 sample_id:2 lane_id:L2, read1_path: read1, read2_path:read2]
[sample_name:sample1 sample_id:1 lane_id:L3, read1_path: read1, read2_path:read2]
[sample_name:sample2 sample_id:2 lane_id:L3, read1_path: read1, read2_path:read2]
[sample_name:sample1 sample_id:1 lane_id:L4, read1_path: read1, read2_path:read2]
[sample_name:sample2 sample_id:2 lane_id:L4, read1_path: read1, read2_path:read2]
What I'd like instead is to sort by sample_name or sampleId, and group every 4 items, so that the downstream consumer can continue instead of blocking until all items in this step are finished.
want:

[sample_name:sample1 sample_id:1 lane_id:L1, read1_path: read1, read2_path:read2]
[sample_name:sample1 sample_id:1 lane_id:L2, read1_path: read1, read2_path:read2]
[sample_name:sample1 sample_id:1 lane_id:L3, read1_path: read1, read2_path:read2]
[sample_name:sample1 sample_id:1 lane_id:L4, read1_path: read1, read2_path:read2]

[sample_name:sample2 sample_id:1 lane_id:L1, read1_path: read1, read2_path:read2]
[sample_name:sample2 sample_id:1 lane_id:L2, read1_path: read1, read2_path:read2]
[sample_name:sample2 sample_id:1 lane_id:L3, read1_path: read1, read2_path:read2]
[sample_name:sample2 sample_id:1 lane_id:L4, read1_path: read1, read2_path:read2]
amacbride
@amacbride
Feb 13 2017 21:50
(There are 64 separate alignments, for 16 samples with 4 lanes apiece -- the behavior I see currently is that this step blocks until all 64 are finished, instead of aligning each lane of the sample simultaneously, and proceeding to the next step when all the lanes for a particular sample are finished.)
Mike Smoot
@mes5k
Feb 13 2017 21:57
Do you want 8 tuples combined into 2? If so, then you'll probably want to use groupTuple with a sort closure.
Félix C. Morency
@fmorency
Feb 13 2017 21:59
.groupTuple(by: 1, size: 4)
or just
.groupTuple(size: 4)
It will group the channel by sample_name and emit a new tuple when size has been reached
Félix C. Morency
@fmorency
Feb 13 2017 22:04
you will have to manipulate said channel to get your want: data
amacbride
@amacbride
Feb 13 2017 22:04
I don't actually want the tuples combined, but I want to sort the values (tuples) emitted by the channel by a particular tuple value.
@fmorency Could you elaborate? I'm not sure what "want data" is.
Félix C. Morency
@fmorency
Feb 13 2017 22:06
the block above where you wrote "want:"
amacbride
@amacbride
Feb 13 2017 22:07
Right -- but how would I accomplish that? I didn't see any operators that would sort the contents of a channel.
Mike Smoot
@mes5k
Feb 13 2017 22:09
In that case, you'd need to do a toSortedList and then a flatMap, but I'm not sure relying on the ordering of a channel is a good idea
Félix C. Morency
@fmorency
Feb 13 2017 22:09
You could use .groupTuple() and uncombine them afterward using another operator like (maybe) .flatMap()
amacbride
@amacbride
Feb 13 2017 22:09
Essentially, I have a step after this one that wants to consume all four lanes of data for a particular sample, but I'd like it to start operating when those four are available, rather than waiting until all 64 steps of the alignment are complete.
Félix C. Morency
@fmorency
Feb 13 2017 22:09
the thing with toSortedList is you have to wait for all steps to be completed
Mike Smoot
@mes5k
Feb 13 2017 22:11
right, then groupTuple with size it is!
amacbride
@amacbride
Feb 13 2017 22:11
That's fine -- it's quick (it's just pulling things from the filesystem and creating tuples with embedded filenames, and only then passing it to a computationally intensive step.)
I will play around -- thanks for the tips!