These are chat archives for nextflow-io/nextflow

19th
Jan 2018
Simone Baffelli
@baffelli
Jan 19 2018 07:56
@ewels I tried doing the same, seems to work, the problem is that one needs to generate as many dummy files or values as the expected outputs of the process
Paolo Di Tommaso
@pditommaso
Jan 19 2018 07:58
the ability to access the task.exitStatus as an output it's already in the master
Simone Baffelli
@baffelli
Jan 19 2018 07:59
:+1:
The only other issue that bothers me "a lot" is related to #378
I cannot create a file in groovy and then use it in the subsequent shell comman
Paolo Di Tommaso
@pditommaso
Jan 19 2018 08:01
shell or exec ?
Simone Baffelli
@baffelli
Jan 19 2018 08:02
in shell if I remember correctly
sometimes I want to have multiple input files from a collect-ed channel, write their paths in a .csv file and pass the csv to the command
what I do at the moment is to convert the list of paths in a list of strings, join it with newlines and echo its value in a file in the shell command
Paolo Di Tommaso
@pditommaso
Jan 19 2018 08:05
this looks unrelated to #378 and it should be possible to handle with NF
Simone Baffelli
@baffelli
Jan 19 2018 08:05
I remember having tried, but I could not make it work
because I could not set the path of the output file correctly
Paolo Di Tommaso
@pditommaso
Jan 19 2018 08:06
I need a test case
Simone Baffelli
@baffelli
Jan 19 2018 08:07
I will try to provide one
In a few minutes
Paolo Di Tommaso
@pditommaso
Jan 19 2018 08:08
I will have my breakfast :)
Simone Baffelli
@baffelli
Jan 19 2018 08:09
thats a late breakfast ;)
Tim Diels
@timdiels
Jan 19 2018 14:09
Any recommendations on scanning through files doing adhoc queries vs making intermediate sqlite databases (kept readonly so it doesn't mess up the cache)? For queries such as SELECT * from table it's clear cut but queries such as SELECT field1, field2 FROM table where is_prop=1 AND species IN ('species1', 'species2') have me wondering.
Tim Diels
@timdiels
Jan 19 2018 15:39
I'm leaning towards making a sqlite file per table to allow for sorting and some more efficient querying.
Paolo Di Tommaso
@pditommaso
Jan 19 2018 15:40
are big tables?
Tim Diels
@timdiels
Jan 19 2018 15:43
the largest is 179M rows, 14.7GB
Paolo Di Tommaso
@pditommaso
Jan 19 2018 15:43
well, yes it's big
and the sql code is written as a process command ?
Tim Diels
@timdiels
Jan 19 2018 15:46
I'm not sure whether that one should be a table in the first place though, everything is in the database currently, some possibly are better changed as references to files. Most tables are only a couple M rows.
The pipeline is written in a custom engine currently, not yet Nextflow.
The sql is scattered across a whole load of scripts
Paolo Di Tommaso
@pditommaso
Jan 19 2018 15:48
if you need random access the these record a DB have sense, otherwise also plain files can work
Tim Diels
@timdiels
Jan 19 2018 15:51
>>> s.describe()
count     33.000000
mean       9.571667
std       31.355155
min        0.008000
25%        0.400000
50%        1.800000
75%        4.800000
max      179.000000
Table row sizes in millions of rows
Well, I'm making an overview of how all the tables are queried (but still an estimated 33h of work left on that), most of them so far are bulk access. Some do filter and sort...