These are chat archives for nextflow-io/nextflow

13th
May 2015
Paolo Di Tommaso
@pditommaso
May 13 2015 07:29
@andrewcstewart It may happen when the input file is a symlink to a file that is not mounted in the container
Andrew Stewart
@andrewcstewart
May 13 2015 16:55
hmm
may have uncovered possible bug then
or something got goofed during an interrupted run
Andrew Stewart
@andrewcstewart
May 13 2015 18:10
ah
looks like it's further upstream
and that my original bam files actually just contain the string of the s3 address I set as their input
hrm, I think I'm using S3 inputs wrong :)
Paolo Di Tommaso
@pditommaso
May 13 2015 18:27
let me what's wrong, when you realise that ;)
Andrew Stewart
@andrewcstewart
May 13 2015 18:46
ok..
s3 = Channel.from('s3://bucket/object')

echo true 

process sayHello {

 input:
   file(obj) from s3

 """
 cat $obj
 """ 

}
when I do that, the contents of the file is just the s3 path string
Paolo Di Tommaso
@pditommaso
May 13 2015 18:47
what version are you using ?
Andrew Stewart
@andrewcstewart
May 13 2015 18:48
Version 0.13.5 build 2985
Paolo Di Tommaso
@pditommaso
May 13 2015 18:50
ah
Andrew Stewart
@andrewcstewart
May 13 2015 18:50
need to update?
Paolo Di Tommaso
@pditommaso
May 13 2015 18:50
no
it's
s3 = Channel.fromPath('s3://bucket/object')
Andrew Stewart
@andrewcstewart
May 13 2015 18:50
(btw I got the repl working locally. very nice!)
ah ok
Paolo Di Tommaso
@pditommaso
May 13 2015 18:50
no Channel.from
I use it a lot to quick test pieces of code
Andrew Stewart
@andrewcstewart
May 13 2015 18:51
yeah its great for feeling out these new features
Paolo Di Tommaso
@pditommaso
May 13 2015 18:51
:+1:
Andrew Stewart
@andrewcstewart
May 13 2015 18:53
Hm, ok so I need to rethink how im handling my input. I was doing Channel.from("s3://somebucket/somepath") because I was using that string as an input into a process that then scans for all S3 objects 'under' that path
that process then outputs a set including a file(s3objectpath)
is there a way to use fromPath within the process (or the next process) ?
maybe I could do some operator magic in between processes
Paolo Di Tommaso
@pditommaso
May 13 2015 18:56
you mean that at process outputs file names
and then you want a downstream process that handle them as S3 files?
Andrew Stewart
@andrewcstewart
May 13 2015 18:59
yeah
Paolo Di Tommaso
@pditommaso
May 13 2015 18:59
you can use a map operator to transform file name to file objects e.g.
s3Files = fileNames.map { file(it) }
Andrew Stewart
@andrewcstewart
May 13 2015 19:02
hm
let me frame the context
Paolo Di Tommaso
@pditommaso
May 13 2015 19:02
better
Andrew Stewart
@andrewcstewart
May 13 2015 19:02
s3buckets = Channel.from('s3://bucket')  // <--- just a string!

process scanBucket {

 input:
   val bucket from s3buckets

 output:
   stdout s3paths

 """
 #!/usr/bin/env python 
 import boto
 # scan $bucket for s3 objects
 # print out object s3-urls to stdout , along with metadata 
 """ 
}

process processObjects {

  input:
    val meta1, meta2, file(obj) from s3paths

  """
  """
 }
uh
Paolo Di Tommaso
@pditommaso
May 13 2015 19:03
arrow + UP, you can edit it :)
Andrew Stewart
@andrewcstewart
May 13 2015 19:03
ok something along these lines...
so what I think you're saying is that between those two processes I could do a map
(and actually.. 's3paths' in reality is a set with a couple metadata values thrown in there)
Paolo Di Tommaso
@pditommaso
May 13 2015 19:04
you should me able to do
process processObjects {

  input:
    file(obj) from s3paths.map { file(it) }

  """
  """
 }
also the first s3buckets = Channel.from('s3://bucket') is useless
you can write directly
input:
val bucket from 's3://bucket'
Andrew Stewart
@andrewcstewart
May 13 2015 19:07
s3paths.map { [it.meta1, it.meta2, file(it.obj] }
something like that?
Paolo Di Tommaso
@pditommaso
May 13 2015 19:07
yep
Andrew Stewart
@andrewcstewart
May 13 2015 19:07
(yeah I know.. I have in mind that I might use this as a factory for multiple buckets.. but for the example I can use a single bucket)
Paolo Di Tommaso
@pditommaso
May 13 2015 19:08
but why don't scan the s3 bucket directly nxf code ?
Andrew Stewart
@andrewcstewart
May 13 2015 19:08
the last snippet I just wrote assumes a hash I believe? (still getting used to groovy's data structures)
how do you mean?
Paolo Di Tommaso
@pditommaso
May 13 2015 19:08
for scan do you mean, traverse a folder structure ?
Andrew Stewart
@andrewcstewart
May 13 2015 19:08
(Im scanning s3 objects and filtering based on s3 metadata fields)
Paolo Di Tommaso
@pditommaso
May 13 2015 19:08
ahh
ok
metadata are not yet supported
Andrew Stewart
@andrewcstewart
May 13 2015 19:09
I can paste the code if you want, I think it's kind of a neat pattern
assuming fastq files are being stored in s3 with metadata fields
Paolo Di Tommaso
@pditommaso
May 13 2015 19:10
it could be interesting let me see it
Andrew Stewart
@andrewcstewart
May 13 2015 19:11
process s3load {

  // tag { bucket }

  input:
    val bucket from s3buckets

  output:
    stdout s3paths

  """
  #!/usr/bin/env python

  import os, boto, re
  s3 = boto.connect_s3()
  bucket_url = "$bucket"
  bucket_name,has_subdir,subdir = bucket_url.replace("s3://","").partition("/")
  bucket = s3.get_bucket(bucket_name)
  objects = [object for object in bucket if subdir in object.name and 'fastq' in object.name and '_R1_' in object.name]
  for object in objects:
    key = bucket.get_key(object.name)
    sample_name = key.get_metadata("sample-name")
    sample_id = key.get_metadata("sample-id")
    pair_id = key.get_metadata("pair-id")
    #sample_id = os.path.dirname(object.name).rpartition('/')[-1]
    m = re.search(r'_(L\\d+)_',object.name)
    #pair_id = m.group(1)
    read1 = "s3://%s/%s" % (object.bucket.name,object.name)
    read2 = read1.replace('_R1_','_R2_')
    print "%s,%s,%s,%s" % (pair_id, sample_name, read1, read2)
  """
}
Paolo Di Tommaso
@pditommaso
May 13 2015 19:12
I see
Andrew Stewart
@andrewcstewart
May 13 2015 19:12
so basically I just use the s3 bucket's url to seed a search, filtering by just fastq files (and just the first of each read pair, to be matched later)
Paolo Di Tommaso
@pditommaso
May 13 2015 19:12
I need definitively to add tag support to the s3 client
Andrew Stewart
@andrewcstewart
May 13 2015 19:13
and actually, in that example I'm still parsing information from the file name itself rather than just the metadata
Paolo Di Tommaso
@pditommaso
May 13 2015 19:14
key.get_metadata("sample-name") it is not a S3 metadata?
Andrew Stewart
@andrewcstewart
May 13 2015 19:14
(because my fastq-to-s3 ingest process is still developing too). At some point you could name the fastq files whatever you want and as long as the right metadata fields are set the pipeline will pick them up
it is
but boto makes the rest of the metadata field name implicit
so really it's x-amz-meta-sample-name
Paolo Di Tommaso
@pditommaso
May 13 2015 19:15
metadata are included in the http header?
Andrew Stewart
@andrewcstewart
May 13 2015 19:16
Yes I believe so.. but I haven't seen the actual raw http calls directly
since I tend to use that boto library
Paolo Di Tommaso
@pditommaso
May 13 2015 19:16
I see
Andrew Stewart
@andrewcstewart
May 13 2015 19:16
(I assume there are similar java libraries for those API bindings)
( so no need to rewrite the wheel)
Paolo Di Tommaso
@pditommaso
May 13 2015 19:18
of course, but in nextflow that library is wrapped by a file system layer
Andrew Stewart
@andrewcstewart
May 13 2015 19:19
gotcha
Paolo Di Tommaso
@pditommaso
May 13 2015 19:19
that would be better to define it a file system adaptor
this allows to use it transparently like any other file
Andrew Stewart
@andrewcstewart
May 13 2015 19:19
that's pretty slick
I figure metadata support is a ways off, but the s3 IO itself in nextflow gets me pretty close (close enough in fact)
Paolo Di Tommaso
@pditommaso
May 13 2015 19:20
yep, but it requires some tricks to handle specific feature like metadata
Andrew Stewart
@andrewcstewart
May 13 2015 19:20
yeah, hence handling it in the process level via boto
Paolo Di Tommaso
@pditommaso
May 13 2015 19:21
good
Andrew Stewart
@andrewcstewart
May 13 2015 19:22
imho, I'd probably wait to observe common usage patterns before codifying that support in, ya know? (unless you have a pretty good idea of how that should be implemented.. it's not immediately obvious to me how one would interact with metadata fields in a file() method though)
file("s3://object/path").meta('x-amz-meta-field1') ?
Paolo Di Tommaso
@pditommaso
May 13 2015 19:24
it could be managed as an extra attribute
something like
file("s3://object/path").attributes.meta('x-amz-meta-field1')   ?
Andrew Stewart
@andrewcstewart
May 13 2015 19:25
ah
Paolo Di Tommaso
@pditommaso
May 13 2015 19:25
finally :)
Andrew Stewart
@andrewcstewart
May 13 2015 19:25
and then that could be used in all sorts of Operator logic ?
that would be neat
Paolo Di Tommaso
@pditommaso
May 13 2015 19:26
I think so
Andrew Stewart
@andrewcstewart
May 13 2015 19:30
awesome
I figured this out
exactly as you said, just do a map in between, remap all the vals to each other and throw the s3 path into a file()
I had to remember that file() as an input handler is not the same thing as file in a channel operator
Andrew Stewart
@andrewcstewart
May 13 2015 20:03
And testing on fastq files it works perfectly. 'cat input' is spamming the heck out of my screen just as I would expect it to :)
Thanks for the help @pditommaso
Paolo Di Tommaso
@pditommaso
May 13 2015 20:03
happy to know it works smoothly