Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Pablo Cordero
    @dimenwarper
    cool thanks, yeah we've also been doing the relative path strings
    marius a. eriksen
    @mariusae
    it's not very satisfactory; I hope to have a much better answer very soon...
    gmguriarte
    @gmguriarte
    Hello I've been having some trouble with installation. I am very unfamiliar with Docker so I'm not sure what my next steps would need to be here: https://gist.github.com/gmguriarte/690518cb7865482e6746521c94e5b12e
    marius a. eriksen
    @mariusae
    @gmguriarte what is your version of Go?
    gmguriarte
    @gmguriarte
    hey marius, I'm on go1.11.4 linux/amd64
    I just tried a fresh installation on a new Ubuntu EC2 instance, here's what I got: https://gist.github.com/gmguriarte/6baac030cd1496416e75d7d857c96284
    marius a. eriksen
    @mariusae
    @gmguriarte hmm, looks like it's not using Go modules for some reason
    RafaƂ Wicha
    @pancernik
    @mariusae Do you have space in the project for any open-source support? I really enjoyed listening about Reflow on Software Engineering Daily recently, and would like to help :)
    Brian Naughton
    @hgbrian

    Hi! I am having an issue that I don't know how to proceed with... Perhaps someone has ideas.
    I have a minimal reflow script that works locally (takes about a second) but fails on AWS.
    (The real script I took the minimal part from used to work on AWS, then stopped working. The problem is the same as with the minimal script.)

    I have no files in my cache(/tmp/flow and s3://hx-reflow-cache/objects are both empty).
    I run the script like reflow -cache=off -log=debug run debug.rf

    The script gets stuck forever in the GetSRAIds function but I can't see what is going on under the hood.

    val dirs = make("$/dirs")
    val rnaseq_image = "248340427436.dkr.ecr.us-east-1.amazonaws.com/rnaseq:latest"
    
    func GetSRAIds(astr string) dir =
        exec(image := rnaseq_image, cpu := 1, mem := 4*GiB, disk := 100*GiB) (sraid_dir dir) {"
            touch {{sraid_dir}}/afile
        "}
    
    func CopyAllToS3(from_dir dir) =
        dirs.Copy(from_dir, "s3://hx-brian/debug")
    
    val paired_sraid_dir = GetSRAIds("asdf")
    
    @requires(cpu := 1, mem := 4*GiB, disk := 100*GiB)
    val Main = CopyAllToS3(paired_sraid_dir)

    Output:

    2019/04/12 15:47:53 reflow version 0.6.7 (go1.10)
    2019/04/12 15:47:53 reflowlet image grailbio/reflowlet:1531508213
    2019/04/12 15:47:53 run ID: 4a689c8e
    2019/04/12 15:47:54 evaluating program /home/brian/hxsync/apps/rnaseq/debug.rf
            (no params)
            (no arguments)
    2019/04/12 15:47:55 ec2cluster: pending{}
    2019/04/12 15:47:55 ec2cluster: allocate {mem:4.0GiB cpu:1 disk:100.0GiB}
    2019/04/12 15:47:55 ec2cluster: attempting to allocate from existing pool
    2019/04/12 15:47:55 ec2cluster: pending{}
    2019/04/12 15:47:55 accepted alloc ec2-34-222-222-8.us-west-2.compute.amazonaws.com:9000/0107955aaa9070b4
    2019/04/12 15:47:55 run state: eval alloc ec2-34-222-222-8.us-west-2.compute.amazonaws.com:9000/0107955aaa9070b4
    2019/04/12 15:47:55 evaluating with configuration: executor *client.clientAlloc transferer *repository.Manager flags cacheextern,nocache,nogc,norecomputeempty,topdown flowconfig hashv2 cachelookuptimeout 1m0s
    2019/04/12 15:47:55  ->  debug.GetSRAIds 54868e65 run    exec ...us-east-1.amazonaws.com/rnaseq:latest touch {{sraid_dir}}/afile
    2019/04/12 15:47:55 debug.GetSRAIds 54868e65 debug.rf:5:9:
            resources: {mem:4.0GiB cpu:1 disk:100.0GiB}
            sha256:fd5c6f3dfe670e81ada10b8e64c5583e500a7fca946b0be46c92036ff1515e24
            sha256:54868e6592f6fed89508ab431e5db8e2462630229e4c19cdc0eafde1454c4e0b
            ec2-34-222-222-8.us-west-2.compute.amazonaws.com:9000/0107955aaa9070b4/54868e6592f6fed89508ab431e5db8e2462630229e4c19cdc0eafde1454c4e0b
            248340427436.dkr.ecr.us-east-1.amazonaws.com/rnaseq:latest
            command:
                touch {{sraid_dir}}/afile
            where:
    ec2cluster: 1 instances: m3.large:1 (<=$0.1/hr), total{mem:7.0GiB cpu:2 disk:2.9TiB intel_avx:2}, waiting{}, pending{}
    4a689c8e: elapsed: 10s, running:1, completed: 0/1
      debug.GetSRAIds:  exec ...us-east-1.amazonaws.com/rnaseq:latest touch {{sraid_dir}}/afile  2m10s

    What is a reasonable way to debug this kind of problem? I don't really have any real model of how it could be failing here, so I'm not sure what I'm supposed to be looking for.

    The things I have tried include: changing the file (i.e., now debug.rf), changing the GetSRAIds function to touch or echo, changing the image, changing the number of cpus to get a different AWS instance.

    Brian Naughton
    @hgbrian
    reflow ps -l
    9f0de509 debug.GetSRAIds 6:13PM 0:00 initializing 0B 0.0 0B [exec] ec2-34-222-222-8.us-west-2.compute.amazonaws.com:9000/dbb06f6bc224f0e9/9f0de50920195fd24e54ae66463ce51faa4ba915335db49241ab1e76aada8df7
    reflow shell ec2-34-222-222-8.us-west-2.compute.amazonaws.com:9000/dbb06f6bc224f0e9/9f0de50920195fd24e54ae66463ce51faa4ba915335db49241ab1e76aada8df7
    sha256:9f0de50920195fd24e54ae66463ce51faa4ba915335db49241ab1e76aada8df7: cannot shell into a non-running exec
    So I guess it is stuck initializing?
    Brian Naughton
    @hgbrian
    if i ssh into the ec2 instance and run docker exec -it 12345 /bin/sh I can see:
    / # ps
    PID   USER     TIME   COMMAND
        1 root       0:00 /reflowlet -prefix /host -ec2cluster -ndigest 60 -config
       22 root       0:00 /bin/sh
       29 root       0:00 ps
    Brian Naughton
    @hgbrian
    some more info from -trace @mariusae
    (hxenv) brian@brian-8:~/hxsync/apps/rnaseq$ reflow run -nocacheextern -recomputeempty -trace debug.rf                                                                                                       reflow: run ID: d3909ecf
    reflow: mutate flow 6520624e state FlowInit {} k deps 2158f738: FlowTODO                                                                                                                                    reflow: mutate flow 2158f738 state FlowInit {} k deps 91585a14: FlowTODO
    reflow: mutate flow 91585a14 state FlowInit {} k deps 0638b774: FlowTODO                                                                                                                                    reflow: mutate flow 0638b774 state FlowInit {} k deps 43d32c20: FlowTODO
    reflow: mutate flow 43d32c20 state FlowInit {} k deps 70363a09: FlowTODO                                                                                                                                    reflow: mutate flow 70363a09 state FlowInit {} coerce deps 9f0de509: FlowTODO
    reflow: mutate flow 9f0de509 state FlowInit {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": FlowNeedLookup     reflow: mutate flow 9f0de509 state FlowNeedLookup {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": FlowLookup
    reflow: mutate flow 9f0de509 state FlowLookup {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": FlowTODO         reflow: mutate flow 9f0de509 state FlowTODO {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": map[]
    reflow: mutate flow 9f0de509 state FlowTODO {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": FlowNeedTransfer   reflow: mutate flow 9f0de509 state FlowNeedTransfer {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": FlowTransfer                                                                                                                                                                                                           reflow: mutate flow 9f0de509 state FlowTransfer {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": FlowReady
    reflow: mutate flow 9f0de509 state FlowReady {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": map[]             reflow: mutate flow 9f0de509 state FlowReady {mem:4.0GiB cpu:1 disk:100.0GiB} exec image 248340427436.dkr.ecr.us-east-1.amazonaws.com/clusterfudge:latest cmd "\n        touch %s\n    ": FlowRunning, map[disk:1.073741824e+11 mem:4.294967296e+09 cpu:1]                                                                                                                                                              ec2cluster: 1 instances: m3.large:1 (<=$0.1/hr), total{mem:7.0GiB cpu:2 disk:2.9TiB intel_avx:2}, waiting{}, pending{}
    d3909ecf: elapsed: 3m0s, running:1, completed: 0/1                                                                                                                                                            debug.GetSRAIds:  exec ..st-1.amazonaws.com/clusterfudge:latest touch {{sraid_file}}  2m59s
    Brian Naughton
    @hgbrian
    It turns out debug.rf works with image "ubuntu" so it must be something about my image. The image is unchanged though so maybe it's a permissioning thing...
    Brian Naughton
    @hgbrian
    OK I think I have worked this out. It's an AWS issue, not a reflow issue, but I think I will still file an issue on reflow.
    Basically, the image I was using did not exist in the region I was logged into (not sure how this happened...)
    If I substitute in an image I know does not exist (any random string, really), then I get the same issue!
    In other words, reflow does not tell me the image does not exist, it just never finishes initializing.
    And for some reason there is a docker container running on the machine....
    Brian Naughton
    @hgbrian
    filed here grailbio/reflow#111
    Brian Naughton
    @hgbrian
    I hit this issue too: grailbio/reflow#86
    The current system seems strange to me. I told reflow what I need for each step.
    As described, it sounds like reflow should just figure out the max exec size instead of making me do it?
    I assumed it did this, since up until now I have used Main with cpu:=1 (thanks @olgabot for flagging these issues before i see them)....
    marius a. eriksen
    @mariusae
    @hgbrian yeah, this should no longer be required with the new scheduler (reflow run -sched); which @prasadgopal is currently working to make the default
    (and apologies for being AWOL here for the last little while -- i had been very heads down in another project!)
    (and i agree, @requires is/was a bit of a kludge. but the new scheduler removes the need for it.)
    Alper Yilmaz
    @alperyilmaz

    Sorry for asking such a simple question. I was trying to compile the latest version from github and after getting the reflow binary and running it I get the following error:

    tls: no provider named github.com/grailbio/infra/tls.Authority (is package github.com/grailbio/infra/tls linked into the binary?)

    What am I doing wrong? Is there a trick for compiling the binary?

    I get different errors for different options:
    $ ./reflow setup-s3-repository test
    repository: no provider named github.com/grailbio/reflow/repository/s3.Repository (is package github.com/grailbio/reflow/repository/s3 linked into the binary?)
    
    $ ./reflow setup-ec2
    cluster: no provider named github.com/grailbio/reflow/ec2cluster.Cluster (is package github.com/grailbio/reflow/ec2cluster linked into the binary?)
    and $ ./reflow setup-dynamodb-assoc test runs without error.. So, how can I link missing parts to the binary?
    prasadgopal
    @prasadgopal
    It is possible it is in a broken state. We will sync the repository sometime today and that should fix that issue. let me do that and then see if things are working. sorry for the inconvenience. I'll ping you when i'm done syncing.
    Alper Yilmaz
    @alperyilmaz
    ok.. thanks for letting me know..
    prasadgopal
    @prasadgopal
    did you use github.com/grailbio/reflow/cmd/buildreflow to build your binary? or just go build?
    prasadgopal
    @prasadgopal
    i synced the repo now. please sync, build and retry the commands and let me know how that goes.
    @alperyilmaz please use buildreflow to build the binary. I will update the doc to reflect that.
    Alper Yilmaz
    @alperyilmaz
    thx @prasadgopal for quick fix.. I clone the repo and compiled as shown below, still getting "package linked to binary" problem
    $ cd cmd/reflow/
    $ go install github.com/grailbio/reflow/cmd/buildreflow
    $ buildreflow
    $ ./reflow version
    reflowlet: no provider named reflowletversion (is package  linked into the binary?)
    prasadgopal
    @prasadgopal
    Could you edit $HOME/.reflow/config.yaml and remove reflowletversion line from it. Sorry, we have made some changes and haven't had a chance to migrate previous versions of configurations.
    Alper Yilmaz
    @alperyilmaz
    Actually, I'm very sorry, the old config file was still around and reflow didnt have access to aws credentials.. I removed config file and assigned credentials to environment variables and now it's working..
    prasadgopal
    @prasadgopal
    ok cool. let me know if you have any questions.
    Alper Yilmaz
    @alperyilmaz
    thx for help..
    Alper Yilmaz
    @alperyilmaz

    hi @prasadgopal , I have to two questions.. (and sorry for bothering again)

    I was going over the example at github page, the one with align.rf aligning two fastq files to human genome with bwa. It worked fine. Then, in the rf file I added the following lines where alignment result is copied s3 bucket with name:

            files := make("$/files")
            aligned := align(r1, r2)
            files.Copy(aligned, "s3://aws-genomics-flow/aligned.sam")

    I was expecting that only transfer step will be executed, but it is aligning again.. is there a way to debug why it didnt use the cached result?

    my second question is about reflow info which seems not working fully. I can get information about a run but when I try to get info about a process I get invalid URI error.

    For example:

    $ reflow ps -l
    2a91fdfc align.reference 3:25AM 0:00 running 297.8MiB 1.0 4.4GiB bwa ec2-34-220-237-187.us-west-2.compute.amazonaws.com:9000/178a24a14af41df7/2a91fdfc48da7dbf15deac89a38b112c5e8e4a125143dcc16c7df1ac14676db9
    
    $ reflow info ec2-34-220-237-187.us-west-2.compute.amazonaws.com:9000/178a24a14af41df7/2a91fdfc48da7dbf15deac89a38b112c5e8e4a125143dcc16c7df1ac14676db9
    alloc 178a24a14af41df7: invalid URI
    prasadgopal
    @prasadgopal
    @alperyilmaz regarding the first question, it should align again. If you can get me the logs of first and the second invocations I can debug further. Your logs should be stored under $HOME/.reflow/runs/<runid>.execlog. runid is usually printed at the start of the run.
    Alper Yilmaz
    @alperyilmaz
    I killed the second run.. So, let me run it again and then provide the logs.. what is the preferred way to share the log file? should I use pastebin or similar site? not sure if I can attach files in here..
    prasadgopal
    @prasadgopal
    Regarding your second question, I agree the tooling is in somewhat a broken state and I'll look into it. We want to revamp the tooling quite a bit and we hope to get to it soon.
    Can you put it in some s3 public bucket? Not sure if there are better ways to share it.
    Alper Yilmaz
    @alperyilmaz

    at the end of simple bioinformatics workflow section, it says:

    Here we see that Reflow did not need to recompute the aligned file; it is instead retrieved from cache. The reference index generation is skipped altogether. Status lines that indicate "xfer" (instead of "run") means that Reflow is performing a cache transfer in place of running the computation. Reflow claims to have transferred a 13.2 GiB file to s3://marius-test-bucket/aligned.sam

    prasadgopal
    @prasadgopal
    My earlier comment should have been "shouldn't align again"
    Alper Yilmaz
    @alperyilmaz
    I thought alignment and reference indexing steps will be skipped..
    Oh ok.. sorry..
    Then pls disregard my previous note..
    prasadgopal
    @prasadgopal
    Yes it should reuse previously computed values
    prasadgopal
    @prasadgopal
    @alperyilmaz ore quick check. What does reflow config -marshal show for assoc, repository and cache?
    Alper Yilmaz
    @alperyilmaz

    hi @prasadgopal,
    here are values:

    assoc: dynamodbassoc,table=aws-genomics-reflow
    repository: s3,bucket=aws-genomics-flow
    cache: "off"

    I'm guessing cache: "off" is the problem.. let me try with "on"

    Alper Yilmaz
    @alperyilmaz
    cache: "on" didn't work, it's expecting a provider name I guess. What are possible provider names?
    possible values are read, readwrite and write I guess.. Should I choose readwrite?
    Alper Yilmaz
    @alperyilmaz
    hi @prasadgopal , I selected "readwrite" and rerun the whole thing (twice).. and it worked.. the second run used the cache..