Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Oct 09 12:29
    kerkomen commented #75
  • Oct 09 12:29
    kerkomen commented #75
  • Oct 09 09:30
    samuell commented #92
  • Oct 09 00:15
    kerkomen opened #92
  • Sep 07 15:36

    samuell on v0.9.6

    (compare)

  • Sep 07 15:19

    samuell on master

    Bump version to 0.9.6 (compare)

  • Sep 07 15:19

    samuell on develop

    Bump version to 0.9.6 (compare)

  • Sep 07 15:17

    samuell on master

    assign the stream function a te… (compare)

  • Sep 07 15:15
    samuell commented #91
  • Sep 07 15:15

    samuell on develop

    assign the stream function a te… (compare)

  • Sep 07 15:15
    samuell closed #91
  • Sep 07 15:15
    samuell edited #91
  • Sep 07 15:09

    samuell on revert-90-master

    (compare)

  • Sep 07 14:32

    samuell on develop

    Hash SubStream IP Paths for Tem… fix small typo (compare)

  • Sep 07 14:31

    samuell on revert-90-master

    Revert "fix small typo" This r… (compare)

  • Sep 07 14:30

    samuell on master

    fix small typo (compare)

  • Sep 07 14:30
    samuell closed #90
  • Sep 04 13:25
    rbisewski opened #91
  • Sep 03 16:03
    rbisewski commented #78
  • Sep 03 15:55
    rbisewski commented #78
Samuel Lampa
@samuell
Both sound great (abstracting FS, and any improvements related to substreams) :)
Drayton Munster
@dwmunster
in the meantime, a copy/paste of my substream component gets the job done
Samuel Lampa
@samuell
IC. You needed to make multiple instances of it?
Drayton Munster
@dwmunster
Yes, because the first task drains all the IPs out of the substream's channel
so when the second task is created, that channel is closed
Samuel Lampa
@samuell
IC yea :|
windhooked
@windhooked
Probably missing something, is there any good way to collect files produced by an executable? I'm running pdfimage which produces an image for each pages, and then want to collect each file as soon as it has been written for further processing. I have tried ls > {o:out} and then spliting the out file but also got stuck here, seems like file splitter bug, producing one more empty file. Any pointers appreciated.
Samuel Lampa
@samuell
Hi @windhooked and welcome to the chat!
Re. catching output files: Are you able to know the pattern of the produced output files on beforehand?
The default way for scipipe to catch output files is to have scipipe decide what the file name should be, by using the {o:outportname} placeholder in the command, of course given that the executable can take the destination filename as a parameter.

So, imagining that pdfimage has a flag, -o for output filename (and -i for input files), that would be:

proc := wf.NewProc("proc", "pdfimage -i  {i:inport-name} -o {o:outport-name}")

etc.

windhooked
@windhooked
Hi @samuell, not exactly since pdfimages generates files based on the given prefix.
ximages := wf.NewProc("pdfimages", "pdfimages -tiff -p \"{i:in}\" \"{i:in}\" && ls \"{i:in}\"*.tif > {o:tif_files} ")
the file splitter component almost did what was expected, but produced one empty file when lines per split == 1, which makes the next process break.
 components.NewFileSplitter(wf, "file_splitter", 1)
windhooked
@windhooked
globbing also does not work since the file glob pattern depends on the output path of the previous process
Samuel Lampa
@samuell
Hmm, yea, the globbing component would need to take paths as parameters, for this to work...
Reg. the file splitter ... sounds a bit like a bug(!)
The quickest workaround, if you are well versed in Go, would probably be to create some custom Go code that does what you want, with: http://scipipe.org/howtos/golang_components/ and/or http://scipipe.org/howtos/reusable_components/
windhooked
@windhooked
any other way to start a new task based on the contents of tif_files, you can think of?
Samuel Lampa
@samuell
Could you share a complete demo workflow code, that shows what you're after?
windhooked
@windhooked
I managed to do it with xargs, at the expense of concurrency
Samuel Lampa
@samuell
And, is there some documentation about the pdfimages command online?
windhooked
@windhooked
here is what kind of what does what it I want.
func main() {

        path := "./data/*/attachments/*.pdf"

        wf := sp.NewWorkflow("pdf2ocr", 20)

        globpdf := comp.NewFileGlobber(wf, "findpdf", path)

        ximages := wf.NewProc("pdfimages", "pdfimages -tiff -p \"{i:in}\" \"{i:in}\" ; ls \"{i:in}\"*.tif > {o:out} ") // {o:out}
        ximages.In("in").From(globpdf.Out())

        // < test xargs -L 1 -I{} tesseract {} {} --psm 1 --oem 1 -l eng
        ocr := wf.NewProc("tesseract", "< \"{i:in}\" xargs -L 1 -I{} tesseract {} {}.ocr --psm 1 --oem 1 -l eng; ls \"{i:in}\"*.txt > {o:out}  ") // {o:out}
        ocr.In("in").From(ximages.Out("out"))

        wf.Run()
}

And, is there some documentation about the pdfimages command online?

https://linux.die.net/man/1/pdfimages

Samuel Lampa
@samuell
Thanks
Samuel Lampa
@samuell
I was trying something like this ... but running into errors: https://gist.github.com/samuell/56ac30625417b881427dfea7d2675d43
But it turns out, working around scipipe's own file handling mechanisms (like writing output to a temp directory, detecting already written files etc), means one has to write a lot of crazy workaround code to handle file paths.

I was trying something like this ... but running into errors: https://gist.github.com/samuell/56ac30625417b881427dfea7d2675d43

The errors might be with the tesseract command ... I'm not too familiar with it yet, and might have changed too many things.

Samuel Lampa
@samuell
I'll be pondering a more general solution to the case when a lot of unspecified output files are created and need to be picked up. It is a problem that occures pretty often.
windhooked
@windhooked
Thanks, that makes sense. I will give the sample a try. It just got stuck at the semantics of how to get the actual file name in a task context.
Samuel Lampa
@samuell
Yeah, in general the paths should be used from task struct, if at all possible (otherwise, path wrangling is needed)
windhooked
@windhooked
@samuell thank you for your suggestions, I have traced it down to relative path problems. how is the temporary directory created for a new task?
Samuel Lampa
@samuell

@windhooked Will check!

The temporary path is created as a subfolder under the main Go-script's folder.

This is why, if you want to refer to existing files in the main directory, you need to prepend ../ to the paths. Output paths though, are supposed to be written relative to the temporary directory, from where they will be moved into their final paths as part of scipipe's code for finalizing a task (to make file writes atomic).

This message was deleted
Samuel Lampa
@samuell
Ah ...

Ok, one problem with the approach of my code suggestion:

sp.ExecCmd() will not be executed inside the temporary task's folder.

Thus, since input paths are prepended with ../ by SciPipe, to support the normal operation (executing inside the temp folder), one will need to remove that ../ part manually ...

E.g., I came a bit forward by adding this line just under the for ... line in your code:
tifFile = strings.Replace(tifFile, "../", "", 1)
kerkomen
@kerkomen
Hi @samuell, thanks a lot for the amazing tool!
I've just started using scipipe for some of my workflows, and I'm still figuring things out.
Sorry for a possibly naïve question: is there a way to automatically re-run the workflow when some files are updated (i.e. something Make-like)?
Samuel Lampa
@samuell
Hi @kerkomen , and thank you for the kind words! :)
Unfortunately, there is not currently really anything similar to how Make works.
One part of this is because workflows like SciPipe typically take a rather different philosophy of how to track changes, than Make. Often you want to make files with completely different content to have different file names as well (so, using the file naming mechanism to give them a different name, perhaps based on a parameter value that was changed, or so).
Then, if there is just a smaller change or fix in one's raw data that one wants to use to rerun the workflow, what I use to do is to simply do a careful delete of all the files that will need to be re-computed based on that.
Samuel Lampa
@samuell
But it is an interesting idea, with the Make like functionality. Will think about it :)
kerkomen
@kerkomen

Thanks for the quick response!
Indeed, I was mainly wondering about the changes in the upstream files and quickly picking up the workflow from the respective part of the pipeline, maybe having something an argument to track timestamps (or/and hashes as well that might be stored in audit files) of files and only skip the step in the Make-like fashion when the downstream file is younger than the upstream one...
On that note, also forcing the workflow to re-run would be nice, I think manually deleting files for large workflows might be too error-prone. Works fine for small workflows, you're right!

Thanks again, looking forward to scipipe's progress!

Samuel Lampa
@samuell
:thumbsup: :)
Tom Deadman
@biotom
Hi, everybody!
I'm a newish Go developer with some scientific workflow experience. Are you looking for contributors, and, if so, is there a contributor guide or any small issues that might be suitable for someone new?
Samuel Lampa
@samuell

Hi @biotom! Welcome to the chat, and thank you for your question! Re. contributors: Right now, I think any bug fixes is the most important way of contributing at the moment.

There would also be great with more support for various HPC systems etc, but I think a bit of architecture for easily plugging in remote runners is needed there, which might not be an easy first task (the architectural bits).

Feedback, input and suggestions are definitely welcome, but on the architectural side, I'm being quite opinionated in trying very hard to achieve a maximally simple core library, without dependencies on other Go packages, which is good to keep in mind.
I realize it is starting to be high time for a contribution guidelines page though
Samuel Lampa
@samuell
(Btw, sorry for the late reply ... has been drowning in work for a deadline next week :P )