Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jul 09 19:35
    JakeHagen opened #90
  • Jun 24 15:35
    ASVBPREAUBV commented #74
  • Jun 20 13:31

    samuell on master

    Hash SubStream IP Paths for Tem… (compare)

  • Jun 20 13:31
    samuell closed #89
  • Jun 20 03:45
    dwmunster opened #89
  • Jun 19 10:56

    samuell on v0.9.5

    (compare)

  • Jun 19 10:51

    samuell on master

    Add Appveyor Windows Testing Fix Windows issues Add select based Task processin… and 7 more (compare)

  • Jun 19 10:51

    samuell on develop

    Bump version to 0.9.5 (compare)

  • Jun 19 10:48
    samuell commented #87
  • Jun 19 10:48

    samuell on develop

    Add test for AtomizeIP's extra … (compare)

  • Jun 19 10:48
    samuell closed #87
  • Jun 19 10:40
    samuell commented #85
  • Jun 19 10:40

    samuell on develop

    Fix issue with overwriting tags. Add checks for the correct tag … (compare)

  • Jun 19 10:40
    samuell closed #85
  • Jun 18 19:35
    dwmunster edited #88
  • Jun 18 19:34
    dwmunster edited #88
  • Jun 18 19:34
    dwmunster edited #88
  • Jun 18 19:34
    dwmunster opened #88
  • Jun 18 18:43
    dwmunster opened #87
  • Jun 18 17:46
    codecov[bot] commented #85
Samuel Lampa
@samuell
That would perhaps be more of "arrays", than substreams, but would probably mostly solve the same commonly occuring problems in workflow design.
I've been thinking that the fact that you can make both a single item and an array of the same type fulfil the same interface in Go, that could be one way to allow single-items and arrays to co-exist.
Drayton Munster
@dwmunster
The "bracket" idea would mean that substream components wouldn't have to read all incoming IPs before sending I guess they don't have to now, just the one I've written

I've been thinking that the fact that you can make both a single item and an array of the same type fulfil the same interface in Go, that could be one way to allow single-items and arrays to co-exist.

How's that look?

Samuel Lampa
@samuell

How's that look?

I made a simple example here, although it is kind of nonsensical:
https://play.golang.org/p/O-NzSQhPRRJ

Drayton Munster
@dwmunster
I see
Samuel Lampa
@samuell
(Nonsensical in the sense that fmt.Println() can print any value anyways, but it shows how to implement a common interface that both a single value type and array type fulfil).
Drayton Munster
@dwmunster
An alternative is to use the ellipsis (...) operator for all the factory functions and just use slices everywhere internally
Samuel Lampa
@samuell
Perhaps, yes
Haven't tried, but sounds like that would be great, unless it creates any other problems
Drayton Munster
@dwmunster

sounds like that would be great, unless it creates any other problems

see also: everything

Samuel Lampa
@samuell
:grin:
Drayton Munster
@dwmunster
I'm taking off tomorrow and might poke around at an IP refactor. Originally it was going to be to abstract away the FS bits, but that could probably be rolled in
Samuel Lampa
@samuell
Both sound great (abstracting FS, and any improvements related to substreams) :)
Drayton Munster
@dwmunster
in the meantime, a copy/paste of my substream component gets the job done
Samuel Lampa
@samuell
IC. You needed to make multiple instances of it?
Drayton Munster
@dwmunster
Yes, because the first task drains all the IPs out of the substream's channel
so when the second task is created, that channel is closed
Samuel Lampa
@samuell
IC yea :|
windhooked
@windhooked
Probably missing something, is there any good way to collect files produced by an executable? I'm running pdfimage which produces an image for each pages, and then want to collect each file as soon as it has been written for further processing. I have tried ls > {o:out} and then spliting the out file but also got stuck here, seems like file splitter bug, producing one more empty file. Any pointers appreciated.
Samuel Lampa
@samuell
Hi @windhooked and welcome to the chat!
Re. catching output files: Are you able to know the pattern of the produced output files on beforehand?
The default way for scipipe to catch output files is to have scipipe decide what the file name should be, by using the {o:outportname} placeholder in the command, of course given that the executable can take the destination filename as a parameter.

So, imagining that pdfimage has a flag, -o for output filename (and -i for input files), that would be:

proc := wf.NewProc("proc", "pdfimage -i  {i:inport-name} -o {o:outport-name}")

etc.

windhooked
@windhooked
Hi @samuell, not exactly since pdfimages generates files based on the given prefix.
ximages := wf.NewProc("pdfimages", "pdfimages -tiff -p \"{i:in}\" \"{i:in}\" && ls \"{i:in}\"*.tif > {o:tif_files} ")
the file splitter component almost did what was expected, but produced one empty file when lines per split == 1, which makes the next process break.
 components.NewFileSplitter(wf, "file_splitter", 1)
windhooked
@windhooked
globbing also does not work since the file glob pattern depends on the output path of the previous process
Samuel Lampa
@samuell
Hmm, yea, the globbing component would need to take paths as parameters, for this to work...
Reg. the file splitter ... sounds a bit like a bug(!)
The quickest workaround, if you are well versed in Go, would probably be to create some custom Go code that does what you want, with: http://scipipe.org/howtos/golang_components/ and/or http://scipipe.org/howtos/reusable_components/
windhooked
@windhooked
any other way to start a new task based on the contents of tif_files, you can think of?
Samuel Lampa
@samuell
Could you share a complete demo workflow code, that shows what you're after?
windhooked
@windhooked
I managed to do it with xargs, at the expense of concurrency
Samuel Lampa
@samuell
And, is there some documentation about the pdfimages command online?
windhooked
@windhooked
here is what kind of what does what it I want.
func main() {

        path := "./data/*/attachments/*.pdf"

        wf := sp.NewWorkflow("pdf2ocr", 20)

        globpdf := comp.NewFileGlobber(wf, "findpdf", path)

        ximages := wf.NewProc("pdfimages", "pdfimages -tiff -p \"{i:in}\" \"{i:in}\" ; ls \"{i:in}\"*.tif > {o:out} ") // {o:out}
        ximages.In("in").From(globpdf.Out())

        // < test xargs -L 1 -I{} tesseract {} {} --psm 1 --oem 1 -l eng
        ocr := wf.NewProc("tesseract", "< \"{i:in}\" xargs -L 1 -I{} tesseract {} {}.ocr --psm 1 --oem 1 -l eng; ls \"{i:in}\"*.txt > {o:out}  ") // {o:out}
        ocr.In("in").From(ximages.Out("out"))

        wf.Run()
}

And, is there some documentation about the pdfimages command online?

https://linux.die.net/man/1/pdfimages

Samuel Lampa
@samuell
Thanks
Samuel Lampa
@samuell
I was trying something like this ... but running into errors: https://gist.github.com/samuell/56ac30625417b881427dfea7d2675d43
But it turns out, working around scipipe's own file handling mechanisms (like writing output to a temp directory, detecting already written files etc), means one has to write a lot of crazy workaround code to handle file paths.

I was trying something like this ... but running into errors: https://gist.github.com/samuell/56ac30625417b881427dfea7d2675d43

The errors might be with the tesseract command ... I'm not too familiar with it yet, and might have changed too many things.

Samuel Lampa
@samuell
I'll be pondering a more general solution to the case when a lot of unspecified output files are created and need to be picked up. It is a problem that occures pretty often.
windhooked
@windhooked
Thanks, that makes sense. I will give the sample a try. It just got stuck at the semantics of how to get the actual file name in a task context.
Samuel Lampa
@samuell
Yeah, in general the paths should be used from task struct, if at all possible (otherwise, path wrangling is needed)
windhooked
@windhooked
@samuell thank you for your suggestions, I have traced it down to relative path problems. how is the temporary directory created for a new task?
Samuel Lampa
@samuell

@windhooked Will check!

The temporary path is created as a subfolder under the main Go-script's folder.

This is why, if you want to refer to existing files in the main directory, you need to prepend ../ to the paths. Output paths though, are supposed to be written relative to the temporary directory, from where they will be moved into their final paths as part of scipipe's code for finalizing a task (to make file writes atomic).

This message was deleted
Samuel Lampa
@samuell
Ah ...

Ok, one problem with the approach of my code suggestion:

sp.ExecCmd() will not be executed inside the temporary task's folder.

Thus, since input paths are prepended with ../ by SciPipe, to support the normal operation (executing inside the temp folder), one will need to remove that ../ part manually ...

E.g., I came a bit forward by adding this line just under the for ... line in your code:
tifFile = strings.Replace(tifFile, "../", "", 1)