Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Ywen
    @YPares
    Oh damn...
    Torsten Scholak
    @tscholak
    code behaves differently when wrapped into dask delayed computations
    and it’s hard to figure out why
    Ywen
    @YPares
    Why was Dask chosen over, say, Spark?
    Torsten Scholak
    @tscholak
    dask is perceived to be more leightweight and more “pythonic”
    also, it was my idea to give it a try...
    the current alternative is a bash script ;)
    a bash script that splits up the work, dispatches and launches jobs on the cluster
    Ywen
    @YPares
    @tscholak Yes, that's exactly what we had too at my client, some custom solution to distribute bash/docker commands. We switched to celery last year, but now we think we would have fared better with a simple job queue (like RQ) on top of rabbit or redis
    Tim Pierson
    @o1lo01ol1o
    It's worth mentioning transient again here. However, it's currently rough in terms of developer UX and the materials are not super accessible. (It's basically ContT over IO with semantics similar to ListT. However, composing all the streaming in a distributed setting is never trivial.) The best source of current information is probably via the author in the gitter: https://gitter.im/Transient-Transient-Universe-HPlay/Lobby. I know he's working on a new release.
    Ywen
    @YPares
    @o1lo01ol1o Will have a look :)
    Hi guys, I stumble upon a GHC bug which might bit you too. Happily, the workaround (once you have found it ^^) is easy:
    tweag/porcupine#75
    Torsten Scholak
    @tscholak
    @YPares is there an Ormolu room somewhere?
    Mark Karpov
    @mrkkrp
    @tscholak Hi, Ormolu main developer here. We don't have a dedicated room right now. Maybe we should create one!
    Torsten Scholak
    @tscholak
    :+1:
    Ywen
    @YPares
    @tscholak Sorry, I forgot to reply ^^ Thanks @mrkkrp
    Tim Pierson
    @o1lo01ol1o
    @YPares I'm looking at porcupine for a couple of use cases. Is is possible to control how source data files are provided to the PTask? Say my PTask is a foldM and I need to stream all the source files? Similarly, if the output files from the PTask are (effectfully) written at intervals from the fold, is there a reasonable way to interface with a datasink?
    In one particular case, I've been using streamly to do the concurrent transformations, is there a recommended way to purely pass those to a datasink or do I need to setup a fifo queue in STM?
    Torsten Scholak
    @tscholak
    bump, but with pipes instead of streamly
    Ywen
    @YPares
    Hi @o1lo01ol1o , by stream all the source files you mean have a Stream (Of FileContent) m ?
    Tim Pierson
    @o1lo01ol1o
    @YPares in one case, I have an IsStream s => s m (Bytestring) in another, I know that I have "files" that will need to be "streamed" :)
    Ywen
    @YPares
    If so, the simplest is to use loadDataStream with one VirtualFile, which be considered to be repeated. If you write-config-template you'll see that the path to your files by default includes a variable part ({index} for instance) where index is the LocVariable (just a String wrapper) you gave to loadDataStream
    writeDataStream does the same. It expects in input a Stream (Of (index, FileContent))
    Ywen
    @YPares
    @o1lo01ol1o You tell me if that fits your needs :) if not we'll see what we can do
    The stream of indices can come from whatever place you want (hardcoded in the source, read from the content of another file, a getOption call, etc.)
    But if you don't have an index notion, for now these high-level stream loaders/writers cannot help you. All the use cases we had so far dealt with indiced sources/sinks. It'd be doable, but you'd need to write a custom SerialsFor NoWrite (Stream (Of Stuff) m) to have a DataSource/Sink that directly outputs a stream
    Tim Pierson
    @o1lo01ol1o
    Ok. I'm still thinking about this and will have to play around more to see what's possible. Thanks @YPares
    Michel Kuhlmann
    @michelk
    I want to write a small application, which downloads periodically river gauging station data from different urls and store it in a TimeScaleDB. Would you recommend to use porcupine for that? Thanks for a short suggestion.
    Michał J. Gajda
    @mgajda
    @YPares Hi Yves, how are you doing?
    Ywen
    @YPares
    Hi people, sorry for the long silence. I'm now working directly as a NovaDiscovery ( https://www.novadiscovery.com/ ) employee, the company for which porcupine was originally developed. We're starting to work on a design to bring distributed jobs to porcupine, and smaller useful features have already been added and are waiting to be forked back to the public github repo. Expect some API breaking in the future, although the general picture shouldn't change much. Stay tuned :)
    Torsten Scholak
    @tscholak
    nice, congrats Yves!