Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Dalton Hall
    @halldalton

    Hello, my company is very interested in Tuplex as an alternative to Pandas/ Dask/ Spark etc. I am in charge of bench marking it on our datasets and am eager to start using it.

    I installed tuplex yesterday via pip and downloaded version 0.3.0, but my kernel kept crashing with the known segfault error. I saw on github that you all have already made a hot fix and bumped the version to 0.3.1. I was wondering when that version will be available for download through pip, or if there is another way to avoid to the segfault. Could I just set my AWS credentials to a random string?

    Thanks in advance for your time and effort.

    Dalton Hall
    @halldalton
    Our company also produces papers and blogs over the technologies we benchmark. I will link both here when we complete them, they should serve as nice publicity for the library.
    Leonhard Spiegelberg
    @LeonhardFS
    Hello Dalton, thanks for you interest in Tuplex! We just released v0.3.1 on pypi, it should not segfault anymore due to missing AWS credentials. The docker image tuplex/tuplex has been also updated with the latest (v0.3.1) version.
    We are very curious about your experience and results using Tuplex, and are looking forward to the links! Since Tuplex is still a very early stage project, feel free to raise issues on Github if you encounter any problems running your pipelines and will try our best to get them addressed as quickly as possible.
    Dalton Hall
    @halldalton

    Can I see an example of the syntax for a aggregateByKey method? I am working with financial data and want to aggregate by a security id.

    I can aggregate total volume by doing

    tups.aggregate(
        lambda a, b: a + b,
        lambda a, x: a + x["volume"],
        0.0,
    )

    which returns

    +-----------------+
    | Column_0        |
    +-----------------+
    | 504332951.00000 |
    +-----------------+

    But when I try to do it by the id

    tups.aggregateByKey(
        lambda a, b: a + b,
        lambda a, x: a + x["volume"],
        0.0,
        ["id"]
    )

    I get

    <function <lambda> at 0x7fe1e93d7710> is not a lambda function or its code could not be extracted
    <function <lambda> at 0x7fe1e93d7c20> is not a lambda function or its code could not be extracted
    
    
    Error occurred: map::at

    Which I figure is due to an error in my syntax in the lambda functions.

    Also, thank you again for pushing the fix to pypi, I have been enjoying using the library so far
    Leonhard Spiegelberg
    @LeonhardFS
    I was able to reproduce this in a jupyter notebook, will try to investigate the root cause for this. Are you using 1) jupyter 2) a .py file or 3) interactive shell for the workload?
    Dalton Hall
    @halldalton
    @LeonhardFS a jupyter notebook
    Dalton Hall
    @halldalton
    Thank you again for your response
    Leonhard Spiegelberg
    @LeonhardFS
    @halldalton we pushed a fix for this on the current master.
    Dalton Hall
    @halldalton
    Thanks! When it is in pypi, will it be the same version or will there be a version bump?
    Leonhard Spiegelberg
    @LeonhardFS
    There will be a version bump, because pypi requires unique version identifiers and doesn't allow for consistency reasons to replace an existing release with one with identical version numbers :) Yet, we also push previews to test.pypi.org which may be used in the meantime. These are development releases though.