Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Will Eaton
    @wseaton
    just curious, do you have any insight as to why the try_from wasn't working above? if it did I'd have the full structarray -> rb -> polars pipeline working in arrow2
    Ritchie Vink
    @ritchie46
    Because the trait 'From<RecordBatch>' is not implemented for 'DataFrame. :). From<Vec<RecordBatch>> is, so try:
    let df = DataFrame::try_from(vec![rbs])?;
    9 replies
    yuuuxt
    @yuuuxt
    seeing pola-rs/polars#1210 PR, wondering if rank is going to be implemented? in my case it would be nice to have something similar to pandas' rank
    8 replies
    yuuuxt
    @yuuuxt
    图片.png
    (the pic above should be in the reply)
    speyejack
    @speyejack:matrix.org
    [m]
    Hi everyone, I was wondering if there was a way to access PyDataFrame from the rust API? Im currently working on a rust/python hybrid program and it would be nice to use the data in rust then pass it back to python or vice versa.
    1 reply
    Will Eaton
    @wseaton
    So I had a case where I needed to parse a file with a two character delimiter, basically | which I saw isn't supported. No biggie, I'll just trim the whitespace off all of the Utf8 fields in my dataframe from a list of them. So I have two questions:
    1) would it be helpful to add a method that returns all the series of a certain type in a df? and
    2) when you have your Vec<Series> of transformed series, what is the best way to replace them in the original frame? Is it safe to use something like fold(&mut df, |df, s| df.replace(s, col) )
    3 replies
    zys864
    @zys864
    image.png
    Why polars-io's buffer doesn't support other Datatype,such as Uint8.
    1 reply
    When I use CsvReader to parse a csv file to DataFrame,I couldn't use a Schema including Datatype::UInt8 type.(I already turn features dtype-full on)
    speyejack
    @speyejack:matrix.org
    [m]
    @ritchie46 do you know anything about exposing the python API to the public rust API?
    Benjamin Kay
    @benkay86
    Can I store arbitrary types in a polars series using Python? It seems like the Object data type should let me do this, but I cannot actually figure out how to construct a polars.Series using this type...
    Will Eaton
    @wseaton
    @benkay86 don't think you can without doing some type of binary serialization like pickling, or marshalling to JSON in the case of something like a namedtuple or a dataclass
    Will Eaton
    @wseaton
    now I'm wondering that the use-case is, is it worth losing the ability to do arrow IPC? :)
    Benjamin Kay
    @benkay86
    Indeed, this would require serialization to a byte array.
    My (experimental) use case was going to be storing 3d image data (as ndarray matrices) in parquet files.
    Or simply to keep track of binary image data + columnar metadata in memory.
    Benjamin Kay
    @benkay86
    This is sort of possible by flattening the array and converting it to list. Unfortunately, converting to a Python list is very inefficient with memory. I wonder if there's a more efficient way...
    # Multidimensional array
    arr = np.array([[[1,2,3], [4,5,6]], [[7,8,9], [10,11,12]]])
    
    # Make a data frame.
    df = pl.DataFrame(
        {
            "my_arrays": [arr.ravel().tolist()]
        }
    )
    
    # Get the array back out of the data frame.
    arr = np.asarray(df["my_arrays"][0]).reshape((2,2,3))
    Ritchie Vink
    @ritchie46

    Can I store arbitrary types in a polars series using Python? It seems like the Object data type should let me do this, but I cannot actually figure out how to construct a polars.Series using this type...

    If you create a Series that is not one of polars dtypes, it will by a Object data type. This data type can be used, but cannot be used in serialization.

    Rocky Chen
    @chenrocky

    What's the best way to "chain" together conditional filters in polars for python?

    Apologies this if this is too long, but I'd like to provide an example of what I am facing:

    # example data
    d = {
        'col1': ["a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "c", "c", "c", "c","c"],
        'col2': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
        'col3': [1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1]
    }
    
    df_pl = pl.DataFrame(data=d)

    This filter produces my expected output:

    # this returns what I expect
    df_pl.filter(
        (col("col2").is_between(0, 4))
    )

    This filter also produces my expected output:

    # this returns what I expect
    df_pl.filter(
        (col("col3") == 1)
    )

    But when I try to combine the two filters together using the and operator, I do not get my expected output:

    # this ends up returning rows where col2 is not 1, 2, or 3
    df_pl.filter(
        (
            (col("col2").is_between(0, 4)) and
            (col("col3") == 1)
        )
    )

    And, when I reverse the order of the conditions in the filter, I get a different unexpected output:

    # this ends up returning rows where col3 is not equal to 1
    df_pl.filter(
        (
            (col("col3") == 1) and
            (col("col2").is_between(0, 4))
        )
    )

    This block of code produces my expected output, but it doesn't look great imo:

    # this produces the desired output
    df_pl_filtered = df_pl.filter((col("col2").is_between(0, 4)))
    df_pl_filtered = df_pl_filtered.filter((col("col3") == 1))

    So again, what's the best way to "chain" together conditional filters in polars for python?

    And can anyone help me wrap my head around why reversing the combined conditions in the filter produce two different outputs?

    Ritchie Vink
    @ritchie46
    Try this:
    
    df_pl.filter(
        (
            (col("col2").is_between(0, 4)) &
            (col("col3") == 1)
        )
    )
    1 reply
    Python does not give an error when you and on a object that does not have implemented any of that behavior. If anybody knows how I can throw errors on this behavior, please let me know. :)
    Will Eaton
    @wseaton
    apparently numpy throws errors when doing elementwise comparison on matricies w/ and/or, but doesn't with bitwise & |
    may be able to borrow their logic for that
    Will Eaton
    @wseaton
    ah okay, so numpy handles this by throwing an error on bool(x) if the array has multiple elements.
    so adding some logic to __bool__should work?
    Ritchie Vink
    @ritchie46
    Good one @wseaton . Will use that. Thanks.
    Will Eaton
    @wseaton
    Happy to grab that one that if you want :)
    1 reply
    Björn Malmgren
    @BjornM111
    Hey, I've been trying to do some really simple bitwise operations, basically just a bitwise AND on a Series but haven't really found a way as the "&" operator is used for other things. Is there any way of doing actaul bitwise operation on a Series?
    Björn Malmgren
    @BjornM111
    Oh, and just to be clear, I'm using the python api
    Ritchie Vink
    @ritchie46
    @BjornM111 You want to do a bitwise and on numeric values?
    Björn Malmgren
    @BjornM111
    Yes exactly
    Ritchie Vink
    @ritchie46
    Hmm... That makes sense. I will add support for that.
    Björn Malmgren
    @BjornM111
    Oh wonderful, thank you so much!
    James Moore
    @lumost
    Hey All! quick question on the rust api - what's the correct method of dealing with nested objects within a Series? or is this not supported?
    1 reply
    Rocky Chen
    @chenrocky

    In python, what's a way to do the polars-equivalent of this?: https://stackoverflow.com/a/26721325/9564346

    I tried the following, but doesn't seem like .rank works the same way here?:

    df[
        [
            col("*"),
    
            col("score")
            .rank()
            .over("user")
    #         .flatten()
            .alias("rank"),
        ]
    ]

    And I also tried:

    df.groupby("user").agg([pl.col("score").rank()])

    Both code blocks above give me the same error:

    PanicException: called `Option::unwrap()` on a `None` value

    (apologies for adding and deleting messages; I couldn't edit the old messages for some reason)

    6 replies
    Richard Janis Goldschmidt
    @SuperFluffy
    I am trying to call ChunkedArray::rand_normal in Rust, but I am getting function or associated item not found inpolars::prelude::ChunkedArray<_>. Very confused why that's the case. polars is0.16.0`
    superfluffy
    @superfluffy:matrix.org
    [m]
    (Sorry for the messed up formatting)
    Richard Janis Goldschmidt
    @SuperFluffy
    Also, setting the type explicitly as ChunkedArray::<f64>::rand_normal yields the same result
    Neither does ChunkedArray::<Float64Type>
    Richard Janis Goldschmidt
    @SuperFluffy
    figured it out: I need to set features = ["random"]. It's documented in polars-core, but not in polars.
    superfluffy
    @superfluffy:matrix.org
    [m]
    Next difficulty: can I construct a ChunkedArray<Float64Type> from a a Vec<f64>? I actually want to initialize a chunked array with n constant values of 1.0.
    Ritchie Vink
    @ritchie46
    Something like this?
    
    
    On mobile, so cannot edit :/
    (0..100).map(|_| Some(1.0)).collect::<Float64Chunked>()
    You can also create from slices.
    J. Bruno Morgado
    @jbmorgado_gitlab

    Hi all.

    I am breaking my head trying to figure out how to use groupby and apply.

    Coming from Pandas, I was using:

    def get_score(df):
       return spearmanr(df["prediction"], df["target"]).correlation
    
    correlations = df.groupby("era").apply(get_score)

    But in polars, this doesn't work.

    I tried several approaches, mainly around:

    correlations = df.groupby("era").apply(get_score)

    But they all fail and I am having trouble understanding the correct syntax.

    Any ideas?

    14 replies
    Lissa Hyacinth
    @lissahyacinth

    Hey all - I'm aware this isn't a direct use case of Polars, but as this is the only package that does PyArrow -> Rust in batches I'm aware of, this seems like the place that'd know.

    from polars.polars import PyDataFrame
    import pyarrow as pa
    import pandas as pd
    
    
    if __name__ == "__main__":
        a = pd.DataFrame.from_dict({'a': [1,2,3]})
        df = PyDataFrame.from_arrow_record_batches(
            pa.Table.from_pandas(a).to_batches(max_chunksize=1)
        )

    This will reliably crash with the error thread '<unnamed>' panicked at 'the offset of the new Buffer cannot exceed the existing length', /github/home/.cargo/git/checkouts/arrow2-8a2ad61d97265680/0b37568/src/buffer/immutable.rs:99:9, and I cannot figure out why. Polars avoids the issue by only allowing an entire PyArrow Table, but it seems from other sized batches should be viable too.

    3 replies
    Gert Hulselmans
    @ghuls
    @ritchie46 is there a way to get the length of a column inside an expression? I only managed to get the height from a materialized dataframe or the length of a materialized column\, but this breaks chaining of commands.