Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Sutou Kouhei
    @kou
    If you want to work on some of issues you created, I can help you. Please let me know.
    Konstantin Ilchenko
    @simpl1g
    Is it okay to use Arrow::Table as dataframe tool? Or it is too low level and something on top of it should be used with some syntax sugar?
    Sutou Kouhei
    @kou
    We will be able to use Arrow::Table or Arrow::DataFrame (not created yet) as convenient dataframe tool in future.
    But it's not convenient enough yet.
    Konstantin Ilchenko
    @simpl1g
    Arrow::DataFrame Is it going to be something global for Arrow ecosystem and will use C bindings?
    Also, do you have estimates when red-arrow 6.0 is going to be released?
    Konstantin Ilchenko
    @simpl1g
    rover dataframe gem uses internally Numo::NArray to store columns, we can convert it to tensor via red-arrow-numo-narray gem, but is it possible to convert it to Arrow::Array or Arrow::Column?
    Kenta Murata
    @mrkn
    I guess the rover's numeric vectors are partially convertible to Arrow's array. From my understanding, rover's vectors don't manage bit-vectors that locate missing. It means the rover's numeric vectors can be converted to Arrow's arrays, but Arrow's arrays with missing values cannot be converted to rover's vectors.
    I found a rover's integer vector cannot have missing values. So it cannot be made from an Arrow's integer array with missing values.
    Konstantin Ilchenko
    @simpl1g

    @mrkn Yes, you right, but currently I'm looking for efficient way to convert from rover to arrow to write arrow/parquet files.

    In the future I guess Arrow::Table/Arrow::DataFrame should replace rover/daru gems.

    Konstantin Ilchenko
    @simpl1g
    Added some examples how to red-arrow gem apache/arrow#11584
    Maybe you have tips and suggestions what to improve/add
    Sutou Kouhei
    @kou
    Apache Arrow C++ will implement dataframe API.
    Design document: https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
    Arrow::DataFramewill be the bindings of it.
    The red-arrow 6.0.0's blocker is Homebrew.
    We need to update https://github.com/Homebrew/homebrew-core/blob/master/Formula/apache-arrow-glib.rb to 6.0.0 to release red-arrow 6.0.0.
    But nobody is working on this for now.
    Here is the instruction to update for this: https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingHomebrewpackages
    Konstantin Ilchenko
    @simpl1g
    Looks like Homebrew already has Pull request to update apache-arrow
    Sutou Kouhei
    @kou
    I asked my colleague for it yesterday. But it's not yet completed.
    Homebrew/homebrew-core#88585
    dsisnero
    @dsisnero
    another rust dataframe library using arrow with bindings to python. https://github.com/pola-rs/polars -
    Mathieu Leduc-Hamel
    @mlhamel
    Anyone found a way to query multiple parquet files using red-parquet ? The examples are minimal and i don't saw any ways of doing it...
    or even red-arrow
    Sutou Kouhei
    @kou
    Could you provide sample Parquet files and sample condition(s)?
    Mathieu Leduc-Hamel
    @mlhamel
    Ideally I would just use something like:
    Arrow::Table.load("/dev/shm/*.parquet")
    Sutou Kouhei
    @kou
    Then the following will work:
    require "arrow-dataset"
    Arrow::Table.load("file:///dev/shm/", format: :parquet)
    Mathieu Leduc-Hamel
    @mlhamel
    Thanks @kou that's exactly what i was searching for :raised_hands:
    just out of curiosity do you know if you can use directly other types of filesystem for loading files? something like "gs://../" to you a s3 bucket or even gcs ?
    Sutou Kouhei
    @kou
    You can use s3://.. for S3.
    GCS supports will be available in Red Arrow 7.0.0.
    Konstantin Ilchenko
    @simpl1g
    I guess it should be
    Arrow::Table.load(URI("file:///dev/shm/"), format: :parquet)
    Sutou Kouhei
    @kou
    Ah, sorry. @simpl1g 's one is correct.
    It may be better that we add support for Arrow::Table.load("file:///...") and Arrow::Table.load("/dev/shm/").
    Konstantin Ilchenko
    @simpl1g

    Maybe it would be more interesting to support patterns

    Arrow::Table.load("/path/to/folder/part*.parquet"), format: :parquet)

    Or even better S3 with patterns, we use this a lot in ClickHouse
    https://clickhouse.com/docs/en/engines/table-engines/integrations/s3/#wildcards-in-path

    Nabil Servais
    @blackrez
    Hello, I would like to use DuckDB (with Arrow bindings) in a Rails Application, I'm not an expert in Rails but I used it a long time ago. Someone have advise for the integrations ?
    Konstantin Ilchenko
    @simpl1g

    @kou Hi, is there a way to list all available Compute Functions in red-arrow?

    and maybe you know how to get value of min_max function?

    f = Arrow::Function.find('min_max')
    f.execute([table['revenue'].data]).value
    #<Arrow::StructScalar:x11fcc8df8 ptr=x7ffd0ea762b0 {min:double = 0, max:double = 257.515735}> 
    f.execute([table['revenue'].data]).value.value
    # []
    Konstantin Ilchenko
    @simpl1g
    And is it possible to pass precision to round function, I'm not sure how to pass RoundOptions
    Benson Muite
    @bkmgit_gitlab
    Not clear what you mean be precision. For C++ the available round options are listed at https://arrow.apache.org/docs/cpp/compute.html#rounding-functions
    Konstantin Ilchenko
    @simpl1g
    I mean how to pass ndigits option, from docs:
    Round to a number of digits where the ndigits option of RoundOptions specifies the rounding precision in terms of number of digits
    Currently when I apply Arrow::Function.find('round') it converts float to int, and I want to round 1.1111 to 1.11, so I need to pass ndigits = 2
    Benson Muite
    @bkmgit_gitlab
    Sutou Kouhei
    @kou
    Ah, we don't implement the feature of listing all available compute functions. It's easy to implement.
    Could you create a JIRA issue?
    6 replies
    We need to implement the bindings of arrow::compute:RoundOptions. It's also easy to implement.
    Could you also create a JIRA issue for it?
    1 reply
    Sutou Kouhei
    @kou
    f.execute([table['revenue'].data]).value.value should work.
    It's a bug of the current implementation...
    Could you create a JIRA issue for this too...?
    1 reply
    Thanks!
    Konstantin Ilchenko
    @simpl1g

    I'm currently trying loading many csv files from folder and it fails if type was not detected correctly in one file
    Simplified example(from real world data) would be:
    test/1.csv

    a,b
    1,0
    2,0

    test/2.csv

    a,b
    3,0
    4,0.99
    Arrow::Table.load(URI('file:///.../test/'), format: :csv)
    # [scanner][to-table]: Invalid: Could not open CSV input source 
    # Invalid: In CSV column #1: Row #3: CSV conversion error to int64: invalid value '0.99'

    Loading them separately and doing concatenate + unify_schemas also doesn't work

    table1.concatenate(table2, unify_schemas: true)
    # Invalid: Unable to merge: Field b has incompatible types: int64 vs double

    Are there any options to concatenate such files?

    we can do something like this for casting, but I'm not sure how to efficiently build new tables with updated column
    options = Arrow::CastOptions.new
    options.to_data_type = Arrow::DoubleDataType.new
    f = Arrow::Function.find('cast')
    f.execute([table1['b'].data], options)
    Sutou Kouhei
    @kou
    In C++, we can specify schema explicitly. For this case, we can specify Arrow::Schema.new(a: :int64, b: :double).
    But there aren't bindings of them yet.
    Could you create a JIRA issue for this?
    1 reply
    dsisnero
    @dsisnero
    does ruby arrow or glib have a way to zero copy an arrow from ruby to python using the c data interface? https://medium.com/@niklas.molin/0-copy-you-pyarrow-array-to-rust-23b138cb5bf2
    Sutou Kouhei
    @kou
    We can do it with https://github.com/red-data-tools/red-arrow-pycall but it doesn't use the C data interface. It uses raw Python related API in Apache Arrow C++.
    dsisnero
    @dsisnero
    what I am more interested in is if we can roundtrip from different languages zero copy using c data interface . From Java or rust or R or python - .
    Sutou Kouhei
    @kou
    Red Arrow supports C data interface.
    https://github.com/red-data-tools/red-arrow-duckdb uses it to roundtrip from C++ without copying.
    We need bindings to use C data interface with Java, Rust, R, Python and so on because C data interface works only in the same process.
    PyCall is a Python bindings library.
    Do you have any use case for C data interface?
    Wei Shi
    @weishi029
    hi, guys, thanks for your amazing work for red-arrow. We are considering using red-arrow to generate parquet file in our application. Wondering if there is any example that we can generate parquet from sql result? I tried few ways, but seems they are not very efficient, and no way to match loading csv.
    Sutou Kouhei
    @kou
    How do you get SQL result? Active Record?
    Wei Shi
    @weishi029
    It is raw sql with few joins returns Mysql::Result. We want to use streaming to keep memory footprint low as we can easily have more than 10k records.
    Currently we generate a csv, and use Arrow::CSVLoader to load the csv and then convert it to parquet.
    Sutou Kouhei
    @kou
    OK.
    https://github.com/red-data-tools/red-arrow-activerecord/blob/master/lib/arrow-activerecord/arrowable.rb may help you.
    It uses Active Record but you can use similar code for Mysql::Result.
    16 replies