by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Rick Cobb
    @cobbr2
    parquet-tools gives me Could not read footer: java.lang.RuntimeException: file:/tmp/myquicktest.parquet is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [82, 79, 87, 49]
    Sutou Kouhei
    @kou
    It writes Arrow file.
    It's not Parquet file.
    Rick Cobb
    @cobbr2
    Yes, that's correct. How do I write a Parquet file?
    I presume there's some other output stream I should be using, but I'm having trouble discovering it; maybe it's a different writer?
    Found it, I think. Will post update if it works.
    Rick Cobb
    @cobbr2
    Well, I found the Parquet::ArrowFileWriter class, of course... but it doesn't support write_record_batch, only write_table. Which suggests I have to get my whole table into RAM at once in order to write it out.
    We've tried to avoid doing that by using CSV as an intermediate step, but that's proven way too fragile (we've even segfaulted Ruby 2.5.1 with those attempts).
    Sutou Kouhei
    @kou
    require "parquet"
    
    data_output_path = '/tmp/myquicktest.parquet'
    
    bow = {
      'wackamole': [[ 27, 2.3 ]],
      'quakamole': [[ 23, 3.5 ], [ 19, 5.7 ]]
    }
    schema = Arrow::Schema.new(term_id: :uint32,
                               score: :double)
    record_batches = bow.collect do |record_id, records|
      Arrow::RecordBatch.new(schema, records)
    end
    table = Arrow::Table.new(schema, record_batches)
    table.save(data_output_path)
    Ah, you don't want create Arrow::Table.
    Sutou Kouhei
    @kou
    require "parquet"
    
    data_output_path = '/tmp/myquicktest.parquet'
    
    bow = {
      'wackamole': [[ 27, 2.3 ]],
      'quakamole': [[ 23, 3.5 ], [ 19, 5.7 ]]
    }
    schema = Arrow::Schema.new(term_id: :uint32,
                               score: :double)
    Arrow::FileOutputStream.open(data_output_path, false) do |output|
      Parquet::ArrowFileWriter.open(schema, output) do |writer|
        bow.each do |record_id, records|
          record_batch = Arrow::RecordBatch.new(schema, records)
          writer.write_table(record_batch.to_table, 1024)
        end
      end
    end
    Rick Cobb
    @cobbr2
    Thank you That's exactly what I needed.
    Didn't even occur to me that RecordBatch would support #to_table.
    Oscar Luza
    @NashL
    Hey, do you know where can I start learning, I’m a beginner in data science
    Kenta Murata
    @mrkn
    @NashL What are you interested in?
    Oscar Luza
    @NashL
    Machine learning
    Kenta Murata
    @mrkn
    Rumale is a scikit-learn-like framework for Ruby.
    Anakane's blog is good! https://ankane.org/
    Oscar Luza
    @NashL
    Thank you so much
    Phil
    @QuakePhil

    hey folks! looking to generate a parquet file from thin air (rather than load an existing one as per https://github.com/apache/arrow/blob/master/ruby/red-parquet/README.md#usage )

    I know I can do something like this in python... https://gist.github.com/QuakePhil/5506f18f35b6cda4b966187e90f31386

    Can someone please point me to some documentation/clues how to do something like that in ruby?

    the code from above Nov 07 seems helpful.... would like any other tips though
    Kenta Murata
    @mrkn
    @QuakePhil You can create a new parquet file from Arrow::Table. See this test file https://github.com/apache/arrow/blob/master/ruby/red-parquet/test/test-arrow-table.rb
    Passing a filename with “.parquet” extension to Arrow::Table#save, you can save an Arrow::Table object to that parquet file.
    Phil
    @QuakePhil

    are there any ruby examples on this sort of stuff beyond the basic "load from file, then save to file" example in the README?

    what about a "load from array" or "load from array of arrays" or "load from object" etc?

    There aren't others.
    "load from array": Arrow:::Table.new("column" => [1, 2, 3]).save("xxx.parquet")
    Phil
    @QuakePhil
    ok let me play around with that! thanks
    Sutou Kouhei
    @kou
    "load from array of arrays": Array::Table.new({"column" => :int32}, [[1, 2,3]]).save("xxx.parquet")
    "load from object": Nothiing. What object do you use? An object of your own class or built-in class?
    Phil
    @QuakePhil
    I was just throwing ideas out there... I suppose an object could be the array of arrays zipped into the schema
    Sutou Kouhei
    @kou
    OK. It may be better that we have #to_arrow protocol for this case.
    For example, https://github.com/red-data-tools/red-arrow-numo-narray uses #to_arrow for converting Numo::NArray` objects to Apache Arrow objects.
    Phil
    @QuakePhil

    "load from array of arrays": Array::Table.new({"column" => :int32}, [[1, 2,3]]).save("xxx.parquet")

    just got around to trying this, getting an error:

    ~/code/brew/lib/ruby/gems/2.5.0/gems/red-arrow-0.15.1/lib/arrow/record-batch-builder.rb:83:in block (2 levels) in append_records': undefined method<<' for nil:NilClass (NoMethodError)

    Phil
    @QuakePhil
    got this to work though
    Arrow::Table.new({"a" => :int32, "b" => :string}, [[1,'x'],[2,'y'],[3,'z']]).save("xxx.parquet")
    Sutou Kouhei
    @kou
    Ah, sorry.
    Phil
    @QuakePhil
    Its all good, I'm just all new to this ruby parquet/arrow stuff, your help is appreciated 8)
    Phil
    @QuakePhil
    but that's why it would be nice to have these sorts of examples (not just load(file).save(file)) written down somewhere on the official page... it must be there somewhere, I'm probably just not finding it
    I have a hunch that most people don't read the documentation or the lexical grammar of a software package, and instead look for example use cases first
    Sutou Kouhei
    @kou
    I agree with it.
    We need more tutorial like documentation.
    We have the location to place it: https://github.com/apache/arrow/tree/master/ruby/red-arrow/doc/text
    Phil
    @QuakePhil
    once I've loaded a parquet file with Arrow::Table.load how do I iterate it? I tried each, iteritems, to_pandas, to no avail
    Kenta Murata
    @mrkn
    You can convert an Arrow::Table to Python’s pyarrow.Table by to_python provided by red-arrow-pycall https://github.com/red-data-tools/red-arrow-pycall
    Or, using Arrow::Table#each_record_batch and then using Arrow::RecordBatch#each, you can iterate records in a table.
    Sutou Kouhei
    @kou
    What do you want to do by iterating it?
    If Apache Arrow provides an operation what you want, you should use the operation. For example, Apache Arrow provides sum: https://github.com/apache/arrow/blob/master/c_glib/test/test-int32-array.rb#L56
    If you need to iterate all records, table.raw_records.each will be the fastest way.
    https://diary.kitaitimakoto.net/2019/12/21.html
    This Japanese article may be helpful for sum case.
    Phil
    @QuakePhil
    Thank you once again 8)