Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Shirshanka Das
    @shirshanka
    @treff7es : do you use Parquet in your setup?
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : seems like Parquet has support for writing protobuf natively… if you use the ProtoParquetWriter (https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoParquetWriter.java)
    so one option might be to create a separate ProtoParquetWriter (modeling the ParquetDataWriter) or better … figuring out how to refactor the existing ParquetDataWriter (and Builder) to support this option
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : looking specifically at your error in the converter, it seems like you might be importing two versions of the parquet GroupType class, one from org.apache.parquet.schema and one from parquet.schema … fixing the imports might fix your error
    Pritam Sarkar
    @pritamsarkar86
    Hi @shirshanka , that is right. Here is the problem: ParquetDataWriterBuilder which comes as a part of gobblin-parquet-0.14.0.jar depends on MessageType and Group which is being referenced from parquet.example.data package. Now for protobuf schema to be converted into parquet schema, the helper classes are part of parquet-mr as you mentioned above, and they reference MessageType and Group from org.apache.parquet.example.data
    Shirshanka Das
    @shirshanka
    so maybe bumping up the version of parquet and fixing the imports on the gobblin side is the right answer?
    Pritam Sarkar
    @pritamsarkar86
    yes
    Shirshanka Das
    @shirshanka
    what version of parquet should we be using
    Pritam Sarkar
    @pritamsarkar86
    If you are asking about the jar dependencies, I have not seen any issues from 1.7.0 to all they way 1.10.1. For parquet version I think we will have to support both v1 and v2
    May be I misunderstood the question
    Shirshanka Das
    @shirshanka
    yeah asking about jar dependencies
    if we upgrade the parquet version, would like to upgrade to something the community is using
    since we don’t use parquet at LinkedIn, I don’t know the appropriate version to use
    Pritam Sarkar
    @pritamsarkar86
    I see 1.10.1 having good usage
    Shirshanka Das
    @shirshanka
    ok
    we’ll need a couple of days to make the changes at the least
    Tamas Nemeth
    @treff7es
    No, we don’t. :(
    Pritam Sarkar
    @pritamsarkar86
    @shirshanka that should be fine. Thank you.
    Tamas Nemeth
    @treff7es
    @shirshanka We never really could get to even ORC format but nowadays I keep an eye on Iceberg, Hudi, Delta Lake as storage format, it would be cool to have some integration in Gobblin. Unfortunately our priorities are other places and fortunatelly Gobblin/ingestion working fine. :)
    Shirshanka Das
    @shirshanka
    interesting, we recently rolled out native ORC in Gobblin (ditching the hive serde)
    and we are working on Iceberg integration currently (first for metadata)
    Tamas Nemeth
    @treff7es
    wow, I need to catchup what is happening there, I wish I would have more time. :)
    Shirshanka Das
    @shirshanka
    We’re hoping to write a blog soon on recent developments
    kchando
    @kchando
    @shirshanka : Curious to know regarding the Native ORC in Gobblin. In the current Gobblin 0.14 still HiveSerdeConverter is the only way to convert to ORC? When you said "rolled out native ORC in Gobblin" did you have any new release version for that? or would it be part of next release 0.15?
    Shirshanka Das
    @shirshanka
    @tilakpatidar : If I change the parquet writer to org.apache.parquet based types, any existing pipelines using the old parquet types will break. Do you have this enabled in production?

    @shirshanka : Curious to know regarding the Native ORC in Gobblin. In the current Gobblin 0.14 still HiveSerdeConverter is the only way to convert to ORC? When you said "rolled out native ORC in Gobblin" did you have any new release version for that? or would it be part of next release 0.15?

    @kchando : could be part of next release 0.15. We need to see how easy it is to contribute it back before we build the next release.

    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : it seems like you can work around your issue if you depend on com.twitter:parquet-hadoop-bundle:1.5.0 in your code
    is that something you can do?
    Pritam Sarkar
    @pritamsarkar86
    Hi @shirshanka , I tried that. Once the package issues are resolved, ParquetDataWriterBuilder crashes on CloseOnFlushWriterWrapper while trying to get this.writer=writerSupplier.get();. I could not get much deeper into it after that. My observation is that, Parquet writer flow needs more testing.
    Shirshanka Das
    @shirshanka
    Ok, are you building gobblin from source?
    or pulling from maven repo?
    Pritam Sarkar
    @pritamsarkar86
    Pulling from Maven 0.14.0
    Shirshanka Das
    @shirshanka
    Can you paste your job pull / conf file here?
    Pritam Sarkar
    @pritamsarkar86
    Config is very similar to what is described in the doc. Below is a snippet:

    ```#### JOB
    job.name=pull_event_logs
    job.group=logpullg
    job.description=Gobblin job to pull event logs from Kafka
    task.maxretries=0
    mr.job.max.mappers=2
    mapreduce.map.memory.mb=4096

    SOURCE

    source.class=com.dea.roku.data.consumers.fork.KafkaForkSource
    topic.whitelist=events
    bootstrap.with.offset=latest
    kafka.brokers=${env:KAFKA_BROKERS}

    EXTRACT

    fork.branches=1
    fork.operator.class=com.dea.roku.data.consumers.fork.events.DataForkOperator
    extract.namespace=org.apache.gobblin.extract.kafka
    extract.limit.enabled=true
    extract.limit.type=count
    extract.limit.count.limit=20000000

    WRITER

    writer.file.path.type=tablename
    writer.destination.type=HDFS
    writer.output.format=PARQUET
    writer.partitioner.class=com.dea.roku.data.consumers.writer.partitioner.events.FactPartitioner

    PUBLISHER

    data.publisher.type=org.apache.gobblin.publisher.TimePartitionedDataPublisher
    data.publisher.replace.final.dir=false

    Converter, writer, publisher configs per fork

    converter.classes.0=com.dea.roku.data.consumers.events.converter.EventConverter
    writer.partitioner.class.0=com.dea.roku.data.consumers.writer.partitioner.events.FactPartitioner
    writer.output.format.0=PARQUET
    writer.file.path.0=fact_events
    writer.fs.uri.0=${env:HADOOP_FS_URI}
    writer.builder.class.0=org.apache.gobblin.writer.ParquetDataWriterBuilder
    writer.staging.dir.0=/gobblin/events/task-staging
    writer.output.dir.0=/gobblin/events/task-output
    data.publisher.fs.uri.0=${env:S3_BUCKET_URI}
    data.publisher.final.dir.0=${env:ROOT_S3_FACT_FOLDER}```

    EventConverter gets the protobuf events and converts them into ParquetGroup to write with ParquetDataWriterBuilder
    Shirshanka Das
    @shirshanka
    and what is the stack trace you get?
    Pritam Sarkar
    @pritamsarkar86
    Oops! Looks like my logs are cleared. I looked into the code a couple of times, so let me write it down the stack.
    This message was deleted
    PartitionedDataWriter (106) => CloseOnFlushWriterWrapper (71) => this.writer = writerSupplier.get();
    Shirshanka Das
    @shirshanka
    and what is the error you get on this?
    is it an NPE? or something else?
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : I’m not able to reproduce your issue. Getting a real stack trace would help.
    softkumsh
    @softkumsh
    @shirshanka am getting the below error , am running on windows machine
    image.png
    Shirshanka Das
    @shirshanka
    @softkumsh are you building from the top-level directory using ./gradlew ?
    Pritam Sarkar
    @pritamsarkar86
    @shirshanka will send the stack by tomorrow for the parquet issue. Thanks.
    Shirshanka Das
    @shirshanka
    thanks @pritamsarkar86, on my local build, I’m able to generate parquet files from avro etc … not seeing the issue you described
    Pritam Sarkar
    @pritamsarkar86
    Please post your config
    Shirshanka Das
    @shirshanka
    I have some uncommitted changes related to a test source to generate in-memory json, so the source config won’t work for you