Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Shirshanka Das
    @shirshanka
    what version of parquet should we be using
    Pritam Sarkar
    @pritamsarkar86
    If you are asking about the jar dependencies, I have not seen any issues from 1.7.0 to all they way 1.10.1. For parquet version I think we will have to support both v1 and v2
    May be I misunderstood the question
    Shirshanka Das
    @shirshanka
    yeah asking about jar dependencies
    if we upgrade the parquet version, would like to upgrade to something the community is using
    since we don’t use parquet at LinkedIn, I don’t know the appropriate version to use
    Pritam Sarkar
    @pritamsarkar86
    I see 1.10.1 having good usage
    Shirshanka Das
    @shirshanka
    ok
    we’ll need a couple of days to make the changes at the least
    Tamas Nemeth
    @treff7es
    No, we don’t. :(
    Pritam Sarkar
    @pritamsarkar86
    @shirshanka that should be fine. Thank you.
    Tamas Nemeth
    @treff7es
    @shirshanka We never really could get to even ORC format but nowadays I keep an eye on Iceberg, Hudi, Delta Lake as storage format, it would be cool to have some integration in Gobblin. Unfortunately our priorities are other places and fortunatelly Gobblin/ingestion working fine. :)
    Shirshanka Das
    @shirshanka
    interesting, we recently rolled out native ORC in Gobblin (ditching the hive serde)
    and we are working on Iceberg integration currently (first for metadata)
    Tamas Nemeth
    @treff7es
    wow, I need to catchup what is happening there, I wish I would have more time. :)
    Shirshanka Das
    @shirshanka
    We’re hoping to write a blog soon on recent developments
    kchando
    @kchando
    @shirshanka : Curious to know regarding the Native ORC in Gobblin. In the current Gobblin 0.14 still HiveSerdeConverter is the only way to convert to ORC? When you said "rolled out native ORC in Gobblin" did you have any new release version for that? or would it be part of next release 0.15?
    Shirshanka Das
    @shirshanka
    @tilakpatidar : If I change the parquet writer to org.apache.parquet based types, any existing pipelines using the old parquet types will break. Do you have this enabled in production?

    @shirshanka : Curious to know regarding the Native ORC in Gobblin. In the current Gobblin 0.14 still HiveSerdeConverter is the only way to convert to ORC? When you said "rolled out native ORC in Gobblin" did you have any new release version for that? or would it be part of next release 0.15?

    @kchando : could be part of next release 0.15. We need to see how easy it is to contribute it back before we build the next release.

    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : it seems like you can work around your issue if you depend on com.twitter:parquet-hadoop-bundle:1.5.0 in your code
    is that something you can do?
    Pritam Sarkar
    @pritamsarkar86
    Hi @shirshanka , I tried that. Once the package issues are resolved, ParquetDataWriterBuilder crashes on CloseOnFlushWriterWrapper while trying to get this.writer=writerSupplier.get();. I could not get much deeper into it after that. My observation is that, Parquet writer flow needs more testing.
    Shirshanka Das
    @shirshanka
    Ok, are you building gobblin from source?
    or pulling from maven repo?
    Pritam Sarkar
    @pritamsarkar86
    Pulling from Maven 0.14.0
    Shirshanka Das
    @shirshanka
    Can you paste your job pull / conf file here?
    Pritam Sarkar
    @pritamsarkar86
    Config is very similar to what is described in the doc. Below is a snippet:

    ```#### JOB
    job.name=pull_event_logs
    job.group=logpullg
    job.description=Gobblin job to pull event logs from Kafka
    task.maxretries=0
    mr.job.max.mappers=2
    mapreduce.map.memory.mb=4096

    SOURCE

    source.class=com.dea.roku.data.consumers.fork.KafkaForkSource
    topic.whitelist=events
    bootstrap.with.offset=latest
    kafka.brokers=${env:KAFKA_BROKERS}

    EXTRACT

    fork.branches=1
    fork.operator.class=com.dea.roku.data.consumers.fork.events.DataForkOperator
    extract.namespace=org.apache.gobblin.extract.kafka
    extract.limit.enabled=true
    extract.limit.type=count
    extract.limit.count.limit=20000000

    WRITER

    writer.file.path.type=tablename
    writer.destination.type=HDFS
    writer.output.format=PARQUET
    writer.partitioner.class=com.dea.roku.data.consumers.writer.partitioner.events.FactPartitioner

    PUBLISHER

    data.publisher.type=org.apache.gobblin.publisher.TimePartitionedDataPublisher
    data.publisher.replace.final.dir=false

    Converter, writer, publisher configs per fork

    converter.classes.0=com.dea.roku.data.consumers.events.converter.EventConverter
    writer.partitioner.class.0=com.dea.roku.data.consumers.writer.partitioner.events.FactPartitioner
    writer.output.format.0=PARQUET
    writer.file.path.0=fact_events
    writer.fs.uri.0=${env:HADOOP_FS_URI}
    writer.builder.class.0=org.apache.gobblin.writer.ParquetDataWriterBuilder
    writer.staging.dir.0=/gobblin/events/task-staging
    writer.output.dir.0=/gobblin/events/task-output
    data.publisher.fs.uri.0=${env:S3_BUCKET_URI}
    data.publisher.final.dir.0=${env:ROOT_S3_FACT_FOLDER}```

    EventConverter gets the protobuf events and converts them into ParquetGroup to write with ParquetDataWriterBuilder
    Shirshanka Das
    @shirshanka
    and what is the stack trace you get?
    Pritam Sarkar
    @pritamsarkar86
    Oops! Looks like my logs are cleared. I looked into the code a couple of times, so let me write it down the stack.
    This message was deleted
    PartitionedDataWriter (106) => CloseOnFlushWriterWrapper (71) => this.writer = writerSupplier.get();
    Shirshanka Das
    @shirshanka
    and what is the error you get on this?
    is it an NPE? or something else?
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : I’m not able to reproduce your issue. Getting a real stack trace would help.
    softkumsh
    @softkumsh
    @shirshanka am getting the below error , am running on windows machine
    image.png
    Shirshanka Das
    @shirshanka
    @softkumsh are you building from the top-level directory using ./gradlew ?
    Pritam Sarkar
    @pritamsarkar86
    @shirshanka will send the stack by tomorrow for the parquet issue. Thanks.
    Shirshanka Das
    @shirshanka
    thanks @pritamsarkar86, on my local build, I’m able to generate parquet files from avro etc … not seeing the issue you described
    Pritam Sarkar
    @pritamsarkar86
    Please post your config
    Shirshanka Das
    @shirshanka
    I have some uncommitted changes related to a test source to generate in-memory json, so the source config won’t work for you
    Pritam Sarkar
    @pritamsarkar86
    I was trying to directly write ParqueGroup using ParquetDataWriterBuilder, but looks like you might be doing it via avro conversion first.
    I see.
    Shirshanka Das
    @shirshanka
    I’m basically enhancing the current test source in gobblin to generate in-mem json / avro etc
    so that it is easy to hook it up to any writer / converter pipeline
    dynamic generation of protobuf messages is a bit tricky since proto relies on code-gen from proto file to java class .. unless I’m missing something
    Pritam Sarkar
    @pritamsarkar86
    I was trying to save that computation step via converting to ParquetGroup . I already have the Protobuf java classes.
    Shirshanka Das
    @shirshanka
    yeah I will be committing a native proto writer for parquet as well