Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Shirshanka Das
    @shirshanka
    We’re hoping to write a blog soon on recent developments
    kchando
    @kchando
    @shirshanka : Curious to know regarding the Native ORC in Gobblin. In the current Gobblin 0.14 still HiveSerdeConverter is the only way to convert to ORC? When you said "rolled out native ORC in Gobblin" did you have any new release version for that? or would it be part of next release 0.15?
    Shirshanka Das
    @shirshanka
    @tilakpatidar : If I change the parquet writer to org.apache.parquet based types, any existing pipelines using the old parquet types will break. Do you have this enabled in production?

    @shirshanka : Curious to know regarding the Native ORC in Gobblin. In the current Gobblin 0.14 still HiveSerdeConverter is the only way to convert to ORC? When you said "rolled out native ORC in Gobblin" did you have any new release version for that? or would it be part of next release 0.15?

    @kchando : could be part of next release 0.15. We need to see how easy it is to contribute it back before we build the next release.

    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : it seems like you can work around your issue if you depend on com.twitter:parquet-hadoop-bundle:1.5.0 in your code
    is that something you can do?
    Pritam Sarkar
    @pritamsarkar86
    Hi @shirshanka , I tried that. Once the package issues are resolved, ParquetDataWriterBuilder crashes on CloseOnFlushWriterWrapper while trying to get this.writer=writerSupplier.get();. I could not get much deeper into it after that. My observation is that, Parquet writer flow needs more testing.
    Shirshanka Das
    @shirshanka
    Ok, are you building gobblin from source?
    or pulling from maven repo?
    Pritam Sarkar
    @pritamsarkar86
    Pulling from Maven 0.14.0
    Shirshanka Das
    @shirshanka
    Can you paste your job pull / conf file here?
    Pritam Sarkar
    @pritamsarkar86
    Config is very similar to what is described in the doc. Below is a snippet:

    ```#### JOB
    job.name=pull_event_logs
    job.group=logpullg
    job.description=Gobblin job to pull event logs from Kafka
    task.maxretries=0
    mr.job.max.mappers=2
    mapreduce.map.memory.mb=4096

    SOURCE

    source.class=com.dea.roku.data.consumers.fork.KafkaForkSource
    topic.whitelist=events
    bootstrap.with.offset=latest
    kafka.brokers=${env:KAFKA_BROKERS}

    EXTRACT

    fork.branches=1
    fork.operator.class=com.dea.roku.data.consumers.fork.events.DataForkOperator
    extract.namespace=org.apache.gobblin.extract.kafka
    extract.limit.enabled=true
    extract.limit.type=count
    extract.limit.count.limit=20000000

    WRITER

    writer.file.path.type=tablename
    writer.destination.type=HDFS
    writer.output.format=PARQUET
    writer.partitioner.class=com.dea.roku.data.consumers.writer.partitioner.events.FactPartitioner

    PUBLISHER

    data.publisher.type=org.apache.gobblin.publisher.TimePartitionedDataPublisher
    data.publisher.replace.final.dir=false

    Converter, writer, publisher configs per fork

    converter.classes.0=com.dea.roku.data.consumers.events.converter.EventConverter
    writer.partitioner.class.0=com.dea.roku.data.consumers.writer.partitioner.events.FactPartitioner
    writer.output.format.0=PARQUET
    writer.file.path.0=fact_events
    writer.fs.uri.0=${env:HADOOP_FS_URI}
    writer.builder.class.0=org.apache.gobblin.writer.ParquetDataWriterBuilder
    writer.staging.dir.0=/gobblin/events/task-staging
    writer.output.dir.0=/gobblin/events/task-output
    data.publisher.fs.uri.0=${env:S3_BUCKET_URI}
    data.publisher.final.dir.0=${env:ROOT_S3_FACT_FOLDER}```

    EventConverter gets the protobuf events and converts them into ParquetGroup to write with ParquetDataWriterBuilder
    Shirshanka Das
    @shirshanka
    and what is the stack trace you get?
    Pritam Sarkar
    @pritamsarkar86
    Oops! Looks like my logs are cleared. I looked into the code a couple of times, so let me write it down the stack.
    This message was deleted
    PartitionedDataWriter (106) => CloseOnFlushWriterWrapper (71) => this.writer = writerSupplier.get();
    Shirshanka Das
    @shirshanka
    and what is the error you get on this?
    is it an NPE? or something else?
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : I’m not able to reproduce your issue. Getting a real stack trace would help.
    softkumsh
    @softkumsh
    @shirshanka am getting the below error , am running on windows machine
    image.png
    Shirshanka Das
    @shirshanka
    @softkumsh are you building from the top-level directory using ./gradlew ?
    Pritam Sarkar
    @pritamsarkar86
    @shirshanka will send the stack by tomorrow for the parquet issue. Thanks.
    Shirshanka Das
    @shirshanka
    thanks @pritamsarkar86, on my local build, I’m able to generate parquet files from avro etc … not seeing the issue you described
    Pritam Sarkar
    @pritamsarkar86
    Please post your config
    Shirshanka Das
    @shirshanka
    I have some uncommitted changes related to a test source to generate in-memory json, so the source config won’t work for you
    Pritam Sarkar
    @pritamsarkar86
    I was trying to directly write ParqueGroup using ParquetDataWriterBuilder, but looks like you might be doing it via avro conversion first.
    I see.
    Shirshanka Das
    @shirshanka
    I’m basically enhancing the current test source in gobblin to generate in-mem json / avro etc
    so that it is easy to hook it up to any writer / converter pipeline
    dynamic generation of protobuf messages is a bit tricky since proto relies on code-gen from proto file to java class .. unless I’m missing something
    Pritam Sarkar
    @pritamsarkar86
    I was trying to save that computation step via converting to ParquetGroup . I already have the Protobuf java classes.
    Shirshanka Das
    @shirshanka
    yeah I will be committing a native proto writer for parquet as well
    so that you can go straight into the parquet writer with a Message class and not worry about converting
    however, your current approach of converting to Parquet Group and handing it off to the Parquet Writer should work (based on my experiments)
    Pritam Sarkar
    @pritamsarkar86
    The approach that you took for parquet (msg -> json -> avro -> parquet), I took the same approach for ORC and it works fine. Will try this out for Parquet as well.
    For parquet, what converters are you chaining ?
    Shirshanka Das
    @shirshanka
    here is my config file: https://pastebin.com/8EKQ538H
    it includes a sequential test source, with config that won’t work in the default gobblin
    since they contain uncommitted changes
    Pritam Sarkar
    @pritamsarkar86
    Thanks @shirshanka
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : in the meantime if you can work on building gobblin locally and using the jars from there, it would help once I check in my changes
    Pritam Sarkar
    @pritamsarkar86
    at org.apache.gobblin.writer.PartitionedDataWriter$2.get(PartitionedDataWriter.java:148) at org.apache.gobblin.writer.PartitionedDataWriter$2.get(PartitionedDataWriter.java:141) at org.apache.gobblin.writer.CloseOnFlushWriterWrapper.<init>(CloseOnFlushWriterWrapper.java:71) at org.apache.gobblin.writer.PartitionedDataWriter.<init>(PartitionedDataWriter.java:140) at org.apache.gobblin.runtime.fork.Fork.buildWriter(Fork.java:534) at org.apache.gobblin.runtime.fork.Fork.buildWriterIfNotPresent(Fork.java:542) at org.apache.gobblin.runtime.fork.Fork.processRecord(Fork.java:502) at org.apache.gobblin.runtime.fork.AsynchronousFork.processRecord(AsynchronousFork.java:103) at org.apache.gobblin.runtime.fork.AsynchronousFork.processRecords(AsynchronousFork.java:86) at org.apache.gobblin.runtime.fork.Fork.run(Fork.java:243) at org.apache.gobblin.util.executors.MDCPropagatingRunnable.run(MDCPropagatingRunnable.java:39) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Thread
    @shirshanka this is the crash I was seeing
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : do you mind pasting the entire stack trace onto pastebin and sending the link here.. this stack trace doesn’t tell me the exception that was thrown
    Shirshanka Das
    @shirshanka
    Pritam Sarkar
    @pritamsarkar86
    Thanks @shirshanka , I will try this out. Meanwhile we tried ORC and that worked without any issues.
    Shirshanka Das
    @shirshanka
    :+1: