Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
• Create your own community
Activity
Shirshanka Das
@shirshanka
Can you paste your job pull / conf file here?
Pritam Sarkar
@pritamsarkar86
Config is very similar to what is described in the doc. Below is a snippet:

#### JOB
job.name=pull_event_logs
job.group=logpullg
job.description=Gobblin job to pull event logs from Kafka
task.maxretries=0
mr.job.max.mappers=2
mapreduce.map.memory.mb=4096

SOURCE

source.class=com.dea.roku.data.consumers.fork.KafkaForkSource
topic.whitelist=events
bootstrap.with.offset=latest
kafka.brokers=${env:KAFKA_BROKERS} EXTRACT fork.branches=1 fork.operator.class=com.dea.roku.data.consumers.fork.events.DataForkOperator extract.namespace=org.apache.gobblin.extract.kafka extract.limit.enabled=true extract.limit.type=count extract.limit.count.limit=20000000 WRITER writer.file.path.type=tablename writer.destination.type=HDFS writer.output.format=PARQUET writer.partitioner.class=com.dea.roku.data.consumers.writer.partitioner.events.FactPartitioner PUBLISHER data.publisher.type=org.apache.gobblin.publisher.TimePartitionedDataPublisher data.publisher.replace.final.dir=false Converter, writer, publisher configs per fork converter.classes.0=com.dea.roku.data.consumers.events.converter.EventConverter writer.partitioner.class.0=com.dea.roku.data.consumers.writer.partitioner.events.FactPartitioner writer.output.format.0=PARQUET writer.file.path.0=fact_events writer.fs.uri.0=${env:HADOOP_FS_URI}
writer.builder.class.0=org.apache.gobblin.writer.ParquetDataWriterBuilder
writer.staging.dir.0=/gobblin/events/task-staging
writer.output.dir.0=/gobblin/events/task-output
data.publisher.fs.uri.0=${env:S3_BUCKET_URI} data.publisher.final.dir.0=${env:ROOT_S3_FACT_FOLDER}

EventConverter gets the protobuf events and converts them into ParquetGroup to write with ParquetDataWriterBuilder
Shirshanka Das
@shirshanka
and what is the stack trace you get?
Pritam Sarkar
@pritamsarkar86
Oops! Looks like my logs are cleared. I looked into the code a couple of times, so let me write it down the stack.
This message was deleted
PartitionedDataWriter (106) => CloseOnFlushWriterWrapper (71) => this.writer = writerSupplier.get();
Shirshanka Das
@shirshanka
and what is the error you get on this?
is it an NPE? or something else?
Shirshanka Das
@shirshanka
@pritamsarkar86 : I’m not able to reproduce your issue. Getting a real stack trace would help.
softkumsh
@softkumsh
@shirshanka am getting the below error , am running on windows machine
Shirshanka Das
@shirshanka
@softkumsh are you building from the top-level directory using ./gradlew ?
Pritam Sarkar
@pritamsarkar86
@shirshanka will send the stack by tomorrow for the parquet issue. Thanks.
Shirshanka Das
@shirshanka
thanks @pritamsarkar86, on my local build, I’m able to generate parquet files from avro etc … not seeing the issue you described
Pritam Sarkar
@pritamsarkar86
Please post your config
Shirshanka Das
@shirshanka
I have some uncommitted changes related to a test source to generate in-memory json, so the source config won’t work for you
Pritam Sarkar
@pritamsarkar86
I was trying to directly write ParqueGroup using ParquetDataWriterBuilder, but looks like you might be doing it via avro conversion first.
I see.
Shirshanka Das
@shirshanka
I’m basically enhancing the current test source in gobblin to generate in-mem json / avro etc
so that it is easy to hook it up to any writer / converter pipeline
dynamic generation of protobuf messages is a bit tricky since proto relies on code-gen from proto file to java class .. unless I’m missing something
Pritam Sarkar
@pritamsarkar86
I was trying to save that computation step via converting to ParquetGroup . I already have the Protobuf java classes.
Shirshanka Das
@shirshanka
yeah I will be committing a native proto writer for parquet as well
so that you can go straight into the parquet writer with a Message class and not worry about converting
however, your current approach of converting to Parquet Group and handing it off to the Parquet Writer should work (based on my experiments)
Pritam Sarkar
@pritamsarkar86
The approach that you took for parquet (msg -> json -> avro -> parquet), I took the same approach for ORC and it works fine. Will try this out for Parquet as well.
For parquet, what converters are you chaining ?
Shirshanka Das
@shirshanka
here is my config file: https://pastebin.com/8EKQ538H
it includes a sequential test source, with config that won’t work in the default gobblin
since they contain uncommitted changes
Pritam Sarkar
@pritamsarkar86
Thanks @shirshanka
Shirshanka Das
@shirshanka
@pritamsarkar86 : in the meantime if you can work on building gobblin locally and using the jars from there, it would help once I check in my changes
Pritam Sarkar
@pritamsarkar86
at org.apache.gobblin.writer.PartitionedDataWriter$2.get(PartitionedDataWriter.java:148) at org.apache.gobblin.writer.PartitionedDataWriter$2.get(PartitionedDataWriter.java:141) at org.apache.gobblin.writer.CloseOnFlushWriterWrapper.<init>(CloseOnFlushWriterWrapper.java:71) at org.apache.gobblin.writer.PartitionedDataWriter.<init>(PartitionedDataWriter.java:140) at org.apache.gobblin.runtime.fork.Fork.buildWriter(Fork.java:534) at org.apache.gobblin.runtime.fork.Fork.buildWriterIfNotPresent(Fork.java:542) at org.apache.gobblin.runtime.fork.Fork.processRecord(Fork.java:502) at org.apache.gobblin.runtime.fork.AsynchronousFork.processRecord(AsynchronousFork.java:103) at org.apache.gobblin.runtime.fork.AsynchronousFork.processRecords(AsynchronousFork.java:86) at org.apache.gobblin.runtime.fork.Fork.run(Fork.java:243) at org.apache.gobblin.util.executors.MDCPropagatingRunnable.run(MDCPropagatingRunnable.java:39) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Thread
@shirshanka this is the crash I was seeing
Shirshanka Das
@shirshanka
@pritamsarkar86 : do you mind pasting the entire stack trace onto pastebin and sending the link here.. this stack trace doesn’t tell me the exception that was thrown
Shirshanka Das
@shirshanka
Pritam Sarkar
@pritamsarkar86
Thanks @shirshanka , I will try this out. Meanwhile we tried ORC and that worked without any issues.
Shirshanka Das
@shirshanka
:+1:
Shirshanka Das
@shirshanka
@pritamsarkar86 there isn’t a published release for this yet, we are considering building and releasing nightlies to help with earlier integration testing
Shirshanka Das
@shirshanka
@pritamsarkar86 : it would be great if you could file the issue you were facing with the Parquet writer on Gobblin JIRA so we could take a look in detail. Specifically the stack trace that contains the Exception being thrown.
tichyj
@tichyj
Hi, I'm trying to deploy gobblin in yarn mode. When I have launched gobblin by cmd "bin/gobblin-yarn.sh --verbose start" it started onto yarn, but I have seen an issue in the log during fetching metadata from kafka. It is using old version of kafka client (Kafka08ConsumerClient) and my cluster is running in cloudera kafka 3.1.0. It is kafka 1.0.1. Is that happen because of too old kafka client? Do you have any experience with gobblin on yarn? Does it work? There is no documentation in section Deployment/YARN architecture.
Stacktrace:
2020-01-28 01:54:00 PST WARN  [DefaultQuartzScheduler_Worker-1] org.apache.gobblin.kafka.client.Kafka08ConsumerClient  - Fetching topic metadata from broker czrtim1hr.oskarmobil.cz:9092 has failed 1 times.
java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
at kafka.utils.Utils$.read(Utils.scala:381) at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79) at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:68) at kafka.consumer.SimpleConsumer.send(SimpleConsumer.scala:91) at kafka.javaapi.consumer.SimpleConsumer.send(SimpleConsumer.scala:68) at org.apache.gobblin.kafka.client.Kafka08ConsumerClient.fetchTopicMetadataFromBroker(Kafka08ConsumerClient.java:147) at org.apache.gobblin.kafka.client.Kafka08ConsumerClient.getFilteredMetadataList(Kafka08ConsumerClient.java:131) at org.apache.gobblin.kafka.client.Kafka08ConsumerClient.getTopics(Kafka08ConsumerClient.java:98) at org.apache.gobblin.kafka.client.AbstractBaseKafkaConsumerClient.getFilteredTopics(AbstractBaseKafkaConsumerClient.java:78) at org.apache.gobblin.source.extractor.extract.kafka.KafkaSource.getFilteredTopics(KafkaSource.java:765) at org.apache.gobblin.source.extractor.extract.kafka.KafkaSource.getWorkunits(KafkaSource.java:212) at org.apache.gobblin.runtime.SourceDecorator.getWorkunitStream(SourceDecorator.java:81) at org.apache.gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:408) at org.apache.gobblin.cluster.GobblinHelixJobLauncher.launchJob(GobblinHelixJobLauncher.java:378) at org.apache.gobblin.scheduler.JobScheduler.runJob(JobScheduler.java:487) at org.apache.gobblin.cluster.HelixRetriggeringJobCallable.runJobLauncherLoop(HelixRetriggeringJobCallable.java:203) at org.apache.gobblin.cluster.HelixRetriggeringJobCallable.call(HelixRetriggeringJobCallable.java:159) at org.apache.gobblin.cluster.GobblinHelixJobScheduler.runJob(GobblinHelixJobScheduler.java:228) at org.apache.gobblin.cluster.GobblinHelixJob.executeImpl(GobblinHelixJob.java:61) at org.apache.gobblin.scheduler.BaseGobblinJob.execute(BaseGobblinJob.java:58) at org.quartz.core.JobRunShell.run(JobRunShell.java:202) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
2020-01-28 01:54:01 PST INFO  [DefaultQuartzScheduler_Worker-1] kafka.utils.Logging\$class  - Reconnect due to socket error: java.nio.channels.ClosedChannelException
Chhaya Vankhede
@csvankhede

I am new to gobblin. I have downloaded incubator-gobblin-gobblin_0.11.0.tar.gz. while installing gobblin on windows 10 by following the instructions at getting started when running ./gradlew :gobblin-distribution:buildDistributionTar I am getting below result.
 FAILURE: Build failed with an exception.

• What went wrong:
Could not determine the dependencies of task ':gobblin-distribution:buildDistributionTar'.
> Configuration with name 'dataTemplate' not found.

• Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED  how can i solve it

Shirshanka Das
@shirshanka
@tichyj : Gobblin on Yarn does work, in fact we are running it at LinkedIn currently. Which version of gobblin are you using?
@csvankhede : can you build from gobblin incubator master on GitHub? https://github.com/apache/incubator-gobblin
Chhaya Vankhede
@csvankhede
@shirshanka by building from gobblin incubator master on GitHub. I am getting
 > Configure project :gobblin-distribution
Using flavor:standard for project gobblin-distribution

FAILURE: Build failed with an exception.

* Where:
Build file 'C:\Users\csvan\incubator-gobblin\gobblin-restli\gobblin-flow-config-service\gobblin-flow-config-service-api\build.gradle' line: 1

* What went wrong:
Could not compile build file 'C:\Users\csvan\incubator-gobblin\gobblin-restli\gobblin-flow-config-service\gobblin-flow-config-service-api\build.gradle'.
> startup failed:
build file 'C:\Users\csvan\incubator-gobblin\gobblin-restli\gobblin-flow-config-service\gobblin-flow-config-service-api\build.gradle': 1: unexpected token: .. @ line 1, column 1.
../../api.gradle
^

1 error

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/4.9/userguide/command_line_interface.html#sec:command_line_warnings
Shirshanka Das
@shirshanka
this might be a windows specific build issue… do you have access to a linux / unix environment?
Chhaya Vankhede
@csvankhede
@shirshanka No
tichyj
@tichyj
@shirshanka Thanks for answer. We are using Gobblin Version 0.15.0. I have cloned the repo and builded it by ./gradlew build command. Then I have prepared application.conf, reference.conf in conf/yarn directory and prepared job configuration in path gobblin.cluster.job.conf.path. It seems to be working well (but only with job.lock.enabled=false), but when it tries to fetch metadata from kafka it fails.