Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Pritam Sarkar
    @pritamsarkar86
    Please post your config
    Shirshanka Das
    @shirshanka
    I have some uncommitted changes related to a test source to generate in-memory json, so the source config won’t work for you
    Pritam Sarkar
    @pritamsarkar86
    I was trying to directly write ParqueGroup using ParquetDataWriterBuilder, but looks like you might be doing it via avro conversion first.
    I see.
    Shirshanka Das
    @shirshanka
    I’m basically enhancing the current test source in gobblin to generate in-mem json / avro etc
    so that it is easy to hook it up to any writer / converter pipeline
    dynamic generation of protobuf messages is a bit tricky since proto relies on code-gen from proto file to java class .. unless I’m missing something
    Pritam Sarkar
    @pritamsarkar86
    I was trying to save that computation step via converting to ParquetGroup . I already have the Protobuf java classes.
    Shirshanka Das
    @shirshanka
    yeah I will be committing a native proto writer for parquet as well
    so that you can go straight into the parquet writer with a Message class and not worry about converting
    however, your current approach of converting to Parquet Group and handing it off to the Parquet Writer should work (based on my experiments)
    Pritam Sarkar
    @pritamsarkar86
    The approach that you took for parquet (msg -> json -> avro -> parquet), I took the same approach for ORC and it works fine. Will try this out for Parquet as well.
    For parquet, what converters are you chaining ?
    Shirshanka Das
    @shirshanka
    here is my config file: https://pastebin.com/8EKQ538H
    it includes a sequential test source, with config that won’t work in the default gobblin
    since they contain uncommitted changes
    Pritam Sarkar
    @pritamsarkar86
    Thanks @shirshanka
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : in the meantime if you can work on building gobblin locally and using the jars from there, it would help once I check in my changes
    Pritam Sarkar
    @pritamsarkar86
    at org.apache.gobblin.writer.PartitionedDataWriter$2.get(PartitionedDataWriter.java:148) at org.apache.gobblin.writer.PartitionedDataWriter$2.get(PartitionedDataWriter.java:141) at org.apache.gobblin.writer.CloseOnFlushWriterWrapper.<init>(CloseOnFlushWriterWrapper.java:71) at org.apache.gobblin.writer.PartitionedDataWriter.<init>(PartitionedDataWriter.java:140) at org.apache.gobblin.runtime.fork.Fork.buildWriter(Fork.java:534) at org.apache.gobblin.runtime.fork.Fork.buildWriterIfNotPresent(Fork.java:542) at org.apache.gobblin.runtime.fork.Fork.processRecord(Fork.java:502) at org.apache.gobblin.runtime.fork.AsynchronousFork.processRecord(AsynchronousFork.java:103) at org.apache.gobblin.runtime.fork.AsynchronousFork.processRecords(AsynchronousFork.java:86) at org.apache.gobblin.runtime.fork.Fork.run(Fork.java:243) at org.apache.gobblin.util.executors.MDCPropagatingRunnable.run(MDCPropagatingRunnable.java:39) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Thread
    @shirshanka this is the crash I was seeing
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : do you mind pasting the entire stack trace onto pastebin and sending the link here.. this stack trace doesn’t tell me the exception that was thrown
    Shirshanka Das
    @shirshanka
    Pritam Sarkar
    @pritamsarkar86
    Thanks @shirshanka , I will try this out. Meanwhile we tried ORC and that worked without any issues.
    Shirshanka Das
    @shirshanka
    :+1:
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 there isn’t a published release for this yet, we are considering building and releasing nightlies to help with earlier integration testing
    Shirshanka Das
    @shirshanka
    @pritamsarkar86 : it would be great if you could file the issue you were facing with the Parquet writer on Gobblin JIRA so we could take a look in detail. Specifically the stack trace that contains the Exception being thrown.
    tichyj
    @tichyj
    Hi, I'm trying to deploy gobblin in yarn mode. When I have launched gobblin by cmd "bin/gobblin-yarn.sh --verbose start" it started onto yarn, but I have seen an issue in the log during fetching metadata from kafka. It is using old version of kafka client (Kafka08ConsumerClient) and my cluster is running in cloudera kafka 3.1.0. It is kafka 1.0.1. Is that happen because of too old kafka client? Do you have any experience with gobblin on yarn? Does it work? There is no documentation in section Deployment/YARN architecture.
    Stacktrace:
    2020-01-28 01:54:00 PST WARN  [DefaultQuartzScheduler_Worker-1] org.apache.gobblin.kafka.client.Kafka08ConsumerClient  - Fetching topic metadata from broker czrtim1hr.oskarmobil.cz:9092 has failed 1 times.
    java.io.EOFException: Received -1 when reading from channel, socket has likely been closed.
        at kafka.utils.Utils$.read(Utils.scala:381)
        at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
        at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
        at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
        at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
        at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:79)
        at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:68)
        at kafka.consumer.SimpleConsumer.send(SimpleConsumer.scala:91)
        at kafka.javaapi.consumer.SimpleConsumer.send(SimpleConsumer.scala:68)
        at org.apache.gobblin.kafka.client.Kafka08ConsumerClient.fetchTopicMetadataFromBroker(Kafka08ConsumerClient.java:147)
        at org.apache.gobblin.kafka.client.Kafka08ConsumerClient.getFilteredMetadataList(Kafka08ConsumerClient.java:131)
        at org.apache.gobblin.kafka.client.Kafka08ConsumerClient.getTopics(Kafka08ConsumerClient.java:98)
        at org.apache.gobblin.kafka.client.AbstractBaseKafkaConsumerClient.getFilteredTopics(AbstractBaseKafkaConsumerClient.java:78)
        at org.apache.gobblin.source.extractor.extract.kafka.KafkaSource.getFilteredTopics(KafkaSource.java:765)
        at org.apache.gobblin.source.extractor.extract.kafka.KafkaSource.getWorkunits(KafkaSource.java:212)
        at org.apache.gobblin.runtime.SourceDecorator.getWorkunitStream(SourceDecorator.java:81)
        at org.apache.gobblin.runtime.AbstractJobLauncher.launchJob(AbstractJobLauncher.java:408)
        at org.apache.gobblin.cluster.GobblinHelixJobLauncher.launchJob(GobblinHelixJobLauncher.java:378)
        at org.apache.gobblin.scheduler.JobScheduler.runJob(JobScheduler.java:487)
        at org.apache.gobblin.cluster.HelixRetriggeringJobCallable.runJobLauncherLoop(HelixRetriggeringJobCallable.java:203)
        at org.apache.gobblin.cluster.HelixRetriggeringJobCallable.call(HelixRetriggeringJobCallable.java:159)
        at org.apache.gobblin.cluster.GobblinHelixJobScheduler.runJob(GobblinHelixJobScheduler.java:228)
        at org.apache.gobblin.cluster.GobblinHelixJob.executeImpl(GobblinHelixJob.java:61)
        at org.apache.gobblin.scheduler.BaseGobblinJob.execute(BaseGobblinJob.java:58)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
    2020-01-28 01:54:01 PST INFO  [DefaultQuartzScheduler_Worker-1] kafka.utils.Logging$class  - Reconnect due to socket error: java.nio.channels.ClosedChannelException
    Chhaya Vankhede
    @csvankhede

    I am new to gobblin. I have downloaded incubator-gobblin-gobblin_0.11.0.tar.gz. while installing gobblin on windows 10 by following the instructions at getting started when running ./gradlew :gobblin-distribution:buildDistributionTar I am getting below result.
    ``` FAILURE: Build failed with an exception.

    • What went wrong:
      Could not determine the dependencies of task ':gobblin-distribution:buildDistributionTar'.
      > Configuration with name 'dataTemplate' not found.

    • Try:
      Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

    BUILD FAILED ``` how can i solve it

    Shirshanka Das
    @shirshanka
    @tichyj : Gobblin on Yarn does work, in fact we are running it at LinkedIn currently. Which version of gobblin are you using?
    @csvankhede : can you build from gobblin incubator master on GitHub? https://github.com/apache/incubator-gobblin
    Chhaya Vankhede
    @csvankhede
    @shirshanka by building from gobblin incubator master on GitHub. I am getting
     > Configure project :gobblin-distribution
    Using flavor:standard for project gobblin-distribution
    
    FAILURE: Build failed with an exception.
    
    * Where:
    Build file 'C:\Users\csvan\incubator-gobblin\gobblin-restli\gobblin-flow-config-service\gobblin-flow-config-service-api\build.gradle' line: 1
    
    * What went wrong:
    Could not compile build file 'C:\Users\csvan\incubator-gobblin\gobblin-restli\gobblin-flow-config-service\gobblin-flow-config-service-api\build.gradle'.
    > startup failed:
      build file 'C:\Users\csvan\incubator-gobblin\gobblin-restli\gobblin-flow-config-service\gobblin-flow-config-service-api\build.gradle': 1: unexpected token: .. @ line 1, column 1.
         ../../api.gradle
         ^
    
      1 error
    
    
    * Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
    
    * Get more help at https://help.gradle.org
    
    Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
    Use '--warning-mode all' to show the individual deprecation warnings.
    See https://docs.gradle.org/4.9/userguide/command_line_interface.html#sec:command_line_warnings
    Shirshanka Das
    @shirshanka
    this might be a windows specific build issue… do you have access to a linux / unix environment?
    Chhaya Vankhede
    @csvankhede
    @shirshanka No
    tichyj
    @tichyj
    @shirshanka Thanks for answer. We are using Gobblin Version 0.15.0. I have cloned the repo and builded it by ./gradlew build command. Then I have prepared application.conf, reference.conf in conf/yarn directory and prepared job configuration in path gobblin.cluster.job.conf.path. It seems to be working well (but only with job.lock.enabled=false), but when it tries to fetch metadata from kafka it fails.
    Shirshanka Das
    @shirshanka
    @tichyj : you can try building with kafka 0.9 and see if that solves your problem. you will have to customize the build flavor when you build gobblin
    Shirshanka Das
    @shirshanka
    @tichyj in the file : gobblin-distribution/gobblin-flavor-standard.gradle you can replace :gobblin-modules:gobblin-kafka-08 with :gobblin-modules:gobblin-kafka-09
    then gradle clean and rebuild
    priyanka0545
    @priyanka0545
    Hi, I am using Gobblin standalone to copy data from HDFS to S3. I get following error. I believe it is performing schema validation. how i can disable that?
    Caused by: java.lang.NumberFormatException: For input string: "64M"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Long.parseLong(Long.java:589)
    at java.lang.Long.parseLong(Long.java:631)
    at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1349)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:133)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2816)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
    at org.apache.gobblin.runtime.FsDatasetStateStore.createStateStore(FsDatasetStateStore.java:107)
    at org.apache.gobblin.runtime.FsDatasetStateStoreFactory.createStateStore(FsDatasetStateStoreFactory.java:29)
    I have not specified any configuration to enable schema validation
    priyanka0545
    @priyanka0545
    Any suggestion on this would help.
    Here is the conf file
    priyanka0545
    @priyanka0545

    Basic Properties

    job.name=GobblinHdfsToS3QuickStart
    job.group=Dunamis
    job.description=Gobblin quick start job for Hdfs to S3 ingestion
    job.lock.enabled=false

    jobconf.dir=/Users/<dir>/Documents/installations/gobblin/gobblin-jobs/job-config
    state.store.dir=s3a://<bucket>/state
    jobconf.monitor.interval=300000
    launcher.type=LOCAL
    job.schedule=0 0/1 * ?

    Retry Properties

    workunit.retry.enabled=true
    workunit.retry.policy=always
    task.maxretries=2
    task.retry.intervalinsec=60
    job.max.failures=1

    Task Execution Properties

    taskexecutor.threadpool.size=2
    tasktracker.threadpool.coresize=1
    tasktracker.threadpool.maxsize=1
    taskretry.threadpool.coresize=1
    taskretry.threadpool.maxsize=1
    task.status.reportintervalinms=10000

    Metrics

    metrics.enabled=true
    metrics.report.interval=10000
    metrics.log.dir=/Users/<dir>/Documents/installations/gobblin/gobblin-jobs/work-dir/metrics
    metrics.reporting.file.enabled=true
    metrics.reporting.jmx.enabled=False

    Source Properties

    fs.uri=hdfs://<IP>:8020
    source.class=org.apache.gobblin.source.extractor.hadoop.AvroFileSource
    extract.namespace=org.apache.gobblin.source.extractor.hadoop
    distcp.persist.dir=s3a://<bucket>/persist
    source.filebased.data.directory=hdfs://<ip>8020/user/root/file.csv
    source.filebased.fs.uri=hdfs://<ip>:8020

    Distcp copy Properties

    gobblin.copy.simulate=true
    gobblin.copy.includeEmptyDirectories=false
    gobblin.copy.recursive.update=true
    gobblin.copy.split.enabled=true
    gobblin.copy.file.max.split.size=128000000

    Working directories

    writer.output.dir=/Users/<dir>/Documents/installations/gobblin/gobblin-jobs/work-dir/task-output
    task.data.root.dir=/Users/<dir>/Documents/installations/gobblin/gobblin-jobs/work-dir/task-staging

    Sink Properties

    writer.destination.type=HDFS
    writer.output.format=AVRO
    writer.builder.class=gobblin.writer.AvroDataWriterBuilder
    data.publisher.type=org.apache.gobblin.publisher.BaseDataPublisher
    data.publisher.final.dir=s3a://<bucket>/finaloutput

    fs.s3a.awsAccessKeyId=<key>
    fs.s3a.awsSecretAccessKey=<secret>
    fs.s3a.buffer.dir=s3a://<bucket>/tmp
    data.publisher.metadata.output.dir=s3a://<bucket>/metadata_out
    writer.fs.uri=s3a://<bucket>/finaloutput
    state.store.fs.uri=s3a://<bucket>/state
    data.publisher.fs.uri=s3a://<bucket>/finaloutput

    priyanka0545
    @priyanka0545
    source.filebased.data.directory=hdfs://<ip>8020/user/root/file.avro, there is type in conf
    priyanka0545
    @priyanka0545
    typo*
    Chhaya Vankhede
    @csvankhede
    @shirshanka I tried to build gobblin from incubator master in linux environment. I am still getting same error.
    \> Configure project :gobblin-distribution
    Using flavor:standard for project gobblin-distribution
    
    FAILURE: Build failed with an exception.
    
    * Where:
    Build file '/mnt/c/users/csvan/incubator-gobblin/gobblin-restli/gobblin-flow-config-service/gobblin-flow-config-service-api/build.gradle' line: 1
    
    * What went wrong:
    Could not compile build file '/mnt/c/users/csvan/incubator-gobblin/gobblin-restli/gobblin-flow-config-service/gobblin-flow-config-service-api/build.gradle'.
    \> startup failed:
      build file '/mnt/c/users/csvan/incubator-gobblin/gobblin-restli/gobblin-flow-config-service/gobblin-flow-config-service-api/build.gradle': 1: unexpected token: .. @ line 1, column 1.
         ../../api.gradle
         ^
    
      1 error
    
    
    * Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
    
    * Get more help at https://help.gradle.org
    
    Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
    Use '--warning-mode all' to show the individual deprecation warnings.
    See https://docs.gradle.org/4.9/userguide/command_line_interface.html#sec:command_line_warnings
    tichyj
    @tichyj
    @shirshanka I don't know why it doesn't work, but I have replaced with :gobblin-modules:gobblin-kafka-09 in both of files gobblin-distribution/gobblin-flavor-{standard,full}.gradle. After that I have cleaned and builded gobblin: ./gradlew clean && ./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain. The lib directory contains both gobblin-kafka-09-0.15.0.jar and gobblin-kafka-08-0.15.0.jar. I decided to remove gobblin-kafka-08-0.15.0.jar from classpath. Then I launched bin/gobblin-yarn.sh --verbose start and the classpath contains only gobblin-kafka-09-0.15.0.jar lib, but there are same errors/exceptions in a log.
    2020-01-30 05:32:00 PST WARN  [DefaultQuartzScheduler_Worker-1] org.apache.gobblin.kafka.client.Kafka08ConsumerClient  - Fetching topic metadata from broker czrtim1hr.oskarmobil.cz:9092 has failed 1 times.
    tichyj
    @tichyj
    @shirshanka What does mean this one in file settings.gradle
    // Disable jacoco for now as Kafka 0.8 is the default version and jacoco does not like the same classes
    // being declared in different modules
    def jacocoBlacklist =  new HashSet([
        "gobblin-modules:gobblin-kafka-09"
    ])
    Chhaya Vankhede
    @csvankhede
    How can i solve following error for gobblin. I build gobblin successfully now I am getting this error Error: Could not find or load main class org.apache.gobblin.runtime.cli.GobblinCli
    Shirshanka Das
    @shirshanka
    @tichyj : the exception line indicates that you do have kafka-08 still being picked up (org.apache.gobblin.kafka.client.Kafka08ConsumerClient)