Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    nleeuskadi
    @nleeuskadi

    Hi all, I am new with brooklin and try to understand its concepts and advantages.
    I have 2 needs :

    • Geo proximity : for low latency reason, I use replication to move a subset of topics closer to the access location
    • Disaster Recovery
      For both cases the replication is done over a WAN.

    I have few questions:

    1. What is the core advantages to use Brooklin instead of MirormakerV2? performance? low latency? ease of use?
    2. What is the difference between the 2 types of Connectors : kafkaConnector and kafkaMirrorMakeConnector?
      Which one would be more performant or more suitable for my Geo Proximity use-case?
    3. Does Brooklin rely on MirormakerV2 behind the scene?
    4. Does Brooklin rely on kafka Connect cluster behind the scene?
    5. Where should be installed Broklin instances : on the source kafka cluster side (i.e. in the source data center), or on the target kafka cluster side (i.e. in the target Data center)?
    6. Is it recommended to use a brooklin dedicated zookeeper cluster, or is it possible to use the same zookeeper cluster as the kafka one?
    7. How to determine the needed number of brooklin instances ?
    8. In the brooklin server log I can see messages like these ones "INFO Queuing event HEARTBEAT" or "INFO De-queuing event HEARTBEAT" or "INFO START: Handle event HEARTBEAT": What are they used for? Where are these Heartbeat events pushed (in kafka, in zookeeper)?

    Thank you very much for your help :)

    Cheers

    Abhijeetiyengar
    @Abhijeetiyengar

    Hi ALl,

    I have been writing a connector and task using brooklin framework to connect our in house cache which is like a key value data store to kafka topic. I noticed that while saving a checkPoint within a given task , DataStreamProducerRecordBuilder takes a string for setSourceCheckpoint. As i understand brooklin stores checkpoint as a value with the destination topic's (in my case as destination is kafka) partition as the key. My problem is I would be having just 1 Task which would be taking data from multiple partitions of our source cache and putting data into 1 partition of kafka destination.
    How could I store checkpoints for all my source partitions ?
    Is the expectation that every source partition must have destination partition ?

    rgibbard
    @rgibbard
    Are there any updates on the status of the trigger based Oracle CDC connector that we’re under development?
    Sanjay Kumar
    @sanjay24
    Hey Guys, does anyone know where if Brooklin supports Exactly once semantics?
    nitin456
    @nitin456
    @nleeuskadi Did you get the answer for MirrorMaker 2.0 vs Brooklin?
    nleeuskadi
    @nleeuskadi
    @nitin456 sadely no answer since I posted my question in April the 17th. But still interested to get inputs. Thanks !
    skaur05
    @skaur05
    Hey, I am from Wayfair, and here is a blog we wrote about our journey to brooklin, hopefully it helps you too. https://tech.wayfair.com/2020/06/scaling-kafka-mirroring-pipelines-at-wayfair/
    Sanjay Kumar
    @sanjay24
    @skaur05 thanks for sharing the blog. I’ve had similar experience working with Brooklin. However, I’ve also seen that it’s quite sensitive to network flaps. We are running producers in flushless mode. Could you share your experience dealing with network or source/destination cluster outages?
    skaur05
    @skaur05
    @sanjay24 we also use flushless mode. We mostly have brooklin clusters setup near target cluster. But in some cases both consumer and producer clusters are on separate DCs. Let me know what kind of errors you are getting. Network blips are expected in any system, so provided brooklin service can come backup in no time it is good. Network flaps you observer with in communication in brooklin cluster, consumer cluster or producer cluster. Hopefully you have set up enough number of retries for transportprovider. brooklin.server.transportProvider.kafkaTransportProvider.retries
    Ahmed Abdul Hamid
    @ahmedahamid
    @nleeuskadi hello
    • No. Brooklin is actually intended to be a replacement for KMM for Kafka mirroring use-cases.
    • No. Kafka Connect is a totally different product.
    Ahmed Abdul Hamid
    @ahmedahamid
    • You can do it both ways. At LinkedIn, we prefer running Brooklin in the same DC of the destination/target Kafka cluster. We have observed better throughput using this setup in general.
    • Either way is fine. If you have lots of datastreams, a dedicated ZK would be better. Otherwise, you can use the same one as Kafka's.
    Ahmed Abdul Hamid
    @ahmedahamid
    • We typically shoot for 5-8 MB/s/host, with each host running 7-10 connector tasks, and each task handling 200-1500 partitions
    but that's just guidance since it heavily depends on the actual throughput of each partition
    Ahmed Abdul Hamid
    @ahmedahamid
    @rgibbard we're currently test-driving that. are you interested in following the news of its development knowing that it has a dependency on Oracle's Big data adapter?
    Ahmed Abdul Hamid
    @ahmedahamid
    @sanjay24 We haven't looked into what it would take to have Brooklin work in setups where it's mirroring Kafka clusters that are configured and used to operate under exactly-once semantics (assuming you're referring to this)
    we have been mostly running it with the usual at least once expectations
    Revanth
    @revanthpobala
    I am trying to run brooklin and in the wiki I have this https://github.com/linkedin/brooklin/wiki/Streaming-Text-Files-to-Kafka
    When I execute this command bin/brooklin-rest-client.sh -o CREATE -u http://localhost:32311/ -n first-file-datastream -s NOTICE -c file -p 1 -t kafkaTransportProvider -m '{"owner":"test-user"}'
    to create a datastream I am getting connection refused exception
    Ahmed Abdul Hamid
    @ahmedahamid
    @revanthpobala I've responded to the ticket you opened.
    Mario Alberto Romero Sandoval
    @mariors
    Hi, I am dealing with the ssl configuration for both source and destination cluter on brooklin. But, is there any way to send those configuration in the POST message to the API? I mean how would the ideal setup be when working with a single destination cluster but multiple sources using tls?
    I found this linkedin/brooklin#619 but I think it will only work with one source cluster
    Yαkηyα Δαβο
    @yakhyadabo
    Question: Has anyone tried to deploy a brooklin cluster using Kubernetes ?
    I did some research but found nothing.
    Sebastian Cheung CQF
    @scheung38
    Hello, like to migrate on-prem data ingestion Kafka/ksqlDB cluster into Azure, what is the best practice and are there any examples to show how easily or difficult this can be done? Is this categorized as mirroring?
    iftachby
    @iftachby

    Hello, I am trying to test Brooklin as a replacement for Kafka MM. I have 3 source clusters in different DCs and 2 aggr clusters (The aggr clusters share DC with 2 of the source clusters - so we have in DC A and B a source + aggr cluster each, and a DC C with just a source cluster) I am trying to mirror from all 3 DCs into each aggr cluster. I created 2 Brooklin clusters, 1 in each DC with an aggr cluster, and on each Brooklin cluster a datastream per source cluster

    It seems that each aggr only gets messages from the source cluster in the same DC - I.E; Aggr in DC A gets messages from source A, aggr in DC B gets messages from source in DC B

    The output of bin/brooklin-rest-client.sh -o READALL (snippet for connection strings)
    "source" : {
    "connectionString" : "kafka://kafka-source.service.A.consul:9092/(topicX|topicY)"
    },

    "source" : {
    "connectionString" : "kafka://kafka-source.service.B.consul:9092/(topicX|topicY)"
    },

    "source" : {
    "connectionString" : "kafka://kafka-source.service.C.consul:9092/(topicX|topicY)"
    },

    Of course, the connection string addresses are pingable from all machines.
    Any idea what I'm doing wrong?

    iftachby
    @iftachby
    Upon further inspection - it seems like Brooklin by default does not write copy to a topic by more than 1 datastream? Is this possible to overcome? In KMM we mirrored all 3 sources into 1 topic on each aggr cluster. Would like to keep it this way if possible
    iftachby
    @iftachby
    Hi. Can anyone please help me understand how to edit a running datastream (if thats possible?) - I want to change the topics mirrored. Do I have to delete it and recreate or is there another way?
    Sanjay Kumar
    @sanjay24
    hi @ahmedahamid when are you release 1.0.3?
    Sanjay Kumar
    @sanjay24
    releasing*
    ivorodrigues
    @ivorodrigues
    Hi all I have a question,
    Is this output from status API reporting negative lag?
    If so how is it possible, and should I be worry about it?
    [
      {
        "key": {
          "topic": "my.topic",
          "partition": 0,
          "datastreamTaskPrefix": "my-kafka-cluster-dc1-dc2",
          "datastreamTaskName": "my-kafka-cluster-dc1-dc2_d83dfd5e-1e80-4422-9d76-cebd07a3d205",
          "connectorTaskStartTime": 1604657249726
        },
        "value": {
          "brokerOffset": 8979,
          "consumerOffset": 9108,
          "assignmentTime": 1604657254571,
          "lastRecordReceivedTimestamp": 1604657479629,
          "lastBrokerQueriedTime": 1604657460109,
          "lastNonEmptyPollTime": 1604657479816
        }
      },
      {
        "key": {
          "topic": "my-topic",
          "partition": 2,
          "datastreamTaskPrefix": "my-kafka-cluster-dc1-dc2",
          "datastreamTaskName": "my-kafka-cluster-dc1-dc2_d83dfd5e-1e80-4422-9d76-cebd07a3d205",
          "connectorTaskStartTime": 1604657249726
        },
        "value": {
          "brokerOffset": 8183,
          "consumerOffset": 8256,
          "assignmentTime": 1604657254571,
          "lastRecordReceivedTimestamp": 1604657479698,
          "lastBrokerQueriedTime": 1604657460268,
          "lastNonEmptyPollTime": 1604657479829
        }
      },
      {
        "key": {
          "topic": "my-topic",
          "partition": 1,
          "datastreamTaskPrefix": "my-kafka-cluster-dc1-dc2",
          "datastreamTaskName": "my-kafka-cluster-dc1-dc2_d83dfd5e-1e80-4422-9d76-cebd07a3d205",
          "connectorTaskStartTime": 1604657249726
        },
        "value": {
          "brokerOffset": 7700,
          "consumerOffset": 7773,
          "assignmentTime": 1604657254571,
          "lastRecordReceivedTimestamp": 1604657479566,
          "lastBrokerQueriedTime": 1604657460198,
          "lastNonEmptyPollTime": 1604657479753
        }
      }
    ]
    ivorodrigues
    @ivorodrigues
    To calculate a lag in this case is consumerOffset-brokerOffset right?
    André Cardoso
    @cardosoa2

    Greetings mates,
    I am exploring the Brooklin framework to be used as solution for data replication at Fanduel.
    But I am getting this error message log:
    Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1260380 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration. [2020-11-20 12:11:53,118] WARN Detect exception being thrown from callback for src partition: soccer.ly.agglomerated.events-6 while sending, metadata: null , exception: (com.linkedin.datastream.connectors.kafka.mirrormaker.KafkaMirrorMakerConnectorTask) com.linkedin.datastream.server.api.transport.SendFailedException: com.linkedin.datastream.common.DatastreamRuntimeException: org.apache.kafka.common.errors.RecordTooLargeException: The message is 1260380 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
    The message it is self explain, however my difficulty right now is to change the property of request max size for the producer.
    I realise that the LiKafkaProducerFactory.java file instantiate a producer with 100MB.
    How can I instantiate a consumer with the same size? It is done by configuration file? Or by HTTP request when creating the datastream? Do you have any request example?

    Thank you in advance

    FYI: I using the kafkaMirroringConnector
    André Cardoso
    @cardosoa2
    Current producer config in the logs:
    `

    [2020-11-20 12:11:53,107] INFO ProducerConfig values: acks = 1 batch.size = 16384 bootstrap.servers = [....] buffer.memory = 33554432 client.id = datastream-producer compression.type = none connections.max.idle.ms = 540000 enable.idempotence = false interceptor.classes = [] key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer linger.ms = 0 max.block.ms = 60000 max.in.flight.requests.per.connection = 5 max.request.size = 1048576 metadata.max.age.ms = 300000

    The max.request.size is 1MB... should not be 100MB?

    iftachby
    @iftachby
    Hi everyone, is it possible to tell brooklin to mirror kafka topic from latest or from specific offset?
    Sanjay Kumar
    @sanjay24
    @iftachby you can specify "system.auto.offset.reset": "latest" as metadata attribute while creating your datastream
    iftachby
    @iftachby
    @sanjay24 I already added the same parameter to my brooklin server.properties file and that helped. Thanks!
    Sanjay Kumar
    @sanjay24
    @iftachby Adding that in metadata can help you specify it at datastream level
    iftachby
    @iftachby
    @sanjay24 got it - thanks!
    iftachby
    @iftachby
    Hi everyone
    I wanted to ask if there are plans to release new versions for Brooklin (perhaps with newer Kafka Client? :) )
    Thanks!
    Mike Papetti
    @papetti23
    Is anyone using Brooklin to manage data in influxDB? We’re well over cardinality limits and having to make trade offs between memory, retention and cardinality
    I feel like there might be an opportunity since kafka is already in our stack but I’m not quite sure if mirroring or CDC is the best pattern since the influxes are sharded out
    Mike Papetti
    @papetti23
    Or sybase? Anyone using Brooklin for an old ass version of Sybase for change data capture?