Spark Kafka Integration using receiver with WAL - hadoop

I was reading below blog in Databricks
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
While explaining the process how spark kafka integration works using receiver with WAl , it says
1.The Kafka data is continuously received by Kafka Receivers running in the Spark workers/executors. This used the high-level consumer API
of Kafka.
2.The received data is stored in Spark’s worker/executor memory as well as to the WAL (replicated on HDFS). The Kafka Receiver updated
Kafka’s offsets to Zookeeper only after the data has been persisted to
the log.
Now my doubt is how a high level consumer can update offset in zookeeper , as high level consumer does not handle offset, it is handled by zookeeper. So once we read a message from kafka using zookeeper then zookeeper automatically update the offset.

So when a consumer retrieves data from a particular topic in kafka it is the responsibility of the consumer to update the offsets in zookeeper. So when you use a custom kafka consumer it has an inbuild kafka API( org.apache.kafka.clients.consumer.* does that )that will update the offsets once you receive the data from that particular topic.
In case of that receiver based approach in spark it uses Kafka's high level API to update the offsets in zookeeper.

Related

Kafka connect with EventStoreDB

I'm working on a small academic project - Event sourcing with EventStoreDB and Apache Kafka as a broker. The idea is that get events from EventStoreDB and push them to Kafka for further distribution. I saw Apache Kafka has connections to different DB systems but didn't find any connector with EvenStoreDB.
How can I create(code or use existing one) Kafka connector to EventStoreDB, so these two systems would be able to transfer events vise-versa, from Kafka to EventStoreDB and from EventStoreDB to Kafka?
There is no official Kafka Connect Connector between Kafka and EventStoreDB, and I haven't heard about any unofficial so far. Still, there is a tool called Replicator that enables replicating data from EventStoreDB to Kafka (https://replicator.eventstore.org/docs/features/sinks/kafka/). It's open-sourced, so you can either use it or check the implementation.
For the EventStoreDB to Kafka, I recommend using the subscriptions mechanism: catch-up if you need an ordering guarantee, persistent if ordering is not critical: https://developers.eventstore.com/clients/grpc/subscriptions.html. The crucial part here is to define how to map EventStoreDB streams to Kafka topics and partitions. Typically you'd expect to have at least an ordering guarantee on the stream level, so single stream events should land to the same partition.
For Kafka to EventStoreDB integration, you could either write your own pass-through service or try to use the HTTP sink connector (e.g. https://docs.confluent.io/kafka-connect-http/current/overview.html). EventStoreDB exposes HTTP API (https://developers.eventstore.com/clients/http-api/v5/introduction/). Sidenote, this API (Atom pub based) may be replaced with another HTTP API in the future, so the structure may change.
You can use Event Store Replicator, which has a Kafka sink.
Keep in mind that it doesn't do anything with regards to events schema, so things like Kafka Streams and KSQL might not work properly.
The sink was created solely for the purpose of pushing events to Kafka being used as a message broker.

Spring Kafka JDBC Connector compatibility

Is Kafka JDBC connect compatible with Spring-Kafka library?
I did follow https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector/ and still have some confusions.
Let's say you want to consume from a Kafka topic and write to a JDBC database. Some of your options are
Use plain Kafka consumer to consume from the topic and use Jdbc api to write the consumed record to database.
Use spring Kafka to consume from the Kafka Topic and spring jdbc template or spring data to write it to the database
Use Kafka connect with Jdbc connector as sink to read from topic and write to a table.
So as you can see
Kafka Jdbc connector is a specialised component that can only do one job.
Kafka Consumer is very genric component which can do lot of job and you will be writing lot of code. In facr, it will be the foundational API from which other frameworks build on and specialise.
Spring Kafka simplfies it and let you deal with kafka records as java objects but doesnt tell you how to write that object to your db.
So they are alternative solutions to fulfil the task. Having said that you may have a flow where different segments are controlled by different teams and for each segment, any of them can be used and Kafka topic will act as joining channel

Persist state of Kafka Producer within Spring Clod/Boot

I want to implement a Kafka Producer with Spring that observes a Cloud Storage and emits meta informations about newly arrived files.
Until now we did that with a Kafka Connector but for some reasons we now have to do this with a simple Kafka producer.
Now I need to persist the state of the producer (e.g. timestamp of last commited file) in a kind of Offset Topic like the Connector did, but did not find a reasonable approach to do that.
My current idea is to hold the state by committing it to a topic that the producer also consumes but just acknowledge the last consumed state when commuting a new one. So if the Kubernetes pod of the producer dies and comes up again to consume the last state (not acknowledged) and so knows where it stopped.
But this idea seems to be a bit complex to just hold a state of a Kafka app. Is there a better approach for that?

Does Spring Kafka producer guarantee delivery by default?

I wonder whether spring kafka Producer within spring boot guarantee delivery or not.
Does anybody know what happens if some random listener fails to receive message? Would spring kafka retry to send the message?
There are some concepts here:
Producer will produce events and send them to kafka server. You must be aware on the producer side for retries and things like that if Kafka will have downtime or other error scenarios that are specific to your context.
Consumers will have assign partitions by Kafka, each partition will deliver events and each event will have an offset. Consumers will poll for data from kafka (they will request for data, kafka will not push data to consumers, but consumers will go to kafka and require data). Every event that is delivered with success by Kafka to the consumers will produce and Acknowledgment and Kafka will commit the offset of the event. So the next event, with a higher offeset will be delivered to the consumer. If a consumer goes down, partitions will be reasigned to other consumers, so you won't lose your data. If you have only one consumer, the data will be stored in Kafka and when the consumer will be back, it will go and request data from the latest/earliest offset.

Kafka consumer , able process tons of messages

In general case Kafka consumer could be anything, that connects to Kafka and gets messages.
I'm interested in known Kafka consumers for several purposes:
1) process messages and save result in DB(Oracle)
2) process messages and save result in files
What established Kafka consumers can you suggest?
Thanks.
You can use Camus Consumer for Kakfa->HDFS. It is a mapreduce job that does distributed data loads out of Kafka.

Resources