Schedule creation of consumer in Kafka using kafka-go - go

I am new to kafka and currently working on it. I am using kafka-go in golang to create producer and consumer. Currently i am able to create a producer but i want consumer to be created once a producer of a topic is created and not every time. means for each topic, a consumer is created only once. Also, when there is a need of creating more consumer for a topic to balance load, it gets created. Is there any way to schedule that, either through goroutines or Faktory?

You should not have a coupled producer/consumer, Kafka let you have totally decoupled producer / consumer.
You can run your consumer even if the topic does not exists ( Kafka will create it, you'll just get an leader unavailable warning ), and run your producer whenever you want.
Regarding scaling, the idea is that you create as many partition as you might want your topic to be scaled ( in number of consumers).
Here is some reading about topic partition strategy:
https://blog.newrelic.com/engineering/effective-strategies-kafka-topic-partitioning/
You have lot of readings about this on the web.
Yannick

Related

How we can check there are no more events or messages is left on the topic to consume?

Is there a way we can check there are no more events or messages is left on the topic to consume in Spring Boot Kafka. In my scenario, I have a requirement like I receive data from the two-channel source one from Kafka topic, and another I can get a complete dump of data by connecting to some other DB. So there is a case after consuming all the messages from Kafka Topic I need to compare the count of data that I have received from Topic with the other data count which I get from DB connectivity.
Is it possible to do so? I know how to write the code in spring boot to start consuming events from Kafka topic and how to make a DB connectivity get data from one DB table and insert it another db table
See the documentation about detecting idle listener containers.
While efficient, one problem with asynchronous consumers is detecting when they are idle. You might want to take some action if no messages arrive for some period of time.
You can configure the listener container to publish a ListenerContainerIdleEvent when some time passes with no message delivery. While the container is idle, an event is published every idleEventInterval milliseconds.
...

Correct Number of Partitions/Replicas for #RetryableTopic Retry Topics

Hello Stack Overflow community and anyone familiar with spring-kafka!
I am currently working on a project which leverages the #RetryableTopic feature from spring-kafka in order to reattempt the delivery of failed messages. The listener annotated with #RetryableTopic is consuming from a topic that has 50 partitions and 3 replicas. When the app is receiving a lot of traffic, it could possibly be autoscaled up to 50 instances of the app (consumers) grabbing from those partitions. I read in the spring-kafka documentation that by default, the retry topics that #RetryableTopic autocreates are created with one partition and one replica, but you can change these values with autoCreateTopicsWith() in the configuration. From this, I have a few questions:
With the autoscaling in mind, is it recommended to just create the retry topics with the same number of partitions and replicas (50 & 3) as the original topic?
Is there some benefit to having differing numbers of partitions/replicas for the retry topics considering their default values are just one?
The retry topics should have at least as many partitions as the original (by default, records are sent to the same partition); otherwise you have to customize the destination resolution to avoid the warning log. See Destination resolver returned non-existent partition
50 partitions might be overkill unless you get a lot of retried records.
It's up to you how many replicas you want, but in general, yes, I would use the same number of replicas as the original.
Only you can decide what are the "correct" numbers.

Spring boot kafka: Microservice multi instances, concurrency and partitions

I have a question about the way of publishing and reading messages in kafka for microservices arquitectures with multiple instance of the same microservices for writing and reading.
My main problem here is that the microservices that publish and read are configure with an autoscaling but a default numer of instances of 1.
The point is that I have an entity, let call it "Event" that are stored in the DDBB and each entity has its own ID in the DDBB. When some specific command are executed in a specific entity (let say with entityID = ajsha87) it must be published a message that will be readed by a consumer. if each of this messages for the same entity is writen in diferent partitions and cosumed at the same time (Concurrency issue) I will have a lot of problems.
My question is about if according to the entityID for example I can set in which partitions all events of this specific entity will be published. For another entity with different ID I dont care about the partion but the messages for the same entity must be always published in the same partition to avoid that a consumer will read a messages (2) published after a message (1).
There is any mechanism to do that, or each time I save the entity I have randomly store in the DDBB the partition ID in which its messages will be published?
Same happens with consumers. Only one consumer can read a partition at the same time because if not, a consumer number 1 can read the message (1) from partition (1) realted with entity (ID=78198) and then another can read the message (2) from partition (1) ralated with the same entity and process the message 2 before number one.
There is any mechanish about subscribe each instance only to one partition according to the microservice autoscaling?
Another option it will be to assign dinamically for each new publisher instance a partition, but I dont know how to configure that dinamically to set diferent particions IDs according to the microservice instance
I am using spring boot by the way
Thanks for you answer and recomendations and sorry if my english is not good enough.
If you use Hash Partitioner as the partitioner in producer config (This is the default partitioner in many libraries), and use same key for same entity (let say with entityID = ajsha87) kafka manages to send all messages with same key to same partition.
If you are using group consumer, One consumer instance take the responsibility of one partition and all messages published to that partition consumes by that instance only. Instance can be changed if there is rebalancing when upscaling. but still messages in same partition will read from one consumer instance.

Kafka state-store on different scaled instances

I have 5 different machine with each scaled 5 spring boot instance that uses kafka-streams application. I am using 50 partitions compacted topic with different 2-3 topics and each my instance has 10 concurrency. I am using docker swarm and docker volume. Using these topics KTable or KStream do some flatMap, map and join operations with my kafka streams app.
props.put(StreamsConfig.STATE_DIR_CONFIG, /tmp/kafka-streams);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 2);
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, EXACTLY_ONCE);
props.put("num.stream.threads", 10);
props.put("application.id", applicationId);
If everything goes OK nothing is wrong or no data loss in my application with .join() operations, but when one of my instance is down my join operations are not able to do the join actually.
My question is: When the app is restarted or redeployed (and given that it's working inside a non-persistent container) its state is gone right? Than my join operations don't work. It is When I redeploy my instance and populate my compacted topic from elasticsearch with the latest entities my join operations are OK. So I think when my application starts at new machine my local state-store is gone ? But the kafka document says:
If tasks run on a machine that fails and are restarted on another machine, Kafka Streams guarantees to restore their associated state stores to the content before the failure by replaying the corresponding changelog topics prior to resuming the processing on the newly started tasks. As a result, failure handling is completely transparent to the end user.
Note that the cost of task (re)initialization typically depends primarily on the time for restoring the state by replaying the state stores' associated changelog topics. To minimize this restoration time, users can configure their applications to have standby replicas of local states (i.e. fully replicated copies of the state). When a task migration happens, Kafka Streams then attempts to assign a task to an application instance where such a standby replica already exists in order to minimize the task (re)initialization cost. See num.standby.replicas at the Kafka Streams Configs Section.
(https://kafka.apache.org/0102/documentation/streams/architecture)
Does my downed instance refresh kafka state-store when it goes up ? If it is why I am losing data and I have no idea :/ Or can't reload state-store because of commit_offset because all my instance's use same applicationId ?
Thanks !
The changelog topics are always read from the earliest offset, and they're compacted, so they don't lose data.
If you're joining non compact topics, then sure, you lose data, but that's not limited to Kafka Streams or your specific use case... You'll need to configure the topic to retain data for at least as long as you think it'll take you to solve any issues with topic downtime. While the data is retained, you can always seek your consumer to it
If you want persistent storage, use a volume mount to your container via Kubernetes, for example, or plug in a state state store stored externally to the container like Redis : https://github.com/andreas-schroeder/redisks

Azure eventhub load distribution

I have an eventhub solution where there are alot of publishers publishing data to the hub. Currently we are not using partitions. I would like to have a solution where there can be multiple listeners/subscribers who can listen to these events in parallel. E.g
If there is an eventA and an eventB, can I have only one listener recieve the eventA and other listener receive the eventB so that the load can be distributed?
I have to do some compute on each event so I want the computed distributed and not duplicated
Yes, that's what partitions are for. For a given consumer group, there can be multiple readers splitting their work among them, but their max amount is limited by partition count.
Each consumer would lock 1 or more partitions, and will be the only one working on events from those.

Resources