Kafka Topic and Partition allocation for consumer - go

I just started working on kafka, I need to develop a consumer client using sarama go pkg, the client is supposed to be a part of consumer group, and is needed to read from two topic A and B, client needs to read from some partitions of topic A allocated to it by any balance strategy and for B it needs to read from all partition (B is kinda like brodcast topic).
Workflow:
consumer group xx.
I have two topic A and B with 6 partition [0,1,2...5] each.
I have two consumer C1 and C2 in xx, data should be read in such a way:
C1 reads from A:[0,1,2] and from B:[0,1,2,3,4,5,6]
C2 reads from A:[3,4,5] and from B:[0,1,2,3,4,5,6]
note: in case an new client is added the partition in A should be rebalance and all partition in B should be read.
I tried implementing my custom balance strategy but failed, Please let me know if this can be done and how to do it.

For any consumer in the same consumer group, it is not possible for multiple to be listening to any overlapping partitions. In other words, listening to all partitions of topic B cannot be done unless you move C2 consumer to its own unique group, regardless of the rebalancing strategy for consumer groups for topic A.

You need to implement both Partition Consumer and Group Consumer in your service.
1. Group Consumer
Use Group Consumer to consume messages from topic "A". You can implement ConsumerGroup interface in Sarama library. Both your consumers "C1" and "C2" need to subscribe to topic "A" as a group (using the same client ID).
2. Partition Consumer
Use Partition consumers to consume messages in topic "B". Use Consumer interface in Sarama library for this. "C1" and "C2" need to subscribe to all partitions in starting up and also after a rebalance.

Related

Spring Boot Kafka Listener concurrency

I am using Spring Boot #kafkaListener in my application. Lets assume I use below configuration -
Topic Partitions : 2
spring.kafka.listener.concurrency : 2
group-id : TEST_GRP_ID
Acknowledgement : Manual
My question is ,
As per my knowledge Concurrency will create parallel thread to consume message.
So, thread 1 consumed the batch of records and thread 2 consumed the batch of records in this case processing of the messages will sequential and then commit the offset?
If I have two instances of the micro service in my cloud environment (in production more partition and more instances), then how concurrency will work? In each instance will create two parallel thread for my Kafka consumer?
How can I improve performance of my consumer or how can I make fast consumption and processing of the messages?
Your understanding is not too far from the truth. In fact only one consumer per partition can exist for the given group. The concurrency number gives us an approximate number of target consumers. And independently of microservice instances only two maximum consumers can exist if you have only two partitions in your topic.
So, to increase a performance you need to have more than 2 partition or more topics to consume, then they all can be distributed between your instances and their consumers evenly.
See more info in Apache Kafka docs: https://docs.confluent.io/platform/current/clients/consumer.html
✓you are having concurrency as 2 , which means 2 containers will be created to your listener.
✓As you are having 2 partitions in topic , messages from both the partitions will be consumed and processed parallelly.
✓When you spin up one more instance with same group name , the first thing that will happen is Group Rebalance .
✓Despite this event , as at any point of time only one consumer from a specific consumer group can be there for a partition , In the end , only 2 containers will be listening to messages and other 2 containers just remain idle.
✓In order to achieve more scalability , we need to add more number of partitions to the topic there by we can have more number of active listener containers

How do I get two topics that have the same partition key and the number of partitions land on the same consumer within a kafka streams application

I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.

Spring kafka multiple consumers result combination

I have producer which produces some messages(10 for example).
There are n partitions and a consumer group with n consumers.
Kafka system will distribute the messages among the consumers.
How do I combine the messages of all the consumers in one place so that I have 10 messages.
I am using Kafka with Spring.
Create a consumer group with only one consumer, then you will get the records from all the partitions in one place.

Kafka multiple consumer from different partitions

I have 4 partitions and 4 consumers(A,B,C,D for example).
How to configure which consumer will read from which partition using consumer groups.
I am using Kafka with Spring boot.
By default, kafka will automatically assign the partitions; if you have 4 consumers in the same group, they will eventually get one partition each. There are properties to configure kafka so it won't immediately do the allocation while you bring up your consumers.
You can also assign the partitions yourself.
Using
public ContainerProperties(TopicPartitionInitialOffset... topicPartitions)
if you are building the container yourself, or
#KafkaListener(id = "baz", topicPartitions = #TopicPartition(topic = "${topic}",
partitions = "${partition}"))
if you are using #KafkaListener.

Is scalability applicable with Kafka stream if each topic has single partition

My understanding as per Kafka stream documentation,
Maximum possible parallel tasks is equal to maximum number of partitions of a topic among all topics in a cluster.
I have around 60 topics at Kafka cluster. Each topic has single partition only.
Is it possible to achieve scalability/parallelism with Kafka stream for my Kafka cluster?
Do you want to do the same computation over all topics? For this, I would recommend to introduce an extra topic with many partitions that you use to scale out:
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder():
KStream parallelizedStream = builder
.stream(/* subscribe to all topics at once*/)
.through("topic-with-many-partitions");
// apply computation
parallelizedStream...
Note: You need to create the topic "topic-with-many-partitions" manually before starting your Streams application
Pro Tip:
The topic "topic-with-many-partitions" can have a very short retention time as it's only used for scaling and must not hold data long term.
Update
If you have 10 topic T1 to T10 with a single partitions each, the program from above will execute as follows (with TN being the dummy topic with 10 partitions):
T1-0 --+ +--> TN-0 --> T1_1
... --+--> T0_0 --+--> ... --> ...
T10-0 --+ +--> TN-10 --> T1_10
The first part of your program will only read all 10 input topics and write it back into 10 partitions of TN. Afterwards, you can get up to 10 parallel tasks, each processing one input partition. If you start 10 KafakStreams instances, only one will execute T0_0, and each will alsa one T1_x running.

Resources