I have 4 partitions and 4 consumers(A,B,C,D for example).
How to configure which consumer will read from which partition using consumer groups.
I am using Kafka with Spring boot.
By default, kafka will automatically assign the partitions; if you have 4 consumers in the same group, they will eventually get one partition each. There are properties to configure kafka so it won't immediately do the allocation while you bring up your consumers.
You can also assign the partitions yourself.
Using
public ContainerProperties(TopicPartitionInitialOffset... topicPartitions)
if you are building the container yourself, or
#KafkaListener(id = "baz", topicPartitions = #TopicPartition(topic = "${topic}",
partitions = "${partition}"))
if you are using #KafkaListener.
Related
I just started working on kafka, I need to develop a consumer client using sarama go pkg, the client is supposed to be a part of consumer group, and is needed to read from two topic A and B, client needs to read from some partitions of topic A allocated to it by any balance strategy and for B it needs to read from all partition (B is kinda like brodcast topic).
Workflow:
consumer group xx.
I have two topic A and B with 6 partition [0,1,2...5] each.
I have two consumer C1 and C2 in xx, data should be read in such a way:
C1 reads from A:[0,1,2] and from B:[0,1,2,3,4,5,6]
C2 reads from A:[3,4,5] and from B:[0,1,2,3,4,5,6]
note: in case an new client is added the partition in A should be rebalance and all partition in B should be read.
I tried implementing my custom balance strategy but failed, Please let me know if this can be done and how to do it.
For any consumer in the same consumer group, it is not possible for multiple to be listening to any overlapping partitions. In other words, listening to all partitions of topic B cannot be done unless you move C2 consumer to its own unique group, regardless of the rebalancing strategy for consumer groups for topic A.
You need to implement both Partition Consumer and Group Consumer in your service.
1. Group Consumer
Use Group Consumer to consume messages from topic "A". You can implement ConsumerGroup interface in Sarama library. Both your consumers "C1" and "C2" need to subscribe to topic "A" as a group (using the same client ID).
2. Partition Consumer
Use Partition consumers to consume messages in topic "B". Use Consumer interface in Sarama library for this. "C1" and "C2" need to subscribe to all partitions in starting up and also after a rebalance.
I am using Spring Boot #kafkaListener in my application. Lets assume I use below configuration -
Topic Partitions : 2
spring.kafka.listener.concurrency : 2
group-id : TEST_GRP_ID
Acknowledgement : Manual
My question is ,
As per my knowledge Concurrency will create parallel thread to consume message.
So, thread 1 consumed the batch of records and thread 2 consumed the batch of records in this case processing of the messages will sequential and then commit the offset?
If I have two instances of the micro service in my cloud environment (in production more partition and more instances), then how concurrency will work? In each instance will create two parallel thread for my Kafka consumer?
How can I improve performance of my consumer or how can I make fast consumption and processing of the messages?
Your understanding is not too far from the truth. In fact only one consumer per partition can exist for the given group. The concurrency number gives us an approximate number of target consumers. And independently of microservice instances only two maximum consumers can exist if you have only two partitions in your topic.
So, to increase a performance you need to have more than 2 partition or more topics to consume, then they all can be distributed between your instances and their consumers evenly.
See more info in Apache Kafka docs: https://docs.confluent.io/platform/current/clients/consumer.html
✓you are having concurrency as 2 , which means 2 containers will be created to your listener.
✓As you are having 2 partitions in topic , messages from both the partitions will be consumed and processed parallelly.
✓When you spin up one more instance with same group name , the first thing that will happen is Group Rebalance .
✓Despite this event , as at any point of time only one consumer from a specific consumer group can be there for a partition , In the end , only 2 containers will be listening to messages and other 2 containers just remain idle.
✓In order to achieve more scalability , we need to add more number of partitions to the topic there by we can have more number of active listener containers
Lets say we have a KafkaStreams application which is reading data from 2 source topics customerA.orders and customerB.orders. Each topic is having 3 partitions.
StreamsBuilder builder = new StreamsBuilder();
KStream stream1 = builder.stream("customerA.orders")
KStream stream2 = builder.stream("customerB.orders")
//Business logic which has stateless transformations.
When i run this application, 6 tasks are created which is expected ( since we have 3 partitions for each topic) : current active tasks: [0_0, 0_1, 1_0, 0_2, 1_1, 1_2]
Since both topic names end with ".orders", i can use regex to read data from the source topics as shown below
StreamsBuilder builder = new StreamsBuilder();
KStream stream1 = builder.stream(Pattern.compile(".*orders"))
But when i run this application using regex, only 3 tasks are created instead of 6 tasks even though we have 2 topics with 3 partitions each : current active tasks: [0_0, 0_1, 0_2]
streams application is getting messages from both the topics.
Why are the number of tasks reduced when we use regex for source topics ?
In the first code, if you don't apply any operation like join, or using same state store between two topics (more precisely between too Stream DSL codes from two KStreams) it'll create 2 sub-topology, so you can have separated task for each topic's partition. So these 2 Topology process in parallel.
When your application subscribes multiple topics into one KStream, it'll create a same task for topic's partitions of input topics which have the same partition number so it's co-partitioned (so partition 0 of topic 1 and partition 0 of topic 2 is consumed by the same task), and one particular task only processes one message from one of subscribed partition-i at a time.
I have producer which produces some messages(10 for example).
There are n partitions and a consumer group with n consumers.
Kafka system will distribute the messages among the consumers.
How do I combine the messages of all the consumers in one place so that I have 10 messages.
I am using Kafka with Spring.
Create a consumer group with only one consumer, then you will get the records from all the partitions in one place.
My understanding as per Kafka stream documentation,
Maximum possible parallel tasks is equal to maximum number of partitions of a topic among all topics in a cluster.
I have around 60 topics at Kafka cluster. Each topic has single partition only.
Is it possible to achieve scalability/parallelism with Kafka stream for my Kafka cluster?
Do you want to do the same computation over all topics? For this, I would recommend to introduce an extra topic with many partitions that you use to scale out:
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder():
KStream parallelizedStream = builder
.stream(/* subscribe to all topics at once*/)
.through("topic-with-many-partitions");
// apply computation
parallelizedStream...
Note: You need to create the topic "topic-with-many-partitions" manually before starting your Streams application
Pro Tip:
The topic "topic-with-many-partitions" can have a very short retention time as it's only used for scaling and must not hold data long term.
Update
If you have 10 topic T1 to T10 with a single partitions each, the program from above will execute as follows (with TN being the dummy topic with 10 partitions):
T1-0 --+ +--> TN-0 --> T1_1
... --+--> T0_0 --+--> ... --> ...
T10-0 --+ +--> TN-10 --> T1_10
The first part of your program will only read all 10 input topics and write it back into 10 partitions of TN. Afterwards, you can get up to 10 parallel tasks, each processing one input partition. If you start 10 KafakStreams instances, only one will execute T0_0, and each will alsa one T1_x running.