Lets say we have a KafkaStreams application which is reading data from 2 source topics customerA.orders and customerB.orders. Each topic is having 3 partitions.
StreamsBuilder builder = new StreamsBuilder();
KStream stream1 = builder.stream("customerA.orders")
KStream stream2 = builder.stream("customerB.orders")
//Business logic which has stateless transformations.
When i run this application, 6 tasks are created which is expected ( since we have 3 partitions for each topic) : current active tasks: [0_0, 0_1, 1_0, 0_2, 1_1, 1_2]
Since both topic names end with ".orders", i can use regex to read data from the source topics as shown below
StreamsBuilder builder = new StreamsBuilder();
KStream stream1 = builder.stream(Pattern.compile(".*orders"))
But when i run this application using regex, only 3 tasks are created instead of 6 tasks even though we have 2 topics with 3 partitions each : current active tasks: [0_0, 0_1, 0_2]
streams application is getting messages from both the topics.
Why are the number of tasks reduced when we use regex for source topics ?
In the first code, if you don't apply any operation like join, or using same state store between two topics (more precisely between too Stream DSL codes from two KStreams) it'll create 2 sub-topology, so you can have separated task for each topic's partition. So these 2 Topology process in parallel.
When your application subscribes multiple topics into one KStream, it'll create a same task for topic's partitions of input topics which have the same partition number so it's co-partitioned (so partition 0 of topic 1 and partition 0 of topic 2 is consumed by the same task), and one particular task only processes one message from one of subscribed partition-i at a time.
Related
I am using Spring Boot #kafkaListener in my application. Lets assume I use below configuration -
Topic Partitions : 2
spring.kafka.listener.concurrency : 2
group-id : TEST_GRP_ID
Acknowledgement : Manual
My question is ,
As per my knowledge Concurrency will create parallel thread to consume message.
So, thread 1 consumed the batch of records and thread 2 consumed the batch of records in this case processing of the messages will sequential and then commit the offset?
If I have two instances of the micro service in my cloud environment (in production more partition and more instances), then how concurrency will work? In each instance will create two parallel thread for my Kafka consumer?
How can I improve performance of my consumer or how can I make fast consumption and processing of the messages?
Your understanding is not too far from the truth. In fact only one consumer per partition can exist for the given group. The concurrency number gives us an approximate number of target consumers. And independently of microservice instances only two maximum consumers can exist if you have only two partitions in your topic.
So, to increase a performance you need to have more than 2 partition or more topics to consume, then they all can be distributed between your instances and their consumers evenly.
See more info in Apache Kafka docs: https://docs.confluent.io/platform/current/clients/consumer.html
✓you are having concurrency as 2 , which means 2 containers will be created to your listener.
✓As you are having 2 partitions in topic , messages from both the partitions will be consumed and processed parallelly.
✓When you spin up one more instance with same group name , the first thing that will happen is Group Rebalance .
✓Despite this event , as at any point of time only one consumer from a specific consumer group can be there for a partition , In the end , only 2 containers will be listening to messages and other 2 containers just remain idle.
✓In order to achieve more scalability , we need to add more number of partitions to the topic there by we can have more number of active listener containers
I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.
I have 4 partitions and 4 consumers(A,B,C,D for example).
How to configure which consumer will read from which partition using consumer groups.
I am using Kafka with Spring boot.
By default, kafka will automatically assign the partitions; if you have 4 consumers in the same group, they will eventually get one partition each. There are properties to configure kafka so it won't immediately do the allocation while you bring up your consumers.
You can also assign the partitions yourself.
Using
public ContainerProperties(TopicPartitionInitialOffset... topicPartitions)
if you are building the container yourself, or
#KafkaListener(id = "baz", topicPartitions = #TopicPartition(topic = "${topic}",
partitions = "${partition}"))
if you are using #KafkaListener.
KTable<Key1, GenericRecord> primaryTable = createKTable(key1, kstream, statestore-name);
KTable<Key2, GenericRecord> childTable1 = createKTable(key1, kstream, statestore-name);
KTable<Key3, GenericRecord> childTable2 = createKTable(key1, kstream, statestore-name);
primaryTable.leftJoin(childTable1, (primary, choild1) -> compositeObject)
.leftJoin(childTable2,(compositeObject, child2) -> compositeObject, Materialized.as("compositeobject-statestore"))
.toStream().to(""composite-topics)
For my application, I am using KTable-Ktable joins, so that whenever data is received on primary or child stream, it can set it compositeObject with setters and getters for all three tables. These three incoming streams have different keys, but while creating KTable, I make the keys same for all three KTable.
I have all topics with single partition. When I run application on single instance, everything runs fine. I can see compositeObject populated with data from all three tables.
All interactive queries also runs fine passing the recordID and local statestore name.
But when I run two instances of same application, I see compositeObject with primary and child1 data but child2 remains empty. Even if i try to make call to statestore using interactive query, it doesn't return anything.
I am using spring-cloud-stream-kafka-streams libraries for writing code.
Please suggest what is the reason it is not setting and what should be a right solution to handle this.
Kafka Streams' scaling model is coupled to the number of input topic partitions. Thus, if your input topics are single partitioned you cannot scale-out. The number of input topic partitions determine your maximum parallelism.
Thus, you would need to create new topics with higher parallelism.
My understanding as per Kafka stream documentation,
Maximum possible parallel tasks is equal to maximum number of partitions of a topic among all topics in a cluster.
I have around 60 topics at Kafka cluster. Each topic has single partition only.
Is it possible to achieve scalability/parallelism with Kafka stream for my Kafka cluster?
Do you want to do the same computation over all topics? For this, I would recommend to introduce an extra topic with many partitions that you use to scale out:
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder():
KStream parallelizedStream = builder
.stream(/* subscribe to all topics at once*/)
.through("topic-with-many-partitions");
// apply computation
parallelizedStream...
Note: You need to create the topic "topic-with-many-partitions" manually before starting your Streams application
Pro Tip:
The topic "topic-with-many-partitions" can have a very short retention time as it's only used for scaling and must not hold data long term.
Update
If you have 10 topic T1 to T10 with a single partitions each, the program from above will execute as follows (with TN being the dummy topic with 10 partitions):
T1-0 --+ +--> TN-0 --> T1_1
... --+--> T0_0 --+--> ... --> ...
T10-0 --+ +--> TN-10 --> T1_10
The first part of your program will only read all 10 input topics and write it back into 10 partitions of TN. Afterwards, you can get up to 10 parallel tasks, each processing one input partition. If you start 10 KafakStreams instances, only one will execute T0_0, and each will alsa one T1_x running.