My understanding as per Kafka stream documentation,
Maximum possible parallel tasks is equal to maximum number of partitions of a topic among all topics in a cluster.
I have around 60 topics at Kafka cluster. Each topic has single partition only.
Is it possible to achieve scalability/parallelism with Kafka stream for my Kafka cluster?
Do you want to do the same computation over all topics? For this, I would recommend to introduce an extra topic with many partitions that you use to scale out:
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder():
KStream parallelizedStream = builder
.stream(/* subscribe to all topics at once*/)
.through("topic-with-many-partitions");
// apply computation
parallelizedStream...
Note: You need to create the topic "topic-with-many-partitions" manually before starting your Streams application
Pro Tip:
The topic "topic-with-many-partitions" can have a very short retention time as it's only used for scaling and must not hold data long term.
Update
If you have 10 topic T1 to T10 with a single partitions each, the program from above will execute as follows (with TN being the dummy topic with 10 partitions):
T1-0 --+ +--> TN-0 --> T1_1
... --+--> T0_0 --+--> ... --> ...
T10-0 --+ +--> TN-10 --> T1_10
The first part of your program will only read all 10 input topics and write it back into 10 partitions of TN. Afterwards, you can get up to 10 parallel tasks, each processing one input partition. If you start 10 KafakStreams instances, only one will execute T0_0, and each will alsa one T1_x running.
Related
I have used datalake connectors to sink data from a topic and that allowed me to specify
number of records
An interval.
So, that essentially meant the connector would sink whichever condition is satisfied first.
e.g. this with the properties specified here.
In there you could see the properties
flush.size and
rotate.interval.ms or rotate.schedule.interval.ms
I am trying to achieve the same using the JDBC sink connector specified here, but I only see
batch.size
The problem is some times during the day, messages arrive rather infrequently and thus the sinking of the data onto the destination (in this case a Azure SQL Server DB) does not happen, until the batch.size is achieved.
Is there a way to specify that sink when either the batch.size is what I specify or when a certain time interval has elapsed?
I have gone through this very interesting discussion but I can't find a way to use this to fulfill the requirements I have.
also, I have seen that I have the max.tasks property , which essentially spawns multiple "tasks" in parallel to sink the data . So, if my topic has 4 partitions and I have max.tasks specified as 4, and my batch.size is 10- does it mean the data would only be sink by each of the tasks when 10 messages have arrived in their assigned partition?.
Any questions and I can elaborate.
I am using Spring Boot #kafkaListener in my application. Lets assume I use below configuration -
Topic Partitions : 2
spring.kafka.listener.concurrency : 2
group-id : TEST_GRP_ID
Acknowledgement : Manual
My question is ,
As per my knowledge Concurrency will create parallel thread to consume message.
So, thread 1 consumed the batch of records and thread 2 consumed the batch of records in this case processing of the messages will sequential and then commit the offset?
If I have two instances of the micro service in my cloud environment (in production more partition and more instances), then how concurrency will work? In each instance will create two parallel thread for my Kafka consumer?
How can I improve performance of my consumer or how can I make fast consumption and processing of the messages?
Your understanding is not too far from the truth. In fact only one consumer per partition can exist for the given group. The concurrency number gives us an approximate number of target consumers. And independently of microservice instances only two maximum consumers can exist if you have only two partitions in your topic.
So, to increase a performance you need to have more than 2 partition or more topics to consume, then they all can be distributed between your instances and their consumers evenly.
See more info in Apache Kafka docs: https://docs.confluent.io/platform/current/clients/consumer.html
✓you are having concurrency as 2 , which means 2 containers will be created to your listener.
✓As you are having 2 partitions in topic , messages from both the partitions will be consumed and processed parallelly.
✓When you spin up one more instance with same group name , the first thing that will happen is Group Rebalance .
✓Despite this event , as at any point of time only one consumer from a specific consumer group can be there for a partition , In the end , only 2 containers will be listening to messages and other 2 containers just remain idle.
✓In order to achieve more scalability , we need to add more number of partitions to the topic there by we can have more number of active listener containers
Lets say we have a KafkaStreams application which is reading data from 2 source topics customerA.orders and customerB.orders. Each topic is having 3 partitions.
StreamsBuilder builder = new StreamsBuilder();
KStream stream1 = builder.stream("customerA.orders")
KStream stream2 = builder.stream("customerB.orders")
//Business logic which has stateless transformations.
When i run this application, 6 tasks are created which is expected ( since we have 3 partitions for each topic) : current active tasks: [0_0, 0_1, 1_0, 0_2, 1_1, 1_2]
Since both topic names end with ".orders", i can use regex to read data from the source topics as shown below
StreamsBuilder builder = new StreamsBuilder();
KStream stream1 = builder.stream(Pattern.compile(".*orders"))
But when i run this application using regex, only 3 tasks are created instead of 6 tasks even though we have 2 topics with 3 partitions each : current active tasks: [0_0, 0_1, 0_2]
streams application is getting messages from both the topics.
Why are the number of tasks reduced when we use regex for source topics ?
In the first code, if you don't apply any operation like join, or using same state store between two topics (more precisely between too Stream DSL codes from two KStreams) it'll create 2 sub-topology, so you can have separated task for each topic's partition. So these 2 Topology process in parallel.
When your application subscribes multiple topics into one KStream, it'll create a same task for topic's partitions of input topics which have the same partition number so it's co-partitioned (so partition 0 of topic 1 and partition 0 of topic 2 is consumed by the same task), and one particular task only processes one message from one of subscribed partition-i at a time.
I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.
We are trying to replace Apache Storm with Apache Spark streaming.
In storm; we partitioned stream based on "Customer ID" so that msgs with a range of "customer IDs" will be routed to same bolt (worker).
We do this because each worker will cache customer details (from DB).
So we split into 4 partitions and each bolt (worker) will have 1/4 of the entire range.
I did see comparison Spark and Storm; and this being limitation on Spark.
I am hoping we have a solution to this in Spark Streaming
When using Kafka, one way to address this problem is to partition your data at the producer side. As you probably have seen, Kafka messages have a key, and you may use that key to partition the data among partitions.
Using the Kafka receiver, you create one receiver per partition. Upon start of the Streaming job, the receivers will be distributed over several executors.
This means that every executor (JVM) will be receiving data for only the partitions it's got assigned. This results on the same id going to the same executor for the lifetime of the receiver, and enables effective local caching as intended in the question.