Kafka Streams - Multiple Joins and Number of threads in a single instance - apache-kafka-streams

I have a use case to do multiple joins on the two topics,
Lets say, I have topic A (2 partitions) and topic B (2 partitions) and running single instance of KafkaStreams application.
I have use case to find breaks, left miss and right miss between the two topics, so I am performing following 3 operations,
A.join(B)
B.leftJoin(A)
A.leftJoin(B)
As per the documentation, there will be two tasks (max(2,2)) will be created for each topology and a total of 6 tasks, i.e,
1. A.join(B) - two tasks created - each task is assigned two
partitions
2. B.leftJoin(A) - two tasks created - each task is assigned
two partitions
3. A.leftJoin(B) - two tasks created - each task is
assigned two partitions
Since i am running a single instance, to scale up, i am planning to configure num.stream.threads=6 and each thread will be assigned one task.
is my above understanding correct? Please correct me if i am mistake.
Thanks in Advance.
Regards,
Sathish

From confluent documentation:
The default implementation provided by Kafka Streams is
DefaultPartitionGrouper, which assigns each task with at most one
partition for each of the source topic partitions; therefore, the
generated number of tasks is equal to the largest number of partitions
among the input topics. [1]
So if you aren't overriding partition.grouper config, the number of tasks should be 2.
Links:
[1] http://docs.confluent.io/current/streams/developer-guide.html#optional-configuration-parameters

Related

Flink how are partitions of a stream associated with the parallelism?

I am new to Flink and I'm trying to understand a few things. I've got a theory which I am trying to confirm. So it goes like that:
Parallelism refers to how many parallel "machines" (could be threads or different machines as I understand, correct me if I'm wrong) will run my job.
Flink by default will partition the stream in a round-robin manner to take advantage of the job's parallelism.
If the programmer defines a partitioning strategy (for example with keyBy) then this strategy will be followed instead of the default round-robin.
If the parallelism is set to 1 then partitioning the stream will not have any effect on the processing speed as the whole stream will end up being processed by the same machine. In this case, the only benefit of partitioning a stream (with keyBy) is that the stream can be processed in keyed context.
keyBy guarantees that the elements with the same key (same group) will be processed by the same "machine" but it doesn't mean that this machine will only process elements of this group. It could process elements from other groups as well but it processes each group as if it is the only one, independently from the others.
Setting a parallelism of 3 while the maximum number of partitions that my partition strategy can spawn is 2, is kind of meaningless as only 2 of the 3 "machines" will end up processing the two partitions.
Can somebody tell me if those points are correct? Correct me if I'm wrong please.
Thank you in advance for your time
I think you've got it. To expand on point 6: If your job uses a keyBy to do repartitioning, as in
source
.keyBy(...)
.window(...)
.sinkTo(...)
then in a case where the source is a Kafka topic with only 2 partitions,
the source operator will only have 2 active instances, but for the window and sink all 3 instances will have meaningful work to do (assuming there are enough distinct keys).
Also, while we don't talk about it much, there's also horizontal parallelism you can exploit. For example, in the job outlined above, the source task will run in one Java thread, and the task with the window and sink will run in another thread. (These are separate tasks because the keyBy forces a network shuffle.) If you give each task slot enough hardware resources, then these tasks will be able to run more-or-less independently (there's a bit of coupling, since they're in the same JVM).

what is the maximum tasks for kafka streams on multiple topcis with different partition

Suppose we have two input topics. Topic1 has 2 partitions, and Topic2 has 4 partitions.
We create the kafka stream application with thread number 1.
Question: what is the maximum number that we can run the stream application that all will be assigned a partition?
as for my understanding, it is decided by the maximum partition of input topics. that is 4.
while what I want to achieve is 6, that is the sum number of all topics' partition. do you know is this doable? Thanks.
The parallelism of a streams application is defined number of partitions in the input topic(s), you are correct. You cannot change this. A workaround would be to work with an intermediate repartition topic: you repartition the input topic into a new topic with 6 partitions, and then do the actual work with a parallelism of 6.

Apache storm: why and how to choose number of tasks per executor?

According to the official documentation:
How many instances to create for a spout/bolt. A task runs on a thread with zero or more other tasks for the same spout/bolt. The number of tasks for a spout/bolt is always the same throughout the lifetime of a topology, but the number of executors (threads) for a spout/bolt can change over time. This allows a topology to scale to more or less resources without redeploying the topology or violating the constraints of Storm (such as a fields grouping guaranteeing that the same value goes to the same task)
My questions are:
Under what circumstances would I choose to run multiple tasks in one executor?
If I do use multiple tasks in one executor, what might be reasons that I would choose different number of tasks per executor between my spout and my bolt (such as 2 tasks per bolt executor but only 1 task per spout executor)?
I thought https://stackoverflow.com/a/47714449/8845188 was a fine answer, but I'll try to reword it as examples:
The number of tasks for a component (e.g. spout or bolt) is set in stone when you submit the topology, while the number of executors can be changed without redeploying the topology. The number of executors is always less than or equal to the number of tasks for a component.
Question 1
You wouldn't normally have a reason to choose running e.g. 2 tasks in 1 executor, but if you currently have a low load but expect a high load later, you may choose to submit the topology with a high number of tasks but a low number of executors. You could of course just submit the topology with as many executors as you expect to need, but using many threads when you only need a few is inefficient due to context switching and/or potential resource contention.
For example, lets say you submit your topology so the spout has 4 tasks and 4 executors (one per). When your load increases, you can't scale further because 4 is the maximum number of executors you can have. You now have to redeploy the topology in order to scale with the load.
Let's say instead you submit your topology so the spout has 32 tasks and 4 executors (8 per). When the load increases, you can increase the number of executors to 32, even though you started out with only 4. You can do this scaling up without redeploying the topology.
Question 2
Let's say your topology has a spout A, and a bolt B. Let's say bolt B does some heavyweight work (e.g. can do 10 tuples per executor per second), while the spout is lightweight (e.g. can do 1000 tuples per executor per second). Let's say your load is initially 20 messages per second into the topology, but you expect that to grow.
In this case it makes sense that you might configure your spout with 1 executor and 1 task, since it's likely to be idle most of the time. At the same time you want to configure your bolt with a high number of tasks so you can scale the number of executors for it, and at least 2-3 executors to start.
Config#TOPOLOGY_TASKS -> How many tasks to create per component.
A task performs the actual data processing and is run within its parent executor’s thread of execution. Each spout or bolt that you implement in your code executes as many tasks across the cluster.
The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads <= #tasks.
By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread (which is usually what you want anyways).
Also be aware that:
The number of executor threads can be changed after the topology has been started.
The number of tasks of a topology is static.
There is another reason where having tasks in place of executors makes more sense.
Lets suppose you have 2 tasks of the same bolt running on a single executor(thread). Lets suppose you are calling a relatively long running(1 second maybe) database subroutine and the result is needed before proceeding further.
Case 1 - Your database call would be running on the executor thread and it would pause for a while and you would not gain anything by running 2 tasks.
Case 2 - You refactor your database call code to spawn a new thread and execute. In this case, your main executor thread would not hang and it would be able to start processing of the second bolt task while the newly spawned thread would be fetching data from database.
Unless you introduce your own parallelism within the component, I do not see a performance gain and no reason to run multiple tasks apart from maintenance reasons as mentioned in other answers.

Creating threads in Storm Bolt

I want to fire multiple web requests in parallel and then aggregate the data in a storm topology? which of the following way is preferred
1) create multiple threads within a bolt
2) Create multiple bolts and create a merging bolt to aggregate the data.
I would like to create multiple threads within a bolt because merging data in another bolt is not a simple process. But i see there are some concerns around that I found on internet
https://mail-archives.apache.org/mod_mbox/storm-user/201311.mbox/%3CCAAYLz+pUZ44GNsNNJ9O5hjTr2rZLW=CKM=FGvcfwBnw613r1qQ#mail.gmail.com%3E
but didn't get clear reason why not to create multiple threads. Any pointers will help.
On a side note does that mean i should not use java8's capabilities of parallel streams as well as mentioned in https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html?
Increase number of tasks for the bolt, its like spawning multiple instances of the same. And also increase the number of executors (threads) to handle them evenly.
Make sure #executors <= #tasks. Storm will do the rest for you.

what is the relevence of mentioning no of tasks in storm

I just wanted to know what is the actual relevance of using tasks in storm with respect to the output or performance since it does not have to do anything with parallelism, so on choosing more than 1 task for a component will make any change in output? or what will be the flow than? Or if i choose no of tasks > executors how does that make difference in flow or the output (here i am just taking the basic word count example).
It would be very helpful if anybody could explain me this with or without example.
for example say-
I have a topology with 3 bolts and 1 spout, and i have mentioned only 2 workers port,than that means that all these 4 components(1 spot and 3 bolts will get run on these workers only) now i have mentioned 2 executors for 1st bolt than it means that there will be 2 thread of that bolt will be running in parallel.Now if i mention the no of task=3 how will this make difference whether in output or performance?
And if i have mentioned the field grouping than the grouping will be there in different executors(plz correct me if m wrong)?
Did you read this article? https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
To pick up your example: If you set #tasks=3 and specify 2 executors using fieldsGrouping the data will be partitioned into 3 substreams (= #tasks). 2 substreams go to one executor and the third to the second executor. However, using 3 tasks and 2 executors, allows you to increase the number of executors to 3 using rebalance command.
As long as you do not want to increase the number of executors during execution, #tasks should be equal to #executors (ie, just don't specify #tasks).
For your example (if you don't want to change the parallelism at runtime), you most likely can an imbalance workload for both executors (one executor processed 33% of the data, the other 66%). However, this is only a problem in this special case and not in general. If you assume you have 4 tasks, each executors processed 2 substreams and no inbalance occurs.

Resources