In my use case, i have below two option to build my kafka streams topology,
Topics where data streamed from:
Topic A - 6 partitions
Topic B - 6 partitions
Topic C - 6 partitions
Intermediate Topic:
Topic INTERMIDIATE - 6 partitions
Option 1: with two topologies
Topology 1:
A.leftJoin(B).to(INTERMIDIATE)
Topology 2:
INTERMIDIATE.leftJoin(C).to(ResultTopic)
Option 2: with single topology
A.leftJoin(B).leftJoin(C).to(ResultTopic)
I know both are resulting in same output, my question is,
Option 1 creates 12 tasks max-partition(A, B) + max-partition(INTERMIDIATE,C)
Option 2 creates just 6 tasks max-partition(A, B, C)
At high level, it looks like Option 2 is the best solution because I can just configure less threads to handle the tasks but in my use case i am making a big single topology,
A.leftJoin(B).leftJoin(C).leftJoin(D).leftJoin(E).leftJoin(F).to(ResultTopic)
In this case again just 6 tasks are created but the topics are more, is this a good solution to end up in less tasks by chaining the topics in a single topology?
Related
Suppose we have two input topics. Topic1 has 2 partitions, and Topic2 has 4 partitions.
We create the kafka stream application with thread number 1.
Question: what is the maximum number that we can run the stream application that all will be assigned a partition?
as for my understanding, it is decided by the maximum partition of input topics. that is 4.
while what I want to achieve is 6, that is the sum number of all topics' partition. do you know is this doable? Thanks.
The parallelism of a streams application is defined number of partitions in the input topic(s), you are correct. You cannot change this. A workaround would be to work with an intermediate repartition topic: you repartition the input topic into a new topic with 6 partitions, and then do the actual work with a parallelism of 6.
I have a use case to do multiple joins on the two topics,
Lets say, I have topic A (2 partitions) and topic B (2 partitions) and running single instance of KafkaStreams application.
I have use case to find breaks, left miss and right miss between the two topics, so I am performing following 3 operations,
A.join(B)
B.leftJoin(A)
A.leftJoin(B)
As per the documentation, there will be two tasks (max(2,2)) will be created for each topology and a total of 6 tasks, i.e,
1. A.join(B) - two tasks created - each task is assigned two
partitions
2. B.leftJoin(A) - two tasks created - each task is assigned
two partitions
3. A.leftJoin(B) - two tasks created - each task is
assigned two partitions
Since i am running a single instance, to scale up, i am planning to configure num.stream.threads=6 and each thread will be assigned one task.
is my above understanding correct? Please correct me if i am mistake.
Thanks in Advance.
Regards,
Sathish
From confluent documentation:
The default implementation provided by Kafka Streams is
DefaultPartitionGrouper, which assigns each task with at most one
partition for each of the source topic partitions; therefore, the
generated number of tasks is equal to the largest number of partitions
among the input topics. [1]
So if you aren't overriding partition.grouper config, the number of tasks should be 2.
Links:
[1] http://docs.confluent.io/current/streams/developer-guide.html#optional-configuration-parameters
I am trying to explore Spark streaming from Kafka as the source. As per this link, createDirectStream has 1:1 parallelism between kafka partitions and Spark. So this would mean that, if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.
Questions
Suppose i have a window operation after the data is read. Does the window operation apply window across partitions or within one
partition i.e. lets say my batch interval is 10s and window interval
is 50s. Does window accumulate data for 50s of data across partitions
(if each partition has 10 records each for 50s, does window hold 30
records) or 50s of data per partition in parallel (if each partition
has 10 records each for 50s, does window hold 10 records)?
pseudo code:
rdd = createDirectStream(...)
rdd.window()
rdd.saveAsTextFile() //Does this write 30 records in 1 file or 3 files
with 10 records per file?
Suppose i have this...
Pseudo code:
rdd = createDirectStream()
rdd.action1()
rdd.window()
rdd.action2()
Lets say, i have 3 kafka partitions and 3 executors (each reading a
topic). This spins 2 jobs as there are 2 actions. Each spark executor
would have partition of the RDD and action1 is applied in parallel.
Now for action2, would the same set of executors be used (otherwise,
the data has to be read from Kafka again - not good)?
Q) if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.
In more specific terms, there will be 3 tasks submitted to the Spark cluster, one for each partition. Where these tasks execute depend on your cluster topology and locality settings but in general you can consider that these 3 tasks will run in parallel.
Q) Suppose I have a window operation after the data is read. Does the window operation apply window across partitions or within one partition?
The fundamental model of Spark and by transitivity of Spark Streaming is that operations are declared on an abstraction (RDD/Datasets for Spark, DStream for Spark Streaming) and at the execution level, those operations will be applied in a distributed fashion, using the native partitioning of the data.
((I'm not sure about the distinction the question makes between "across partitions or within one partition". The window will be preserved per partition. The operation(s) will be applied according to their own semantics. For example, a map operation will be applied per partition, while a count operation will be first applied to each partition and then consolidated to one result.))
Regarding the pseudo code:
val dstream = createDirectStream(..., Seconds(30))
dstream.window(Seconds(600)) // this does nothing as the new dstream is not referenced any further
val windowDstream = dstream.window(timePeriod) // this creates a new Windowed DStream based on the base DStream
dstream.saveAsTextFiles() // this writes using the original streaming interval (30 seconds). It will write 1 logical file in the distributed file system with 3 partitions
windowDstream.saveAsTextFiles() // this writes using the windowed interval (600 seconds). It will write 1 logical file in the distributed file system with 3 partitions.
Given this code (note naming changes!):
val dstream = createDirectStream(...)
dstream.action1()
val windowDStream = dstream.window(...)
windowDStream.action2()
for action2, would the same set of executors be used (otherwise, the data has to be read from Kafka again - not good)?
In the case of Direct Stream model, the RDDs at each interval do not contain any data, only offsets (offset-start, offset-end). It's only when an action is applied that the data is read.
A windowed dstream over a direct producer is, therefore, just a series of offsets: Window (1-3) = (offset1-start, offset1-end), (offset2-start, offset2-end), (offset3-start, offset3-end). When an action is applied to that window, these offsets will be fetched from Kafka and the operation will be applied. This is not "bad" as implied in the question. This prevents us from having to store intermediate data for long periods of time and lets us preserve operation semantics on the data.
So, yes, the data will be read again, and that's a good thing.
I am exploring Apache Storm. I know that there is no way of determining what tasks get mapped to which node. I wanted to know if there is any way to even guess which executors are grouped together. For instance, consider a linear chain topology with 1 spout and 2 bolts:
Spout -> Bolt1 -> Bolt2
If there is a 3 node cluster, and numworkers = 3, with combined parallelism = 9 (3 spouts + 2 x 3 bolts), is there any way of determining how executors are grouped? I have read that the default scheduler distributes the load evenly in a round robin manner. Does it mean that all the workers will have one instance each of:
S -> B1 -> B2 executors?
For the default scheduler, you are right. If you have 3 workers, each worker will get assigned one instance of your Spout, Bolt1, and Bolt2.
The order in which the default scheduler assigns executors to workers, is round robin as you stated correctly. In more detail, the round robin assignment for each logical operator happens for all its executors before the scheduler considers the next logical operator. However, the order of the logical operators themselves is not fixed. See the code here for more details: https://github.com/apache/storm/tree/0.9.x-branch/storm-core/src/clj/backtype/storm/scheduler
If you want to influence this behavior, you can provide a custom scheduler. See an example here: https://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/
I just wanted to know what is the actual relevance of using tasks in storm with respect to the output or performance since it does not have to do anything with parallelism, so on choosing more than 1 task for a component will make any change in output? or what will be the flow than? Or if i choose no of tasks > executors how does that make difference in flow or the output (here i am just taking the basic word count example).
It would be very helpful if anybody could explain me this with or without example.
for example say-
I have a topology with 3 bolts and 1 spout, and i have mentioned only 2 workers port,than that means that all these 4 components(1 spot and 3 bolts will get run on these workers only) now i have mentioned 2 executors for 1st bolt than it means that there will be 2 thread of that bolt will be running in parallel.Now if i mention the no of task=3 how will this make difference whether in output or performance?
And if i have mentioned the field grouping than the grouping will be there in different executors(plz correct me if m wrong)?
Did you read this article? https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
To pick up your example: If you set #tasks=3 and specify 2 executors using fieldsGrouping the data will be partitioned into 3 substreams (= #tasks). 2 substreams go to one executor and the third to the second executor. However, using 3 tasks and 2 executors, allows you to increase the number of executors to 3 using rebalance command.
As long as you do not want to increase the number of executors during execution, #tasks should be equal to #executors (ie, just don't specify #tasks).
For your example (if you don't want to change the parallelism at runtime), you most likely can an imbalance workload for both executors (one executor processed 33% of the data, the other 66%). However, this is only a problem in this special case and not in general. If you assume you have 4 tasks, each executors processed 2 substreams and no inbalance occurs.