Apache Kafka Streams : State Store and Topic Partition Assignment - apache-kafka-streams

I would like to understand some details on how state stores and topic partitions are assigned to Stream Processing applications and their tasks.
Let's say I have a 4-partition Topic (tA).
I also have 4 instances (i0,i1,i2,i3) of the same application.id (myApp) running on 4 different machines and streaming records from tA.
The streaming engine will allocate 1 partition to each application instance. For the sake of argument, let's say partition allocation is : p0->i0, p1->i1, p2->i2 and p3->i3
And also assume my streaming application instances all create their state stores SS0, SS1,SS2,SS3. So basically, SS0 will hold records (keys) corresponding to p0, SS1->p1 etc.
Now, if i0 and i1 go down, and if i2 and i3 get reassigned additional partitions p0 and p1 respectively.
Will the corresponding state stores that had p0 and p1 keys also get reassigned with those partitions ?
In short, my question is : do partitions and the state stores get associated with each other so that during reassignment, they move together ?
i.e. we will never have the case that the task that gets p0 gets ss1 ?

A task reads from one specific partition (or a set of partitions of different topics) and a task also maintains a specific state store. Tasks are the components that are moved around during assignments in a rebalancing.
In your example, the Kafka Streams app will have 4 tasks, t0..t3. Task t0 will read from partition p0, t1 from p1, etc. Each task will maintain its own state store. That means, task t0 will maintain state store SS0, t1 will maintain SS1 and so on.
Let's assume instance i0 execute task t0, i1 executes t1 etc. When instances i0 and i1 go down, tasks i0 and i1 are redistributed to instances i2 and i3. Now, i2 will execute t0 as well as t2 and i3 will execute t1 as well as t3. Since the state stores are part of the task they will migrate with them. If an instance to which a task with a state is migrated does not hold up-to-date data of the state, the state store will be restored on that instance from the changelog of the state on the Kafka brokers. Note that a task can also maintain multiple state stores, for instance when the task contains multiple stateful operations.
Since a task is bound to its input partitions and its state stores, you will never run into the situation where a task reads from a different partition or maintains a different state store after a migration to different instance.
You can find more details about task and state stores under the following links:
https://www.confluent.io/blog/how-to-tune-rocksdb-kafka-streams-state-stores-performance/
https://kafka.apache.org/28/documentation/streams/architecture

Related

Flink how are partitions of a stream associated with the parallelism?

I am new to Flink and I'm trying to understand a few things. I've got a theory which I am trying to confirm. So it goes like that:
Parallelism refers to how many parallel "machines" (could be threads or different machines as I understand, correct me if I'm wrong) will run my job.
Flink by default will partition the stream in a round-robin manner to take advantage of the job's parallelism.
If the programmer defines a partitioning strategy (for example with keyBy) then this strategy will be followed instead of the default round-robin.
If the parallelism is set to 1 then partitioning the stream will not have any effect on the processing speed as the whole stream will end up being processed by the same machine. In this case, the only benefit of partitioning a stream (with keyBy) is that the stream can be processed in keyed context.
keyBy guarantees that the elements with the same key (same group) will be processed by the same "machine" but it doesn't mean that this machine will only process elements of this group. It could process elements from other groups as well but it processes each group as if it is the only one, independently from the others.
Setting a parallelism of 3 while the maximum number of partitions that my partition strategy can spawn is 2, is kind of meaningless as only 2 of the 3 "machines" will end up processing the two partitions.
Can somebody tell me if those points are correct? Correct me if I'm wrong please.
Thank you in advance for your time
I think you've got it. To expand on point 6: If your job uses a keyBy to do repartitioning, as in
source
.keyBy(...)
.window(...)
.sinkTo(...)
then in a case where the source is a Kafka topic with only 2 partitions,
the source operator will only have 2 active instances, but for the window and sink all 3 instances will have meaningful work to do (assuming there are enough distinct keys).
Also, while we don't talk about it much, there's also horizontal parallelism you can exploit. For example, in the job outlined above, the source task will run in one Java thread, and the task with the window and sink will run in another thread. (These are separate tasks because the keyBy forces a network shuffle.) If you give each task slot enough hardware resources, then these tasks will be able to run more-or-less independently (there's a bit of coupling, since they're in the same JVM).

Kafka Streams - Multiple Joins and Number of threads in a single instance

I have a use case to do multiple joins on the two topics,
Lets say, I have topic A (2 partitions) and topic B (2 partitions) and running single instance of KafkaStreams application.
I have use case to find breaks, left miss and right miss between the two topics, so I am performing following 3 operations,
A.join(B)
B.leftJoin(A)
A.leftJoin(B)
As per the documentation, there will be two tasks (max(2,2)) will be created for each topology and a total of 6 tasks, i.e,
1. A.join(B) - two tasks created - each task is assigned two
partitions
2. B.leftJoin(A) - two tasks created - each task is assigned
two partitions
3. A.leftJoin(B) - two tasks created - each task is
assigned two partitions
Since i am running a single instance, to scale up, i am planning to configure num.stream.threads=6 and each thread will be assigned one task.
is my above understanding correct? Please correct me if i am mistake.
Thanks in Advance.
Regards,
Sathish
From confluent documentation:
The default implementation provided by Kafka Streams is
DefaultPartitionGrouper, which assigns each task with at most one
partition for each of the source topic partitions; therefore, the
generated number of tasks is equal to the largest number of partitions
among the input topics. [1]
So if you aren't overriding partition.grouper config, the number of tasks should be 2.
Links:
[1] http://docs.confluent.io/current/streams/developer-guide.html#optional-configuration-parameters

Window operation on Spark streaming from Kafka

I am trying to explore Spark streaming from Kafka as the source. As per this link, createDirectStream has 1:1 parallelism between kafka partitions and Spark. So this would mean that, if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.
Questions
Suppose i have a window operation after the data is read. Does the window operation apply window across partitions or within one
partition i.e. lets say my batch interval is 10s and window interval
is 50s. Does window accumulate data for 50s of data across partitions
(if each partition has 10 records each for 50s, does window hold 30
records) or 50s of data per partition in parallel (if each partition
has 10 records each for 50s, does window hold 10 records)?
pseudo code:
rdd = createDirectStream(...)
rdd.window()
rdd.saveAsTextFile() //Does this write 30 records in 1 file or 3 files
with 10 records per file?
Suppose i have this...
Pseudo code:
rdd = createDirectStream()
rdd.action1()
rdd.window()
rdd.action2()
Lets say, i have 3 kafka partitions and 3 executors (each reading a
topic). This spins 2 jobs as there are 2 actions. Each spark executor
would have partition of the RDD and action1 is applied in parallel.
Now for action2, would the same set of executors be used (otherwise,
the data has to be read from Kafka again - not good)?
Q) if there is a Kafka topic with 3 partitions then 3 spark executors would run parallel, each reading a partition.
In more specific terms, there will be 3 tasks submitted to the Spark cluster, one for each partition. Where these tasks execute depend on your cluster topology and locality settings but in general you can consider that these 3 tasks will run in parallel.
Q) Suppose I have a window operation after the data is read. Does the window operation apply window across partitions or within one partition?
The fundamental model of Spark and by transitivity of Spark Streaming is that operations are declared on an abstraction (RDD/Datasets for Spark, DStream for Spark Streaming) and at the execution level, those operations will be applied in a distributed fashion, using the native partitioning of the data.
((I'm not sure about the distinction the question makes between "across partitions or within one partition". The window will be preserved per partition. The operation(s) will be applied according to their own semantics. For example, a map operation will be applied per partition, while a count operation will be first applied to each partition and then consolidated to one result.))
Regarding the pseudo code:
val dstream = createDirectStream(..., Seconds(30))
dstream.window(Seconds(600)) // this does nothing as the new dstream is not referenced any further
val windowDstream = dstream.window(timePeriod) // this creates a new Windowed DStream based on the base DStream
dstream.saveAsTextFiles() // this writes using the original streaming interval (30 seconds). It will write 1 logical file in the distributed file system with 3 partitions
windowDstream.saveAsTextFiles() // this writes using the windowed interval (600 seconds). It will write 1 logical file in the distributed file system with 3 partitions.
Given this code (note naming changes!):
val dstream = createDirectStream(...)
dstream.action1()
val windowDStream = dstream.window(...)
windowDStream.action2()
for action2, would the same set of executors be used (otherwise, the data has to be read from Kafka again - not good)?
In the case of Direct Stream model, the RDDs at each interval do not contain any data, only offsets (offset-start, offset-end). It's only when an action is applied that the data is read.
A windowed dstream over a direct producer is, therefore, just a series of offsets: Window (1-3) = (offset1-start, offset1-end), (offset2-start, offset2-end), (offset3-start, offset3-end). When an action is applied to that window, these offsets will be fetched from Kafka and the operation will be applied. This is not "bad" as implied in the question. This prevents us from having to store intermediate data for long periods of time and lets us preserve operation semantics on the data.
So, yes, the data will be read again, and that's a good thing.

How does yarn manage the extra resources in hadoop?

Consider there are 3 top level queues, q1,q2,q3.Capacity Scheduler
Users of q1 and q2 submit their jobs to their respective queues, they are guaranteed to get their allocated resources. Now the resources which are not utilized by q3 has to be utilized by q1 and q2. What factors does yarn consider while dividing the extra resources? Who (q1,q2) gets preference?
Every queue in the Capacity Scheduler has 2 important properties (which are defined in terms of percentage of total resources available), which determine the scheduling:
Guaranteed capacity of the queue (determined by configuration "yarn.scheduler.capacity.{queue-path}.capacity")
Maximum capacity to which the queue can grow (determined by the configuration "yarn.scheduler.capacity.{queue-path}.maximum-capacity"). This puts an upper limit on the resource utilization by a queue. The queue cannot grow beyond this limit.
The Capacity Scheduler, organizes the queues in a hierarchical fashion.
Queues are of 2 types "parent" and "leaf" queues. The jobs can only be submitted to the leaf queues.
"ROOT" queue is the parent of all the other queues.
Each parent queue, sorts the child queues based on the demand (what's the current used capacity of the queue? Whether it is under-served or over-served?).
For each queue, the ratio (Used Capacity / Total Cluster Capacity) gives an indication about the utilization of the queue. The parent queue always gives priority to the most under-served child queue..
When the free resources are given to a parent queue, the resources are recursively distributed to the child queues, depending on the current used capacity of the queue.
Within a leaf queue, the distribution of capacity can happen based on certain user limits (for e.g. configuration parameter: yarn.scheduler.capacity.{queue-path}.minimum-user-limit-percent, determines the minimum queue capacity that each user is guaranteed to have).
In your example, for the sake of simplicity, let's assume that the queues q1, q2 and q3 are directly present under "ROOT". As mentioned earlier, the parent queue keeps the queues sorted based on their utilization.
Since q3 is not utilized at all, the parent can distribute the un-utilized resources of q3, between q1 and q2.
The available resources are distributed based on following factors:
If both q1 and q2 have enough resources to continue scheduling their jobs, then there is no need to distribute the available resources from q3
If both q1 and q2 have hit maximum capacity ("yarn.scheduler.capacity.{queue-path}.maximum-capacity", this configuration limits the elasticity of the queues. Queues cannot demand more than the percentage configured by this parameter), then the free resources are not allotted
If any one of the queues q1 or q2 is under-served, then the free resources are allotted to the under-served queue
If both q1 and q2 are under-served, then the most under-served queue is given the top priority.

Conceptual questions about map reduce

I've been doing a lot of reading about Map Reduce and I had the following questions that I can't seem to find the answers to:
Everyone points to the word-count example. But why do we need the map reduce paradigm for a really big corpus for the word counts? I'm not sure how having one machine read from a really huge stream and maintain the word counts all in memory is worse than having a number of connected machines split the counting task amongst themselves and aggregate it again. Finally, at the end, there will still be one place where all the count will be maintained right?
Are mapper and reducer machines physically different? Or can the mapping and reducing happen on the same machine?
Suppose my stream is the foll three sentences:
a b c
b c d
b c
So, the word-count mapper will generate key-value pairs as:
a 1
b 1
c 1
b 1
c 1
d 1
b 1
c 1
And now it will pass these key value pairs to the next stage, right? I have the following questions:
- Is this next stage the reducer?
- Can a mapper send the first b 1 and second b 1 tuples to different nodes? If yes, then do the counts get aggregated in the next phase? If no, then why not? Wouldn't that be counter intutive?
Finally, in the end of a map reduce job, the final output is all aggregated at a single machine, right? If yes, doesn't this make the entire process too expensive, computationally?
Word count is easiest to explain that is why you see it more often. It has become "Hello World" example for Hadoop Framework.
Yes, Map and Reduce can be on same machine or different machine. Reduce starts only after all map completes.
All keys goes to same reducer.
( so answer to your question
Can a mapper send the first b 1 and second b 1 tuples to different nodes --- is NO )
Also its not right to say entire processing is expensive.
As Map-Reduce paradigm can process/solve/analyze problems which were almost impossible to be processed by single machine ( the reason its called BIG data ).
And now with MapReduce its possible with commodity ( read cheaper ) hardware ; that is why is widely accepted.
The Map-Reduce (MR) paradigm was created by Google, and Google is doing Word Count (or in their special case they are creating inverted indices, but that is pretty similar conceptually). You can use MR for many things (and people try doing it) but it isn't really useful. In fact many companies use MR for a special version of Word Count. When Spotify analyses their logs and reports which songs were listened to how often, it is basically word count, just with TB of logs.
The end-result doesn't land on only one machine in hadoop, but again in HDFS, which is distributed. And than you can perform another MR algorithm on that result, ...
In hadoop you have different kind of nodes, but as far as we have tested MR, all nodes where storing data as well as performing Map and Reduce jobs. The reason for performing the Map and Reduce jobs directly on the machine where data is stored is the locality and therefore lower network traffic. You can afterwards combine the reduced results and reduce them again.
For instance when Machine 1 has
a b c
and Machine 2 has
b c d
b c
Than Machine 2 would Map and Reduce the data and only send
b 2
c 2
d 1
over the wire. However Machine 2 actually wouldn't send the data anywhere, this result would rather be save as a preliminary result in HDFS and other machines can access it.
This was now specific to Hadoop, I think it helps to understand the Map-Reduce paradigm when you also look at other usage scenarios. The NoSQL Databases Couchbase and CouchDB use Map-Reduce to create views. This means that you can analyse data and compute sums, min, max, counts, ... This MR-Jobs are run on all the nodes of such a database cluster and the results are stored in the database again and all of this without Hadoop and HDFS.

Resources