How commit interval & cache max buffer behave in a sub-topology with multiple state store? - apache-kafka-streams

Taken from the book: Mastering Kafka Streams and ksqlDB
Kafka Streams uses a depth-first strategy when processing data. When a new record is received, it is routed through each stream processor in the topology before another record is processed.
Also from the documentation: https://docs.confluent.io/platform/current/streams/architecture.html#
Kafka Streams does not use a backpressure mechanism because it does not need one. Using a depth-first processing strategy, each record consumed from Kafka will go through the whole processor (sub-)topology for processing and for (possibly) being written back to Kafka before the next record will be processed.
My question here is:
1 ) When a sub-topology (i.e. task) has multiple stateful processor (KTable Operations) with state store
1-A) Do each stateful processor make use of the cache memory? If so, given that the cache memory is distributed evenly across the Threads and a Thread execute the Tasks (I guess in sequence), how is that memory distributed/allocated for the cache of each stateful processor of the task being executed ?
1-B) If each stateful Processors make use of a cache, how do the Depth-First processing happens? If I have let say 3 stateful processors in a row, how do a record go through the 3 processor ? I understand that a commit interval met or a cache will trigger forwarding, flushing and saving to the changelog topic of a state store of a stateful processor. However where there are 3 a row, I get confused as to how this would work.
1-B-1) Will record be forward only on commit interval met or cache full between processor ? And therefore, hypothetically the time for a record to traverse a sub-topology (if the caches of processor never get full), would be commit interval multiplied by 3 ?

Related

Delay Kafka processor reading from source topic

I have a topology that consists of two source topics which are read and processed by two different processors in a Kafka Streams app. The one processor A reads its corresponding topic and creates a persistent local store which is shared with the other processor B in the topology.
My issue is that I need somehow after a restart to pause processor B processing for a very small amount of time and give processor A the time to read some events from its topic updating its local store before processor B starts with its processing.
Since both processors belong to the same sub-topology I can't use Thread.sleep in init() for example because this will cause the whole app to stall.
So is there a way to make processor B in topology wait/stall for a very small amount of time when restarting the application before starting reading from the source topic and begin processing events?
Processing order is base on record timestamps. Hence, if the timestamps of the record processed by A are smaller than the timestamps of the record processed by B, those "A records" will be processed first.
Explicitly pausing one side does not make sense, as it may violate the processing order. Just make sure that your input data is properly timestamped and you don't have to worry about manual pausing.

Achieve concurrency in Kafka consumers

We are working on parallelising our Kafka consumer to process more number of records to handle the Peak load. One way, we are already doing is through spinning up as many consumers as many partitions within the same consumer group.
Our Consumer deals with making an API call which is synchronous as of now. We felt making this API call asynchronous will make our consumer handle more load. Hence, we are trying to making the API call Asynchronous and in its response we are increasing the offset. However we are seeing an issue with this:
By making the API call Asynchronous, we may get the response for the last record first and none of the previous record's API calls haven't initiated or done by then. If we commit the offset as soon as we receive the response of the last record, the offset would get changed to the last record. In the meantime if the consumer restarts or partition rebalances, we will not receive any record before the last record we committed the offset as. With this, we will miss out the unprocessed records.
As of now we already have 25 partitions. We are looking forward to understand if someone have achieved parallelism without increasing the partitions or increasing the partitions is the only way to achieve parallelism (to avoid offset issues).
First, you need to decouple (if only at first) the reading of the messages from the processing of these messages. Next look at how many concurrent calls you can make to your API as it doesn't make any sense to call it more frequently than the server can handle, asynchronously or not. If the number of concurrent API calls is roughly equal to the number of partitions you have in your topic, then it doesn't make sense to call the API asynchronously.
If the number of partitions is significantly less than the max number of possible concurrent API calls then you have a few choices. You could try to make the max number of concurrent API calls with fewer threads (one per consumer) by calling the API's asynchronously as you suggest, or you can create more threads and make your calls synchronously. Of course, then you get into the problem of how can your consumers hand their work off to a greater number of shared threads, but that's exactly what streaming execution platforms like Flink or Storm do for you. Streaming platforms (like Flink) that offer checkpoint processing can also handle your problem of how to handle offset commits when messages are processed out of order. You could roll your own checkpoint processing and roll your own shared thread management, but you'd have to really want to avoid using a streaming execution platform.
Finally, you might have more consumers than max possible concurrent API calls, but then I'd suggest that you just have fewer consumers and share partitions, not API calling threads.
And, of course, you can always change the number of your topic partitions to make your preferred option above more feasible.
Either way, to answer your specific question you want to look at how Flink does checkpoint processing with Kafka offset commits. To oversimplify (because I don't think you want to roll your own), the kafka consumers have to remember not only the offsets they just committed, but they have to hold on to the previous committed offsets, and that defines a block of messages flowing though your application. Either that block of messages in its entirety is processed all the way through or you need to rollback the processing state of each thread to the point where the last message in the previous block was processed. Again, that's a major oversimplification, but that's kinda how it's done.
You have to look at kafka batch processing. In a nutshell: you can setup huge batch.size with a little number (or even single) of partitions. As far, as whole batch of messages consumed at consumer side (i.e. in ram memory) - you can parallelize this messages in any way you want.
I would really like to share links, but their number rolls over the web hole.
UPDATE
In terms of committing offsets - you can do this for whole batch.
In general, kafka doesn't achieve target performance requirements by abusing partitions number, but rather relying on batch processing.
I already saw a lot of projects, suffering from partitions scaling (you may see issues later, during rebalancing for example). The rule of thumb - look at every available batch setting first.

In Nifi, what is the difference between FirstInFirstOutPrioritizer and OldestFlowFileFirstPrioritizer

User guide https://nifi.apache.org/docs/nifi-docs/html/user-guide.html has the below details on prioritizers, could you please help me understand how these are different and provide any real time example.
FirstInFirstOutPrioritizer: Given two FlowFiles, the one that reached the connection first will be processed first.
OldestFlowFileFirstPrioritizer: Given two FlowFiles, the one that is oldest in the dataflow will be processed first. 'This is the default scheme that is used if no prioritizers are selected.'
Imagine two processors A and B that are both connected to a funnel, and then the funnel connects to processor C.
Scenario 1 - The connection between the funnel and processor C has first-in-first-out prioritizer.
In this case, the flow files in the queue between the funnel and connection C will be processed strictly based on the order they reached the queue.
Scenario 2 - The connection between the funnel and processor C has oldest-flow-file-first prioritizer.
In this case, there could already be flow files in the queue between the funnel and connection C, but one of the processors transfers a flow to that queue that is older than all the flow files in that queue, it will jump to the front.
You could imagine that some flow files come from a different portion of the flow that takes way longer to process than other flow files, but they both end up funneled into the same queue, so these flow files from the longer processing part are considered older.
Apache NiFi handles data from many disparate sources and can route it through a number of different processors. Let's use the following example (ignore the processor types, just focus on the titles):
First, the relative rate of incoming data can be different depending on the source/ingestion point. In this case, the database poll is being done once per minute, while the HTTP poll is every 5 seconds, and the file tailing is every second. So even if a database record is 59 seconds "older" than another, if they are captured in the same execution of the processor, they will enter NiFi at the same time and the flowfile(s) (depending on splitting) will have the same origin time.
If some data coming into the system "is dirty", it gets routed to a processor which "cleans" it. This processor takes 3 seconds to execute.
If both the clean relationship and the success relationship from "Clean Data" went directly to "Process Data", you wouldn't be able to control the order that those flowfiles were processed. However, because there is a funnel that merges those queues, you can choose a prioritizer on the queued queue, and control that order. Do you want the first flowfile to enter that queue processed first, or do you want flowfiles that entered NiFi earlier to be processed first, even if they entered this specific queue after a newer flowfile?
This is a contrived example, but you can apply this to disaster recovery situations where some data was missed for a time window and is now being recovered, or a flow that processes time-sensitive data and the insights aren't valid after a certain period of time has passed. If using backpressure or getting data in large (slow) batches, you can see how in some cases, oldest first is less valuable and vice versa.

multiple consumers per kinesis shard

I read you can have multiple consumer apps per kinesis stream.
http://docs.aws.amazon.com/kinesis/latest/dev/developing-consumers-with-kcl.html
however, I heard you can only have on consumer per shard. Is this true? I don't find any documentation to support this, and can't imagine how that could be if multiple consumers are reading from the same stream. Certainly, it doesn't mean the producer needs to repeat content in different shards for different consumers.
Kinesis Client Library starts threads in the background, each listens to 1 shard in the stream. You cannot connect to a shard over multiple threads, that is by-design.
http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-record-processor-scaling.html
For example, if your application is running on one EC2 instance, and
is processing one Amazon Kinesis stream that has four shards. This one
instance has one KCL worker and four record processors (one record
processor for every shard). These four record processors run in
parallel within the same process.
In the explanation above, the term "KCL worker" refers to a Kinesis consumer application. Not the threads.
But below, the same "KCL worker" term refers to a "Worker" thread in the application; which is a runnable.
Typically, when you use the KCL,
you should ensure that the number of instances does not exceed the
number of shards (except for failure standby purposes). Each shard is
processed by exactly one KCL worker and has exactly one corresponding
record processor, so you never need multiple instances to process one
shard.
See the Worker.java class in KCL source.
Late to the party, but the answer is that you can have multiple consumers per kinesis shard. A KCL instance will only start one process per shard, but you can have another KCL instance consuming the same stream (and shard), assuming the second one has permission.
There are limits, though, as laid out in the docs, including:
Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second.
If you want a stream with multiple consumers where each message will be processed once, you're probably better off with something like Amazon Simple Queue Service.
to keep it simple, you can have multiple/different lambda functions get triggered on kinesis data. this way your both the lambdas are going to get all the data from the kinesis. The downside is that now you will have to increase the throughput at the kinesis level which is going to pricey. Use SQS instead for your use case.

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

Resources