Low latency(subsecond) kafka + spark structured streaming tuning - performance

I'm trying to build a kafka + spark structured streaming stateful application with low latency. By saying low latency I mean a couple of hundred of milliseconds each job.
The spark app read data from a kafka topic with partition number that's 2 times of executor core, then process and output it to another kafka topic. The rate that the data is produced into this topic is 100 records/s with approximate 2 kb record size. The DAG of the job indicate that stage that includes reading from kafka source takes 0.5s
. This stage basically transform the data from kafka into a dataset of custom case class, followed by groupByKey and flatMapGroupsWithState function from second stage. The shuffle write time in web UI is 0 ms(which should be small because the shuffled data size is around 10~20kb). So AFAIK the only time-consuming operation should be reading from kafka.
I've read about that kafka can perform much better than this. The end-to-end latency can be smaller than 100 ms.
The kafka broker is not heavily loaded. I don't know if it's related to the question but the whole application runs on a kubernetes cluster. And there's pic of this stage and pic of the whole query attached if they might help.
Sorry I cannot post the code. Is there anything I can try doing?
Best Regards

Today I find that some tasks of the first stage takes 0.5s while others do not, which looks very suspicious to me. And I looked more into the kafka settings. There's one configuration called fetch.max.wait.ms, which prevent the consumer task from stoping waiting for new message for 500 ms by default. After reducing this config everything goes fine. More info here: fetch.max.wait.ms

Related

Flink: how data is split in parallel tasks

I have a job with parallelism 2; it gets data from a kafka topic and, after keying, it handles timers in a stateful function.
I observed that sometimes one parallelized instance gets stuck: as a result timers do not trigger until a new message arrives, moving forward the current watermark for that parallel instance.
How does Flink split data between parallel instances?
Is there a metric to explore to get a quick view of how messages are split? (in percent or a count)
A part from reducing parallelism to 1, is there any other tip to solve this issue?
Thanks
With the Kafka source, it depends on the number of partitions. So setting the parallelism higher than the number of partitions will stop the watermark moving forward. In your case, as you mentioned it only gets stuck sometimes, probably one of the partitions didn't receive data for a bit which again stops the watermark.
To solve this issue, you can use withIdleness with your watermark strategy, more details can be found in the docs.

Looking for a real time streaming solution

We have a spark-streaming micro batch process which consumes data from kafka topic with 20 partitions. The data in the partitions are independent and can be processed independently. The current problem is the micro batch waits for processing to be complete in all 20 partitions before starting next micro batch. So if one partition completes processing in 10 seconds and other partition takes 2 mins then the first partition will have to wait for 110 seconds before consuming next offset.
I am looking for a streaming solution where we can process the 20 partitions independently without having to wait for other partition to complete a process. The steaming solution should consume data from each partition and progress offsets at its own rate independent of other partitions.
Anyone have suggestion on which streaming architecture would allow to achieve my goal?
Any of Flink (AFAIK), KStreams, and Akka Streams will be able to progress through the partitions independently: none of them does Spark-style batching unless you explicitly opt in.
Flink is similar to Spark in that it has a job server model; KStreams and Akka are both libraries that you just integrate into your project and deploy like any other JVM application (e.g. you can build a container and run on a scheduler like kubernetes). I personally prefer the latter approach: it generally means less infrastructure to worry about and less of an impedance mismatch to integrate with observability tooling used elsewhere.
Flink is an especially good choice when it comes to time-window based processing and joins.
KStreams fundamentally models everything as a transformation from one kafka topic to another: the topic topology is managed by KStreams, but there can be some gotchas there (especially if you're dealing with anything time-seriesy).
Akka is the most general and (in some senses) the least opinionated of the toolkits: you will have to make more decisions with less handholding (I'm saying this as someone who could probably fairly be called an Akka cheerleader); as a pure stream processing library, it may not be the ideal choice (though in terms of resource consumption, being able to more explicitly manage backpressure (basically, what happens when data comes in faster than it can be processed) may make it more efficient than the alternatives). I'd probably tend to only choose it if you were going to also take advantage of cluster sharded (and almost certainly event-sourced) actors: the benefit of doing that is that you can completely decouple your processing parallelism from the number of input Kafka partitions (e.g. you may be able to deploy 40 instances of processing and have each working on half of the data from Kafka).

Can we use Hadoop MapReduce for real-time data process?

Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.
If not what are the alternatives for real time data processing or analysis?
Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
What if the temperature sensor is placed inside a nuclear plant and
our code create alarms. Creating alarms after one hour has elapsed
may not be the best way to handle it. Can we get alerts within 1
second?
What if you want the readings calculated at hour boundary while it
takes few seconds for data to arrive at the storage. Now you cannot
start the job at your boundary, you need to watch the disk and
trigger the job when data has arrived for the hour boundary.
Well, you can run Hadoop fast. Will the job finish within 1 seconds?
Can we write the data to the disk, read the data, process it, and
produce the results, and recombine with other 23 hours of data in one
second? Now things start to get tight.
The reason you start to feel the friction is because you are not
using the right tool for the Job. You are using the flat screwdriver
when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
Some data naturally comes as a never-ending stream of events. To do
batch processing, you need to store it, cut off at some time and
processes the data. Then you have to do the next batch and then worry
about aggregating across multiple batches. In contrast, streaming
handles neverending data streams gracefully and naturally. You can
have conditions, look at multiple levels of focus ( will discuss this
when we get to windows), and also easily look at data from multiple
streams simultaneously.
With streaming, you can respond to the events faster. You can produce
a result within milliseconds of receiving an event ( update). With
batch this often takes minutes.
Stream processing naturally fit with time series data and detecting
patterns over time. For example, if you are trying to detect the
length of a web session in a never-ending stream ( this is an example
of trying to detect a sequence), it is very hard to do it with
batches as some session will fall into two batches. Stream processing
can handle this easily. If you take a step back and consider, the
most continuous data series are time series data. For example, almost
all IoT data are time series data. Hence, it makes sense to use a
programming model that fits naturally.
Batch lets the data build up and try to process them at once while
stream processing data as they come in hence spread the processing
over time. Hence stream processing can work with a lot less hardware
than batch processing.
Sometimes data is huge and it is not even possible to store it.
Stream processing let you handle large fire horse style data and
retain only useful bits.
Finally, there are a lot of streaming data available ( e.g. customer
transactions, activities, website visits) and they will grow faster
with IoT use cases ( all kind of sensors). Streaming is a much more
natural model to think about and program those use cases.
In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration
You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.
Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.
Here's the good reference link to understand a similar use case:
https://www.youtube.com/watch?v=3NEQV5mjKfY
Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.
Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:
Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid

Storm topology processing slowing down gradually

I have been reading about apache Storm tried few examples from storm-starter. Also learnt about how to tune the topology and how to scale it to perform fast enough to meet the required throughput.
I have created example topology with acking enabled, i am able to achieve 3K-5K messages processing per second. It performs really fast in initial 10 to 15min or around 1mil to 2mil message and then it starts slowing down. On storm UI, I can see the overall latency starts going up gradually and does not comes back, after a while the processing drops to only few hundred a second. I am getting exact same behavior for all the typologies i tried, the simplest one is to just read from kafka using KafkaSpout and send it to transform bolt parse the msg and send it to kafka again using KafkaBolt. The parser is very fast as it takes less than a millisecond to parse the message. I tried few option of increasing/describing the parallelism, changing the buffer sizes etc. but same behavior. Please help me to find out the reason for gradual slowness in the topology. Here is the config i am using
1 Nimbus machine (4 CPU) 24GB RAM
2 Supervisor machines (8CPU) and using 1 thread per core with 24GB RAM
4 Node kafka cluster running on above 2 supervisor machines (each topic has 4 partitions)
KafkaSpout(2 parallelism)-->TransformerBolt(8)-->KafkaBolt(2)
topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.spout.max.batch.size: 65536
topology.transfer.buffer.size: 32
topology.receiver.buffer.size: 8
topology.max.spout.pending: 250
At the start
After few minutes
After 45 min - latency started going up
After 80 min - Latency will keep going up and will go till 100 sec by the time it reaches 8 to 10mil messages
Visual VM screenshot
Threads
Pay attention to the capacity metric on RT_LEFT_BOLT, it is very close to 1; which explains why your topology is slowing down.
From the Storm documentation:
The Storm UI has also been made significantly more useful. There are new stats "#executed", "execute latency", and "capacity" tracked for all bolts. The "capacity" metric is very useful and tells you what % of the time in the last 10 minutes the bolt spent executing tuples. If this value is close to 1, then the bolt is "at capacity" and is a bottleneck in your topology. The solution to at-capacity bolts is to increase the parallelism of that bolt.
Therefore, your solution is to add more executors (and tasks) to that given bolt (RT_LEFT_BOLT). Another thing you can do is reduce the number of executors on RT_RIGHT_BOLT the capacity indicates you don't need that many executors, probably 1 or 2 can do the job.
The issue was due to GC setting with newgen params, it was not using the allocated heap completely so internal storm queues were getting full and running out of memory. The strange thing was that storm did not throw out of memory error, it just got stalled, with the help of visual vm i was able to trace it down.

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

Resources