Flink: how data is split in parallel tasks - parallel-processing

I have a job with parallelism 2; it gets data from a kafka topic and, after keying, it handles timers in a stateful function.
I observed that sometimes one parallelized instance gets stuck: as a result timers do not trigger until a new message arrives, moving forward the current watermark for that parallel instance.
How does Flink split data between parallel instances?
Is there a metric to explore to get a quick view of how messages are split? (in percent or a count)
A part from reducing parallelism to 1, is there any other tip to solve this issue?
Thanks

With the Kafka source, it depends on the number of partitions. So setting the parallelism higher than the number of partitions will stop the watermark moving forward. In your case, as you mentioned it only gets stuck sometimes, probably one of the partitions didn't receive data for a bit which again stops the watermark.
To solve this issue, you can use withIdleness with your watermark strategy, more details can be found in the docs.

Related

About the effect of parallelism in StormCrawler

I am currently working on a Storm Crawler based project. We have a fixed and limited amount of bandwidth for fetching page from the web. We have 8 worker with a large value for parallelism hint for different Bolt in the topology (i.e. 50). So lots of thread created for fetching the page. Is there any relation between increasing number of fetch_error and increasing parallelism_hint in the project? How can I determine the good value for the parallelism_hint in the Storm Crawler?
The parallelism hint is not something that should be applied to all bolts indiscriminately.
Ideally, you need one instance of FetcherBolt per worker, so in your case 8. As you've probably read in the WIKI or seen in the conf, the FetcherBolt handles internal threads for fetching. This is determined by the config fetcher.threads.number which is set to 50 in the archetypes' configurations (assuming this is what you used as a starting point).
Using too many FetcherBolt instances is counterproductive. It is better to change the value of fetcher.threads.number instead. If you have 50 Fetcher instances with a default number of threads of 50, that would give you 2500 fetching threads which might be too much for your available bandwidth.
As I mentioned before you want 1 FetcherBolt per worker, the number of internal fetching threads per bolt depends on your bandwidth. There is no hard rule for this, it depends on your situation.
One constant I have observed however is the ratio of parsing bolts to Fetcher bolts; usually, 4 parsers per fetcher works fine. Run Storm in deployed mode and check the capacity value for the parser bolts in the UI. If the value is 1 or above, try using more instances and see if it affects the capacity.
In any case, not all bolts need the same level of parallelism.

Achieve concurrency in Kafka consumers

We are working on parallelising our Kafka consumer to process more number of records to handle the Peak load. One way, we are already doing is through spinning up as many consumers as many partitions within the same consumer group.
Our Consumer deals with making an API call which is synchronous as of now. We felt making this API call asynchronous will make our consumer handle more load. Hence, we are trying to making the API call Asynchronous and in its response we are increasing the offset. However we are seeing an issue with this:
By making the API call Asynchronous, we may get the response for the last record first and none of the previous record's API calls haven't initiated or done by then. If we commit the offset as soon as we receive the response of the last record, the offset would get changed to the last record. In the meantime if the consumer restarts or partition rebalances, we will not receive any record before the last record we committed the offset as. With this, we will miss out the unprocessed records.
As of now we already have 25 partitions. We are looking forward to understand if someone have achieved parallelism without increasing the partitions or increasing the partitions is the only way to achieve parallelism (to avoid offset issues).
First, you need to decouple (if only at first) the reading of the messages from the processing of these messages. Next look at how many concurrent calls you can make to your API as it doesn't make any sense to call it more frequently than the server can handle, asynchronously or not. If the number of concurrent API calls is roughly equal to the number of partitions you have in your topic, then it doesn't make sense to call the API asynchronously.
If the number of partitions is significantly less than the max number of possible concurrent API calls then you have a few choices. You could try to make the max number of concurrent API calls with fewer threads (one per consumer) by calling the API's asynchronously as you suggest, or you can create more threads and make your calls synchronously. Of course, then you get into the problem of how can your consumers hand their work off to a greater number of shared threads, but that's exactly what streaming execution platforms like Flink or Storm do for you. Streaming platforms (like Flink) that offer checkpoint processing can also handle your problem of how to handle offset commits when messages are processed out of order. You could roll your own checkpoint processing and roll your own shared thread management, but you'd have to really want to avoid using a streaming execution platform.
Finally, you might have more consumers than max possible concurrent API calls, but then I'd suggest that you just have fewer consumers and share partitions, not API calling threads.
And, of course, you can always change the number of your topic partitions to make your preferred option above more feasible.
Either way, to answer your specific question you want to look at how Flink does checkpoint processing with Kafka offset commits. To oversimplify (because I don't think you want to roll your own), the kafka consumers have to remember not only the offsets they just committed, but they have to hold on to the previous committed offsets, and that defines a block of messages flowing though your application. Either that block of messages in its entirety is processed all the way through or you need to rollback the processing state of each thread to the point where the last message in the previous block was processed. Again, that's a major oversimplification, but that's kinda how it's done.
You have to look at kafka batch processing. In a nutshell: you can setup huge batch.size with a little number (or even single) of partitions. As far, as whole batch of messages consumed at consumer side (i.e. in ram memory) - you can parallelize this messages in any way you want.
I would really like to share links, but their number rolls over the web hole.
UPDATE
In terms of committing offsets - you can do this for whole batch.
In general, kafka doesn't achieve target performance requirements by abusing partitions number, but rather relying on batch processing.
I already saw a lot of projects, suffering from partitions scaling (you may see issues later, during rebalancing for example). The rule of thumb - look at every available batch setting first.

Can we use Hadoop MapReduce for real-time data process?

Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.
If not what are the alternatives for real time data processing or analysis?
Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
What if the temperature sensor is placed inside a nuclear plant and
our code create alarms. Creating alarms after one hour has elapsed
may not be the best way to handle it. Can we get alerts within 1
second?
What if you want the readings calculated at hour boundary while it
takes few seconds for data to arrive at the storage. Now you cannot
start the job at your boundary, you need to watch the disk and
trigger the job when data has arrived for the hour boundary.
Well, you can run Hadoop fast. Will the job finish within 1 seconds?
Can we write the data to the disk, read the data, process it, and
produce the results, and recombine with other 23 hours of data in one
second? Now things start to get tight.
The reason you start to feel the friction is because you are not
using the right tool for the Job. You are using the flat screwdriver
when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
Some data naturally comes as a never-ending stream of events. To do
batch processing, you need to store it, cut off at some time and
processes the data. Then you have to do the next batch and then worry
about aggregating across multiple batches. In contrast, streaming
handles neverending data streams gracefully and naturally. You can
have conditions, look at multiple levels of focus ( will discuss this
when we get to windows), and also easily look at data from multiple
streams simultaneously.
With streaming, you can respond to the events faster. You can produce
a result within milliseconds of receiving an event ( update). With
batch this often takes minutes.
Stream processing naturally fit with time series data and detecting
patterns over time. For example, if you are trying to detect the
length of a web session in a never-ending stream ( this is an example
of trying to detect a sequence), it is very hard to do it with
batches as some session will fall into two batches. Stream processing
can handle this easily. If you take a step back and consider, the
most continuous data series are time series data. For example, almost
all IoT data are time series data. Hence, it makes sense to use a
programming model that fits naturally.
Batch lets the data build up and try to process them at once while
stream processing data as they come in hence spread the processing
over time. Hence stream processing can work with a lot less hardware
than batch processing.
Sometimes data is huge and it is not even possible to store it.
Stream processing let you handle large fire horse style data and
retain only useful bits.
Finally, there are a lot of streaming data available ( e.g. customer
transactions, activities, website visits) and they will grow faster
with IoT use cases ( all kind of sensors). Streaming is a much more
natural model to think about and program those use cases.
In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration
You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.
Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.
Here's the good reference link to understand a similar use case:
https://www.youtube.com/watch?v=3NEQV5mjKfY
Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.
Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:
Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid

Storm Trident and Spark Streaming: distributed batch locking

After doing lots of reading and building a POC we are still unsure as to whether Storm Trident or Spark Streaming can handle our use case:
We have an inbound stream of sensor data for millions of devices (which have unique identifiers).
We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches.
Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock.
In addition the event device data needs to be processed in the order that the events occurred.
Essentially we can’t have two batches for the same device being processed at the same time.
Can trident/spark streaming handle our use case?
Any advice appreciated.
Since you have unique id's, can you divide them up? Simply divide the id by 10, for example, and depending on the remainder, send them to different processing boxes? This should also take care making sure each device's events are processed in order, as they will be sent to the same box. I believe Storm/Trident allows you to guarantee in-order processing. Not sure about Spark, but I would be surprised if they don't.
Pretty awesome problem to solve, I have to say though.

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

Resources