stream processing watermark heuristics - spark-streaming

How accurate are watermark estimates in stream processing in apache beam or spark streaming.
My data source are files from gcs/s3 , but i use event time associated with each event as the timestamp for windowing function. Any ideas on how is this heuristic or estimate calculated by these stream processing engines and if there is way to measure how bad this estimate was.
My use case i have several server produceing event logs on gcs/S3 and then i am reading this files in a streaming way from my stream processing engine. So there can be delayed due to filesystem outages and failures or servers not being able to flush log events for couple hours. So In my stream processing pipeline correctness is one of the important aspect when aggregating some events. So I am curios how is this watermark estimate computed

Generally speaking, watermark is determined by the source. When a source announces a watermark of T, it is saying "I don't expect any more records with eventtime earlier than T". The streaming engine can then proceed to close the related windows etc. There could still be some events that arrive with timestamp less than T, and those will be considered "late". In Apache Beam, you have control on such late events as well. Sources in Apache Beam provide watermark by implementing getWatermark() interface (documentation there is quite helpful too).
In your case, critical part would be to know how delayed these files could be. You mentioned couple of hours. A simple heuristic could be keep watermark to 'latest event time - 2 hours'. Based on expected distribution of delays, you could limit that to 10 minutes to get most of the benefit and treat further delayed events as 'late'.

Related

Flink: how data is split in parallel tasks

I have a job with parallelism 2; it gets data from a kafka topic and, after keying, it handles timers in a stateful function.
I observed that sometimes one parallelized instance gets stuck: as a result timers do not trigger until a new message arrives, moving forward the current watermark for that parallel instance.
How does Flink split data between parallel instances?
Is there a metric to explore to get a quick view of how messages are split? (in percent or a count)
A part from reducing parallelism to 1, is there any other tip to solve this issue?
Thanks
With the Kafka source, it depends on the number of partitions. So setting the parallelism higher than the number of partitions will stop the watermark moving forward. In your case, as you mentioned it only gets stuck sometimes, probably one of the partitions didn't receive data for a bit which again stops the watermark.
To solve this issue, you can use withIdleness with your watermark strategy, more details can be found in the docs.

How to design a fair processing stream for sources of different rate?

I have a system with events coming from different sources. Some sources have a very high rate while some have low rate.
The number of sources is not constant, a new source could join or existing source could close.
I would like to process those events with some fairness, so that each source has an equal opportunity to process its events and high rate sources do not starve the low rate sources.
The current solution is to write all events to db and for each source schedule a periodic task that tries to fetch a batch of events coming from that source from db and process them. The batch is limited by size and how far back is the query trying to fetch (for high rate sources there might be some losses, as in events written to db but not fetched for processing).
Is there a better, more streamlined solution, that doesn't rely on batching?
I tried thinking of using Kafka to distribute events between partitions, but didn't have anything specific that doesn't rely on fetching data from db in batches.
A solution that scales could be great, but if not, it's OK to have some losses for high rate sources, as long as the solution provides fairness and optimizes resource usage

Can we use Hadoop MapReduce for real-time data process?

Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.
If not what are the alternatives for real time data processing or analysis?
Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
What if the temperature sensor is placed inside a nuclear plant and
our code create alarms. Creating alarms after one hour has elapsed
may not be the best way to handle it. Can we get alerts within 1
second?
What if you want the readings calculated at hour boundary while it
takes few seconds for data to arrive at the storage. Now you cannot
start the job at your boundary, you need to watch the disk and
trigger the job when data has arrived for the hour boundary.
Well, you can run Hadoop fast. Will the job finish within 1 seconds?
Can we write the data to the disk, read the data, process it, and
produce the results, and recombine with other 23 hours of data in one
second? Now things start to get tight.
The reason you start to feel the friction is because you are not
using the right tool for the Job. You are using the flat screwdriver
when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
Some data naturally comes as a never-ending stream of events. To do
batch processing, you need to store it, cut off at some time and
processes the data. Then you have to do the next batch and then worry
about aggregating across multiple batches. In contrast, streaming
handles neverending data streams gracefully and naturally. You can
have conditions, look at multiple levels of focus ( will discuss this
when we get to windows), and also easily look at data from multiple
streams simultaneously.
With streaming, you can respond to the events faster. You can produce
a result within milliseconds of receiving an event ( update). With
batch this often takes minutes.
Stream processing naturally fit with time series data and detecting
patterns over time. For example, if you are trying to detect the
length of a web session in a never-ending stream ( this is an example
of trying to detect a sequence), it is very hard to do it with
batches as some session will fall into two batches. Stream processing
can handle this easily. If you take a step back and consider, the
most continuous data series are time series data. For example, almost
all IoT data are time series data. Hence, it makes sense to use a
programming model that fits naturally.
Batch lets the data build up and try to process them at once while
stream processing data as they come in hence spread the processing
over time. Hence stream processing can work with a lot less hardware
than batch processing.
Sometimes data is huge and it is not even possible to store it.
Stream processing let you handle large fire horse style data and
retain only useful bits.
Finally, there are a lot of streaming data available ( e.g. customer
transactions, activities, website visits) and they will grow faster
with IoT use cases ( all kind of sensors). Streaming is a much more
natural model to think about and program those use cases.
In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration
You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.
Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.
Here's the good reference link to understand a similar use case:
https://www.youtube.com/watch?v=3NEQV5mjKfY
Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.
Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:
Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid

Storm Trident and Spark Streaming: distributed batch locking

After doing lots of reading and building a POC we are still unsure as to whether Storm Trident or Spark Streaming can handle our use case:
We have an inbound stream of sensor data for millions of devices (which have unique identifiers).
We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches.
Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock.
In addition the event device data needs to be processed in the order that the events occurred.
Essentially we can’t have two batches for the same device being processed at the same time.
Can trident/spark streaming handle our use case?
Any advice appreciated.
Since you have unique id's, can you divide them up? Simply divide the id by 10, for example, and depending on the remainder, send them to different processing boxes? This should also take care making sure each device's events are processed in order, as they will be sent to the same box. I believe Storm/Trident allows you to guarantee in-order processing. Not sure about Spark, but I would be surprised if they don't.
Pretty awesome problem to solve, I have to say though.

How to dedupe parallel event streams

Is there a standard approach for deduping parallel event streams ? Before I attempt to reinvent the wheel, I want to know if this problem has some known approaches.
My client component will be communicating with two servers. Each one is providing a near real-time event stream (~1 second). The events may occasionally be out of order. Assume I can uniquely identify the events. I need to send a single stream of events to the consuming code at the same near real-time performance.
A lot has been written about this kind of problem. Here's a foundational paper, by Leslie Lamport:
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
The Wikipedia article on Operational Transformation theory is a perfectly good starting point for further research:
http://en.wikipedia.org/wiki/Operational_transformation
As for your problem, you'll have to choose some arbitrary weight to measure the cost of delay vs the cost of dropped events. You can maintain two priority queues, time-ordered, where incoming events go. You'd do a merge-and on the heads of the two queues with some delay (to allow for out-of-order events), and throw away events that happened "before" the timestamp of whatever event you last sent. If that's no better than what you had in mind already, well, at least you get to read that cool Lamport paper!
I think that the optimization might be OS-specific. From the task as you described it I think about two threads consuming incoming data and appending it to the common stream having access based on mutexes. Both Linux and Win32 have mutex-like procedures, but they may have slow performance if you have data rate is really great. In this case I'd operate by blocks of data, that will allow to use mutexes not so often. Sure there's a main thread that consumes the data and it also access it with a mutex.

Resources