Storm Trident and Spark Streaming: distributed batch locking

Storm Trident and Spark Streaming: distributed batch locking - spark-streaming

After doing lots of reading and building a POC we are still unsure as to whether Storm Trident or Spark Streaming can handle our use case:
We have an inbound stream of sensor data for millions of devices (which have unique identifiers).
We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches.
Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock.
In addition the event device data needs to be processed in the order that the events occurred.
Essentially we can’t have two batches for the same device being processed at the same time.
Can trident/spark streaming handle our use case?
Any advice appreciated.

Since you have unique id's, can you divide them up? Simply divide the id by 10, for example, and depending on the remainder, send them to different processing boxes? This should also take care making sure each device's events are processed in order, as they will be sent to the same box. I believe Storm/Trident allows you to guarantee in-order processing. Not sure about Spark, but I would be surprised if they don't.
Pretty awesome problem to solve, I have to say though.

Related

How to validate CDC data pipeline?

We have a MongoDB from which we consume CDC stream using custom python code. The CDC stream is dumped as files which is further consumed by spark which runs SQL on the files and dumps the result set into Kafka.
Questions:
how do you make sure there is no data loss in the pipeline
even if there is some loss, how to detect it and point it?
How are these handled? What is industry standard?

This problem is particularly significant when the replication target happens to be Kafka, given the semantics of Kafka. On the bright side, as long as you are not compacting topics it is possible to account for each message received by your consuming application. The issue is having something in the Kafka message that gives you a monotonically dense increasing sequence number. And there becomes issues if a consumer is only reading a subset of the data, as then not all of the sequence numbers would be read so it becomes hard to know if the data isn't there because its in a topic/partition you aren't reading or whether it is actually missing.
In the perfect situation your source has a sequence number in the user data. From my many customer interactions, this is highly unlikely. In my product (I work for IBM and own the CDC Kafka target engine), we allow a user to introduce a sequence number in the processing of the user data. You can consider doing this at both the subscription and topic/partition level. But at that point you are trusting that CDC captured the original data and did not have a "bug" in reading it in the first place. Assuming you trust CDC to have at least read the source information from the source log.... you can then insert a sequence number with our product if you want to go the do it yourself route.
There are problems with this, in that the sequence number is for a given replication session.... so if there's an abnormal termination and you start the sub up, you might see replication with the new entries starting at zero. You can solve this by storing the number you left off in the location you note the effective log position on the source that you've replicated to.
To solve all of this I designed something called the Transactionally Consistent Consumer.... It removes duplication and exactly resequences operations. It has a checkpoint set of bytes that can be used to restart the source stream at any point previously seen (allowing for down stream data loss or incomplete processing). It does require that you trust CDC originally captured all the changes (which is the point of an enterprise grade replication product). If you happened to have source generated sequence numbers than that could work in conjunction with this.
https://www.ibm.com/docs/en/idr/11.4.0?topic=kafka-transactionally-consistent-consumer
If your interested.
I did a presentation at the Kafka summit on the idea behind the technology ....
uh here....
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions/
Hopefully that helps a bit with how enterprise grade products approach this.
Cheers.

There is no industry standard, only hard work to capture potentially a few things, that never ends if the design is not simple from the outset.
It's like SOA eventing asynchronously --> hard to do, so not done often.
I work on such things and we test well and assume that may be some less will occur but we offset the cost vs. the benefit.
E.g. writing to AZURE Event Hub from a TIBCO Cloud Mashery API by a client and maipulating the Event Hub via a post insert AZURE Function, or CDC feed from ORACLE via Shareplex, KAFKA, Spark Batch KAFKA integration.

Delay Kafka processor reading from source topic

I have a topology that consists of two source topics which are read and processed by two different processors in a Kafka Streams app. The one processor A reads its corresponding topic and creates a persistent local store which is shared with the other processor B in the topology.
My issue is that I need somehow after a restart to pause processor B processing for a very small amount of time and give processor A the time to read some events from its topic updating its local store before processor B starts with its processing.
Since both processors belong to the same sub-topology I can't use Thread.sleep in init() for example because this will cause the whole app to stall.
So is there a way to make processor B in topology wait/stall for a very small amount of time when restarting the application before starting reading from the source topic and begin processing events?

Processing order is base on record timestamps. Hence, if the timestamps of the record processed by A are smaller than the timestamps of the record processed by B, those "A records" will be processed first.
Explicitly pausing one side does not make sense, as it may violate the processing order. Just make sure that your input data is properly timestamped and you don't have to worry about manual pausing.

Can we use Hadoop MapReduce for real-time data process?

Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.
If not what are the alternatives for real time data processing or analysis?

Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
What if the temperature sensor is placed inside a nuclear plant and
our code create alarms. Creating alarms after one hour has elapsed
may not be the best way to handle it. Can we get alerts within 1
second?
What if you want the readings calculated at hour boundary while it
takes few seconds for data to arrive at the storage. Now you cannot
start the job at your boundary, you need to watch the disk and
trigger the job when data has arrived for the hour boundary.
Well, you can run Hadoop fast. Will the job finish within 1 seconds?
Can we write the data to the disk, read the data, process it, and
produce the results, and recombine with other 23 hours of data in one
second? Now things start to get tight.
The reason you start to feel the friction is because you are not
using the right tool for the Job. You are using the flat screwdriver
when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
Some data naturally comes as a never-ending stream of events. To do
batch processing, you need to store it, cut off at some time and
processes the data. Then you have to do the next batch and then worry
about aggregating across multiple batches. In contrast, streaming
handles neverending data streams gracefully and naturally. You can
have conditions, look at multiple levels of focus ( will discuss this
when we get to windows), and also easily look at data from multiple
streams simultaneously.
With streaming, you can respond to the events faster. You can produce
a result within milliseconds of receiving an event ( update). With
batch this often takes minutes.
Stream processing naturally fit with time series data and detecting
patterns over time. For example, if you are trying to detect the
length of a web session in a never-ending stream ( this is an example
of trying to detect a sequence), it is very hard to do it with
batches as some session will fall into two batches. Stream processing
can handle this easily. If you take a step back and consider, the
most continuous data series are time series data. For example, almost
all IoT data are time series data. Hence, it makes sense to use a
programming model that fits naturally.
Batch lets the data build up and try to process them at once while
stream processing data as they come in hence spread the processing
over time. Hence stream processing can work with a lot less hardware
than batch processing.
Sometimes data is huge and it is not even possible to store it.
Stream processing let you handle large fire horse style data and
retain only useful bits.
Finally, there are a lot of streaming data available ( e.g. customer
transactions, activities, website visits) and they will grow faster
with IoT use cases ( all kind of sensors). Streaming is a much more
natural model to think about and program those use cases.

In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration
You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.

Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.
Here's the good reference link to understand a similar use case:
https://www.youtube.com/watch?v=3NEQV5mjKfY
Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.
Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:
Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid

Storm as a replacement for Multi-threaded Consumer/Producer approach to process high volumes?

We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM

Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.

How to dedupe parallel event streams

Is there a standard approach for deduping parallel event streams ? Before I attempt to reinvent the wheel, I want to know if this problem has some known approaches.
My client component will be communicating with two servers. Each one is providing a near real-time event stream (~1 second). The events may occasionally be out of order. Assume I can uniquely identify the events. I need to send a single stream of events to the consuming code at the same near real-time performance.

A lot has been written about this kind of problem. Here's a foundational paper, by Leslie Lamport:
http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html#time-clocks
The Wikipedia article on Operational Transformation theory is a perfectly good starting point for further research:
http://en.wikipedia.org/wiki/Operational_transformation
As for your problem, you'll have to choose some arbitrary weight to measure the cost of delay vs the cost of dropped events. You can maintain two priority queues, time-ordered, where incoming events go. You'd do a merge-and on the heads of the two queues with some delay (to allow for out-of-order events), and throw away events that happened "before" the timestamp of whatever event you last sent. If that's no better than what you had in mind already, well, at least you get to read that cool Lamport paper!

I think that the optimization might be OS-specific. From the task as you described it I think about two threads consuming incoming data and appending it to the common stream having access based on mutexes. Both Linux and Win32 have mutex-like procedures, but they may have slow performance if you have data rate is really great. In this case I'd operate by blocks of data, that will allow to use mutexes not so often. Sure there's a main thread that consumes the data and it also access it with a mutex.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio