apache beam stream processing failure recovery - spark-streaming

Running a streaming beam pipeline where i stream files/records from gcs using avroIO and then create minutely/hourly buckets to aggregate events and add it to BQ. In case the pipeline fails how can i recover correctly and process the unprocessed events only ? I do not want to double count events .
One approach i was thinking was writing to spanner or bigtable but it may be the case the write to BQ succeeds but the DB fails and vice versa ?
How can i maintain a state in reliable consistent way in streaming pipeline to process only unprocessed events ?
I want to make sure the final aggregated data in BQ is the exact count for different events and not under or over counting ?
How does spark streaming pipeline solve this (I know they have some checkpointing directory for managing state of query and dataframes ) ?
Are there any recommended techniques to solve accurately these kind of problem in streaming pipelines ?

Based on clarification from the comments, this question boils down to 'can we achieve exactly-once semantics across two successive runs of a streaming job, assuming both runs are start from scratch?'. Short answer is no. Even if the user is willing store some state in external storage, it needs to be committed atomically/consistently with streaming engine internal state. Streaming engines like Dataflow, Flink store required state internally, which is needed for to 'resume' a job. With Flink you could resume from latest savepoint, and with Dataflow you can 'update' a running pipeline (note that Dataflow does not actually kill your job even when there are errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update.
Some what relaxed guarantees would be feasible with careful use of external storage. The details really depend on specific goals (often it is is no worth the extra complexity).

Related

Apache Flink relating/caching data options

This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.

Exactly-once guarantee in Storm Trident in network partitioning and/or failure scenarios

So, Apache Storm + Trident provide the exactly-once semantics. Imagine I have the following topology:
TridentSpout -> SumMoneyBolt -> SaveMoneyBolt -> Persistent Storage.
CalculateMoneyBolt sums monetary values in memory, then passes the result to SaveMoneyBolt which should save the final value to a remote storage/database.
Now it is very important that we calculate these values and store only once to the database. We do not want to accidentally double count the money.
So how does Storm with Trident handle network partitioning and/or failure scenarios when the write request to the database has been successfully sent, the database has successfully received the request, logged the transaction, and while responding to the client, the SaveMoneyBolt has either died or partitioned from the network before having received the database response?
I assume that if SaveMoneyBolt had died, Trident would retry the batch, but we cannot afford double counting.
How are such scenarios handled?
Thanks.
Trident gives a unique transaction id for each batch. If a batch is retried it will have the same txid. Also the batch updates are ordered, i.e. the state update for a batch will not happen until the update for the previous batch is complete. So by storing the txid along with the values in the state trident can de-duplicate the updates and provide exactly once semantics.
Trident comes with a few built-in Map state implementations which handles all this automatically.
For more information take a look at the docs :
http://storm.apache.org/releases/1.0.1/Trident-tutorial.html
http://storm.apache.org/releases/current/Trident-state.html

Amazon Web Services: Spark Streaming or Lambda

I am looking for some high level guidance on an architecture. I have a provider writing "transactions" to a Kinesis pipe (about 1MM/day). I need to pull those transactions off, one at a time, validating data, hitting other SOAP or Rest services for additional information, applying some business logic, and writing the results to S3.
One approach that has been proposed is use Spark job that runs forever, pulling data and processing it within the Spark environment. The benefits were enumerated as shareable cached data, availability of SQL, and in-house knowledge of Spark.
My thought was to have a series of Lambda functions that would process the data. As I understand it, I can have a Lambda watching the Kinesis pipe for new data. I want to run the pulled data through a bunch of small steps (lambdas), each one doing a single step in the process. This seems like an ideal use of Step Functions. With regards to caches, if any are needed, I thought that Redis on ElastiCache could be used.
Can this be done using a combination of Lambda and Step Functions (using lambdas)? If it can be done, is it the best approach? What other alternatives should I consider?
This can be achieved using a combination of Lambda and Step Functions. As you described, the lambda would monitor the stream and kick off a new execution of a state machine, passing the transaction data to it as an input. You can see more documentation around kinesis with lambda here: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html.
The state machine would then pass the data from one Lambda function to the next where the data will be processed and written to S3. You need to contact AWS for an increase on the default 2 per second StartExecution API limit to support 1MM/day.
Hope this helps!

Using NiFi for scheduling Hadoop batch processes

According to NiFi's homepage, it "supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic".
I've been playing with NiFi in the last couple of months and can't help wondering why not using it also for scheduling batch processes.
Let's say I have a use case in which data flows into Hadoop, processed by a series of Hive \ MapReduce jobs, then exported to some external NoSql database to be used by some system.
Using NiFi in order to ingest and flow data into Hadoop is a use case that NiFi was built for.
However, using Nifi in order to schedule the jobs on Hadoop ("Oozie-like") is a use case I haven't encountered others implementing, and since it seems completely possible to implement, I'm trying to understand if there are reasons not to do so.
The gains of doing it all on NiFi is that one will get a visual representation of the entire course of data, from source to destination in one place. In case of complicated flows it is very important for maintenance.
In other words - my question is: Are there reasons not to use NiFi as a scheduler\coordinator for batch processes? If so - what problems may arise in such a use case?
PS - I've read this: "Is Nifi having batch processing?" - but my question aims to a different sense of "batch processing in NiFi" than the one raised in the attached question
You are correct that you would have a concise visual representation of all steps in the process were the schedule triggers present on the flow canvas, but NiFi was not designed as a scheduler/coordinator. Here is a comparison of some scheduler options.
Using NiFi to control scheduling feels like a "hammer" solution in search of a problem. It will reduce the ease of defining those schedules programmatically or interacting with them from external tools. Theoretically, you could define the schedule format, read them into NiFi from a file, datasource, endpoint, etc., and use ExecuteStreamCommand, ExecuteScript, or InvokeHTTP processors to kick off the batch process. This feels like introducing an unnecessary intermediate step however. If consolidation & visualization is what you are going for, you could have a monitoring flow segment ingest those schedule definitions from their native format (Oozie, XML, etc.) and display them in NiFi without making NiFi responsible for defining and executing the scheduling.

Difference between Esper and Apache Storm?

My goal is to make a distributed application with publishers and consumers where i use CEP to process streams of data to generate event notifications to event consumers.
Whats the difference between Esper and Apache Storm?
Can I achieve what Storm does with only Esper, and when should I consider integrating Esper with Storm?
Im confused, I thought Esper provided the same functionality.
Storm is a distributed realtime computation system which generally can be used for any purpose while Esper is an Event Stream Processing and and event correlation engine (Complex Event Processing), therefore Esper is more specific.
Here are some use cases of them:
Storm can be used to real-time consume data from Twitter and calculate to find the most used hashtag per topic.
Esper can be used to detect event like: a normal train has speed 100 miles per hour and its speed is reported secondly. If the speed of the train increase to 130 miles per hour in 10 minutes, an event will be generated and notify train operator.
There are some more criteria you can consider when select between them:
Storm is built-in designed for distributed processing while Esper seems not (My team evaluated 2016)
Open source license (Storm) vs Commercial license (Esper)
Storm provides abstractions to run pieces of code and provides value is managing the distribution aspect such as workers, management, restart. Whatever piece of code you plug in is your job to code and your code must worry how and where it keeps state such as counts or other derived information. Storm is often used for extract-transform-load or ingest of streams into some store.
Esper provides abstractions for detecting situations among sequences of events by providing an "Event Processing Language (EPL)" that conforms to SQL92. Its an expressive, concise and extensible means to detect stuff without writing any or very little code. EPL can be dynamically injected and managed so you can achieve adding and removing rules at runtime without restarts. If you were to do this with code you would always have to restart the JVM/workers/topology.
Esper has horizontal scalability by integrating directly with Kafka Streams. But the code for this is not open source, currently.
In my opinion, storm can offer a good management of resources, like load balancing, auto restart when machine down, a easier way to add machines or tasks, etc, which you have to handle it yourself if just using esper.

Resources