My goal is to make a distributed application with publishers and consumers where i use CEP to process streams of data to generate event notifications to event consumers.
Whats the difference between Esper and Apache Storm?
Can I achieve what Storm does with only Esper, and when should I consider integrating Esper with Storm?
Im confused, I thought Esper provided the same functionality.
Storm is a distributed realtime computation system which generally can be used for any purpose while Esper is an Event Stream Processing and and event correlation engine (Complex Event Processing), therefore Esper is more specific.
Here are some use cases of them:
Storm can be used to real-time consume data from Twitter and calculate to find the most used hashtag per topic.
Esper can be used to detect event like: a normal train has speed 100 miles per hour and its speed is reported secondly. If the speed of the train increase to 130 miles per hour in 10 minutes, an event will be generated and notify train operator.
There are some more criteria you can consider when select between them:
Storm is built-in designed for distributed processing while Esper seems not (My team evaluated 2016)
Open source license (Storm) vs Commercial license (Esper)
Storm provides abstractions to run pieces of code and provides value is managing the distribution aspect such as workers, management, restart. Whatever piece of code you plug in is your job to code and your code must worry how and where it keeps state such as counts or other derived information. Storm is often used for extract-transform-load or ingest of streams into some store.
Esper provides abstractions for detecting situations among sequences of events by providing an "Event Processing Language (EPL)" that conforms to SQL92. Its an expressive, concise and extensible means to detect stuff without writing any or very little code. EPL can be dynamically injected and managed so you can achieve adding and removing rules at runtime without restarts. If you were to do this with code you would always have to restart the JVM/workers/topology.
Esper has horizontal scalability by integrating directly with Kafka Streams. But the code for this is not open source, currently.
In my opinion, storm can offer a good management of resources, like load balancing, auto restart when machine down, a easier way to add machines or tasks, etc, which you have to handle it yourself if just using esper.
Related
I have a simple ZeroMQ PUB/SUB architecture for streaming data from the publisher to subscribers. When the subscribers are connected, publisher starts streaming the data but I want to modify it, so that publisher publishes the most recent snapshot of the data first and after that starts streaming.
How can I achieve this?
Q : How can I achieve this?
( this being: "... streaming the data but I want to modify it, so that publisher publishes the most recent snapshot of the data first and after that starts streaming."
Solution :
Instantiate a pair of PUB-s, the first called aSnapshotPUBLISHER, the second aStreamingPUBLISHER. Using XPUB-archetype for the former may help to easily integrate some add-on logic for subscriber-base management ( a NTH-feature, yet kinda O/T ATM ).
Get configured the former with aSnapshotPUBLISHER.setsockopt( ZMQ_CONFLATE, 1 ), other settings may focus on reducing latency and ensuring all the needed resources are available for both the smooth streaming via aStreamingPUBLISHER while also having the most recent snapshot readily available in aSnapshotPUBLISHER for any newcomer.
SUB-side agents simply follow this approach, having setup a pair of working (.bind()/.connect()) links ( to either of the PUB-s or a pair of XPUB+PUB ) and having got confirmed the links are up and running smooth, stop sourcing the snapshots from aSnapshotPUBLISHER and remain consuming but the (now synced using BaseID / TimeStamp / FrameMarker or similarly aligned) streaming-data from aStreamingPUBLISHER.
The known ZMQ_CONFLATE-mode as-is limitation of not supporting multi-frame message payloads is needless to consider a trouble, once a low-latency rule-of-thumb is to pack/compress any data into right-sized BLOB-s rather than moving any sort of "decorated" but inefficient data-representation formats over the wire.
Running a streaming beam pipeline where i stream files/records from gcs using avroIO and then create minutely/hourly buckets to aggregate events and add it to BQ. In case the pipeline fails how can i recover correctly and process the unprocessed events only ? I do not want to double count events .
One approach i was thinking was writing to spanner or bigtable but it may be the case the write to BQ succeeds but the DB fails and vice versa ?
How can i maintain a state in reliable consistent way in streaming pipeline to process only unprocessed events ?
I want to make sure the final aggregated data in BQ is the exact count for different events and not under or over counting ?
How does spark streaming pipeline solve this (I know they have some checkpointing directory for managing state of query and dataframes ) ?
Are there any recommended techniques to solve accurately these kind of problem in streaming pipelines ?
Based on clarification from the comments, this question boils down to 'can we achieve exactly-once semantics across two successive runs of a streaming job, assuming both runs are start from scratch?'. Short answer is no. Even if the user is willing store some state in external storage, it needs to be committed atomically/consistently with streaming engine internal state. Streaming engines like Dataflow, Flink store required state internally, which is needed for to 'resume' a job. With Flink you could resume from latest savepoint, and with Dataflow you can 'update' a running pipeline (note that Dataflow does not actually kill your job even when there are errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update.
Some what relaxed guarantees would be feasible with careful use of external storage. The details really depend on specific goals (often it is is no worth the extra complexity).
For every processor there is a way to configure the processor and there is a context menu to view data provenance.
Is there a good explanation of what is data provenance?
Data provenance is all about understanding the origin and attribution of data. In a typical system you get 'logs'. When you consider data flowing through a series of processes and queues you end up with a lot of lots of course. If you want to follow the path a given piece of data took, or how long it took to take that path, or what happened to an object that got split up into different objects and so on all of that is really time consuming and tough. The provenance that NiFi supports is like logging on steroids and is all about keeping and tracking these relationships between data and the events that shaped and impacted what happened to it. NiFi is keeping track of where each piece of data comes from, what it learned about the data, maintains the trail across splits, joins, transformations, where it sends it, and ultimately when it drops the data. Think of it like a chain of custody for data.
This is really valuable for a few reasons. First, understanding and debugging. Having this provenance capture means from a given event you can go forwards or backwards in the flow to see where data came from and went. Given that NiFi also has an immutable versioned content store under the covers you can also use this to click directly to the content at each stage of the flow. You can also replay the content and context of a given event against the latest flow. This in turn means much faster iteration to the configuration and results you want. This provenance model is also valuable for compliance reasons. You can prove whether you sent data to the correct systems or not. If you learn that you didn't then have data with which you can address the issue or create a powerful audit trail for follow-up.
The provenance model in Apache NiFi is really powerful and it is being extended to the Apache MiNiFi which is a subproject of Apache NiFi as well. More systems producing more provenance will mean you have a far stronger ability to track data from end-to-end. Of course this becomes even more powerful when it can be combined with other lineage systems or centralized lineage stores. Apache Atlas may be a great system to integrate with for this to bring a centralized view. NiFi is able to not only do what I described above but to also send these events to such a central store. So, exciting times ahead for this.
Hope that helps.
Since a couple of days I've been trying to figure it out how to inform to the rest of the microservices that a new entity was created in a microservice A that store that entity in a MongoDB.
I want to:
Have low coupling between the microservices
Avoid distributed transactions between microservices like Two Phase Commit (2PC)
At first a message broker like RabbitMQ seems to be a good tool for the job but then I see the problem of commit the new document in MongoDB and publish the message in the broker not being atomic.
Why event sourcing? by eventuate.io:
One way of solving this issue implies make the schema of the documents a bit dirtier by adding a mark that says if the document have been published in the broker and having a scheduled background process that search unpublished documents in MongoDB and publishes those to the broker using confirmations, when the confirmation arrives the document will be marked as published (using at-least-once and idempotency semantics). This solutions is proposed in this and this answers.
Reading an Introduction to Microservices by Chris Richardson I ended up in this great presentation of Developing functional domain models with event sourcing where one of the slides asked:
How to atomically update the database and publish events and publish events without 2PC? (dual write problem).
The answer is simple (on the next slide)
Update the database and publish events
This is a different approach to this one that is based on CQRS a la Greg Young.
The domain repository is responsible for publishing the events, this
would normally be inside a single transaction together with storing
the events in the event store.
I think that delegate the responsabilities of storing and publishing the events to the event store is a good thing because avoids the need of 2PC or a background process.
However, in a certain way it's true that:
If you rely on the event store to publish the events you'd have a
tight coupling to the storage mechanism.
But we could say the same if we adopt a message broker for intecommunicate the microservices.
The thing that worries me more is that the Event Store seems to become a Single Point of Failure.
If we look this example from eventuate.io
we can see that if the event store is down, we can't create accounts or money transfers, losing one of the advantages of microservices. (although the system will continue responding querys).
So, it's correct to affirmate that the Event Store as used in the eventuate example is a Single Point of Failure?
What you are facing is an instance of the Two General's Problem. Basically, you want to have two entities on a network agreeing on something but the network is not fail safe. Leslie Lamport proved that this is impossible.
So no matter how much you add new entities to your network, the message queue being one, you will never have 100% certainty that agreement will be reached. In fact, the opposite takes place: the more entities you add to your distributed system, the less you can be certain that an agreement will eventually be reached.
A practical answer to your case is that 2PC is not that bad if you consider adding even more complexity and single points of failures. If you absolutely do not want a single point of failure and wants to assume that the network is reliable (in other words, that the network itself cannot be a single point of failure), you can try a P2P algorithm such as DHT, but for two peers I bet it reduces to simple 2PC.
We handle this with the Outbox approach in NServiceBus:
http://docs.particular.net/nservicebus/outbox/
This approach requires that the initial trigger for the whole operation came in as a message on the queue but works very well.
You could also create a flag for each entry inside of the event store which tells if this event was already published. Another process could poll the event store for those unpublished events and put them into a message queue or topic. The disadvantage of this approach is that consumers of this queue or topic must be designed to de-duplicate incoming messages because this pattern does only guarantee at-least-once delivery. Another disadvantage could be latency because of the polling frequency. But since we have already entered the eventually consistent area here this might not be such a big concern.
How about if we have two event stores, and whenever a Domain Event is created, it is queued onto both of them. And the event handler on the query side, handles events popped from both the event stores.
Ofcourse every event should be idempotent.
But wouldn’t this solve our problem of the event store being a single point of entry?
Not particularly a mongodb solution but have you considered leveraging the Streams feature introduced in Redis 5 to implement a reliable event store. Take a look this intro here
I find that it has rich set of features like message tailing, message acknowledgement as well as the ability to extract unacknowledged messages easily. This surely helps to implement at least once messaging guarantees. It also support load balancing of messages using "consumer group" concept which can help with scaling the processing part.
Regarding your concern about being the single point of failure, as per the documentation, streams and consumer information can be replicated across nodes and persisted to disk (using regular Redis mechanisms I believe). This helps address the single point of failure issue. I'm currently considering using this for one of my microservices projects.
Architecture question: We have an Apache Kafka based eventing system and multiple systems producing / sending events. Each event has some data including an ID and I need to implement a "ID is complete"-event. Example:
Event_A(id)
Event_B(id)
Event_C(id)
are received asynchonrously, and only once all 3 events are received, I need to send a Event_Complete(id). The problem is that we have multiple clusters of consumers and our database is eventual consistent.
A simple way would be to use the eventually consistent DB to store which events we have for each ID and add a "cron" job to catch race conditions eventually.
It feels like a problem that might have been solved out there already. So my question is, is there a better way to do it (without introducing a consistent datastore to the picture)?
Thanks a bunch!