Replay projection in production - event-sourcing

How do we replay projection in a production environment?
For example, we have about 100k events, to replay, it takes about 15 minutes, if we do this live, new events may come in and the projection will not be up to date after the replay.
So aside from schedule a system down time, how do we replay the projection gracefully?

Projection is always (potentially) not up to date. Projections are Data on the Outside -- unlocked, non-authoritative copies of the real data.
The fact that projection updates lag behind the changes to the authoritative copies of the data is an inevitable consequence of distributing copies of the data.
So aside from schedule a system down time, how do we replay the projection gracefully?
You accept into your design that the projections are data "as at" some time in the past; and you let the system run with the previously cached projection while the new projection is assembled.

We typically name our projections. If you projected all your order events into a projected-orders-v1 you can create a projected-orders-v2 in parallel and let it build up in the background.
When it's ready you do the code change required to access the new projections.
After that you can delete your old projection if you want.
This requires that your projection mechanism can read your event log from the beginning independently.
Update: Designing your system according to CQRS, separating READS from WRITES, solves this as there will be separate non-conflicting processes. One process is responsible for writing events to the end of the event stream, and (at least) one is responsible for reading from the beginning of the event stream. The process reading the events don't have to care if the event is new or not, it will only have to keep track of it's position (last known event) and keep reading forever.

Related

Existing works of distributed animation queues?

I'm building out an animation library on a distributed system. It is currently working well for single-processes but I'd like to take advantage of the distributed nature of the system I'm working on. To that end, the process the animations are spawned from holds the state of the values to render the scene.
When I start to conceptualize how the animation queue can work I keep running into race conditions around the scene's state. For example, my animation implementation takes values and you essentially provide the value that a given property should be set to. The animation library is responsible to building the timing frames of what the in between values are given a duration, frame rate, and easing. The frame is popped off the animation queue, evaluated, value is updated to the frame's value, and the scene is rendered.
However, when I start to think about updating with the new property values the eventual race conditions and how to handle multiple processes looking to update that state concurrently eventually come to mind.
So I'm interested in knowing if there are pre-existing works that achieve a similar goal or other efforts in distributing animation handling across processes that I can reference and learn from?

Can we use Hadoop MapReduce for real-time data process?

Hadoop map-reduce and it's echo-systems (like Hive..) we usually use for batch processing. But I would like to know is there any way that we can use hadoop MapReduce for realtime data processing example like live results, live tweets.
If not what are the alternatives for real time data processing or analysis?
Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
What if the temperature sensor is placed inside a nuclear plant and
our code create alarms. Creating alarms after one hour has elapsed
may not be the best way to handle it. Can we get alerts within 1
second?
What if you want the readings calculated at hour boundary while it
takes few seconds for data to arrive at the storage. Now you cannot
start the job at your boundary, you need to watch the disk and
trigger the job when data has arrived for the hour boundary.
Well, you can run Hadoop fast. Will the job finish within 1 seconds?
Can we write the data to the disk, read the data, process it, and
produce the results, and recombine with other 23 hours of data in one
second? Now things start to get tight.
The reason you start to feel the friction is because you are not
using the right tool for the Job. You are using the flat screwdriver
when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
Some data naturally comes as a never-ending stream of events. To do
batch processing, you need to store it, cut off at some time and
processes the data. Then you have to do the next batch and then worry
about aggregating across multiple batches. In contrast, streaming
handles neverending data streams gracefully and naturally. You can
have conditions, look at multiple levels of focus ( will discuss this
when we get to windows), and also easily look at data from multiple
streams simultaneously.
With streaming, you can respond to the events faster. You can produce
a result within milliseconds of receiving an event ( update). With
batch this often takes minutes.
Stream processing naturally fit with time series data and detecting
patterns over time. For example, if you are trying to detect the
length of a web session in a never-ending stream ( this is an example
of trying to detect a sequence), it is very hard to do it with
batches as some session will fall into two batches. Stream processing
can handle this easily. If you take a step back and consider, the
most continuous data series are time series data. For example, almost
all IoT data are time series data. Hence, it makes sense to use a
programming model that fits naturally.
Batch lets the data build up and try to process them at once while
stream processing data as they come in hence spread the processing
over time. Hence stream processing can work with a lot less hardware
than batch processing.
Sometimes data is huge and it is not even possible to store it.
Stream processing let you handle large fire horse style data and
retain only useful bits.
Finally, there are a lot of streaming data available ( e.g. customer
transactions, activities, website visits) and they will grow faster
with IoT use cases ( all kind of sensors). Streaming is a much more
natural model to think about and program those use cases.
In HDP 3.1, Hive-Kafka integration was introduced for working with real-time data. For more info, see the docs: Apache Hive-Kafka Integration
You can add Apache Druid to a Hadoop cluster to process OLAP queries on event data, and you can use Hive and Kafka with Druid.
Hadoop/Spark shines in case of handling large volume of data and batch processing on it but when your use case is revolving around real time analytics requirement then Kafka Steams and druid are good options to consider.
Here's the good reference link to understand a similar use case:
https://www.youtube.com/watch?v=3NEQV5mjKfY
Hortonworks also provides HDF Stack (https://hortonworks.com/products/data-platforms/hdf/) which works best with use cases related to data in motion.
Kafka and Druid documentation is a good place to understand strength of both technologies. Here are their documentation links:
Kafka: https://kafka.apache.org/documentation/streams/
Druid: http://druid.io/docs/latest/design/index.html#when-to-use-druid

How to account for clock offsets in a distributed system?

Background
I have a system consisting of several distributed services, each of which is continuously generating events and reporting these to a central service.
I need to present a unified timeline of the events, where the ordering in the timeline corresponds to the moment event occurred. The frequency of event occurrence and the network latency is such that I cannot simply use time of arrival at the central collector to order the events.
E.g. in the following scenario:
E1 needs to be rendered in the timeline above E2, despite arriving at the collector afterwards, which means the events need to come with timestamp metadata. This is where the problem arises.
Problem
Due to constraints on how the environment is set up, it is not possible to ensure that the local time services on each machine are reliably aware of current UTC time. I can assume that each machine can accurately gauge relative time, i.e. the clock speeds are close enough to make measurement of short timespans identical, but problems like NTP misconfiguration/partitioning make it impossible to guarantee that every machine agrees on the current UTC time.
This means that a naive approach of simply generating a local timestamp for each event as it occurs, then ordering events using that will not work: every machine has its own opinion of what universal time is.
So the question is: how can I recover an ordering for events generated in a distributed system where the clocks do not agree?
Approaches I've considered
Most solutions I find online go down the path of trying to synchronize all the clocks, which is not possible for me since:
I don't control the machines in question
The reason the clocks are out of sync in the first place is due to network flakiness, which I can't fix
My own idea was to query some kind of central time service every time an event is generated, then stamp that event with the retrieved time minus network flight time. This gets hairy, because I have to add another service to the system and ensure its availability (I'm back to square zero if the other services can't reach this one). I was hoping there is some clever way to do this that doesn't require me to centralize timekeeping in this way.
A simple solution, somewhat inspired by your own at the end, is to periodically ping what I'll call the time-source server. In the ping include the service's chip clock; the time-source echos that and includes its timestamp. The service can then deduce the round-trip-time and guess that the time-source's clock was at the timestamp roughly round-trip-time/2 nanoseconds ago. You can then use this as an offset to the local chip clock to determine a globalish time.
You don't have to use a different service for this; the Collector server will do. The important part is that you don't have to ask call the time-source server at every request; it removes it from the critical path.
If you don't want a sawtooth function for the time, you can smooth the time difference
Congratulations, you've rebuilt NTP!

Creating real time datawarehouse

I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to full data mining processes and OLAP cubes.
I have read that a data warehouse must be a relational database, and understand this. What I don't understand is how to achieve a near real time, or fully real time DWH. I have read about push and pull strategies but my conclusions are the following:
The choice of DBMS is not important to create real time DWH. I mean that is possible with MySQL, SQL Server, Oracle or any other. As I am doing it as a personal project I choose MySQL.
The key factor is the frequency of the jobs scheduling, and this is task of the scheduler. Is this assumption correct? I mean, the key to create a real time DWH is to establish jobs every second for every ETL process?
If I am wrong can you provide me some help to understand this? And then, which is the way to create a real time DWH? Is the any open source scheduler that allows that? And any not open source scheduler which allows that?
I am very confused because some references say that this is impossible, others that is possible.
Definition
Very interesting question. First of all, it should be defined how "real-time" realtime should be. Realtime really has a very low latency for incoming data but requires good architecture in the sending systems, maybe a event bus or messaging queue and good infrastructure on the receiving end. This usually involves some kind of listener and pushing from the deliviering systems.
Near-realtime would be the next "lower" level. If we say near-realtime would be about 5 minutes delay max, your approach could work as well. So for example here you could pull every minute or so the data. But keep in mind that you need some kind of high-performance check if new data is available and which to get. If this check and the pull would take longer than a minute it would become harder to keep up with the data. Really depends on the volume.
Realtime
As I said before, realtime analytics require at best a messaging queue or a service bus some jobs of yours could connect to and "listen" for new data. If a new data package is pushed into the pipeline, the size of it will probably be very small and it can be processed very fast.
If there is no infrastructure for listeners, you need to go near-realtime.
Near-realtime
This is the part where you have to develop more. You have to make sure to get realtively small data packages which will usually be some kind of delta. This could be done with triggers if you have access to the database. Otherwise you have to pull every once in a while whereas your "once" will probably be very frequent.
This could be done on Linux for example with a simple conjob or on Windows with event planning. Just keep in mind that your loading and processing time shouldn't exceed the time window you have got until the next job is being started.
Database
In the end, when you defined what you want to achieve and have a general idea how to implement delta loading or listeners, you are right - you could take a relational database. If you are interested in performance and are modelling this part as Star Schema, you also could look into Column Based Engines or Column Based Databases like Apache Cassandra.
Scheduling
Also for job scheduling you could start with Linux or Windows standard planning tools. If you code in Java you could use later something like quartz. But this would only be the case for near-realtime. Realtime requires a different architecture as I explained above.

Apache Kylin fault tolerance

Apache Kylin looks like a great tool that will fill in the needs of a lot data scientists. It's also a very complex system. We are developing an in-house solution with exactly the same goal in mind, multidimensional OLAP cube with low query latency.
Among the many issues, the one I'm concerned of the most right now is about fault tolerance.
With large volumes of incoming transactional data, the cube must be incrementally updated, and some of the cuboids are updated over long period of time such as those with time dimension value at the scale of year. Over such long period, some piece of the complex system is guaranteed to fail, and how does the system ensure all the raw transactional records are aggregated into the cuboids exactly once, no more no less? Even each of the pieces has its own fault tolerance mechanism, it doesn't mean they will play together automatically.
For simplicity, we can assume all the input data are saved in HDFS by another process, and can be "played back" in any way you want to recover from any interruption, voluntary or forced. What are Kylin's fault tolerance considerations, or is it not really an issue?
There are data faults and system faults.
Data fault tolerance: Kylin partitions cube into segments and allows rebuild an individual segment without impacting the whole cube. For example, assume a new daily segment is built on daily basis and get merged into weekly segment on weekend; weekly segments merge into monthly segment and so on. When there is data error (or whatever change) within a week, you need to rebuild only one day's segment. Data changes further back will require rebuild a weekly or monthly segment.
The segment strategy is fully customizable so you can balance the data error tolerance and query performance. More segments means more tolerable to data changes but also more scans to execute for each query. Kylin provides RESTful API, an external scheduling system can invoke the API to trigger segment build and merge.
A cube is still online and can serve queries when some of its segments is under rebuild.
System fault tolerance: Kylin relies on Hadoop and HBase for most system redundancy and fault tolerance. In addition to that, every build step in Kylin is idempotent. Meaning you can safely retry a failed step without any side effect. This ensures the final correctness, no matter how many fails and retries the build process has been through.
(I'm also Apache Kylin co-creator and committer. :-)
Notes: I'm Apache Kylin co-creator and committer.
The Fault Tolerance point is really good one which we actually be asked from some cases, when they have extreme large datasets. To calculate again from begin will require huge computing resources, network traffic and time.
But from product perspective, the question is: which one is more important between precision result and resources? For transaction data, I believe the exactly number is more important, but for behavior data, it should be fine, for example, the distinct count value is approximate result in Kylin now. It depends what's kind of case you will leverage Kylin to serve business needs.
Will put this idea into our backlog and will update to Kylin dev mailing list if we have more clear clue for this later.
Thanks.

Resources