How to validate CDC data pipeline? - etl

We have a MongoDB from which we consume CDC stream using custom python code. The CDC stream is dumped as files which is further consumed by spark which runs SQL on the files and dumps the result set into Kafka.
Questions:
how do you make sure there is no data loss in the pipeline
even if there is some loss, how to detect it and point it?
How are these handled? What is industry standard?

This problem is particularly significant when the replication target happens to be Kafka, given the semantics of Kafka. On the bright side, as long as you are not compacting topics it is possible to account for each message received by your consuming application. The issue is having something in the Kafka message that gives you a monotonically dense increasing sequence number. And there becomes issues if a consumer is only reading a subset of the data, as then not all of the sequence numbers would be read so it becomes hard to know if the data isn't there because its in a topic/partition you aren't reading or whether it is actually missing.
In the perfect situation your source has a sequence number in the user data. From my many customer interactions, this is highly unlikely. In my product (I work for IBM and own the CDC Kafka target engine), we allow a user to introduce a sequence number in the processing of the user data. You can consider doing this at both the subscription and topic/partition level. But at that point you are trusting that CDC captured the original data and did not have a "bug" in reading it in the first place. Assuming you trust CDC to have at least read the source information from the source log.... you can then insert a sequence number with our product if you want to go the do it yourself route.
There are problems with this, in that the sequence number is for a given replication session.... so if there's an abnormal termination and you start the sub up, you might see replication with the new entries starting at zero. You can solve this by storing the number you left off in the location you note the effective log position on the source that you've replicated to.
To solve all of this I designed something called the Transactionally Consistent Consumer.... It removes duplication and exactly resequences operations. It has a checkpoint set of bytes that can be used to restart the source stream at any point previously seen (allowing for down stream data loss or incomplete processing). It does require that you trust CDC originally captured all the changes (which is the point of an enterprise grade replication product). If you happened to have source generated sequence numbers than that could work in conjunction with this.
https://www.ibm.com/docs/en/idr/11.4.0?topic=kafka-transactionally-consistent-consumer
If your interested.
I did a presentation at the Kafka summit on the idea behind the technology ....
uh here....
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions/
Hopefully that helps a bit with how enterprise grade products approach this.
Cheers.

There is no industry standard, only hard work to capture potentially a few things, that never ends if the design is not simple from the outset.
It's like SOA eventing asynchronously --> hard to do, so not done often.
I work on such things and we test well and assume that may be some less will occur but we offset the cost vs. the benefit.
E.g. writing to AZURE Event Hub from a TIBCO Cloud Mashery API by a client and maipulating the Event Hub via a post insert AZURE Function, or CDC feed from ORACLE via Shareplex, KAFKA, Spark Batch KAFKA integration.

Related

Kafka streams : How to scale out the input topic partitions seamlessly when the topology has state involved

We heavily use Kafka for our messaging needs(replaced MQ from our apps) and its one of the best decisions we made as it scales out easily as and when we need. We increase partitions when we know there will be new data coming in to the system. This doesnt have any impact to the kafka producer(metadata is refreshed time to time) or consumer(rebalance occurs when a new partition is added to a topic) that are used in our applications.
Recently, we had a requirement to group batch of records in a stream in 2minute window intervals and kafka streams looked like a perfect solution. However, after implementing it - we realized we cannot really scale out like we used to as every partition is tied to the changelog topic thats created for a state store.
And we cannot afford to create new topics every time or decide on a particular number of partitions that we need to start with.
This sounds like an enterprise issue. Any suggestions on this is highly appreciated!

Creating real time datawarehouse

I am doing a personal project that consists of creating the full architecture of a data warehouse (DWH). In this case as an ETL and BI analysis tool I decided to use Pentaho; it has a lot of functionality from allowing easy dashboard creation, to full data mining processes and OLAP cubes.
I have read that a data warehouse must be a relational database, and understand this. What I don't understand is how to achieve a near real time, or fully real time DWH. I have read about push and pull strategies but my conclusions are the following:
The choice of DBMS is not important to create real time DWH. I mean that is possible with MySQL, SQL Server, Oracle or any other. As I am doing it as a personal project I choose MySQL.
The key factor is the frequency of the jobs scheduling, and this is task of the scheduler. Is this assumption correct? I mean, the key to create a real time DWH is to establish jobs every second for every ETL process?
If I am wrong can you provide me some help to understand this? And then, which is the way to create a real time DWH? Is the any open source scheduler that allows that? And any not open source scheduler which allows that?
I am very confused because some references say that this is impossible, others that is possible.
Definition
Very interesting question. First of all, it should be defined how "real-time" realtime should be. Realtime really has a very low latency for incoming data but requires good architecture in the sending systems, maybe a event bus or messaging queue and good infrastructure on the receiving end. This usually involves some kind of listener and pushing from the deliviering systems.
Near-realtime would be the next "lower" level. If we say near-realtime would be about 5 minutes delay max, your approach could work as well. So for example here you could pull every minute or so the data. But keep in mind that you need some kind of high-performance check if new data is available and which to get. If this check and the pull would take longer than a minute it would become harder to keep up with the data. Really depends on the volume.
Realtime
As I said before, realtime analytics require at best a messaging queue or a service bus some jobs of yours could connect to and "listen" for new data. If a new data package is pushed into the pipeline, the size of it will probably be very small and it can be processed very fast.
If there is no infrastructure for listeners, you need to go near-realtime.
Near-realtime
This is the part where you have to develop more. You have to make sure to get realtively small data packages which will usually be some kind of delta. This could be done with triggers if you have access to the database. Otherwise you have to pull every once in a while whereas your "once" will probably be very frequent.
This could be done on Linux for example with a simple conjob or on Windows with event planning. Just keep in mind that your loading and processing time shouldn't exceed the time window you have got until the next job is being started.
Database
In the end, when you defined what you want to achieve and have a general idea how to implement delta loading or listeners, you are right - you could take a relational database. If you are interested in performance and are modelling this part as Star Schema, you also could look into Column Based Engines or Column Based Databases like Apache Cassandra.
Scheduling
Also for job scheduling you could start with Linux or Windows standard planning tools. If you code in Java you could use later something like quartz. But this would only be the case for near-realtime. Realtime requires a different architecture as I explained above.

Storm Trident and Spark Streaming: distributed batch locking

After doing lots of reading and building a POC we are still unsure as to whether Storm Trident or Spark Streaming can handle our use case:
We have an inbound stream of sensor data for millions of devices (which have unique identifiers).
We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches.
Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock.
In addition the event device data needs to be processed in the order that the events occurred.
Essentially we can’t have two batches for the same device being processed at the same time.
Can trident/spark streaming handle our use case?
Any advice appreciated.
Since you have unique id's, can you divide them up? Simply divide the id by 10, for example, and depending on the remainder, send them to different processing boxes? This should also take care making sure each device's events are processed in order, as they will be sent to the same box. I believe Storm/Trident allows you to guarantee in-order processing. Not sure about Spark, but I would be surprised if they don't.
Pretty awesome problem to solve, I have to say though.

Safe to broadcast large objects with RabbitMQ?

I am relative new to RabbitMQ, and found it is extremely handy and swift, I have used it for communicating small objects by using ruby + bunny gem.
Now I'm trying to pass object around 10~20MB each to exchange, and fanout to its subscribers.
It seemed worked fine, BUT is it a good practice to use RabbitMQ as a publisher? Or should I use something conjecture with RabbitMQ?
You should not send files via AMQP.
Message queues are not databases. Specifically, RabbitMQ was not built with the idea of storing large objects in the queues, because messages are not supposed to be large.
Think about the real world a bit - the postal service for years (not necessarily so much anymore), was optimized for processing letters. If your letter is too fat (heavy), they charge a pretty hefty fee for additional postage. Big messages cost more to move around and disrupt the system. Additionally, your mailbox won't hold large messages - they get left somewhere else - either in a separate package drop or your front door (where they sometimes go missing).
Message queues are the same way. A message typically contains a small piece of data describing an event or other meaningful thing that happened in your application. Usually the data conveyed by a message can be communicated in 100kB or less.
As I mention in this answer, the AMQP protocol (which underlies RabbitMQ) is a fairly chatty protocol. It requires large messages be divided into multiple segments of no more than 131kB. This can add significant of overhead to a large file transfer, especially when compared to other file transfer mechanisms (e.g. FTP, HTTP).
More importantly for performance, the message has to be fully processed by the broker before it is made available in a queue, and it ties up RAM on the broker while this is being done. Putting files in the broker may work for one client and one broker, but it will break quickly when scaling out is attempted. Finally, compression is often desirable when transferring files - HTTP supports gzip compression automatically, while AMQP does not.
What should you do?
It is quite common in message-oriented applications to send a message containing a resource locator (e.g. URL) pointing to the larger data file, which is then accessed via appropriate means.
If it works and doesn't cause you any problems then great. I would suggest that there may be a time cost for the conversion of each object to a byte array. Clearly the reverse at the consumer side is the case too. As each object is so large that may be consideration, unless speed is not your primary objective. Is is necessary to send such large objects?
One big problem with sending large objects is that they will block and entire connection so if you have more than one channel publishing on the same connection they will have to wait for each connection to finish sending this large object.
see here

Recommendation for a large-scale data warehousing system

I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.

Resources