Kafka Streams with lookup data on HDFS - hadoop

I'm writing an application with Kafka Streams (v0.10.0.1) and would like to enrich the records I'm processing with lookup data. This data (timestamped file) is written into a HDFS directory on daily basis (or 2-3 times a day).
How can I load this in the Kafka Streams application and join to the actual KStream?
What would be the best practice to reread the data from HDFS when a new file arrives there?
Or would it be better switching to Kafka Connect and write the RDBMS table content to a Kafka topic which can be consumed by all the Kafka Streams application instances?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system. Additionally AFAIK I don't have a control when the compaction happens.

The recommend approach is indeed to ingest the lookup data into Kafka, too -- for example via Kafka Connect -- as you suggested above yourself.
But in this case how can I schedule the Connect job to run on a daily basis rather than continuously fetch from the source table which is not necessary in my case?
Perhaps you can update your question you do not want to have a continuous Kafka Connect job running? Are you concerned about resource consumption (load on the DB), are you concerned about the semantics of the processing if it's not "daily udpates", or...?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
Kafka Connect is safe, and the JDBC connector has been built for exactly the purpose of feeding DB tables into Kafka in a robust, fault-tolerant, and performant way (there are many production deployments already). So I would suggest to not fallback to "batch update" pattern just because "it looks safer"; personally, I think triggering daily ingestions is operationally less convenient than just keeping it running for continuous (and real-time!) ingestion, and it also leads to several downsides for your actual use case (see next paragraph).
But of course, your mileage may vary -- so if you are set on updating just once a day, go for it. But you lose a) the ability to enrich your incoming records with the very latest DB data at the point in time when the enrichment happens, and, conversely, b) you might actually enrich the incoming records with stale/old data until the next daily update completed, which most probably will lead to incorrect data that you are sending downstream / making available to other applications for consumption. If, for example, a customer updates her shipping address (in the DB) but you only make this information available to your stream processing app (and potentially many other apps) once per day, then an order processing app will ship packages to the wrong address until the next daily ingest will complete.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system.
The JDBC connector for Kafka Connect already handles this automatically for you: 1. it ensures that DB inserts/updates/deletes are properly reflected in a Kafka topic, and 2. Kafka's log compaction ensures that the target topic doesn't grow out of bounds. Perhaps you may want to read up on the JDBC connector in the docs to learn which functionality you just get for free: http://docs.confluent.io/current/connect/connect-jdbc/docs/ ?

Related

Uploding data to kafka producer

I am new to Kafka in Spring Boot, I have been through many tutorials and got fair knowledge about the same.
Currently I have been assigned a task and I am facing an issue. Hope to get some help here.
The scenario is as follows.
1)I have a DB which is getting updated continuously with millions of data.
2)I have to hit the DB after every 5 mins and pick the recently updated data and send it to Kafka.
Condition- The old data that I have picked in my previous iteration should not be picked in my next DB call and Kafka pushing.
I am done with the part of Spring Scheduling to pick the data by using findAll() of spring boot JPA, but how can I write the logic so that it does not pick the old DB records and just take the new record and push it to kafka.
My DB table also have a field called "Recent_timeStamp" of type "datetime"
Its hard to tell without really seeing your logic and the way you work with the database, but from what you've described you should do just "findAll" here.
Instead you should treat your DB table as a time-driven data:
Since it has a field of timestamp, make sure there is an index on it
Instead of "findAll" execute something like:
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP > ?
ORDER BY RECENT_TIMESTAMP ASC
In this case you'll get the records ordered by the increasing timestamp
Now the ? denotes the last memorized timestamp that you've handled
So you'll have to maintain the state here
Another option is to query the data whose timestamp is "less" than 5 minutes, in this case the query will look like this (pseudocode since the actual syntax varies):
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP < now() - 5 minutes
ORDER BY RECENT_TIMESTAMP ASC
The first method is more robust because if your spring boot application is "down" for some reason you'll be able to recover and query all your records from the point it has failed to send the data. On the other hand you'll have to save this kind of pointer in some type of persistent storage.
The second solution is "easier" in a sense that you don't have a state to maintain but on the other hand you will miss the data after the restart.
In both of the cases you might want to use some kind of pagination because basically you don't know how many records you'll get from the database and if the amount of records exceeds your memory limits, the application with end up with OutOfMemory error thrown.
A Completely different approach is throwing the data to kafka when you write to the database instead of when you read from it. At that point you might have a data chunk of (probably) reasonably limited size and in general you don't need the state because you can store to db and send to kafka from the same service, if the architecture of your application permits to do so.
You can look into kafka connect component if it serves your purpose.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka® and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch, or into batch systems–such as Hadoop for offline analysis.

How does KafkaStreams determine whether a GlobalKTable is fully populated while bootstrapping?

The topic I use to create a GlobalKTable is very active. In the documentation of KStream-GlobalKTable join I read
The GlobalKTable is fully bootstrapped upon (re)start of a KafkaStreams instance, which means the table is fully populated with all the data in the underlying topic that is available at the time of the startup. The actual data processing begins only once the bootstrapping has completed.
How does KafkaStreams determine whether all data is read? Does it read all the messages with a timestamp below the KafkaStreams instance bootstrap time? Or does it use some kind of timeout?
Either way, I guess we better get the retention and log compaction of the underlying topic right or a restart might take a while.
On startup, Kafka Streams reads the current log-end-offsets and bootstrapping is finished after all those data was loaded (cf. KIP-99).
Note, GlobalKTable is designed with static/rarely changing data in mind.
Either way, I guess we better get the retention and log compaction of the underlying topic right or a restart might take a while.
GlobalKTable checkpoints as of 0.11 (released today) so bootstrapping should be much faster on restart than in 0.10.2.

spring batch: process large file

I have 10 large files in production, and we need to read each line from the file and convert comma separated values into some value object and send it to JMS queue and also insert into 3 different table in the database
if we take 10 files we will have 33 million lines. We are using spring batch(MultiResourceItemReader) to read the earch line and have write to write it o db and also send it to JMS. it roughly takes 25 hrs to completed all.
Eventhough we have 10 system in production, presently we use only one system to run this job( i am new to spring batch, and not aware how spring supports in load balancing)
Since we have only one system we configured data source to connect to db and max connection is specified as 25.
To improve the performance we thought to use spring multi thread support. started to use 5 threads. we could see the performance improvement and could see everything completed in 10 hours.
Here i Have below questions:
1) if i process using 5 threads, we will publish huge amount of data into JMS queue. Will queue support huge data.Note we have 10 systems in production to read JMS Message from the queue.
2) Using thread(5) and 1 production system is good approach (or) instead of spring batch insert the data into db i can create a rest service and spring batch calls the rest api to insert the data into db and let spring api inserts data into JmS queue(again, if spring batch process file annd use rest to insert data into db, per second i will read 4 or 5 lines and will call the rest api. Note we have 10 production system). If use rest API approach will my system support(rest can handle huge request using load balancer, and also JMS can handle huge and huge message) or using thread in spring batch app using 1 production system is better approach.
Different JMS providers are going to have different limits, but in general messaging can easily handle millions of rows in a small period of time.
Messaging is going to be faster than inserting directly into the database because a message has very little data to manage (other than JMS properties) instead of the overhead of a complete RDBMS or NoSQL database or whatever, messaging out performs them all.
Assuming the individual lines can be processed in any order, then sending all data to the same queue and have n consumers working the back-end is a sound solution.
Your big bottleneck, however, is getting the data into the database. If the destination table(s) have m/any keys/indices on them, there is going to be serious contention because each insert/update/delete needs to rebuild the indices, so even though you have n different consumers trying to update the database, they're going to trounce on each other as the transactions are completed.
One solution I've seen is disabling all database constrains before you start and enabling at the end, and hopefully if things worked the data is consistent and usable; of course, the risk is there was bad data that you didn't catch and now you need to clean up or reattempt the load
A better solution might be to transform the files into a single file that can be batch loaded into the database using a platform-specific tool. These tools often disable indexes, contraint checking, and anything else that's going to slow things down - often times bypassing SQL itself - to get performance.

Pattern to load data to Elasticsearch from SQL server

Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

Compare and Contrast Change Data Capture and Database Change Notification

Oracle has two seemingly competing technologies. CDC and DCN.
What are the strengths of each?
When would you use one and not the other?
In general, you would use DCN to notify a client application that the client application needs to clear/ update the application's cache. You would use CDC for ETL processing.
DCN would generally be preferable when you have an OLTP application that needs to be notified immediately about data changes in the database. Since the goal here is to minimize the number of network round-trips and the number of database hits, you'd generally want the application to use DCN for queries which either are mostly static. If a large fraction of the query is changing regularly, you may be better off just refreshing the application's cache on a set frequency rather than running queries constantly to get the changed data (DCN does not contain the changed data, just the ROWID of the row(s) that changed). If the application goes down, I believe DCN allows changes to be lost.
CDC would generally be preferable when you have a DSS application that needs to periodically pull over all the data that changed in a number of tables. CDC can guarantee that the subscriber has received every change to the underlying table(s) which can be important if you are trying to replicate changes to a different database . CDC allows the subscriber to pull the changes at its convenience rather than trying to notify the subscriber that there are changes, so you'd definitely want CDC if you wanted the subscriber to process new data every hour or every day rather than in near real time. (note: DCN also has a guaranteed delivery mode, see comments below. --Mark Harrison)
CDC seems to be much more complex to set up than DCN.
I mean to setup DCN I wrap a select in a start and end DCN block and then write a procedure to be called with a collect of changes. That's it.
CDC requires publishers and subscribers and anyways, seems like more work.

Resources