Which Nifi processor to use for RDBMS Extract - apache-nifi

i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
Thanks!

I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link

Related

Importing a large amount of data into Elasticsearch every time by dropping existing data

Currently, there's a denormalized table inside a MySQL database that contains hundreds of columns and millions of records.
The original source of the data does not have any way to track the changes so the entire table is dropped and rebuilt every day by a CRON job.
Now, I would like to import this data into Elaticsearch. What is the best way to approach this? Should I use logstash to connect directly to the table and import it or is there a better way?
Exporting the data into JSON or similar is an expensive process since we're talking about gigabytes of data every time.
Also, should I drop the index in elastic as well or is there a way to make it recognize the changes?
In any case - I'd recommend using index templates to simplify index creation.
Now for the ingestion strategy, I see two possible options:
Rework your ETL process to do a merge instead of dropping and recreating the entire table. This would definitely be slower but would allow shipping only deltas to ES or any other data source.
As you've imagined yourself - you should be probably fine with Logstash using daily jobs. Create a daily index and drop the old one during the daily migration.
You could introduce buffers, such as Kafka to your infrastructure, but I feel that might be an overkill for your current use case.

ETL + sync data between with Redshift and Dynamodb

I need to aggregate data coming from DynamoDB to AWS Redshift, and I need to be accurate and in-sync. For the ETL I'm planning to use DynamoDB Streams, Lambda transform, Kinesis Firehorse to, finally, Redshift.
How would be the process for updated data? I find it's all fine-tuned just for ETL. Which should be the best option to maintain both (Dynamo and Redshift) in sync?
These are my current options:
Trigger an "UPDATE" command direct from Lambda to Redshift (blocking).
Aggregate all update/delete records and process them on an hourly basis "somehow".
Any experience with this? Maybe is Redshift not the best solution? I need to extract aggregated data for reporting / dashboarding on 2 TB of data.
Redshift COPY command supports using a DyanmoDB table as a data source. This may or may not be a possible solution in your case as there are some limitations to this process. Data types and table naming differences can trip you up. Also this isn't a great option for incremental updates but can be done if the amount of data is small and you can design the updating SQL.
Another route to look at DynamoDB Stream. This will route data updates through Kinesis and this can be used to update Redshift at a reasonable rate. This can help keep data synced between these databases. This will likely make the data available for Redshift as quickly as possible.
Remember that you are not going to get Redshift to match on a moment by moment bases. Is this what you mean by "in-sync"? These are very different databases with very different use cases and architectures to support these use cases. Redshift works in big chunks of data changing slower than what typically happens in DynamoDB. There will be updating of Redshift in "chunks" which happen a more infrequent rate than on DynamoDB. I've made systems to bring this down to 5min intervals but 10-15min update intervals is where most end up when trying to keep a warehouse in sync.
The other option is to update Redshift infrequently (hourly?) and use federated queries to combine "recent" data with "older data" stored in Redshift. This is a more complicated solution and will likely mean changes to your data model to support but doable. So only go here if you really need to query very recent data right along side with older and bigger data.
The best-suited answer is to use a Staging table with an UPSERT operation (or a Redshift interpretation of it).
I found the answer valid on my use case when:
Keep Redshift as up to date as possible without causing blocking.
Be able to work with complex DynamoDB schemas so they can't be used as a source directly and data has to be transformed to adapt to Redshift DDL.
This is the architecture:
So we constantly load from Kinesis using the same COPY mechanism, but instead of loading directly to the final table, we use a staging one. Once the batch is loaded into staging we seek for duplicates between the two tables. Those duplicates on the final table will be DELETED before an INSERT is performed.
After trying this I've found that all DELETE operations on the same batch perform better if enclosed within a unique transaction. Also, a VACUUM operation is needed in order to re-balance the new load.
For further detail on the UPSERT operation, I've found this source very useful.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

How to implement an ETL Process

I would like to implement a synchronization between a source SQL base database and a target TripleStore.
However for matter of simplicity let say simply 2 databases. I wonder what approaches to use to have every change in the source database replicated in the target database. More specifically, I would like that each time some row changes in the source database that this can be seen by a process that will read the changes and populate the target database accordingly while applying some transformation in the middle.
I have seen suggestion around the mechanism of notification that can
be available in the database, or building tables such that changes can
be tracked (meaning doing it manually) and have the process polling it
at different intervals, or the usage of Logs (change data capture,
etc...)
I'm seriously puzzle about all of this. I wonder if anyone could give some guidance and explanation about the different approaches with respect to my objective. Meaning: name of methods and where to look.
My organization mostly uses: Postgres and Oracle database.
I have to take relational data and transform them in RDF so as to store them in a triplestore and keep that triplestore constantly synchronized with the data is the SQL Store.
Please,
Many thanks
PS:
A clarification between ETL and replication techniques as in Change Data capture, with respect to my overall objective would be appreciated.
Again i need to make sense of the subject, know what are the methods, so i can further start digging for myself. So far i have understood that CDC is the new way to go.
Assuming you can't use replication and you need to use some kind of ETL process to actually extract, transform and load all changes to the destination database, you could use insert, update and delete triggers to fill a (manually created) audit table. Columns GeneratedId, TableName, RowId, Action (insert, update, delete) and a boolean value to determine if your ETL process has already processed this change. Use that table to get all the changed rows in your database and transport them to the destination database. Then delete the processed rows from the audit table so that it doesn't grow too big. How often you have to run the ETL process depends on the amount of changes occurring in the source database.

Pattern to load data to Elasticsearch from SQL server

Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

Resources