Getting transformation configuration from custom processor: Nifi - etl

I trying to test a functionality for Nifi. The data I pulled from database consist of specific columns say "id". I need to use Nifi to transform the column name to "customer_id". I understood this is a easy job using something like jolt. But my problem is I need to pull these configuration or rules from somewhere else let say in another database or some other place. I don't want to hard code in the jolt transform to specify the column names instead get it from some other location. Is there any best practice or best way of doing this? Will I have to write any customer processor for this and if so what is the best place to start referring for writing the custom processors?

Many different ways you can do transforms as well as JOLT - it is worth researching using Records and Schema's in NiFi.
But on to your problem - you could use LookupRecord with LookupServices for pulling the configurations, for example, you could pull them out of a database or from a REST endpoint. There are many LookupServices - read the LookupRecord docs page for a list of them.

Related

Looking for multi database data processing for spring batch

I have a situation where I need to call 3 databases and create a CSV.
I have created a Batch step where I could get the data from my First database.
This gives around 10000 records.
Now from each of these records I need to get the id and use it to fetch the data from other data source. I could not able to find best solution.
Any help in finding the solution is appreciated
I tried two steps for each data source but not sure how to pass the ids to next step. ( we are talking about 10000) ids.
Is it possible to connect to all 3 databases in the same step? I am new to Spring batch so not have full grasp of all the concepts.
You can do the second call to fetch the details of each item in an item processor. This is a common pattern and is described in the Common Batch Patterns section of the reference documentation.

File Import Validation in domain driven design

I am developing an application and trying to use a DDD approach. I modeled all my entities and aggregates in the domain. Now I would like to keep my validation logic in the domain, but I need to support some bulk import functionality(via a csv file). I need to validate the file and give the user information about the rows that need corrections.
What would be the best DDD way to do this? The validation logic is quite complex and I wouldn't like to duplicate the csv file validation in the domain. Also the file structure is a more flatten structure which is different from the domain aggregate. So even if I validate the aggregate the row information is lost.
What would be the best DDD way to do this? The validation logic is quite complex and I wouldn't like to duplicate the csv file validation in the domain.
To accomplish something of this nature, you can use a Saga or Process Manager. It would maintain your position in the file and issue commands to your Aggregates to perform the file import. If the commands fail or are rejected then you can update the corresponding rows with the error messages that are produced by the domain.
Also the file structure is a more flatten structure which is different from the domain aggregate. So even if I validate the aggregate the row information is lost.
You could group the records in the file if necessary, but the saga should maintain a record of the line numbers and corresponding commands. You may choose to model the file itself as an Aggregate which will provide benefits as well. Eliding domain concepts such as this will make things more difficult in the future, but adding them to the design will come at the cost of time.
I can provide more detail if you can explain: Are you using DDD, CQRS, ES, queues, async, etc? As #Dnomyar explained, you might also help to explain where you are facing actual difficulties.

How to get multiple data using single flow?

I want to fetch the multiple table data using a single processor and push the data into respective tables using NiFi, For that which processors do I need to use? and what is the advantage of single flow instead of multiple flows?
Thank you.
As per NiFi-5221 jira, Starting from NiFi-1.7 you can use DatabaseLookupService controller service.
Now using lookupservice you can get more than one table data using the single flow.
Refer to this and this links for more references.

Which Nifi processor to use for RDBMS Extract

i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
Thanks!
I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Resources