in Middleware IIB/ACE how to parse huge JSON messages and insert them into DB - ibm-integration-bus

in Middleware IIB/ACE how to parse huge JSON messages and insert them into DB
rather hitting every time into DB. how to parse json messages into records and insert in DB table

In this case you will need to use the aggregate nodes. You may find the official documentation of the Message Flow Aggregation in the following link https://www.ibm.com/docs/en/integration-bus/9.0.0?topic=aggregation-message-flow
I also found the below tutorial useful but I it is explaining how to aggregate messages from different resources but you can adapt the solution as per your requirements.
https://www.youtube.com/watch?v=nNJ6fHHj_as

Related

What is the best approach while pooling data from DB and query DB again to fetch additional information?

The spring boot application that I am working on
pools 1000 messages from table X [ This table X is populated by another service s1]
From each message get the account number and query table Y to get additional information about account.
I am using spring integrating to pool messages from table X and reading additional information for account, I am planning to use Spring JDBC.
We are expecting about 10k messages very day.
Is above approach, to query table Y for each message, a good approach ?
No, that indeed not. If all of that data is in the same database, consider to write a proper SELECT to join those tables in a single query performed by that source polling channel adapter.
Another approach is to implement a stored procedure which will do that job for you and will return the whole needed data: https://docs.spring.io/spring-integration/reference/html/jdbc.html#stored-procedures.
Although if the memory for that number of records to handle at once is a limit in your environment or you don't care how fast all of them are processed, then indeed an integration flow with parallel processing of splitted polling result is OK. For that goal you can use a JdbcOutboundGateway as a service in your flow instead of playing with plain JdbcTemplate: https://docs.spring.io/spring-integration/reference/html/jdbc.html#jdbc-outbound-gateway

Uploding data to kafka producer

I am new to Kafka in Spring Boot, I have been through many tutorials and got fair knowledge about the same.
Currently I have been assigned a task and I am facing an issue. Hope to get some help here.
The scenario is as follows.
1)I have a DB which is getting updated continuously with millions of data.
2)I have to hit the DB after every 5 mins and pick the recently updated data and send it to Kafka.
Condition- The old data that I have picked in my previous iteration should not be picked in my next DB call and Kafka pushing.
I am done with the part of Spring Scheduling to pick the data by using findAll() of spring boot JPA, but how can I write the logic so that it does not pick the old DB records and just take the new record and push it to kafka.
My DB table also have a field called "Recent_timeStamp" of type "datetime"
Its hard to tell without really seeing your logic and the way you work with the database, but from what you've described you should do just "findAll" here.
Instead you should treat your DB table as a time-driven data:
Since it has a field of timestamp, make sure there is an index on it
Instead of "findAll" execute something like:
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP > ?
ORDER BY RECENT_TIMESTAMP ASC
In this case you'll get the records ordered by the increasing timestamp
Now the ? denotes the last memorized timestamp that you've handled
So you'll have to maintain the state here
Another option is to query the data whose timestamp is "less" than 5 minutes, in this case the query will look like this (pseudocode since the actual syntax varies):
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP < now() - 5 minutes
ORDER BY RECENT_TIMESTAMP ASC
The first method is more robust because if your spring boot application is "down" for some reason you'll be able to recover and query all your records from the point it has failed to send the data. On the other hand you'll have to save this kind of pointer in some type of persistent storage.
The second solution is "easier" in a sense that you don't have a state to maintain but on the other hand you will miss the data after the restart.
In both of the cases you might want to use some kind of pagination because basically you don't know how many records you'll get from the database and if the amount of records exceeds your memory limits, the application with end up with OutOfMemory error thrown.
A Completely different approach is throwing the data to kafka when you write to the database instead of when you read from it. At that point you might have a data chunk of (probably) reasonably limited size and in general you don't need the state because you can store to db and send to kafka from the same service, if the architecture of your application permits to do so.
You can look into kafka connect component if it serves your purpose.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka® and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch, or into batch systems–such as Hadoop for offline analysis.

Which Nifi processor to use for RDBMS Extract

i will explain my use case to understand which DB extract utility to use.
I need to extract data from SQL Server tables with varying frequency each day. Each extract query is a complex SQL statement, involving 5-10 tables in joins etc with multiple causes. Have around 20-30 such statements overall.
All these extract queries might be required to run multiple times a day with varying frequencies each day. It depends on how many times we receive data from source system or other cases.
We are planning to use Kafka to publish a message to let Nifi workflow know whenever a RDBMS table is updated and flow needs to be triggered (i can't just trigger Nifi flow based on "incremental" column value, there might only be all row update scenarios and we might not create new rows in tables).
How should i go about designing my Nifi. There are ExecuteSQL/GenerateTableFetch/ExecuteSQLRecord/QueryDatabaseTable all sorts of components available. Which one is going to fit my requirement best?
Thanks!
I am suggesting that you use ExecuteSQL. You can set query from attribute or compose it using attribute. Easiest way is to create json and then parse that json and create attributes. Check this example, here is sql created from file you can adjust it to create it from kafka link

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

bulk.find.update() vs update.collection(multi=true) in mongo

I am newbie to mongodb and would like to implement mongodb in my project having millions of record.Would like to know what should i prefer for update -bulk.find.update() vs update.collection with multi =true for performance.
As far as I know the biggest gains Bulk provides are these:
Bulk operations send only one request to the MongoDB for the all requests in the bulk. Others send a request per document or send only for one operation type from one of insert, update, updateOne, upsert with update operations and remove .
Bulk can handle a lot of different cases at different lines on a code page.
Bulk operations can work as asynchronously. Others cannot.
But today some operations work bulk based. For instance insertMany.
If the gains above have taken into account, update() must show the same performance results with an bulk.find.update() operation.
Because update() can take only one query object sending to MongoDB. And multi: true is only an argument which specifies that the all matched documents have to be updated. This means it makes only one request on the network. Just like Bulk operations.
So, both of sends only one request to MongoDB and MongoDB evaluates the query clause to find the documents that'd be updated then updates them!
I had tried to find out an answer for this question on MongoDB official site but I could not.
So, an explanation from #AsyaKamsky would be great!

Resources