I have a situation where I need to call 3 databases and create a CSV.
I have created a Batch step where I could get the data from my First database.
This gives around 10000 records.
Now from each of these records I need to get the id and use it to fetch the data from other data source. I could not able to find best solution.
Any help in finding the solution is appreciated
I tried two steps for each data source but not sure how to pass the ids to next step. ( we are talking about 10000) ids.
Is it possible to connect to all 3 databases in the same step? I am new to Spring batch so not have full grasp of all the concepts.
You can do the second call to fetch the details of each item in an item processor. This is a common pattern and is described in the Common Batch Patterns section of the reference documentation.
Related
I am writing a spring boot batch application and I am confused about the best way to process big data. I have input file with millions of user ids and I need to remove those ids from another table. I dont think querying each userid a good idea but I dont have different solution at this time. Unfortunately these user ids are very random and wont be able to sort. Can any one suggest me the best approach? The database is Oracle.
You can load the file in a temporary table then delete IDs that are not in it, something like:
DELETE FROM ORIGINAL_TABLE WHERE ID NOT IN (SELECT ID FROM TEMPORARY_TABLE).
This will perform better than doing a query for each ID in the file to check its existence and delete it if necessary.
If you really need to use Spring Batch for that, you can create a two steps job for this: the first step is chunk-oriented and loads the file in a temporary table, and the second step is a simple tasklet that executes the delete query. I would add a third step to remove the temporary table and clean-up things.
I am looking to read data from multiple tables (different database tables) and aggregate and create final result set. In my case, each query will return the List of object. I went through web many times, I found no link other than - Spring Batch How to read multiple table (queries) as Reader and write it as flat file write, but it returns only single object.
Is there any way if we can do this ? Any working sample example would help a lot.
Example -
One query gives List of Departments - from Oracle DB
One query gives List of Employee - from Postgres
Now I want to build Employee and Department relationship and send final object to processor to further lookup against MongoDB and send the final object to reader.
The question should rather be "how to join three tables from three different databases and write the result in a file". There is no built-in reader in Spring Batch that reads from multiple tables. You either need to create a custom reader, or decompose the problem at hand into tasks that can be implemented using Spring Batch tasklet/chunk-oriented steps.
I believe you can use the driving query pattern in a single chunk-oriented step. The reader reads employee items, then a processor enrich items with 1) department from postgres and 2) other info from mongo. This should work for small/medium datasets. If you have a lot of data, you can use partitioning to parallelize things and improve performance.
Another option if you want to avoid a query per item is to load all departments in a cache for example (I guess there should be less departments than employees) and enrich items from the cache rather than with individual queries to the db.
I'm currently new to Talend and I'm learning through videos and documentation, so I'm just not sure how to approach/implement this with best practices.
Goal
Integrate Magento and Quick Book using Talend.
My thoughts
Initially my first thought was I will setup direct DB connection for Magento and will take relevant data which I need and will process it and will send to QuickBook using REST API's(specifically bulk API's in batch)
But then again I thought it would be little hectic for me to query Magento database(multiple joins) so I've another option to use Magento's REST API.
But as I'm not much familiar with the tool I'm struggling little to find best suitable approach, so any help is appreciated.
What I've done till now?
I've saved my auth(for QB) and db(Magento) credentials data in file and using tFileInputDelimited and tContextLoad, I'm storing them in context variables so they can be accessible globally.
I've successfully configured database connection and dbinput but I've not used metadata for connection(should I use that and if Yes how can I pass dynamic values there?). I've used my context variables data in db connection settings.
I've taken relevant fields for now but if I want multiple fields simple query is not enough as Magento stores data in multiple tables for Customer etc but it's not big deal I know but I think it might increase my work.
For now that's what I've built and my next step is send the data to QB using REST while getting access_token and saving it to context variable and again storing the QB reference into Magento DB.
Also I've decided to use QB bulk API's but I'm not sure how I can process data in chunks in Talend(I tried to check multiple resources but no luck) i.e. if the Magento is returning 500 rows I want to process them in chunks of 30 as QB batch max limit is 30, so I will be sending it using REST to QB and as I said I also want to store back QB reference ID in magento(so I can update it later).
Also this all will be on local, then how can I do same in production? how I can maintain development and production environment?
Resources I'm referring
For REST and Auth best practices - https://community.talend.com/t5/How-Tos-and-Best-Practices/Using-OAuth-2-0-with-Talend-to-Access-Goo...
Nice example for batch processing here:
https://community.talend.com/t5/Design-and-Development/Batch-processing-in-talend-job/td-p/51952
Redirect your input to a tFileOutputDelimited.
Enter the output filename, tick the option "Split output in several files" from the "Advanced settings" and enter the value of 1000 into the field "Rows in each output file". This will create n files based on the filename with 1000 in each.
On the next subjob, use a tFileList to iterate over this file list to get records from each file.
I need to know what is best approach to read the data from one database in chunk(100) and on the basis of that data we read the data from other database server .
example : taking id from one database server and on the basis of that id we take data from other database server.
I have searched on google but have'nt got solution to read twice and write once in batch.
One approach is read in chunk and inside process we take id and hit the database. But process will take single data at a time which is most time consuming.
Second approach is make two different step but in this we can't able share list of id to other step because we can share only small amount of data to other step.
Need to know what is best approach to read twice one after other.
There is no best approach as it depends on the use case.
One approach is read in chunk and inside process we take id and hit the database. But process will take single data at a time which is most time consuming.
This approach is a common pattern called the "Driving Query Pattern" explained in detail in the Common Batch Patterns section of the reference documentation. The idea is that the reader reads only IDs, and the processor enriches the item by querying the second server with additional data for that item. Of course this will generate a query for each item, but this what you want anyway, unless you want your second query to send the list of all IDs in the chunk. In this case, you can do it in org.springframework.batch.core.ItemWriteListener#beforeWrite where you get the list of all items to be written.
Second approach is make two different step but in this we can't able share list of id to other step because we can share only small amount of data to other step.
Yes, sharing a lot of data via the execution context is not recommended as this execution context will be persisted between steps. So I think this is not a good option for you.
Hope this helps.
I've below requirement to write a spring batch. I would like to know the best approach to achieve it.
Input: A relatively large file with report data (for today)
Processing:
1. Update Daily table and monthly table based on the report data for today
Daily table - Just update the counts based on ID
Monthly table: Add today's count to the existing value
My concerns are:
1. since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
2. To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
Please suggest the approach I should follow or any example if there is any?
Thanks.
You can design a chunk-oriented step to first insert the daily data from the file to the table. When this step is finished, you can use a step execution listener and in the afterStep method, you will have a handle to the step execution where you can get the write count with StepExecution#getWriteCount. You can write this count to the monthly table.
since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
With a chunk oriented step, data is already written in bulk (one transaction per chunk). This model works very well even if your input file is huge.
To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
No need to store the info in a map, you can get the write count from the step execution after the step as explained above.
Hope this helps.