spring batch : Read Twice one after other from database - spring-boot

I need to know what is best approach to read the data from one database in chunk(100) and on the basis of that data we read the data from other database server .
example : taking id from one database server and on the basis of that id we take data from other database server.
I have searched on google but have'nt got solution to read twice and write once in batch.
One approach is read in chunk and inside process we take id and hit the database. But process will take single data at a time which is most time consuming.
Second approach is make two different step but in this we can't able share list of id to other step because we can share only small amount of data to other step.
Need to know what is best approach to read twice one after other.

There is no best approach as it depends on the use case.
One approach is read in chunk and inside process we take id and hit the database. But process will take single data at a time which is most time consuming.
This approach is a common pattern called the "Driving Query Pattern" explained in detail in the Common Batch Patterns section of the reference documentation. The idea is that the reader reads only IDs, and the processor enriches the item by querying the second server with additional data for that item. Of course this will generate a query for each item, but this what you want anyway, unless you want your second query to send the list of all IDs in the chunk. In this case, you can do it in org.springframework.batch.core.ItemWriteListener#beforeWrite where you get the list of all items to be written.
Second approach is make two different step but in this we can't able share list of id to other step because we can share only small amount of data to other step.
Yes, sharing a lot of data via the execution context is not recommended as this execution context will be persisted between steps. So I think this is not a good option for you.
Hope this helps.

Related

Looking for multi database data processing for spring batch

I have a situation where I need to call 3 databases and create a CSV.
I have created a Batch step where I could get the data from my First database.
This gives around 10000 records.
Now from each of these records I need to get the id and use it to fetch the data from other data source. I could not able to find best solution.
Any help in finding the solution is appreciated
I tried two steps for each data source but not sure how to pass the ids to next step. ( we are talking about 10000) ids.
Is it possible to connect to all 3 databases in the same step? I am new to Spring batch so not have full grasp of all the concepts.
You can do the second call to fetch the details of each item in an item processor. This is a common pattern and is described in the Common Batch Patterns section of the reference documentation.

Spring batch to read CSV and update data in bulk to MySQL

I've below requirement to write a spring batch. I would like to know the best approach to achieve it.
Input: A relatively large file with report data (for today)
Processing:
1. Update Daily table and monthly table based on the report data for today
Daily table - Just update the counts based on ID
Monthly table: Add today's count to the existing value
My concerns are:
1. since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
2. To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
Please suggest the approach I should follow or any example if there is any?
Thanks.
You can design a chunk-oriented step to first insert the daily data from the file to the table. When this step is finished, you can use a step execution listener and in the afterStep method, you will have a handle to the step execution where you can get the write count with StepExecution#getWriteCount. You can write this count to the monthly table.
since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
With a chunk oriented step, data is already written in bulk (one transaction per chunk). This model works very well even if your input file is huge.
To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
No need to store the info in a map, you can get the write count from the step execution after the step as explained above.
Hope this helps.

Import data from a large CSV (or stream of data) to Neo4j efficiently in Ruby

I am new to background processes, so feel free to point me out if I am making wrong assumptions.
I am trying to write a script that imports import data into a Neo4j db from a large CSV file (consider it as a stream of data, endless). The csv file only contains two column - user_a_id and user_b_id, which maps the directed relations. A few things to consider:
data might have duplicates
the same user can map to multiple other users and is not guaranteed that when it will show up again.
My current solution: I am using sidekiq and have one worker to read the file in batches and dispatch workers to create edges in the database.
Problems that I am having:
Since the I am receiving a stream of data, I cannot pre-sort the file and assign job that build relation for one user.
Since jobs are performed asynchronously, if two workers are working on relation of the same node, I will get a write lock from Neo4j.
Let's say I get around with the write lock, if two workers are working on records that are duplicated, I will build duplicated edges.
Possible solution: Build a synchronous queue and have only one worker to perform writing (Seems neither sidekiq or resque has the option). And this could be pretty slow since only one thread is working on the job.
Or, I can write my own implementation, which create one worker to build multiple queues of jobs based on user_id (one unique id per queue), and use redis to store them. Then assign one worker per queue to write to database. Set a maximum number of queues so I wouldn't run out of memory, and delete the queue once it exhausts all the jobs (rebuild it if I see the same user_id in the future). - This doesn't sound trivial though, so I would prefer using an existing library before diving into this.
My question is — is there a existing gem that I can use? What is a good practice of handling this?
You have a number of options ;)
If your data really is in a file and not as a stream, I would definitely recommend checking out the neo4j-import command which comes with Neo4j. It allows you to import CSV data at speeds on the order of 1 million rows per second. Two caveats: You may need to modify your file format a bit, and you would need to be generating a fresh database (it doesn't import new data into an existing database)
I would also be familiar with the LOAD CSV command. This takes a CSV in any format and lets you write some Cypher commands to transform and import the data. It's not as fast as neo4j-import, but it's pretty fast and it can stream a CSV file from disk or a URL.
Since you're using Ruby, I would also suggest checking out neo4apis. This is a gem that I wrote to make it easier to batch import data so that you're not making a single request for every entry in your file. It allows you to define a class in a sort of DSL with importers. These importers can take any sort of Ruby object and, given that Ruby object, will define what should be imported using add_node and add_relationship methods. Under the covers this generates Cypher queries which are buffered and executed in batches so that you don't have lots of round trips to Neo4j.
I would investigate all of those things first before thinking about doing things asynchronously. If you really do have a never ending set of data coming in, however. The MERGE clause should help you with any race conditions or locks. It allows you to create objects and relationships if they don't already exist. It's basically a find_or_create, but at a database level. If you use LOAD CSV you'll probably want merge as well, and neo4apis uses MERGE under the covers.
Hope that helps!

Spring Batch: what is the best way to use, the data retrieved in one TaskletStep, in the processing of another step

I have a job in which:
The first step is a TaskletStep which retrieves some records(approx. 150-200) from a database table into a list.
The second step retrieves data from some other table and requires the list of records retrieved in the previous step for processing.
I came across three ways to do this:
1)putting the list retrieved in first step in StepExecutionContext and then promoting it to JobExecutionContext to share data between steps.
2)using spring's caching concept i.e. using #cacheable
3)programmatically putting the list in the ApplicationContext
What is the best way to achieve this(it would be better if it can be explained with an example), keeping in mind two main concerns:
if the volume of data retrieved in the first step increases and performance
Remember that objects in step context are stored into database,so you must be sure that objects are serializable and are really a few.
If you are sure, put objects in your jobExecutionContext (as solution 1.) or use a bean holder (Passing data to future step); this type of approach is valid ONLY if data in first step is SMALL.
Else, you can process data in step2 without data retrivial in step1, but easly manage a cache of step1 data while processing data in step2; in this way you don't need step1, don't need to store step1 data to database, but step1 data lookup while processing millions record in step2 doesn't impact in terms of time processing.
I hope I was clear, English is not my language

Redis multiple requests

I am writing a very simple social networking app that uses Redis.
Each user has a sorted set that contains ids of items in their feed. If I want to display their feed, I do the following steps:
use ZREVRANGE to get ids of items in their feed
use HMGET to get the feed (each feed item is a string)
But now, I also want to know if the user has liked a feed item or not. So I have a set associated with each feed item that contains ids of user who have liked a feed item.
If I get 15 feed items, now I have to execute an additional 15 requests to Redis to find out, for each feed item if current user has commented on it or not (by checking if id exists in each set for each feed).
So that will take 15+1 requests.
Is this type of querying considered 'normal' when using Redis? Are there better ways I can structure the data to avoid this many requests?
I am using redis-rb gem.
You can easily refactor your code to collapse the 15 requests in one by using pipelines (which redis-rb supports).
You get the ids from the sorted sets with the first request and then you use them to get the many keys you need based on those results (using the pipeline)
With this approach you should have 2 requests in total instead of 16 and keep your code quite simple.
As an alternative you can use a lua script and fetch everything in one request.
This kind of database (Non-relational database), you have to make a trade-off between multiple requests and include some data redundancy.
You should analyze each case separately and consider some aspects, like:
How frequently this data will be accessed?
How much space this redundancy will consume?
How many requests I will have to do, in order to have all data, without redundancy?
Performance is an issue?
In your case, I would suggest to keep a Set/Hash or just a JSON encoded data for each user with a historical of all recent user interaction, such as comments, likes, etc. Every time the user access the feeds you just have to read the feeds and the historical; only two requests.
One thing to keep in mind, every user interaction, you must update all redundant data as well.

Resources