Insert unique timestamp to sql server using spring batch chunk step - spring-boot

I have a table that stores current timestamp as a unique identifier. The format is “yyyy-MM-dd HH:mm:ss.SSSSSS”. Unfortunately with chunk oriented steps I’m seeing duplicate insert errors as multiple items gets the same timestamp. Please advise how can I handle this use case. I’m not in a position to change the Table definition to use any other unique identifiers.
I’ve tried to set the timestamp using the itemProcessor as it was failing originally when set at itemWriter. But results were same.
Also tested by introducing a sleep of 1 ns(Thread.Sleep(0,1)) but that made the spring batch job terribly slow. A sample job with 200K records that took under 2minutes was taking about 10-11 minutes after Thread.Sleep

Related

Spring batch to read CSV and update data in bulk to MySQL

I've below requirement to write a spring batch. I would like to know the best approach to achieve it.
Input: A relatively large file with report data (for today)
Processing:
1. Update Daily table and monthly table based on the report data for today
Daily table - Just update the counts based on ID
Monthly table: Add today's count to the existing value
My concerns are:
1. since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
2. To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
Please suggest the approach I should follow or any example if there is any?
Thanks.
You can design a chunk-oriented step to first insert the daily data from the file to the table. When this step is finished, you can use a step execution listener and in the afterStep method, you will have a handle to the step execution where you can get the write count with StepExecution#getWriteCount. You can write this count to the monthly table.
since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
With a chunk oriented step, data is already written in bulk (one transaction per chunk). This model works very well even if your input file is huge.
To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
No need to store the info in a map, you can get the write count from the step execution after the step as explained above.
Hope this helps.

Spring Batch Metadata Issue

When I am trying to disable Spring Batch Metadata creation with the option spring.batch.initialize-schema=never and then I launch batch, nothing happen and the batch terminate immediately without running the related jobs.
In the other hand when I am trying to enable the Metadata creation, the batch work fine, I am getting the classic SERIALIZED_CONTEXT field size error. I can't always save 4GB of data in the table when I execute the batch.
How to disable definitively the Metadata creation, and have my batch still working?
Edit : I think I found a kind of solution to avoid this issue, and I would like to have your point of view. I am finally working with Metadata generation. The issue occurs when you have large set of data stored in your ExecutionContext you pass between Tasklets (we all know this is the reason). In my case it is an ArrayList of elements (POJO), retrieved from a CSV file with OpenCSV. To overcome this issue I have :
reduced the number of columns and lines in the ArrayList (because Spring Batch will serialize this ArrayList in the SERIALIZED_CONTEXT field. The more columns and lines you have the more you are sure to get this issue)
changed the type of the SERIALIZED_CONTEXT from TEXT to LONGTEXT
deleted the toString() method defined in the POJO (not sure it really helps)
But I am still wondering, what if you have no choice and you have to load all your columns, what is the best way to prevent this issue?
So this is not an issue with metadata generation but with passing a large amount of data between two steps.
what if you have no choice and you have to load all your columns, what is the best way to prevent this issue?
You can still load all columns but you have to reduce the chunk size. The whole point of chunk processing in Spring Batch is to not load all data in memory. What you can do in your case is to carefully choose a chunk size that fits your requirement. There is no recipe for choosing the correct chunk size (since it depends on the number of columns, the size of each column, etc), so you need to proceed in an empirical way.

What will happen when inserting a row during a long running query

I am writing some data loading code that pulls data from a large, slow table in an oracle database. I have read-only access to the data, and do not have the ability to change indexes or affect the speed of the query in any way.
My select statement takes 5 minutes to execute and returns around 300,000 rows. The system is inserting large batches of new records constantly, and I need to make sure I get every last one, so I need to save a timestamp for the last time I downloaded the data.
My question is: If my select statement is running for 5 minutes, and new rows get inserted while the select is running, will I receive the new rows or not in the query result?
My gut tells me that the answer is 'no', especially since a large portion of those 5 minutes is just the time spent on the data transfer from the database to the local environment, but I can't find any direct documentation on the scenario.
"If my select statement is running for 5 minutes, and new rows get inserted while the select is running, will I receive the new rows or not in the query result?"
No. Oracle enforces strict isolation levels and does not permit dirty reads.
The default isolation level is Read Committed. This means the result set you get after five minutes will be identical to the one you would have got if Oracle could have delivered you all the records in 0.0000001 seconds. Anything committed after you query started running will not be included in the results. That includes updates to the records as well as inserts.
Oracle does this by tracking changes to the table in the UNDO tablespace. Provided it can restrict the original image from that data your query will run to completion; if for any reason the undo information is overwritten your query will fail with the dreaded ORA-1555: Snapshot too old. That's right: Oracle would rather hurl an exception than provide us with an inconsistent result set.
Note that this consistency applies at the statement level. If we run the same query twice within the one transaction we may see two different results sets. If that is a problem (I think not in your case) we need to switch from Read Committed to Serialized isolation.
The Concepts Manual covers Concurrency and Consistency in great depth. Find out more.
So to answer your question, take the timestamp from the time you start the select. Specifically, take the max(created_ts) from the table before you kick off the query. This should protect you from the gap Alex mentions (if records are not committed the moment they are inserted there is the potential to lose records if you base the select on comparing with the system timestamp). Although doing this means you're issuing two queries in the same transaction which means you do need Serialized isolation after all!

how do i maintain snapshot of data in oracle and couchbase

I have a spring batch which reads a file daily and insert the data into oracle and couchbase. There are other application which read data from these datasources and for that i just need the latest record data in the tables.
Lets take an example
Day 1: i recieved the file with the below records
123,student1,gradeA ( id,name,grade)
124, student2, gradeA ( id, name, grade)
Day 2: I received the file with below records
123,student1,gradeB ( id,name,grade)
So what i need to do is
1. on Day1 I should insert all the records of the file as initially table is empty
2. On Day2 I need to invalidate the record for "124" as that is not in file
3. On Day2 Update the record for "123" with new grade
So on day 2 if any read request comes for "124" i should throw exception (data not found).
Couple of approaches i thought about
Approach 1:
I can have a revision number column in the table and every day a file
is read i get the unique revision number for that day and while
inserting records into DB for that day i use the revision number. But
for this, i need to store the revision number somewhere else and every
time i need to read data i have to do an extra lookup to get the
current version number.
Approach2:
Every time a record is updated maintain last_modified_date column and
after batch run which ever is not modified remove those records(this
might be costly).
The above approaches might work for oracle, but for couchbase i am thinking of having TTL for every record to solve this.
Can someone suggest any other better approaches on this?
I'd be cautious of the TTL in Couchbase. In the event that your batch schedule is delayed, you'd lose all your data.
Approach 2 seems the cleanest to me, and you can leverage a merge/upsert in both Oracle and Couchbase to insert/update accordingly. In Oracle, you'd want either a physical or temporary staging table to improve your MERGE performance. And for Couchbase, you'd use a bulk UPSERT.
You can use a flag to maintain record validation.
Every day all records will be marked as 'not valid' and during batch process add record if missing or mark record as 'valid' if it is already present.

Oracle overwrites my Date?

I got a problem with Oracle Dates. It seems that predefined Dates in an Java Application are different after inserting into an OracleDB.
Insert via JPA entity:
entity.setDateOfCreation(new Date(System.currentTimeInMillis()));
// 1350565985000
After commit and retrieve:
entity.getDateOfCreation() // 1350565985047
Why is this different?
I assumed Oracle would just insert my specific Date Object with these exact Milliseconds into the Database. But obviously it doesn't. Because of the minimal delay it seems to "overwrite" the given Date with its own Date in milliseconds (and despite I do NOT use #GeneratedValue).
Does the table you working with have a trigger which populates that column? I would hope it does. I have experienced lots of problems in the last with time differences between the app server and the database. It is much better to have a single of time which ensures consistent timings across the state.

Resources