how do i maintain snapshot of data in oracle and couchbase - oracle

I have a spring batch which reads a file daily and insert the data into oracle and couchbase. There are other application which read data from these datasources and for that i just need the latest record data in the tables.
Lets take an example
Day 1: i recieved the file with the below records
123,student1,gradeA ( id,name,grade)
124, student2, gradeA ( id, name, grade)
Day 2: I received the file with below records
123,student1,gradeB ( id,name,grade)
So what i need to do is
1. on Day1 I should insert all the records of the file as initially table is empty
2. On Day2 I need to invalidate the record for "124" as that is not in file
3. On Day2 Update the record for "123" with new grade
So on day 2 if any read request comes for "124" i should throw exception (data not found).
Couple of approaches i thought about
Approach 1:
I can have a revision number column in the table and every day a file
is read i get the unique revision number for that day and while
inserting records into DB for that day i use the revision number. But
for this, i need to store the revision number somewhere else and every
time i need to read data i have to do an extra lookup to get the
current version number.
Approach2:
Every time a record is updated maintain last_modified_date column and
after batch run which ever is not modified remove those records(this
might be costly).
The above approaches might work for oracle, but for couchbase i am thinking of having TTL for every record to solve this.
Can someone suggest any other better approaches on this?

I'd be cautious of the TTL in Couchbase. In the event that your batch schedule is delayed, you'd lose all your data.
Approach 2 seems the cleanest to me, and you can leverage a merge/upsert in both Oracle and Couchbase to insert/update accordingly. In Oracle, you'd want either a physical or temporary staging table to improve your MERGE performance. And for Couchbase, you'd use a bulk UPSERT.

You can use a flag to maintain record validation.
Every day all records will be marked as 'not valid' and during batch process add record if missing or mark record as 'valid' if it is already present.

Related

Insert unique timestamp to sql server using spring batch chunk step

I have a table that stores current timestamp as a unique identifier. The format is “yyyy-MM-dd HH:mm:ss.SSSSSS”. Unfortunately with chunk oriented steps I’m seeing duplicate insert errors as multiple items gets the same timestamp. Please advise how can I handle this use case. I’m not in a position to change the Table definition to use any other unique identifiers.
I’ve tried to set the timestamp using the itemProcessor as it was failing originally when set at itemWriter. But results were same.
Also tested by introducing a sleep of 1 ns(Thread.Sleep(0,1)) but that made the spring batch job terribly slow. A sample job with 200K records that took under 2minutes was taking about 10-11 minutes after Thread.Sleep

Data cleanup in Oracle DB is taking long time for 300 billion records

Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.

Spring batch to read CSV and update data in bulk to MySQL

I've below requirement to write a spring batch. I would like to know the best approach to achieve it.
Input: A relatively large file with report data (for today)
Processing:
1. Update Daily table and monthly table based on the report data for today
Daily table - Just update the counts based on ID
Monthly table: Add today's count to the existing value
My concerns are:
1. since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
2. To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
Please suggest the approach I should follow or any example if there is any?
Thanks.
You can design a chunk-oriented step to first insert the daily data from the file to the table. When this step is finished, you can use a step execution listener and in the afterStep method, you will have a handle to the step execution where you can get the write count with StepExecution#getWriteCount. You can write this count to the monthly table.
since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
With a chunk oriented step, data is already written in bulk (one transaction per chunk). This model works very well even if your input file is huge.
To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
No need to store the info in a map, you can get the write count from the step execution after the step as explained above.
Hope this helps.

Stop Hbase update operation if it have same value

I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.

Importing data incrementally from RDBMS to hive/hadoop using sqoop

I have an oracle database and need to import data to a hive table. The daily import data size would be around 1 GB. What would be the better approach?
If I import each day data as a partition, how can the updated values be handled?
For example, if I imported today's data as a partition and for the next day there are some fields that are updated with the new values.
Using --lastmodified we can get the values but where we need to send the updated values to the new partition or to the old (already existing) partition?
If I send to the new partition, then the data is duplicated.
If I want to send to the already existing partition, how we can it be achieved?
Your only option is to override the entire existing partition with 'INSERT OVERWRITE TABLE...'.
Question is - how far back are you going to be constantly updating the data?
I think of 3 approaches u can consider:
Decide on a threshold for 'fresh' data. for example '14 days backwards' or '1 month backwards'.
Then each day you are running the job, you override partitions (only the ones which have updated values) backwards, until the threshold decided.
With ~1 GB a day it should be feasible.
All the data from before your decided time is not guranteed to be 100% correct.
This scenario could be relevant if you know the fields can only be changed a certain time window after they were initially set.
Make your Hive table compatible with ACID transactions, thus allowing updates on the table.
Split your daily job to 2 tasks: the new data being written for the run day. the updated data that you need to run backwards. the sqoop will be responsible for the new data. take care of the updated data 'manually' (some script that generates the update statements)
Don't use partitions based on time. maybe dynamic partitioning is more suitable for your use case.It depends on the nature of the data being handled.

Resources