Deduplication in Oracle

Deduplication in Oracle - oracle

Situation:-
Table 'A' is receiving data from OracleGoldenGate feed and gets the data as New,Updated,Duplicate feed that either creates a new record or rewrites the old one based on it's characteristics (N/U/D). Every entry in table has its UpdatedTimeStamp column contain insertion timestamp.
Scope:-
To write a StoredProcedure in Oracle that pulls the data for a time period based on UpdatedTimeStamp column and publishes an xml using DBMSXMLGEN.
How can I ensure that a duplicate entered in the table is not processed again ??
FYI-am currently filtering via a new table that I created, named as 'A-stg' and has old data inserted incrementally.

As far as I understood the question, there are a few ways to avoid duplicates.
The most obvious is to use DISTINCT, e.g.
select distinct data_column from your_table
Another one is to use timestamp column and get only the last (or the first?) value, e.g.
select data_column, max(timestamp_column)
from your_table
group by data_column

Related

Oracle 12c - refreshing the data in my tables based on the data from warehouse tables

I need to update the some tables in my application from some other warehouse tables which would be updating weekly or biweekly. I should update my tables based on those. And these are having foreign keys in another tables. So I cannot just truncate the table and reinsert the whole data every time. So I have to take the delta and update accordingly based on few primary key columns which doesn't change. Need some inputs on how to implement this approach.
My approach:
Check the last updated time of those tables, views.
If it is most recent then compare each row based on the primary key in my table and warehouse table.
update each column if it is different.
Do nothing if there is no change in columns.
insert if there is a new record.
My Question:
How do I implement this? Writing a PL/SQL code is it a good and efficient way? as the expected number of records are around 800K.
Please provide any sample code or links.

I would go for Pl/Sql and bulk collect forall method. You can use minus in your cursor in order to reduce data size and calculating difference.
You can check this site for more information about bulk collect, forall and engines: http://www.oracle.com/technetwork/issue-archive/2012/12-sep/o52plsql-1709862.html

There are many parts to your question above and I will answer as best I can:
While it is possible to disable referencing foreign keys, truncate the table, repopulate the table with the updated data then reenable the foreign keys, given your requirements described above I don't believe truncating the table each time to be optimal
Yes, in principle PL/SQL is a good way to achieve what you are wanting to
achieve as this is too complex to deal with in native SQL and PL/SQL is an efficient alternative
Conceptually, the approach I would take is something like as follows:
Initial set up:
create a sequence called activity_seq
Add an "activity_id" column of type number to your source tables with a unique constraint
Add a trigger to the source table/s setting activity_id = activity_seq.nextval for each insert / update of a table row
create some kind of master table to hold the "last processed activity id" value
Then bi/weekly:
retrieve the value of "last processed activity id" from the master
table
select all rows in the source table/s having activity_id value > "last processed activity id" value
iterate through the selected source rows and update the target if a match is found based on whatever your match criterion is, or if
no match is found then insert a new row into the target (I assume
there is no delete as you do not mention it)
on completion, update the master table "last processed activity id" to the greatest value of activity_id for the source rows
processed in step 3 above.
(please note that, depending on your environment and the number of rows processed, the above process may need to be split and repeated over a number of transactions)
I hope this proves helpful

Creating a record history table - How do I create a record on creation?

For a project, I want to have a "History" table for my records. I have two tables for this (example) system:
RECORDS
ID
NAME
CREATE_DATE
RECORDS_HISTORY
ID
RECORDS_ID
LOG_DATE
LOG_TYPE
MESSAGE
When I insert a record into RECORDS, how can I automatically create an associated entry in RECORDS_HISTORY where RECORDS_ID is equal to the newly inserted ID in RECORDS?
I currently have a sequence on the ID in RECORDS to automatically increment when a new row is inserted, but I am unsure how to prepopulate a record in RECORDS_HISTORY that will look like this for each newly created (not updated) record.
INSERT INTO RECORDS_HISTORY (RECORDS_ID, LOG_DATE, LOG_TYPE, MESSAGE) VALUES (<records.id>, sysdate(), 'CREATED', 'Record created')
How can I create this associated _HISTORY record on creation?

You didn't mention the DB you are working with. I assume its Oracle. The most obvious answer is: Use a "On Insert Trigger". You even can get back the ID (sequence) from the insert statement into table RECORDS. Disadvantages of this solution: Triggers are kinda "hidden" code, can slow down processes on massive inserts and you consume like double diskspace on storing data partially redundant. What if RECORDS got updated or deleted? Can that happen and do you have to take care of that as well? The big question is: What is your goal?
There are proved historisation concepts around. Have a look at this: https://en.wikipedia.org/wiki/Slowly_changing_dimension

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.

The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)

Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.

The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

Why doesn't Cassandra UPDATE violate the no read before writes rule

I am confused by two seemingly contradictory statements about Cassandra
No reads before writes (presumably this is because writes are sequential whereas reads require scanning a primary key index)
INSERT and UPDATE have identical semantics (stated in an older version of the CQL manual but presumably still considered essentially true)
Suppose I've created the following simple table:
CREATE TABLE data (
id varchar PRIMARY KEY,
names set<text>
);
Now I insert some values:
insert into data (id, names) values ('123', {'joe', 'john'});
Now if I do an update:
update data set names = names + {'mary'} where id = '123';
The results are as expected:
id | names
-----+-------------------------
123 | {'joe', 'john', 'mary'}
Isn't this a case where a read has to occur before a write? The "cost" would seem to be the the following
The cost of reading the column
The cost of creating a union of the two sets (negligible here but could be noticeable with larger sets)
The cost of writing the data with the key and new column data
An insert would merely be the doing just the last of these.

There is no need for read before writing. Internally each collection stores data using one column per entry -- When you ask for a new entry in a collection the operation is done in the single column*: if the column already exists it will be overwritten otherwise it will be created (InsertOrUpdate).
This is the reason why each entry in a collection can have custom ttl and writetime.
*while with Map and Set this is transparent there is some internal trick to allow multiple columns with same name inside a List.

How do I prevent the loading of duplicate rows in to an Oracle table?

I have some large tables (millions of rows). I constantly receive files containing new rows to add in to those tables - up to 50 million rows per day. Around 0.1% of the rows I receive are duplicates of rows I have already loaded (or are duplicates within the files). I would like to prevent those rows being loaded in to the table.
I currently use SQLLoader in order to have sufficient performance to cope with my large data volume. If I take the obvious step and add a unique index on the columns which goven whether or not a row is a duplicate, SQLLoader will start to fail the entire file which contains the duplicate row - whereas I only want to prevent the duplicate row itself being loaded.
I know that in SQL Server and Sybase I can create a unique index with the 'Ignore Duplicates' property and that if I then use BCP the duplicate rows (as defined by that index) will simply not be loaded.
Is there some way to achieve the same effect in Oracle?
I do not want to remove the duplicate rows once they have been loaded - it's important to me that they should never be loaded in the first place.

What do you mean by "duplicate"? If you have a column which defines a unique row you should setup a unique constraint against that column. One typically creates a unique index on this column, which will automatically setup the constraint.
EDIT:
Yes, as commented below you should setup a "bad" file for SQL*Loader to capture invalid rows. But I think that establishing the unique index is probably a good idea from a data-integrity standpoint.

Use Oracle MERGE statement. Some explanations here.

You dint inform about what release of Oracle you have. Have a look at there for merge command.
Basically like this
---- Loop through all the rows from a record temp_emp_rec
MERGE INTO hr.employees e
USING temp_emp_rec t
ON (e.emp_ID = t.emp_ID)
WHEN MATCHED THEN
--- _You can update_
UPDATE
SET first_name = t.first_name,
last_name = t.last_name
--- _Insert into the table_
WHEN NOT MATCHED THEN
INSERT (emp_id, first_name, last_name)
VALUES (t.emp_id, t.first_name, t.last_name);

I would use integrity constraints defined on the appropriate table columns.
This page from the Oracle concepts manual gives an overview, if you also scroll down you will see what types of constraints are available.

use below option, if you will get this much error 9999999 after that your sqlldr will terminate.
OPTIONS (ERRORS=9999999, DIRECT=FALSE )
LOAD DATA
you will get duplicate records in bad file.
sqlldr user/password#schema CONTROL=file.ctl, LOG=file.log, BAD=file.bad

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Deduplication in Oracle - oracle

Related

Oracle 12c - refreshing the data in my tables based on the data from warehouse tables

Creating a record history table - How do I create a record on creation?

Query a table in different ways or orderings in Cassandra

Why doesn't Cassandra UPDATE violate the no read before writes rule

How do I prevent the loading of duplicate rows in to an Oracle table?

Categories

Resources