ETL into operational oracle database - used by jsp/spring/hibernate app - oracle

I am needing to have some legacy data loaded into an operational oracle (11gR2) database. The database is being used by a jsp/spring/hibernate (3.2.5.ga) application. A sequence is used for generating unique-keys across all the tables. the sequence definition is as below:
CREATE SEQUENCE "TEST"."HIBERNATE_SEQUENCE" MINVALUE 1 MAXVALUE 999999999999999999999999999 INCREMENT BY 1 START WITH 1000 CACHE 20 NOORDER NOCYCLE
The idea for the data load/ETL is to come up wtih a script that starts out with the max sequence value by running
select HIBERNATE_SEQUENCE.NEXTVAL from dual
at the beginning of the script generation process - and generated SQL Insert statements for the data that needs to be populated. there is some logic involved in handling data cleanup, business rules etc that get applied applied through the script and the generated SQL Insert statements are expected to be run in one batch and that should be able to bring in all of the legacy data.
assuming that the max sequence value was 1000 - the script uses this as as variable and increments is as necessary, and the output SQL INSERTS will be as below:
INSERT INTO USER_STATUS(ID, CREATE_DATE, UPDATE_DATE, STATUS_ID, USER_ID)
VALUES (**1001**, CURRENT_DATE, CURRENT_DATE, 20, 445);
INSERT INTO USER_ACTIVITY_LOG(ID, CREATE_DATE, UPDATE_DATE, DETAILS, LAST_USER_STATUS_ID)
VALUES (**1002**, CURRENT_DATE, CURRENT_DATE, 'USER ACTIVITY 1', **1001**);
INSERT INTO USER_STATUS(ID, CREATE_DATE, UPDATE_DATE, STATUS_ID, USER_ID)
VALUES (**1003**, CURRENT_DATE, CURRENT_DATE, 10, 445);
INSERT INTO USER_ACTIVITY_LOG(ID, CREATE_DATE, UPDATE_DATE, DETAILS, LAST_USER_STATUS_ID)
VALUES (**1004**, CURRENT_DATE, CURRENT_DATE, 'USER ACTIVITY 3', **1003**);
I have created some mock SQL to show the idea of how the output INSERTS are going to be - there are going to be a lot more tables involved in the insert operations. whenever we need to make data changes from the back-end we would use the HIBERNATE_SEQUENCE.NEXTVAL to get the next unique key value. but since the sql generation script runs in a disconnected mode, it does not use the HIBERNATE_SEQUENCE.NEXTVAL, but tries to increment a local variable instead.
The assumption we are having about being able to generate (and run) this script is to
have the application taken down for maintenance
have no database activity during the time of running the script and start out with the max sequence value.
generate the SQL
run the SQL - commit.
assuming that, in the process of script generation, the max sequence value goes up from 1000 to 5000 - after the script is run and the data is loaded, the HIBERNATE_SEQUENCE would need to dropped/created to start at 5001.
bring the application back up.
Now, to the reason i am posting this, in such detail... i am needing your suggestions/input about any loopholes in this design and if there is anything i am overlooking.
Any input is appreciated.
Thanks!

I would suggest against dropping and creating the sequence if its used for any other task in your application, doing so means you also need to re-add any permissions, synonyms,etc.
Do you know at the start of the script how many inserts you will do? If so, and assuming that you wont have any other activity, then you can adjust the 'increment by' value of the sequence , so a single select from it will move the sequence forward by whatever value you want.
> drop sequence seq_test;
sequence SEQ_TEST dropped.
> create sequence seq_test start with 1 increment by 1;
sequence SEQ_TEST created.
> select seq_test.nextval from dual;
NEXTVAL
----------------------
1
> alter sequence seq_test increment by 500;
sequence SEQ_TEST altered.
> select seq_test.nextval from dual;
NEXTVAL
----------------------
501
> alter sequence seq_test increment by 1;
sequence SEQ_TEST altered.
> select seq_test.nextval from dual;
NEXTVAL
----------------------
502
Just be aware that the DDL statements will issue an implicit commit, so once they have run any inflight transaction will be commited, and any work performed after them will be a separate transaction.

Related

Oracle LogMiner Results Inconsistent

I'm working on a LogMiner-based solution for capturing changes and I've uncovered what appears to either an unusual set of expectations when attempting to mine redo events that pertain to CLOB or BLOB operations.
In my use case, I've inserted record into a table that contains 3 CLOB fields where one of the CLOB fields value is small while the other two CLOB fields must be set using LOB_WRITE operations.
When I set a starting LogMiner SCN range that starts before and ends after the transaction commit, I get the full expected rows in V$LOGMNR_CONTENTS, which are:
0a00070084220000 37717288 START
0a00070084220000 37717288 INSERT
0a00070084220000 37717312 SEL_LOB_LOCATOR
0a00070084220000 37717312 LOB_WRITE (several of these as expected)
0a00070084220000 37717331 SEL_LOB_LOCATOR
0a00070084220000 37717331 LOB_WRITE (several of these as expected)
0a00070084220000 37717332 INSERT (sets the smaller clob data values)
0a00070084220000 37717334 COMMIT
The unusual bit occurs when starting the mining session with certain start/end SCN ranges.
For example, when I mine from 37717239 to 37717289, I expected LogMiner to provide both the START and the INSERT in the table; however only the START operation was present.
Additionally, when I mine from 37717290 to 37717340, I expected LogMiner to provide all the SEL_LOB_LOCATOR, LOB_WRITE, and subsequent INSERT and the COMMIT; however only the subsequent INSERT and COMMIT were present.
The only assertion I can make from this is that LogMiner seems to have trouble when you split a transction where certain redo events represent various synthetic operations as they relate to LOB operations and therefore the only way I've been able to actually always reconstruct the series of events has been to mine from 37717288 forward to force LogMiner to have the full scope of the transaction available when it materializes the rows in the contents view.
Why does LogMiner behave like this? Why does it not correctly materialize when splitting the transaction with the SCN ranges I presented above?
For Logminer any single command is atomic by definition.
In this case it starts at 37717288 and ends at 37717332.
It cannot be split. If you will ask any range that splits it - Logminer will not fetch it on purpose (so you won't have partial results of a single command).
This is also right for large non-LOB commands, like DDL that generate many internal commands (e.g. alter table modify column default value)
Besides that, pay attention that fetching the values of LOB from Logminer is not reliable. just play around with the values and you will see that it is highly inconsistent. (I have somewhere tests to prove it, so if you are interested I can provide them).
Here is the test: define a table with 2 lobs, create 2 rows - first with single lob and second with two lobs.
drop table sample1.clobs2;
create table sample1.clobs2 (id number not null, clob1 clob not null, clob2 clob);
--start
select current_scn from v$database;
insert into sample1.clobs2 (id, clob1) values (3, 'abc');
insert into sample1.clobs2 (id, clob1, clob2) values (4, 'abc', '2abc');
commit;
update sample1.clobs2 set clob1='def' where id=3;
update sample1.clobs2 set clob1='def', clob2='2def' where id=4;
commit;
update sample1.clobs2 set clob1=rpad('ghj',30000,'Z') where id=3;
update sample1.clobs2 set clob1=rpad('ghj',30000,'Z'), clob2=rpad('ghj',30000,'Z') where id=4;
commit;
--end
select current_scn from v$database;
Start the logminer:
exec DBMS_LOGMNR.end_LOGMNR;
exec DBMS_LOGMNR.ADD_LOGFILE('put here any logfile(select MEMBER from v$logfile), logminer will do the rest');
begin DBMS_LOGMNR.START_LOGMNR(
STARTSCN => put here the scn from the above test,
ENDSCN => put here the scn from the above test,
OPTIONS => -- I leave all the possible parameters here just for you to play
--DBMS_LOGMNR.DICT_FROM_REDO_LOGS +
--DBMS_LOGMNR.DICT_FROM_ONLINE_CATALOG +
DBMS_LOGMNR.CONTINUOUS_MINE +
--DBMS_LOGMNR.COMMITTED_DATA_ONLY+
--DBMS_LOGMNR.DDL_DICT_TRACKING+
DBMS_LOGMNR.NO_ROWID_IN_STMT+
DBMS_LOGMNR.NO_SQL_DELIMITER
);
end;
/
Check the results:
select scn, (XIDUSN || '.' || XIDSLT || '.' || XIDSQN) AS transaction_id, operation, seg_name, ROW_ID, rollback, csf,SQL_REDO, c.*
from v$logmnr_contents c
where 1=1
(seg_name='OBJ# put here the object id of the table sample1.clobs2' or operation='COMMIT');
You will see:
large lobs behave differently from small ones
in case when only one lob is updated - it is impossible to understand which one was it (the first or the second)
In addition, this behaviour changes between different Oracle versions.

Disable triggers and re-enable triggers but avoid table alteration in meantime

I have the following situation:
A table (MyTable) should be processed (updates/inserts/deletes etc) by a batch process (a call to a myplsql() procedure).
During myplsql execution no one should touch MyTable - so MyTable is locked in exclusive mode by myplsql.
Now MyTable has a number of on insert,on update, on delete triggers defined but those are not needed while performing batch processing - moreover they slow down the batch process extremely.
So the solution is to disable the triggers before calling myplsql().
But how to avoid someone touching the MyTable just after alter table ... disable trigger is performed and before myplsql manages to lock the table,
given that alter table performs implicit commit - so any lock acquired before that will be lost anyway?
Part of the problem is that I do not control the other code or the other user that could try to touch the Table.
In a few words I need to perform the following in a single shot:
Lock MyTable
Disable Triggers (somehow without loosing the lock)
Process MyTable
Enable Triggers
Unlock MyTable
One thought was to remove grants from the table - and render it unusable for other users.
But as it is turns out - that is not an option as the other processes/users perform their operations logged in as a table owner user.
Thanks.
A slightly different approach is to keep the triggers enabled but reduce (if not quite entirely remove) their impact, by adding a when clause something like:
create or replace trigger ...
...
for each row
when (sys_context('userenv', 'client_info') is null
or sys_context('userenv', 'client_info') != 'BATCH')
declare
...
begin
...
end;
/
Then in your procedure add a call at the start as your 'disable triggers' step:
dbms_application_info.set_client_info('BATCH');
and clear it again at the end, just in case the session is left alive and reused (so you might want to do this in an exception handler too):
dbms_application_info.set_client_info(null);
You could also use module, or action, or a combination. While that setting is in place the trigger will still be evaluated but won't fire, so any thing happening inside will skipped - the trigger body does not run, as the docs put it.
This isn't foolproof as there is nothing really stopping other users/applications making the same calls, but if you pick a more descriptive string and/or a combination of settings, it would have to be deliberate - and I think you're mostly worried about accidents not bad actors.
Quick speed test with a pointless trigger that does just slows things down a bit.
create table t42 (id number);
-- no trigger
insert into t42 (id) select level from dual connect by level <= 10000;
10,000 rows inserted.
Elapsed: 00:00:00.050
create or replace trigger tr42 before insert on t42 for each row
declare
dt date;
begin
select sysdate into dt from dual;
end;
/
-- plain trigger
insert into t42 (id) select level from dual connect by level <= 10000;
10,000 rows inserted.
Elapsed: 00:00:00.466
create or replace trigger tr42 before insert on t42 for each row
when (sys_context('userenv', 'client_info') is null
or sys_context('userenv', 'client_info') != 'BATCH')
declare
dt date;
begin
select sysdate into dt from dual;
end;
/
-- userenv trigger, not set
insert into t42 (id) select level from dual connect by level <= 10000;
10,000 rows inserted.
Elapsed: 00:00:00.460
- userenv trigger, set to BATCH
exec dbms_application_info.set_client_info('BATCH');
insert into t42 (id) select level from dual connect by level <= 10000;
10,000 rows inserted.
Elapsed: 00:00:00.040
exec dbms_application_info.set_client_info(null);
There's a bit of variation from making remote calls, but I ran a few times and it's clear that running with a plain trigger is very similar to running with the constrained trigger without BATCH set, and both are much slower than running without a trigger or with the constrained trigger with BATCH set. In my testing there's an order of magnitude difference.

Automatically delete new records with specific values - ORACLE

I have a table in a database (Oracle 11g) that receives roughly 45000 new records each day. Our organization has roughly 15 items (is a predetermined be a static unique value for each) and I am looking to either delete these records automatically or change a specific value in the these records columns before my batch job packages these transactions and sends them off. Any suggestions on the best way to do this? These transactions are only 10-20 of the 45000 so checking each time they are entered seems like it may have to much cost. To add the values come periodically through the day via DTS package from SQL 2000 server; and yes 2000 is end of life an we will be upgrading early next year.
sample below - accept only +0 values, if a value is less than 0 we change it to 99999;
create table my_table (val int);
create or replace trigger my_trigger
before insert on my_table
for each row
declare
begin
if :new.val <0 then:new.val := 99999; end if;
end my_trigger;
insert into my_table values(0);
insert into my_table values(1);
insert into my_table values(-1);
select * from my_table
1 0
2 1
3 99999
if want to prevent insert of the "wrong" values - "silent insert reject" is not recommended, you'd better need either raise an exception in trigger or set a constraint, see discussion there: Before insert trigger

How to guarantee order of primary key and timestamp in oracle database

I am creating some record which have id, ts ... So firstly I call select to get ts and id:
select SEQ_table.nextval, CURRENT_TIMESTAMP from dual
and then I call insert
insert into table ...id, ts ...
this works good in 99 % but sometimes when there is a big load the order of record is bad because I need record.id < (record+1).id and record.ts < (record+1).ts but this conditional is met. How I can solve this problem ? I am using oracle database.
You should not use the result of a sequence for ordering. This might look strange but think about how sequences are cached and think about RAC. Every instance has it's own sequence cache .... For performance you need big caches. sequences had better be called random unique key generators that happen to work sequenctially most of the time.
The timestamp format has a time resolution upto microsecond level. When hardware becomes quicker and load increases it could be that you get multiple rows at the same time. There is not much you can do about that, until oracle takes the resolution a step farther again.
Use an INSERT trigger to populate the id and ts columns.
create table sotest
(
id number,
ts timestamp
);
create sequence soseq;
CREATE OR REPLACE TRIGGER SOTEST_BI_TRIG BEFORE
INSERT ON SOTEST REFERENCING NEW AS NEW FOR EACH ROW
BEGIN
:new.id := soseq.nextval;
:new.ts := CURRENT_TIMESTAMP;
END;
/
PHIL#PHILL11G2 > insert into sotest values (NULL,NULL);
1 row created.
PHIL#PHILL11G2 > select * from sotest;
ID TS
---------- ----------------------------------
1 11-MAY-12 13.29.33.771515
PHIL#PHILL11G2 >
You should also pay attention to the other answer provided. Is id meant to be a meaningless primary key (it usually is in apps - it's just a key to join on)?

select only new row in oracle

I have table with "varchar2" as primary key.
It has about 1 000 000 Transactions per day.
My app wakes up every 5 minute to generate text file by querying only new record.
It will remember last point and process only new records.
Do you have idea how to query with good performance?
I am able to add new column if necessary.
What do you think this process should do by?
plsql?
java?
Everyone here is really really close. However:
Scott Bailey's wrong about using a bitmap index if the table's under any sort of continuous DML load. That's exactly the wrong time to use a bitmap index.
Everyone else's answer about the PROCESSED CHAR(1) check in ('Y','N')column is right, but missing how to index it; you should use a function-based index like this:
CREATE INDEX MY_UNPROCESSED_ROWS_IDX ON MY_TABLE
(CASE WHEN PROCESSED_FLAG = 'N' THEN 'N' ELSE NULL END);
You'd then query it using the same expression:
SELECT * FROM MY_TABLE
WHERE (CASE WHEN PROCESSED_FLAG = 'N' THEN 'N' ELSE NULL END) = 'N';
The reason to use the function-based index is that Oracle doesn't write index entries for entirely NULL values being indexed, so the function-based index above will only contain the rows with PROCESSED_FLAG = 'N'. As you update your rows to PROCESSED_FLAG = 'Y', they'll "fall out" of the index.
Well, if you can add a new column, you could create a Processed column, which will indicate processed records, and create an index on this column for performance.
Then the query should only be for those rows that have been newly added, and not processed.
This should be easily done using sql queries.
Ah, I really hate to add another answer when the others have come so close to nailing it. But
As Ponies points out, Oracle does have a hidden column (ORA_ROWSCN - System Change Number) that can pinpoint when each row was modified. Unfortunately, the default is that it gets the information from the block instead of storing it with each row and changing that behavior will require you to rebuild a really large table. So while this answer is good for quieting the SQL Server fella, I'd not recommend it.
Astander is right there but needs a few caveats. Add a new column needs_processed CHAR(1) DEFAULT 'Y' and add a BITMAP index. For low cardinality columns ('Y'/'N') the bitmap index will be faster. Once you have the rest is pretty easy. But you've got to be careful not select the new rows, process them and mark them as processed in one step. Otherwise, rows could be inserted while you are processing that will get marked processed even though they have not been.
The easiest way would be to use pl/sql to open a cursor that selects unprocessed rows, processes them and then updates the row as processed. If you have an aversion to walking cursors, you could collect the pk's or rowids into a nested table, process them and then update using the nested table.
In MS SQL Server world where I work, we have a 'version' column of type 'timestamp' on our tables.
So, to answer #1, I would add a new column.
To answer #2, I would do it in plsql for performance.
Mark
"astander" pretty much did the work for you. You need to ALTER your table to add one more column (lets say PROCESSED)..
You can also consider creating an INDEX on the PROCESSED ( a bitmap index may be of some advantage, as the possible value can be only 'y' and 'n', but test it out ) so that when you query it will use INDEX.
Also if sure, you query only for every 5 mins, check whether you can add another column with TIMESTAMP type and partition the table with it. ( not sure, check out again ).
I would also think about writing job or some thing and write using UTL_FILE and show it front end if it can be.
If performance is really a problem and you want to create your file asynchronously, you might want to use Oracle Streams, which will actually get modification data from your redo log withou affecting performance of the main database. You may not even need a separate job, as you can configure Oracle Streams to do Asynchronous replication of the changes, through which you can trigger the file creation.
Why not create an extra table that holds two columns. The ID column and a processed flag column. Have an insert trigger on the original table place it's ID in this new table. Your logging process can than select records from this new table and mark them as processed. Finally delete the processed records from this table.
I'm pretty much in agreement with Adam's answer. But I'd want to do some serious testing compared to an alternative.
The issue I see is that you need to not only select the rows, but also do an update of those rows. While that should be pretty fast, I'd like to avoid the update. And avoid having any large transactions hanging around (see below).
The alternative would be to add CREATE_DATE date default sysdate. Index that. And then select records where create_date >= (start date/time of your previous select).
But I don't have enough data on the relative costs of setting a sysdate as default vs. setting a value of Y, updating the function based vs. date index, and doing a range select on the date vs. a specific select on a single value for the Y. You'll probably want to preserve stats or hint the query to use the index on the Y/N column, and definitely want to use a hint on a date column -- the stats on the date column will almost certainly be old.
If data are also being added to the table continuously, including during the period when your query is running, you need to watch out for transaction control. After all, you don't want to read 100,000 records that have the flag = Y, then do your update on 120,000, including the 20,000 that arrived when you query was running.
In the flag case, there are two easy ways: SET TRANSACTION before your select and commit after your update, or start by doing an update from Y to Q, then do your select for those that are Q, and then update to N. Oracle's read consistency is wonderful but needs to be handled with care.
For the date column version, if you don't mind a risk of processing a few rows more than once, just update your table that has the last processed date/time immediately before you do your select.
If there's not much information in the table, consider making it Index Organized.
What about using Materialized view logs? You have a lot of options to play with:
SQL> create table test (id_test number primary key, dummy varchar2(1000));
Table created
SQL> create materialized view log on test;
Materialized view log created
SQL> insert into test values (1, 'hello');
1 row inserted
SQL> insert into test values (2, 'bye');
1 row inserted
SQL> select * from mlog$_test;
ID_TEST SNAPTIME$$ DMLTYPE$$ OLD_NEW$$ CHANGE_VECTOR$$
---------- ----------- --------- --------- ---------------------
1 01/01/4000 I N FE
2 01/01/4000 I N FE
SQL> delete from mlog$_test where id_test in (1,2);
2 rows deleted
SQL> insert into test values (3, 'hello');
1 row inserted
SQL> insert into test values (4, 'bye');
1 row inserted
SQL> select * from mlog$_test;
ID_TEST SNAPTIME$$ DMLTYPE$$ OLD_NEW$$ CHANGE_VECTOR$$
---------- ----------- --------- --------- ---------------
3 01/01/4000 I N FE
4 01/01/4000 I N FE
I think this solution should work..
What you need to do following steps
For the first run, you will have to copy all records. In first run you need to execute following query
insert into new_table(max_rowid) as (Select max(rowid) from yourtable);
Now next time when you want to get only newly inserted values, you can do it by executing follwing command
Select * from yourtable where rowid > (select max_rowid from new_table);
Once you are done with processing above query, simply truncate new_table and insert max(rowid) from yourtable
I think this should work and would be fastest solution;

Resources