Oracle update million records from XML file - oracle

Gurus ,
I have reporting shell script on LINUX platform,Oracle 12c database which does the following.
Read the Error XML file created in last 24 hrs( mtime)from the unix directory path
Sed unwanted text like '
Fetch each row and column using cut -d ";" -f $X
Prepare update statement
execute update statement after processing each file to set the error
code.
In UAT I received , 400 files , each file have 20,000 rows. which means, update statement will be prepared 400X20,000 times and each statement will be executed.
The issues I see are:
Unable to log/handle update errors in order to debug or rerun them.
it is taking lot of time even though we have indexes.
What is the best way to handle such situation?
I have following thought in mind: Instead of creating update statements, use sqlldr to load to temp table and execute update/merge two tables. I'm not sure about performance of executing 400 sqlldrs.any idea?
Is there a better way to handle ? in terms of error handling and process.

Related

external tables: how to make sure i don't load same file/data

I want to use an external table to load a csv file as it's very convenient, but the problem is how do i make sure i don't load the same file twice in a row? i can't validate the data loaded because it can be the same information as before; i need to find a way to make sure the user doesnt load the same file as 2h ago for example.
I thought about uploading the file with a different name each time and issuing an alter table command to change the name of the file in the definition of the external table, but it sounds kinda risky.
I also thought about marking each row in the file with a sequence to help differentiate files, but i doubt the client would accept it as they would need to manually do this (the file is exported from somewhere).
Is there any better way to make sure i don't load the same file in the external table except changing the file's name and executing an alter on the table?
Thank you
when you bring the data from external table to your database you can use MERGE command instead of insert. it let you don't worry about duplicate data
see the blog about The Oracle Merge Command
What's more, we can wrap up the whole transformation process into this
one Oracle MERGE command, referencing the external table and the table
function in the one command as the source for the MERGED Oracle data.
alter session enable parallel dml;
merge /*+ parallel(contract_dim,10) append */
into contract_dim d
using TABLE(trx.go(
CURSOR(select /*+ parallel(contracts_file,10) full (contracts_file) */ *
from contracts_file ))) f
on d.contract_id = f.contract_id
when matched then
update set desc = f.desc,
init_val_loc_curr = f.init_val_loc_curr,
init_val_adj_amt = f.init_val_adj_amt
when not matched then
insert values ( f.contract_id,
f.desc,
f.init_val_loc_curr,
f.init_val_adj_amt);
So there we have it - our complex ETL function all contained within a
single Oracle MERGE statement. No separate SQL*Loader phase, no
staging tables, and all piped through and loaded in parallel
I can only think of a solution somewhat like this:
Have a timestamp encoded in the datafile name (like: YYYYMMDDHHMISS-file.csv), where YYYYMMDDHHMISS is the timestamp.
Create a table with the fields timestamp (as above).
Create a shell scripts that:
extracts the timestamp from the datafilename.
calls an sqlscript with the timestamp as the parameter, and return 0 if that timestamp does not exist, <>0 if the timestamp already exist, and in that case exit the script with the error: File: YYYYMMDDHHMISS-file.csv already loaded.
copy the YYYYMMDDDHHMISS-file.csv to input-file.csv.
run the sql loader script that loads the input-file.csv file
when succes: run a second sql script with parameter timestamp that inserts the record in the database to indicate that the file is loaded and move the original file to a backup folder.
when failure: report the failure of the load script.

Quick way to run large sql file containing Insert stms - oracle

I have a sql script that contains 700,000 Insert statements. I tried to run through Oracle Developer, but it failed to load the file itself. I tried to run from sqlplus but its taking quite a long time to execute such large file.
To speed up, I have deleted all the constraints on the table but there is no improvement.
I have looked for information, Please have been suggesting
Split the file into manageable sizes - which I will fall back as my last resort.
Sql Loader - as far as I understand SQL Loader is to export from the db in specific format and load them into the db with CTL.
Is there any better way to handle this scenario.
Convert the file from many SQL statements to a smaller number of PL/SQL blocks to reduce the round-trip overhead. This only requires a few minutes with a text editor and can improve performance by orders of magnitude, especially over slow networks.
Every 10,000 lines, add a begin and an end; to the file.
Change this:
insert into ...
insert into ...
insert into ...
...
To this:
begin
insert into ...
insert into ...
insert into ...
...
end;
/
Don't convert the entire file into one large PL/SQL block though. There is a limit to the size of anonymous PL/SQL blocks, you might get a parser error.
Agree with previous answer above, 700,000 inserts via script is going to be slow no matter what you do with it - load up your data as an external table or use SQL*Loader.
However
If you want to execute a large script with SQL Developer, don't OPEN the file - we have to open and parse and display the contents of that file. Ow.
Just do this in the worksheet
#script_name
That will execute the script.
To speed it up even further, hide or minimize the output area of the worksheet.
It's still not going to be super-fast with 700,000 inserts though.
AFAIK the option 2 is correct.
Use sed/awk/perl to convert the file into CSV (or fixed width) input file.
disable constraints, indexes, (possibly drop unique indexes)
create control file for your input file
exec sqlldr (turn direct path load on)
And this should finish withing few seconds.
You have also to check if you have some triggers in the table where you are doing the inserts. It can slow down the process if a lot of logic is coded behind.

Avoiding Data Duplication when Loading Data from Multiple Servers

I have a dozen web servers each writing data to a log file. At the beginning of each hour, the data from the previous hour is loaded to hive using a cron script running the command:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"
In some cases, the command fails and exits with a code other than 0, in which case our script awaits and tries again. The problem is, in some cases of failure, the data loading does not fail, even though it shows a failure message. How can I know for sure whether or not the data has been loaded?
Example for such a "failure" where the data is loaded:
Loading data to table default.my_table partition (dt=2015-08-17-05)
Failed with exception
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter
partition. FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask
Edit:
Alternatively, is there a way to query hive for the filenames loaded into it? I can use DESCRIBE to see the number of files. Can I know their names?
About "which files have been loaded in a partition":
if you had used an EXTERNAL TABLE and just uploaded your raw data
file in the HDFS directory mapped to LOCATION, then you could
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call)
(b) run a Hive query such as select distinct INPUT__FILE__NAME from (...)
but in your case, you copy the data into a "managed" table, so there
is no way to retrieve the data lineage (i.e. which log file was used
to create each managed datafile)
...unless you add explicitly the original file name inside the log file, of
course (either on "special" header record, or at the beginning of each record - which can be done with good old sed)
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/...
map your log file to an EXTERNAL TABLE so that you can run an
INSERT-SELECT query
upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source
add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
INSERT INTO TABLE Target
SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName
FROM Source src
WHERE NOT EXISTS
(SELECT DISTINCT 1
FROM Target trg
WHERE trg.SrcFileName =src.INPUT__FILE__NAME
)
Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...
I don't believe you can simply do this is in Hadoop/Hive. So here are the basics of an implementation in python:
import subprocess
x=subprocess.check_output([hive -e "select count(*) from my_table where dt='2015-08-17-05'"])
print type(x)
print x
But you have to spend some time working with backslashes to get hive -e to work using python. It can be very difficult. It may be easier to write a file with that simple query in it first, and then use hive -f filename. Then, print the output of subprocess.check_output in order to see how the output is stored. You may need to do some regex or type conversions, but I think it should just come back as a string. Then simply use an if statement:
if x > 0:
pass
else:
hive -e "LOAD DATA LOCAL INPATH 'myfile.log' INTO TABLE my_table PARTITION(dt='2015-08-17-05')"

How to insert name of file and modified time using batch/shell script and sql loader

I have a requirement to insert bulk data into an Oracle database from a CSV file. Now table columns specs match those of the CSV file's header with the exception of three additional fields in database:
A Primary Key field (for which a simple SEQUENCE.NEXTVAL is called)
A field for the name of the CSV file
A field for the last modified date+time of the file
The following stack question address an extra column issue, but the solution is pretty easy because it used Oracle sysdate which is internally available. I need to pass a parameter from either batch script/shell script.
Insert actual date time in a row with SQL*loader
Can PARFILE help here somehow?
My other alternative would be to do the whole task in two steps by writing a small java code:
Use SQL Loader for bulk upload leaving out data for the filename and
modified time
And then run a separate update statement to populate the newly
created rows
But I'm looking for something which will get the job done in one shot. Any advice??
I'm affraid it's not possible with sqlldr alone.
There is no tools for this in sqlldr.
You'd need some sort of script or a program to dynamically create a .ctl file for each load.
Here is a bash script to help you get started:
#!/bin/bash -xv
readonly MY_FILENAME=$1
readonly DB_BUF_TABLE=$2
readonly SQLLDR_CTL="LOAD DATA
CHARACTERSET UTF8
APPEND INTO TABLE $DB_BUF_TABLE
FIELDS TERMINATED BY ';'(
filename \"$MY_FILE_NAME\",
col_foo,
col_bar
)"
echo "$SQLLDR_CTL" > "loader.ctl"
sqlldr control=loader.ctl parfile=loader.par data="$MY_FILENAME"
sqlldrReturnValue=$?
You'd needsome locking with this.. or path separation for concurrent loads to be sure sqlldr starts with proper ctl file

create large objects in oracle via sqlplus

I have a view whose DDL definition is many thousands of lines long. Part of our CI process is to drop and recreate views from DDL using SQLPlus called from a command line script.
This works for hundreds of views in the database but the very large view is never created in the target schema. I always manually paste the view creation script into Toad and run it manually after the automated process has completed. This is a drag.
There is no meaningful error message from SQLPlus when the large-view portion of the DDL script is run but I suspect that it fails because of it's size.
Is there a "set" command that I can include at the top of my DDL to tell SQLPlus that it's ok to create large views or am I forever doomed to include a stoopid manual step in the otherwise automatic CI process?
Firstly, use the most recent version of SQLPlus. Its been a long time since I had a piece of code that was too large to be executed through SQLPlus. You can use the InstantClient
I'd also look at re-factoring the view. Look at the WITH clause as that is relatively new and, if the view has evolved over a long period, there's a good chance it can be amended to make use of this
Is there an empty line in the view SQL, or does any line have more than 2499 characters? Either one of these may cause SQL*Plus to behave unexpectedly but not actually fail.
If there is an empty line, Oracle will ignore everything before it and try to run everything after it. (This only applies to SQL, not PL/SQL.) For example, if you have an empty line right after the create view line, the query will run:
SQL> create or replace view newline_in_the_middle as
2
SQL> select * from dual;
D
-
X
A line with >2499 characters will be ignored but Oracle will still try to process the statement without it. This can cause problems but may still result in a valid statement:
SQL> create or replace view long_line as
2 select '...[enter 2500 characters]...' asdf from dual union all
SP2-0027: Input is too long (> 2499 characters) - line ignored
2 select '1' asdf from dual;
View created.
You may have to check the script output very carefully to find these issues.

Resources