How to update an Oracle Table from SAS efficiently? - oracle

The problem I am trying to solve:
I have a SAS dataset work.testData (in the work library) that contains 8 columns and around 1 million rows. All columns are in text (i.e. no numeric data). This SAS dataset is around 100 MB in file size. My objective is to have a step to parse this entire SAS dataset into Oracle. i.e. sort of like a "copy and paste" of the SAS dataset from the SAS platform to the Oracle platform. The rationale behind this is that on a daily basis, this table in Oracle gets "replaced" by the one in SAS which will enable downstream Oracle processes.
My approach to solve the problem:
One-off initial setup in Oracle:
In Oracle, I created a table called testData with a table structure pretty much identical to the SAS dataset testData. (i.e. Same table name, same number of columns, same column names, etc.).
On-going repeating process:
In SAS, do a SQL-pass through to truncate ora.testData (i.e. remove all rows whilst keeping the table structure). This ensure the ora.testData is empty before inserting from SAS.
In SAS, a LIBNAME statement to assign the Oracle database as a SAS library (called ora). So I can "see" what's in Oracle and perform read/update from SAS.
In SAS, a PROC SQL procedure to "insert" data from the SAS dataset work.testData into the Oracle table ora.testData.
Sample codes
One-off initial setup in Oracle:
Step 1: Run this Oracle SQL Script in Oracle SQL Developer (to create table structure for table testData. 0 rows of data to begin with.)
DROP TABLE testData;
CREATE TABLE testData
(
NODENAME VARCHAR2(64) NOT NULL,
STORAGE_NAME VARCHAR2(100) NOT NULL,
TS VARCHAR2(10) NOT NULL,
STORAGE_TYPE VARCHAR2(12) NOT NULL,
CAPACITY_MB VARCHAR2(11) NOT NULL,
MAX_UTIL_PCT VARCHAR2(12) NOT NULL,
AVG_UTIL_PCT VARCHAR2(12) NOT NULL,
JOBRUN_START_TIME VARCHAR2(19) NOT NULL
)
;
COMMIT;
On-going repeating process:
Step 2, 3 and 4: Run this SAS code in SAS
******************************************************;
******* On-going repeatable process starts here ******;
******************************************************;
*** Step 2: Trancate the temporary Oracle transaction dataset;
proc sql;
connect to oracle (user=XXX password=YYY path=ZZZ);
execute (
truncate table testData
) by oracle;
execute (
commit
) by oracle;
disconnect from oracle;
quit;
*** Step 3: Assign Oracle DB as a libname;
LIBNAME ora Oracle user=XXX password=YYY path=ZZZ dbcommit=100000;
*** Step 4: Insert data from SAS to Oracle;
PROC SQL;
insert into ora.testData
select NODENAME length=64,
STORAGE_NAME length=100,
TS length=10,
STORAGE_TYPE length=12,
CAPACITY_MB length=11,
MAX_UTIL_PCT length=12,
AVG_UTIL_PCT length=12,
JOBRUN_START_TIME length=19
from work.testData;
QUIT;
******************************************************;
**** On-going repeatable process ends here *****;
******************************************************;
The limitation / problem to my approach:
The Proc SQL step (that transfer 100 MB of data from SAS to Oracle) takes around 5 hours to perform - the job takes too long to run!
The Question:
Is there a more sensible way to perform data transfer from SAS to Oracle? (i.e. updating an Oracle table from SAS).

First off, you can do the drop/recreate from SAS if that's a necessity. I wouldn't drop and recreate each time - a truncate seems easier to get the same results - but if you have other reasons then that's fine; but either way you can use execute (truncate table xyz) from oracle or similar to drop, using a pass-through connection.
Second, assuming there are no constraints or indexes on the table - which seems likely given you are dropping and recreating it - you may not be able to improve this, because it may be based on network latency. However, there is one area you should look in the connection settings (which you don't provide): how often SAS commits the data.
There are two ways to control this, the DBCOMMMIT setting and the BULKLOAD setting. The former controls how frequently commits are executed (so if DBCOMMIT=100 then a commit is executed every 100 rows). More frequent commits = less data is lost if a random failure occurs, but much slower execution. DBCOMMIT defaults to 0 for PROC SQL INSERT, which means just make one commit (fastest option assuming no errors), so this is less likely to be helpful unless you're overriding this.
Bulkload is probably my recommendation; that uses SQLLDR to load your data, ie, it batches the whole bit over to Oracle and then says 'Load this please, thanks.' It only works with certain settings and certain kinds of queries, but it ought to work here (subject to other conditions - read the documentation page above).
If you're using BULKLOAD, then you may be up against network latency. 5 hours for 100 MB seems slow, but I've seen all sorts of things in my (relatively short) day. If BULKLOAD didn't work I would probably bring in the Oracle DBAs and have them troubleshoot this, starting from a .csv file and a SQL*LDR command file (which should be basically identical to what SAS is doing with BULKLOAD); they should know how to troubleshoot that and at least be able to monitor performance of the database itself. If there are constraints on other tables that are problematic here (ie, other tables that too-frequently recalculate themselves based on your inserts or whatever), they should be able to find out and recommend solutions.
You could look into PROC DBLOAD, which sometimes is faster than inserts in SQL (though all in all shouldn't really be, and is an 'older' procedure not used too much anymore). You could also look into whether you can avoid doing a complete flush and fill (ie, if there's a way to transfer less data across the network), or even simply shrinking the column sizes.

Related

Import massive table from Oracle to PostgreSQL with oracle-fdw return ORA-01406

I work on a project to transfer data from an Oracle database to a PostgreSQL database to create a datawarehouse with bash & SQL scripts. To access to the Oracle database, I use the PostgreSQL extension oracle-fdw.
One of my scripts import data from a massive table (~ 100 000 000 new rows/day). This table is partitioned and each partition contains 1 day of data. The query I use to import data looks like that :
INSERT INTO postgre_target_table (some_fields)
SELECT some_aggregated_fields -- (~150 fields)
FROM oracle_source_table
WHERE partition_id = :v_partition_id AND some_others_filters
GROUP BY primary_key;
On DEV server, the query works fine (there is much less data on this server) but in PREPROD, it returns the error ORA-01406: fetched column value was truncated.
In some posts, people say that the output fields may be too small but if I try to send a simple SELECT query without INSERT or GROUP BY I have the same error.
Another idea I found in another post is to create an Oracle side view but in my query I use multiple parameters that I cannot use in a view.
The last idea I found is to create an Oracle stored procedure that fills a table with aggregated data and then import data from this table but the Oracle database is critical and my customer prefers to avoid adding more data on it.
Now, I'm starting to think there's no solution and it's not good...
PostgreSQL version : 12.4 / Oracle version : 11.2
UPDATE
It seems my problem is more complecated than I thought.
After applying the modification given by Laurenz Albe, the query runs correctly on PGAdmin but the problem still appears when I use psql command.
Moreover, another query seems to have the same problem. This other query does not use the same source table as the first query, it uses 4 joined tables without any partition. The common point between these queries is the structure.
The detail I omit to specify in the original post is that the purpose of both queries is to pivot a table. They look like that :
SELECT osr.id,
MIN(CASE osr.category
WHEN 123 THEN
1
END) AS field1,
MIN(CASE osr.category
WHEN 264 THEN
1
END) AS field2,
MIN(CASE osr.category
WHEN 975 THEN
1
END) AS field3,
...
FROM oracle_source_table osr
WHERE osr.category IN (123, 264, 975, ...)
GROUP BY osr.id;
Now that I have detailed what the queries look like, I can give you some results I had with the second one without changing the value of max_long (this query is lighter than the first one) :
Sometimes it works (~10%), sometimes it failed (~90%) on PGadmin but it never works with psql command
If I delete the WHERE, it always works
I don't understand why deleting the WHERE change something, the field used in this clause is a NUMBER(6, 0) between 0 and 2500 and it is still used in the SELECT clause... Oh and in the 4 Oracle tables used by this query, there is no LONG datatype, only NUMBER datatype is used.
Among 20 queries I have, only these two have a problem, their structure is similar and I don't believe in coincidences.
Don't despair!
Set the max_long option on the foreign table big enough that all your oversized data fit.
The documentation has the details:
max_long (optional, defaults to "32767")
The maximal length of any LONG, LONG RAW and XMLTYPE columns in the Oracle table. Possible values are integers between 1 and 1073741823 (the maximal size of a bytea in PostgreSQL). This amount of memory will be allocated at least twice, so large values will consume a lot of memory.
If max_long is less than the length of the longest value retrieved, you will receive the error message
ORA-01406: fetched column value was truncated
Example:
ALTER FOREIGN TABLE my_tab OPTIONS (ADD max_long '1000000');

Oracle Temporary Table to convert a Long Raw to Blob

Questions have been asked in the past that seems to handle pieces of my full question, but I'm not finding a totally good answer.
Here is the situation:
I'm importing data from an old, but operational and production, Oracle server.
One of the columns is created as LONG RAW.
I will not be able to convert the table to a BLOB.
I would like to use a global temporary table to pull out data each time I call to the server.
This feels like a good answer, from here: How to create a temporary table in Oracle
CREATE GLOBAL TEMPORARY TABLE newtable
ON COMMIT PRESERVE ROWS
AS SELECT
MYID, TO_LOB("LONGRAWDATA") "BLOBDATA"
FROM oldtable WHERE .....;
I do not want the table hanging around, and I'd only do a chunk of rows at a time, to pull out the old table in pieces, each time killing the table. Is it acceptable behavior to do the CREATE, then do the SELECT, then DROP?
Thanks..
--- EDIT ---
Just as a follow up, I decided to take an even different approach to this.
Branching the strong-oracle package, I was able to do what I originally hoped to do, which was to pull the data directly from the table without doing a conversion.
Here is the issue I've posted. If I am allowed to publish my code to a branch, I'll post a follow up here for completeness.
Oracle ODBC Driver Release 11.2.0.1.0 says that Prefetch for LONG RAW data types is supported, which is true.
One caveat is that LONG RAW can technically be up to 2GB in size. I had to set a hard max size of 10MB in the code, which is adequate for my use, so far at least. This probably could be a variable sent in to the connection.
This fix is a bit off original topic now however, but it might be useful to someone else.
With Oracle GTTs, it is not be necessary to drop and create each time, and you don't need to worry about data "hanging around." In fact, it's inadvisable to drop and re-create. The structure itself persists, but the data in it does not. The data only persists within each session. You can test this by opening up two separate clients, loading data with one, and you will notice it's not there in the second client.
In effect, each time you open a session, it's like you are reading a completely different table, which was just truncated.
If you want to empty the table within your stored procedure, you can always truncate it. Within a stored proc, you will need to execute immediate if you do this.
This is really handy, but it also can make debugging a bear if you are implementing GTTs through code.
Out of curiosity, why a chunk at a time and not the entire dataset? What kind of data volumes are you talking about?
-- EDIT --
Per our comments conversation, this is very raw and untested, but I hope it will give you an idea what I mean:
CREATE OR REPLACE PROCEDURE LOAD_DATA()
AS
TOTAL_ROWS number;
TABLE_ROWS number := 1;
ROWS_AT_A_TIME number := 100;
BEGIN
select count (*)
into TOTAL_ROWS
from oldtable;
WHILE TABLE_ROWS <= TOTAL_ROWS
LOOP
execute immediate 'truncate table MY_TEMP_TABLE';
insert into MY_TEMP_TABLE
SELECT
MYID, TO_LOB(LONGRAWDATA) as BLOBDATA
FROM oldtable
WHERE ROWNUM between TABLE_ROWS and TABLE_ROWS + ROWS_AT_A_TIME;
insert into MY_REMOTE_TABLE#MY_REMOTE_SERVER
select * from MY_TEMP_TABLE;
commit;
TABLE_ROWS := TABLE_ROWS + ROWS_AT_A_TIME;
END LOOP;
commit;
end LOAD_DATA;

Oracle accessing multiple databases

I'm using Oracle SQL Developer version 4.02.15.21.
I need to write a query that accesses multiple databases. All that I'm trying to do is get a list of all the IDs present in "TableX" (There is an instance of Table1 in each of these databases, but with different values) in each database and union all of the results together into one big list.
My problem comes with accessing more than 4 databases -- I get this error: ORA-02020: too many database links in use. I cannot change the INIT.ORA file's open_links maximum limit.
So I've tried dynamically opening/closing these links:
SELECT Local.PUID FROM TableX Local
UNION ALL
----
SELECT Xdb1.PUID FROM TableX#db1 Xdb1;
ALTER SESSION CLOSE DATABASE LINK db1
UNION ALL
----
SELECT Xdb2.PUID FROM TableX#db2 Xdb2;
ALTER SESSION CLOSE DATABASE LINK db2
UNION ALL
----
SELECT Xdb3.PUID FROM TableX#db3 Xdb3;
ALTER SESSION CLOSE DATABASE LINK db3
UNION ALL
----
SELECT Xdb4.PUID FROM TableX#db4 Xdb4;
ALTER SESSION CLOSE DATABASE LINK db4
UNION ALL
----
SELECT Xdb5.PUID FROM TableX#db5 Xdb5;
ALTER SESSION CLOSE DATABASE LINK db5
However this produces 'ORA-02081: database link is not open.' On whichever db is being closed out last.
Can someone please suggest an alternative or adjustment to the above?
Please provide a small sample of your suggestion with syntactically correct SQL if possible.
If you can't change the open_links setting, you cannot have a single query that selects from all the databases you want to query.
If your requirement is to query a large number of databases via database links, it seems highly reasonable to change the open_links setting. If you have one set of people telling you that you need to do X (query data from a large number of tables) and another set of people telling you that you cannot do X, it almost always makes sense to have those two sets of people talk and figure out which imperative wins.
If we can solve the problem without writing a single query, then you have options. You can write a bit of PL/SQL, for example, that selects the data from each table in turn and does something with it. Depending on the number of database links involved, it may make sense to write a loop that generates a dynamic SQL statement for each database link, executes the SQL, and then closes the database link.
If you want need to provide a user with the ability to run a single query that returns all the data, you can write a pipelined table function that implements this sort of loop with dynamic SQL and then let the user query the pipelined table function. This isn't really a single query that fetches the data from all the tables. But it is as close as you're likely to get without modifying the open_links limit.

Improving performance of CLOB insert across a DBLINK in Oracle

I am seeing poor performance in Oracle (11g) when trying to copy CLOBs from one database to another. I have tried several things, but haven't been able to improve this.
The CLOBs are used for gathering report data. This can be quite large on a record by record basis. I am calling a procedure on the remote databases (across a WAN) to build the data, then copying the results back to the database at the corporate headquarters for comparison. The general format is:
CREATE TABLE my_report(the_db VARCHAR2(30), object_id VARCHAR2(30),
final_value CLOB, CONSTRAINT my_report_pk PRIMARY KEY (the_db, object_id));
To gain performance, I accumulate the results for remote sites into remote copies of the table. At the end of the procedure run, I try to copy the data back. This query is very simple:
INSERT INTO my_report SELECT * FROM my_report#europe;
The performance that I am seeing is around 9 rows per second, with an average CLOB size of 3500 bytes. (I am using CLOBs as this size often goes above 4k, the VARCHAR2 limit.) For 70,000 records (not uncommon) this takes around 2 hours to transfer. I have tried using the create table as select method, but this gets the same performance. I also spent more than a few hours tuning SQL*NET, but see no improvement from this. Changing the Arraysize does not improve the performance (though it can reduce it if the value is reduced.
I am able to get a copy over using the old exp/imp methods (export the table from remote, import it back in), which runs much faster, but this is fairly manual for my automated report. I have considered trying to write a pipelined function to select this data from, using it to split the CLOBS into BYTE/VARCHAR2 chunks (with an additional chunk number column), but didn't want to do this if someone had tried it and found a problem.
Thanks for your help.
I was able to get better performance when increasing arraysize to 1500 or higher. See also attached document: http://www.fors.com/velpuri2/PERFORMANCE/SQLNET.pdf

Oracle DBMS package command to export table content as INSERT statement

Is there any subprogram similar to DBMS_METADATA.GET_DDL that can actually export the table data as INSERT statements?
For example, using DBMS_METADATA.GET_DDL('TABLE', 'MYTABLE', 'MYOWNER') will export the CREATE TABLE script for MYOWNER.MYTABLE. Any such things to generate all data from MYOWNER.MYTABLE as INSERT statements?
I know that for instance TOAD Oracle or SQL Developer can export as INSERT statements pretty fast but I need a more programmatically way for doing it. Also I cannot create any procedures or functions in the database I'm working.
Thanks.
As far as I know, there is no Oracle supplied package to do this. And I would be skeptical of any 3rd party tool that claims to accomplish this goal, because it's basically impossible.
I once wrote a package like this, and quickly regretted it. It's easy to get something that works 99% of the time, but that last 1% will kill you.
If you really need something like this, and need it to be very accurate, you must tightly control what data is allowed and what tools can be used to run the script. Below is a small fraction of the issues you will face:
Escaping
Single inserts are very slow (especially if it goes over a network)
Combining inserts is faster, but can run into some nasty parsing bugs when you start inserting hundreds of rows
There are many potential data types, including custom ones. You may only have NUMBER, VARCHAR2, and DATE now, but what happens if someone adds RAW, BLOB, BFILE, nested tables, etc.?
Storing LOBs requires breaking the data into chunks because of VARCHAR2 size limitations (4000 or 32767, depending on how you do it).
Character set issues - This will drive you ¿¿¿¿¿¿¿ insane.
Enviroment limitations - For example, SQL*Plus does not allow more than 2500 characters per line, and will drop whitespace at the end of your line.
Referential Integrity - You'll need to disable these constraints or insert data in the right order.
"Fake" columns - virtual columns, XML lobs, etc. - don't import these.
Missing partitions - If you're not using INTERVAL partitioning you may need to manually create them.
Novlidated data - Just about any constraint can be violated, so you may need to disable everything.
If you want your data to be accurate you just have to use the Oracle utilities, like data pump and export.
Why don't you use regular export ?
If you must you can generate the export script:
Let's assume a Table myTable(Name VARCHAR(30), AGE Number, Address VARCHAR(60)).
select 'INSERT INTO myTable values(''' || Name || ','|| AGE ||',''' || Address ||''');' from myTable
Oracle SQL Developer does that with it's Export feature. DDL as well as data itself.
Can be a bit unconvenient for huge tables and likely to cause issues with cases mentioned above, but works well 99% of the time.

Resources