Benefits of External Tables vs. UTL_FILE - oracle

I am writing an application in PL/SQL that takes a .csv flat-file, reads it, does some data processing on it, and then decides which of several tables to update, insert into, or delete.
I have the option of using the UTL_FILE.GET_LINE functionality to process a single record at a time, parsing it with various REGEX tools, storing the data temporarily in some variables, and then doing work with it (making decisions, updating tables, etc.)
I ALSO have the option, of creating an External table, and then just stepping through it using a cursor on said external table (using a for each loop for performance) I should still be able to do all of the same things with the data(making decisions, updating tables, etc.)
I have looked around, and a couple of forums suggest that External Tables are the preferred solution to this, as they scale better, are faster, and more reliable. I have not, however, heard a why. Oracles documentation on utl_file and/or external tables does not talk about why one might be faster than the other, so I'm curious if anyone has some more information or references that I do not about what would make one perform better over the other.

The performance difference is quite simple: UTL_FILE is a PL/SQL package, while external tables use the SQL*Loader code written in C.
If you have enough data, you can even load external tables in parallel with minimal effort f.i. ALTER TABLE my_external_table PARALLEL 4;
External tables can be used in bulk mode (INSERT INTO my_table SELECT ... FROM my_external_table JOIN my_lookup_table USING (lookup_column)).
External tables can be set to transactionally safe mode (REJECT LIMIT 0), so the above INSERT either works or rolls back.
Do you need more reasons?

If the file has data that has a known structure/file format then external table is the way to go. UTL_FILE is at a different abstraction level - you are now just working with a file - your use of UTL_FILE will be brittle and likely introduce bugs. The deciding factor should not be performance; however I doubt you will be able to 'outperform' Oracle's external table implementation by rolling your own using REGEX and UTL_FILE.

Related

Hive Managed vs External tables maintainability

Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?
And by maintaining, i mean that these tables will have the following operations on daily basis frequently;
Select using partitions most of the time.. but for some of it they are not used.
Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?
It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.
While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.
Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.
You can refer the below links which can give you few pointers in considerations
External Vs Managed tables comparison

How can I load large amount of data into oracle database from .csv -file without risking to drop och mismatch data?

I’m in the middle of trying to migrate a large amount of data into a oracle database from existing excel-files.
Due to the large amount of rows loaded (10 000 and more) every time, it is not possible to use SQL Developer for this tasks.
In every work-sheet there’s data that need to go into different tables, but at the same time keep the relations and not dropping any data.
As for now, I use one .CSV file for each table and mapping them together afterwards. This is thou combined with a great risk of adding the wrong FK and with that screw up the hole shit. And I don’t have the time, energy or will for clean ups even if it is my own mess…
My initial thought was if I could bulk transfer with sql loader using some kind of plsql-script in maybe an ctl-file (the used for mapping the properties) but it seems like I.m quite out in the bush with that one… (or am I…? )
The other thought was to create a simple program In c# and use fastMember and load the database that way. (But that means that I need to take the time to actually make the program, however small it is).
I can’t possible be the only one that have had this issue, but trying to us my notToElevatedNinjaGoogling-skills ends up with either using sql developer (witch is not an alternative) or the bulk copy thing from sql load (and where I need to map it all together afterwards).
Is there any alternative solutions for my problem or is the above solutions the one that I need to cope with?
Did you consider using CSV files as external tables? As they act as if they were ordinary Oracle tables, you can write (PL/)SQL against them, inserting data into different tables in the target schema. That might give you some more freedom & control over what you are doing.
Behind the scene, it is still SQL*Loader.

Advantages of temporary tables in Oracle

I've tried to figure out which performance impacts the use of temporary tables has on an Oracle database. We want to use these tables in our ETL process to save temporary results. At this time we are using physical tables for this purpose and truncating this tables at the beginning of the ETL process. I know that the truncate process is very expensive and therefore I thought if it would be better to use temporary tables instead.
Have anyone of you experiences if there is a performance boost by using temporary tables in this scenario?
There were only some answers on this question regarding to the SQL Server like in this question. But I don't know if these recommendations also fit for the Oracle db.
It would be nice if anyone could list the advantages and disadvanteges of this feature and also point out in which scenarios this feature could be applicable.
Thanks in advance.
First of all: truncate is not expensive, a delete with no condition is very expensive.
Second: do your temporary table have indexes? What about external keys?
That could affect performance.
The temporary table works more or less like Sql Server (of course the syntax is different, like global temporary table), and both are just table.
You won't get any performance gain with temporary tables against normal table, they are just the same: they have a definition on DB, can have indexes, and are logged.
The only difference is that temporary table are exclusive to your session (except for global table) and that means if multiple scripts from multiple sessions refer to the same table, every one is reading/writing a different table and they cannot locking each other (in this case you could gain performance, but I think it's rarely the case).

Can full information about an Oracle schema data-model be selected atomically?

I'm instantiating a client-side representation of an Oracle Schema data-model in custom Table/Column/Constraint/Index data structures, in C/C++ using OCI. For this, I'm selecting from:
all_tables
all_tab_comments
all_col_comments
all_cons_columns
all_constraints
etc...
And then I'm using OCI to describe all tables, for precise information about column types. This is working, but our CI testing farm is often failing inside this schema data-model introspection code, because another test is running in parallel and creating/deleting tables in the middle of this serie of queries and describe calls I'm making.
My question is thus how can I introspect this schema atomically such that another session does not concurrently change that very schema I'm instropecting?
Would using a Read-only Serializable transaction around the selects and describes be enough? I.e. does MVCC apply to Oracle's data dictionaries? What would be the likelihood of SnapShot too Old errors on such system dictionaries?
If full atomicity is not possible, are there steps I could take to minimize the possibility of getting inconsistent / stale info?
I was thinking maybe left-joins to reduce the number of queries, and/or replacing the OCIDescribeAny() calls with other dictionary accesses joined to other tables, to get all table/column info in a single query each?
I'd appreciate some expert input on this concurrency issue. Thanks, --DD
a typical read-write conflict. from the top of my head i see 2 ways around it:
use dbms_lock package in both "introspection" and "another test".
rewrite your retrospection query so that it returns one big thing of what you need. there are multiple ways to do that:
use xmlagg and alike.
use listagg and get one big string or clob.
just use a bunch of unions to get one resultset, as it's guaranteed to be consistent.
hope that helps.

External Tables vs SQLLoader

So, I often have to load data into holding tables to run some data validation checks and then return the results.
Normally, I create the holding table, then a sqlldr control file and load the data into the table, then I run my queries.
Is there any reason I should be using external tables for thing instead?
In what way will they make my life easier?
The big advantage of external tables is that we can query them from inside the database using SQL. So we can just run the validation checks as SELECT statements without the need for a holding table. Similarly if we need to do some manipulation of the loaded data it is almost always easier to do this with SQL rather than SQLLDR commands. We can also manage data loads with DBMS_JOB/DBMS_SCHEDULER routines, which further cuts down the need for shell scripts and cron jobs.
However, if you already have a mature and stable process using SQLLDR then I concede it is unlikely you would realise tremendous benefits from porting to external tables.
There are also some cases - especially if you are loading millions of rows - where the SQLLDR approach may be considerably faster. Howver, the difference will not be as marked with more recent versions of the database. I fully expect that SQLLDR will eventually be deprecated in favour of external tables.
If you look at the External Table syntax, it looks suspiciously like SQL*Loader control file syntax :-)
If your external table is going to be repeatedly used in multiple queries it might be faster to load a table (as you're doing now) rather than rescan your external table for each query. As #APC notes, Oracle is making improvements in them, so depending on your DB version YMMV.
I would use external tables for their flexibility.
It's easier to modify the data source on them to be a different file alter table ... location ('my_file.txt1','myfile.txt2')
You can do multitable inserts, merges, run it through a pipelined function etc...
Parallel query is easier ...
It also establishes dependencies better ...
The code is stored in the database so it's automatically backed up ...
Another thing that you can do with external tables is read compressed files. If your files are gzip compressed for example, then you can use the PREPROCESSOR directive within your external table definition, to decompress the files as they are read.

Resources