How can I use Oracle Preprocessor for External Tables to consume this type of format? - oracle

Suppose I have a custom file format, which can be analogous to N tables. Let's pick 3. I could transform the file, writing a custom load wrapper to fill 3 database tables.
But suppose for space and resource constraints, I can't store all of this in the tablespace.
Can I use Oracle Preprocessor for External Tables to transform the custom file three different ways?
The examples of use I have read give gzip'd text files an example. But this is a one-to-one file-to-table relationship, with only one transform.
I have a single file with N possible extractions of data.
Would I need to define N external tables, each referencing a different program?
If I map three tables to the same file, how will this affect performance? (Access is mostly or all reads, few or no writes).
Also, what format does the standard output of my preprocessor have to be? Must it be CSV, or are there ways to configure the external table driver?

"If I map three tables to the same
file, how will this affect
performance? (Access is mostly or all
reads, few or no writes"
There should be little or no difference between three sessions accessing the same file through one external table definition or three external table definitions.
External tables aren't cached by the database (might be by the file system or disk), so any access is purely physical reads.
Depending on the pre-processor program, there might be some level of serialization there (or you may use a pre-processor program to impose serialization).
Performance-wise, you'd be better for a single session to scan the external file/table and load it into one or more database tables. The other sessions read it from there and it is cached in the SGA. Also, you can index a database table so you don't have to read it all.
You may be able to use multi-table inserts to load multiple database tables from a single external table definition in a single pass.
"what format does the standard output
of my preprocessor have to be? Must it
be CSV, or are there ways to configure
the external table driver?"
It pretty much follows SQL*Loader, and both are in the Utilities manual. You can use fixed format or other delimiters.
Would I need to define N external
tables, each referencing a different
program?
Depends on how the data is interleaved. Ignoring pre-processors, you can have different external tables pulling different columns from the same file or use the LOAD WHEN clause to determine which records to include or exclude.

Related

Hive Managed vs External tables maintainability

Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?
And by maintaining, i mean that these tables will have the following operations on daily basis frequently;
Select using partitions most of the time.. but for some of it they are not used.
Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?
It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.
While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.
Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.
You can refer the below links which can give you few pointers in considerations
External Vs Managed tables comparison

Benefits of External Tables vs. UTL_FILE

I am writing an application in PL/SQL that takes a .csv flat-file, reads it, does some data processing on it, and then decides which of several tables to update, insert into, or delete.
I have the option of using the UTL_FILE.GET_LINE functionality to process a single record at a time, parsing it with various REGEX tools, storing the data temporarily in some variables, and then doing work with it (making decisions, updating tables, etc.)
I ALSO have the option, of creating an External table, and then just stepping through it using a cursor on said external table (using a for each loop for performance) I should still be able to do all of the same things with the data(making decisions, updating tables, etc.)
I have looked around, and a couple of forums suggest that External Tables are the preferred solution to this, as they scale better, are faster, and more reliable. I have not, however, heard a why. Oracles documentation on utl_file and/or external tables does not talk about why one might be faster than the other, so I'm curious if anyone has some more information or references that I do not about what would make one perform better over the other.
The performance difference is quite simple: UTL_FILE is a PL/SQL package, while external tables use the SQL*Loader code written in C.
If you have enough data, you can even load external tables in parallel with minimal effort f.i. ALTER TABLE my_external_table PARALLEL 4;
External tables can be used in bulk mode (INSERT INTO my_table SELECT ... FROM my_external_table JOIN my_lookup_table USING (lookup_column)).
External tables can be set to transactionally safe mode (REJECT LIMIT 0), so the above INSERT either works or rolls back.
Do you need more reasons?
If the file has data that has a known structure/file format then external table is the way to go. UTL_FILE is at a different abstraction level - you are now just working with a file - your use of UTL_FILE will be brittle and likely introduce bugs. The deciding factor should not be performance; however I doubt you will be able to 'outperform' Oracle's external table implementation by rolling your own using REGEX and UTL_FILE.

Obtaining number of rows written to multiple targets in Informatica Power Center

I have a requirement where I need to obtain the number of rows written to multiple targets in my mapping. There are 3 targets in my mapping (T1, T2 and T3). I need the number of rows written to each target separately. These values need to be used in subsequent sessions.
I understand that there is a method where I can use separate counters and write them to a flat file and perform a lookup on this file in subsequent mappings. However, I am looking for a direct and better approach to this problem.
You can use the $PMTargetName#numAffectedRows built-in variables. In your case it would be something like
$PMT1#numAffectedRows
$PMT2#numAffectedRows
$PMT3#numAffectedRows
Please refer An ETL Framework for Operational Metadata Logging for details.

How do I use sqlldr to load data from multiple files at once?

I need to load data into an oracle DB using SQLLDR, but I need to pull parts of my table from two different INFILES using the different positions from those infiles?
You can certainly load data from multiple files into a single table and write control files to do that. The format of files should be same. Still running two separate jobs would be a better option. Doing little bit research would help. I have done many extra things using SQL*LOADER.
Sounds like two separate jobs would be simplest.
Depending on the file definitions, it may be possible to use a single job. See this for an idea (except you'd actually have the two record formats loading into the same table rather than different tables).

External Tables vs SQLLoader

So, I often have to load data into holding tables to run some data validation checks and then return the results.
Normally, I create the holding table, then a sqlldr control file and load the data into the table, then I run my queries.
Is there any reason I should be using external tables for thing instead?
In what way will they make my life easier?
The big advantage of external tables is that we can query them from inside the database using SQL. So we can just run the validation checks as SELECT statements without the need for a holding table. Similarly if we need to do some manipulation of the loaded data it is almost always easier to do this with SQL rather than SQLLDR commands. We can also manage data loads with DBMS_JOB/DBMS_SCHEDULER routines, which further cuts down the need for shell scripts and cron jobs.
However, if you already have a mature and stable process using SQLLDR then I concede it is unlikely you would realise tremendous benefits from porting to external tables.
There are also some cases - especially if you are loading millions of rows - where the SQLLDR approach may be considerably faster. Howver, the difference will not be as marked with more recent versions of the database. I fully expect that SQLLDR will eventually be deprecated in favour of external tables.
If you look at the External Table syntax, it looks suspiciously like SQL*Loader control file syntax :-)
If your external table is going to be repeatedly used in multiple queries it might be faster to load a table (as you're doing now) rather than rescan your external table for each query. As #APC notes, Oracle is making improvements in them, so depending on your DB version YMMV.
I would use external tables for their flexibility.
It's easier to modify the data source on them to be a different file alter table ... location ('my_file.txt1','myfile.txt2')
You can do multitable inserts, merges, run it through a pipelined function etc...
Parallel query is easier ...
It also establishes dependencies better ...
The code is stored in the database so it's automatically backed up ...
Another thing that you can do with external tables is read compressed files. If your files are gzip compressed for example, then you can use the PREPROCESSOR directive within your external table definition, to decompress the files as they are read.

Resources