Doing String length on SQL Loader input field - oracle

I'm reading data from a fixed length text file and loading into a table with fixed length processing.
I want to check the input line length so that i'd discard the records that are not matching the fixed length and logging them into an Error Table.
Example
Load into Input_Log table if line is meeting the specified length
Load into Input_Error_Log table if the input line length is less than or greater than the fixed line length.

I believe you would be better served by bulk loading your data into a staging table, then load into the production table from there via a stored procedure where you can apply rules via normal PL/SQL & DML to your heart's content. This is a typical best practice anyway.
sqlldr isn't really the tool to get too complicated in, even if you could do what you want. Maintainability and restart-ability become more complicated when you add complexity to a tool that's really designed for bulk loading. Add the complexity to a proper program.
Let us know what you come up with.

Related

Is there any way to handle source flat file with dynamic structure in informatica power center?

I have to load the flat file using informatica power center whose structure is not static. Number of columns will get changed in the future run.
Here is the source file:
In the sample file I have 4 columns right now but in the future I can get only 3 columns or I may get another set of new columns as well. I can't go and change the code every time in the production , I have to use the same code by handling this situation.
Expected result set is:-
Is there any way to handle this scenario? PLSQL and unix will also work here.
I can see two ways to do it. Only requirement is - source should decide a future structure and stick to it. Come tomorrow if someone decides to change the structure, data type, length, mapping will not work properly.
Solutions -
create extra columns in source towards the end. If you have 5 columns now, extra columns after 5th column will be pulled in as blank. Create as many as you want but pls note, you need to transform them as per future structure and load into proper place in target.
This is similar to above solution but in this case, read the line as single column in source+ source qualifier as large string of length 40000.
Then split the columns as per delimiter in Informatica expression transformation. This splitting can be done by following below thread. This can be also tricky if you have 100s of columns.
Split Flat File String into multiple columns in Informatica

Hadoop grep dump sql

I want to use Apache Hadoop to parse large files (~~ 20 MB each). These files are postegresql dumps (i.e. mostly CREATE TABLE and INSERT). I just need to filter out anything that is not CREATE TABLE or INSERT INTO in the first place.
So I decided to use the grep map reduce with ^(CREATE TABLE|INSERT).*;$ pattern (lines starting with CREATE TABLE or INSERT and ending with a ";").
My problem is some of these create and insert take multiple lines (because the schema is really large I guess) so the pattern isn't able to match them at all (like CREATE TABLE test(\n
"id"....\n..."name"...\n
);)
I guess I could write a mapreduce job to refactor each "insert" and "create" on one line but that would be really costly because the files are large. I could also remove all "\n" from the file but then a single map operation would have to handle multiple create/insert making the balance of the work really bad. I'd really like one map operation per insert or create.
I'm not responsible for the creation of the dump files so I cannot change the layout of the initial dump files.
I have actually no clue what is the best solution, I could use some help :). I can provide any additionnal information if needed.
First things first:
20 mb files are NOT large files to Hadoop standards, you will probably have many files (unless you only have a tiny amount of data) so there should be plenty of parallelization possible.
As such having 1 mapper per file could very well be an excellent solution, and you may even want to concatenate files to reduce overhead.
That being said:
If you don't want to handle all lines at once, and handling a single line at once is insufficient, then the only straightforward solution would be to handle 'a few' lines at once, for example 2 or 3.
An alternate solution would be to chop the file up and use one map per filepart, but then you either need to deal with the edges, or accept that your solution may not remove one of the desired bits.
I realize that this is still quite a conceptual answer, but based on your progress so far, I feel that this may be sufficient to get you there.

Oracle PL/SQL: choosing the update/merge column dynamically

I have a table with data relating to several moments in time that I have to keep updated. To save space and time, however, each row in my table refers to a given day and hourly and quarter-hourly data for that day are scattered throughout the several columns in that same row. When updating the data for a particular moment in time I, therefore, must choose the column that has to be be updated through some programming logic in my PL/SQL procedures and functions.
Is there a way to dynamically choose the column or columns involved in an update/merge operation without having to assemble the query string anew every time? Performance is a concern and the throughput must be high, so I can't do anything that would perform poorly.
Edit: I am aware of normalization issues. However I still would like to know a good way for choosing the columns to be updated/merged dynamically and programatically.
The only way to dynamically choose what column or columns to use for a DML statement is to use dynamic SQL. And the only way to use dynamic SQL is to generate a SQL statement that can then be prepared and executed. Of course, you can assemble the string in a more or less efficient manner, you can potentially parse the statement once and execute it multiple times, etc. in order to minimize the expense of using dynamic SQL. But using dynamic SQL that performs close to what you'd get with static SQL requires quite a bit more work.
I'd echo Ben's point-- it doesn't appear that you are saving time by structuring your table this way. You'll likely get much better performance by normalizing the table properly. I'm not sure what space you believe you are saving but I would tend to doubt that denormalizing your table structure is going to save you much if anything in terms of space.
One way to do what is required is to create a package with all possible updates (which aren't that many, as I'll only update one field at a given time) and then choosing which query to use depending on my internal logic. This would, however, lead to a big if/else or switch/case-like statement. Is there a way to achieve similar results with better performance?

Upload DB2 data into an Oracle database - fixing junk data

I've been given a DB2 export of data (around 7 GB) with associated DB2 control files. My goal is to upload all of the data into an Oracle database. I've almost succeeded in this - I took the route of converting the control files into SQL*Loader CTL files and it has worked for the most part.
However, I have found some of the data files contain terminators and junk data in some of the columns, which is loaded into the database, causing obvious issues with matching on that data. E.g., A column should contain '9930027130', will show length(trim(col)) = 14 : 4 Bytes of junk data.
My question is, what is the best way to eliminate this junk data from the system? I hope theres a simple addition to the CTL file that allows it to replace the junk with spaces - otherwise I can only think of writing a script that analyses the data and replaces nulls/junk with spaces before running SQL*Loader.
What, exactly, is your definition of "junk"?
If you know that a column should only contain 10 characters of data, for example, you can add a NULLIF( LENGTH( <<column>> ) > 10 ) to your control file. If you know that the column should only contain numeric characters (or alphanumerics), you can write a custom data cleansing function (i.e. STRIP_NONNUMERIC) and call that from your control file, i.e.
COLUMN_NAME position(1:14) CHAR "STRIP_NONNUMERIC(:LAST_NAME)",
Depending on your requirements, these cleansing functions and the cleansing logic can get rather complicated. In data warehouses that are loading and cleansing large amounts of data every night, data is generally moved through a series of staging tables as successive rounds of data cleansing and validation rules are applied rather than trying to load and cleanse all the data in a single step. A common approach would be, for example, to load all the data into VARCHAR2(4000) columns with no cleansing via SQL*Loader (or external tables). Then you'd have a separate process move the data to a staging table that has the proper data types NULL-ing out data that couldn't be converted (i.e. non-numeric data in a NUMBER column, impossible dates, etc.). Another process would come along and move the data to another staging table where you apply domain rules-- things like a social security number has to be 9 digits, a latitude has to be between -90 and 90 degrees, or a state code has to be in the state lookup table. Depending on the complexity of the validations, you may have more processes that move the data to additional staging tables to apply ever stricter sets of validation rules.
"A column should contain '9930027130', will show length(trim(col)) = 14 : 4 Bytes of junk data. "
Do a SELECT DUMP(col) to determine the strange characters. Then decide whether that are always invalid, valid in some cases or valid but interpreted wrong.

Is it reasonable to use small blobs in Oracle?

In Oracle LongRaw and Varchar2 have a max length of 4kb in Oracle, but I need to store objects of 8kb & 16kb, so I'm wondering what's a good solution. I know I could use a Blob, but a Blob has variable length and is basically an extra file behind the scenes if I'm correct, a feature and a Price I'm not interested in paying for my Objects.
Are there any other solutions or datatypes that are more suited to this kind of need?
Thanks
A blob is not a file behind the scene. It is stored in the database. Why does it matter that it has variable length? You can just use a blob column (or clob if your data is text data) and it gets stored in its own segment.
You should use a BLOB.
A BLOB is not stored as an extra file, it's stored as a block in one of your datafiles (just like other data). If the BLOB becomes too large for a single block (which may not happen in your case) then it will continue in another block.
If your BLOB data is really small, you can get Oracle to store it inline with other data in your row (like a varchar2).
Internally, Oracle is doing something similar to what PAX suggested. The chunks are as big as a DB block minus some overhead. If you try and re-invent Oracle features on top of Oracle it's only going to be slower than the native feature.
You will also have to re-implement a whole heap of functionality that is already provided in DBMS_LOB (length, comparisons, etc).
Why don't you segment the binary data and store it in 4K chunks? You could either have four different columns for these chunks (and a length column for rebuilding them into your big structure) or the more normalized way of another table with the chunks in it tied back to the original table record.
This would provide for expansion should you need it in future.
For example:
Primary table:
-- normal columns --
chunk_id integer
chunk_last_len integer
Chunk table:
chunk_id integer
chunk_sequence integer
chunk varchar2(whatever)
primary key (chunk_id,chunk_sequence)
Of course, you may find that your DBMS does exactly that sort of behavior under the covers for BLOBs and it may be more efficient to let Oracle handle it, relieving you of the need to manually reconstruct your data from individual chunks. I'd measure the performance of each to figure out the best approach.
Don't store binary data in varchar2 columns, unless you are willing to encode them (base64 or similar). Character set issues might corrupt your data otherwise!
Try the following statement to see the effect:
select * from (select rownum-1 original, ascii(chr(rownum-1)) data from user_tab_columns where rownum<=256) where original<>data;
Varchar2 is of variable length just as well. If you need to store binary data of any bigger than small size in your database, you'll have to look in blob's direction. Another solutiuon is of course storing the binary somewhere on the file system, and storing the path to the file as a varchar in the db.

Resources