Upload DB2 data into an Oracle database - fixing junk data - oracle

I've been given a DB2 export of data (around 7 GB) with associated DB2 control files. My goal is to upload all of the data into an Oracle database. I've almost succeeded in this - I took the route of converting the control files into SQL*Loader CTL files and it has worked for the most part.
However, I have found some of the data files contain terminators and junk data in some of the columns, which is loaded into the database, causing obvious issues with matching on that data. E.g., A column should contain '9930027130', will show length(trim(col)) = 14 : 4 Bytes of junk data.
My question is, what is the best way to eliminate this junk data from the system? I hope theres a simple addition to the CTL file that allows it to replace the junk with spaces - otherwise I can only think of writing a script that analyses the data and replaces nulls/junk with spaces before running SQL*Loader.

What, exactly, is your definition of "junk"?
If you know that a column should only contain 10 characters of data, for example, you can add a NULLIF( LENGTH( <<column>> ) > 10 ) to your control file. If you know that the column should only contain numeric characters (or alphanumerics), you can write a custom data cleansing function (i.e. STRIP_NONNUMERIC) and call that from your control file, i.e.
COLUMN_NAME position(1:14) CHAR "STRIP_NONNUMERIC(:LAST_NAME)",
Depending on your requirements, these cleansing functions and the cleansing logic can get rather complicated. In data warehouses that are loading and cleansing large amounts of data every night, data is generally moved through a series of staging tables as successive rounds of data cleansing and validation rules are applied rather than trying to load and cleanse all the data in a single step. A common approach would be, for example, to load all the data into VARCHAR2(4000) columns with no cleansing via SQL*Loader (or external tables). Then you'd have a separate process move the data to a staging table that has the proper data types NULL-ing out data that couldn't be converted (i.e. non-numeric data in a NUMBER column, impossible dates, etc.). Another process would come along and move the data to another staging table where you apply domain rules-- things like a social security number has to be 9 digits, a latitude has to be between -90 and 90 degrees, or a state code has to be in the state lookup table. Depending on the complexity of the validations, you may have more processes that move the data to additional staging tables to apply ever stricter sets of validation rules.

"A column should contain '9930027130', will show length(trim(col)) = 14 : 4 Bytes of junk data. "
Do a SELECT DUMP(col) to determine the strange characters. Then decide whether that are always invalid, valid in some cases or valid but interpreted wrong.

Related

Is there any way to handle source flat file with dynamic structure in informatica power center?

I have to load the flat file using informatica power center whose structure is not static. Number of columns will get changed in the future run.
Here is the source file:
In the sample file I have 4 columns right now but in the future I can get only 3 columns or I may get another set of new columns as well. I can't go and change the code every time in the production , I have to use the same code by handling this situation.
Expected result set is:-
Is there any way to handle this scenario? PLSQL and unix will also work here.
I can see two ways to do it. Only requirement is - source should decide a future structure and stick to it. Come tomorrow if someone decides to change the structure, data type, length, mapping will not work properly.
Solutions -
create extra columns in source towards the end. If you have 5 columns now, extra columns after 5th column will be pulled in as blank. Create as many as you want but pls note, you need to transform them as per future structure and load into proper place in target.
This is similar to above solution but in this case, read the line as single column in source+ source qualifier as large string of length 40000.
Then split the columns as per delimiter in Informatica expression transformation. This splitting can be done by following below thread. This can be also tricky if you have 100s of columns.
Split Flat File String into multiple columns in Informatica

ETL : Informatica - While loading Flat file source data into Oracle target table, the truncate issue occurring

I am trying to load data from file into Relational target. Target DB is Oracle.
In source file we have data for one account is having special characters.
eg: Viswanathan^A# de
In our application we have length of this field is 50. So we have in Informatica also 50.
Other data are loaded properly without any issues. Those data doesn't have special characters.
Finally while loading the data, it is truncated like Viswanathan d. So the char e is not loaded. Because of that the application has rejected this record.
I would like to know how to see and set the code page is available for Target and source.
i think the issue is with data length or may be code page. Probably you are trying to insert unicode data (data with ascent - Dé). You can change below settings and try again.
you can change the code page of target like below screenshot. You can make it like unicode.
Change integration service mode to Unicode.
Make the length of the target column to varchar2(100 char). To store Unicode values you need double the size than ascii values.

Doing String length on SQL Loader input field

I'm reading data from a fixed length text file and loading into a table with fixed length processing.
I want to check the input line length so that i'd discard the records that are not matching the fixed length and logging them into an Error Table.
Example
Load into Input_Log table if line is meeting the specified length
Load into Input_Error_Log table if the input line length is less than or greater than the fixed line length.
I believe you would be better served by bulk loading your data into a staging table, then load into the production table from there via a stored procedure where you can apply rules via normal PL/SQL & DML to your heart's content. This is a typical best practice anyway.
sqlldr isn't really the tool to get too complicated in, even if you could do what you want. Maintainability and restart-ability become more complicated when you add complexity to a tool that's really designed for bulk loading. Add the complexity to a proper program.
Let us know what you come up with.

Delphi: ClientDataSet is not working with big tables in Oracle

We have a TDBGrid that connected to TClientDataSet via TDataSetProvider in Delphi 7 with Oracle database.
It goes fine to show content of small tables, but the program hangs when you try to open a table with many rows (for ex 2 million rows) because TClientDataSet tries to load the whole table in memory.
I tried to set "FetchOnDemand" to True for our TClientDataSet and "poFetchDetailsOnDemand" to True in Options for TDataSetProvider, but it does not help to solve the problem. Any ides?
Update:
My solution is:
TClientDataSet.FetchOnDemand = T
TDataSetProvider.Options.poFetchDetailsOnDemand = T
TClientDataSet.PacketRecords = 500
I succeeded to solve the problem by setting the "PacketRecords" property for TCustomClientDataSet. This property indicates the number or type of records in a single data packet. PacketRecords is automatically set to -1, meaning that a single packet should contain all records in the dataset, but I changed it to 500 rows.
When working with RDBMS, and especially with large datasets, trying to access a whole table is exactly what you shouldn't do. That's a typical newbie mistake, or a borrowing from old file based small database engines.
When working with RDBMS, you should load the rows you're interested in only, display/modify/update/insert, and send back changes to the database. That means a SELECT with a proper WHERE clause and also an ORDER BY - remember row ordering is never assured when you issue a SELECT without an OREDER BY, a database engine is free to retrieve rows in the order it sees fit for a given query.
If you have to perform bulk changes, you need to do them in SQL and have them processed on the server, not load a whole table client side, modify it, and send changes row by row to the database.
Loading large datasets client side may fali for several reasons, lack of memory (especially 32 bit applications), memory fragmentation, etc. etc., you will flood the network probably with data you don't need, force the database to perform a full scan, maybe flloding the database cache as well, and so on.
Thereby client datasets are not designed to handle millions of billions of rows. They are designed to cache the rows you need client side, and then apply changes to the remote data. You need to change your application logic.

Deleting large number of rows of an Oracle table

I have a data table from company which is of 250Gb having 35 columns. I need to delete around 215Gb of data which
is obviously large number of rows to delete from the table. This table has no primary key.
What could be the fastest method to delete data from this table? Are there any tools in Oracle for such large deletion processes?
Please suggest me the fastest way to do this with using Oracle.
As it is said in the answer above it's better to move the rows to be retained into a separate table and truncate the table because there's a thing called HIGH WATERMARK. More details can be found here http://sysdba.wordpress.com/2006/04/28/how-to-adjust-the-high-watermark-in-oracle-10g-alter-table-shrink/ . The delete operation will overwhelm your UNDO TABLESPACE it's called.
The recovery model term is rather applicable for mssql I believe :).
hope it clarifies the matter abit.
thanks.
Dou you know which records need to be retained ? How will you identify each record ?
A solution might be to move the records to be retained to a temp db, and then truncate the big table. Afterwards, move the retained records back.
Beware that the transaction log file might become very big because of this (but depends on your recovery model).
We had a similar problem a long time ago. Had a table with 1 billion rows in it but had to remove a very large proportion of the data based on certain rules. We solved it by writing a Pro*C job to extract the data that we wanted to keep and apply the rules, and sprintf the data to be kept to a csv file.
Then created a sqlldr control file to upload the data using direct path (which wont create undo/redo (but if you need to recover the table, you have the CSV file until you do your next backup anyway).
The sequence was
Run the Pro*C to create CSV files of data
generate DDL for the indexes
drop the indexes
run the sql*load using the CSV files
recreate indexes using parallel hint
analyse the table using degree(8)
The amount of parellelism depends on the CPUs and memory of the DB server - we had 16CPUs and a few gig of RAM to play with so not a problem.
The extract of the correct data was the longest part of this.
After a few trial runs, the SQL Loader was able to load the full 1 billion rows (thats a US Billion or 1000 million rows) in under an hour.

Resources