Bulk loading into Oracle with all tables in one data file - oracle

I would like to bulk load a bunch of data into a Oracle database. I have written a program to easily format my data however I want. I see many examples of loading a csv file into Oracle, but they all require a control file for each table, linking it to one file.
It would be simple for me to create a script to generate all of the control files, however I would first like to know if it would be possibly to have all data in one file with table names designated in the data file?
For example:
onefile.csv:
------------
details
1, John, john#gmail.com
2, Steve, steve#gmail.com
3, Sally, sally#gmail.com
account
1, John, johntheman, johnh43
2, Steve, password, steve.12
3, Sally, letmein, slllya2
Disclaimer: This is a completely fictional database design and is not at all reflective of how I might store user data in the real world.

You could use UTL_FILE to read a CSV. This would give you complete control over how you process its contents. But it does mean you would waste a lot of time and effort hand-rolling a meagre and slow implementation of SQL*Loader. Why would you want to do that?
One file per data source and/or target is the accepted convention for CSV generation. It is a necessary part of the contract. If we want to do anything else then we need to use a more suitable protocol such as XML or JSON, somethinhg which can support programmatic interrogation.

Why don't you use SQLLdr? See SQLdr docs. See "Distinguishing Different Input Record Formats".
Benefits of Using Multiple INTO TABLE Clauses
Multiple INTO TABLE clauses enable you to:
Load data into different tables
Extract multiple logical records from a single input record
Distinguish different input record formats
Distinguish different input row object subtypes
But you would have to specify table name(or it's ID) on every line.
SQLLrd works on line level, I afraid it does not support stanzas.
So you might need to change it into:
onefile.csv:
------------
details,1, John, john#gmail.com
details,2, Steve, steve#gmail.com
details,3, Sally, sally#gmail.com
account,1, John, johntheman, johnh43
account,2, Steve, password, steve.12
account,3, Sally, letmein, slllya2
Which looks like a reminisce of old COBOL formats.

Related

2 differents Readers to fill an item in the same step

I have this situation. I have a csv file with some information, I have to complete this information with a database table registers ad write a new file with these info.
I guess that I should use a MultipleReader implementation, one to read my file and other to read my database (Some like this example: https://bigzidane.wordpress.com/2016/09/15/spring-batch-multiple-sources-as-input/) But I need pass conditions to te query in relation a the current item being processed.
Any way, If it is possible, I need configure a query data in my reader2 with info getted in my reader1. How I could make this?
This a little resume of my problem:
Input File (Reader1)
Id;data1;data2;data3
Database (Reader2)
Id|data4;data5;data6
Output File
Id;data1;data2;data3;data4;data5;data6
Sorry My english. Any link to articles or docs is good.
This is a common pattern known as the "driving query pattern" and is described in the Common patterns section of the reference documentation.
You can use a FlatFileItemReader to read your input file and an item processor to query the database for the current item and enrich it with additional data.
Another idea is to load your flat file in a staging table in the database and use a database item reader to join data.

Get data's source in kettle

When I use kettle , I was wandering how to get a table column's source column. Just for an example , after I have merged two tables into one table based on primary key already , Given any column in output table , I could judge whether table it belongs to and get the original column name in original table. Thank you for helping and sorry for my poor English...
http://i.stack.imgur.com/xoR0s.png
When I was given any field in table3 (suppose a field named A in table3) , I could know where it comes from without the graphical view (from java code or other ways) , like the original table name (here are input1 or input2) and the original column name(maybe B in input1 , but represents A in table3). Besides I use mysql.
There are a couple of ways to do this:
1) Manually. If you right-click on the output step and choose Show Output fields (or whatever it's called), you will see the "origin step" for each of the outgoing fields. You can do the same for input fields. Then you can trace them back to those origin steps, and repeat the process of viewing the input fields at those steps, and seeing those fields' origins, and so on. This is probably not what you're looking for.
2) With code. Prior to 6.0, you'd need to programmatically perform the same operations as are listed in option 1 above. In 6.0 there is the Data Lineage capability, which offers the LineageClient API that can find the origin fields for the specified output fields. For more information see my blog post describing the Data Lineage capability. Also I put a Gremlin Console in the PDI Marketplace, to make the use of LineageClient easier (and you can visually see the lineage graph too).

Oracle - build dimension from a file based data source

I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.

Reading XML-files with StAX / Kettle (Pentaho)

I'm doing an ETL-process with Pentaho (Spoon / Kettle) where I'd like to read XML-file and store element values to db.
This works just fine with "Get data from XML" -component...but the XML file is quite big, several giga bytes, and there fore reading the file takes too long.
Pentaho Wiki says:
The existing Get Data from XML step is easier to use but uses DOM
parsers that need in memory processing and even the purging of parts
of the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different approach
to solve use cases with very big and complex data stuctures and the
need for very fast data loads...
There fore I'm now trying to do the same with StAX, but it just doesn't seem to work out like planned. I'm testing this with XML-file which only has one element group. The file is read and then mapped/inserted to table...but now I get multiple rows to table where all the values are "undefined" and some rows where I have the right values. In total I have 92 rows in the table, even though it should only have one row.
Flow goes like:
1) read with StAX
2) Modified Java Script Value
3) Output to DB
At step 2) I'm doing as follow:
var id;
if ( xml_data_type_description.equals("CHARACTERS") &&
xml_path.equals("/labels/label/id") ) {
id = xml_data_value; }
...
I'm using positional-staz.zip from http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool&p=261230#post261230 as an example.
How to use StAX for reading XML-file and storing the element values to DB?
I've been trying to look for examples but haven't found much. The above example uses "Filter Rows" -component before inserting the rows. I don't quite understand why it's being used, can't I just map the values I need? It might be that this problem occurs because I don't use, or know how to use, Filter Rows -component.
Cheers!
I posted a possible StAX-based solution on the forum listed above, but I'll post the gist of it here since it is awaiting moderator approval.
Using the StAX parser, you can select just those elements that you care about, namely those with a data type of CHARACTERS. For the forum example, you basically need to denormalize the rows in sets of 4 (EXPR, EXCH, DATE, ASK). To do this you add the row number to the stream (using an Add Sequence step) then use a Calculator to determine a "bucket number" = INT((rownum-1)/4). This will give you a grouping field for a Row Denormaliser step.
When the post is approved, you'll see a link to a transformation that uses StAX and the method I describe above.
Is this what you're looking for? If not please let me know where I misunderstood and maybe I can help.

Single Database Call With Many Parameters vs Many Database Calls With Few Parameters

I am writing a Content Management System which can store meta-data about different document-types. Each document-type has its own set of meta-data fields. For example a Letter has fields like "To", "From", "ToAddress", "FromAddress" etc whereas a MinutesOfMeeting has fields like "DateHeldOn", "TimeHeldOn", "AttendedBy" etc.
I am saving this information in database in two tables: General and Specific. General store information which is common to all types such as DocumentOwnerName, DocumentCreatedDate, DocumentSize etc. Specific table is not one table but a set of 35 different tables, one for each document-type.
I have a page which contains a grid in which I show list of document. One record corresponds to one document. Since the grid is made to show documents of all types therefore first row may show a letter, second a MinutesOfMeeting, third a Memo etc.
I have also made a search feature where user can set criteria on basis of which documents list is retrieved. To make it work, there are four search-related parameters for each of the field in each of the specific tables, and all of these parameters are passed to a central procedure. This procedure then filter out records on basis of criteria.
The problem is, dealing with 35 different document-types, each having like 10 fields, I end up with more than a thousand parameters for the procedure. This is a maintenance nightmare. I am looking for a solution.
One solution is to deal with each of the specific table individually, getting back Ids, then union them. This is fine, except that I have to make 36 different calls to the database, one each for a specific table plus one for the general table.
It all boils down to a simple architecture choice: Should I make a single database call passing many parameters or should I make many database calls passing few parameters.
Which approach is more preferable and why?
Edit: The web-server and database-server are on the same machine. Therefore, network speed shouldn't matter.
When designing an API where I need a procedure to take a large number of related parameters, or even a variable list of parameters, I use record types, e.g.:
TYPE param_type IS RECORD (
To
From
ToAddress
FromAddress
DateHeldOn
TimeHeldOn
AttendedBy
);
PROCEDURE do_search (in_params IN param_type);
The structure of the record is up to you, of course. If the procedure is coded to ignore the record elements that are NULL, then all the caller needs to do is set those elements that are required, e.g.:
DECLARE
p param_type;
BEGIN
p.DateHeldOn := DATE '2012-01-01';
do_search(p);
END;

Resources