Talend Normalize Flat File into Relational Database Tables - etl

We have a single source table, which is flat — we need to insert different fields from a given record into multiple tables. We are successfully using lastInsertID a single time, but we are struggling with how to re-add fields from the same source row again in subsequent related tables.
For example, if we had a mailing address (goofy example coming up, but good for common discussion)
-----------Source----------
First Name
Middle Name
Last Name
Address 1
Address 2
City
State
Zip
-----------Targets-------------
People
address_id
First Name
Last Name
Address
address_id
state_id
zip_id
Address 1
Address 2
States
state_id
State Name
Zip
zip_id
Zip Code
Furthermore, we cannot be sure, that we may not need to add the same column to more than one table.
What is the best practice for this kind of data normalization in Talend?

I would approach this iteratively, normalising part of the table with each step.
You should be able to normalise the person data away from the address, state and zip data in one step and then normalise the state away from the address and zip data and then finally normalise the zip data away from the rest of the address.
As an example, and following on from the example in your question here's a few jobs that will do exactly that:
To start with we should create the example data. I'm going to use MySQL for this example but the same principles apply for any of the major RDBMS'.
Let's create an empty table to start with:
DROP DATABASE IF EXISTS normalisation;
CREATE DATABASE IF NOT EXISTS normalisation
CHARACTER SET utf8 COLLATE utf8_unicode_ci;
CREATE TABLE IF NOT EXISTS normalisation.denormalised (
FirstName VARCHAR(255),
MiddleName VARCHAR(255),
LastName VARCHAR(255),
Address1 VARCHAR(255),
Address2 VARCHAR(255),
City VARCHAR(255),
State VARCHAR(255),
Zip VARCHAR(255)
) ENGINE = INNODB;
Into this we need to populate it with some example data which can be done easily enough with Talend's tRowGenerator component:
I've configured the tRowGenerator to give us some semi sensible testing output:
I've also added an extra step to add some co-habitors to ~1/3 of addresses using the following tMap configuration:
Now that we have our test data easily generated we can move on to actually normalising the data from this denormalised table.
As mentioned above, our first step is to normalise the person data out. We start by creating the necessary tables for the person data and the remaining address data:
CREATE TABLE IF NOT EXISTS normalisation.person (
Person_id BIGINT AUTO_INCREMENT PRIMARY KEY,
FirstName VARCHAR(255),
MiddleName VARCHAR(255),
LastName VARCHAR(255),
Address_id BIGINT
) ENGINE = INNODB;
CREATE TABLE IF NOT EXISTS normalisation.addressStateZip (
Address_id BIGINT AUTO_INCREMENT PRIMARY KEY,
Address1 VARCHAR(50),
Address2 VARCHAR(50),
City VARCHAR(50),
State VARCHAR(50),
Zip VARCHAR(50),
UNIQUE KEY addressStateZip (Address1, Address2, City, State, Zip)
) ENGINE = INNODB;
We then populate these 2 tables by getting all of the address type data, taking only the unique rows and then putting this into the addressStateZip staging table:
The second part of the above job then compares the addressStateZip data to the initial denormalised table and collecting the joins to get the Address_id for the person table:
The remaining steps are now quite similar.
Next we create the state table and another staging table for the address and zip data:
CREATE TABLE IF NOT EXISTS normalisation.state (
State_id BIGINT AUTO_INCREMENT PRIMARY KEY,
State VARCHAR(255),
UNIQUE KEY state (State)
) ENGINE = INNODB;
CREATE TABLE IF NOT EXISTS normalisation.addressZip (
Address_id BIGINT AUTO_INCREMENT PRIMARY KEY,
Address1 VARCHAR(50),
Address2 VARCHAR(50),
City VARCHAR(50),
State_id BIGINT,
Zip VARCHAR(50),
UNIQUE KEY addressStateZip (Address1, Address2, City, State_id, Zip)
) ENGINE = INNODB;
Now we need to take the unique states from the addressStateZip table and put these into the state table:
And the second part, as before, then creates the data into the addressZip staging table with the State_id instead of the actual state:
Now, finally, we can create our zip table and then link that to a proper address table:
CREATE TABLE IF NOT EXISTS normalisation.zip (
Zip_id BIGINT AUTO_INCREMENT PRIMARY KEY,
ZIP VARCHAR(255),
UNIQUE KEY zip (ZIP)
) ENGINE = INNODB;
CREATE TABLE IF NOT EXISTS normalisation.address (
Address_id BIGINT AUTO_INCREMENT PRIMARY KEY,
Address1 VARCHAR(50),
Address2 VARCHAR(50),
City VARCHAR(50),
State_id BIGINT,
Zip_id BIGINT,
UNIQUE KEY addressStateZip (Address1, Address2, City, State_id, Zip_id)
) ENGINE = INNODB;
Using the same methodology as with the state data we get all of the unique zips and put these into the zip table:
And, as before, we can now put the Zip_id into a new, finished address table:
And to check things we can now run the following query to get all the data back out:
SELECT p.FirstName, p.MiddleName, p.LastName, a.Address1, a.Address2, a.City, s.State, z.Zip
FROM normalisation.person AS p
INNER JOIN normalisation.address AS a ON a.Address_id = p.Address_id
INNER JOIN normalisation.state AS s ON s.State_id = a.State_id
INNER JOIN normalisation.zip AS z ON z.Zip_id = a.Zip_id;
You'll probably also want to add some foreign key constraints to the tables now that you're done setting things up:
ALTER TABLE normalisation.person
ADD FOREIGN KEY (Address_id) REFERENCES address(Address_id);
ALTER TABLE normalisation.address
ADD FOREIGN KEY (State_id) REFERENCES state(State_id),
ADD FOREIGN KEY (Zip_id) REFERENCES zip(Zip_id);

I highly doubt this is the best practice, but it is the best of my knowledge. I have 2 different approaches for these kind of stuff:
Assuming [First Name, Last Name] is unique and that people may share addresses I would:
Insert Zip and States checking if they already exist. In that case would not insert;
Insert Address with a lookup on States and Zip to get state_id and zip_id.
Insert People with a lookup first on Zip and States again. Then a lookup on Address to get address_id and finnaly inserting on People if it does not exist already.
If [First Name, Last Name] are not unique or for some reason I don't want them to share addresses, zips or states I usually force the source to have some kind of ID, an explicit or an implicit one like LINE_NUMBER so we can distinguish people. The inserting order would then be the same but in this case I use people_id to distinguish addresses, zip, and state and even people in some of the lookups, depending on what's the intended result.
This last approach is kind of dirty since we might end having useless Ids that were only needed for insertion. To avoid this I would use tmp tables alike with extra fields and in the end would just insert blindly in the final table. If this is not a one time insert, it would require some extra logic on tmp and final tables sync.

Related

want to link crew Assignment to above tables given I'm getting an error How to solve this error?

CREATE TABLE Route(
RouteNo VARCHAR(10),
Origin VARCHAR(30),
Destination VARCHAR(30),
DepartureTime VARCHAR(15),
SerialNo VARCHAR(5),
ArrivalTime VARCHAR(15),
PRIMARY KEY(RouteNo) );
CREATE TABLE Employee(
EmployeeID VARCHAR(5) NOT NULL,
Name VARCHAR(30),
Phone NUMBER,
JobTitle VARCHAR(30),
PRIMARY KEY(EmployeeID) );
CREATE TABLE Flight(
SerialNo VARCHAR(5),
RouteNo VARCHAR(5),
FlightDate DATE,
ActualTD VARCHAR(10),
ActualTA VARCHAR(10),
PRIMARY KEY(SerialNo, RouteNo, FlightDate),
FOREIGN KEY(RouteNo) REFERENCES Route(RouteNo),
FOREIGN KEY(SerialNo) REFERENCES Airplane(SerialNo) ); -- does Airplane table exists ?
CREATE TABLE CrewAssigment(
EmployeeID VARCHAR(5),
RouteNo VARCHAR(5),
FlightDate DATE,
Role VARCHAR(45),
Hours INT,
PRIMARY KEY(EmployeeID, RouteNo, FlightDate),
FOREIGN KEY(EmployeeID) REFERENCES Employee(EmployeeID),
FOREIGN KEY(RouteNo) REFERENCES Route(RouteNo),
FOREIGN KEY(FlightDate) REFERENCES Flight(FlightDate) );
Select * from CrewAssignment
This is my code where I'm getting an error in the CrewAssignment table and above are the tables where the foreign key is referenced from.
Error report -
ORA-02270: no matching unique or primary key for this column-list
02270. 00000 - "no matching unique or primary key for this column-list"
*Cause: A REFERENCES clause in a CREATE/ALTER TABLE statement
gives a column-list for which there is no matching unique or primary
key constraint in the referenced table.
*Action: Find the correct column names using the ALL_CONS_COLUMNS
catalog view
A few objections.
This is clearly an Oracle question, not MySQL. How do I know? ORA-02270 is an Oracle database error code; pay attention to tags you use.
You should use VARCHAR2 datatype instead of VARCHAR. Why? Oracle recommends so.
create table flight fails first as it references the airplane table, and it doesn't exist yet (at least, not in code you posted)
error you're complaining about is due to create table crewassignment. One of its foreign keys references the flight table:
FOREIGN KEY(flightdate) REFERENCES flight(flightdate)
but flight's primary key is composite, made up of 3 columns:
PRIMARY KEY(serialno,
routeno,
flightdate)
which means that you can't create that foreign key.
So, what to do? No idea, I don't know rules responsible for such a data model. Either modify primary key of the flight table, or modify foreign key constraint of the crewassingment table.
Perhaps you could add a new column to flight table (made up of a sequence (or identity column, if your database version supports it) and then let the crewassignment table reference that primary key. Columns you currently use as a primary key (serialno, routeno, flightdate) would then switch to unique key.

Create table as select statement primary key in oracle

Is it possible to specify which is the primary key on creating table as select statement? My aim is to include the declaration of primary key on the create table not modifying the table after the creation.
CREATE TABLE suppliers
AS (SELECT company_id, address, city, state, zip
FROM companies
WHERE company_id < 5000);
Yes, it's possible. You would need to specify columns explicitly:
CREATE TABLE suppliers (
company_id primary key,
address,
city,
state,
zip
)
AS
SELECT company_id, address, city, state, zip
FROM companies
WHERE company_id < 5000;
Here is a demo
Note: in this case primary key constraint will be given a system-generated name. If you want it to have a custom name you'd have to execute alter table suppliers add constraint <<custom constraint name>> primary key(<<primary_key_column_name>>) after executing(without primary key specified) CREATE TABLE suppliers.. DDL statement.
Yes, it's possible.You can try referring below example.
create table student (rollno ,student_name,score , constraint pk_student primary key(rollno,student_name))
as
select empno,ename,sal
from emp;
You can create Suppliers table explicitly and if any column have primary key in companies table, then copy that structure from companies table.
Or
Create a same column defining primary key that you want and copy that column from companies table!

fast comparison a list with itself

I have a list giant list (100k entries) in my database. Each entry contains a id, text and a date.
I created a function to compare two text as possible. How it looks like is not necessary right now.
Is there a "good" way to remove "duplicates" (as possible) from the list by text?
Currently I'm looping through the list twice and compare each entry with each entry, except itself by id.
If your question is when you insert a row in the table... you can include the unique constraint.
Postgresql
CREATE TABLE table1 (
id serial PRIMARY KEY,
txt VARCHAR (50),
dt timestamp,
UNIQUE(txt)
);
Oracle
CREATE TABLE table1
( id numeric(10) NOT NULL,
txt varchar2(50) NOT NULL,
date timestamp,
CONSTRAINT txt_unique UNIQUE (txt)
);

one attribute referencing, attributes in two different tables

I have 4 tables
customer: CustomerID - primary key, name
Magazine: name - primary key, cost, noofissues
Newspaper: name - primary key, cost, noofissues
subscription: custID - references CustomerID of Customer, name, startdate, enddate
In the above, can I reference the name from subscription table to reference name from Magazine and name from Newspaper?
I have created the tables Customer, Newspaper and Magazine. I only need to create Subscription.
Can you do something like this?
CREATE TABLE subscription (
custID INT
CONSTRAINT subscription__custid__fk REFERENCES Customer( CustomerId ),
name VARCHAR2(50)
CONSTRAINT subscription__mag_name__fk REFERENCES Magazine( Name )
CONSTRAINT subscription__news_name__fk REFERENCES Newspaper( Name ),
startdate DATE
CONSTRAINT subscription__startdate__nn NOT NULL,
enddate DATE
);
Yes, you can and you will have two foreign keys on the same column pointing to different tables but if the value in the column is non-null then it will expect there to be a matching name in both the magazines table and the newspapers table - which is probably not what you are after.
Can you have a foreign key that asks can the value be in either exclusively in this table or that table (but not in both)? No.
But you can re-factor your database so you merge the newspapers and magazines tables into a single table (which you can then easily reference); like this:
CREATE TABLE customer (
CustomerID INT
CONSTRAINT customer__CustomerId__pk PRIMARY KEY,
name VARCHAR2(50)
CONSTRAINT customer__name__nn NOT NULL
);
CREATE TABLE Publications (
id INT
CONSTRAINT publications__id__pk PRIMARY KEY,
name VARCHAR2(50)
CONSTRAINT publications__name__nn NOT NULL,
cost NUMBER(6,2)
CONSTRAINT publications__cost__chk CHECK ( cost >= 0 ),
noofissues INT,
type CHAR(1),
CONSTRAINT publications__type__chk CHECK ( type IN ( 'M', 'N' ) )
);
CREATE TABLE subscription (
custID INT
CONSTRAINT subscription__custid__fk REFERENCES Customer( CustomerId ),
pubID INT
CONSTRAINT subscription__pubid__fk REFERENCES Publications( Id ),
startdate DATE
CONSTRAINT subscription__startdate__nn NOT NULL,
enddate DATE
);
If you are asking whether you can create a foreign key constraint on subscription that references either the newspaper table or the magazine table, the answer is no, you cannot. A foreign key must reference exactly one primary key.
Since magazine and newspaper have the same set of attributes, the simple option is to combine them into a single periodical table with an additional periodical_type column to indicate whether it is a magazine or a newspaper. You could then create your foreign key to the periodical table.
Although it probably won't make sense in this particular example, you could also have separate columns in subscription for magazine_name and newspaper_name and create separate foreign key constraints on those columns along with a check constraint that ensured that exactly one of the values was non-NULL. That might make sense if the two different parent tables had radically different attributes.
Not related to your question but as a general bit of advice, I wouldn't use the name as the primary key. In addition to being rather long, names tend to change over time and names aren't necessarily unique. I would use a different attribute for the key, potentially a synthetic primary key generated from a sequence.

Oracle - referential integrity with multiple types of data

I'm working on a set of database tables in Oracle and trying to figure out a way to enforce referential integrity with slightly polymorphic data.
Specifically, I have a bunch of different tables--hypothetically, let's say I have Apples, Bananas, Oranges, Tangerines, Grapes, and a hundred more types of fruit. Now I'm trying to make a table which describes performing steps involving a fruit. So I want to insert one row that says "eat Apple ID 100", then another row which says "peel Banana ID 250", then another row which says "refrigerate Tangerine ID 500", and so on.
Historically, we've done this in two ways:
1 - Include a column for each possible type of fruit. Use a check constraint to ensure that all but one column is NULL. Use foreign keys to ensure referential integrity to our fruit. So in my hypothetical example, we'd have a table with columns ACTION, APPLEID, BANANAID, ORANGEID, TANGERINEID, and GRAPEID. For the first action, we'd have a row 'Eat', 100, NULL, NULL, NULL, NULL, NULL. For the second action, we'd have 'Peel', NULL, 250, NULL, NULL, NULL. etc. etc.
This approach is great for getting all of Oracle's RI benefits automatically, but it just doesn't scale to a hundred types of fruit. You end up getting too many columns to be practical. Just figuring out which type of fruit you are dealing with becomes a challenge.
2 - Include a column with the name of the fruit, and a column with a fruit ID. This works also, but there isn't any way (AFAIK) to have Oracle enforce the validity of the data in any way. So our columns would be ACTION, FRUITTYPE, and FRUITID. The row data would be 'Eat', 'Apple', 100, then 'Peel', 'Banana', 250, etc. But there's nothing preventing someone from deleting Apple ID 100, or inserting a step saying 'Eat', 'Apple', 90000000 even though we don't have an Apple with that ID.
Is there a way to avoid maintaining a separate column per each individual fruit type, but still preserve most the benefits of foreign keys? (Or technically, I could be convinced to use a hundred columns if I can hide the complexity with a neat trick somehow. It just has to look sane in day-to-day use.)
CLARIFICATION: In our actual logic, the "fruits" are totally disparate tables with very little commonality. Think customers, employees, meetings, rooms, buildings, asset tags, etc. The list of steps is supposed to be free-form and allow users to specify actions on any of these things. If we had one table which contained each of these unrelated things, I wouldn't have a problem, but it would also be a really weird design.
It's not clear to me why you need to identify the FRUIT_TYPE on the TASKS table. On the face of it that's just a poor (de-normalised) data model.
In my experience, the best way of modelling this sort of data is with a super-type for the generic thing (FRUIT in your example) and sub-types for the specifics (APPLE, GRAPE, BANANA). This allows us to store common attributes in one place while recording the particular attributes for each instance.
Here is the super-type table:
create table fruits
(fruit_id number not null
, fruit_type varchar2(10) not null
, constraint fruit_pk primary key (fruit_id)
, constraint fruit_uk unique (fruit_id, fruit_type)
, constraint fruit_ck check (fruit_type in ('GRAPE', 'APPLE', 'BANANA'))
)
/
FRUITS has a primary key and a compound unique key. We need the primary key for use in foreign key constraints, because compound keys are a pain in the neck. Except when they are not, which is the situation with these sub-type tables. Here we use the unique key as the reference, because by constraining the value of FRUIT_TYPE in the sub-type we can guarantee that records in the GRAPES table map to FRUITS records of type 'GRAPE', etc.
create table grapes
(fruit_id number not null
, fruit_type varchar2(10) not null default 'GRAPE'
, seedless_yn not null char(1) default 'Y'
, colour varchar2(5) not null
, constraint grape_pk primary key (fruit_id)
, constraint grape_ck check (fruit_type = 'GRAPE')
, constraint grape_fruit_fk foreign key (fruit_id, fruit_type)
references fruit (fruit_id, fruit_type)
, constraint grape_flg_ck check (seedless_yn in ('Y', 'N'))
)
/
create table apples
(fruit_id number not null
, fruit_type varchar2(10) not null
, apple_type varchar2(10) not null default 'APPLE'
, constraint apple_pk primary key (fruit_id)
, constraint apple_ck check (fruit_type = 'APPLE')
, constraint apple_fruit_fk foreign key (fruit_id, fruit_type)
references fruit (fruit_id, fruit_type)
, constraint apple_type_ck check (apple_type in ('EATING', 'COOKING', 'CIDER'))
)
/
create table bananas
(fruit_id number not null
, fruit_type varchar2(10) not null default 'BANANA'
, constraint banana_pk primary key (fruit_id)
, constraint banana_ck check (fruit_type = 'BANANA')
, constraint banana_fruit_fk foreign key (fruit_id, fruit_type)
references fruit (fruit_id, fruit_type)
)
/
In 11g we can make FRUIT_TYPE a virtual column for the sub-type and do away with the check constraint.
So, now we need a table for task types ('Peel', 'Refrigerate', 'Eat ', etc).
create table task_types
(task_code varchar2(4) not null
, task_descr varchar2(40) not null
, constraint task_type_pk primary key (task_code)
)
/
And the actual TASKS table is a simple intersection between FRUITS and TASK_TYPES.
create table tasks
(task_code varchar2(4) not null
, fruit_id number not null
, constraint task_pk primary key (task_code, fruit_id)
, constraint task_task_fk ask foreign key (task_code)
references task_types (task_code)
, constraint task_fruit_fk foreign key (fruit_id)
references fruit (fruit_id)
/
If this does not satisfy your needs please edit your question to include more information.
"... if you want different tasks for different fruits..."
Yes I wondered whether that was the motivation underlying the OP's posted design. But usually workflow is a lot more difficult than that: some tasks will apply to all fruits, some will only apply to (say) fruits which come in bunches, others will only be relevant to bananas.
"In our actual logic, the 'fruits' are totally disparate tables with
very little commonality. Think customers, employees, meetings, rooms,
buildings, asset tags, etc. The list of steps is supposed to be
free-form and allow users to specify actions on any of these things."
So you have a bunch of existing tables. You want to be able to assign records from these tables to tasks in a freewheeling style yet be able to guarantee the identify of the specific record which owns the task.
I think you still need a generic table to hold an ID for the actor in the task, but you will need to link it to the other tables somehow. Here is how I might approach it:
Soem sample existing tables:
create table customers
(cust_id number not null
, cname varchar2(100) not null
, constraint cust_pk primary key (fruit_id)
)
/
create table employees
(emp_no number not null
, ename varchar2(30) not null
, constraint emp_pk primary key (fruit_id)
)
/
A generic table to hold actors:
create table actors
(actor_id number not null
, constraint actor_pk primary key (actor_id)
)
/
Now, you need intersection tables to associate your existing tables with the new one:
create table cust_actors
(cust_id number not null
, actor_id number not null
, constraint cust_actor_pk primary key (cust_id, actor_id)
, constraint cust_actor_cust_fk foreign key (cust_id)
references customers (cust_id)
, constraint cust_actor_actor_fk foreign key (actor_id)
references actors (actor_id)
)
/
create table emp_actors
(emp_no number not null
, actor_id number not null
, constraint emp_actor_pk primary key (emp_no, actor_id)
, constraint emp_actor_emp_fk foreign key (emp_no)
references eployees (emp_no)
, constraint cust_actor_actor_fk foreign key (actor_id)
references actors (actor_id)
)
/
The TASKS table is rather unsurprising, given what's gone before:
create table tasks
(task_code varchar2(4) not null
, actor_id number not null
, constraint task_pk primary key (task_code, actor_id)
, constraint task_task_fk ask foreign key (task_code)
references task_types (task_code)
, constraint task_actor_fk foreign key (actor_id)
references actors (actor_id)
/
I agree all those intersection tables look like a lot of overhead but there isn't any other way to enforce foreign key constraints. The additional snag is creating ACTORS and CUSTOMER_ACTORS records every time you create a record in CUSTOMERS. Ditto for deletions. The only good news is that you can generate all the code you need.
Is this solution better than a table with one hundred optional foreign keys? Perhaps not: it's a matter of taste. But I like it better than having no foreign keys at all. If there is on euniversal truth in database practice it is this: databases which rely on application code to enforce relational integrity are databases riddled with children referencing the wrong parent or referencing no parent at all.

Resources