Dynamically Identify Columns in External Tables
We have a process wherein we upload employee data from multiple legislations (ex. US, Philippines, Latin America) via a SQL Loader.
This happens at least once a week and the current process is they create a control file every time they load employee information,
Load that into Staging Tables using SQL*Loader.
I was hoping to simplify the process by creating an External Table and running a concurrent request to put the data into our staging Tables.
There are two stumbling blocks i'm encountering:
There are some columns which are not being used by some legislations.
Example: US uses the column "Veteran_Information", while the Philippines and Latin America don't.
Philippines uses "SSS_Number" while US and Latin America Don't.
Latin America uses a "Medical_Insurance" Column while US and Philippines don't.
Something like below:
US: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, VETERAN_INFORMATION
PHL: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, SSS_NUMBER
LAT: LEGISLATION, EMPLOYEE_NUMBER, DATE_OF_BIRTH, MEDICAL_INSURANCE
Business Users don't use a Standard CSV Template/Format.
Since the File is being sent by Non-IT Business Users, they don't usually follow a prescribed format. (Training/User issue, probably).
they often don't follow the correct order of columns
they often don't follow the correct number of columns
they often don't follow the correct names of columns
Something like below:
US: LEGISLATION, EMPLOYEE_ID, VETERAN_INFORMATION, DATE_OF_BIRTH, EMAIL_ADD
PHL: EMP_NUM, LEGISLATION, DOB, SSS_NUMBER, EMAIL_ADDRESS
LAT: LEGISLATION, PS_ID, BIRTH_DATE, EMAIL, MEDICAL_INSURANCE
Is there a way for External Tables to identify the correct order and naming of columns even if they're not in the correct order/naming convention in the File?
Taking the Column Data from Problem 2:
US: LEGISLATION | EMPLOYEE_ID | VETERAN_INFORMATION | DATE_OF_BIRTH | EMAIL_ADD
US | 111 | No | 1967 | vet#gmail.com
PHL: EMP_NUM | LEGISLATION | DOB | SSS_NUMBER | EMAIL_ADDRESS
222 | PHL | 1898 | 456789 | pinoy#gmail.com
LAT: LEGISLATION | PS_ID | BIRTH_DATE | EMAIL | MEDICAL_INSURANCE
HON | 333 | 1956 | hon#gmail.com | Yes
I would like it to be like this when it appears in the External Table:
LEGISLATION | EMPLOYEE_NUMBER | DATE_OF_BIRTH | VETERAN_INFORMATION | SSS_NUMBER | MEDICAL_INSURANCE | EMAIL_ADDRESS
US | 111 | 1967 | Y | (NULL) | (NULL) | vet#gmail.com
PHL | 222 | 1898 | (NULL) | 456789 | (NULL) | pinoy#gmail.com
HON | 333 | 1956 | (NULL) | (NULL) | Yes | hon#gmail.com
Is there a way for External Tables to do something like above?
Thanks in advance!
The simplest would be:
Use three distinct load scripts for each type of input (US, PHL, HON). Each script just discards the other 2 record types, and places the columns (possibly doing some transformation, like 'No' -> 'N') in the right place and inserts NULL for columns that were not present for that record type.
Related
I have created a Informatica flow
where I need to read data from table that to only one column which contain empids.
But the column might contain duplicate need to write distinct values to file from below query
Query :
select distinct
emp_id
from
employee
where
empid not in
(
select distinct
custid
from
customer
);
I have added the above query in Source Qualifier
employee table contains : 5 million records and customer table contains : 20 billion records
My Informatica is still running not got completed - 6 hours over till now and nothing is written to file because of huge data in both tables
Following is my query plan
--------------------------------------------------------------------
Id | Operation | Name |
--------------------------------------------------------------------
0 | SELECT STATEMENT | |
1 | AX COORDINATOR | |
2 | AX SEND QC (RANDOM) | :AQ10002 |
3 | HASH UNIQUE | |
4 | AX RECEIVE | |
5 | AX SEND HASH | :AQ10001 |
6 | HASH UNIQUE | |
7 | HASH JOIN ANTI | |
8 | AX RECEIVE | |
9 | AX SEND PARTITION (KEY) | :AQ10000 |
10 | AX SELECTOR | |
11 | INDEX FAST FULL SCAN | PK_EMP_ID |
12 | AX PARTITION RANGE ALL | |
13 | INDEX FAST FULL SCAN | PK_CUST_ID |
--------------------------------------------------------------------
Sample table data :
employee
111
123
145
1345
111
123
145
678
....
customer
111
111
111
1345
111
145
145
145
145
145
145
....
Expected output :
123
678
Any solution is much appreciated !!!
It seems to me the SQL is the problem. If you dont have anything like sorter/aggregator, you dont have to worry about infa.
SQL seems to be having expensive operations. You can try below -
select emp_id
from employee
where not exists
(select 1 from customer where custid =emp_id)
This should be little faster because
you arent doing a subquery to get distinct from a 20billion customer table.
you dont need to use any distinct in first select because you are selecting from emp table where that emp id is unique. And not exist will make sure no duplicates coming out of first select.
You can also use left join +where but i think it will be expensive because of join-induced duplicates.
I would start with partitioning the customer table by hash or range(customer_id) or insert_date, this would speed up your inline select substantially.
Also try this:
select emp_id from employee
minus
select emp_id from employee e, customer c
where e.emp_id=c.custid;
I have data in two different linked tables in Airtable and I need to join them together. See example:
The PERSON table looks like:
Name | Classes
----------------
John | A,B,C,F
Sally | B,F
Max | B,C
While the linked CLASSES table looks like:
Class | Date | People
---------------------------
A | 1975 | John
B | 2000 | John,Sally,Max
C | 1823 | John,Max
D | 1492 |
E | 2020 |
F | 2010 | John,Sally
What I need is:
Person|Class|Date
--------------
John | A | 1975
John | B | 2000
John | C | 1823
John | F | 2010
Sally | B | 2000
Sally | F | 2010
Max | B | 2000
Max | C | 1823
How do I get this view / table as output?
The more I see questions like this, with no answer, the more I realise how airtable just isn't a database in any real sense.
This is a perfectly reasonable question about how to join 2 tables after those tables have been normalised. Answer? It can't be done, not easily!
So what is airtable supposed to be used for, building non-normalised databases, otherwise known as a spreadsheet!
If you use click "Class" field like "A" or "B" in the "Person" table, it'll show the popup so that you could see the class details.
Or if you really want to need that kind of table, my suggestion is like this
Create a new table called "xxx", and write the code in the scripting block and populate the data from "Person", "Class" tables to the new table.
PS: Scripting block is only supported in the "Pro" plan.
I have one table that need to split into several other tables.
But the main table is just like a transitive table.
I dump data from a excel into it (from 5k to 200k rows) , and using insert into select, split into the correct tables (Five different tables).
However, the latest dataset that my client sent has records with duplicates values.
The primary key usually is ENI for my table. But even this record is duplicated because the same company can be a customer and a service provider, so they have two different registers but use the same ENI.
What i have so far.
I found a script that uses merge and modified it to find same eni and update the same main_id to all
|Main_id| ENI | company_name| Type
| 1 | 1864 | JOHN | C
| 2 | 351485 | JOEL | C
| 3 | 16546 | MICHEL | C
| 2 | 351485 | JOEL J. | S
| 1 | 1864 | JOHN E. E. | C
Main_id: Primarykey that the main BD uses
ENI: Unique company number
Type: 'C' - COSTUMER 'S' - SERVICE PROVIDERR
Some Cases it can have the same type. just like id 1
there are several other Columns...
What i need:
insert any of the main_id my other script already sorted, and set a flag on the others that they were not inserted. i cant delete any data i'll need to send these info to the costumer validate.
or i just simply cant make this way and go back to the good old excel
Edit: as a question below this is a example
|Main_id| ENI | company_name| Type| RANK|
| 1 | 1864 | JOHN | C | 1 |
| 2 | 351485 | JOEL | C | 1 |
| 3 | 16546 | MICHEL | C | 1 |
| 2 | 351485 | JOEL J. | S | 2 |
| 1 | 1864 | JOHN E. E. | C | 2 |
RANK - would be like the 1864 appears 2 times,
1st one found gets 1 second 2 and so on. i tryed using
RANK() OVER (PARTITION BY MAIN_ID ORDER BY ENI)
RANK() OVER (PARTITION BY company_name ORDER BY ENI)
Thanks to TEJASH i was able to come up with this solution
MERGE INTO TABLEA S
USING (Select ROWID AS ID,
row_number() Over(partition by eniorder by eni, type) as RANK_DUPLICATED
From TABLEA
) T
ON (S.ROWID = T.ID)
WHEN MATCHED THEN UPDATE SET S.RANK_DUPLICATED= T.RANK_DUPLICATED;
As far as I understood your problem, you just need to know the duplicate based on 2 columns. You can achieve it using analytical function as follows:
Select t.*,
row_number() Over(partition by main_id, eni order by company_name) as rnk
From your_table t
I am in a scenario to obtain all the records from a table where FIRSTNAME and LASTNAME of a particular record is the same but the BIRTHDATE is greater than or equal to 15 years.
Consider my table looks like:
_______________________________________________________________________________
| PRIMARY_ID | UNIQUE_ID | FIRSTNAME | LASTNAME | SUFFIX | BIRTHDATE |
_______________________________________________________________________________
| 12345 | abcd | john | collin | Mr | 1975-10-01 00:00:00|
| 12345 | cdef | john | collin | Mr | 1960-10-01 00:00:00|
| 12345 | efgh | john | collin | Mr | 1975-10-01 00:00:00|
| 12345 | ghij | john | collin | Mr | 1960-10-01 00:00:00|
| 12345 | aaaa | john | collin | Mr | 1975-10-01 00:00:00|
| 12345 | bdfs | john | collin | Mr | 1975-10-01 00:00:00|
| 12345 | asdf | john | collin | Mr | null |
| 12345 | dfgh | john | collin | Mr | null |
| 23456 | ghij | jeremy | lynch | Mr | 1982-10-15 00:00:00|
| 23456 | aaaa | jacob | lynch | Mr | 1945-10-12 00:00:00|
| 23456 | bdfs | jeremy | lynch | Mr | 1945-10-12 00:00:00|
| 23456 | asdf | jacob | lynch | Mr | null |
| 23456 | dfgh | jeremy | lynch | Mr | null |
_______________________________________________________________________________
In this table, for the PRIMARY_ID 12345, the FIRSTNAME and LASTNAME are all same but the BIRTHDATE difference between the UNIQUE_IDs if 15 years. So this PRIMARY_ID needs to be pulled out. Wherein for PRIMARY_ID 23456, the FIRSTNAME is not the same for all UNIQUE_ID records, so it must not be pulled out.
The table might contain NULL values for BIRTHDATE, which should be ignored.
This is what I have tried till now:
SELECT
/*PARALLEL(16)*/
PRIMARY_ID,
UNIQUE_ID,
FIRSTNAME,
LASTNAME,
SUFFIX,
BIRTHDATE,
RANK() OVER ( ORDER BY FIRSTNAME, LASTNAME, SUFFIX, BIRTHDATE) "GROUP"
FROM TABLE;
I have queried to form separate groups to distinguish by FIRSTNAME, LASTNAME and BIRTHDATE. I do not know on how to proceed further with this.
Can someone please help out?
NOTE: The BIRTHDATE field is in varchar datatype and I use Oracle 12C.
As I understand it, the goal is to return the distinct set of primary_id for which adjacent (alphabetically) unique_id that share the same firstname and lastname are separated by 15+ years. As I understand it, NULL should interrupt comparison (and be considered a non-match (otherwise, primary_id 23456 would also match here for pseudo-adjacent bdfs + ghij).
There are other ways to do this, but one way available in 12c is to use pattern-matching. An example is below. The example just uses a difference of 5478 days as to represent 15-years, but one could nuance that if greater exactitude was needed for intercalary days etc.
SELECT DISTINCT PRIMARY_ID
FROM THE_TABLE
MATCH_RECOGNIZE (
PARTITION BY PRIMARY_ID
ORDER BY UNIQUE_ID
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN(FIFTEEN_DIFF)
DEFINE FIFTEEN_DIFF AS
(FIFTEEN_DIFF.FIRSTNAME = PREV(FIFTEEN_DIFF.FIRSTNAME)
AND FIFTEEN_DIFF.LASTNAME = PREV(FIFTEEN_DIFF.LASTNAME)
AND (ABS(EXTRACT( DAY FROM (TO_TIMESTAMP(FIFTEEN_DIFF.BIRTHDATE,'YYYY-MM-DD HH24:MI:SS') - PREV(TO_TIMESTAMP(FIFTEEN_DIFF.BIRTHDATE,'YYYY-MM-DD HH24:MI:SS'))))) >= 5478)));
Result:
PRIMARY_ID
12345
1 row selected.
The above query does the following:
PARTITIONs to look at each PRIMARY_ID group individually,
then ORDERs by the UNIQUE_ID, so only alphabetically-adjacent records are compared.
Then each record is compared to the last, and if they share FIRSTNAME and LASTNAME, and their BIRTHDATEs differ by 15+ years, they are counted as a MATCH, and returns one record to indicate this.
After any match is found, it skips to the next row and resumes comparing.
Since only the distinct matches are desired, a DISTINCT is included in the select statement.
EDIT:
In response to follow-up questions, adding two additional examples.
Alternative 1: Pre-Filter NULL
This will bring different UNIQUE_ID into proximity, giving different matches.
SELECT DISTINCT PRIMARY_ID
FROM (SELECT PRIMARY_ID, UNIQUE_ID, FIRSTNAME, LASTNAME, SUFFIX, BIRTHDATE
FROM THE_TABLE
WHERE BIRTHDATE
IS NOT NULL)
MATCH_RECOGNIZE (
PARTITION BY PRIMARY_ID
ORDER BY UNIQUE_ID
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (FIFTEEN_DIFF)
DEFINE FIFTEEN_DIFF AS
(FIFTEEN_DIFF.FIRSTNAME = PREV(FIFTEEN_DIFF.FIRSTNAME)
AND FIFTEEN_DIFF.LASTNAME = PREV(FIFTEEN_DIFF.LASTNAME)
AND (ABS(EXTRACT(DAY FROM (TO_TIMESTAMP(FIFTEEN_DIFF.BIRTHDATE , 'YYYY-MM-DD HH24:MI:SS') -
PREV(TO_TIMESTAMP(FIFTEEN_DIFF.BIRTHDATE , 'YYYY-MM-DD HH24:MI:SS'))))) >= 5478)));
Result (this now includes PRIMARY_ID 23456, as removing NULL brings two UNIQUE_IDs into order that ar 15+ years apart) :
PRIMARY_ID
12345
23456
2 rows selected.
Alternative 2: Count NULL as a match
SELECT DISTINCT PRIMARY_ID
FROM THE_TABLE
MATCH_RECOGNIZE (
PARTITION BY PRIMARY_ID
ORDER BY UNIQUE_ID
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (FIFTEEN_DIFF)
DEFINE FIFTEEN_DIFF AS
(FIFTEEN_DIFF.FIRSTNAME = PREV(FIFTEEN_DIFF.FIRSTNAME)
AND FIFTEEN_DIFF.LASTNAME = PREV(FIFTEEN_DIFF.LASTNAME)
AND ((ABS(EXTRACT(DAY FROM (TO_TIMESTAMP(FIFTEEN_DIFF.BIRTHDATE , 'YYYY-MM-DD HH24:MI:SS') -
PREV(TO_TIMESTAMP(FIFTEEN_DIFF.BIRTHDATE , 'YYYY-MM-DD HH24:MI:SS'))))) >= 5478)
OR (LEAST(FIFTEEN_DIFF.BIRTHDATE,PREV(FIFTEEN_DIFF.BIRTHDATE)) IS NULL
AND COALESCE(FIFTEEN_DIFF.BIRTHDATE,PREV(FIFTEEN_DIFF.BIRTHDATE)) IS NOT NULL))));
Result (This also return both PRIMARY_ID, as NULL is now counted as a match):
PRIMARY_ID
12345
23456
2 rows selected.
Say i have a dataset/table(Banking sector) which has the following details.
Name | Mob.no | AccountNo | Address | SSN | Salary....... |and so on.
john | 123456 | 987654321 | abx 123 | 1122 | 28000
I have to dump this into Hadoop.
But while dumping i want the AccountNo and SSN columns to be encrypted,
while its getting stored in HDFS.
This is the first part.
Now when i am retriving the results,
First decryption should happen.
After that i want to mask some of the columns.
Say. There are two Persons(CEO, Project Manager) Viewing the results of john.
Then, CEO should be able to see all the details(columns) after decryption.
For Project Manager , the column AccountNo and Salary Should be Masked
For example:
Name | Mob.no | AccountNo | Address | SSN | Salary....... |and so on.
john | 123456 | 9876xxxxx | abx 123 | 1122 | xxxxx
IS there any way to achieve this in Hadoop.
Encrypting column's of data while Dumping into HDFS.
Masking columns based on Hierarchy .
Any Leads would be appreciated,Since i am new to hadoop