How to write huge table data to file | Informatica 10.x - oracle

I have created a Informatica flow
where I need to read data from table that to only one column which contain empids.
But the column might contain duplicate need to write distinct values to file from below query
Query :
select distinct
emp_id
from
employee
where
empid not in
(
select distinct
custid
from
customer
);
I have added the above query in Source Qualifier
employee table contains : 5 million records and customer table contains : 20 billion records
My Informatica is still running not got completed - 6 hours over till now and nothing is written to file because of huge data in both tables
Following is my query plan
--------------------------------------------------------------------
Id | Operation | Name |
--------------------------------------------------------------------
0 | SELECT STATEMENT | |
1 | AX COORDINATOR | |
2 | AX SEND QC (RANDOM) | :AQ10002 |
3 | HASH UNIQUE | |
4 | AX RECEIVE | |
5 | AX SEND HASH | :AQ10001 |
6 | HASH UNIQUE | |
7 | HASH JOIN ANTI | |
8 | AX RECEIVE | |
9 | AX SEND PARTITION (KEY) | :AQ10000 |
10 | AX SELECTOR | |
11 | INDEX FAST FULL SCAN | PK_EMP_ID |
12 | AX PARTITION RANGE ALL | |
13 | INDEX FAST FULL SCAN | PK_CUST_ID |
--------------------------------------------------------------------
Sample table data :
employee
111
123
145
1345
111
123
145
678
....
customer
111
111
111
1345
111
145
145
145
145
145
145
....
Expected output :
123
678
Any solution is much appreciated !!!

It seems to me the SQL is the problem. If you dont have anything like sorter/aggregator, you dont have to worry about infa.
SQL seems to be having expensive operations. You can try below -
select emp_id
from employee
where not exists
(select 1 from customer where custid =emp_id)
This should be little faster because
you arent doing a subquery to get distinct from a 20billion customer table.
you dont need to use any distinct in first select because you are selecting from emp table where that emp id is unique. And not exist will make sure no duplicates coming out of first select.
You can also use left join +where but i think it will be expensive because of join-induced duplicates.

I would start with partitioning the customer table by hash or range(customer_id) or insert_date, this would speed up your inline select substantially.
Also try this:
select emp_id from employee
minus
select emp_id from employee e, customer c
where e.emp_id=c.custid;

Related

Oracle 11g insert into select from a table with duplicate rows

I have one table that need to split into several other tables.
But the main table is just like a transitive table.
I dump data from a excel into it (from 5k to 200k rows) , and using insert into select, split into the correct tables (Five different tables).
However, the latest dataset that my client sent has records with duplicates values.
The primary key usually is ENI for my table. But even this record is duplicated because the same company can be a customer and a service provider, so they have two different registers but use the same ENI.
What i have so far.
I found a script that uses merge and modified it to find same eni and update the same main_id to all
|Main_id| ENI | company_name| Type
| 1 | 1864 | JOHN | C
| 2 | 351485 | JOEL | C
| 3 | 16546 | MICHEL | C
| 2 | 351485 | JOEL J. | S
| 1 | 1864 | JOHN E. E. | C
Main_id: Primarykey that the main BD uses
ENI: Unique company number
Type: 'C' - COSTUMER 'S' - SERVICE PROVIDERR
Some Cases it can have the same type. just like id 1
there are several other Columns...
What i need:
insert any of the main_id my other script already sorted, and set a flag on the others that they were not inserted. i cant delete any data i'll need to send these info to the costumer validate.
or i just simply cant make this way and go back to the good old excel
Edit: as a question below this is a example
|Main_id| ENI | company_name| Type| RANK|
| 1 | 1864 | JOHN | C | 1 |
| 2 | 351485 | JOEL | C | 1 |
| 3 | 16546 | MICHEL | C | 1 |
| 2 | 351485 | JOEL J. | S | 2 |
| 1 | 1864 | JOHN E. E. | C | 2 |
RANK - would be like the 1864 appears 2 times,
1st one found gets 1 second 2 and so on. i tryed using
RANK() OVER (PARTITION BY MAIN_ID ORDER BY ENI)
RANK() OVER (PARTITION BY company_name ORDER BY ENI)
Thanks to TEJASH i was able to come up with this solution
MERGE INTO TABLEA S
USING (Select ROWID AS ID,
row_number() Over(partition by eniorder by eni, type) as RANK_DUPLICATED
From TABLEA
) T
ON (S.ROWID = T.ID)
WHEN MATCHED THEN UPDATE SET S.RANK_DUPLICATED= T.RANK_DUPLICATED;
As far as I understood your problem, you just need to know the duplicate based on 2 columns. You can achieve it using analytical function as follows:
Select t.*,
row_number() Over(partition by main_id, eni order by company_name) as rnk
From your_table t

When i select , only one column is checked without duplicates

I have a 2 table like this:
first table
+------------+---------------+--------+
| pk | user_one |user_two|
+------------+---------------+--------+
second table
+------------+---------------+--------+----------------+----------------+
| pk | sender |receiver|fk of firsttable|content |
+------------+---------------+--------+----------------+----------------+
First and second table have one to many(1:N) relations.
There are many records in second table:
| pk | sender|receiver|fk of firsttable|content |
|120 |car224 |car223 |1 |test message1 to 223
|121 |car224 |car223 |1 |test message2 to 223
|122 |car224 |car225 |21 |test message1 to 225
|123 |car224 |car225 |21 |test message2 to 225
|124 |car224 |car225 |21 |test message3 to 225
|125 |car224 |car225 |21 |test message4 to 225
I need to find if fk has the same value and I want the row with the largest pk.
I've changed the above column name to make it easier to understand.
Here is the actual sql I've tried so far:
select *
from (select rownum rn,
mr.mrno,
mr.user_one,
mr.user_two,
m.mno,
m.content
from tbl_messagerelation mr,
tbl_message m
where (mr.user_one = 'car224' or
mr.user_two='car224') and
m.rowid in (select max(rowid)
from tbl_message
group by m.mno) and
rownum <= 1*20)
where rn > (1-1) * 20
And this is the result:
+---------+-------+----------+----------+-------------------------+----------------------+
| rn | mrno | user_one | user_two | mno(pk of second table) | content |
+---------+-------+----------+----------+-------------------------+----------------------+
| 1 | 1 | car224 | car223 | 125 | test message4 to 225 |
| 2 | 21 | car224 | car225 | 125 | test message4 to 225 |
+---------+-------+----------+----------+-------------------------+----------------------+
My desired result is something like this:
+---------+---------+----------+--------------------+----------------------+
| fk | sender | receiver | pk of second table | content |
+---------+---------+----------+--------------------+----------------------+
| 1 | car224 | car223 | 121 | test message2 to 223 |
| 21 | car224 | car223 | 125 | test message4 to 225 |
+---------+---------+----------+--------------------+----------------------+
Your table description when compared to your query is confusing me. However, what I could understand was that you are probably looking for row_number().
An important advice is to use standard explicit JOIN syntax rather than outdated a,b syntax for joins. Join keys were not clear to me and you may replace it appropriately in your final query.
select * from
(
select mr.*, m.*, row_number() over ( partition by m.fk order by m.pk desc ) as rn
from tbl_messagerelation mr join tbl_message m on mr.? = m.?
) where rn =1
Or perhaps you don't need that join at all
select * from
(
select m.*, row_number() over ( partition by m.fk order by m.pk desc ) as rn
from tbl_message m
) where rn =1

Get records from multiple Hive tables without join

I have 2 tables :
Table1 desc:
count int
Table2 desc:
count_val int
I get the fields count, count_val from the above tables and insert into the another Audit table(table3) .
Table3 desc:
count int
count_val int
I am trying to log the record count of these 2 tables into audit table for each job run.
Any of your suggestions are appreciated.Thanks!
If you want just aggregations (like sums), the solution comes with the use of UNION
INSERT INTO TABLE audit
SELECT
SUM(count),
SUM(count_val)
FROM (
SELECT
t1.count,
0 as count_val
FROM table1 t1
UNION ALL
SELECT
0 as count,
t2.count_val
FROM table2 t2
) unioned;
Otherwise join is required, because you should somehow match your lines, it's how relational algebra (the theory behind SQL) works.
==table1==
| count|
|------|
| 12 |
| 751 |
| 167 |
===table2===
| count_val|
|----------|
| 1991 |
| 321 |
| 489 |
| 7201 |
| 3906 |
===audit===
| count | count_val|
|-------|----------|
| ??? | ??? |

Improving SQL Exists scalability

Say we have two tables, TEST and TEST_CHILDS in the following way:
creat TABLE TEST(id1 number PRIMARY KEY, word VARCHAR(50),numero number);
creat TABLE TEST_CHILD (id2 number references test(id), word2 VARCHAR(50));
CREATE INDEX TEST_IDX ON TEST_CHILD(word2);
CREATE INDEX TEST_JOIN_IDX ON TEST_CHILD(id);
insert into TEST SELECT ROWNUM,U1.USERNAME||U2.TABLE_NAME, LENGTH(U1.USERNAME) FROM ALL_USERS U1,ALL_TABLES U2;
INSERT INTO TEST_CHILD SELECT MOD(ROWNUM,15000)+1,U1.USER_ID||U2.TABLE_NAME FROM ALL_USERS U1,ALL_TABLES U2;
We would like to query to get rows from TEST table that satisfy some criteria in the child table, so we go for:
SELECT /*+FIRST_ROWS(10)*/* FROM TEST T WHERE EXISTS (SELECT NULL FROM TEST_CHILD TC WHERE word2 like 'string%' AND TC.id = T.id ) AND ROWNUM < 10;
We always want just the first 10 results, not any more at all. Therefore, we would like to get the same response time to read 10 results whether table has 10 matching values or 1,000,000; since it could get 10 distinct results from the child table and get the values on the parent table (or at least that is the plan that we would like). But when checking the actual execution plan we see:
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 54 | 5 (20)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | NESTED LOOPS | | | | | |
| 3 | NESTED LOOPS | | 1 | 54 | 5 (20)| 00:00:01 |
| 4 | SORT UNIQUE | | 1 | 23 | 3 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| TEST_CHILD | 1 | 23 | 3 (0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN | TEST_IDX | 1 | | 2 (0)| 00:00:01 |
|* 7 | INDEX UNIQUE SCAN | SYS_C005145 | 1 | | 0 (0)| 00:00:01 |
| 8 | TABLE ACCESS BY INDEX ROWID | TEST | 1 | 31 | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<10)
6 - access("WORD2" LIKE 'string%')
filter("WORD2" LIKE 'string%')
7 - access("TC"."ID"="T"."ID")
SORT UNIQUE under the STOPKEY, what afaik means that it is reading all results from the child table, making the distinct to finally select only the first 10, making the query not as scalable as we would like it to be.
Is there any mistake in my example?
Is it possible to improve this execution plan so it scales better?
The SORT UNIQUE is going to find and sort all of the records from TEST_CHILD that matched 'string%' - it is NOT going to read all results from child table. Your logic requires this. IF you only picked the first 10 rows from TEST_CHILD that matched 'string%', and those 10 rows all had the same ID, then your final results from TEST would only have 1 row.
Anyway, your performance should be fine as long as 'string%' matches a relatively low number of rows in TEST_CHILD. IF your situation is such that 'string%' often matches a HUGE record count on TEST_CHILD, there's not much you can do to make the SQL more performant given the current tables. In such a case, if this is a mission-critical SQL, with performance tied to your annual bonus, there's probably some fancy footwork you could do with MATERIALIZED VIEWs to, e.g. pre-compute 10 TEST rows for high-cardinality WORD2 values in TEST_CHILD.
One final thought - a "risky" solution, but one which should work if you don't have thousands of TEST_CHILD rows matching the same TEST row, would be the following:
SELECT *
FROM TEST
WHERE ID1 IN
(SELECT ID2
FROM TEST_CHILD
WHERE word2 like 'string%'
AND ROWNUM < 1000)
AND ROWNUM <10;
You can adjust 1000 up or down, of course, but if it's too low, you risk finding less than 10 distinct ID values, which would give you final results with less than 10 rows.

joining table returns extra rows

i am stuck in a oracle query
what i am doing is i am joining two tables on there cityid .
whats happening is when i query the first table its returning 486 rows but when i join them no matter which join and join it on the cityid it
returns 570 rows please advice how can i get only the 486 records
the query is as follows
select c.year,c.amount,c.product,g.state
from Accounts c
join Address g
on g.cityid=c.cityid
order by c.year,c.product;
regards
That's perfectly possible.
If you have multiple addresses for a given account, or multiple accounts for a given address, you may wind up with more rows than just what's in the address or account table.
Consider:
Account
id | ... | cityid
4 | ... | 12
5 | ... | 12
6 | ... | 13
7 | ... | 14
Address
id | ... | cityid
2 | ... | 12
3 | ... | 13
4 | ... | 14
With your join you get:
Account Address
id | ... | cityid | id | ... | cityid
4 | ... | 12 | 2 | ... | 12
5 | ... | 12 | 2 | ... | 12
6 | ... | 13 | 3 | ... | 13
7 | ... | 14 | 4 | ... | 14
So, you see there are 4 records returned, even though there are 3 records in Address, with record Address.2 being repeated.
This could go the other way if the foreign key relationships were reversed.
And this is actually the core feature of relational databases, that data entered with foreign key relationships maintained do not need to repeat data entry.
You can limit the rows by selecting only the first (lowest id) value to join on .. this usually involves creating a temporary table, which is an exercise I will leave to an oracle expert, because I think that Sybase's syntax for that is different (and required to be done within a stored procedure, yick).
I find myself wondering if there might be a slightly different interpretation of the schema outside what you've described that might be more likely to resolve your issue.

Resources