Get records from multiple Hive tables without join

Get records from multiple Hive tables without join - hadoop

I have 2 tables :
Table1 desc:
count int
Table2 desc:
count_val int
I get the fields count, count_val from the above tables and insert into the another Audit table(table3) .
Table3 desc:
count int
count_val int
I am trying to log the record count of these 2 tables into audit table for each job run.
Any of your suggestions are appreciated.Thanks!

If you want just aggregations (like sums), the solution comes with the use of UNION
INSERT INTO TABLE audit
SELECT
SUM(count),
SUM(count_val)
FROM (
SELECT
t1.count,
0 as count_val
FROM table1 t1
UNION ALL
SELECT
0 as count,
t2.count_val
FROM table2 t2
) unioned;
Otherwise join is required, because you should somehow match your lines, it's how relational algebra (the theory behind SQL) works.
==table1==
| count|
|------|
| 12 |
| 751 |
| 167 |
===table2===
| count_val|
|----------|
| 1991 |
| 321 |
| 489 |
| 7201 |
| 3906 |
===audit===
| count | count_val|
|-------|----------|
| ??? | ??? |

Related

How to write huge table data to file | Informatica 10.x

I have created a Informatica flow
where I need to read data from table that to only one column which contain empids.
But the column might contain duplicate need to write distinct values to file from below query
Query :
select distinct
emp_id
from
employee
where
empid not in
(
select distinct
custid
from
customer
);
I have added the above query in Source Qualifier
employee table contains : 5 million records and customer table contains : 20 billion records
My Informatica is still running not got completed - 6 hours over till now and nothing is written to file because of huge data in both tables
Following is my query plan
--------------------------------------------------------------------
Id | Operation | Name |
--------------------------------------------------------------------
0 | SELECT STATEMENT | |
1 | AX COORDINATOR | |
2 | AX SEND QC (RANDOM) | :AQ10002 |
3 | HASH UNIQUE | |
4 | AX RECEIVE | |
5 | AX SEND HASH | :AQ10001 |
6 | HASH UNIQUE | |
7 | HASH JOIN ANTI | |
8 | AX RECEIVE | |
9 | AX SEND PARTITION (KEY) | :AQ10000 |
10 | AX SELECTOR | |
11 | INDEX FAST FULL SCAN | PK_EMP_ID |
12 | AX PARTITION RANGE ALL | |
13 | INDEX FAST FULL SCAN | PK_CUST_ID |
--------------------------------------------------------------------
Sample table data :
employee
111
123
145
1345
111
123
145
678
....
customer
111
111
111
1345
111
145
145
145
145
145
145
....
Expected output :
123
678
Any solution is much appreciated !!!

It seems to me the SQL is the problem. If you dont have anything like sorter/aggregator, you dont have to worry about infa.
SQL seems to be having expensive operations. You can try below -
select emp_id
from employee
where not exists
(select 1 from customer where custid =emp_id)
This should be little faster because
you arent doing a subquery to get distinct from a 20billion customer table.
you dont need to use any distinct in first select because you are selecting from emp table where that emp id is unique. And not exist will make sure no duplicates coming out of first select.
You can also use left join +where but i think it will be expensive because of join-induced duplicates.

I would start with partitioning the customer table by hash or range(customer_id) or insert_date, this would speed up your inline select substantially.
Also try this:
select emp_id from employee
minus
select emp_id from employee e, customer c
where e.emp_id=c.custid;

TABLE ACCESS FULL in Oracle execution plan

I have been tasked to find out the SELECT statement for an explain plan
------------------------------------------
| Id | Operation | Name |
------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN RIGHT ANTI | |
| 2 | VIEW | VW_NSO_1 |
| 3 | HASH JOIN RIGHT SEMI| |
| 4 | TABLE ACCESS FULL | PART |
| 5 | TABLE ACCESS FULL | ORDERS |
| 6 | TABLE ACCESS FULL | CUSTOMER |
------------------------------------------
I am able to find the select statement from Id 0-5 but what does the line 6 mean?
This is what I have managed to figure out so far I can't get where the last sentence comes into play.
select *
from customer c join orders o
on c.custkey = o.custkey
where o_totalprice
not in
(select p_retailprice
from part p join orders o
on orders.o_custkey >= 0 and 0.1*o_totalprice >= 0)
I can't get where the last sentence comes into play?

Your query is:
select *
from customer c join orders o
on c.custkey = o.custkey
where o_totalprice
not in
(select p_retailprice
from part p join orders o
on orders.o_custkey >= 0 and 0.1*o_totalprice >= 0)
And your explain plan is
------------------------------------------
| Id | Operation | Name |
------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | HASH JOIN RIGHT ANTI | |
| 2 | VIEW | VW_NSO_1 |
| 3 | HASH JOIN RIGHT SEMI| |
| 4 | TABLE ACCESS FULL | PART |
| 5 | TABLE ACCESS FULL | ORDERS |
| 6 | TABLE ACCESS FULL | CUSTOMER |
------------------------------------------
In your case, this is what happens:
You are getting all the records from both customer and orders that match the condition based on the custkey field.
Your predicate information is delimiting the output to those where o_totalprice ( by the way it should clarified for reading easiness where this field is coming from, although I guess is from orders table ) is not part of the dataset retrieved from the subquery.
the subquery is getting all values of p_retailprice that match the join between part and orders using orders.o_custkey >= 0 and 0.1*o_totalprice >= 0
Getting this in consideration the CBO is:
Accessing ( Line 6 ) by TABLE FULL SCAN the table CUSTOMER, which is logical as you are getting all fields from the table and probably you have no index over custkey.
Making a HASH SEMI JOIN ( line 3 ) between PARTS and ORDERS. In general, a semi join is used for an in or exists clause, and the join stops as soon as the exists condition or the in condition is satisfied.
The HASH JOIN ANTI of line 1 is when the optimizer push the join predicate into a view, normally when an anti join ( not in ) is in place. This is then join to the CUSTOMER TABLE in line 6.
You are filtering only in the right table of the join ( ORDERS ) that is why the access are reflecting that.
This is just an overview of your execution plan and the reasons why the CBO is using those access paths.

When i select , only one column is checked without duplicates

I have a 2 table like this:
first table
+------------+---------------+--------+
| pk | user_one |user_two|
+------------+---------------+--------+
second table
+------------+---------------+--------+----------------+----------------+
| pk | sender |receiver|fk of firsttable|content |
+------------+---------------+--------+----------------+----------------+
First and second table have one to many(1:N) relations.
There are many records in second table:
| pk | sender|receiver|fk of firsttable|content |
|120 |car224 |car223 |1 |test message1 to 223
|121 |car224 |car223 |1 |test message2 to 223
|122 |car224 |car225 |21 |test message1 to 225
|123 |car224 |car225 |21 |test message2 to 225
|124 |car224 |car225 |21 |test message3 to 225
|125 |car224 |car225 |21 |test message4 to 225
I need to find if fk has the same value and I want the row with the largest pk.
I've changed the above column name to make it easier to understand.
Here is the actual sql I've tried so far:
select *
from (select rownum rn,
mr.mrno,
mr.user_one,
mr.user_two,
m.mno,
m.content
from tbl_messagerelation mr,
tbl_message m
where (mr.user_one = 'car224' or
mr.user_two='car224') and
m.rowid in (select max(rowid)
from tbl_message
group by m.mno) and
rownum <= 1*20)
where rn > (1-1) * 20
And this is the result:
+---------+-------+----------+----------+-------------------------+----------------------+
| rn | mrno | user_one | user_two | mno(pk of second table) | content |
+---------+-------+----------+----------+-------------------------+----------------------+
| 1 | 1 | car224 | car223 | 125 | test message4 to 225 |
| 2 | 21 | car224 | car225 | 125 | test message4 to 225 |
+---------+-------+----------+----------+-------------------------+----------------------+
My desired result is something like this:
+---------+---------+----------+--------------------+----------------------+
| fk | sender | receiver | pk of second table | content |
+---------+---------+----------+--------------------+----------------------+
| 1 | car224 | car223 | 121 | test message2 to 223 |
| 21 | car224 | car223 | 125 | test message4 to 225 |
+---------+---------+----------+--------------------+----------------------+

Your table description when compared to your query is confusing me. However, what I could understand was that you are probably looking for row_number().
An important advice is to use standard explicit JOIN syntax rather than outdated a,b syntax for joins. Join keys were not clear to me and you may replace it appropriately in your final query.
select * from
(
select mr.*, m.*, row_number() over ( partition by m.fk order by m.pk desc ) as rn
from tbl_messagerelation mr join tbl_message m on mr.? = m.?
) where rn =1
Or perhaps you don't need that join at all
select * from
(
select m.*, row_number() over ( partition by m.fk order by m.pk desc ) as rn
from tbl_message m
) where rn =1

Oracle UNION ALL query takes temp space

I have a query like this:
SELECT * FROM TEST1 LEFT OUTER JOIN TEST2 on TEST1.ID=TEST2.ID
UNION ALL
SELECT * FROM TEST3 LEFT OUTER JOIN TEST4 on TEST3.ID=TEST4.ID;
The behavior I see here is, it first join TEST1 and TEST2 tables (billions of rows) and then stores the output in temp tablespace. Then it joins TEST3 and TEST4 and then saves the output in same temp table. And finally select the records from there to display the result.
This behavior I see in both Redshift and Oracle. I was just wondering why it stores the result in temporary segments after getting result from first SELECT. It's time taking as well as eats up the temp space. Can not it just starts displaying the result after 1st SELECT is finishes and then goes for 2nd one (instead of storing).

This answer is somewhat speculative, because I don't have an Oracle doc reference. By inspection, we can imagine instead that you wanted to run the following query:
SELECT * FROM TEST1 JOIN TEST2
UNION ALL
SELECT * FROM TEST3 JOIN TEST4
ORDER BY some_col;
It should be clear that to apply any set operation like ORDER BY, all the records returned from the union query would need to be in one logical place. A temp table would seem to work.
That you are not using ORDER BY appears to not affect the workflow which Oracle is using.
I can also add another reason why Oracle is insisting on using a temp table here. Suppose it would be possible to write both halves of the union directly to the buffer. But what would happen if, at a later date, the size of the total union query suddenly exceeded what the buffer can hold? The answer is that your database would crash. So, using a temp table is a safe bet which should generally always work.

How do you observe this behaviour? By any chance don't you perform INSERT or CREATE TABLE? That would explain your observation, because at the end, all rows are required.
Also if your client has set an option fetch all rows this could be observed.
But in normal case, where the client is interested in few first rows Oracle returns quickly the first available (array size) rows from the first join ignoring the second one.
You may perform this little Gedankenexperiment:
create table test1 as
select rownum id,
lpad('x',1023,'X') pad
from dual connect by level <= 1000000;
Create analog the table 2 to 4.
Now run your query (adapted to valid syntax)
SELECT * FROM TEST1 CROSS JOIN TEST2
UNION ALL
SELECT * FROM TEST3 CROSS JOIN TEST4;
This returns for my the first page in SQL Developer in ca 30 seconds, which somehow disproves your claim.
Simple calculate the required TEMP space for two 10**6 * 10**6 cartesian join with row lenth 1K - this is far above my TEMP configuration.
The one possible way to observe what is Oracle actualy doing is to run the query with the /*+ gather_plan_statistics */ hint.
Than get the SQL_ID of the statement and check the actual rows A-Rowsin the plan
select * from table(dbms_xplan.display_cursor('a9y62gxagups6',null,'ALLSTATS LAST'));
SQL_ID a9y62gxagups6, child number 0
-------------------------------------
SELECT /*+ gather_plan_statistics */ * FROM TEST1 CROSS JOIN TEST2
UNION ALL SELECT * FROM TEST3 CROSS JOIN TEST4
Plan hash value: 1763392637
--------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 1 | UNION-ALL | | 1 | | 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 2 | MERGE JOIN CARTESIAN| | 1 | 1000G| 50 |00:00:28.52 | 166K| 166K| 142K| | | |
| 3 | TABLE ACCESS FULL | TEST1 | 1 | 1000K| 1 |00:00:00.02 | 4 | 28 | 0 | | | |
| 4 | BUFFER SORT | | 1 | 1000K| 50 |00:00:28.49 | 166K| 166K| 142K| 1255M| 11M| 97M (0)|
| 5 | TABLE ACCESS FULL | TEST2 | 1 | 1000K| 1000K|00:00:03.66 | 166K| 166K| 0 | | | |
| 6 | MERGE JOIN CARTESIAN| | 0 | 1000G| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 7 | TABLE ACCESS FULL | TEST3 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 8 | BUFFER SORT | | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | 1103M| 10M| |
| 9 | TABLE ACCESS FULL | TEST4 | 0 | 1000K| 0 |00:00:00.01 | 0 | 0 | 0 | | | |
--------------------------------------------------------------------------------------------------------------------------------------
You see, that Oracle
1) full scanned the table2 (row 5)
2) get one row from table1 (row 3)
3) to return to frist 50 rows (row 0)
4) tables 3 and 4 are untached (rows 7 and 9)
You may simple adapt the example to you inner join to see similar results.

(Nested?) Select statement with MAX and WHERE clause

I'm cranking my head on a set of data in order to generate a report from a Oracle DB.
Data are in two tables:
SUPPLY
DEVICE
There is only one column that links the two tables:
SUPPLY.DEVICE_ID
DEVICE.ID
In SUPPLY, there are these data: (Markdown is not working well. it's supposed to show a table)
| DEVICE_ID | COLOR_TYPE | SERIAL | UNINSTALL_DATE |
|----------- |------------ |-------------- |--------------------- |
| 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
| 758 | 4 | CAP67543 | 30/09/2016,11:00:00 |
In DEVICE, there are columns that I've to select all (more or less), but each row is unique.
What i need to achieve is:
for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
JOINED with
more or less all the columns in DEVICE.
At the end I should have something like this:
| ACCOUNT_CODE | MODEL | DEVICE.SERIAL | DEVICE_ID | COLOR_TYPE | SUPPLY.SERIAL | UNINSTALL_DATE |
|-------------- |------- |--------------- |----------- |------------ |--------------- |--------------------- |
| BUSTO | MS410 | LM753 | 1232 | 1 | CAP857496 | 08/11/2016,19:10:50 |
| MACCHI | MX310 | XC876 | 5263 | 2 | CAP57421 | 07/11/2016,11:20:00 |
| ASL_COMO | MX711 | AB123 | 758 | 3 | CBO753421869 | 07/11/2016,04:25:00 |
| ASL_COMO | MX711 | AB123 | 758 | 4 | CC9876543 | 06/11/2016,11:40:00 |
| ASL_VARESE | X950 | DE8745 | 8575 | 4 | CVF75421 | 05/11/2016,23:59:00 |
So far, using a nested select like:
SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE FROM
(SELECT SELECT DEVICE_ID,COLOR_TYPE,SERIAL,UNINSTALL_DATE
FROM SUPPLY WHERE DEVICE_ID = '123456' ORDER BY UNINSTALL_DATE DESC)
WHERE ROWNUM <= 1
I managed to get the highest value on the UNISTALL_DATE column after trying MAX(UNISTALL_DATE) or HIGHEST(UNISTALL_DATE).
I tried also:
SELECT SUPPLY.DEVICE_ID, SUPPLY.COLOR_TYPE, ....
FROM SUPPLY,DEVICE WHERE SUPPLY.DEVICE_ID = DEVICE.ID
and it works, but gives me ALL the items, basically it's a merge of the two tables.
When I try to narrow the data selected, i get errors or a empty result.
I'm starting to wonder that it's not possible to obtain this data and i'm starting to export the data in excel and work from there, but I wish someone can help me before giving up...
Thank you in advance.

for each SUPPLY.DEVICE_ID and SUPPLY.COLOR_TYPE, I need the most recent ROW -> MAX(UNINSTALL_DATE)
Use ROW_NUMBER function in this way:
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
This query marks most recent rows with RN=1
JOINED with more or less all the columns in DEVICE.
Just join the above query to DEVICE table
SELECT d.*,
x.COLOR_TYPE,
x.SERIAL,
x.UNINSTALL_DATE
FROM (
SELECT s.*,
row_number() OVER (
PARTITION BY DEVICE_ID, COLOR_TYPE
ORDER BY UNINSTALL_DATE DESC
) As RN
FROM SUPPLY s
) x
JOIN DEVICE d
ON d.DEVICE_ID = x.DEVICE_ID AND x.RN=1

OK - so you could group by device_id, color_type and select max(uninstall_date) as well, and join to the other table. But you would miss the serial value for the most recent row (for each combination of device_id, color_type).
There are a few ways to fix that. Your attempt with rownum was close, but the problem is that you need to order within each "group" (by device_id, color_type) and get the first row from each group. I am sure someone will post a solution along those lines, using either row_number() or rank() or perhaps the analytic version of max(uninstall_date).
When you just need the "top" row from each group, you can use keep (dense_rank first/last) - which may be slightly more efficient - like so:
select device_id, color_type,
max(serial) keep (dense_rank last order by uninstall_date) as serial,
max(uninstall_date) as uninstall_date
from supply
group by device_id, color_type
;
and then join to the other table. NOTE: dense_rank last will pick up the row OR ROWS with the most recent (max) date for each group. If there are ties, that is more than one row; the serial will then be the max (in lexicographical order) among those rows with the most recent date. You can also select min, or add some order so you pick a specific one (you didn't discuss this possibility).

SELECT
d.ACCOUNT_CODE, d.DNS_HOST_NAME,d.IP_ADDRESS,d.MODEL_NAME,d.OVERRIDE_SERIAL_NUMBER,d.SERIAL_NUMBER,
s.COLOR, s.SERIAL_NUMBER, s.UNINSTALL_TIME
FROM (
SELECT s.DEVICE_ID, s.LAST_LEVEL_READ, s.SERIAL_NUMBER,TRUNC(s.UNINSTALL_TIME), row_number()
OVER (
PARTITION BY DEVICE_ID, COLOR
ORDER BY UNINSTALL_TIME DESC
) As RN
FROM SUPPLY s
WHERE s.UNINSTALL_TIME IS NOT NULL AND s.SERIAL_NUMBER IS NOT NULL
)
JOIN DEVICE d
ON d.ID = s.DEVICE_ID AND s.RN=1;
#krokodilko: thank you very much for your help. First query works. Modified it in order to remove junk, putting real columns name i need (yesterday evening i had no access to the DB) and getting only the data I need.
Unfortunately, when I join the two tables as you suggested I get error:
ORA-00904: "S"."RN": invalid identifier
00904. 00000 - "%s: invalid identifier"
If i remove s. before RN, the ORA-00904 moves back to s.DEVICE_ID.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Get records from multiple Hive tables without join - hadoop

Related

How to write huge table data to file | Informatica 10.x

TABLE ACCESS FULL in Oracle execution plan

When i select , only one column is checked without duplicates

Oracle UNION ALL query takes temp space

(Nested?) Select statement with MAX and WHERE clause

Categories

Resources