How to update table in Hive 0.13? - hadoop

My Hive version is 0.13. I have two tables, table_1 and table_2
table_1 contains:
customer_id | items | price | updated_date
------------+-------+-------+-------------
10 | watch | 1000 | 20170626
11 | bat | 400 | 20170625
table_2 contains:
customer_id | items | price | updated_date
------------+----------+-------+-------------
10 | computer | 20000 | 20170624
I want to update records of table_2 if customer_id already exists in it, if not, it should append to table_2.
As Hive 0.13 does not support update, I tried using join, but it fails.

You can use row_number or full join. This is example using row_number:
insert overwrite table_1
select customer_id, items, price, updated_date
from
(
select customer_id, items, price, updated_date,
row_number() over(partition by customer_id order by new_flag desc) rn
from
(
select customer_id, items, price, updated_date, 0 as new_flag
from table_1
union all
select customer_id, items, price, updated_date, 1 as new_flag
from table_2
) all_data
)s where rn=1;
Also see this answer for update using FULL JOIN: https://stackoverflow.com/a/37744071/2700344

Related

rewrite query without DENSE_RANK

I have one very slow query and try to optimize response time by using a materialized view. But one part is not compatible with General Restrictions on Fast Refresh.
How to rewrite it without DENSE_RANK?
create table t (id,object_id,log_cre_date) as
select 1,2,to_date('18/5/2010, 08:00','dd/mm/yyyy, hh:mi') from dual union all
select 2,2,to_date('18/5/2010, 10:00','dd/mm/yyyy, hh mi') from dual union all
select 3,3,to_date('18/5/2010, 11:00','dd/mm/yyyy, hh mi') from dual union all
select 4,3,to_date('18/5/2010, 12:10','dd/mm/yyyy, hh mi') from dual union all
select 5,4,to_date('18/5/2010, 12:20','dd/mm/yyyy, hh mi') from dual union all
select 6,4,to_date('18/5/2010, 11:30','dd/mm/yyyy, hh mi') from dual;
SELECT
MAX(t.id) KEEP(DENSE_RANK FIRST ORDER BY log_cre_date ASC) id,
t.object_id
FROM
t
GROUP BY
t.object_id
I am not sure the accepted answer is fast refreshable. Here is a query that definitely is:
SELECT max(cast(to_char(t.log_cre_date,'YYYYMMDDHH24MISS') || lpad(t.id,30,'0') as varchar2(80))) maxid,
t.object_id,
COUNT(*) cnt
FROM t
GROUP BY t.object_id;
The idea is to append the id to the log_cre_date and take the max of the concatenation. That way, you can extract the id you need later.
So, to get the id, you would do this:
SELECT to_char(substr(maxid,-30)) id, object_id
FROM your_materialized_view;
You could put that in a view to hide the complexity.
Here is a full example:
Create the base table
DROP TABLE t;
create table t (id,object_id,log_cre_date) as
select 1,2,to_date('18/5/2010, 08:00','dd/mm/yyyy, hh:mi') from dual union all
select 2,2,to_date('18/5/2010, 10:00','dd/mm/yyyy, hh mi') from dual union all
select 3,3,to_date('18/5/2010, 11:00','dd/mm/yyyy, hh mi') from dual union all
select 4,3,to_date('18/5/2010, 12:10','dd/mm/yyyy, hh mi') from dual union all
select 5,4,to_date('18/5/2010, 12:20','dd/mm/yyyy, hh mi') from dual union all
select 6,4,to_date('18/5/2010, 11:30','dd/mm/yyyy, hh mi') from dual;
Add some constraints to allow fast-refresh MV
ALTER TABLE t MODIFY id NOT NULL;
ALTER TABLE t ADD CONSTRAINT t_pk PRIMARY KEY ( id );
Create a snapshot log to enable fast refresh
--DROP MATERIALIZED VIEW LOG ON t;
CREATE MATERIALIZED VIEW LOG ON t WITH ROWID, PRIMARY KEY (OBJECT_ID, LOG_CRE_DATE) INCLUDING NEW VALUES;
Create the materialized view (note presence of COUNT(*) in select-list. Important!
--DROP MATERIALIZED VIEW t_mv;
CREATE MATERIALIZED VIEW t_mv
REFRESH FAST ON COMMIT AS
SELECT max(cast(to_char(t.log_cre_date,'YYYYMMDDHH24MISS') || lpad(t.id,30,'0') as varchar2(80))) maxid,
t.object_id,
COUNT(*) cnt
FROM t
GROUP BY t.object_id;
Test it out
select to_number(substr(maxid,-30)) id, object_id
from t_mv;
+----+-----------+
| ID | OBJECT_ID |
+----+-----------+
| 2 | 2 |
| 4 | 3 |
| 5 | 4 |
+----+-----------+
DELETE FROM t WHERE id = 5;
COMMIT;
select to_number(substr(maxid,-30)) id, object_id
from t_mv;
+----+-----------+
| ID | OBJECT_ID |
+----+-----------+
| 4 | 3 |
| 5 | 4 |
| 1 | 2 | -- Now ID #1 is the latest for object_id 2
+----+-----------+
Maybe this query will run faster:
select object_id, id
from (
select object_id, first_value(id) over(partition by object_id order by log_cre_date) as id
from t
)
group by object_id, id;
Hope it helps!
I went through the restriction but I am not sure if following query will work or not.
Try this and let us know if it works.
Select t.id, t.object_id from
T join
(SELECT
min(log_cre_date) mindt,
t.object_id
FROM
t
GROUP BY
t.object_id) t1
On t.object_id = t1.object_id
And t.log_cre_date = t1.mindt;
Cheers!!

How to get mismatch records of two tables from same database in hive?

Eg:
select username, country from table1
Minus
Select username, country from table2;
The above minus query works in RDBMS but i want the same result using hive. Can we use joins here in hive to get the result? If so how to get proper result using hive query.
Set operations (MINUS/EXCEPT/INTERSECT in addition to UNION) are supported as of Hive 2.3.0 (released on 17 July 2017)
https://issues.apache.org/jira/browse/HIVE-12764
Demo
create table table1 (username string, country string);
create table table2 (username string, country string);
insert into table1 values ('Danny','USA'),('Danny','USA'),('David','UK');
insert into table2 values ('David','UK'),('Michal','France');
select username, country from table1
minus
Select username, country from table2
;
+--------------+-------------+
| _u1.username | _u1.country |
+--------------+-------------+
| Danny | USA |
+--------------+-------------+
In older Hive version you can use -
select username
,country
from ( select 1 tab,username, country from table1
union all select 2 tab,username, country from table2
) t
group by username
,country
having count(case when tab = 2 then 1 end) = 0
;
+----------+---------+
| username | country |
+----------+---------+
| Danny | USA |
+----------+---------+
You may utilize left join as follows
select table1.username, table1.country
from table1 left join table2
on table1.username=table2.username and table1.country=table2.country
where table2.username is NULL and table2.country is NULL;
Yes , As minus and exist not usually work in hive we can do minus operation by below LEFT JOIN condition.
SELECT t1.username, t1.country
FROM
(select username, country from table1) t1
LEFT JOIN
(Select username, country from table2) t2
ON t1.username =t2.username
AND t1.country =t2.country
WHERE t1.username IS NULL
IMP NOTE:Please do use WHERE CLAUSE FOR NULL Operations instead of AND after join condition this will have different results.

ORACLE: How to get all column with GROUP by only 1 column?

I'm using ORACLE Database,
How to get all column with GROUP by only 1 column (EMP_ID)?
Example I have table ESD_RESULTS
FIRST_NAME | LAST_NAME | EMP_ID | WRIST_STATUS | LFOOT_STATUS | DATE
Dodo | A | 0101 | Pass | Pass | 2016-01-18 10:00
Wedi | Wil | 0105 | Pass | Pass | 2016-01-18 10:05
Dodo | A | 0101 | Pass | Fail | 2016-01-18 10:11
What I want the data display is (Get the last data by date desc if EMP_ID same):
FIRST_NAME | LAST_NAME | EMP_ID | WRIST_STATUS | LFOOT_STATUS | DATE
Dodo | A | 0101 | Pass | Fail | 2016-01-18 10:11
Wedi | Wil | 0105 | Pass | Pass | 2016-01-18 10:05
I tried to use DISTINCT and GROUP by the data still show all.
One option is to use ROW_NUMBER() to identify the latest record for each employee:
SELECT t.FIRST_NAME,
t.LAST_NAME,
t.EMP_ID,
t.WRIST_STATUS,
t.LFOOT_STATUS,
t.DATE
FROM
(
SELECT FIRST_NAME, LAST_NAME, EMP_ID, WRIST_STATUS, LFOOT_STATUS, DATE,
ROW_NUMBER() OVER (PARTITION BY EMP_ID ORDER BY DATE DESC) rn
FROM ESD_RESULTS
) t
WHERE t.rn = 1
Since presumably the first name and the last name are determined by the emp_id (they don't change from one row to another), you might as well group by all three columns - resulting in less work. (On the other hand, it would make more sense to normalize your table design; one table shows the associated first name and last name for each emp_id, there is no need to repeat the first name and last name in "this" table, which you show in your post.)
Then: you can use the FIRST/LAST function, with keep (dense_rank ...), as demonstrated below, to eliminate the need for a subquery and an outer query. If there is the possibility of two rows having the exact same date and time for an emp_id, you may refine the query to accommodate "tie-breaks" of some kind. If there are no ties, then the query will work without modification.
DATE is a reserved word in Oracle, it shouldn't be used for table or column names. I changed it to DT.
with
test_data ( first_name, last_name, emp_id, wrist_status, lfoot_status, dt ) as (
select 'Dodo', 'A' , 0101, 'Pass', 'Pass', to_date('2016-01-18 10:00', 'yyyy-mm-dd hh24:mi') from dual union all
select 'Wedi', 'Wil', 0105, 'Pass', 'Pass', to_date('2016-01-18 10:05', 'yyyy-mm-dd hh24:mi') from dual union all
select 'Dodo', 'A' , 0101, 'Pass', 'Fail', to_date('2016-01-18 10:11', 'yyyy-mm-dd hh24:mi') from dual
)
-- end of test data (NOT part of the solution); SQL query begins BELOW THIS LINE
select first_name, last_name, emp_id,
min(wrist_status) keep (dense_rank last order by dt) as wrist_status,
min(lfoot_status) keep (dense_rank last order by dt) as lfoot_status,
max(dt) as dt
from test_data
group by first_name, last_name, emp_id
;
FIRST_NAME LAST_NAME EMP_ID WRIST_STATUS LFOOT_STATUS DT
---------- --------- ---------- ------------ ------------ ----------------
Dodo A 101 Pass Fail 2016-01-18 10:11
Wedi Wil 105 Pass Pass 2016-01-18 10:05
2 rows selected.

NULL values not found in cursor

I am trying to:
Create a cursor that gets all the current prices of items in a store.
I bulk collect the cursor and loop upserting by using MERGE statement into STORE_INVENTORY table.
Now I want to NULL out the PRICE column in the STORE_INVENTORY table that are not in the cursor.
How can step 3 be done? I can do step 1 and 2 already as I have already updated or inserted the items that are pulled from the cursor.
Here is some example data:
There are three source tables where it is updated by an external party. My objective is to take these three sources of data and merge it into a singular table.
SOURCE TABLES
ITEM_TYPES
DESC_ID | TYPE
A | Kitchen
B | Bath
ITEM_MANIFEST
LOC_ID | ORIGIN
U | USA
C | CHINA
ITEM_PRICE
ITEM_ID | PRICE | DESC_ID | LOC_ID | DATE
0 | 3.99 | A | U | 9/11/2015
1 | 2.99 | B | C | 9/11/2015
2 | 1.99 | A | U | 9/05/2015
DESTINATION TABLE
STORE_INVENTORY
ITEM_ID | TYPE | ORIGIN | PRICE
0 | Kitchen | CHINA | 3.99
8 | Bath | USA | 2.99
So after I execute the SQL Procedure that has a date as a parameter. It will only pull from ITEM_PRICE if it's after the given date.
If execute the procedure with the passed in date 9/10/2015
Expected Output
STORE_INVENTORY
0 | Kitchen | USA | 3.99
1 | Bath | China | 2.99
8 | Bath | USA | NULL
So, something like this, then?
drop table item_description;
drop table item_manifest;
drop table item_price;
drop table store_inventory;
create table item_description
as
select 'A' desc_id, 'Kitchen' type from dual union all
select 'B' desc_id, 'Bath' type from dual;
create table item_manifest
as
select 'U' loc_id, 'USA' origin from dual union all
select 'C' loc_id, 'CHINA' origin from dual;
create table item_price
as
select 0 item_id, 3.99 price, 'A' desc_id, 'U' loc_id, to_date('11/09/2015', 'dd/mm/yyyy') dt from dual union all
select 1 item_id, 2.99 price, 'B' desc_id, 'C' loc_id, to_date('11/09/2015', 'dd/mm/yyyy') dt from dual union all
select 2 item_id, 1.99 price, 'A' desc_id, 'U' loc_id, to_date('05/09/2015', 'dd/mm/yyyy') dt from dual;
create table store_inventory
as
select 0 item_id, 'Kitchen' type, 'CHINA' origin, 3.99 price from dual union all
select 8 item_id, 'Bath' type, 'USA' origin, 2.99 price from dual;
select * from store_inventory;
ITEM_ID TYPE ORIGIN PRICE
---------- ------- ------ ----------
0 Kitchen CHINA 3.99
8 Bath USA 2.99
select coalesce(ip.item_id, si.item_id) item_id,
coalesce(id.type, si.type) type,
coalesce(im.origin, si.origin) origin,
ip.price
from item_description id
inner join item_price ip on (id.desc_id = ip.desc_id and ip.dt > to_date('10/09/2015', 'dd/mm/yyyy')) -- use a parameter for the date here
inner join item_manifest im on (ip.loc_id = im.loc_id)
full outer join store_inventory si on (si.item_id = ip.item_id);
ITEM_ID TYPE ORIGIN PRICE
---------- ------- ------ ----------
0 Kitchen USA 3.99
8 Bath USA
1 Bath CHINA 2.99
merge into store_inventory tgt
using (select coalesce(ip.item_id, si.item_id) item_id,
coalesce(id.type, si.type) type,
coalesce(im.origin, si.origin) origin,
ip.price
from item_description id
inner join item_price ip on (id.desc_id = ip.desc_id and ip.dt > to_date('10/09/2015', 'dd/mm/yyyy')) -- use a parameter for the date here
inner join item_manifest im on (ip.loc_id = im.loc_id)
full outer join store_inventory si on (si.item_id = ip.item_id)) src
on (src.item_id = tgt.item_id)
when matched then
update set tgt.type = src.type,
tgt.origin = src.origin,
tgt.price = src.price
when not matched then
insert (tgt.item_id, tgt.type, tgt.origin, tgt.price)
values (src.item_id, src.type, src.origin, src.price);
commit;
select * from store_inventory;
ITEM_ID TYPE ORIGIN PRICE
---------- ------- ------ ----------
0 Kitchen USA 3.99
8 Bath USA
1 Bath CHINA 2.99
Obviously, your procedure would have an input parameter of DATE datatype to pass into the query, and your query would use the parameter, rather than a hardcoded date like I did in my example. E.g. ip.dt > p_cutoff_date
I can do step 1 and 2 already as I have already updated or inserted
the items that are pulled from the cursor.
Hmm. These steps seem unnecessary - why not do them as part of the MERGE statement? What does the store_inventory table look like before you do your insert/update from the cursor? Also, what is the cursor you're using to do this?
couldn't you do a date-limited subselect of ITEM_PRICE.PRICE, after pulling in the TYPE and ORIGIN via the main join to ITEM_PRICE, without limiting on date?
i.e. something like.
select ITEM_ID, TYPE, ORIGIN
/* not selecting PRICE in the main join */
,(select PRICE from ITEM_PRICE where your join conditions
and DATE >= your param)
from ITEM_TYPES, ITEM_MANIFEST, ITEM_PRICE
where your join conditions, but no criteria on DATE
Sorry, would be clearer and easier to type up if you had provided your existing query.
From re-reading your question, I am unsure if you are inserting only 2 rows but want to get 3. Or if you have 3 rows, but you want to NULL out the missing price.
If the target table already has the 3 rows, then, instead of doing a CURSOR based approach (which can be slow on high volumes and is fussy to write), why not do an UPDATE instead, with DATE as a criteria? The NULL will be assigned to price if there is no match, that's how UPDATEs work.
UPDATE STORE_INVENTORY set PRICE
= (select PRICE from ITEM_PRICE where your join conditions
and DATE >= your param)

Using DISTINCT for specific columns

select distinct employee_id, first_name, commission_pct, department_id from
employees;
When I use the above query it results in distinct combination of all the attributes mentioned. As employee_id (being the primary key for employees) is unique, the query results in producing all the rows in the table.
I want to have a result set that has distinct combination of commission_pct and department_id. so how the query should be formed. When I tried to include the DISTINCT in the middle as
select employee_id, first_name, distinct commission_pct, department_id from
employees;
It is resulting in an error
ORA-00936-missing expression
How to form a query which results have only distinct combination of commission and department_id.The table is from HR schema of oracle.
What you request is impossible. You cannot select all the employee ids but have only distinct commission_pct and department_id.
So think it over, what you want to show:
All distinct commission_pct, department_id only?
All distinct commission_pct, department_id and the number of relevant employees?
All distinct commission_pct, department_id and the relevant employees comma separated?
All employees, but with nulls when commission_pct and department_id are the same as in the line before?
The first can be solved with DISTINCT. The second and third with GROUP BY (plus count or listagg). The last would be solved with the analytic function LAG.
You have to remove two columns before distinct
select distinct commission_pct, department_id from
employees;
Indeed, if your second query would work, what do you expect to see in the first two columns? Consider example data
| employee_id | first_name | commission_pct | department_id |
| 1 | "x" | "b" | 3 |
| 2 | "y" | "b" | 3 |
| 1 | "x" | "c" | 4 |
| 2 | "y" | "c" | 4 |
You expect to get only two row result like this
| employee_id | first_name | commission_pct | department_id |
| ? | ? | "b" | 3 |
| ? | ? | "c" | 4 |
But what do you expect in the first two column?
Can you try this one?
SELECT
NAME1,
PH
FROM
(WITH T
AS (SELECT
'mark' NAME1,
'1234567' PH
FROM
DUAL
UNION ALL
SELECT
'bailey',
'456789'
FROM
DUAL
UNION ALL
SELECT
'mark',
'987654'
FROM
DUAL)
SELECT
NAME1,
PH,
ROW_NUMBER ( ) OVER (PARTITION BY NAME1 ORDER BY NAME1) SEQ
FROM
T)
WHERE
SEQ = 1;
If you dont care on a specific row, then use aggregate functions
SELECT
NAME1,
MAX ( PH ) PH
FROM
T
GROUP BY
NAME1;

Resources