Hive - Select unique rows based on some key field

Hive - Select unique rows based on some key field - hadoop

I am using Hive 1.2.1 and want to select unique rows based on empid
empid empname dept
101 aaaa dept1
101 aaaa dept2
102 bbbb dept1
103 cccc dept2
I tried to use correlated subquery but that does not work
select empid,
empname,
dept
(select count(*)
from emp t2
where t2.empid = t1.empid) as row_number
from emp t1 where row_number=1
order by empid;
Is there a way to select unique value based on some key field? Need your help..
Expected output would be
empid empname dept
101 aaaa dept1
102 bbbb dept1
103 cccc dept2
Thanks.

If you need a single row per unique key, than you can use row_number():
select empid, empname, dept from (
select row_number() over (partition by empid order by empname , dept) as rowNum, empid, empname, dept from table
) q where rowNum == 1

Related

Need to add data to table only if the latest data (latest based highest date) for the record does not exist

I have a table EMPLOYEE with the following fields:
ENROLL_DATE EMP ID EMP_NAME DEPT SWITCH
01-20-2001 123 ABC D1 N
01-20-2012 123 ABC D2 N
10-12-2016 123 RST D2 N
02-10-2017 123 RST D3 N
02-10-2017 456 TYU D2 N
I have another table EMPLOYE_CUR
ENROLL_DATE EMP ID EMP_NAME DEPT SWITCH
02-23-2017 123 PQR D4 N
02-23-2017 456 TYU D2 Y
I need to insert records into table EMPLOYEE only if its a new record (New being record is changed for EMP_ID or EMP_NAME or DEPT, even if SWITCH is changed we wont insert that record)
We will be comparing the reocrd for the emp_ID for hisgest ENROLL_DATE. So if the record is changed based on the latest record for the emp_ID in table EMPLOYEE then only we will be inserting that record.
So in above case the records which need to be inserted are:
ENROLL_DATE EMP ID EMP_NAME DEPT SWITCH
02-23-2017 123 PQR D4 N
So employee table will now have:
ENROLL_DATE EMP ID EMP_NAME DEPT SWITCH
01-20-2001 123 ABC D1 N
01-20-2012 123 ABC D2 N
10-12-2016 123 RST D2 N
02-10-2017 123 RST D3 N
02-19-2017 123 RST D4 Y
02-23-2017 123 PQR D4 N
I was trying to do this using cursor getting all the records form EMPLOYEE table with highest Enroll_date and then thinking to compare that with EMPLOYEE_CUR table, but could not figure out how to do.
Can anyone help me with the query here?
Thanks!

Here you find a simplified solution, that should give you advice how to handle your possible special cases.
Note that I assume that the common history is identical in both tables and that a
new employee will be created in both tables.
If this is not true you mast refine the query to address it.
The solution has two parts:
1) filter idle records in EMPLOYE_CUR that corresponds to no change (e.g. only SWITCH changed) - use analytical functions to trace the last values
2) select the real change records that have ENROLL_DATE higher that the MAX(ENROLL_DATE) from table EMPLOYEE
(The last step INSERT is not considered as obvious.) - see below
STEP 1
with history as (
select ENROLL_DATE, EMP_ID, EMP_NAME, DEPT,SWITCH,
LAG(EMP_NAME) OVER (PARTITION BY EMP_ID ORDER BY ENROLL_DATE) PREV_EMP_NAME,
LAG(DEPT) OVER (PARTITION BY EMP_ID ORDER BY ENROLL_DATE) PREV_DEPT
from EMPLOYE_CUR)
select
ENROLL_DATE, EMP_ID, EMP_NAME, DEPT, SWITCH,
case when EMP_NAME != PREV_EMP_NAME or DEPT != PREV_DEPT then 'Y' end is_changed
from history;
ENROLL_DATE EMP_ID EMP_NAME DEPT SWITCH IS_CHANGED
------------------- ---------- ---------- ---------- ---------- ----------
20.01.2001 00:00:00 123 ABC D1 N
20.01.2012 00:00:00 123 ABC D2 N Y
12.10.2016 00:00:00 123 RST D2 N Y
10.02.2017 00:00:00 123 RST D3 N Y
17.02.2017 00:00:00 123 RST D3 Y
19.02.2017 00:00:00 123 RST D4 Y Y
23.02.2017 00:00:00 123 PQR D4 N Y
STEP 2
with history as (
select ENROLL_DATE, EMP_ID, EMP_NAME, DEPT,SWITCH,
LAG(EMP_NAME) OVER (PARTITION BY EMP_ID ORDER BY ENROLL_DATE) PREV_EMP_NAME,
LAG(DEPT) OVER (PARTITION BY EMP_ID ORDER BY ENROLL_DATE) PREV_DEPT
from EMPLOYE_CUR),
clean_history as (
select
ENROLL_DATE, EMP_ID, EMP_NAME, DEPT, SWITCH,
case when EMP_NAME != PREV_EMP_NAME or DEPT != PREV_DEPT then 'Y' end is_changed
from history)
select clean_history.ENROLL_DATE, clean_history.EMP_ID, EMP_NAME, DEPT, SWITCH from clean_history
join (select EMP_ID, max(ENROLL_DATE) ENROLL_DATE from EMPLOYEE group by EMP_ID) e
on clean_history.EMP_ID = e.EMP_ID and clean_history.ENROLL_DATE > e.ENROLL_DATE
where is_changed = 'Y';
ENROLL_DATE EMP_ID EMP_NAME DEPT SWITCH
------------------- ---------- ---------- ---------- ----------
19.02.2017 00:00:00 123 RST D4 Y
23.02.2017 00:00:00 123 PQR D4 N

I used the below query and its working fine. Not sure if i covered all scenarios:
INSERT into EMPLOYEE(ENROLL_DATE, EMP_ID, EMP_NAME, DEPT, SWITCH)
SELECT ENROLL_DATE, EMP_ID, EMP_NAME, DEPT, SWITCH FROM EMPLOYEE_CUR
WHERE NOT EXISTS (SELECT * FROM EMPLOYEE A WHERE A.EMP_ID = EMPLOYEE_CUR.EMP_ID
AND A.EMP_NAME = EMPLOYEE_CUR.EMP_NAME
AND DEPT= EMPLOYEE_CUR.DEPT
AND A.Enroll_date in (SELECT MAX(enroll_date) over ( partition by emp_id) as max_date from employee b where a.emp_id=b.emp_id));
Can someone comment if it looks correct?

What is the query to find emp details whose names contain letter vowel?(oracle database)

Emp Table
---------------
empid ename sal hiredate
101 ashish 5000 23-jul-2016
102 ankith 20000 21-sep-2012
103 uma 3000 10-jan-2004

You could use REGEXP_LIKE
select * from my_table
WHERE REGEXP_LIKE (ename, '([aeiou])');
ignore case
select * from my_table
WHERE REGEXP_LIKE (ename, '([aeiou])', 'i');

How to get only the one employee name from each department if the max salary is same from more than one employee

I am using below query:
SELECT rownum, job_id, employee_id, first_name, last_name, phone_number, salary
FROM employees OUTER
WHERE salary =
(
SELECT MAX(salary)
FROM employees
WHERE job_id = OUTER.job_id
GROUP BY job_id
)
AND ROWNUM < 6;
And getting below result:
1 AD_PRES 100 Steven King 515.123.4567 24000
2 AD_VP 101 Neena Kochhar 515.123.4568 17000
3 AD_VP 102 Lex De Haan 515.123.4569 17000
4 IT_PROG 103 Alexander Hunold 590.423.4567 9000
5 FI_MGR 108 Nancy Greenberg 515.124.4569 12008
But the problem is I want only one name for each JOB_ID. And that should be decided by alphabetical preference in FIRST_NAME.

One option would be to use a subquery which contains the job_id for the first name you want to retain. I wrapped your original query in a common table expression to make it more readable.
WITH t AS
(
SELECT rownum, job_id, employee_id, first_name, last_name,
phone_number, salary
FROM employees OUTER
WHERE salary =
(
SELECT MAX(salary)
FROM employees
WHERE job_id = OUTER.job_id
GROUP BY job_id
)
AND ROWNUM < 6;
)
SELECT t1.rownum, t1.job_id, t1.employee_id, t1.first_name, t1.last_name,
t1.phone_number, t1.salary
FROM t t1
INNER JOIN
(
SELECT job_id, MAX(first_name) AS max_name
FROM t
GROUP BY job_id
) t2
ON t1.job_id = t2.job_id AND t1.first_name = t2.max_name

Use analytic functions:
select * from (
SELECT job_id, employee_id, first_name, last_name, phone_number, salary,
RANK() over (
job_id
order by
salary desc,
first_name,
employee_id -- adding employe_id breaks ties in the ordering
) as rnk
FROM employees
) where rnk = 1;
This will probably also perform better then the subselect.
All this is written without a database at hand, so it might/will contain typos

Oracle aggregate functions on strings

I have an employee with multiple managers. The manager name field has (firstname,lastname) and the email field has(last.first#email.com).There is no Mgr id.
So, when I try to group this by employee id to get the max of Mgr name and email, some times I end up getting the wrong name/email id combination.
ex:
person Mgr_name Mgr_email
------- --------- ----------
111 brad,pitt pitt.brad#test.com
111 mike,clark clark.mike#test.com
when I group it by person and get the max(mgr_name),mgr_email, I get
person max(Mgr_name) max(Mgr_email)
------- --------- ----------
111 mike,clark pitt.brad#test.com
How do I get the correct email/name combination?

Use row_number analytical function instead:
with t(person ,Mgr_name , Mgr_email) as (
select 111 ,'brad,pitt' , 'pitt.brad#test.com' from dual union all
select 111 ,'mike,clark' , 'clark.mike#test.com' from dual )
select person ,Mgr_name , Mgr_email from (
select t1.*, row_number() over (order by mgr_name) num from t t1)
where num = 1
This get max mgr_name with correct email.
Output:
PERSON MGR_NAME MGR_EMAIL
---------- ---------- -------------------
111 brad,pitt pitt.brad#test.com

You could use a subselect to obtain the max mgr_name for each person in the table then join it back to the base results to limit to only display each persons "Max" manager...
SELECT t1.Person, t1.Mgr_name, t1.mgr_email
FROM tableName t1
INNER JOIN (Select max(mgr_name) mname, Person from TableName group by person) t2
on t1.mgr_name = t2.mname
and t2.Person = T1.Person

display manager name and count of employees reporting him in employees table

I want to display manager_name and count of employees reporting him in employees table.I want to sort the data based on count IE maximum employees reporting to a manager should come first.
I tried to write self join but i could not get the out put .
EMPLOYEE_ID FIRST_NAME MANAGER_ID SALARY HIRE_DATE
198 Donald 124 2600 21-JUN-99
199 Douglas 124 2600 13-JAN-00
200 Jennifer 101 4400 17-SEP-87
201 Michael 100 13000 17-FEB-96
202 Pat 201 6000 17-AUG-97
203 Susan 101 6500 07-JUN-94
204 Hermann 101 10000 07-JUN-94
205 Shelley 101 12000 07-JUN-94
206 William 205 8300 07-JUN-94
100 Steven 24000 17-JUN-87
101 Neena 100 17000 21-SEP-89
the table name is employees and i want to see names also

You can use the aggregate function COUNT and ORDER BY clause
You didn't mention the table name assuming the table name as EMPLOYEES, below query would help you.
SELECT MANAGER_ID, COUNT(EMPLOYEE_ID) as EMP_COUNT
FROM EMPLOYEES
GROUP BY MANAGER_ID
ORDER BY EMP_COUNT DESC;
Here EMP_COUNT is the column alias name.If you don't want any column alias you can simply use the query below.
SELECT MANAGER_ID, COUNT(EMPLOYEE_ID)
FROM EMPLOYEES
GROUP BY MANAGER_ID
ORDER BY COUNT(EMPLOYEE_ID) DESC;
If you want to sort by ascending order instead of DESC you can use ASC.

We can get this output using an analytical function:
SELECT E.EMPID,E.EMPNAME as "Manager Name",M.EMPNAME AS "Employee Name",count(*) over(partition by e.empid) reportee_count
from empmgid m,empmgid e where M.MAGID=e.EMPID order by reportee_count desc;

Please employ the following SQL-Query:
SELECT
e.empno,
e.ename,
e1.empcnt
FROM
emp e,
(
SELECT
mgr,
COUNT(*) empcnt
FROM
emp
GROUP BY
mgr
) e1
WHERE
e.empno = e1.mgr;

-- Restricting which manager is having two employees working under them
-----------------------------------------------------------------------
SELECT E1.* FROM
(
SELECT E1.EMPNO,E1.ENAME AS EMPLOYE,
M1.ENAME AS MANAGERS,
COUNT(*)
OVER
(
PARTITION BY E1.EMPNO
) EMPCNT
FROM EMP E1,EMP M1
WHERE M1.MGR=E1.EMPNO
) E1
WHERE EMPCNT = 2;

select count(distinct manager_id) from employees;

select count(distinct manager_id) from employees;
Ans:
COUNT(DISTINCTMANAGER_ID)
18

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive - Select unique rows based on some key field - hadoop

If you need a single row per unique key, than you can use row_number(): select empid, empname, dept from ( select row_number() over (partition by empid order by empname , dept) as rowNum, empid, empname, dept from table ) q where rowNum == 1

Related

Need to add data to table only if the latest data (latest based highest date) for the record does not exist

What is the query to find emp details whose names contain letter vowel?(oracle database)

How to get only the one employee name from each department if the max salary is same from more than one employee

Oracle aggregate functions on strings

display manager name and count of employees reporting him in employees table

Categories

Resources