remove successive rows in hive - hadoop

What is efficient way to remove successive row with duplicate values in specific fields in hive? for example:
Input:
ID field1 field2 date
1 a b 2015-01-01
1 a b 2015-01-02
2 e d 2015-01-03
output:
ID field1 field2 date
1 a b 2015-01-01
2 e d 2015-01-03
Thanks in advance

One way to remove successive duplicates is to use lag to check the previous id and only keep rows where the previous id is different:
select * from (
select * ,
lag(id) over (order by date) previous_id
from mytable
) t where t.previous_id <> t.id
or t.previous_id is null -- accounts for the 1st row
If you also need to check field1 and field2, then you can add separate lag statements for each field:
select * from (
select * ,
lag(id) over (order by date) previous_id,
lag(field1) over (order by date) previous_field1
from mytable
) t where (t.previous_id <> t.id and t.previous_field1 <> field1)
or t.previous_id is null

Related

Oracle How to make SELECT INSIDE A SELECT work?

Just wondering why the following select isn't working:
SELECT
A.FIELD1
, (SELECT PCN FROM (select B.PRIORITY, B.PCN
from
TABLE2 B
WHERE B.CUST= A.CUST
ORDER BY B.PRIORITY)
WHERE ROWNUM = 1) AS PCN
FROM TABLE1 A;
ERROR at line 2: ORA-00904: "A"."CUST": invalid identifier
Important to mention:
TABLE1 has as fields FIELD1, CUST.
TABLE2 has as fields PCN, PRIORITY, CUST.
Thanks in advance.
Your query shouldn't give you that error message, on when you remove the outer qiery this would happen
CREATE tABLE TABLE1 (FIELD1 int, CUST int)
INSERT INTO TABLE1 VALUES(1,1)
1 rows affected
CREATE TABLE TABLE2 (PCN int, PRIORITY int, CUST int)
INSERT INTO TABLE2 VALUES (1,1,1)
1 rows affected
SELECT
A.FIELD1
, (SELECT PCN FROM (select B.PRIORITY, B.PCN
from
TABLE2 B
WHERE B.CUST= A.CUST
ORDER BY B.PRIORITY)
WHERE ROWNUM = 1) AS PCN
FROM TABLE1 A;
FIELD1
PCN
1
1
fiddle
You can't nest inline selects (more than one level) without losing the ability of the inner nested selects being able to reference the parent block. So your query on TABLE2 cannot see the columns from TABLE1 because of this nesting.
Try this:
SELECT a.field1,
pcn.pcn
FROM table1 a,
(SELECT b.cust,
b.priority,
b.pcn,
ROW_NUMBER() OVER (PARTITION BY b.cust ORDER BY b.priority DESC) seq
FROM table2 b) pcn
WHERE a.cust = pcn.cust(+)
AND pcn.seq(+) = 1
That will work well for report queries. If you end up adding a filter on a specific customer, then you would be better off using OUTER APPLY if you have a recent-enough version of Oracle that supports that.
You could try this:
SELECT
A.FIELD1
, (SELECT B.PCN
from
TABLE2 B
WHERE B.CUST= A.CUST
ORDER BY B.PRIORITY
FETCH FIRST 1 ROWS ONLY) AS PCN
FROM TABLE1 A;
FETCH FIRST 1 ROWS ONLY gets you the first ordered record. Works on 12c and up and supports nesting, and no 2nd subquery needed.
Yet another option might be a CTE.
Sample data:
SQL> with
2 table1 (field1, cust) as
3 (select 1, 100 from dual union all
4 select 2, 200 from dual
5 ),
6 table2 (pcn, priority, cust) as
7 (select 10, 1, 100 from dual union all
8 select 20, 2, 100 from dual union all
9 select 30, 1, 200 from dual
10 ),
Query begins here. Rank rows by priority, and then fetch the ones that rank as the highest (line #20):
11 temp as
12 (select a.field1,
13 b.pcn,
14 rank() over (partition by a.field1 order by b.priority desc) rnk
15 from table1 a join table2 b on a.cust = b.cust
16 )
17 select field1,
18 pcn
19 from temp
20 where rnk = 1;
FIELD1 PCN
---------- ----------
1 20
2 30
SQL>
You may use first aggregate function to achieve the same (assuming that you have completely deterministic order by) functionality without nested subquery:
select
a.field1
, (
select max(b.pcn) keep(dense_rank first order by b.priority)
from table2 b
where b.cust = a.cust
) as pcn
from table1 a
which for this sample data
insert into table1 values(1,1);
insert into table1 values(2,2);
insert into table2 values(1,1,1);
insert into table2 values(2,2,1)
returns
FIELD1
PCN
1
1
2
(null)
SQL fiddle

How can I count the amount of values in different columns in oracle plsql

For example, I have a table with these values:
ID
Date
Col1
Col2
Col3
Col4
1
01/11/2021
A
A
B
2
01/11/2021
B
B
The A and B values are dynamic, they can be other characters as well.
Now I need somehow to get to the result that id 1 has 2 occurences of A and one of B. Id 2 has 0 occurences of A and 2 occurences of B.
I'm using dynamic SQL to do this:
for v_record in table_cursor
loop
for i in 1 .. 4
loop
v_query := 'select col'||i||' from table where id = '||v_record.id;
execute immediate v_query into v_char;
if v_char = "any letter I'm checking" then
amount := amount + 1;
end if;
end loop;
-- do somehting with the amount
end loop;
But there has to be a better much more efficient way to do this.
I don't have that much knowledge of plsql and I really don't know how to formulate this question in google. I've looked into pivot, but I don't think that will help me out in this case.
I'd appreciate it if someone could help me out.
Assuming the number of columns would be fixed at four, you could use a union aggregation approach here:
WITH cte AS (
SELECT ID, Col1 AS val FROM yourTable UNION ALL
SELECT ID, Col2 FROM yourTable UNION ALL
SELECT ID, Col3 FROM yourTable UNION ALL
SELECT ID, Col4 FROM yourTable
)
SELECT
t1.ID,
t2.val,
COUNT(c.ID) AS cnt
FROM (SELECT DISTINCT ID FROM yourTable) t1
CROSS JOIN (SELECT DISTINCT val FROM cte) t2
LEFT JOIN cte c
ON c.ID = t1.ID AND
c.val = t2.val
WHERE
t2.val IS NOT NULL
GROUP BY
t1.ID,
t2.val;
This produces:
Demo

Oracle Delete/Update in one query

I want to Delete the Duplicates from the table update the unique identifier and merge it with the already existing record.
I have a table which can contain following records -
ID Name Req_qty
1001 ABC-02/01+Time 10
1001 ABC-03/01+Time 20
1001 ABC 30
1002 XYZ 40
1003 DEF-02/01+Time 10
1003 DEF-02/01+Time 20
And I am expecting the records after the operation as follows:
ID Name Req_Qty
1001 ABC 60
1002 XYZ 40
1003 DEF 30
Any assistance would be really helpful. Thanks!
It is possible to do this in a single SQL statement:
merge into (select rowid as rid, x.* from test_table x ) o
using ( select id
, regexp_substr(name, '^[[:alpha:]]+') as name
, sum(reg_qty) as reg_qty
, min(rowid) as rid
from test_table
group by id
, regexp_substr(name, '^[[:alpha:]]+')
) n
on (o.id = n.id)
when matched then
update
set o.name = n.name
, o.reg_qty = n.reg_qty
delete where o.rid > n.rid;
Working example
This uses a couple of tricks:
the delete clause of a merge statement will only operate on data that has been updated, and so there's no restriction on what gets updated.
you can't select rowid from a "view" and so it's faked as rid before updating
by selecting the minimum rowid from per ID we make a random choice about which row we're going to keep. We can then delete all the rows that have a "greater" rowid. If you have a primary key or any other column you'd prefer to use as a discriminator just substitute that column for rowid (and ensure it's indexed if your table has any volume!)
Note that the regular expression differs from the other answer; it uses caret (^) to anchor the search for characters to the beginning of the string before looking for all alpha characters thereafter. This isn't required as the default start position for REGEXP_SUBSTR() is the first (1-indexed) but it makes it clearer what the intention is.
In your case, you will need to update the records first and then delete the records which are not required as following (Update):
UPDATE TABLE1 T
SET T.REQ_QTY = (
SELECT
SUM(TIN.REQ_QTY) AS REQ_QTY
FROM
TABLE1 TIN
WHERE TIN.ID = T.ID
)
WHERE (T.ROWID,1) IN
(SELECT TIN1.ROWID, ROW_NUMBER() OVER (PARTITION BY TIN1.ID)
FROM TABLE1 TIN1); --TAKING RANDOM RECORD FOR EACH ID
DELETE FROM TABLE1 T
WHERE NOT EXISTS (SELECT 1 FROM TABLE1 TIN
WHERE TIN.ID = T.ID AND TIN.REQ_QTY > T.REQ_QTY);
UPDATE TABLE1 SET NAME = regexp_substr(NAME,'[[:alpha:]]+');
--Update--
The following merge should work for you
MERGE INTO
(select rowid as rid, T.* from MY_TABLE1 T ) MT
USING
(
SELECT * FROM
(SELECT ID,
regexp_substr(NAME,'^[[:alpha:]]+') AS NAME_UPDATED,
SUM(Req_qty) OVER (PARTITION BY ID) AS Req_qty_SUM,
ROWID AS RID
FROM MY_TABLE1) MT1
WHERE RN = 1
) mt1
ON (MT.ID = MT1.ID)
WHEN MATCHED THEN
UPDATE SET MT.NAME = MT1.NAME_UPDATED, MT.Req_qty = MT1.Req_qty_SUM
delete where (MT.RID <> MT1.RID);
Cheers!!

Oracle Query: Get distinct names having count greater than a threshold

I have a table having schema given below
EmpID,MachineID,Timestamp
1, A,01-Nov-13
2, A,02-Nov-13
3, C,03-Nov-13
1, B,02-Nov-13
1, C,04-Nov-13
2, B,03-Nov-13
3, A,02-Nov-13
Desired Output:
EmpID,MachineID
1, A
1, B
1, C
2, A
2, B
3, A
3, C
So basically, I want to find the Emp who have used more than one machines in the given time period.
The query I am using is
select EmpID,count(distinct(MachineID)) from table
where Timestamp between '01-NOV-13' AND '07-NOV-13'
group by EmpID having count(distinct(MachineID)) > 1
order by count(distinct(MachineID)) desc;
This query is given me output like this
EmpID,count(distinct(MachineID))
1, 3
2, 2
3, 2
Can anyone help with making changes to get the output like described above in my question.
One possible solution:
CREATE TABLE emp_mach (
empid NUMBER,
machineid VARCHAR2(1),
timestamp_val DATE
);
INSERT INTO emp_mach VALUES (1,'A', DATE '2013-11-01');
INSERT INTO emp_mach VALUES (2,'A', DATE '2013-11-02');
INSERT INTO emp_mach VALUES (3,'C', DATE '2013-11-03');
INSERT INTO emp_mach VALUES (1,'B', DATE '2013-11-02');
INSERT INTO emp_mach VALUES (1,'C', DATE '2013-11-04');
INSERT INTO emp_mach VALUES (2,'B', DATE '2013-11-03');
INSERT INTO emp_mach VALUES (3,'A', DATE '2013-11-02');
COMMIT;
SELECT DISTINCT empid, machineid
FROM emp_mach
WHERE empid IN (
SELECT empid
FROM emp_mach
WHERE timestamp_val BETWEEN DATE '2013-11-01' AND DATE '2013-11-07'
GROUP BY empid
HAVING COUNT(DISTINCT machineid) > 1
)
ORDER BY empid, machineid;
(I've changed the name of the timestamp column to timestamp_val)
Output:
EMPID MACHINEID
---------- ---------
1 A
1 B
1 C
2 A
2 B
3 A
3 C
you did the hardest. Your query has to be used to filter out the results:
SELECT t1.empid, t1.machineid
FROM
table t1
WHERE
EXIST (
SELECT
empid
FROM table t2
WHERE
timestamp BETWEEN '01-NOV-13' AND '07-NOV-13'
AND t2.empid = t1.empid
GROUP BY empid HAVING COUNT(distinct(machineid)) > 1
)
ORDER BY empid, machineid;
edit: posted a few secs after Przemyslaw Kruglej. I'll leave it here since it is just another alternative (using EXIST instead of IN)
SELECT * FROM
(SELECT DISTINCT(EmpID),COUNT(*) AS NumEMP
from TableA
WHERE Timestamp between '01-NOV-13' AND '07-NOV-13'
group by EmpID
order by EmpID
)
WHERE NumEmp >= 1

Combining two tables with a different column

I have to select requests that i want to combine using UNION :
Table 1 : which is a join between Table_a, table_b and table_c
id_table_a desc_table_a table_b.id_user table_c.field
-----------------------------------------------------------
1 desc1 1 field1
2 desc2 2 field2
3 desc3 3 field3
Table 2 : which is also a join between Table_a, table_b and table_c but it has these columns:
id_table_a desc_table_a table_c.id_user table_c.field
-----------------------------------------------------------
4 desc4 4 field4
5 desc5 5 field8
9 desc9 6 field9
the difference between the two is that in Table1 we have table_b.id_user and table two
table_c.id_user instead .
Combined Table
id_table_a desc_table_a id_user table_c.field
-----------------------------------------------------------
1 desc1 1 field1
2 desc2 2 field2
3 desc3 3 field3
4 desc4 4 field4
5 desc5 5 field5
9 desc9 6 field6
I already have the join requests working but doing union between the two gives me
ORA-01790 expression must have same datatype as corresponding expression
which make sense because the two columns are not the same .
Im using zend_Db's join and union for this .
So how can i tackle this to get the result ?
Thanks.
Are the results above the same as the sequence of columns in your table? because oracle is strict in column orders. this example below produces an error:
create table test1_1790 (
col_a varchar2(30),
col_b number,
col_c date);
create table test2_1790 (
col_a varchar2(30),
col_c date,
col_b number);
select * from test1_1790
union all
select * from test2_1790;
ORA-01790: expression must have same datatype as corresponding expression
As you see the root cause of the error is in the mismatching column ordering that is implied by the use of * as column list specifier. This type of errors can be easily avoided by entering the column list explicitly:
select col_a, col_b, col_c from test1_1790
union all
select col_a, col_b, col_c from test2_1790;
A more frequent scenario for this error is when you inadvertently swap (or shift) two or more columns in the SELECT list:
select col_a, col_b, col_c from test1_1790
union all
select col_a, col_c, col_b from test2_1790;
OR if the above does not solve your problem, how about creating an ALIAS in the columns
like this: (the query is not the same as yours but the point here is how to add alias in the column.)
SELECT id_table_a,
desc_table_a,
table_b.id_user as iUserID,
table_c.field as iField
UNION
SELECT id_table_a,
desc_table_a,
table_c.id_user as iUserID,
table_c.field as iField
hope this helps.

Resources