Hive How to get non-grouped columns for each group of grouped results? - hadoop

I have a table similar to the following.
|name | grp | dt
------------------------------
|foo | A | 2016-01-01
|bar | A | 2016-01-02
|hai | B | 2016-01-01
|bai | B | 2016-01-02
|baz | C | 2016-01-01
For each group, I want to find the name whose dt is the most recent. In other words, MAX(dt), GROUP by grp, and associate the name whose dt is the max of the group to the output:
|name | grp | dt
------------------------------
|bar | A | 2016-01-02
|bai | B | 2016-01-02
|baz | C | 2016-01-01
In Oracle, the following query works and is very clean (taken from here):
SELECT o.name, o.grp, o.dt
FROM tab o
LEFT JOIN tab b
ON o.grp = b.grp AND o.dt < b.dt
WHERE b.dt IS NULL
However this fails with [Error 10017]: Line 4:43 Both left and right aliases encountered in JOIN 'service_effective_from' From another question quoting the documentation, I learn that I cannot use an inequality operator in a join statement:
Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.
What is a clean solution for obtaining this in Hive, given that I cannot use an inequality operator in a join condition?

The following works and is taken from here, but I don't find it very clean:
SELECT o.name, ogrp, o.dt
FROM tab o
JOIN (
SELECT grp, MAX(dt) dt
FROM tab
GROUP BY grp
) b
ON o.grp = b.grp AND o.dt = b.dt
As an aside, it takes 164 seconds on my environment for a comparable test table with 4 rows.

Related

SELECT Records > 0 and with NO NULL Values

I have a query in which I am producing results with rows that contain 0 values. I would like to exclude any rows in which columns B or C = 0. To exclude such rows, I have added the T2.A <> 0 and T2.A != 0. When I do this, the 0 values are replaced with NULLs. Thus I also added T2.A IS NOT NULL.
My results still produce the columns that I do not need which show (null) and would like to exclude these.
SELECT
(SELECT
SUM(T2.A) as prem
FROM Table_2 T2, Table_2 T1
WHERE T2.ENT_REF = T1.ENT_REF
AND UPPER(T2.PER) = 'HURR'
AND UPPER(T2.ENT_TYPE) = 'POL'
AND T2.Cov NOT IN ('OUTPROP','COV')
AND T2.A <> 0
AND T2.A IS NOT NULL
) as B,
(SELECT
SUM(T2.A) as prem
FROM Table_2 T2, Table_2 T1
WHERE T2.ENT_REFE = T1.ENT_REF
AND UPPER(T2.PER) IN ('I', 'II', 'II')
AND UPPER(T2.ENT_TYPE) = 'POL'
AND T2.Cov NOT IN ('OUTPROP','COV')
AND T2.A <> 0
AND T2.A IS NOT NULL
) as C
Ideally the result will go from:
+----+--------+--------+
| ID | B | C |
+----+--------+--------+
| 1 | 24 | 123 |
| 2 | 65 | 78 |
| 3 | 43 | 89 |
| 3 | 0 | 0 |
| 4 | 95 | 86 |
| 5 | 43 | 65 |
| 5 | (null) | (null) |
+----+--------+--------+
To something similar to the following:
+----+-----+-----+
| ID | B | C |
+----+-----+-----+
| 1 | 24 | 123 |
| 2 | 65 | 78 |
| 3 | 43 | 89 |
| 4 | 95 | 86 |
| 5 | 43 | 65 |
+----+-----+-----+
I have also attempted distinct values, but I have other columns such as dates which are different per row. Although I need to include dates, they are not as important to me as only getting B and C columns with only values > 0. I have also tried using a GROUP BY ID statement, but I get an error that states 'ORA-00979: not a GROUP BY expression'
You have written all the conditions in the SELECT clause.
You are facing the issue because the WHERE clause decides the number of rows to be fetched and SELECT clause decides values to be returned.
In your case, something like the following is happening:
Simple Example:
-- MANUAL DATA
WITH DATAA AS (
SELECT
1 KEY,
'VALS' VALUE,
1 SEQNUM
FROM
DUAL
UNION ALL
SELECT
2,
'IDEAL OPTION',
2
FROM
DUAL
UNION ALL
SELECT
10,
'EXCLUDE',
3
FROM
DUAL
)
-- QUERY OF YOUR TYPE
SELECT
(
SELECT
KEY
FROM
DATAA I
WHERE
I.KEY = 1
AND O.KEY = I.KEY
) AS KEY, -- DECIDE VALUES TO BE SHOWN
(
SELECT
KEY
FROM
DATAA I
WHERE
I.SEQNUM = 1
AND O.SEQNUM = I.SEQNUM
) AS SEQNUM -- DECIDE VALUES TO BE SHOWN
FROM
DATAA O
WHERE
O.KEY <= 2; -- DECIDES THE NUMBER OF RECORDS
OUTPUT:
If you don't want to change much logic in your query then just use additional WHERE clause outside your final query like:
SELECT <bla bla bla>
FROM <YOUR FINAL QUERY>
WHERE B IS NOT NULL AND C IS NOT NULL
Cheers!!
I guess you were on the right track, trying to group values.
In order to do that, columns (that are supposed to be distinct) will be left alone (such as ID in the following example), while the rest should be aggregated (using min, max or any other you find appropriate).
For example, as you said that there's some date column you don't care about - I mean, which one of them you'll select - then select the first one (i.e. min(date_column)). Similarly, you'd do with the rest. The group by clause should contain all non-aggregated columns (id in this example).
select id,
sum(a) a,
sum(b) b,
min(date_column) date_column
from your_current_query
group by id
If I understand your query right, it would be much easier and more performant, to avoid the lookups in the Select clause. Try to bring it all in one Query:
SELECT * FROM (
SELECT T2.ENT_REF AS ID,
SUM(CASE WHEN UPPER(T2.PER) = 'HURR' THEN T2.A END) AS B,
SUM(CASE WHEN UPPER(T2.PER) IN ('I', 'II', 'II') THEN T2.A END) as C
FROM Table_2 T2
WHERE UPPER(T2.ENT_TYPE) = 'POL'
AND T2.Cov NOT IN ('OUTPROP','COV')
GROUP BY T2.ENT_REF
)
WHERE B IS NOT NULL
OR C IS NOT NULL

PL/SQL Switching two columns from two tables

Suppose I have two tables (tblA and tblB) and want to switch the second column of each table (tblA.Grade and tblB.Grade) as shown:
+-------------------------------------+
| table a table b |
+-------------------------------------+
| name grade name grade |
| a 60 f 50 |
| b 45 g 70 |
| c 30 h 90 |
+-------------------------------------+
Now, I would like to switch the grade column from table a to table b and the the grade column from table b to table a. The result should look like this:
+-----------------------------------------+
| table a table b |
+-----------------------------------------+
| name grade name grade |
| a 50 f 60 |
| b 70 g 45 |
| c 90 h 30 |
+-----------------------------------------+
I have created the tables, loaded them into cursors using bulk collect and the following code to complete the transformation:
insert into tblA values('a',60);
insert into tblA values('b',45);
insert into tblA values('c',30);
insert into tblb values('f',70);
insert into tblb values('g',80);
insert into tblb values('h',90);
.
DECLARE
TYPE tbla_type IS TABLE OF tbla%ROWTYPE;
l_tbla tbla_type;
TYPE tblb_type IS TABLE OF tblb%ROWTYPE;
l_tblb tblb_type;
BEGIN
-- All rows at once...
SELECT *
BULK COLLECT INTO l_tbla
FROM tbla;
SELECT *
BULK COLLECT INTO l_tblb
FROM tblb;
DBMS_OUTPUT.put_line (l_tblb.COUNT);
FOR indx IN 1 .. l_tbla.COUNT
LOOP
DBMS_OUTPUT.put_line (l_tbla(indx).lname);
update tbla set grade = l_tblb(indx).grade
where l_tbla(indx).lname= tbla.lname;
update tblb set grade = l_tbla(indx).grade
where l_tblb(indx).lname= tblb.lname;
END LOOP;
END;
So, although I did the task, I am wondering if there is a more simple solution that I have not thought of?
Please let me know if anyone knows if there may be a more simple solution?
Note that there is nothing called first or second record in databases as there is no guarantee that the first record entered will be the first one returned. So there should always be an order by to decide first/second etc.
So assuming you want the records to be ordered by name and then swap grade of smallest name of first table with grade of smallest name of second table,
Now assuming you fix the order thingy in your existing code, and if it is working, I believe it would be faster than the way I would do it below. Something like
Create a temp table and put names and grade ordered by name.
Reason of using temp table is mostly because later if I want to correct or revert the data, I can use the same temp table to reverse the merge.
create table tmp1 as
with ta as
(select t.* ,
row_number() over (order by name) as rnk
from tblA t)
,tb as
(select t.* ,
row_number() over (order by name) as rnk
from tblb t)
select ta.name as ta_name,ta.grade as ta_grade,
tb.name as tb_name,tb.grade as tb_grade
from ta inner join tb
on ta.rnk=tb.rnk
Output of tmp1
+---------+----------+---------+----------+
| TA_NAME | TA_GRADE | TB_NAME | TB_GRADE |
+---------+----------+---------+----------+
| a | 60 | f | 70 |
| b | 45 | g | 80 |
| c | 30 | h | 90 |
+---------+----------+---------+----------+
Then use merge to swap value from tmp1.
merge into tbla t1
using tmp1 t
on (t1.name=t.ta_name)
when matched then update
set t1.grade=t.tb_grade;
merge into tblb t1
using tmp1 t
on (t1.name=t.tb_name)
when matched then update
set t1.grade=t.ta_grade;
If satisfied with result, drop the temp table later
drop table tmp1;

Complex SQL query to join two tables

Problem:
Given two tables: TableA, TableB, where TableA has a one-to-many relationship with TableB, I want to retrieve all records in TableB for where the search criteria matches a certain column in TableB and return NULL for the unique TableA records for the same attribute.
Table Structures:
Table A
ID(Primary Key) | Name | City
1 | ABX | San Francisco
2 | ASDF | Oakland
3 | FDFD | New York
4 | GFGF | Austin
5 | GFFFF | San Francisco
Table B
ATTR_ID |Attr_Type | Attr_Name | Attr_Value
1 | TableA | Attr_1 | Attr_Value_1
2 | TableD | Attr_1 | Attr_Value_2
1 | TableA | Attr_2 | Attr_Value_3
3 | TableA | Attr_4 | Attr_Value_4
9 | TableC | Attr_2 | Attr_Value_5
Table B holds attribtue names and values and is a common table used across multiple tables. Each table is identified by Attr_Type and ATTR_ID (which maps to the IDs of different tables).
For instance, the record in Table A with ID 1 has two attributes in Table B with Attr_Names: Attr_1 and Attr_2 and so on.
Expected Output
ID | Name | City | TableB.Attr_Value
1 | ABX | San Francisco | Attr_Value_1
2 | ASDF | Oakland | Attr_Value_2
3 | FDFD | New York | NULL
4 | GFGF | Austin | NULL
5 | GFFFF | San Francisco | NULL
Search Criteria:
Get rows from Table B for each record in Table A with ATTR_NAME Attr_1. If a particular TableA record doesn't have Attr_1, return null.
My Query
select id, name, city,
b.attr_value from table_A
join table_B b on
table_A.id =b.attr_id and b.attr_name='Attr_1'
This is a strange data structure. You need a left outer join with the conditions in the on clause:
select a.id, a.name, a.city, b.attr_value
from table_A a left join
table_B b
on a.id = b.attr_id and b.attr_name = 'Attr_1' and b.attr_type = 'TableA';
I added the attr_type condition, because that seems logic with this data structure.
I dont have an sql server to test the command, but what you want is an inner/outer join query. You could do something like this
select id, name, city,
b.attr_value from table_A
join table_B b on
table_A.id *= b.attr_id and b.attr_name *= 'Attr_1'
Something like this should do the trick for you

How to reduce the script lines for same table join condition in multiple times in oracle?

There are two tables table1 and table2.
Table1 is as below:
col1 | col2 | Col3
A 10 X
B 11 X
C 10 X
A 20 X
Table2 is as below:
col1 | col2 | col3 | col4
A 10 1 UDHAY
B 11 2 VIJAY
C 10 1 SURESH
A 20 2 ARUL
A 10 3 UDHAY
B 11 4 VIJAY
C 10 4 SURESH
A 20 5 ARUL
I want to display the column col4 in table2 with 3 join conditions as below.
table1.COl1 = table2.COl1
and table1.COl2 = table2.COl2
and table2.COl3 = '1'
Sample query :
select
table2.col4
from table1
left outer join table2
on(
table1.COl1 = table2.COl1
and table1.COl2 = table2.COl2
and table2.COl3 = '1');
Question: IF I want to display table2.col4 for condition table2.col3 1,2,3,4,5 with matching other condition from table1, how to make the script?
Actually I know we can add same table 5 times with different alias names and can print. But I don't want to repeat the same condition 5 times. Only the where conditions should be common for all 5 values.
Added on 30-OCT-2013:
Thanks for your response. NO not like you mentioned by using IN. Right now I am using below script concept :
select A.col1,A.co2,B1.col4 ,B2.col4,B3.col4.B4.col4
from table1 A
left outer join table2 B1
on(
A.COl1 = B1.COl1
and A.COl2 = B1.COl2
and B1.COl3 = '1')
left outer join table2 B2
on(
A.COl1 = B2.COl1
and A.COl2 = B2.COl2
and B2.COl3 = '2')
left outer join table2 B3
on(
A.COl1 = B3.COl1
and A.COl2 = B3.COl2
and B3.COl3 = '3')
left outer join table2 B4
on(
A.COl1 = B4.COl1
and A.COl2 = B4.COl2
and B4.COl3 = '4');
So my output will be:
A | 10 | UDHAY | |UDHAY| |
B | 11 | | VIJAY| | VIJAY |
C | 10 | SURESH | | | SURESH |
A | 20 | | ARUL | | |
But like above script I have to make it for 25 combination (1 to 25). So if I make the script like above the script lines will be more than 200 lines. To avoid that will you please help to suggest some other method to reduce the script lines and get the same output?
I'm not sure I'm getting you correctly, do you mean something like this?
and table2.COl3 IN ('1','2','3','4','5')
First solution (trivial):
Make a view with the whole set of Joins and query for it instead on your script, it's a bit dodgy but gets your script to be shorter. If your data set ends up growing off the scales you could switch to a materialized view instead and simplify what could end up being a costly plan.
Second solution (not so trivial):
If your tables do have a name convention you could write a PL block and loop through it, going around and arraying the tables that you want to join so that you can concat the joins on runtime in a string and run it as dynamic SQL.

Oracle compare two count() different tables

I have a problem with a query, see I have two tables, let say:
table a:
progid | name | type
12 | john | b
12 | anna | c
13 | sara | b
13 | ben | c
14 | alan | b
15 | george| b
table b:
progid | name | type
12 | john | b
12 | anna | c
13 | sara | b
14 | alan | b
15 | george| b
table a gets count
progid | count(*)
12 | 2
13 | 2
14 | 1
15 | 1
table b gets
progid | count(*)
12 | 2
**13 | 1**<-this is what I want to find different count
14 | 1
15 | 1
What I want is to find which progid in table b aren't in table a by count, (because as you can see the prog id is there but they should be there the same times! So ben is gone but the progid 13 is there)
So I want to get progid where count varies in the tables, I tried:
select a.progid from
(select progid ,count(*) total from tablea group by progid) a,
(select progid ,count(*) total from tableb group by progid) b
where
a.progid=b.progid and a.total<>b.total;
I get b.total invalid identifier
if I use a.count(progid)<>b.count(progid)
Error says can't use group functions there, any ideas? I'm desperate!
ok i've checked your answers and here's the original one
select a.beneficiarioid from
(select beneficiarioid,count(*) total from lmml_ejercicio_2012_3 where programaid=61 group by beneficiarioid order by beneficiarioid) a,
(select beneficiarioid,count(*) total from ejercicio_2012_3 where programaid=61 group by beneficiarioid order by beneficiarioid) where
a.beneficiarioid=b.beneficiarioid and a.total<>b.total;
anyway, i'll try your querys and let you know!! thank you very much!!
btw it's Oracle 11g
You should be able to use a subquery to get each count and then join them using a FULL OUTER JOIN:
select coalesce(a.progId, b.progId) progid,
coalesce(a.atotal, 0) atotal,
coalesce(b.btotal, 0) btotal
from
(
select progid, count(*) aTotal
from tablea
group by progId
) a
full outer join
(
select progid, count(*) bTotal
from tableb
group by progId
) b
on a.progid = b.progid
where coalesce(a.atotal, 0) <> coalesce(b.btotal, 0);
See SQL Fiddle with Demo. I used a FULL OUTER JOIN in the event you have rows in one table that do not exist in the other table.
Even though your query works fine on my database, I would prefer set operation:
(select progid ,count(*) total from tablea group by progid)
minus
(select progid ,count(*) total from tableb group by progid)

Resources