Hive: How to output total row count as a variable - hadoop

I have a dataset that I'm de-duping with the following code:
select session_id, sol_id, id, session_context_code, date
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id, date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
order by session_id, sol_id, date
I want to add a variable that stores the total count of rows after dedup, and I tried with count(*):
select session_id, sol_id, id, session_context_code, date,count(*) as total
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
order by session_id, sol_id, date
The error I received:
ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException:
Error while compiling statement: FAILED: SemanticException
[Error 10025]: Line 1:44 Expression not in GROUP BY key 'session_id'
I just want to output a count as a variable that counts all distinct records by session_id and sol_id after de-duped by row number. How do I incorporate that to the code?
Based on Gomz's suggestion, but received error:
ERROR: Execute error: org.apache.hive.service.cli.HiveSQLException:
Error while compiling statement: FAILED: ParseException line
1:614 missing EOF at 'group' near 'nifi_date'
Code:
select session_id, solicit_id, nifi_date,id, session_context_code,count(*) as total
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1 and
session_context_code in ("4","3") and
order by session_id, sol_id, nifi_date
group by session_id, sol_id, nifi_date,id, session_context_code

A Hive query with COUNT(*) along with columns in SELECT clause should have these columns grouped at the end with GROUP BY.
Some samples:
SELECT COUNT(*) FROM employees;
SELECT id, name, COUNT(*) FROM employees GROUP BY id, name;
In your issue scenario, the query should look like below,
select session_id, sol_id, id, session_context_code, count(*) as total
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
GROUP BY session_id, sol_id, id, session_context_code
order by session_id, sol_id, date
You can read more HERE
Update: If you want to count all distinct records only by session_id and sol_id, then the query can be as follows,
select session_id, sol_id, count(*) as total
from (
select *, ROW_NUMBER() OVER (PARTITION BY session_id, sol_id,date) as rn,
substr(case_id,2,9) as id
from df.t1_data
)undup
where undup.rn =1
GROUP BY session_id, sol_id
order by session_id, sol_id, date;
As discussed, you can use only the columns you need to be counted in SELECT and GROUP BY.
If you need the results with multiple columns more than what needs to be counted, you can create a temporary table with only the columns those are counted and join with the original table. i.e., if you need the columns c,d,e,f as well from the table even though you need the count of columns a, b then you can do something like below,
CREATE TABLE tmp AS
SELECT a, b, count(*)
FROM table1
GROUP BY a,b;
Do a JOIN between tmp and table1 on columns a, b
SELECT y.a, y.b, x.c, x.d, x.e, x.f
FROM tmp y, table1 x
WHERE y.a=x.a
AND y.b=x.b;
Hope this helps!

Related

Oracle SQL -- Finding count of rows that match date maximum in table

I am trying to use a query to return the count from rows such that the date of the rows matches the maximum date for that column in the table.
Oracle SQL: version 11.2:
The following syntax would seem to be correct (to me), and it compiles and runs. However, instead of returning JUST the count for the maximum, it returns several counts more or less like the "HAIVNG" clause wasn't there.
Select ourDate, Count(1) as OUR_COUNT
from schema1.table1
group by ourDate
HAVING ourDate = max(ourDate) ;
How can this be fixed, please?
You can use:
SELECT MAX(ourDate) AS ourDate,
COUNT(*) KEEP (DENSE_RANK LAST ORDER BY ourDate) AS ourCount
FROM schema1.table1
or:
SELECT ourDate,
COUNT(*) AS our_count
FROM (
SELECT ourDate,
RANK() OVER (ORDER BY ourDate DESC) AS rnk
FROM schema1.table1
)
WHERE rnk = 1
GROUP BY ourDate
Which, for the sample data:
CREATE TABLE table1 (ourDate) AS
SELECT SYSDATE FROM DUAL CONNECT BY LEVEL <= 5 UNION ALL
SELECT SYSDATE - 1 FROM DUAL;
Both output:
OURDATE
OUR_COUNT
2022-06-28 13:35:01
5
db<>fiddle here
I don't know if I understand what you want. Try this:
Select x.ourDate, Count(1) as OUR_COUNT
from schema1.table1 x
where x.ourDate = (select max(y.ourDate) from schema1.table1 y)
group by x.ourDate
One option is to use a subquery which fetches maximum date:
select ourdate, count(*)
from table1
where ourdate = (select max(ourdate)
from table1)
group by ourdate;
Or, a more modern approach (if your database version supports it; 11g doesn't, though):
select ourdate, count(*)
from table1
group by ourdate
order by ourdate desc
fetch first 1 rows only;
You can use this SQL query:
select MAX(ourDate),COUNT(1) as OUR_COUNT
from schema1.table1
where ourDate = (select MAX(ourDate) from schema1.table1)
group by ourDate;

How to get records from select statement based on one column distinct value in Oracle?

Please help me with next problem:
And the result should be:
filtered by iban_code distinct
You can use row_number analytical function.
Select * from
(Select t.*,
Row_number()
over (partition by per_id, iban_code
order by main_bank_account desc) as rn
From your_table t)
Where rn=1;
Cheers!!

Column name become invalid after referred as result of aggregate function MIN()

SELECT cust_detl.*,
MIN(CREATION_TIMESTAMP) OVER (PARTITION BY CUST_ID) AS MIN_TIMESTAMP
FROM CUST_DETAILS cust_detl
WHERE CREATION_TIMESTAMP=MIN_TIMESTAMP;
Above query select all columns from table CUST_DETAILS with oldest value inCREATION_TIMESTAMP column.
Any idea why MIN_TIMESTAMP encounter as an invalid identifier?
These are the columns that should display:
SELECT
CUSTOMER_DTL_SEQ.nextval,
CUST_ID
CUS_REF_ID
CUST_NAME
CUST_ADDRESS
CREATION_TIMESTAMP
FROM
(
CUSTOMER_DTL_SEQ.nextval,
cust_detl.CUST_ID,
cust_detl.CUST_REF_ID,
cust_detl.CUST_NAME,
cust_detl.CUST_ADDRESS,
cust_detl.CREATION_TIMESTAMP,
MIN(CREATION_TIMESTAMP) OVER (PARTITION BY CUST_ID) AS min_timestamp
FROM cust_details cust_detl
)
WHERE CREATION_TIMESTAMP = min_timestamp;
I would need to select CREATION_TIMESTAMP column as well, only those columns with minimum timestamp will be selected. The problem is the sequence with nextval is not allowed. I would need sequence in the query as this statment is going to use for INSERT later SELECT...INSERT INTO
The PK need to be incremented.
The column name is not yet valid, data is filtered first with the where condition and then on the filtered data, select statement works. You need to put it in sub query before you can use it.
SELECT * FROM
(SELECT cust_detl.*,
MIN(CREATION_TIMESTAMP) OVER (PARTITION BY CUST_ID) AS MIN_TIMESTAMP
FROM CUST_DETAILS cust_detl)
WHERE CREATION_TIMESTAMP=MIN_TIMESTAMP;
UPDATE: I don't know what list of columns you have in your table, but if you need only specific columns, then the query goes like this(assuming you need only columns cust_id, column1, column2 and column3 in select list)
SELECT cust_id,
column1,
column2,
column3
FROM (SELECT cust_detl.cust_id,
cust_detl.column1,
cust_detl.column2,
cust_detl.column3,
cust_detl.creation_timestamp,
MIN(creation_timestamp) over(PARTITION BY cust_id) AS min_timestamp
FROM cust_details cust_detl)
WHERE creation_timestamp = min_timestamp;
If you still don't get your solution, the post the list of columns from the table and the expected output.
Update2 : Fetch the cursor in the outer query, this query should work fine.
SELECT customer_dtl_seq.nextval,
cust_id,
cus_ref_id,
cust_name,
cust_address,
creation_timestamp
FROM (SELECT cust_detl.cust_id,
cust_detl.cust_ref_id,
cust_detl.cust_name,
cust_detl.cust_address,
cust_detl.creation_timestamp,
MIN(creation_timestamp) over(PARTITION BY cust_id) AS min_timestamp
FROM cust_details cust_detl)
WHERE creation_timestamp = min_timestamp;

How to find record from a very big HIVE table where column header__timestamp,header__change_seq should be latest update and id should unique

I have to find record from the hive table where Id, der__timestamp, header__change_seq should be unique but in table (Id, der__timestamp, header__change_seq) can duplicate so in this case i have to fetch only one record if records are getting duplicate .
select b.*
from (SELECT ID, max(COALESCE(header__timestamp))
max_modified,MAX(CAST(header__change_seq AS DECIMAL(38,0))) max_sequence
FROM table_name group by ID) a
join table_name b on (a.id=b.id and
a.max_modified=b.header__timestamp and
a.max_sequence=b.header__change_seq)
So the total number of distinct id is count-->244441250
but through above query i am getting count-->244442548
due to some duplicate records but i have to find only distinct id where (header__change_seq and header__timestamp) should max .
#Rahul; please try this one. It makes use of row_number() so in case of duplicate id, header_timestamp and hearder_change_seq, it will select only one record. Hope it helps.
select *
from (
select *,
row_number() over ( partition by id order by header__timestamp desc, header__change_seq desc) as rnk
from table_name) t
where t.rnk = 1;

SemanticException [Error 10007]: Ambiguous column reference _c1

I'm facing issue while using four level of nesting in a hive query. Below is the query which I'm executing -
SELECT *,
SUM(qtod.amount) OVER (PARTITION BY qtod.id, qtod.year_begin_date ORDER BY qtod.tran_date)
FROM (SELECT *,
SUM(mtod.amount) OVER (PARTITION BY mtod.id, mtod.quarter_begin_date ORDER BY mtod.tran_date)
FROM (SELECT *,
SUM(wtod.amount) OVER (PARTITION BY wtod.id, wtod.month_begin_date ORDER BY wtod.tran_date)
FROM (select id,
year_begin_date,
quarter_begin_date,
month_begin_date,
week_begin_date,
tran_date,
amount,
SUM(amount)
OVER (PARTITION BY id,week_begin_date ORDER BY tran_date) FROM table_name)wtod)mtod)qtod;
If I'm excluding fourth level nesting it is working fine, but while including it, getting below Error msg -
FAILED: SemanticException [Error 10007]: Ambiguous column reference
_c1 in qtod
To avoid nesting i have tried to do it in other way
SELECT * FROM
(SELECT id,year_begin_date,tran_date,amount,SUM(amount) OVER (PARTITION BY id,year_begin_date ORDER BY tran_date) FROM yeartodate)ytod
JOIN
(SELECT *, SUM(mtod.amount) OVER (PARTITION BY mtod.id, mtod.quarter_begin_date ORDER BY mtod.tran_date)
FROM (SELECT *, SUM(wtod.amount) OVER (PARTITION BY wtod.id, wtod.month_begin_date ORDER BY wtod.tran_date)
FROM (select id,
year_begin_date,
quarter_begin_date,
month_begin_date,
week_begin_date,
tran_date,
amount,
SUM(amount)
OVER (PARTITION BY id,week_begin_date ORDER BY tran_date) FROM table_name)wtod)mtod)qtod
ON qtod.id=ytod.id AND qtod.tran_date=ytod.tran_date;
Still getting same Error.
after searching on web i found it's an issue with hive itself according to JIRA raised for hive
As jira is fixed now and patch is available in hive 14, so i tried to run it on hive 14(HDP).
Still getting the same Error.
Please write your suggestion.....
Non-aliased function calls within a SELECT are mapped to column names _c1, _c2, etc. In this case you have a single non-aliased function call per SELECT, so they all create a column _c1.
The issue is that because you are doing a SELECT * from the next sub-query down and then appending another function call that maps to _c1 then you have the same column named twice, and hence an error around an ambiguous column reference.
The solution should be to alias all of your function calls so that they do not use the _c1 default name, like so:
SELECT * FROM
(SELECT id,year_begin_date,tran_date,amount,SUM(amount) AS ytod_amount_sum OVER (PARTITION BY id,year_begin_date ORDER BY tran_date) FROM yeartodate)ytod
JOIN
(SELECT *, SUM(mtod.amount) AS mtod_amount_sum OVER (PARTITION BY mtod.id, mtod.quarter_begin_date ORDER BY mtod.tran_date)
FROM (SELECT *, SUM(wtod.amount) AS wtod_amount_sum OVER (PARTITION BY wtod.id, wtod.month_begin_date ORDER BY wtod.tran_date)
FROM (select id,
year_begin_date,
quarter_begin_date,
month_begin_date,
week_begin_date,
tran_date,
amount,
SUM(amount) AS amount_sum
OVER (PARTITION BY id,week_begin_date ORDER BY tran_date) FROM table_name)wtod)mtod)qtod
ON qtod.id=ytod.id AND qtod.tran_date=ytod.tran_date;

Resources