Hive Getting only max occurrence of a value - hadoop

I have hive table which has two cloumns,I want to get the value which occured max number of times
For example in my below table a value occured twice and c only once , here a value is dominat so I want only a value as shown in output
col1 col2
a a_value1
a a_value2
a c_value3
b b_value1
OUTPUT:
col1 col2
a a_value1
b b_value1

You are looking for what statisticians call the mode. A pretty simple method is to use aggregation with a window function:
select col1, col2
from (select col1, col2, count(*) as cnt,
row_number() over (partition by col1 order by count(*) desc) as seqnum
from t
) t
where seqnum = 1;
The above query will return one value for each col1, even if there are ties. If you want all the values in the event of ties, then use rank() or dense_rank().

Related

How can I count the amount of values in different columns in oracle plsql

For example, I have a table with these values:
ID
Date
Col1
Col2
Col3
Col4
1
01/11/2021
A
A
B
2
01/11/2021
B
B
The A and B values are dynamic, they can be other characters as well.
Now I need somehow to get to the result that id 1 has 2 occurences of A and one of B. Id 2 has 0 occurences of A and 2 occurences of B.
I'm using dynamic SQL to do this:
for v_record in table_cursor
loop
for i in 1 .. 4
loop
v_query := 'select col'||i||' from table where id = '||v_record.id;
execute immediate v_query into v_char;
if v_char = "any letter I'm checking" then
amount := amount + 1;
end if;
end loop;
-- do somehting with the amount
end loop;
But there has to be a better much more efficient way to do this.
I don't have that much knowledge of plsql and I really don't know how to formulate this question in google. I've looked into pivot, but I don't think that will help me out in this case.
I'd appreciate it if someone could help me out.
Assuming the number of columns would be fixed at four, you could use a union aggregation approach here:
WITH cte AS (
SELECT ID, Col1 AS val FROM yourTable UNION ALL
SELECT ID, Col2 FROM yourTable UNION ALL
SELECT ID, Col3 FROM yourTable UNION ALL
SELECT ID, Col4 FROM yourTable
)
SELECT
t1.ID,
t2.val,
COUNT(c.ID) AS cnt
FROM (SELECT DISTINCT ID FROM yourTable) t1
CROSS JOIN (SELECT DISTINCT val FROM cte) t2
LEFT JOIN cte c
ON c.ID = t1.ID AND
c.val = t2.val
WHERE
t2.val IS NOT NULL
GROUP BY
t1.ID,
t2.val;
This produces:
Demo

Oracle Select unique on multiple column

How can I achieve this to Select to one row only dynamically since
the objective is to get the uniqueness even on multiple columns
select distinct
coalesce(least(ColA, ColB),cola,colb) A1, greatest(ColA, ColB) B1
from T
The best solution is to use UNION
select colA from your_table
union
select colB from your_table;
Update:
If you want to find the duplicate then use the EXISTS as follows:
SELECT COLA, COLB FROM YOUR_TABLE T1
WHERE EXISTS (SELECT 1 FROM YOUR_tABLE T2
WHERE T2.COLA = T1.COLB OR T2.COLB = T1.COLA)
If I correctly understand words: objective is to get the uniqueness even on multiple columns, number of columns may vary, table can contain 2, 3 or more columns.
In this case you have several options, for example you can unpivot values, sort, pivot and take unique values. The exact code depends on Oracle version.
Second option is listagg(), but it has limited length and you should use separators not appearing in values.
Another option is to compare data as collections. Here I used dbms_debug_vc2coll which is simple table of varchars. Multiset except does main job:
with t as (select rownum rn, col1, col2, col3,
sys.dbms_debug_vc2coll(col1, col2, col3) as coll
from test )
select col1, col2, col3 from t a
where not exists (
select 1 from t b where b.rn < a.rn and a.coll multiset except b.coll is empty )
dbfiddle with 3-column table, nulls and different test cases

Fetch name based on comma-separated ids

I have two tables, customers and products.
products:
productid name
1 pro1
2 pro2
3 pro3
customers:
id name productid
1 cust1 1,2
2 cust2 1,3
3 cust3
i want following result in select statement,
id name productid
1 cust1 pro1,pro2
2 cust2 pro1,pro3
3 cust3
i have 300+ records in both tables, i am beginner to back end coding, any help?
Definitely a poor database design but the bad thing is that you have to live with that. Here is a solution which I created using recursive query. I don't see the use of product table though since your requirement has nothing to do with product table.
with
--Expanding each row seperated by comma
tab(col1,col2,col3) as (
Select distinct c.id,c.prdname,regexp_substr(c.productid,'[^,]',1,level)
from customers c
connect by regexp_substr(c.productid,'[^,]',1,level) is not null
order by 1),
--Appending `Pro` to each value
tab_final as ( Select col1,col2, case when col3 is not null
then 'pro'||col3
else col3
end col3
from tab )
--Displaying result as expected
SELECT
col1,
col2,
LISTAGG(col3,',') WITHIN GROUP( ORDER BY col1,col2 ) col3
FROM
tab_final
GROUP BY
col1,
col2
Demo:
--Preparing dataset
With
customers(id,prdname,productid) as ( Select 1, 'cust1', '1,2' from dual
UNION ALL
Select 2, 'cust2','1,3' from dual
UNION ALL
Select 3, 'cust3','' from dual),
--Expanding each row seperated by comma
tab(col1,col2,col3) as (
Select distinct c.id,c.prdname,regexp_substr(c.productid,'[^,]',1,level)
from customers c
connect by regexp_substr(c.productid,'[^,]',1,level) is not null
order by 1),
--Appending `Pro` to each value
tab_final as ( Select col1,col2, case when col3 is not null
then 'pro'||col3
else col3
end col3
from tab )
--Displaying result as expected
SELECT
col1,
col2,
LISTAGG(col3,',') WITHIN GROUP( ORDER BY col1,col2 ) col3
FROM
tab_final
GROUP BY
col1,
col2
PS: While using don't forget to put your actual table columns as in my example it may vary.

Change Default Hive result to some values

I was trying to get duplicate record count from table, but for particular partitions data is not available, so hive is only printing "OK" result.
Is it possible to change this result with some value like 0 Or NULL.
Yes have tried with nvl,COALESCE,case option still it showing OK. AND goal is to only check duplicate count, so required at least one value
select col1, col2, nvl(count(*),0) AS DUPLICATE_ROW_COUNT, 'xyz' AS TABLE_NAME
from xyz
where data_dt='20170423'
group by col1,col2
having count(*) >1
It will return no rows on empty dataset because you are using group by and having filter. Group by having nothing to group, that is why it does not return any rows. Without group by and having query returns 0:
select nvl(count(*),0) cnt, 'xyz' AS TABLE_NAME
from xyz
where data_dt='20170423'
As a solution you can UNION ALL with null row when empty dataset
select col1, col2, nvl(count(*),0) AS DUPLICATE_ROW_COUNT, 'xyz' AS TABLE_NAME
from xyz
where data_dt='20170423'
group by col1,col2
having count(*) >1
UNION ALL --returns 1 row on empty dataset
select col1, col2, DUPLICATE_ROW_COUNT, TABLE_NAME
from (select null col1, null col2, null AS DUPLICATE_ROW_COUNT, 'xyz' AS TABLE_NAME
)a --inner join will not return rows when non-empty dataset
inner join (
select count(*) cnt from --should will return 0 on empty dataset
( --your original query
select col1, col2, nvl(count(*),0) AS DUPLICATE_ROW_COUNT, 'xyz' AS TABLE_NAME
from xyz
where data_dt='20170423'
group by col1,col2
having count(*) >1
)s --your original query
)s on s.cnt=0
Also it's may be possible to use CTE (WITH) and WHERE NOT EXISTS instead of inner joinfor your subquery, didn't test it.
Also you can use shell to get result and test it on empty value:
dataset=$(hive -e "set hive.cli.print.header=false; [YOUR QUERY HERE]);
# test on empty dataset
if [[ -z "$dataset" ]] ; then
dataset=0
fi

Oracle Query Subselect with order by

I am normally using MS SQL and am a total rookie with oracle.
I get an oracle driver problem when I use the ORDER BY statement in my subquery.
Example (my real statement is much more complex but I doubt it matters to my problem - I can post it if needed):
SELECT col1
, col2
, (SELECT colsub FROM subtbl WHERE idsub = tbl.id AND ROWNUM=1 ORDER BY coldate) col3
FROM tbl
If I do such a construct I get an odbc driver error: ORA-00907: Right bracket is missing (translated from german, so bracket might be other word :)).
If I remove the ORDER BY coldate everything works fine. I couldn't find any reason why, so what do I wrong?
It doesn't make any sense to write the ROWNUM and the ORDER BY this way since the ORDER BY is evaluated after the WHERE clause, meaning that it has no effect in this case. An example is given in this question.
This also gets a little more complicated because it is hard to join a sub-query back to the main query if it is nested too deeply.
The query below won't necessarily work because you can't join between tbl and subtbl in this way.
SELECT
col1,
col2,
(
select colsub
from (
SELECT colsub
FROM subtbl
WHERE idsub = tbl.id
order by coldate
)
where rownum = 1
) as col3
FROM tbl
So you'll need to use some sort of analytic function as shown in the example below:
SELECT
col1,
col2,
(SELECT max(colsub) keep (dense_rank first order by coldate) as colsub
FROM subtbl
WHERE idsub = tbl.id
group by idsub
) col3
FROM tbl
The FIRST analytic function is more complicated than it needs to be but it will get the job done.
Another option would be to use the ROW_NUMBER analytic function which would also solve the problem.
SELECT
col1,
col2,
(select colsub
from (
SELECT
idsub,
row_number() over (partition by idsub order by coldate) as rown,
colsub
FROM subtbl a
) a
WHERE a.idsub = tbl.id
and a.rown = 1
) col3
FROM tbl
What you are doing wrong is clear. You are using an order by in a sub-query. It does not make any sense using an order by in a sub-query so why would you want to do that?
Also you are using an order by on a sub-query that always returns 1 row. That also does not make any sense.
If you want the query result to be sorted use an order by at the highest level.
try:
select
col1,
col2,
colsub
from(
select
col1 ,
col2 ,
coldate,
max(coldate) over (partition by st.idsub) max_coldate
from
tbl t,
subtbl st
where
st.idsub = t.id)
where
coldate = max_coldate

Resources