Hive: unable to fetch column that is not present in GROUP BY

Hive: unable to fetch column that is not present in GROUP BY - hadoop

I have a table in hive called purchase_data that has a list all the purchases made.
I need to query this table and find the cust_id, product_id and price of the most expensive product purchased by a customer.
The data in purchase_data table looks like:
cust_id product_id price purchase_data
--------------------------------------------------------
aiman_sarosh apple_iphone5s 55000 01-01-2014
aiman_sarosh apple_iphone6s 65000 01-01-2017
jeff_12 apple_iphone6s 65000 01-01-2017
jeff_12 dell_vostro 70000 01-01-2017
missy_el lenovo_thinkpad 70000 01-02-2017
I have written the code below, but it is not fetching the right rows.
Some rows are getting repeated:
select master.cust_id, master.product_id, master.price
from
(
select cust_id, product_id, price
from purchase_data
) as master
join
(
select cust_id, max(price) as price
from purchase_data
group by cust_id
) as max_amt_purchase
on max_amt_purchase.price = master.price;
output:
aiman_sarosh apple_iphone6s 65000.0
jeff_12 apple_iphone6s 65000.0
jeff_12 dell_vostro 70000.0
jeff_12 dell_vostro 70000.0
missy_el lenovo_thinkpad 70000.0
missy_el lenovo_thinkpad 70000.0
Time taken: 21.666 seconds, Fetched: 6 row(s)
Is there something wrong with the code ?

Use row_number():
select pd.*
from (select pd.*,
row_number() over (partition by cust_id order by price_desc) as seqnum
from purchase_data pd
) pd
where seqnum = 1;
This returns one row per cust_id, even if there are ties. If you want multiple rows when there are ties, then use rank() or dense_rank() instead of row_number().

I changed the code, its working now:
select master.cust_id, master.product_id, master.price
from
purchase_data as master,
(
select cust_id, max(price) as price
from purchase_data
group by cust_id
) as max_price
where master.cust_id=max_price.cust_id and master.price=max_price.price;
output:
aiman_sarosh apple_iphone6s 65000.0
missy_el lenovo_thinkpad 70000.0
jeff_12 dell_vostro 70000.0
Time taken: 55.788 seconds, Fetched: 3 row(s)

Related

In oracle SQL DB same primary Id is present more then once with different batch_id. How can I know the batch ID just before the current batch ID

I am working on oracle database.
We load customer data in source table which eventually migrates to target table.
Every time customer data is loaded in source table it is having a unique batch_id.
If we want to update some field in customer table, then we again load the same customer in source table but this time with different batch_id.
Now I want to know batch_id of the customer just before the latest batch_id.
Batch_id we take is usually the current date.

Use ROW_NUMBER analytic function
your sample data
select * from tab
order by 1,2
CUSTOMER_ID BATCH_ID
----------- -------------------
1 09.12.2019 00:00:00
1 10.12.2019 00:00:00
2 10.12.2019 00:00:00
Row_number assihns sequence number starting from 1 for each customer order descending on BATCH_ID - you are interested on one before the latest, i.e. the rows with the number 2.
with cust as (
select
customer_id, batch_id,
row_number() over (partition by customer_id order by batch_id desc) rn
from tab)
select CUSTOMER_ID, BATCH_ID
from cust
where rn = 2;
CUSTOMER_ID BATCH_ID
----------- -------------------
1 09.12.2019 00:00:00

It seems that you're basically looking for the second biggest value in the SOURCE table.
In this example code the SOURCE_TABLE represents the table containing same CUSTOMER_NO with different BATCH_NO:
create table source_table (customer_no integer, batch_no date);
insert into source_table values ('1', SYSDATE-2);
insert into source_table values ('1', SYSDATE-1);
insert into source_table values ('1', SYSDATE);
SELECT batch_no
FROM (
SELECT batch_no, row_number() over (order by batch_no desc) as row_num
FROM source_table
) t
WHERE row_num = 2
Where row_num = 2 represents the second biggest value in the table.
The query returns SYSDATE-1.

Oracle sql Select the first and last name of the customer with most orders in 2017

I have the following tables
f_orders
ORDER_NUMBER NUMBER(5,0)
ORDER_DATE DATE
ORDER_TOTAL NUMBER(8,2)
CUST_ID NUMBER(5,0)
STAFF_ID NUMBER(5,0)
with the following data
ORDER_NUMBER ORDER_DATE ORDER_TOTAL CUST_ID STAFF_ID
5678 10-Dec-2017 103.02 123 12
9999 10-Dec-2017 10 456 19
9997 09-Dec-2017 3 123 19
9989 10-Dec-2016 3 123 19
and
f_customers
ID NUMBER(5,0)
FIRST_NAME VARCHAR2(25)
LAST_NAME VARCHAR2(35)
ADDRESS VARCHAR2(50)
with the following data
ID FIRST_NAME LAST_NAME ADDRESS
123 Cole Bee 123 Main Street
456 Zoe Twee 1009 Oliver Avenue
I'm supposed to display the name of the customer wthi the most orders placed in the year 2017.
My query looks like this
SELECT f_customers.first_name,
f_customers.last_name,
count(order_total)
FROM f_orders JOIN f_customers
ON f_customers.id = f_orders.CUST_ID
WHERE TO_CHAR(order_date, 'DD-Mon-YYYY') LIKE '%2017'
GROUP BY f_customers.first_name, f_customers.last_name
HAVING count(order_total) = (SELECT max(count(cust_id))
FROM f_orders
GROUP BY cust_id)
The problem is that whenever I insert the where statement it returns no data found, even though it should return the name Cole Bee with 2 orders
If I remove the where statement it will show that Cole Bee has placed 3 orders
I can't figure out why I get the no data found result. Any ideas?

Your main query is filtering on the year; the subquery on the right hand side of the having clause is not. The max(count()) is 3 if you run that subquery on its own, and you’re comparing that with the filtered list which (as you expect) only finds 2 rows for that customer.
Run the whole query with just the having part removed (rather than the where clause), and run just the subquery; and compare the results.
The simple answer is to repeat the filter:
SELECT f_customers.first_name,
f_customers.last_name,
count(order_total)
FROM f_orders JOIN f_customers
ON f_customers.id = f_orders.CUST_ID
WHERE TO_CHAR(order_date, 'DD-Mon-YYYY') LIKE '%2017'
GROUP BY f_customers.first_name, f_customers.last_name
HAVING count(order_total) = (SELECT max(count(cust_id))
FROM f_orders
WHERE TO_CHAR(order_date, 'DD-Mon-YYYY') LIKE '%2017'
GROUP BY cust_id)
Both filters could be written more simply as:
WHERE TO_CHAR(order_date, 'YYYY') = '2017'
or even:
WHERE EXTRACT(YEAR FROM order_date) = 2017
You can avoid hitting the table twice using analytic queries and other tricks but as this seems to be an assignment that may be getting beyond what you’ve been taught and are expected to know/use.

How do I optimize my hive query for finding Sum of Count of Records from multiple tables

I’ve to generate a report that will give me the sum of the counts from tables A, B and C for events that have been stored using Hive and my S3 buckets have been partitioned by Organization_id
For eg:
Table A – Has a record for every day John (and other employees) goes to work
Table B – Has a record for every call that John (and other employees) makes or takes at work
Table C – Has a record for every expense that John(and other employees) submits at work
Basically I want a sum of the counts from A, B and C for John (employee_id) in the last month. There should be only one record for every date if there is a record in any of the 3 tables A, B or C (and sum the counts if there is a record for a date in one or more of the tables). So my Output is:
Employee id
Employee Name
Date
Count
123
John
02-Jan-2016
55
123
John
12-Jan-2016
88
123
John
19-Jan-2016
103
The query that I came up with is:
select adcts.employee_name, adcts.employee_id,Total_count as event_count, adcts.event_date
from
(select coalesce(Evts.employee_id,imps.employee_id,AEvts.employee_id) as employee_id
, coalesce(Evts.employee_name,imps.employee_name,AEvts.employee_name) as employee_name
, coalesce(Evts.Event_count,0) + coalesce(Imps.Impression_count,0) + coalesce (AEvts.Event_Count,0)as Total_Count
, coalesce (Evts.event_date,imps.impression_date, AEvts.event_date) as event_date
from
(select employee_id, employee_name, count(*) as Event_count,event_date
from mm_events
where organization_id = 100048
and event_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name,event_date) Evts
full outer join
(select employee_id, employee_name, count(*) as Impression_count, impression_date
from mm_impressions
where organization_id = 100048
and impression_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name,impression_date) Imps
on Evts.employee_id = Imps.employee_id
full outer join
(select employee_id, employee_name, count(*) as Event_count,event_date
from mm_attributed_events
where organization_id = 100048
and event_date between '2016-02-01' and '2016-02-04'
and event_type = 'click'
group by employee_id, employee_name,event_date) AEvts
on AEvts.employee_id=Evts.employee_id
) adcts
join
(select distinct c.employee_id from default.t1_meta_dmp c
where c.employee_dmp_enabled='inherits'
and c.agency_dmp_enabled = 'inherits'
and c.agency_status='true'
and c.employee_status='true'
and c.organization_id = 100048) cc
on adcts.employee_id=cc.employee_id
order by adcts.employee_id asc
I have 2 questions:
1. Do I have the right query?
2. Because I’m using ‘full outer join’ I get more than one entry for the same date. Can someone suggest a better way to achieve the result? Different query maybe

You are getting more than one entry for the same date because you are grouping by date in subqueries but joining them only by employee_id. That is why your records are duplicated after join. You should add event_date to the join condition as well.
It seems you do not need FULL JOIN at all. Join is more expensive than union all. Use UNION ALL select from each table then group by employee_name, employee_id, event_date and aggregate count() :
select employee_id, employee_name, sum(Event_count) as Total_Count , event_date
from
(
select employee_id, employee_name, count(*) as Event_count, event_date from mm_events
where organization_id = 100048 and event_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name, event_date
union all
select employee_id, employee_name, count(*) as Event_count, impression_date as event_date
from mm_impressions
where organization_id = 100048 and impression_date between '2016-02-01' and '2016-02-04'
group by employee_id, employee_name,impression_date
union all
select employee_id, employee_name, count(*) as Event_count,event_date
from mm_attributed_events
where organization_id = 100048 and event_date between '2016-02-01' and '2016-02-04' and event_type = 'click'
group by employee_id, employee_name, event_date
) adcts
group by employee_id, employee_name, event_date
Add your join with cc query to the above query.
All subqueries in UNION ALL will run in parallel

Oracle sql retrive records based on maximum time

i have below data.
table A
id
1
2
3
table B
id name data1 data2 datetime
1 cash 12345.00 12/12/2012 11:10:12
1 quantity 222.12 14/12/2012 11:10:12
1 date 20/12/2012 12/12/2012 11:10:12
1 date 19/12/2012 13/12/2012 11:10:12
1 date 13/12/2012 14/12/2012 11:10:12
1 quantity 330.10 17/12/2012 11:10:12
I want to retrieve data in one row like below:
tableA.id tableB.cash tableB.date tableB.quantity
1 12345.00 13/12/2012 330.10
I want to retrieve based on max(datetime).

The data model appears to be insane-- it makes no sense to join an ORDER_ID to a CUSTOMER_ID. It makes no sense to store dates in a VARCHAR2 column. It makes no sense to have no relationship between a CUSTOMER and an ORDER. It makes no sense to have two rows in the ORDER table with the same ORDER_ID. ORDER is also a reserved word so you cannot use that as a table name. My best guess is that you want something like
select *
from customer c
join (select order_id,
rank() over (partition by order_id
order by to_date( order_time, 'YYYYMMDD HH24:MI:SS' ) desc ) rnk
from order) o on (c.customer_id=o.order_id)
where o.rnk = 1
If that is not what you want, please (as I asked a few times in the comments) post the expected output.
These are the results I get with my query and your sample data (fixing the name of the ORDER table so that it is actually valid)
SQL> ed
Wrote file afiedt.buf
1 with orders as (
2 select 1 order_id, 'iphone' order_name, '20121201 12:20:23' order_time from dual union all
3 select 1, 'iphone', '20121201 12:22:23' from dual union all
4 select 2, 'nokia', '20110101 13:20:20' from dual ),
5 customer as (
6 select 1 customer_id, 'paul' customer_name from dual union all
7 select 2, 'stuart' from dual union all
8 select 3, 'mike' from dual
9 )
10 select *
11 from customer c
12 join (select order_id,
13 rank() over (partition by order_id
14 order by to_date( order_time, 'YYYYMMDD HH24:MI:SS' ) desc ) rnk
15 from orders) o on (c.customer_id=o.order_id)
16* where o.rnk = 1
SQL> /
CUSTOMER_ID CUSTOM ORDER_ID RNK
----------- ------ ---------- ----------
1 paul 1 1
2 stuart 2 1

Try something like
SELECT *
FROM CUSTOMER c
INNER JOIN ORDER o
ON (o.CUSTOMER_ID = c.CUSTOMER_ID)
WHERE TO_DATE(o.ORDER_TIME, 'YYYYMMDD HH24:MI:SS') =
(SELECT MAX(TO_DATE(o.ORDER_TIME, 'YYYYMMDD HH24:MI:SS')) FROM ORDER)
Share and enjoy.

Unique rows in oracle 11g

I have a query which returns a set of records as like the one below:-
Date Dept commission
5-Apr Sales 20
4-Apr Sales 21
1-Jan Marketing 35
case 1: If i run a query between 1 Jan and 5 april I should get
Date Dept commission
5 April Sales 76
case 2: and when I run the query between jan 1 and jan 31 should get the output as
Date Dept commission
1 Jan Marketing 35
Case 2 is simple as when i put hte date range getting the required results , but not sure how to handle case 1 to show the max / latest date , the Dept for that date and a sum of the commission for that Dept , date for the selected date range . The output will be a single row with the latest date and department with a sum(commission) for the selected date range.

SELECT
MAX(Date) AS Date
, ( SELECT tt.Dept
FROM tableX tt
WHERE tt.Date = MAX(t.Date)
) AS Dept
, SUM(Commission) AS Commission
FROM
tableX t
WHERE
Date BETWEEN StartDate AND EndDate
The above works in SQL-Server, MySQL, Postgres as the sql-fiddle, test-1 shows, however it does NOT work in Oracle 11g R2 !
This works though (sql-fiddle, test-2):
SELECT
MAX(t.Date) AS Date
, MIN(tt.Dept) AS Dept --- MIN, MAX irrelevant
, SUM(t.Commission) AS Commission
FROM
( SELECT
MAX(Date) AS Date
, SUM(Commission) AS Commission
FROM
tableX
WHERE
Date BETWEEN StartDate AND EndDate
) t
JOIN
tableX tt
ON tt.Date = t.Date
The MIN(tt.Dept) is used to take care of the case you have more than row with the maximum date, say one row with Sales and one with Marketing, both in Apr-5
This works, too, using the LAST_VALUE analytic function (sql-fiddle, test-3):
SELECT
MAX(Date) AS Date
, MIN(Dept) AS Dept
, SUM(Commission) AS Commission
FROM
( SELECT
Date AS Date
, LAST_VALUE(Dept) OVER( ORDER BY Date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS Dept
, Commission AS Commission
FROM
tableX
WHERE
Date BETWEEN StartDate AND EndDate
) t

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hive: unable to fetch column that is not present in GROUP BY - hadoop

Related

In oracle SQL DB same primary Id is present more then once with different batch_id. How can I know the batch ID just before the current batch ID

Oracle sql Select the first and last name of the customer with most orders in 2017

How do I optimize my hive query for finding Sum of Count of Records from multiple tables

Oracle sql retrive records based on maximum time

Unique rows in oracle 11g

Categories

Resources