hive expression not in group by key size - hadoop

My table Schema is (State string,City String,Size int)
Here is my input data
Karnataka,Bangalore,200
Karnataka,Mysore,50
Karnataka,Bellary,100
Karnataka,Mangalore,10
Andhra pradesh,Chittoor, 25
Andhra pradesh,nellore, 15
Andhra pradesh,guntur, 20
Andhra pradesh,tirupathi, 30
Andhra pradesh,vizag, 35
Andhra pradesh,kadapa, 45
I want to retrieve top 2 city's of the state along with size, I want the output as below.
(Andhra pradesh,{(35),(30)},{(vizag),(tirupathi)})
(Karnataka,{(200),(100)},{(Bangalore),(Bellary)})
I have written the query as follow but I am getting error as expression not in group by size, plz help me out.
select * from statefile groyp by state,city order by size limit 2;
thanks in advance.

You would use row_number():
select sf.*
from (select sf.*,
row_number() over (partition by state order by size desc) as seqnum
from statefile sf
) sf
where seqnum <= 2;

Related

Oracle Select Query on Same Table (self join)

It seems to simple, but not getting desired results
I have a table with there data
Team_id, Player_id, Player_name Game_cd
1 100 abc 24
1 1000 xyz 24
1 588 ert 24
1 500 you 24
2 600 ops 24
2 700 dps 24
2 900 lmv 24
2 200 hmv 24
I have to write a query to get a result like this
Home_team home_plr_id home_player away_team away_plr_id away_player
1 100 abc 2 600 ops
1 1000 xyz 2 900 lmv
The query I wrote
select f1.Team_id as home_team,
f1.player_id as home_plr_id,
f1.player_Name as home_player,
f2.Team_id as away_team,
f2.player_id as away_plr_id,
f2.player_Name as home_player
from game f1, game f2
where
f1.team_id<> f2.team_id and
f1.game_cd = f2.game_cd
Alternative to #Radagast81's self-join is pivot, available in your Oracle version:
select home_plr_id, home_plr_name, away_plr_id, away_plr_name
from (select game.*,
row_number() over (partition by team_id order by player_id) rn
from game)
pivot (max(player_id) plr_id, max(player_name) plr_name
for team_id in (1 home, 2 away))
SQL Fiddle
Players have to be numbered somehow (here by ID), it can be done by name, null or even random. This numbering is needed only to put them in same rows. Pivot works also if numbers of players in teams differs.
It is not clear how you want to pair a home player with an away player. But provided that you don't care about that, the following might be what you are looking for:
WITH game_p AS (SELECT team_id, player_id, player_name, game_cd
, ROW_NUMBER() over (PARTITION BY team_id, game_cd ORDER BY player_id) pos
, dense_rank() over (PARTITION BY game_cd ORDER BY team_id) team_pos
FROM game)
SELECT NVL(f1.game_cd, f2.game_cd) AS game_cd
, f1.Team_id as home_team
, f1.player_id as home_plr_id
, f1.player_Name as home_player
, f2.Team_id as away_team
, f2.player_id as away_plr_id
, f2.player_Name as away_player
FROM (SELECT * FROM game_p WHERE team_pos = 1) f1
FULL JOIN (SELECT * FROM game_p WHERE team_pos = 2) f2
ON f1.game_cd = f2.game_cd
AND f1.pos = f2.pos
The new column POS gives any player of each team a position to pair them with the other team.
The new column TEAM_POS is to get the team_id mapped to the values 1 and 2, as the team_id's can differ per game.
Finally do a FULL JOIN to get the final list. If the number of players are allways the same for both teams you can do a normal join instead...

Consolidate rows

I'm trying to cut down on rows a report has. There are 2 assets that return on this query but I want them to show up on one row.
Basically if dc.name LIKE '%CT/PT%' then I want it to be same row as the asset. The SP.SVC_PT_ID is the common field to join them.
There will be times when there is no dc.name LIKE '%CT/PT%' however I still want the DV.MFG_SERIAL_NUM to populated just with a Null to the right.
Select SP.SVC_PT_ID, SP.DEVICE_ID, DV.MFG_SERIAL_NUM, dc.name,
substr(dc.name,26)
From EIP.SVC_PT_DEVICE_REL SP,
eip.device_class dc,
EIP.DEVICE DV
Where SP.EFF_START_TIME < To_date('20170930', 'YYYYMMDD') + 1
and SP.EFF_END_TIME is null
and dc.id = DV.device_class_id
and DV.ID = SP.device_id
ORDER BY SP.SVC_PT_ID, DV.MFG_SERIAL_NUM;
I'm not sure I understand what you are saying; test case would certainly help. You said that query you posted returns two rows (only if we saw which ones ...) but you want them to be displayed as the image you attached to the message.
Generally speaking, you can do that using an aggregate function (such as MAX) on certain column(s), along with the GROUP BY clause that contains the rest of them.
Just for example:
select svc_pt_id, max(ctpt_name) ctpt_name, sum(ctpt_multipler) ctpt_multipler
from ...
group by svc_pt_id
As I said: a test case would help people who'd want to answer the question. True - someone might have understood it far better than I did and will provide assistance nevertheless.
EDIT: after you posted sample data (which, by the way, don't match screenshot you posted previously), maybe something like this might do the job: use analytic function to check whether name contains CT/PT; if so, take its data. Otherwise, display both rows.
SQL> with test as (
2 select 14 svc_pt_id, 446733 device_id, 'Generic Electric' name from dual union
3 select 14, 456517, 'Generic CT/PT, Multiplier' from dual
4 ),
5 podaci as
6 (select svc_pt_id, device_id, name,
7 rank() over (partition by svc_pt_id
8 order by case when instr(name, 'CT/PT') > 1 then 1
9 else 2
10 end) rnk
11 from test
12 )
13 select svc_pt_id, device_id, name
14 from podaci
15 where rnk = 1;
SVC_PT_ID DEVICE_ID NAME
---------- ---------- -------------------------
14 456517 Generic CT/PT, Multiplier
SQL>
My TEST table (created by WITH factoring clause) would be the result of your current query.

Simulate pipelined order by in oracle 11g

I have been working with an application that is integrated with spring and Hibernate 4.X.X and its transaction is managed by JTA in Weblogic application server. After 3 years, there are about 40 million records only into one table from 100 tables that exist in my DB. The DB is Oracle 11g. The response time of a query is about 5 minutes because of increasing the count of records of this tables.
I customized the query and put it into Sql Developer and run the query advisor plan for suggestion some Index. Totally after doing such this, its response time is reduced to 2 minute. But even so, this response time does not satisfy the Custumer. To further clarify I put the query, It is as following:
select *
from (select (count(storehouse0_.ID) over()) as col_0_0_,
storehouse3_.storeHouse_ID as col_1_0_,
(DBPK_PUB_STOREHOUSE.get_Storehouse_Title(storehouse5_.id, 1)) as col_2_0_,
storehouse5_.Organization_Code as col_3_0_,
publicgood1_.Goods_Item_Id as col_4_0_,
storehouse0_.storeHouse_Inventory_Id as col_5_0_,
storehouse0_.Id as col_6_0_,
storehouse3_.samapel_Item_Id as col_7_0_,
samapelite10_.MAINNAME as col_8_0_,
publicgood1_.serial_Number as col_9_0_,
publicgood1_1_.production_Year as col_10_0_,
samapelpar2_.ID_SourceInfo as col_11_0_,
samapelpar2_.Pn as col_12_0_,
storehouse3_.expire_Date as col_13_0_,
publicgood1_1_.Status_Id as col_14_0_,
baseinform12_.Topic as col_15_0_,
publicgood1_.public_Num as col_16_0_,
cast(publicgood1_1_.goods_Status as number(10, 0)) as col_17_0_,
publicgood1_1_.goods_Status as col_18_0_,
publicgood1_1_.deleted as col_19_0_
from amd.Core_StoreHouse_Inventory_Item storehouse0_,
amd.Core_STOREHOUSE_INVENTORY storehouse3_,
amd.Core_STOREHOUSE storehouse5_,
amd.SMP_SAMAPEL_CODE samapelite10_
cross join amd.Core_Goods_Item_Public publicgood1_
inner join amd.Core_Goods_Item publicgood1_1_
on publicgood1_.Goods_Item_Id = publicgood1_1_.Id
left outer join amd.SMP_SOURCEINFO samapelpar2_
on publicgood1_1_.Samapel_Part_Number_Id =
samapelpar2_.ID_SourceInfo, amd.App_BaseInformation
baseinform12_
where not exists
(select ssec.samapelITem_id
from core_security_samapelitem ssec
inner join core_goods_item g
on ssec.samapelitem_id = g.samapel_item_id
where not exists (SELECT aa.groupid
FROM app_actiongroup aa
where aa.groupid in
(select au.groupid
from app_usergroup au
where au.userid = 1)
and aa.actionid = 9054)
and ssec.isenable = 1
and storehouse0_.goods_Item_ID = g.id)
and not exists
(select *
from CORE_POWER_SECURITY cps
where not exists (SELECT aa.groupid
FROM app_actiongroup aa
where aa.groupid in
(select au.groupid
from app_usergroup au
where au.userid = 1)
and aa.actionid = 9055)
and cps.inventory_id =
storehouse0_.storeHouse_Inventory_Id
and cps.goodsitemtype = 6)
and storehouse0_.storeHouse_Inventory_Id = storehouse3_.Id
and storehouse3_.storeHouse_ID = storehouse5_.Id
and storehouse3_.samapel_Item_Id = samapelite10_.MAINCODE
and publicgood1_1_.Status_Id = baseinform12_.ID
and 1 <> 2
and storehouse0_.goods_Item_ID = publicgood1_.Goods_Item_Id
and publicgood1_1_.edited = 0
and publicgood1_1_.deleted = 0
and (exists (select storehouse13_.Id
from amd.Core_STOREHOUSE storehouse13_
cross join amd.core_power power16_
cross join amd.core_power power17_
where storehouse5_.powerID = power16_.Id
and storehouse13_.powerID = power17_.Id
and (storehouse13_.Id in (741684217))
and storehouse13_.storeHouseType = 2
and (power16_.hierarchiCode like
power17_.hierarchiCode || '%')) or
(storehouse3_.storeHouse_ID in (741684217)) and
storehouse5_.storeHouseType = 1)
and (storehouse5_.storeHouse_Status not in (2, 3))
order by storehouse3_.samapel_Item_Id)
where rownum <= 10
[Note: This query is generated by Hibernate].
It is clear that order by 40 million holds so much time.
I find the main issue of this query. I omitted the “order by” and run the query, its response time was reduced to about 5 second. I was wonderful why the “order by” affects so much the response time.
(Some body may think that if this table is partitioned or use another facility of oracle, it may get better response time. Ok it may be right but my emphasis is the “order by” performance. If there is a way that do the “order by” responsibility, why not to do it). Any way I am not able to omit the “order by” because the Customer needs to order and it is necessary for paging. I find a solution that is explained by an example. This solution I order only some records that is needed. How, I will explain later. It is clear when oracle wants to sort 40 million records, it naturally takes so much time. I replace “order by” with “where clause”. With doing this replacement the response time was reduces from 2 minute to about 5 second and this is very exciting for me.
I explain my solution via an example, anybody that read this Post tells me whether this solution is good or there are another solution that I do not know exists.
Another hand I have a solution that is explained later, if it is ok or not. Whether I use it or not.
I explain my solution:
Let’s assumed that there are two table as below:
Post table
Id Others fields
1
2
3
4
5
… …
Post_comment table
Id post_id
1 5
2 5
3 5
4 5
6 5
7 2
8 2
9 2
10 3
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 1
25 1
26 4
27 4
There is a form that shows the result of join between POST table and POST_COMMENT table.
I explain both query with “order by” all records of that table and “order by” only specific records that are needed. The result of two query are exactly the same but the response time of second approach is the better than that one.
You assume that the page size is 10 and you are in page 3.
The first query with the “order by” all records of that table:
select *
from (Select res.*, rownum as rownum_
from (Select * from POST_COMMENT Order by post_id asc) res
Where rownum <= 30)
where rownum_ > 20
The second solution:
Before execution the query, I query as below:
select *
from (select post_id, count(id) from POST_COMMENT group by post_id)
order by post_id asc
So the result of it is the below:
Post_id Count(id) Sum(count(id))
1 15 15
2 3 18
3 1 19
4 2 21
5 5 26
It needs to say that the third column that is "Sum(count(id))" is calculated after that query.Any entry of this column is sum all before records.
So there is a formula that specifics which post_id must be selected. The formula is the below:
pageSize = 10, pageNumber = 3
from : (pageNumber – 1) * pageCount  2 * 10 = 20
to : (pageNumber – 1) * pageCount + pageCount  20 + 10 = 30
So I need the posts that are between (20, 30] of Sum(count(id)). According to this, I need only two post_id that have value 4,5. According to this the main query of second approach is:
select *
from (select rownum as rownum_, res.*
from (select *
from (select * from POST_COMMENT where post_id in (4, 5))
order by post_id asc) res
where rownum <= 30)
where rownum_ > 20
If you look at both query, you will see the biggest difference. The second query only selects the records of POST_COMENT that have post_id that are 4 and 5. After that, orders this records not all records of that table.
After posting this post, I have searched. finally I am redirected to HERE . I can reach to the response time that is very excited for me. It is reduced from 3 minutes to less than 3 seconds. It is necessary to know, I only use one tip from all of the query optimization guidelines that are in that site that is Duplicate constant condition for different tables whenever possible.
Note: Before doing this tip, there are some indexs on fields that are in where-clause and order-by.

11g Oracle aggregate SQL query

Can you please help me in getting a query for this scenario. In below case it should return me single row of A=13 because 13,14 in column A has most occurrences and value of B (30) is greater for 13. We are interested in maximum occurrences of A and in case of tie B should be considered as tie breaker.
A B
13 30
13 12
14 10
14 25
15 5
In below case where there are single occurrence of A (all tied) it should return 14 having maximum value of 40 for B.
A B
13 30
14 40
15 5
Use case - we get calls from corporate customers. We are interested in knowing during what hours of day when most calls come and in case of tie - which of the busiest hours has longest call.
Further question
There is further questions on this. I want to use either of two solutions - '11g or lower' from #GurV or 'dense_rank' from #mathguy in bigger query below how can I do it.
SELECT dv.id , u.email , dv.email_subject AS headline , dv.start_date , dv.closing_date, b.name AS business_name, ls.call_cost, dv.currency,
SUM(lsc.duration) AS duration, COUNT(lsc.id) AS call_count, ROUND(AVG(lsc.duration), 2) AS avg_duration
-- max(extract(HOUR from started )) keep (dense_rank last order by count(duration), max(duration)) as most_popular_hour
FROM deal_voucher dv
JOIN lead_source ls ON dv.id = ls.deal_id
JOIN lead_source_call lsc ON ls.PHONE_SID = lsc.phone_number_id
JOIN business b ON dv.business_id = b.id
JOIN users u ON b.id = u.business_id
AND TRUNC(dv.closing_date) = to_date('13-01-2017', 'dd-mm-yyyy')
AND lsc.status = 'completed' and lsc.duration >= 30
GROUP BY dv.id , u.email , dv.email_subject , dv.start_date , dv.closing_date, b.name, ls.call_cost, dv.currency
--, extract(HOUR from started )
Try this if 12c+
select a
from t
group by a
order by count(*) desc, max(b) desc
fetch first 1 row only;
If 11g or lower:
select * from (
select a
from t
group by a
order by count(*) desc, max(b) desc
) where rownum = 1;
Note that if there is equal count and equal max value for two or more values of A, then any one of them will be fetched.
Here is a query that will work in older versions (no fetch clause) and does not require a subquery. It uses the first/last function. In case of ties by both "count by A" and "value of max(B)" it selects only the row with the largest value of A. You can change that to min(A), or even to sum(A) (although that probably doesn't make sense in your problem) or LISTAGG(A, ',') WITHIN GROUP (ORDER BY A) to get a comma-delimited list of the A's that are tied for first place, but that requires 11.2 (I believe).
select max(a) keep (dense_rank last order by count(b), max(b)) as a
, max(max(b)) keep (dense_rank last order by count(b)) as b
from inputs
group by a
;

recursive cte working very slow

I want to Group the rows based on certain columns, i.e. if data is same in these columns in continuous rows, then assign same Group Number to them, and if its changed, assign new one. This become complex as the same data in the columns could appear later in some other rows, so they have to be given another Group Number as they are not in continuous rows with previous group.
I used cte for this purpose and it is giving correct output also, but is so slow that iterating over 75k+ rows takes about 15 minutes. The code I used is:
WITH
cte AS (SELECT ROW_NUMBER () OVER (ORDER BY Patient_ID, Opnamenummer, SPECIALISMEN, Opnametype, OntslagDatumTijd) AS RowNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen, SpecialismeCode, Specialismen
FROM t_opnames)
SELECT * INTO #ttt FROM cte;
WITH cte2 AS (SELECT TOP 1 RowNumber,
1 AS GroupNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen, SpecialismeCode, Specialismen
FROM #ttt
ORDER BY RowNumber
UNION ALL
SELECT c1.RowNumber,
CASE
WHEN c2.Afdelingscode <> c1.Afdelingscode
OR c2.Patient_ID <> c1.Patient_ID
OR c2.Opnametype <> c1.Opnametype
THEN c2.GroupNumber + 1
ELSE c2.GroupNumber
END AS GroupNumber,
c1.Opnamenummer,c1.Patient_ID,c1.AfdelingsCode,c1.Opnamedatum,c1.Opnamedatumtijd,c1.Ontslagdatum,c1.Ontslagdatumtijd,c1.IsSpoedopname,c1.OpnameType,c1.IsNuOpgenomen, SpecialismeCode, Specialismen
FROM cte2 c2
JOIN #ttt c1 ON c1.RowNumber = c2.RowNumber + 1
)
SELECT *
FROM cte2
OPTION (MAXRECURSION 0) ;
DROP TABLE #ttt
I tried to improve performance by putting output of cte in a temp table. That increased the performance, but still its too slow. So, how can I increase the performance of this code to run it under 10 seconds for 75k+ records? The output before cancelling the query is: Screenshot. As visible from the image, data is same in columns Afdelingscode,Patient_ID and Opnametype in RowNumber 3,5 and 6, but they have different GroupNumber because of concurrency of the rows.
Without data its not that easy to test but i would try first to not use temporary table and just use both cte from start to end, ie;
;WITH
cte AS (...),
cte2 AS (...)
select * from cte2
OPTION (MAXRECURSION 0);
Without knowing indices etc... for instance, you do a lot of ordering in the first cte. Is this supported by indices (or one multicolumn index) or not?
Without the data i don't have the option to play with it but looking at this:
CASE
WHEN c2.Afdelingscode <> c1.Afdelingscode
OR c2.Patient_ID <> c1.Patient_ID
OR c2.Opnametype <> c1.Opnametype
THEN c2.GroupNumber + 1
ELSE c2.GroupNumber
i would try to take a look at partition by statement in row_number
So try to run this:
WITH
cte AS (
SELECT ROW_NUMBER () OVER (PARTITION BY Afdelingscode , Patient_ID ,Opnametype ORDER BY Patient_ID, Opnamenummer, SPECIALISMEN, Opnametype, OntslagDatumTijd ) AS RowNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen
FROM t_opnames)

Resources