Oracle select two (or more) adjacent rows having the same value for a given column - oracle

How do I do the following in Oracle:
I have a (simplified) table:
+-----+-----+-----+
| a | b | ... |
+-----+-----+-----+
| 1 | 7 | ... |
| 2 | 5 | ... |
| 1 | 7 | ... |
+-----+-----+-----+
Where a functions as a unique identifier for a person, and b is the field I am interested in matching across rows. How do I construct a query that basically says "give me the person-ID's where the person has multiple b values (i.e., duplicates)"?
So far I have tried:
SELECT a FROM mytable GROUP BY a HAVING COUNT(DISTINCT b) > 1;
This feels close except it just gives me the user IDs where the user has multiple unique b's, which I suspect is coming from the DISTINCT part, but I'm not sure how to change the query to achieve what I want.

Try
group by a,b having count(b) > 1
Yours would count 7,5,7 as 2 (one 7, one 5). This one one will count total Bs in any grouping, so you'll get 1,7 - > 2 and 1,5 -> 1

SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE mytable ( a, b ) AS
SELECT LEVEL, LEVEL FROM DUAL CONNECT BY LEVEL <= 2000
UNION ALL
SELECT LEVEL *2, LEVEL * 2 FROM DUAL CONNECT BY LEVEL <= 1000;
Query 1:
WITH data AS (
SELECT a
FROM mytable
GROUP BY a
HAVING COUNT(b) > COUNT( DISTINCT b )
ORDER BY a
),
numbered AS (
SELECT a,
ROWNUM AS rn
FROM data
)
SELECT a
FROM numbered
WHERE rn <= 20
Results:
| A |
|----|
| 2 |
| 4 |
| 6 |
| 8 |
| 10 |
| 12 |
| 14 |
| 16 |
| 18 |
| 20 |
| 22 |
| 24 |
| 26 |
| 28 |
| 30 |
| 32 |
| 34 |
| 36 |
| 38 |
| 40 |

Related

Performance drop returning cursor with union all

I'm facing unsolvable and impossible performace drop while using UNION ALL with two sub-queries in one cursor (at least I think that's the problem). PL/SQL Developer just freezes when opening cursor results in test window.
If I turn off no matter which sub-query - everything works fine.
If I take the whole query out of cursor to regular SQL Query windows - everything is okay without any need to turn off some parts.
Procedure structure is down below, looking forward any help:
procedure p_proc(p_param varchar2,
outcur out sys_refcursor) is
begin
open outcur for
select *
from (select -- visible cols
si.item_full_name
, si.final_price
, si.full_price
, si.receipt_num
, si.receipt_date
, si.vendor_code
, case when det.br_summary is null and mr.motiv_rate_value is not null then mr.motiv_rate_value
when det.br_summary is not null then det.br_summary
end personal_bonus_amount
, case when det.br_summary is null and mr.motiv_rate_value is not null then 1
when det.br_summary is not null then det.cross_sale_kt
end personal_bonus_koeff
-- service cols
, case when det.br_summary is null and mr.motiv_rate_value is not null then 'approximate'
when det.br_summary is not null then 'definite'
end personal_bonus_type
, coalesce(det.sale_stream, mr.sale_stream, 'Not defined') item_group_name
, si.operation_type
, si.src
-- pagination
, row_number() over (order by si.receipt_date desc) rn
from (-- curr day
select b.cost final_price
, case when b.discount = 0 then null else b.price
end full_price
, b.doc_number receipt_num
, b.receipt_date receipt_date
, i.item_code vendor_code
, i.full_name item_full_name
, b.subsite code_op
, b.operator_id
, to_char(b.businessday, 'yyyymm') sale_period
, b.oper_type operation_type
, 'bill' src
from scheme.bills b
join scheme.items i on i.item_code = b.item
where b.businessday = trunc(p_date_to)
and b.subsite = p_office_id
and b.operator_id = p_emp_id
union all
-- prev days
select l.txn_amount final_price
, case when l.disc = 0 then null else l.price
end full_price
, t.receipt_num receipt_num
, t.ts receipt_date
, i.item_code vendor_code
, i.full_name item_full_name
, s.office_code code_op
, e.emp_code operator_id
, to_char(l.dt,'yyyymm') sale_period
, l.txn_type operation_type
, 'txn' src
from scheme.txn t
join scheme.txn_lines l on t.rtl_txn_id = l.rtl_txn_id
join scheme.items i on l.item_id = i.item_id
join scheme.offices s on t.subsite_id = s.subsite_id
join scheme.employees e on t.employee_id = e.employee_id
where t.ts between trunc(p_date_from) and trunc(p_date_to)
and t.subsite_id = v_op_id
and t.employee_id = v_emp_id
) si
/* fact */
left join scheme.sales_details det on si.sale_period = det.period
and si.code_op = det.op_code
and ltrim(si.operator_id,'0') = ltrim(det.tab_num,'0')
and si.receipt_num = det.rcpt_num
and si.vendor_code = det.item_article
/* prognosis */
left join scheme.rates mr on si.sale_period = mr.motiv_rate_period
and si.code_op = mr.code_op
and si.vendor_code = mr.code_1c
where 1 = 1
and si.final_price between nvl(p_price_from, si.final_price) and nvl(p_price_to, si.final_price)
/* if no filters */
and (item_group_cnt = 0 or coalesce(det.sale_stream, mr.sale_stream, 'Not defined') in (select * from table(p_item_group)))
and si.receipt_num = nvl(p_receipt_num, si.receipt_num)
)
where rn between p_page_num * p_page_size + 1 and (p_page_num + 1) * p_page_size;
end;
UPD Explain plan for the whole query used in a cursor:
----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 10 | 32810 | 62 | 00:00:01 |
| * 1 | VIEW | | 10 | 32810 | 62 | 00:00:01 |
| * 2 | WINDOW SORT PUSHED RANK | | 2 | 2956 | 62 | 00:00:01 |
| 3 | NESTED LOOPS OUTER | | 2 | 2956 | 61 | 00:00:01 |
| 4 | NESTED LOOPS OUTER | | 2 | 2826 | 53 | 00:00:01 |
| 5 | VIEW | | 2 | 2728 | 46 | 00:00:01 |
| 6 | UNION-ALL | | | | | |
| 7 | NESTED LOOPS | | 1 | 138 | 32 | 00:00:01 |
| 8 | NESTED LOOPS | | 1 | 138 | 32 | 00:00:01 |
| 9 | PARTITION RANGE SINGLE | | 1 | 66 | 29 | 00:00:01 |
| * 10 | TABLE ACCESS BY LOCAL INDEX ROWID BATCHED | F003_BILL | 1 | 66 | 29 | 00:00:01 |
| * 11 | INDEX RANGE SCAN | IX_SUBSITE_DOCNUM_BUSINDAY_SEQ | 1 | | 5 | 00:00:01 |
| * 12 | INDEX RANGE SCAN | IX_D001_CODE_1C_ITEM_ID | 1 | | 2 | 00:00:01 |
| 13 | TABLE ACCESS BY INDEX ROWID | D001_ITEM | 1 | 72 | 3 | 00:00:01 |
| 14 | NESTED LOOPS | | 1 | 183 | 14 | 00:00:01 |
| 15 | NESTED LOOPS | | 1 | 183 | 14 | 00:00:01 |
| 16 | NESTED LOOPS | | 1 | 104 | 12 | 00:00:01 |
| 17 | NESTED LOOPS | | 1 | 70 | 7 | 00:00:01 |
| 18 | NESTED LOOPS | | 1 | 30 | 4 | 00:00:01 |
| 19 | TABLE ACCESS BY INDEX ROWID | D005_EMPLOYEE | 1 | 18 | 3 | 00:00:01 |
| * 20 | INDEX UNIQUE SCAN | PK_D005 | 1 | | 2 | 00:00:01 |
| 21 | TABLE ACCESS BY INDEX ROWID | D018_SUBSITE | 1 | 12 | 1 | 00:00:01 |
| * 22 | INDEX UNIQUE SCAN | PK_D018 | 1 | | 0 | 00:00:01 |
| 23 | PARTITION RANGE ITERATOR | | 1 | 40 | 3 | 00:00:01 |
| 24 | PARTITION HASH SINGLE | | 1 | 40 | 3 | 00:00:01 |
| * 25 | TABLE ACCESS FULL | F007_RTL_TXN | 1 | 40 | 3 | 00:00:01 |
| * 26 | TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED | F008_RTL_TXN_LI | 1 | 34 | 5 | 00:00:01 |
| * 27 | INDEX RANGE SCAN | IX_F008_RTL_TXN_ID | 7 | | 3 | 00:00:01 |
| * 28 | INDEX UNIQUE SCAN | PK_D001 | 1 | | 1 | 00:00:01 |
| 29 | TABLE ACCESS BY INDEX ROWID | D001_ITEM | 1 | 79 | 2 | 00:00:01 |
| * 30 | TABLE ACCESS BY INDEX ROWID BATCHED | T_OP_MOTIVATION_RATE_MYRTK | 1 | 49 | 7 | 00:00:01 |
| * 31 | INDEX RANGE SCAN | IDX02_CODE_OP_1C | 3 | | 3 | 00:00:01 |
| * 32 | TABLE ACCESS BY INDEX ROWID BATCHED | DET_SALES_PPT_DWH | 1 | 65 | 4 | 00:00:01 |
| * 33 | INDEX RANGE SCAN | IDX_03_RCPT_NUM | 3 | | 2 | 00:00:01 |
----------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
------------------------------------------
* 1 - filter("RN">=1 AND "RN"<=10)
* 2 - filter(ROW_NUMBER() OVER ( ORDER BY INTERNAL_FUNCTION("SI"."RECEIPT_DATE") DESC )<=10)
* 10 - filter("F003"."OPERATOR_ID"='000189513' AND "F003"."COST">=TO_NUMBER(TO_CHAR("F003"."COST")) AND "F003"."COST"<=TO_NUMBER(TO_CHAR("F003"."COST")))
* 11 - access("F003"."SUBSITE"='S165' AND "F003"."BUSINESSDAY"=TO_DATE(' 2021-11-23 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
* 11 - filter("F003"."BUSINESSDAY"=TO_DATE(' 2021-11-23 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "F003"."DOC_NUMBER" IS NOT NULL)
* 12 - access("I"."D001_CODE_1C"="F003"."ITEM")
* 12 - filter("I"."D001_CODE_1C" IS NOT NULL)
* 20 - access("E"."EMPLOYEE_ID"=3561503543)
* 22 - access("S"."SUBSITE_ID"=29260)
* 25 - filter("T"."EMPLOYEE_ID"=3561503543 AND "T"."SUBSITE_ID"=29260 AND "T"."F007_TS"<=TO_DATE(' 2021-11-23 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "T"."F007_RCPT_NUM_1C" IS NOT NULL)
* 26 - filter("L"."F008_AMOUNT">=TO_NUMBER(TO_CHAR("L"."F008_AMOUNT")) AND "L"."F008_AMOUNT"<=TO_NUMBER(TO_CHAR("L"."F008_AMOUNT")))
* 27 - access("T"."RTL_TXN_ID"="L"."RTL_TXN_ID")
* 28 - access("L"."ITEM_ID"="I"."ITEM_ID")
* 30 - filter("SI"."SALE_PERIOD"="MR"."MOTIV_RATE_PERIOD"(+))
* 31 - access("SI"."CODE_OP"="MR"."CODE_OP"(+) AND "SI"."VENDOR_CODE"="MR"."CODE_1C"(+))
* 32 - filter("SI"."CODE_OP"="DET"."OP_CODE"(+) AND "SI"."VENDOR_CODE"="DET"."ITEM_ARTICLE"(+) AND "DET"."ITEM_ARTICLE"(+) IS NOT NULL AND "DET"."PERIOD"(+)=TO_NUMBER("SI"."SALE_PERIOD") AND
LTRIM("SI"."OPERATOR_ID",'0')=LTRIM("DET"."TAB_NUM_RTK"(+),'0'))
* 33 - access("SI"."RECEIPT_NUM"="DET"."RCPT_NUM"(+))
* 33 - filter("DET"."RCPT_NUM"(+) IS NOT NULL)
Actual solution
Managed to get procedure execution plan from DBA. The problem was that optimizer chose another index for joining scheme.sales_details table when executing query inside the procedure. Added INDEX HINT with the same index which was used in regular query and everything works just fine.
Deprecated ideas down below
As far as I understood the problem is in Oracle optimizer which "thought" that doing UNION ALL first is better than pushing predicate into the sub-query. Separating this union into two single queries make him push pred without any hesitations.
Probably this can be fixed by playing with hints, that's wip for now.
Temporary workaround is to regroup the query, going from this structure
select *
from (select row_number() rn
, u.*
from (select *
from first_query
union all
select *
from second_query) u
-- some joins
join first_table ft
join second_table st
-- predicate block
where 1=1
and a = b
)
where rn between c and d;
to this
select *
from (select row_number() rn
, u.*
from (select *
from first_query) u
-- some joins
join first_table ft
join second_table st
-- predicate block
where 1=1
and a = b
union all
select row_number() rn
, u.*
from (select *
from second_query) u
-- some joins
join first_table ft
join second_table st
-- predicate block
where 1=1
and a = b
)
where rn between c and d;
That's not the perfect solution cause it doubles the JOIN section but at least it works.

SAS Hive SQL (Hadoop) version of Proc Transpose?

I was wondering if there is a version of 'Proc Transpose' in SAS Hive SQL (Hadoop) ?
Otherwise I can see the only other (long winded) way is creating a lot of separate tables to then join back together, which I'd rather avoid.
Any assistance most welcome!
Sample table to Transpose > Intention to put Month along the top of the table so the rates are split by month:
+------+-------+----------+----------+-------+
| YEAR | MONTH | Geog | Category | Rates |
+------+-------+----------+----------+-------+
| 2018 | 1 | National | X | 32 |
| 2018 | 1 | National | Y | 43 |
| 2018 | 1 | National | Z | 47 |
| 2018 | 1 | Regional | X | 52 |
| 2018 | 1 | Regional | Y | 38 |
| 2018 | 1 | Regional | Z | 65 |
| 2018 | 2 | National | X | 63 |
| 2018 | 2 | National | Y | 14 |
| 2018 | 2 | National | Z | 34 |
| 2018 | 2 | Regional | X | 90 |
| 2018 | 2 | Regional | Y | 71 |
| 2018 | 2 | Regional | Z | 69 |
+------+-------+----------+----------+-------+
Sample output:
+------+----------+----------+----+----+
| YEAR | Geog | Category | 1 | 2 |
+------+----------+----------+----+----+
| 2018 | National | X | 32 | 63 |
| 2018 | National | Y | 43 | 14 |
| 2018 | National | Z | 47 | 34 |
| 2018 | Regional | X | 52 | 90 |
| 2018 | Regional | Y | 38 | 71 |
| 2018 | Regional | Z | 65 | 69 |
+------+----------+----------+----+----+
The typical wallpaper SQL technique for transposing (or pivoting) is a group+transform to pivot case statements sub-query within a group aggregating query that collapses the sub-query. The group represents a single resultant pivot row.
For example your group is year, geog, category and min is used to collapse:
proc sql;
create view want_pivot as
select year, geog, category
, min(rate_m1) as rate_m1
, min(rate_m2) as rate_m2
from
( select
year, geog, category
, case when month=1 then rates end as rate_m1
, case when month=2 then rates end as rate_m2
from have
)
group by year, geog, category
;
Here is the same concept, a little more generically where data is repeated within the group at the detail level and mean is used to collapse over the repeats.
data have;
input id name $ value;
datalines;
1 a 1
1 a 2
1 a 3
1 b 2
1 c 3
2 a 2
2 d 4
2 b 5
3 e 1
run;
proc sql;
create view have_pivot as
select
id
, mean(a) as a
, mean(b) as b
, mean(c) as c
, mean(d) as d
, mean(e) as e
from
(
select
id
, case when name='a' then value end as a
, case when name='b' then value end as b
, case when name='c' then value end as c
, case when name='d' then value end as d
, case when name='e' then value end as e
from have
)
group by id
;
quit;
When the column names are not known apriori, you will need to write a code generator that passes over all the data to determine the name values, writes the wall paper query which will perform a second pass over the data returning the pivot.
Also, many contemporary data bases have a PIVOT clause that can be leveraged via pass through.
The Hadoop Mania post "TRANSPOSE/PIVOT a Table in Hive" shows the use of collect_list and map in a similar wallpapery manner:
select b.id, b.code, concat_ws('',b.p) as p, concat_ws('',b.q) as q, concat_ws('',b.r) as r, concat_ws('',b.t) as t from
(select id, code,
collect_list(a.group_map['p']) as p,
collect_list(a.group_map['q']) as q,
collect_list(a.group_map['r']) as r,
collect_list(a.group_map['t']) as t
from ( select
id, code,
map(key,value) as group_map
from test_sample
) a group by a.id, a.code) b;
if your sample dataset is representative of real dataset then you can use a simple inner join as shown below. Year geo and categoty makes unique combination below code should work.
select a.YEAR ,
a.Geog ,
a.Category ,
a.Rates ,
a.month as month_1,
b.month as month_2
from have a
inner join
have b
on a.year = b.year
and a.Geog = b.Geog
and a.Category = b.category
where a.month ne b.month;

How to select minimum values for duplicate ids using hive

Can someone please help me on this.
I have data like this
**id,age,name**
10,25,abc
10,35,def
20,45,ghi
20,55,jkl
20,65,mno
30,40,pqr
30,50,stu
30,70,vwr
40,20,yza
40,25,fdf
40,25,dgh
40,20,sfs
Now I want to get the final result as below
+------+------+
| id | age |
+------+------+
| 10 | 25 |
| 20 | 45 |
| 30 | 40 |
| 40 | 20 |
| 40 | 20 |
+------+------+
I am able to do this in mysql but as hive do not support multiple arguments in sub query so I am not able to get desired result in hive.
I tried doing this using hive join but no success.
Thanks in advance for help!!
select id
,age
from (select id
,age
,rank () over
(
partition by id
order by age
) as rnk
from mytable
) t
where t.rnk = 1
+----+-----+
| id | age |
+----+-----+
| 10 | 25 |
| 20 | 45 |
| 30 | 40 |
| 40 | 20 |
| 40 | 20 |
+----+-----+
Other way to implement expected output.
SELECT id,
age
FROM
(SELECT id,
age
FROM tblname) a LEFT SEMI
JOIN
(SELECT id,
MIN(age) age
FROM tblName
GROUP BY id) b ON a.id=b.id
AND a.age=b.age

Hive Query for ROlling total based on 2 fields

I have a table a show below
Date | Customer | Count | Daily_Count | ITD_Count
d1 | A | 3 | 3 |
d2 | B | 4 | 4 |
d3 | A | 7 | 16 |
d3 | B | 9 | 16 |
d4 | A | 8 | 9 |
d4 | B | 1 | 9 |
Descrption of Fields:
Date : date
customer : name of customer
Count : # of customers
daily_Count : # of customers on daily basis calculated as
SUM(count) OVER (partition BY date )as Daily_Count
Question :
How do I calculate the Running Total or Rolling Total in the ITD_Count ?
The output should look like
Date | Customer | Count | Daily_Count | ITD_Count
d1 | A | 3 | 3 | 3
d2 | B | 4 | 4 | 7
d3 | A | 7 | 16 | 23
d3 | B | 9 | 16 | 23
d4 | A | 8 | 9 | 31
d4 | B | 1 | 9 | 31
I have tried several variations of using the Window functionality.. But hit a road-block in all my attempts.
Attempt 1 ;
SUM(daily_COunt) OVER (partition BY date order by date rows between unbounded preceding and current row ) as ITD_account_linking
Attempt 2 :
SUM(daily_COunt) OVER (partition BY date, daily_count order by date rows between unbounded preceding and current row ) as ITD_account_linking
and several more attempts following this. :(
Any possible suggestions to guide me in the right direction are welcome.
Please let me know if you need more details.
Use Hive Windowing and Analytics functions.
SELECT Date, Customer, Count, Daily_Count,
SUM(Daily_Count) OVER (ORDER BY Date ROWS UNBOUNDED PRECEDING) AS ITD_Count
FROM table;

80% Rule Estimation Value in PL/SQL

Assume a range of values inserted in a schema table and in the end of the month i want to apply for these records (i.e. 2500 rows = numeric values) the algorithm: sort the values descending (from the smallest to highest value) and then find the 80% value of the sorted column.
In my example, if each row increases by one starting from 1, the 80% value will be the 2000 row=value (=2500-2500*20/100). This algorithm needs to be implemented in a procedure where the number of rows is not constant, for example it can varries from 2500 to 1,000,000 per month
Hint: You can achieve this using Oracle's cumulative aggregate functions. For example, suppose your table looks like this:
MY_TABLE
+-----+----------+
| ID | QUANTITY |
+-----+----------+
| A | 1 |
| B | 2 |
| C | 3 |
| D | 4 |
| E | 5 |
| F | 6 |
| G | 7 |
| H | 8 |
| I | 9 |
| J | 10 |
+-----+----------+
At each row, you can sum the quantities so far using this:
SELECT
id,
quantity,
SUM(quantity)
OVER (ORDER BY quantity ROWS UNBOUNDED PRECEDING)
AS cumulative_quantity_so_far
FROM
MY_TABLE
Giving you:
+-----+----------+----------------------------+
| ID | QUANTITY | CUMULATIVE_QUANTITY_SO_FAR |
+-----+----------+----------------------------+
| A | 1 | 1 |
| B | 2 | 3 |
| C | 3 | 6 |
| D | 4 | 10 |
| E | 5 | 15 |
| F | 6 | 21 |
| G | 7 | 28 |
| H | 8 | 36 |
| I | 9 | 45 |
| J | 10 | 55 |
+-----+----------+----------------------------+
Hopefully this will help in your work.
Write a query using the percentile_disc function to solve your problem. Sounds like it does what you want.
An example would be
select percentile_disc(0.8) within group (order by the_value)
from my_table

Resources