Hive: Is there a better way to percentile rank a column? - performance

Currently, to percentile rank a column in hive, I am using something like the following. I am trying to rank items in a column by what percentile they fall under, assigning a value form 0 to 1 to each item. The code below assigns a value from 0 to 9, essentially saying that an item with a char_percentile_rank of 0 is in the bottom 10% of items, and a value of 9 is in the top 10% of items. Is there a better way of doing this?
select item
, characteristic
, case when characteristic <= char_perc[0] then 0
when characteristic <= char_perc[1] then 1
when characteristic <= char_perc[2] then 2
when characteristic <= char_perc[3] then 3
when characteristic <= char_perc[4] then 4
when characteristic <= char_perc[5] then 5
when characteristic <= char_perc[6] then 6
when characteristic <= char_perc[7] then 7
when characteristic <= char_perc[8] then 8
else 9
end as char_percentile_rank
from (
select split(item_id,'-')[0] as item
, split(item_id,'-')[1] as characteristic
, char_perc
from (
select collect_set(concat_ws('-',item,characteristic)) as item_set
, PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as char_perc
from(
select item
, sum(characteristic) as characteristic
from table
group by item
) t1
) t2
lateral view explode(item_set) explodetable as item_id
) t3
Note: I had to do the collect_set in order to avoid a self join, as the percentile function implicitly performs a group by.
I've gathered that the percentile function is horribly slow (at least in this usage). Perhaps it would be better to manually calculate percentile?

Try removing one of your derived tables
select item
, characteristic
, case when characteristic <= char_perc[0] then 0
when characteristic <= char_perc[1] then 1
when characteristic <= char_perc[2] then 2
when characteristic <= char_perc[3] then 3
when characteristic <= char_perc[4] then 4
when characteristic <= char_perc[5] then 5
when characteristic <= char_perc[6] then 6
when characteristic <= char_perc[7] then 7
when characteristic <= char_perc[8] then 8
else 9
end as char_percentile_rank
from (
select item, characteristic,
, PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) over () as char_perc
from (
select item
, sum(characteristic) as characteristic
from table
group by item
) t1
) t2

Related

How to find the nearest location points?

I have a table that has customer_number, customer_location(lat, long), current_latched_tower_ID, tower_location(lat,long) and Distance(between cx location and latched tower location). I am trying to find a tower that is very closer to the customer location rather than current_latched_tower.
Example: One customer currently latched to one tower(KA001), the distance between customer location and latched tower location is 1.6KM. However, there is another tower(KA002) very closer to the customer location. Thus, its distance between customer location and closest tower location is 1.3KM.
Table
customer_number cx_lat cx_long tower_lat tower_long Latched_tower_ID Distance
34532 6.897257333 79.86474533 6.890487 79.869199 CM0321 0.51477
43445 6.935598403 81.14939421 6.947618 81.160246 BD0010 1.2292
54365 6.866224 79.88215 6.896111 79.868611 CM0037 1.6216
52568 7.113198 80.037247 7.121666 80.028888 GM0121 0.9476
Expecting output table
customer_number cx_lat cx_long tower_lat tower_long Latched_tower_ID Distance Cloesed_tower_ID Closed_Distance
34532 6.897257333 79.86474533 6.890487 79.869199 CM0321 0.51477 CM0037 0.43222
52568 7.113198 80.037247 7.121666 80.028888 GM0121 0.9476 NULL NULL
if don't have any closer tower rather than latched tower "Cloesed_tower_ID" and "Closed_Distance" columns should be NULL
If you are not using SDO_GEOM then you can create the haversine function to calculate the distance between two latitude/longditude co-ordinates:
CREATE FUNCTION haversine_distance(
lat1 IN NUMBER,
long1 IN NUMBER,
lat2 IN NUMBER,
long2 IN NUMBER
) RETURN NUMBER DETERMINISTIC
IS
PI CONSTANT NUMBER := ASIN(1) * 2;
R CONSTANT NUMBER := 6371000; -- Approx. radius of the earth in m
PHI1 CONSTANT NUMBER := lat1 * PI / 180;
PHI2 CONSTANT NUMBER := lat2 * PI / 180;
DELTA_PHI CONSTANT NUMBER := (lat2 - lat1) * PI / 180;
DELTA_LAMBDA CONSTANT NUMBER := (long2 - long1) * PI / 180;
a NUMBER;
c NUMBER;
BEGIN
a := SIN(delta_phi/2) * SIN(delta_phi/2) + COS(phi1) * COS(phi2) *
SIN(delta_lambda/2) * SIN(delta_lambda/2);
c := 2 * ATAN2(SQRT(a), SQRT(1-a));
RETURN R * c; -- in metres
END;
/
Then, I am assuming that you have your data in third normal form:
CREATE TABLE towers (
tower_id PRIMARY KEY,
t_lat,
t_long
) AS
SELECT 'CM0321', 6.890487, 79.869199 FROM DUAL UNION ALL
SELECT 'BD0010', 6.947618, 81.160246 FROM DUAL UNION ALL
SELECT 'CM0037', 6.896111, 79.868611 FROM DUAL UNION ALL
SELECT 'GM0121', 7.121666, 80.028888 FROM DUAL;
CREATE TABLE customers (
customer_number PRIMARY KEY,
cx_lat,
cx_long,
latched_tower_id
) AS
SELECT 34532, 6.897257333, 79.86474533, 'CM0321' FROM DUAL UNION ALL
SELECT 43445, 6.935598403, 81.14939421, 'BD0010' FROM DUAL UNION ALL
SELECT 54365, 6.866224000, 79.88215000, 'CM0037' FROM DUAL UNION ALL
SELECT 52568, 7.113198000, 80.03724700, 'GM0121' FROM DUAL;
ALTER TABLE customers ADD CONSTRAINT customers__lti__fk
FOREIGN KEY (latched_tower_id) REFERENCES towers (tower_id);
Then, from Oracle 12, you can calculate the closer towers using:
SELECT c.*,
TO_CHAR(
HAVERSINE_DISTANCE(c.cx_lat, c.cx_long, t.t_lat, t.t_long)/1000,
'FM999990.000'
) AS distance,
ct.tower_id AS closer_tower_id,
TO_CHAR(ct.distance, 'FM999990.000') AS closer_distance
FROM customers c
INNER JOIN towers t
ON (t.tower_id = c.latched_tower_id)
LEFT OUTER JOIN LATERAL(
SELECT ct.*,
HAVERSINE_DISTANCE(
c.cx_lat,
c.cx_long,
ct.t_lat,
ct.t_long
)/1000 AS distance
FROM towers ct
ORDER BY distance ASC
FETCH FIRST ROW ONLY
) ct
ON (ct.tower_id != c.latched_tower_id);
Which outputs:
CUSTOMER_NUMBER
CX_LAT
CX_LONG
LATCHED_TOWER_ID
DISTANCE
CLOSER_TOWER_ID
CLOSER_DISTANCE
43445
6.935598403
81.14939421
BD0010
1.795
54365
6.866224
79.88215
CM0037
3.644
CM0321
3.053
34532
6.897257333
79.86474533
CM0321
0.899
CM0037
0.445
52568
7.113198
80.037247
GM0121
1.318
Before Oracle 12, you can use:
SELECT customer_number,
cx_lat,
cx_long,
latched_tower_id,
distance,
CASE
WHEN latched_tower_id != closer_tower_id
THEN closer_tower_id
END AS closer_tower_id,
CASE
WHEN latched_tower_id != closer_tower_id
THEN closer_distance
END AS closer_distance
FROM (
SELECT c.*,
TO_CHAR(
HAVERSINE_DISTANCE(c.cx_lat, c.cx_long, t.t_lat, t.t_long)/1000,
'FM999990.000'
) AS distance,
ct.tower_id AS closer_tower_id,
TO_CHAR(
HAVERSINE_DISTANCE(c.cx_lat, c.cx_long, ct.t_lat, ct.t_long)/1000,
'FM999990.000'
) AS closer_distance,
ROW_NUMBER() OVER (
PARTITION BY c.customer_number
ORDER BY HAVERSINE_DISTANCE(c.cx_lat, c.cx_long, ct.t_lat, ct.t_long)
) AS rn
FROM customers c
INNER JOIN towers t
ON (t.tower_id = c.latched_tower_id)
CROSS JOIN towers ct
ORDER BY
customer_number,
DISTANCE ASC
)
WHERE rn = 1;
db<>fiddle here

Power Query How to Assign Values to range of Numbers

I have a range of numbers that I need to assign scores to and would like to see if there's an easy way to do it through Power Query:
>150 = 0,
101-150 = 1,
51-100 = 2,
21-50 = 3,
<=20 = 4
Thanks in advance
The most obvious way is just an if .. then .. else if .. construction as a new custom column similar to this:
Score =
if [value] > 150 then 0
else if [value] >= 101 and [value] <= 150 then 1
else if ...
else if [value] <= 20 then 4
else null

calculate percentage of two select counts

I have a query like
select count(1) from table_a where state=1;
it gives 20
select count(1) from table_a where state in (1,2);
it gives 25
I would like to have a query to extract percentage 80% (will be 20*100/25).
Is possible to have these in only one query?
I think without testing that the following SQL command can do that
SELECT SUM(CASE WHEN STATE = 1 THEN 1 ELSE 0 END)
/SUM(CASE WHEN STATE IN (1,2) THEN 1 ELSE 0 END)
as PERCENTAGE
FROM TABLE_A
or the following
SELECT S1 / (S1 + S2) as S1_PERCENTAGE
FROM
(
SELECT SUM(CASE WHEN STATE = 1 THEN 1 ELSE 0 END) as S1
,SUM(CASE WHEN STATE = 2 THEN 1 ELSE 0 END) as S2
FROM TABLE_A
)
or the following
SELECT S1 / T as S1_PERCENTAGE
FROM
(
SELECT SUM(CASE WHEN STATE = 1 THEN 1 ELSE 0 END) as S1
,SUM(CASE WHEN STATE IN (1,2) THEN 1 ELSE 0 END) as T
FROM TABLE_A
)
you have the choice for performance or readability !
Just as a slight variation on #schlebe's first query, you can continue to use count() by making that conditional:
select count(case when state = 1 then state end)
/ count(case when state in (1, 2) then state end) as result
from table_a
or multiplying by 100 to get a percentage instead of a decimal:
select 100 * count(case when state = 1 then state end)
/ count(case when state in (1,2) then state end) as percentage
from table_a
Count ignores nulls, and both of the case expressions default to null if their conditions are not met (you could have else null to make it explicit too).
Quick demo with a CTE for dummy data:
with table_a(state) as (
select 1 from dual connect by level <= 20
union all select 2 from dual connect by level <= 5
union all select 3 from dual connect by level <= 42
)
select 100 * count(case when state = 1 then state end)
/ count(case when state in (1,2) then state end) as percentage
from table_a;
PERCENTAGE
----------
80
Why the plsql tag? Regardless, i think what you need is:
(select count(1) from table_a where state=1) * 100 / (select count(1) from table_a where state in (1,2)) from dual

Progressive LOOP Statement ORACLE

How to create a function that returns a float(ChargeTotal)?
ChargeTotal is based on a progressive table using number of batches.
num of batches | charge
----------------------------
1-10 | 0
11-20 | 50
21-30 | 60
31-40 | 70
40+ | 80
If number of batches is 25 then
num of batches | charge
----------------------------
1-10 | 0
11-20 | 50*10
21-30 | 60*5
----------------------------
total | 800 <number I need to be returned(ChargeTotal)
So far I have come up with the following, but I'm unsure how to get the total for each loop, or if it is even possible to do more than one FOR statements:
CREATE OR REPLACE FUNCTION ChargeTotal
RETURN FLOAT IS
total FLOAT;
BEGIN
FOR a in 1 .. 10 LOOP
FOR a in 11 .. 20 LOOP
FOR a in 21 .. 30 LOOP
FOR a in 40 .. 1000 LOOP
RETURN Total;
END ChargeTotal;
Ok so take into consideration that right now I have no DB available to test this (there might be some syntax errors etc).
But I am thinking something along this lines of code...
function ChargeTotal(xin number) return number is
cursor mycursor is
select lowLimit,highLimit,charge
from progressive_table order by lowLimit;
sum number;
segment number;
x number;
begin
sum:=0;
x :=xin;
for i in mycursor loop
segment := (i.highLimit-i.lowLimit)+1;
x := greatest ( x - segment,x);
sum := sum + segment*i.charge;
if (x<segment) then
return sum;
end if;
end loop;
return sum;
end;
I think you can do the calculation via single sql without complex function
the logic is:
you have weights for each "band"
calculate the "band" each row
count(*) over to calculate number of rows in each "band"
join your weight table to get sub.total for each band
use rollup to get grand total
sql
select r.num_of_batches
,sum(r.subtotal_charge)
from (
with weights as
(select 1 as num_of_batches, 0 as charge from dual
union all
select 2 as num_of_batches, 50 as charge from dual
union all
select 3 as num_of_batches, 60 as charge from dual
union all
select 4 as num_of_batches, 70 as charge from dual
union all
select 5 as num_of_batches, 80 as charge from dual
)
select distinct n.num_of_batches
, w.charge
, count(*) over (partition by n.num_of_batches) as cnt
, count(*) over (partition by n.num_of_batches) * charge as subtotal_charge
from (
select num, case when floor(num / 10) > 4 then 5 else floor(num / 10)+1 end as num_of_batches
from tst_brances b
) n
inner join weights w on n.num_of_batches = w.num_of_batches
order by num_of_batches
) r
group by ROLLUP(r.num_of_batches)
populate test data
create table tst_branches(num int);
declare
i int;
begin
delete from tst_brances;
for i in 1..10 loop
insert into tst_brances(num) values (i);
end loop;
for i in 11..20 loop
insert into tst_brances(num) values (i);
end loop;
for i in 21..25 loop
insert into tst_brances(num) values (i);
end loop;
for i in 31..32 loop
insert into tst_brances(num) values (i);
end loop;
for i in 41..43 loop
insert into tst_brances(num) values (i);
end loop;
for i in 51..55 loop
insert into tst_brances(num) values (i);
end loop;
commit;
end;
results
1 1 0
2 2 500
3 3 360
4 4 140
5 5 640
6 1640

SSRS - Percentage for totals

C1 C2 C3 C4 C5 C6 C7 C8 Total **Percentages**
======================================================
R1 6 1 8 8 2 1 1 0 27 **60%**
R2 0 0 0 5 1 1 0 0 7 **16%**
R3 2 0 3 2 0 1 0 0 8 **18%**
R4 2 0 0 1 0 0 0 0 3 **7%**
TTL10 1 11 16 3 3 1 0 45 **100%**
How to calculate the individual row percentages in SSRS
Thank you.
If you're not filtering your dataset, you could use the Dataset sum to get the overall total and use that as the denominator in your expression.
If your table is a matrix with the C1 - C8 all coming from one field, then your formula would just be:
=Sum(Fields!YourField.Value) / Sum(Fields!YourField.Value, "Dataset1")
If the C1 - C8 fields are in separate fields, you can use the same expression used for your total column as the numerator and then divide by the SUM of all the other fields.
=Sum(Fields!C1.Value + Fields!C2.Value + Fields!C3.Value + Fields!C4.Value + Fields!C5.Value + Fields!C6.Value + Fields!C7.Value + Fields!C8.Value)
/
Sum(Fields!C1.Value + Fields!C2.Value + Fields!C3.Value + Fields!C4.Value + Fields!C5.Value + Fields!C6.Value + Fields!C7.Value + Fields!C8.Value, "Dataset1"))
I will work on SQL rather on SSRS. Here is my approach. For SSRS here is the link.
DECLARE #YourTable TABLE
(
Col INT
,Col1 INT
,Col2 INT
,Col3 INT
)
INSERT INTO #YourTable VALUES
(1 , 20, 10, 15)
,(2 , 30, 12, 14)
,(2 , 22, 2, 4)
,(3 , 3, 10, 15)
,(5 , 5, 14, 14)
,(2 , 21, 32, 4)
SELECT * FROM #YourTable
; WITH CTE AS
(SELECT *,Col+Col1+Col2+Col3 AS SumCol FROM #YourTable)
SELECT *, CAST(SumCol*100.0 / SUM(SumCol) OVER() as DECIMAL(28,2)) FROM CTE
Here's another approach:
Create a row outside of the details group, above the first row of data.
Populate a Textbox in the new row =Sum(Fields!Total.Value). Rename the Textbox something unique, such as Denominator.
Hide the row.
For your percentage formula in the details row, use something like:
=Sum(Fields!Total.Value) / ReportItems!Denominator.Value

Resources