How to divide dataset by grouping data? - oracle

I have a table with column of Depth, it ranges from 0 to 1000m with interval of 1m. I like to group them into every 10m with average value for each 10m to save time. How to do it? Thank you so much.
Here is my code without grouping on depth column.
Also wondering does this reducing num of data rows will increase query SPEED?
start='2022-01-01'
end='2022-03-01'
sql =
f""" SELECT
WELL_NAME, ROUND(OBV_TIME,'DDD') as "Date",
DEPTH, AVG(TEMPERATURE) as "TEMPERATURE"
FROM
TEMPERATURE_V
WHERE
AREA_NAME = 'Lake' AND WELL_NAME = '{well}'
AND OBV_TIME >= TO_DATE('{start}', 'YYYY-MM-DD')
AND OBV_TIME <= TO_DATE('{end}','YYYY-MM-DD')
AND DEPTH>={dts_well_depth_min}
GROUP BY
WELL_NAME, ROUND(OBV_TIME,'DDD'), DEPTH """

I found a solution using 10*trunc(depth/10,0) and it works.
SELECT
WELL_NAME,
OBV_TIME,
10*TRUNC(DEPTH/10,0) as "DEPTH_10",
AVG(TEMPERATURE) as "TEMPERATURE"
FROM TEMPERATURE_V
WHERE
AREA_NAME = 'Lake'
AND WELL_NAME = '{well}'
AND OBV_TIME >= TO_DATE('{start}', 'YYYY-MM-DD')
AND OBV_TIME <= TO_DATE('{end}', 'YYYY-MM-DD')
AND DEPTH>={dts_well_depth_min}
GROUP BY WELL_NAME, OBV_TIME, 10*TRUNC(DEPTH/10,0)

Related

In BigQuery, is there a way to increase the number of output sink for a stage?

When I tried to improve my query performance, I found that the slot utilization was very low, where only 100 slots were used for a long time (2000 slots are the limit).
By investigating the log file, I found the bottleneck stage's 'parallelInputs' was only 80. However the number of rows of the input was about 100 million, and there were no duplicated value. Therefore I think the query planner should increase the output sink of the stage before the bottleneck stage.
Is there a way to encourage the query planner to increase an output sink?
==============================
I have resolved my performance problem with "UNION" method of the answer.
In my case, I used subqueries instead of a view like the following
with slow_stage as (
...
), multiple_read as (
select
*
from slow_stage where MOD(key, 3)=0
union all
select
*
from slow_stage where MOD(key, 3)=1
...
)
I expected a three-hold improvement. However, the improvement was more than a five-hold because of the increase of the output sinks of the slow_stage. Previously, slow_stage had only 80 output sinks. After applying "UNION" method, it had over 1000 outputs sinks.
I thought the number of output sinks depends on the size and shape of the output. (I refered "Google BigQuery: The Definitive Guide")
In this case I didn't change the output of the slow_stage, so I don't know a reason of the major improvement.
There is a hack which you may be able to use with caution:
Below query split yourDataset.yourTable table into 16 shards by value of col1, it doesn't increase cost because BigQuery charge by size of the tables it scans.
Again, use it with caution for the hack may not be effective in the future.
CREATE VIEW yourdataset.yourJoinedTable AS (
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '0' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '1' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '2' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '3' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '4' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '5' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '6' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '7' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '8' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = '9' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = 'a' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = 'b' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = 'c' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = 'd' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = 'e' UNION ALL
SELECT * FROM yourDataset.yourTable WHERE SUBSTR(col1, 0, 1) = 'f'
);
SELECT ...
FROM yourdataset.yourJoinedTable;

Hive: Is there a better way to percentile rank a column?

Currently, to percentile rank a column in hive, I am using something like the following. I am trying to rank items in a column by what percentile they fall under, assigning a value form 0 to 1 to each item. The code below assigns a value from 0 to 9, essentially saying that an item with a char_percentile_rank of 0 is in the bottom 10% of items, and a value of 9 is in the top 10% of items. Is there a better way of doing this?
select item
, characteristic
, case when characteristic <= char_perc[0] then 0
when characteristic <= char_perc[1] then 1
when characteristic <= char_perc[2] then 2
when characteristic <= char_perc[3] then 3
when characteristic <= char_perc[4] then 4
when characteristic <= char_perc[5] then 5
when characteristic <= char_perc[6] then 6
when characteristic <= char_perc[7] then 7
when characteristic <= char_perc[8] then 8
else 9
end as char_percentile_rank
from (
select split(item_id,'-')[0] as item
, split(item_id,'-')[1] as characteristic
, char_perc
from (
select collect_set(concat_ws('-',item,characteristic)) as item_set
, PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as char_perc
from(
select item
, sum(characteristic) as characteristic
from table
group by item
) t1
) t2
lateral view explode(item_set) explodetable as item_id
) t3
Note: I had to do the collect_set in order to avoid a self join, as the percentile function implicitly performs a group by.
I've gathered that the percentile function is horribly slow (at least in this usage). Perhaps it would be better to manually calculate percentile?
Try removing one of your derived tables
select item
, characteristic
, case when characteristic <= char_perc[0] then 0
when characteristic <= char_perc[1] then 1
when characteristic <= char_perc[2] then 2
when characteristic <= char_perc[3] then 3
when characteristic <= char_perc[4] then 4
when characteristic <= char_perc[5] then 5
when characteristic <= char_perc[6] then 6
when characteristic <= char_perc[7] then 7
when characteristic <= char_perc[8] then 8
else 9
end as char_percentile_rank
from (
select item, characteristic,
, PERCENTILE(BIGINT(characteristic),array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) over () as char_perc
from (
select item
, sum(characteristic) as characteristic
from table
group by item
) t1
) t2

Invalid number error - [Error Code: 1722, SQL State: 42000] ORA-01722: invalid number

The 1st query from the below 2 queries is giving me [Error Code: 1722, SQL State: 42000] ORA-01722: invalid number error.
But when I limit the no of records as in the 2nd query then it is running fine.
Other than limiting the rows in the 2nd query, both the queries are identical.
SELECT b.first_name,
b.last_name,
b.device_derived,
b.ios_version_group,
b.add_date,
FIRST_VALUE (b.add_date)
OVER (PARTITION BY b.first_name, b.last_name, b.ios_version_group)
AS first_date,
LAST_VALUE (b.add_date)
OVER (PARTITION BY b.first_name, b.last_name, b.ios_version_group)
AS last_date
FROM (SELECT a.first_name,
a.last_name,
a.os_version,
a.device_type,
a.device,
a.add_date,
a.device_derived,
CASE
WHEN ( ( UPPER (a.device_derived) = 'IPHONE'
OR UPPER (a.device_derived) = 'IPAD')
AND TO_NUMBER (SUBSTR (a.os_version, 1, 1)) > 4)
THEN
'iOS ' || SUBSTR (a.os_version, 1, 1)
ELSE
'Others'
END
AS ios_version_group
FROM (SELECT first_name,
last_name,
os_version,
device_type,
device,
add_date,
CASE
WHEN UPPER (device_type) = 'ANDROID'
THEN
'Android'
WHEN UPPER (device_type) = 'BB'
OR UPPER (device_type) = 'BLACKBERRY'
THEN
'Blackberry'
WHEN UPPER (device_type) = 'IOS'
AND ( SUBSTR (UPPER (device), 1, 6) = 'IPHONE'
OR SUBSTR (UPPER (device), 1, 4) = 'IPOD')
THEN
'iPhone'
WHEN UPPER (device_type) = 'IOS'
AND (SUBSTR (UPPER (device), 1, 4) = 'IPAD')
THEN
'iPad'
END
AS device_derived
FROM vw_mobile_devices_all) a) b;
SELECT b.first_name,
b.last_name,
b.device_derived,
b.ios_version_group,
b.add_date,
FIRST_VALUE (b.add_date)
OVER (PARTITION BY b.first_name, b.last_name, b.ios_version_group)
AS first_date,
LAST_VALUE (b.add_date)
OVER (PARTITION BY b.first_name, b.last_name, b.ios_version_group)
AS last_date
FROM (SELECT a.first_name,
a.last_name,
a.os_version,
a.device_type,
a.device,
a.add_date,
a.device_derived,
CASE
WHEN ( ( UPPER (a.device_derived) = 'IPHONE'
OR UPPER (a.device_derived) = 'IPAD')
AND TO_NUMBER (SUBSTR (a.os_version, 1, 1)) > 4)
THEN
'iOS ' || SUBSTR (a.os_version, 1, 1)
ELSE
'Others'
END
AS ios_version_group
FROM (SELECT first_name,
last_name,
os_version,
device_type,
device,
add_date,
CASE
WHEN UPPER (device_type) = 'ANDROID'
THEN
'Android'
WHEN UPPER (device_type) = 'BB'
OR UPPER (device_type) = 'BLACKBERRY'
THEN
'Blackberry'
WHEN UPPER (device_type) = 'IOS'
AND ( SUBSTR (UPPER (device), 1, 6) = 'IPHONE'
OR SUBSTR (UPPER (device), 1, 4) = 'IPOD')
THEN
'iPhone'
WHEN UPPER (device_type) = 'IOS'
AND (SUBSTR (UPPER (device), 1, 4) = 'IPAD')
THEN
'iPad'
END
AS device_derived
FROM vw_mobile_devices_all) a) b
WHERE ROWNUM <= 100;
Can somebody tell me why I am getting this error. Is there an efficient way to write this query?
You have TO_NUMBER (SUBSTR (a.os_version, 1, 1) in your queries, so presumably you're hitting data that doesn't have a number at the start of the os_version, when you request more than 100 rows.
You need to check your data.
This error happens when you try to convert a non-numeric value with TO_NUMBER.
In the second query the first 100 rows seem not to result into a.os_version to a non-numeric value.
Try a simple select vw_mobile_devices_all to find the non-numeric os_version. Figure out how you can work around the problem. Maybe you can query the os_version differently.

Oracle performance tuning order by is taking time

Am having query,in which two fields and getting as output pps_id and total_weight. Here pps_id is the column from the table and total_weight we are calculating from inner query. after doing all process in query we are order by the query by total weight. Its taking more cost and response.Is there any way to improve this query performance.
SELECT PPS_ID, TOTAL_WEIGHT
FROM ( SELECT PPS_ID, TOTAL_WEIGHT
FROM (SELECT pps_id,
ROUND (
( ( (60 * name_pct_match / 100)
+ prs_weight
+ year_weight
+ dt_weight)
/ 90)
* 100)
total_weight
FROM (SELECT pps_id,
ROUND (func_compare_name ('aaaa',
UPPER (name_en),
' ',
60))
name_pct_match,
DECODE (prs_nationality_id, 99, 15, 0)
prs_weight,
10 mother_weight,
100 total_attrib_weight,
CASE
WHEN TO_NUMBER (
TO_CHAR (birth_date, 'yyyy')) =
1986
THEN
5
ELSE
0
END
year_weight,
CASE
WHEN TO_CHAR (
TO_DATE ('12-JAN-86',
'DD-MON-RRRR'),
'dd') =
TO_CHAR (birth_date, 'dd')
AND TO_CHAR (
TO_DATE ('12-JAN-86',
'DD-MON-RRRR'),
'mm') =
TO_CHAR (birth_date, 'mm')
THEN
10
WHEN TO_DATE ('12-JAN-86', 'DD-MON-RRRR') BETWEEN birth_date
- 6
AND birth_date
+ 6
THEN
8
WHEN TO_DATE ('12-JAN-86', 'DD-MON-RRRR') BETWEEN birth_date
- 28
AND birth_date
+ 28
THEN
5
WHEN TO_DATE ('12-JAN-86', 'DD-MON-RRRR') BETWEEN birth_date
- 90
AND birth_date
+ 90
THEN
3
ELSE
0
END
dt_weight
FROM individual_profile
WHERE birth_date = '12-JAN-86'
AND IS_ACTIVE = 1
AND gender_id = 1
AND ROUND (func_compare_name ('aaa',
UPPER (name_en),
' ',
60)) > 20))
WHERE TOTAL_WEIGHT >= 100
ORDER BY total_weight DESC)
WHERE ROWNUM <= 10
i have tried by splitting the query and put values in temp tables and tried but it also taking time. I want to improve the performance of the query

Convert integer into percentage

I don't know how to convert integer into percentage, please help me. Thank you
Here's the query:
SELECT 'Data' || ',' ||
TO_CHAR(D.DTIME_DAY,'MM/dd/yyyy') || ',' ||
NVL(o.CNT_OPENED,0) || ',' || --as cnt_opened
NVL(c.CNT_CLOSED,0) --as cnt_closed
FROM OWNER_DWH.DC_DATE d
LEFT JOIN (SELECT TRUNC(t.CREATE_TIME, 'MM') AS report_date,
count(*) AS cnt_opened
FROM APP_ACCOUNT.OTRS_TICKET t
WHERE t.CREATE_TIME BETWEEN SYSDATE -120 AND SYSDATE
GROUP BY TRUNC(t.CREATE_TIME, 'MM')) o
ON d.DTIME_DAY=o.REPORT_DATE
LEFT JOIN (SELECT TRUNC(t.CLOSE_TIME, 'MM') as report_date,
count(*) AS cnt_closed
FROM APP_ACCOUNT.OTRS_TICKET t
WHERE t.CLOSE_TIME BETWEEN SYSDATE -120 AND SYSDATE
GROUP BY TRUNC(t.CLOSE_TIME, 'MM')) c
ON D.DTIME_DAY=c.REPORT_DATE
WHERE d.DTIME_DAY BETWEEN SYSDATE -120 AND TRUNC(SYSDATE) -1 AND
d.DTIME_DAY = TRUNC(d.DTIME_DAY, 'MM') AND
TRUNC(d.DTIME_DAY,'MM')= d.DTIME_DAY
ORDER BY D.DTIME_DAY;
The output of that query:
Data,10/01/2013,219,201
Data,11/01/2013,249,234
Data,12/01/2013,228,224
Data,01/01/2014,269,256
example output that I need is like this:
Data,10/01/2013,219, 52%, 201, 45%
Data,11/01/2013,249, 75%, 234, 60%
.......
........
Formula:
create_time + close time = total / create_time (for cnt_opened each column) = percentage
create_time + close time = total / close_time (for cnt_closed each column) = percentage
Try this:
Basically just add the total of CNT_OPENED and CNT_CLOSED, then whichever you want to take the percentage of, multiply that by 100 and divide by the sum.
For instance, CNT_OPENED = 219 and CNT_CLOSED = 201 so the total is 420. Multiply CNT_OPENED by 100 and then divide by 420 -> (219 * 100) / 420 = 52. Do the same thing with CNT_CLOSED.
Note that this WILL result in an exception if both CNT_OPENED and CNT_CLOSED are 0.
SELECT 'Data'
||','||TO_CHAR(D.DTIME_DAY,'MM/dd/yyyy')
||','||NVL(o.CNT_OPENED,0) --as cnt_opened
||','||(NVL(o.CNT_OPENED,0) * 100) / (NVL(o.CNT_OPENED,0) + NVL(o.CNT_CLOSED,0)) || '%'
||','||NVL(c.CNT_CLOSED,0) --as cnt_closed
||','||(NVL(o.CNT_CLOSED,0) * 100) / (NVL(o.CNT_OPENED,0) + NVL(o.CNT_CLOSED,0)) || '%'
That will also potentially give you a million decimal places, so if you only want to take it out to a couple, simply use the TRUNC function and specify your precision (2 decimal places in this case):
TRUNC((NVL(o.CNT_OPENED,0) * 100) / (NVL(o.CNT_OPENED,0) + NVL(o.CNT_CLOSED,0)), 2)

Resources