T-SQL Use Table Variable or Sum Against Parent Table - performance

The scenario is this, I am creating a log table that will end up being quite large once it is all said and done and I want to create a status table that will query from the table with different date ranges and sum the results into multiple total fields.
I plan on writing this into a Stored Procedure but my question would I gain the best performance from reading all my records from the log table into a temp table before doing the sum operations.
IE I have this table:
SummaryValues
90DayValues
60DayValues
30DayValues
14DayValues
7DayValues
1DayValues
Would it be logical to make a take all values for the previous 90 days and then insert them into a table value before then calculating my sum for my 6 fields in my summary table or would it be just as fast to execute 6 sum statements from the log table?

Sometimes you are better reading into a temp table first. Sometimes not. This makes sense if you have multiple passes of processing on the same data
However, if you want "last 90 days", "last 60 day" etc then it can be done in one query
Reading the question again, I'd just run one query and calculate all values in one go. And not bother with any intermediate tables
SELECT
Stuff,
SUM(CASE WHEN dayDiff <= 90 THEN SomeValue ELSE 0 END) AS SumValue90,
SUM(CASE WHEN dayDiff <= 60 THEN SomeValue ELSE 0 END) AS SumValue60,
SUM(CASE WHEN dayDiff <= 30 THEN SomeValue ELSE 0 END) AS SumValue30
FROM
(
SELECT
Stuff,
DATEDIFF(day, SomeData, GETDATE()) AS dayDiff
FROM
Mytable
WHERE
...
) foo
GROUP BY
...

Related

How to iterate over a hive table row by row and calculate metric when a specific condition is met?

I have a requirement as below:
I am trying to convert a MS Access table macro loop to work for a hive table. The table called trip_details contains details about a specific trip taken by a truck. The truck can stop at multiple locations and the type of stop is indicated by a flag called type_of_trip. This column contains values like arrival, departure, loading etc.
The ultimate aim is to calculate the dwell time of each truck (how much time does the truck take before beginning for another trip). To calculate this we have to iterate the table row by row and check for trip type.
A typical example look like this:
Do while end of file:
Store the first row in a variable.
Move to the second row.
If the type_of_trip = Arrival:
Move to the third row
If the type_of_trip = End Trip:
Store the third row
Take the difference of timestamps to calculate dwell time
Append the row into the output table
End
What is the best approach to tackle this problem in hive?
I tried checking if hive contains a keyword for loop but could not find one. I was thinking of doing this using a shell script but need guidance on how to approach this.
I cannot disclose the entire data but feel free to shoot any questions in the comments section.
Input
Trip ID type_of_trip timestamp location
1 Departure 28/5/2019 15:00 Warehouse
1 Arrival 28/5/2019 16:00 Store
1 Live Unload 28/5/2019 16:30 Store
1 End Trip 28/5/2019 17:00 Store
Expected Output
Trip ID Origin_location Destination_location Dwell_time
1 Warehouse Store 2 hours
You do not need loop for this, use the power of SQL query.
Convert your timestamps to seconds (using your format specified 'dd/MM/yyyy HH:mm'), calculate min and max per trip_id, taking into account type, subtract seconds, convert seconds difference to 'HH:mm' format or any other format you prefer:
with trip_details as (--use your table instead of this subquery
select stack (4,
1,'Departure' ,'28/5/2019 15:00','Warehouse',
1,'Arrival' ,'28/5/2019 16:00','Store',
1,'Live Unload' ,'28/5/2019 16:30','Store',
1,'End Trip' ,'28/5/2019 17:00','Store'
) as (trip_id, type_of_trip, `timestamp`, location)
)
select trip_id, origin_location, destination_location,
from_unixtime(destination_time-origin_time,'HH:mm') dwell_time
from
(
select trip_id,
min(case when type_of_trip='Departure' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) origin_time,
max(case when type_of_trip='End Trip' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) destination_time,
max(case when type_of_trip='Departure' then location end) origin_location,
max(case when type_of_trip='End Trip' then location end) destination_location
from trip_details
group by trip_id
)s;
Result:
trip_id origin_location destination_location dwell_time
1 Warehouse Store 02:00

Referancing value from select column in where clause : Oracle

My tables are as below
MS_ISM_ISSUE
ISSUE_ID ISSUE_DUE_DATE ISSUE_SOURCE_TYPE
I1 25-11-2018 1
I2 25-12-2018 1
I3 27-03-2019 2
MS_ISM_SOURCE_SETUP
SOURCE_ID MODULE_NAME
1 IT-Compliance
2 Risk Assessment
I have written following query.
with rs as
(select
count(ISSUE_ID) as ISSUE_COUNT, src.MODULE_NAME,
case
when ISSUE_DUE_DATE<sysdate then 'Overdue'
when ISSUE_DUE_DATE between sysdate and sysdate + 90 then 'Within 3 months'
when ISSUE_DUE_DATE>sysdate+90 then 'Beyond 90 days'
end as date_range
from MS_ISM_ISSUE issue, MS_ISM_SOURCE_SETUP src
where issue.Issue_source_type = src.source_id
group by src.MODULE_NAME, case
when ISSUE_DUE_DATE<sysdate then 'Overdue'
when ISSUE_DUE_DATE between sysdate and sysdate + 90 then 'Within 3 months'
when ISSUE_DUE_DATE>sysdate+90 then 'Beyond 90 days'
end)
select ISSUE_COUNT,MODULE_NAME, DATE_RANGE,
(select count(ISSUE_COUNT) from rs where rs.MODULE_NAME=MODULE_NAME) as total from rs;
The output of the code is as below.
ISSUE_COUNT MODULE_NAME DATE_RANGE Total
1 IT-Compliance Overdue 3
1 IT-Compliance Within 3 months 3
1 Risk Assessment Beyond 90 days 3
The result is correct till 3rd column. In 4th column what I want is, total of Issue count for given module name. Hence in above case Total column will have value as 2 for first and second row (since there are 2 Issues for IT-Compliance) and value 1 for the third row (since one issue is present for Risk Assessment).
Essentially, I want to achieve is to replace current row's MODULE_NAME in last where clause. How do I achieve this using query?
OK, this condition
where rs.MODULE_NAME=MODULE_NAME
is essentially the same as if you wrote
where MODULE_NAME = MODULE_NAME
which is simply always true (if there are no nulls in module_name).
Try using different table alias for inner query and outer query, e.g.
select count(ISSUE_COUNT) from rs rs2 where rs2.MODULE_NAME=rs.MODULE_NAME
You can also try to use analytic function here, something like
select ISSUE_COUNT,
MODULE_NAME,
DATE_RANGE,
COUNT(ISSUE_COUNT) OVER (PARTITION BY RS.MODULE_NAME) AS TOTAL
from rs
instead of your subquery

how to model data / indexes to find timeslices fast

We have a lot of tables in our database with data that is only relevant/valid during a certain period of time. For example contracts, they have a start_date and an end_date. And it's not necessarily full months.
Now this is a typical type of query against this table:
SELECT
*
FROM
contracts c
WHERE
c.start_date <= :1
AND c.end_date >= :2
AND c.region_id = :3
Since we have 20 years of data in our table (~7000 days), the date is very good filter criteria, especially when :1 and :2 is the same day. The region_id is not such a good filter criteria because there aren't that many (~50). In this example we have (among others) 2 Indexes on our table:
contracts_valid_index (start_date, end_date)
contracts_region (region_id)
Unfortunately, above query will often us the contracts_region index because the optimizer thinks it's cheaper. The reason behind this is simple: When I pick a day in the middle of our data, then the database will think that an index over start_date will not really be good because it will only filter out half the data. And by looking at end_date the same applies. So the optimizer thinks that he can only filter out 1/4 of my data. Because he does not know that start_date and end_date are usually pretty close together and this index would be very selective.
An execution plan using the contracts_valid_index has higher costs than an execution plan using contracts_region. But in reality the contracts_valid_index is a lot better.
I currently don't think that I can speed up my queries by making better indexes (other than deleting all but contracts_valid_index). But maybe my data model is not very good for the query optimizer. So I assume that others are also having similar needs and would love to know how they modeled their data or optimized their data tables / indexes.
Any suggestions?
Since you indicate you are using Oracle 12c it may help to define your Start_Date and End_Date columns as temporal valid time columns provided they match the appropriate temporal validity semantics (start_date and end_date need to be timestamps, end_date must be > start_date or possibly null and valid time periods include the start date but exclude the end date, that is it's a partially closed/open range unlike the usual between operator which denotes a fully closed range). For example:
ALTER TABLE contracts ADD (PERIOD FOR valid_time (start_date, end_date));
You can then query the contracts table for a given validity period thusly:
SELECT
c.*
FROM
contracts VERSIONS PERIOD FOR valid_time BETWEEN :1 AND :2 c
WHERE
c.region_id = :3
This is semantically similar to:
SELECT
c.*
FROM
contracts c
WHERE
:1 < end_date
AND start_date <= :2
AND c.region_id = :3
Alternatively to query for records that are valid for a specific point in time rather than a range of time:
SELECT
c.*
FROM
contracts AS OF PERIOD FOR valid_time :1 c
WHERE
c.region_id = :2
which is semantically similar to:
SELECT
c.*
FROM
contracts c
WHERE
:1 BETWEEN start_date AND end_date
and :1 <> end_date
and c.region_id = :2
I'm not sure if null values for start_date and end_date indicate the beginning and end of time respectively or not since I don't currently have an R12 instance to test in.
I have previously come across the same problem of index usage in relation to large sets of IP addresses on MySQL databases (bear with me; it really is the same problem).
The solution I found (by much googling, I'm not taking the credit for inventing it) was to use a geospatial index. This is specifically designed to find data within ranges. Most implementations (including that in mysql) are hard-wired to a 2 dimensional space while ip addresses and time are 1 dimensional, but its trivial to map a 1 dimensional coordinate into a 2 dimensional space (see link for a step by step explanation).
Sorry, I don't know anything about Oracle's geospatial capabilities so I can't offer any example code but, it does support geospatial indexing so can resolve your queries efficiently.
You could try the following query to see if it works better:
WITH t1 AS (
SELECT *
FROM contracts c
WHERE c.start_date <= :1
AND c.end_date >= :2
)
SELECT *
FROM t1
WHERE c.region_id = :3
Though it will likely prevent any possibility of using the contracts_region index.
Alternatively you could try hinting the query to use the desired index:
SELECT /*+ INDEX(c contracts_valid_index) */
*
FROM
contracts c
WHERE
c.start_date <= :1
AND c.end_date >= :2
AND c.region_id = :3
Or hinting it to not use the undesired index:
SELECT /*+ NO_INDEX(c contracts_region ) */
*
FROM
contracts c
WHERE
c.start_date <= :1
AND c.end_date >= :2
AND c.region_id = :3
When testing this out for myself without using hints I found that when selecting for dates near the start or end of the available date range the optimizer was using the INDEX_RS_ASC hint. Adding that to the query as shown below caused my testing to use the desired index even when the date range was closer to the center of the date range:
SELECT /*+ INDEX_RS_ASC(c contracts_valid_index) */
*
FROM
contracts c
WHERE
c.start_date <= :1
AND c.end_date >= :2
AND c.region_id = :3
My sample data consisted of 10,000,000 rows evenly distributed accross 50 regions and 1000 years each with a 30 day valid range.

How to avoid expensive Cartesian product using row generator

I'm working on a query (Oracle 11g) that does a lot of date manipulation. Using a row generator, I'm examining each date within a range of dates for each record in another table. Through another query, I know that my row generator needs to generate 8500 dates, and this amount will grow by 365 days each year. Also, the table that I'm examining has about 18000 records, and this table is expected to grow by several thousand records a year.
The problem comes when joining the row generator to the other table to get the range of dates for each record. SQLTuning Advisor says that there's an expensive Cartesian product, which makes sense given that the query currently could generate up to 8500 x 18000 records. Here's the query in its stripped down form, without all the date logic etc.:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on origdate + n - 1 <= closeddate -- here's the problem join
order by t.id, t.origdate;
Is there an alternate way to join these two tables without the Cartesian product?
I need to calculate the elapsed time for each of these records, disallowing weekends and federal holidays, so that I can sort on the elapsed time. Also, the pagination for the table is done server-side, so we can't just load into the table and sort client-side.
The maximum age of a record in the system right now is 3656 days, and the average is 560, so it's not quite as bad as 8500 x 18000; but it's still bad.
I've just about resigned myself to adding a field to store the opendays, computing it once and storing the elapsed time, and creating a scheduled task to update all open records every night.
I think that you would get better performance if you rewrite the join condition slightly:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on Closeddate - Origdate + 1 <= n --you could even create a function-based index
order by t.id, t.origdate;

How to get records randomly from the oracle database?

I need to select rows randomly from an Oracle DB.
Ex: Assume a table with 100 rows, how I can randomly return 20 of those records from the entire 100 rows.
SELECT *
FROM (
SELECT *
FROM table
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum < 21;
SAMPLE() is not guaranteed to give you exactly 20 rows, but might be suitable (and may perform significantly better than a full query + sort-by-random for large tables):
SELECT *
FROM table SAMPLE(20);
Note: the 20 here is an approximate percentage, not the number of rows desired. In this case, since you have 100 rows, to get approximately 20 rows you ask for a 20% sample.
SELECT * FROM table SAMPLE(10) WHERE ROWNUM <= 20;
This is more efficient as it doesn't need to sort the Table.
SELECT column FROM
( SELECT column, dbms_random.value FROM table ORDER BY 2 )
where rownum <= 20;
In summary, two ways were introduced
1) using order by DBMS_RANDOM.VALUE clause
2) using sample([%]) function
The first way has advantage in 'CORRECTNESS' which means you will never fail get result if it actually exists, while in the second way you may get no result even though it has cases satisfying the query condition since information is reduced during sampling.
The second way has advantage in 'EFFICIENT' which mean you will get result faster and give light load to your database.
I was given an warning from DBA that my query using the first way gives loads to the database
You can choose one of two ways according to your interest!
In case of huge tables standard way with sorting by dbms_random.value is not effective because you need to scan whole table and dbms_random.value is pretty slow function and requires context switches. For such cases, there are 3 additional methods:
1: Use sample clause:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
for example:
select *
from s1 sample block(1)
order by dbms_random.value
fetch first 1 rows only
ie get 1% of all blocks, then sort them randomly and return just 1 row.
2: if you have an index/primary key on the column with normal distribution, you can get min and max values, get random value in this range and get first row with a value greater or equal than that randomly generated value.
Example:
--big table with 1 mln rows with primary key on ID with normal distribution:
Create table s1(id primary key,padding) as
select level, rpad('x',100,'x')
from dual
connect by level<=1e6;
select *
from s1
where id>=(select
dbms_random.value(
(select min(id) from s1),
(select max(id) from s1)
)
from dual)
order by id
fetch first 1 rows only;
3: get random table block, generate rowid and get row from the table by this rowid:
select *
from s1
where rowid = (
select
DBMS_ROWID.ROWID_CREATE (
1,
objd,
file#,
block#,
1)
from
(
select/*+ rule */ file#,block#,objd
from v$bh b
where b.objd in (select o.data_object_id from user_objects o where object_name='S1' /* table_name */)
order by dbms_random.value
fetch first 1 rows only
)
);
To randomly select 20 rows I think you'd be better off selecting the lot of them randomly ordered and selecting the first 20 of that set.
Something like:
Select *
from (select *
from table
order by dbms_random.value) -- you can also use DBMS_RANDOM.RANDOM
where rownum < 21;
Best used for small tables to avoid selecting large chunks of data only to discard most of it.
Here's how to pick a random sample out of each group:
SELECT GROUPING_COLUMN,
MIN (COLUMN_NAME) KEEP (DENSE_RANK FIRST ORDER BY DBMS_RANDOM.VALUE)
AS RANDOM_SAMPLE
FROM TABLE_NAME
GROUP BY GROUPING_COLUMN
ORDER BY GROUPING_COLUMN;
I'm not sure how efficient it is, but if you have a lot of categories and sub-categories, this seems to do the job nicely.
-- Q. How to find Random 50% records from table ?
when we want percent wise randomly data
SELECT *
FROM (
SELECT *
FROM table_name
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum <= (select count(*) from table_name) * 50/100;

Resources