I recently got some help for an oracle query and don't quite understand how it works and thus can't get it to work with my data. Is anyone able to explain the logic of what is happening in logical steps and what variables are actually taken from an existing table's columns? I am looking to select data from a table of readings (column names are: day, hour, volume) and find the average reading of volume for each hour of each day (thus GROUP BY day, hour), by going back to all readings for that hour/day combination in the past (as far back as my dataset goes) and writing out the average for it. Once that is done, it will write the results to a different table with the same column names (day, hour, volume). Except when I write it back on a per hour basis, 'volume' will be the average for that hour of the day in the past. For example, I want to find what the average was for all Wednesdays at 7pm in the past, and output the average to a new record. Assuming these 3 columns were used and in reference to the code below, I am not sure how "hours" differs to "hrs" and what the t1 variable represents. Any help is appreciated.
INSERT INTO avg_table (days, hours, avrg)
WITH xweek
AS (SELECT ds, LPAD (hrs, 2, '0') hrs
FROM ( SELECT LEVEL ds
FROM DUAL
CONNECT BY LEVEL <= 7),
( SELECT LEVEL - 1 hrs
FROM DUAL
CONNECT BY LEVEL <= 24))
SELECT t1.ds, t1.hrs, AVG (volume)
FROM xweek t1, tables t2
WHERE t1.ds = TO_CHAR (t2.day(+), 'D')
AND t1.hrs = t2.hour(+)
GROUP BY t1.ds, t1.hrs;
I'd re-write this slightly so it makes more sense (to me at least).
To break it down bit by bit, CONNECT BY is a hierarchical (recursive) query. This is a common "cheat" to generate rows. In this case 7 to represent each day of the week, numbered 1 to 7.
SELECT LEVEL ds
FROM DUAL
CONNECT BY LEVEL <= 7
The next one generates the hours 0 to 23 to represent midnight to 11pm. These are then joined together in the old style in a Cartesian or CROSS JOIN. This means that every possible combination of rows is returned, i.e. it generates every hour of every day for a single week.
The WITH clause is described in the documentation on the SELECT statement, it is commonly known as a Common Table Expression (CTE), or in Oracle the Subquery Factoring Clause. This enables you to assign a name to a sub-query and reference that single sub-query in multiple places. It can also be used to keep code clean or generate temporary tables in memory for ready access. It's not required in this case but it does help to separate the code nicely.
Lastly, the + is Oracle's old notation for outer joins. They are mostly equivalent but there are a few very small differences that are described in this question and answer.
As I said at the beginning I would re-write this to conform to the ANSI standard because I find it more readable
insert into avg_table (days, hours, avrg)
with xweek as (
select ds, lpad(hrs, 2, '0') hrs
from ( select level ds
from dual
connect by level <= 7 )
cross join ( select level - 1 hrs
from dual
connect by level <= 24 )
)
select t1.ds, t1.hrs, avg(volume)
from xweek t1
left outer join tables t2
on t1.ds = to_char(t2.day, 'd')
and t1.hrs = t2.hour
group by t1.ds, t1.hrs;
To go into slightly more detail the t1 variable represents an alias for the CTE week1, it's so you don't have to type the entire thing each time. hrs is an alias for the generated expression, as you reference it explicitly you need to call it something. HOURS is a column in your own table.
As to whether this is doing the correct thing I'm not sure, you imply you only want it for a single day rather than the entire week so only you can decide if this is correct? I also find it a little strange that you need the HOURS column in your table to be a character left-padded with 0s lpad(hrs, 2, '0'), once again, only you know if this is correct.
I would highly recommend playing about with this yourself and working out how everything goes together. You also seem to be missing some of the basics, get a text book or look around on the internet, or Stack Overflow, there's plenty of examples.
Related
Execution time differs too much between the queries below. These are the generated queries from an app using Entity Framework.
The first one is non-parameterized query that takes 0,559 seconds.
SELECT
"Project1"."C2" AS "C1",
"Project1"."C1" AS "C2",
"Project1"."KEYFIELD" AS "KEYFIELD"
FROM ( SELECT
"Extent1"."KEYFIELD" AS "KEYFIELD",
CAST( "Extent1"."LOCALDT" AS date) AS "C1",
2 AS "C2"
FROM "MYTABLE" "Extent1"
WHERE (
("Extent1"."LOCALDT" >= to_timestamp('2017-01-01','YYYY-MM-DD')) AND
("Extent1"."LOCALDT" <= to_timestamp('2018-01-01','YYYY-MM-DD'))
)
) "Project1"
ORDER BY "Project1"."C1" DESC;
The other one has parameterized WHERE clause. It takes 18,372 seconds to fetch the data:
SELECT
"Project1"."C2" AS "C1",
"Project1"."C1" AS "C2",
"Project1"."KEYFIELD" AS "KEYFIELD"
FROM ( SELECT
"Extent1"."KEYFIELD" AS "KEYFIELD",
CAST( "Extent1"."LOCALDT" AS date) AS "C1",
2 AS "C2"
FROM "MYTABLE" "Extent1"
WHERE (
("Extent1"."LOCALDT" >= :p__linq__0) AND
("Extent1"."LOCALDT" <= :p__linq__1)
)
) "Project1"
ORDER BY "Project1"."C1" DESC;
I know that parameterized queries are pretty useful for caching. How can I find the way to improve the performance of the parameterized query?
"parameterized queries are pretty useful for caching"
Just to be clear, when we use bind variables what gets cached is the parsed query and the execution plan. The assumption is that given a query like ...
where col1 = :p1
and col2 = :p2
... the same plan works as well when :p1 = 23 and :p2 = 42 as when :p1 = 42 and :p2 = 23. If our data has an even distribution then the assumption holds good. But if our data has some form of skew we may end up with a plan which works well for one specific combination of values but is rubbish for most of the other queries our users need to run. This is a phenomenon known as bind variable peeking.
Date range queries are a notorious case in point. Your first query provides values that will match records for a well defined range. Presuming that retrieves a narrow slice of the table. However, with the second query the specified date range could be anything: a day, a week, a month, a year, a - well you get the picture.
The upshot is, an index range scan could be very efficient for the first query and shocking for the second.
To understand more you need to explore the specific query:
Run explain plans for the two versions of the queries, understand the differences. (Make sure you're working with realistic (production-like) data: not just volumes but distribution and skew as well.
Check the statistics are accurate, and consider whether refreshing them might help.
Understand the skew of the data, and check whether you are suffering from bind variable peeking. Perhaps you need to look at adaptive cursors.
Alternatively you may need to avoid using bind variables. Especially with date ranged queries on large tables it is not unusual to pass actual values for the date arguments. The cost of parsing the query each time it is executed is offset by getting the best plan for each set of parameters.
In short, we should understand our data and the way our users need to work with it, then write queries accordingly.
workitem_routing_stats table is having around 1000000 records .all records are acceesed thats why we are using full scan hint. it takes around 25 seconds to execute is there is any way to tune this query.
SELECT /*+ full(wrs) */
wrs.NODE_ID,
wrs.bb_id--,
SUM(CASE WHEN WRS.START_TS >= (SYSTIMESTAMP-NUMTODSINTERVAL(7,'day'))
AND wrs.END_TS <= SYSTIMESTAMP THEN (wrs.WORKITEM_COUNT) END) outliers_last_sevend,
SUM(CASE WHEN WRS.START_TS >= (SYSTIMESTAMP-NUMTODSINTERVAL(30,'day'))
AND wrs.END_TS <= SYSTIMESTAMP THEN (wrs.WORKITEM_COUNT) END)
outliers_last_thirtyd ,
SUM(CASE WHEN WRS.START_TS >= (SYSTIMESTAMP-NUMTODSINTERVAL(90,'day'))
AND wrs.END_TS <= SYSTIMESTAMP THEN (wrs.WORKITEM_COUNT) END)
outliers_last_ninetyd ,
SUM(wrs.WORKITEM_COUNT)outliers_year
FROM workitem_routing_stats wrs
WHERE wrs.START_TS BETWEEN (SYSTIMESTAMP-numtodsinterval(365,'day')) AND SYSTIMESTAMP
AND wrs.END_TS BETWEEN (SYSTIMESTAMP-numtodsinterval(365,'day')) AND SYSTIMESTAMP
GROUP BY wrs.NODE_ID,wrs.bb_id ;
You may range partition the table in a monthly manner on START_TS column. (will scan only the year you are interested in)
Secondly(not a very intelligent solution) you may add a parallel(wrs 4) hint if your storage is powerfull.
You can combine these two things.
a full scan is going to be painful in any case...
however - you may avoid some computation if you simply put in the proper numbers instead of calling the conversion functions:
(SYSTIMESTAMP-numtodsinterval(365,'day'))
should just be the same as
(SYSTIMESTAMP-365)
this should remove overhead of calling the function, and parsing the parameter string ('day')
one other possibility - it seems that maybe this data will be adding new timestamps as of today, but the rest is just history...
if this is the case, then you could add a summary table to hold the summarized historic information and only query this current table for the recent stuff, and UNION to the summary table for the older stuff.
you will then need to think through the JOB or other scheduled process to get the summaries populated, but it would save you a ton in this query time.
We are running into performance issue where I need some suggestions ( we are on Oracle 10g R2)
The situation is sth like this
1) It is a legacy system.
2) In some of the tables it holds data for the last 10 years ( means data was never deleted since the first version was rolled out). Now in most of the OLTP tables they are having around 30,000,000 - 40,000,000 rows.
3) Search operations on these tables is taking flat 5-6 minutes of time. ( a simple query like select count(0) from xxxxx where isActive=’Y’ takes around 6 minutes of time.) When we saw the explain plan we found that index scan is happening on isActive column.
4) We have suggested archive and purge of the old data which is not needed and team is working towards it. Even if we delete 5 years of data we are left with around 15,000,000 - 20,000,000 rows in the tables which itself is very huge, so we thought of having table portioning on these tables, but we found that the user can perform search of most of the columns of these tables from UI,so which will defeat the very purpose of table partitioning.
so what are the steps which need to be taken to improve this situation.
First of all: question why you are issuing the query select count(0) from xxxxx where isactive = 'Y' in the first place. Nine out of ten times it is a lazy way to check for existence of a record. If that's the case with you, just replace it with a query that select 1 row (rownum = 1 and a first_rows hint).
The number of rows you mention are nothing to be worried about. If your application doesn't perform well when number of rows grows, then your system is not designed to scale. I'd investigate all queries that take too long using a SQL*Trace or ASH and fix it.
By the way: nothing you mentioned justifies the term legacy, IMHO.
Regards,
Rob.
Just a few observations:
I'm guessing that the "isActive" column can have two values - 'Y' and 'N' (or perhaps 'Y', 'N', and NULL - although why in the name of Fred there wouldn't be a NOT NULL constraint on such a column escapes me). If this is the case an index on this column would have very poor selectivity and you might be better off without it. Try dropping the index and re-running your query.
#RobVanWijk's comment about use of SELECT COUNT(*) is excellent. ONLY ask for a row count if you really need to have the count; if you don't need the count, I've found it's faster to do a direct probe (SELECT whatever FROM wherever WHERE somefield = somevalue) with an apprpriate exception handler than it is to do a SELECT COUNT(*). In the case you cited, I think it would be better to do something like
BEGIN
SELECT IS_ACTIVE
INTO strIsActive
FROM MY_TABLE
WHERE IS_ACTIVE = 'Y';
bActive_records_found := TRUE;
EXCEPTION
WHEN NO_DATA_FOUND THEN
bActive_records_found := FALSE;
WHEN TOO_MANY_ROWS THEN
bActive_records_found := TRUE;
END;
As to partitioning - partitioning can be effective at reducing query times IF the field on which the table is partitioned is used in all queries. For example, if a table is partitioned on the TRANSACTION_DATE variable, then for the partitioning to make a difference all queries against this table would have to have a TRANSACTION_DATE test in the WHERE clause. Otherwise the database will have to search each partition to satisfy the query, so I doubt any improvements would be noted.
Share and enjoy.
I have two tables for time dimension
date (unique row for each day)
time of the day (unique row for each minute in a day)
Given this schema what would a query look like if one wants to retrieve facts for last X hours where X can be any number greater than 0.
Things start to be become tricky when the start time and end time happen to be in two different days of the year.
EDIT: My Fact table does not have a time stamp column
Fact tables do have (and should have) original timestamp in order to avoid weird by-time queries which happen over the boundary of a day. Weird means having some type of complicated date-time function in the WHERE clause.
In most DWs these type of queries are very rare, but you seem to be streaming data into your DW and using it for reporting at the same time.
So I would suggest:
Introduce the full timestamp in the fact table.
For the old records, re-create the timestamp from the Date and Time keys.
DW queries are all about not having any functions in the WHERE clause, or if a function has to be used, make sure it is SARGABLE.
You would probably be better served by converting the Start Date and End Date columns to TIMESTAMP and populating them.
Slicing the table would require taking the appropriate interval BETWEEN Start Date AND End Date. In Oracle the interval would be something along the lines of SYSDATE - (4/24) or SYSDATE - NUMTODSINTERVAL(4, 'HOUR')
This could also be rewritten as:
Start Date <= (SYSDATE - (4/24)) AND End Date >= (SYSDATE - (4/24))
It seems to me that given the current schema you have, that you will need to retrieve the appropriate time IDs from the time dimension table which meet your search criteria, and then search for matching rows in the fact table. Depending on the granularity of your time dimension, you might want to check the performance of doing either (SQL Server examples):
A subselect:
SELECT X FROM FOO WHERE TIMEID IN (SELECT ID FROM DIMTIME WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) AND DATEID IN (SELECT ID FROM DIMDATE WHERE DATE = GETDATE())
An inner join:
SELECT X FROM FOO INNER JOIN DIMTIME ON TIMEID = DIMTIME.ID WHERE HOUR >= DATEPART(HOUR, CURRENT_TIMESTAMP()) INNER JOIN DIMDATE ON DATEID = DIMDATE.ID WHERE DATE = GETDATE()
Neither of these are truly attractive options.
Have you considered that you may be querying against a cube that is intended for roll-up analysis and not necessarily for "last X" analysis?
If this is not a "roll-up" cube, I would agree with the other posters in that you should re-stamp your fact tables with better keys, and if you do in fact intend to search off of hour frequently, you should probably include that in the fact table as well, as any other attempt will probably make the query non-sargable (see What makes a SQL statement sargable?).
Microsoft recommends at http://msdn.microsoft.com/en-us/library/aa902672%28v=sql.80%29.aspx that:
In contrast to surrogate keys used in other dimension tables, date and time dimension keys should be "smart." A suggested key for a date dimension is of the form "yyyymmdd". This format is easy for users to remember and incorporate into queries. It is also a recommended surrogate key format for fact tables that are partitioned into multiple tables by date.
Best luck!
Ok, I'm new using this connect by thing. But its always quite useful. I have this small problem you guys might be able to help me...
Given start month (say to_char(sysdate,'YYYYMM')) and end month (say, to_char(add_months(sysdate, 6),'YYYYMM')), want to get the list of months in between, in the same format.
Well, I want to use this into a partitions automation script. My best shot so far (pretty pitiful) yields invalid months e.g.'201034'... (and yea, I know, incredibly inefficient)
Follows the code:
SELECT id
from
(select to_char(add_months(sysdate, 6),'YYYYMM') as tn_end, to_char(sysdate,'YYYYMM') as tn_start from dual) tabla,
(select * from
(Select Level as Id from dual connect by Level <= (Select to_char(add_months(sysdate, 1),'YYYYMM')from dual)) where id > to_char(sysdate,'YYYYMM')) t
Where
t.Id between tabla.tn_start and tabla.tn_end
how do I do to make this query return only valid months? Any tips?
cheers mates,
f.
Best way might be to separate out the row generator from the date function. So generate a list from 0 to 6 and calculate months from that. If you want to pass the months in then do that in the with clause
with my_counter as (
Select Level-1 as id
from dual
connect by Level <= 7
)
select to_char(add_months(sysdate, id),'YYYYMM') from my_counter
The example below will allow you to plug in the dates you require to work out the difference.
with my_counter as (
Select Level-1 as id
from dual
connect by level <= months_between(add_months(trunc(sysdate,'MM'), 6),
trunc(sysdate,'MM')) + 1
)
select to_char(add_months(trunc(sysdate, 'MM'), id),'YYYYMM') from my_counter
For generating dates and date ranges, I strongly suggest you create a permanent calendar table with one row for each day. Even if you keep 20 years in this table it will be small ~7500 rows. Having such a table lets you attach additional (potentially non-standard) information to a date. For example your company may use a 6-week reporting period which you cannot extract using TO_CHAR / TO_DATE. Pre-compute it and store it in this table.
Oh, and Oracle 11g has automatic partition management. If you are stuck with 10g, then this article may be of interest to you? Automatic Partition Management for Oracle 10g
Try this:
with numbers as
( select level as n from dual
connect by level <= 7
)
select to_char (add_months (trunc(sysdate,'MM'), n-1), 'YYYYMM') id
from numbers;
ID
------
201012
201101
201102
201103
201104
201105
201106