My application allows users to collect measurement data as part of an experiment, and needs to have the ability to report on all of the measurements ever taken.
Below is a very simplified version of the tables I have:
CREATE TABLE EXPERIMENTS(
EXPT_ID INT,
EXPT_NAME VARCHAR2(255 CHAR)
);
CREATE TABLE USERS(
USER_ID INT,
EXPT_ID INT
);
CREATE TABLE SAMPLES(
SAMPLE_ID INT,
USER_ID INT
);
CREATE TABLE MEASUREMENTS(
MEASUREMENT_ID INT,
SAMPLE_ID INT,
MEASUREMENT_PARAMETER_1 NUMBER,
MEASUREMENT_PARAMETER_2 NUMBER
);
In my database there are 2000 experiments, each of which has 18 users. Each user has 6 samples to measure, and would do 100 measurements per sample.
This means that there are 2000 * 18 * 6 * 100 = 21600000 measurements currently stored in the database.
I'm trying to write a query that will get the AVG() of measurement parameter 1 and 2 for each user - that would return about 36,000 rows.
The query I have is extremely slow - I've left it running for over 30 minutes and it doesn't come back with anything. My question is: is there an efficient way of getting the averages? And is it actually possible to get results back for this amount of data in a reasonable time, say 2 minutes? Or am I being unrealistic?
Here's (again a simplified version) the query I have:
SELECT
E.EXPT_ID,
U.USER_ID,
AVG(MEASUREMENT_PARAMETER_1) AS AVG_1,
AVG(MEASUREMENT_PARAMETER_2) AS AVG_2
FROM
EXPERIMENTS E,
USERS U,
SAMPLES S,
MEASUREMENTS M
WHERE
U.EXPT_ID = E.EXPT_ID
AND S.USER_ID = U.USER_ID
AND M.SAMPLE_ID = S.SAMPLE_ID
GROUP BY E.EXPT_ID, U.USER_ID
This will return a row for each expt_id/user_id combination and the average of the 2 measurement parameters.
For your query, in any case, the DBMS needs to read the complete measurements table. This is by far the biggest part of data to read, and the part which takes most time if the query is optimized well (will come to that later). That means that the minimum runtime of your query is about the time it takes to read the complete measurements table from whereever it is stored. You can get a rough estimate by checking how much data that is (in MB or GB) and checking how much time it would take to read this amount of data from the harddisk (or where the table is stored). If your query runs slower by a factor of 5 or more, you can be sure that there is room for optimization.
There is a vast amount of information (tutorials, individual hints which can be invaluable, and general practices lists) about how to optimize oracle queries. You will not get through all this information quickly. But if you provide the execution plan of your query (this is what oracle's query optimizer thinks is the best way to fulfill your query), we will be able to spot steps which can be optimized and suggest solutions.
Related
I'm calculating cumulative total in DAX like:
DEFINE MEASURE 'Sales'[Running Total] =
CALCULATE (
SUM('Sales'[Revenue]),
FILTER(ALL('Date'[Date]),'Date'[Date]<=MAX('Date'[Date]))
)
This should be well-estabilished pattern (at least it is referenced here: http://www.daxpatterns.com/cumulative-total/)
My problem is when I try to evaluate it like:
EVALUATE SUMMARIZECOLUMNS(
'Date'[Date],
"Total_Revenue_By_Date",
'Sales'[Running Total]
)
I'm running into error
The resultset of a query to external data source has exceeded the maximum allowed size of '1000000' rows.
I'm using tabular model with direct query. I know I can enlarge the limit, however the underlying tables are small - Date table has around 10000 rows, Sales table has around 10000 rows as well (it will be much larger on production), so something here doesn't scale well.
I have an idea how to get away with calculating running totals on SQL level, any idea how to tackle this on DAX level ?
Models created by Power BI desktop has default limit of 1 million rows.
This might help you,
https://www.sqlbi.com/articles/tuning-query-limits-for-directquery/
I want to get average temperatures hourly for given table with temperature reads of a thermometer, with row structure: thermometer_id, timestamp (float, julian days), value (float) plus ascending index on timestamp.
To get whole day 4 days ago, I'm using this query:
SELECT
ROUND(AVG(value), 2), -- average temperature
COUNT(*) -- count of readings
FROM reads
WHERE
timestamp >= (julianday(date('now')) - 5) -- between 5 days
AND
timestamp < (julianday(date('now')) - 4) -- ...and 4 days ago
GROUP BY CAST(timestamp * 24 as int) -- make hours from floats, group by hours
It does it work well, yet it works very slowly, for a 9MB database, 355k rows, it takes more than half a second to finish, which is confusingly long, it shouldn't take more than few tens of ms. It does so on not quite fast hardware (not ssd though), yet I'm preparing it to use on raspberry pi, quite slow in comparison + it's going to get 80k more rows per day of work.
Explain explains the reason:
"USE TEMP B-TREE FOR GROUP BY"
I've tried adding day and hour columns with indexes just for the sake of quick access, but still, group by didn't use any of the indexes.
How can I tune this query or database to make this query faster?
If an index is used to optimize the GROUP BY, the timestamp search can no longer be optimized (except by using the skip-scan optimization, which your old SQLite might not have). And going through all rows in reads, only to throw most of them away because of a non-matching timestamp, would not be efficient.
If SQLite doesn't automatically do the right thing, even after running ANALYZE, you can try to force it to use a specific index:
CREATE INDEX rhv ON reads(hour, value);
SELECT ... FROM reads INDEXED BY rhv WHERE timestamp ... GROUP BY hour;
But this is unlikely to result in a query plan that is actually faster.
As #colonel-thirty-two commented, the problem was with cast and multiplication on GROUP BY CAST(timestamp * 24 as int). Such grouping would totally omit the index, hence the slow query time. When I've used hour column both for time comparison and grouping, the query finished immediately.
I am investigating a problem where our application takes too much time to get data from Oracle Database. In my investigation, I found that the slowness of the query traces to the join between tables and because of the aggregate function- SUM.
This may look simple but I am not a good with SQL query optimization.
The query is below
SELECT T1.TONNES, SUM(R.TONNES) AS TOTAL_TONNES
FROM
RECLAIMED R ,
(SELECT DELIVERY_OUT_ID, SUM(TONNES) AS TONNES FROM RECLAIMED WHERE DELIVERY_IN_ID=53773 GROUP BY DELIVERY_OUT_ID) T1
where
R.DELIVERY_OUT_ID = T1.DELIVERY_OUT_ID
GROUP BY
T1.TONNES
SUM(R.TONNES) is the total tonnes per delivery out.
SUM(TONNES) is the total tonnes per delivery in.
My table looks like
I have 16 million entries in this table, and by trying multiple delivery_in_id's by average I am getting about 6 seconds for the query to comeback.
I have similar database (complete copy but only have 4 million entries) and when the same query is applied I am getting less than 1 seconds.
They have both the same indexes so I am confident that index is not a problem.
I am certain that it is just the data, it is heavy on the first database(16 million). I have a feeling that when this query is optimized then the problem will be solved.
Open for suggestions : )
Are the two DB on the same server? If it's not, first, compare the computer configuration, settings and running applications.
If there is no differences, you can try to check if your have NULL values in the column you want to SUM. Use NVL-function to improve your query if there are some.
Also, you may "Analyse index" (or "Rebuild Index"). It cleans up the index. (it's quite fast and safe for your data).
If it is not helping, look if the TABLESPACE of your table is not full. It might have some impact... but I am not sure.
;-)
I've solved the performance problem by updating the stored procedure. It is optimized in a way by adding a filter in the first table before joining the second table. Below is the outcome stored procedure
SELECT R.DELIVERY_IN_ID, R.DELIVERY_OUT_ID, SUM(R.TONNES),
(SELECT SUM(TONNES) AS TONNES FROM RECLAIMED WHERE DELIVERY_OUT_ID=R.DELIVERY_OUT_ID) AS TOTAL_TONNES
FROM
CTSBT_RECLAIMED R
WHERE DELIVERY_IN_ID=53733
GROUP BY DELIVERY_IN_ID, R.DELIVERY_OUT_ID
The result in timing/performance is huge for my case since I am joining a huge table(16M).This query now goes for less than a second. The slowness, I am suspecting is due to 1 table having no index(see T1) and even though in my case it is only about 20 items, it is enough to slow the query down because it compares it to 16 million entries.
The optimized query does the filtering of this 16 million and merge to T1 after.
Should there be a better way how to optimized this? Probably. But I am happy with this result and solved what I intended to solve. Now moving on.
Thanks for those who commented.
I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.
I have a series of extremely similar queries that I run against a table of 1.4 billion records (with indexes), the only problem is that at least 10% of those queries take > 100x more time to execute than others.
I ran an explain plan and noticed that the for the fast queries (roughly 90%) Oracle is using an index range scan; on the slow ones, it's using a full index scan.
Is there a way to force Oracle to do an index range scan?
To "force" Oracle to use an index range scan, simply use an optimizer hint INDEX_RS_ASC. For example:
CREATE TABLE mytable (a NUMBER NOT NULL, b NUMBER NOT NULL, c CHAR(10)) NOLOGGING;
INSERT /*+ APPEND */ INTO mytable(a,b,c)
SELECT level, mod(level,100)+1, 'a' FROM dual CONNECT BY level <= 1E6;
CREATE INDEX myindex_ba ON mytable(b, a);
EXECUTE dbms_stats.gather_table_stats(NULL,'mytable');
SELECT /*+ FULL(m) */ b FROM mytable m WHERE b=10; -- full table scan
SELECT /*+ INDEX_RS_ASC(m) */ b FROM mytable m WHERE b=10; -- index range scan
SELECT /*+ INDEX_FFS(m) */ b FROM mytable m WHERE b=10; -- index fast full scan
Whether this will make your query actually run faster depends on many factors like the selectivity of the indexed value or the physical order of the rows in your table. For instance, if you change the query to WHERE b BETWEEN 10 AND <xxx>, the following costs appear in the execution plans on my machine:
b BETWEEN 10 AND 10 20 40 80
FULL 749 750 751 752
INDEX_RS_ASC 29 325 865 1943
INDEX_FFS 597 598 599 601
If you change the query very slightly to not only select the indexed column b, but also other, non-index columns, the costs change dramatically:
b BETWEEN 10 AND 10 20 40 80
FULL 749 750 751 754
INDEX_RS_ASC 3352 40540 108215 243563
INDEX_FFS 3352 40540 108215 243563
I suggest the following approach:-
Get an explain plan on the slow statement
Using an INDEX hint, get an explain plan on using the index
You'll notice that the cost of the INDEX plan is greater. This is why Oracle is not choosing the index plan. The cost is Oracle's estimate based on the statistics it has and various assumptions.
If the estimated cost of a plan is greater, but it actually runs quicker then the estimate is wrong. Your job is to figure out why the estimate is wrong and correct that. Then Oracle will choose the right plan for this statement and others on it's own.
To figure out why it's wrong, look at the number of expected rows in the plan. You will probably find one of these is an order of magnitude out. This might be due to non-uniformly distributed column values, old statistics, columns that corelate with each other etc.
To resolve this, you can get Oracle to collect better statistics and hint it with better starting assumptions. Then it will estimate accurate costs and come up with the fastest plan.
If you post more information I might be able to comment further.
If you want to know why the optimizer takes the decisions it does you need to use the 10053 trace.
SQL> alter session set events '10053 trace name context forever, level 1';
Then run explain plans for a sample fast query and a sample slow query. In the user dump directory you will get trace files detailing the decision trees which the CBO goes through. Somewhere in those files you will find the reasons why it chooses a full index scan over an index range scan.
I'm not saying the trace files are an easy read. The best resource for understanding them is Wolfgang Breitling's excellent whitepaper "A look under the hood of CBO" (PDF)
you can use oracle sql hints. you can force to use specific index or exclude index
check the documentation
http://psoug.org/reference/hints.html
http://www.adp-gmbh.ch/ora/sql/hints/index.html
like
select /*+ index(scott.emp ix_emp) */ from scott.emp emp_alias
I have seen hint is ignored by Oracle.
Recently, our DBA uses "optimizer_index_cost_adj" and it utilized the index.
This is Oracle parameter but you could use it as session level.
100 is default value and we used 10 as the parameter.