I am investigating a problem where our application takes too much time to get data from Oracle Database. In my investigation, I found that the slowness of the query traces to the join between tables and because of the aggregate function- SUM.
This may look simple but I am not a good with SQL query optimization.
The query is below
SELECT T1.TONNES, SUM(R.TONNES) AS TOTAL_TONNES
FROM
RECLAIMED R ,
(SELECT DELIVERY_OUT_ID, SUM(TONNES) AS TONNES FROM RECLAIMED WHERE DELIVERY_IN_ID=53773 GROUP BY DELIVERY_OUT_ID) T1
where
R.DELIVERY_OUT_ID = T1.DELIVERY_OUT_ID
GROUP BY
T1.TONNES
SUM(R.TONNES) is the total tonnes per delivery out.
SUM(TONNES) is the total tonnes per delivery in.
My table looks like
I have 16 million entries in this table, and by trying multiple delivery_in_id's by average I am getting about 6 seconds for the query to comeback.
I have similar database (complete copy but only have 4 million entries) and when the same query is applied I am getting less than 1 seconds.
They have both the same indexes so I am confident that index is not a problem.
I am certain that it is just the data, it is heavy on the first database(16 million). I have a feeling that when this query is optimized then the problem will be solved.
Open for suggestions : )
Are the two DB on the same server? If it's not, first, compare the computer configuration, settings and running applications.
If there is no differences, you can try to check if your have NULL values in the column you want to SUM. Use NVL-function to improve your query if there are some.
Also, you may "Analyse index" (or "Rebuild Index"). It cleans up the index. (it's quite fast and safe for your data).
If it is not helping, look if the TABLESPACE of your table is not full. It might have some impact... but I am not sure.
;-)
I've solved the performance problem by updating the stored procedure. It is optimized in a way by adding a filter in the first table before joining the second table. Below is the outcome stored procedure
SELECT R.DELIVERY_IN_ID, R.DELIVERY_OUT_ID, SUM(R.TONNES),
(SELECT SUM(TONNES) AS TONNES FROM RECLAIMED WHERE DELIVERY_OUT_ID=R.DELIVERY_OUT_ID) AS TOTAL_TONNES
FROM
CTSBT_RECLAIMED R
WHERE DELIVERY_IN_ID=53733
GROUP BY DELIVERY_IN_ID, R.DELIVERY_OUT_ID
The result in timing/performance is huge for my case since I am joining a huge table(16M).This query now goes for less than a second. The slowness, I am suspecting is due to 1 table having no index(see T1) and even though in my case it is only about 20 items, it is enough to slow the query down because it compares it to 16 million entries.
The optimized query does the filtering of this 16 million and merge to T1 after.
Should there be a better way how to optimized this? Probably. But I am happy with this result and solved what I intended to solve. Now moving on.
Thanks for those who commented.
Related
I have a large query, part of this query contains several joins in the where clause and the joins are according to the execution plan causing TABLE ACCESS (FULL), which is causing the query to run very slow, obviously.
Here is the part of the query that seems to be causing the issue
WHERE ......
A.CN= B.CN(+) AND
A.CI= B.CI(+) AND
A.SO= B.SO(+) AND
A.CN= C.CN(+) AND
The execution plan shows
HASH JOIN (RIGHT OUTER)
Access Predicates: "A"."CN"="C"."CN"
Estimated bytes is 700MB which is 1/3 of this entire queries cost.
I have checked indexes and both tables have indexes on CN.
Im just beginning to learn about performance and how things work so im sorry if this is a dumb question :x
Looking for advice on how to improve performance.
I have an application that collects performance metrics and stores them in a datamart. I then use Mondrian to enable analysis and ad-hoc exploration of the data. I'm collecting about 5e6 rows per day and total size of the METRIC table is about 300M rows.
We "color" our data based on the metrics comparison to an SLA. There are exactly 5 distinct values for color. When we do simple MDX queries to get, for example, a color distribution of the data for a specific date range, say 1 day, we see queries like below:
2014-06-11 23:17:08,042 DEBUG [sql] - 223: SqlTupleReader.readTuples
[[Color].[Color]]: executing sql [select "METRIC"."COLOR" as "c0"
from "METRIC" "METRIC" group by "METRIC"."COLOR" order by
"METRIC"."COLOR" ASC NULLS LAST] 2014-06-11 23:17:58,747 DEBUG [sql] -
223: , exec 50704 ms
In order to improve performance, the datamart includes aggregate tables at the hour and day levels, and both aggregate tables include the COLOR column.
I understand that Mondrian is very dependent on the underlying database performance, but there is really no way to tune this. I can create an index on COLOR (because a full scan of the index will be marginally faster than a full scan of the table), but it seems silly to create an index with 5 distinct value on a 300M row table. The day aggregate table has about 500K rows and would be significantly faster executing virtually the same query against this table, but Mondrian always seems to go to the base fact table for these dimension queries.
My question is, is there some way to avoid this query? If I can't avoid it, is it possible to get Mondrian to use the aggregate tables for this type of query? I have specified approxRowCount in the single level of this dimension/hierarchy and that eliminated the similar query to get the count of values. I haven't dug into the source of Mondrian yet to determine if there is a possibility of using the aggregate table or if there is some configuration on my part that is preventing it.
Edit for Clarification:
I probably didn't do a good job of asking my question-let me try and clarify. My MDX query looks something like:
select [Color].[Color].Members on columns,
{[Measures].[Metric Value], [Measures].[Count]} on rows
from [Metric]
where [Time].[2014].[June].[11]
I can look at this and hand write a SQL query that answers this query
select COLOR, avg(VALUE), sum(FACT_COUNT)
from AGG_DAY_METRIC
where YEAR = 2014
and MONTH = 6
and DAY_OF_MONTH = 11
group by COLOR
The database answers this query in about 100ms scanning approx 4K rows. It takes Mondrian several minutes to answer the
query because it does several queries that don't answer the MDX query directly, but rather get information about the
dimension. In the case above, the database has to scan 300M rows, taking 50 seconds, to return that there are 5 possible
colors. If color was in a normal dimension table there would only be 5 rows, but in a degenerate dimension there can be 100s
of millions of rows.
So my questions are:
a) Is there a way to tell Mondrian the values of a degenerate dimension and avoid these queries?
b) Is there a way to have Mondrian answer these queries from aggregate tables?
This problem was solved, not by modifying anything in the Mondrian schema or the application, but the database. The database in this case was Oracle and we were able to create a materialized view with query rewrite enabled.
The materialized view is created from the exact query issued by Mondrian. Since the color values don't change very frequently (almost never in our case), the materialized view does a full refresh once a day.
In this case the queries went from taking minute(s) to milliseconds. If your facing an issue like this and your database is Oracle this is a good approach to speeding up the tuples resolution for degenerate dimensions with low cardinality.
It's hard to give any specific directions without knowing more about your schema, but it looks to me you have to make sure that the number of rows with certain colours (count) has to be marked defined as an aggregate measure (Count or Max Number).
Please note that these aggregates are not calculated continuously (I think it would be to heavy for the backing data-store, and Mondrian won't keep a flowing set in memory for incoming facts).
The aggregation can be specified to be ran/rebuilt at specific times (nightly, hourly...). This would make Mondrian a bit unsuitable for real-time analysis, but you should be able to do almost instant queries on historical data.
If your dimension has 5 distinct values in a 300M fact table it should not be a degenerate dimension. It should be in a separate dimension table. A degenerate dimension should ONLY be used if its cardinality is close to the full fact table row count, making a separate table pointless, as there would be no significant storage savings and joining the dimension results in a lot of data being read;
If you put the colors on a separate dim table, any "Read Tuples" query will return results in a few ms, and your problem is solved.
However, more to the point of your question, Mondrian should be able to pick the dim values from the agg tables. Unless you have distinct-count aggregators in the cube, in which case you're in a tricky situation (unless there's an agg table that exactly matches the level of detail you need, Mondrian will very likely scan the fact table).
You should also set the highCardinality attribute of this degenerate dimension to True. Even with only 5 distinct values, having highCardinality=false tells Mondrian it's safe to scan the whole dimension to populate the list of members. Setting it to true stops this scan.
You should also add an index to this column. It's always a good idea to add indexes to every key and degenerate dimension column in a fact table. With an index the DB should answer much faster that SQL query.
Finally, you have a 300M row fact table. What DBMS are you using? Is it a Column oriented DB? If not, you should try them as a possible alternative to your data store. Column oriented DB have a significant performance increase over Row oriented DBs for Mondrian-like queries. There are a few good options out there, you should test drive them.
The columns that are in the where clause are not selective. They are all in 1 single table. In addition the expressions used are NOT EQUAL, OR, IS NULL, IS NOT NULL. The primary key is on the customer ID. I am not sure how to get around with this kind of data. Are there different indexing methods that can be created on table or other ways to solve the problem. I guess partitions won't be helpful either for breaking a table into one major section with large data. Any thoughts or workarounds will be useful.
I'm putting below the data for reference and sample queries for ease of understanding.
sample query
colA = 'Marketable' OR colA is null
NORMAL index: gets ignored due to OR and NULL operator. Moreover the queried data covers more than 95% of data in the table.
BITMAP index: gets ignored due to more than 96% data coverage.
sample query
colB = '7' OR colB = '6' OR colB = '5'
NORMAL or BITMAP: both not useful due to large data selection. Optimizer goes with full table scan using the primary key cust_id.
sample query
colC <> 'SPECIAL SEGMENT' OR colC is null (since the values can change, no specific value is passed)
combination sample query
NOT (colB = '6' OR colB = '3') AND
(colC <> 'SPECIAL SEGMENT' OR colC is null)
Full table scans are not evil. Index access is not always more efficient.
If you want to return the majority of the data in a table, you want to use a full table scan since that's the most efficient way of accessing large fractions of the data in the table. Indexes are great when you want to access relatively small fractions of the data in the table. But if you want most of the data, doing millions of index accesses is not going to be more efficient. In your first example, you want to return 9.2 million rows from a 9.3 million row table. A full table scan is the plan you want-- that's the most efficient way to retrieve 99% of the rows in the table. Anything else is going to be less efficient. You could, I suppose, potentially partition the table on A leading to full partition scans of the two large partitions. That's only going to cut, say 1% of the work your query needs to do, though, and may have negative impacts on other queries on that table.
Now, I'm always a bit suspicious about queries that want to return 99% of the rows in a table in the first place. It would make no sense to have such a query in an OLTP system, for example, because no human is going to page through 9.2 million rows of data. It wouldn't make sense to have that sort of query if the goal is to replicate data because it would almost certainly be more efficient to just replicate incremental changes rather than the entire data set every time. It might make sense to read almost all the rows if the goal is to perform some aggregations. But if this is something that happens enough to care about optimizing the analysis, you'd be better off looking at ways of pre-aggregating the data using materialized views and dimensions so that you can read and aggregate the data once and then just read your pre-aggregated values at runtime.
If you do really need to read all that data, you may also want to look into parallel query. If there are relatively few readers, it is more efficient to let Oracle do the full scan in parallel so that your session can utilize more of the available hardware. Of course, that means that you can have fewer simultaneous sessions since more hardware for you means less for others, so that's a trade-off you need to understand. If you're building an ETL process where there will only be a couple sessions loading data at any point, parallel query can provide substantial performance improvements.
I am trying to select a distinct list of a certain column from a table with many millions of rows, such as:
select distinct stylecode from bass.stock_snapshot
This query obviously takes a very long time. What performance tuning can I do on this table?
If there are no predicates to my query, will an index help at all?
" just did this on a test table and the explain plan shows it did use
the index."
Please bear in mind that you have to maintain that index for ever more. I don't understand your data but it seems unlikely this index will be useful for other queries, and this query doesn't seem like the sort of query you ought to be running on a frequent basis.
If this is a one-off, some other approach such as parallel query might be better.
If on the other hand it is a frequent requirement perhaps a reference table for STYLECODE would be a good idea.
I created an index for one table, a simple index just like that:
CREATE INDEX IDX_TRANSACAO_NOVA_STATUS ON TRANSACAO_NOVA(STATUS) TABLESPACE COMVENIF;
This table has 1000K registers insinde and the status table just 5 or 6 possible values. After created the index i expected that the query bellow would have a better performance:
select * from transacao_nova tn where tn.status = 'XXX'
but, the explain plan still show me a full scan with 16.000 cost.
any help? i'm not a dba but i need to improve this performance.
thanks in advance.
If there are only 5 or 6 different status values and a million records the query optimizer may be deciding it is not worth using the index to do a range scan that would still return a substantial number of all the records in the table.
You might look into using an index-clustered table for this application.
If data in the status column es skewed (not uniform: some values appear very often and others appear very rarely), you can accelerate queries for the rare values by refreshing statistics (and verifying that you are calculating a histogram for the status column. This will make Oracle use the index in the cases in which it is more efficient.
http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#autoId12
Be aware that automatically determining if a column needs a histogram is not a good idea as it may lead to inconsistent behaviour. It is better to manually specify histograms when needed. Also, histograms affect every query that uses those columns, so they should be collected with care.
You might need to generate new statistics on the table.
http://docs.oracle.com/cd/B19306_01/server.102/b14211/stats.htm
A common mistake is to assume that an index range scan will be better than a full scan because you only want some "small" fraction of the total rows in the table. But if the rows you want are scattered throughout the table's storage extents, locating them by an index lookup can be slower than just scanning the entire table. I can't say for sure that's the case in your situation, but it's a possibility.
For a more in-depth discussion of this topic I recommend this paper.