Assumptions:
I have a number of tables comprised of facts and foreign keys ('dimensional' and 'key-value' type). For example, ENCOUNTER:
ID - primary key
dimensions
LOCATION_ID
PATIENT_ID
key-value
TYPE_ID
STATUS_ID
PATIENT_CLASS_ID
DISPOSITION_ID
...
facts
ADMISSION_DATE
DISCHARGE_DATE
...
I don't have the option to create a data warehouse
I would like to simplify the data structure for reporting
My approach is to create a number of pseudo-dimensional views ('D_LOCATION' based on the DEPARTMENT and LOCATION tables) and pseudo-fact views ('F_ENCOUNTER' based on ENCOUNTER table). In the pseudo-fact view, I would JOIN the key-value tables (e.g. STATUS, PATIENT_CLASS) to the fact table to include the name fields (e.g. STATUS.NAME, PATIENT_CLASS.NAME).
Questions:
If a query selects a subset of all of the fields from F_ENCOUNTER (i.e. not all of the key-value.name fields), is the Oracle 10g optimizer smart enough to exclude some of the key-value table joins (i.e. the ones that aren't included in the query)?
Is there anything that I can do to optimize this architecture (other than indices)
Is there another approach?
** edit **
Goals (in order of importance):
reduce query complexity; increase query consistency; decrease report-development time
optimize query-processing
minimize administrator burden
decrease storage
One optimization suggestion is not to use key-value pair tables. The concept of a Dimension table is that each record should contain all information about that concept without needing to join to normalized tables - i.e. turning a star schema into a snowflake schema.
While values might be repeated across dimension table records, it has the advantage of fewer joins in your reporting queries. Denormalizing tables in this way might seem counter intuitive but where performance is paramount it is usually the best solution.
I don't believe Oracle would exclude any joins done in the view, because the joins can impact the number of rows returned. (As when an inner join fails to match any rows, making the whole result set empty.)
What are the goals of your optimization? Query speed? query simplicity? storage efficiency? If you can sacrifice storage efficiency for better query performance, then replace the key-value references with the values themselves (TYPE_NAME instead of TYPE_ID, PATIENT_CLASS_NAME instead of PATIENT_CLASS_ID, etc.).
[Edit:] If the original architecture cannot be modified, consider using a materialized view. It would essentially pre-compute the joins and store the result set, giving you speedy query time at the cost of extra storage space and possibly-not-fresh data. You can control the latter by specifying an appropriate refresh policy. See http://en.wikipedia.org/wiki/Materialized_view and http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/mv.htm for further details.
Related
I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.
I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.
I have an application that collects performance metrics and stores them in a datamart. I then use Mondrian to enable analysis and ad-hoc exploration of the data. I'm collecting about 5e6 rows per day and total size of the METRIC table is about 300M rows.
We "color" our data based on the metrics comparison to an SLA. There are exactly 5 distinct values for color. When we do simple MDX queries to get, for example, a color distribution of the data for a specific date range, say 1 day, we see queries like below:
2014-06-11 23:17:08,042 DEBUG [sql] - 223: SqlTupleReader.readTuples
[[Color].[Color]]: executing sql [select "METRIC"."COLOR" as "c0"
from "METRIC" "METRIC" group by "METRIC"."COLOR" order by
"METRIC"."COLOR" ASC NULLS LAST] 2014-06-11 23:17:58,747 DEBUG [sql] -
223: , exec 50704 ms
In order to improve performance, the datamart includes aggregate tables at the hour and day levels, and both aggregate tables include the COLOR column.
I understand that Mondrian is very dependent on the underlying database performance, but there is really no way to tune this. I can create an index on COLOR (because a full scan of the index will be marginally faster than a full scan of the table), but it seems silly to create an index with 5 distinct value on a 300M row table. The day aggregate table has about 500K rows and would be significantly faster executing virtually the same query against this table, but Mondrian always seems to go to the base fact table for these dimension queries.
My question is, is there some way to avoid this query? If I can't avoid it, is it possible to get Mondrian to use the aggregate tables for this type of query? I have specified approxRowCount in the single level of this dimension/hierarchy and that eliminated the similar query to get the count of values. I haven't dug into the source of Mondrian yet to determine if there is a possibility of using the aggregate table or if there is some configuration on my part that is preventing it.
Edit for Clarification:
I probably didn't do a good job of asking my question-let me try and clarify. My MDX query looks something like:
select [Color].[Color].Members on columns,
{[Measures].[Metric Value], [Measures].[Count]} on rows
from [Metric]
where [Time].[2014].[June].[11]
I can look at this and hand write a SQL query that answers this query
select COLOR, avg(VALUE), sum(FACT_COUNT)
from AGG_DAY_METRIC
where YEAR = 2014
and MONTH = 6
and DAY_OF_MONTH = 11
group by COLOR
The database answers this query in about 100ms scanning approx 4K rows. It takes Mondrian several minutes to answer the
query because it does several queries that don't answer the MDX query directly, but rather get information about the
dimension. In the case above, the database has to scan 300M rows, taking 50 seconds, to return that there are 5 possible
colors. If color was in a normal dimension table there would only be 5 rows, but in a degenerate dimension there can be 100s
of millions of rows.
So my questions are:
a) Is there a way to tell Mondrian the values of a degenerate dimension and avoid these queries?
b) Is there a way to have Mondrian answer these queries from aggregate tables?
This problem was solved, not by modifying anything in the Mondrian schema or the application, but the database. The database in this case was Oracle and we were able to create a materialized view with query rewrite enabled.
The materialized view is created from the exact query issued by Mondrian. Since the color values don't change very frequently (almost never in our case), the materialized view does a full refresh once a day.
In this case the queries went from taking minute(s) to milliseconds. If your facing an issue like this and your database is Oracle this is a good approach to speeding up the tuples resolution for degenerate dimensions with low cardinality.
It's hard to give any specific directions without knowing more about your schema, but it looks to me you have to make sure that the number of rows with certain colours (count) has to be marked defined as an aggregate measure (Count or Max Number).
Please note that these aggregates are not calculated continuously (I think it would be to heavy for the backing data-store, and Mondrian won't keep a flowing set in memory for incoming facts).
The aggregation can be specified to be ran/rebuilt at specific times (nightly, hourly...). This would make Mondrian a bit unsuitable for real-time analysis, but you should be able to do almost instant queries on historical data.
If your dimension has 5 distinct values in a 300M fact table it should not be a degenerate dimension. It should be in a separate dimension table. A degenerate dimension should ONLY be used if its cardinality is close to the full fact table row count, making a separate table pointless, as there would be no significant storage savings and joining the dimension results in a lot of data being read;
If you put the colors on a separate dim table, any "Read Tuples" query will return results in a few ms, and your problem is solved.
However, more to the point of your question, Mondrian should be able to pick the dim values from the agg tables. Unless you have distinct-count aggregators in the cube, in which case you're in a tricky situation (unless there's an agg table that exactly matches the level of detail you need, Mondrian will very likely scan the fact table).
You should also set the highCardinality attribute of this degenerate dimension to True. Even with only 5 distinct values, having highCardinality=false tells Mondrian it's safe to scan the whole dimension to populate the list of members. Setting it to true stops this scan.
You should also add an index to this column. It's always a good idea to add indexes to every key and degenerate dimension column in a fact table. With an index the DB should answer much faster that SQL query.
Finally, you have a 300M row fact table. What DBMS are you using? Is it a Column oriented DB? If not, you should try them as a possible alternative to your data store. Column oriented DB have a significant performance increase over Row oriented DBs for Mondrian-like queries. There are a few good options out there, you should test drive them.
I have a table in SYBASE which has around 1mio rows. This table currently does not have any index created and I would like to create one now. My questions are
What precautions should I take before creating an index?
Does this process require more tablespace to be allocated?
Any other performance considerations I should take care of?
Cheers
Ranjith
From manual.
When to index
Use the following general guidelines:
If you plan to do manual insertions into the IDENTITY column, create
a unique index to ensure that the inserts do not assign a value that
has already been used.
A column that is often accessed in sorted order, that is, specified in the order by clause, probably should be indexed so that
Adaptive Server can take advantage of the indexed order.
Columns that are regularly used in joins should always be indexed, since the system can perform the join faster if the columns
are in sorted order.
The column that stores the primary key of the table often has a clustered index, especially if it is frequently joined to columns in
other tables. Remember, there can be only one clustered index per
table.
A column that is often searched for ranges of values might be a good choice for a clustered index. Once the row with the first value
in the range is found, rows with subsequent values are guaranteed to
be physically adjacent. A clustered index does not offer as much of
an advantage for searches on single values.
When not to index
In some cases, indexes are not useful:
Columns that are seldom or never referenced in queries do not benefit
from indexes, since the system seldom has to search for rows on the
basis of values in these columns.
Columns that can have only two or three values, for example, "male" and "female" or "yes" and "no", get no real advantage from
indexes.
Try
sp_spaceused tablename, 1
Here is link to documentation.
Yes - Updating statistics about indexes.
Here is link to documentation.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I would like to know if there are general rules for creating an index or not.
How do I choose which fields I should include in this index or when not to include them?
I know its always depends on the environment and the amount of data, but I was wondering if we could make some globally accepted rules about making indexes in Oracle.
The Oracle documentation has an excellent set of considerations for indexing choices: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/data_acc.htm#PFGRF004
Update for 19c: https://docs.oracle.com/en/database/oracle/oracle-database/19/tgdba/designing-and-developing-for-performance.html#GUID-99A7FD1B-CEFD-4E91-9486-2CBBFC2B7A1D
Quoting:
Consider indexing keys that are used frequently in WHERE clauses.
Consider indexing keys that are used frequently to join tables in SQL statements. For more information on optimizing joins, see the section "Using Hash Clusters for Performance".
Choose index keys that have high selectivity. The selectivity of an index is the percentage of rows in a table having the same value for the indexed key. An index's selectivity is optimal if few rows have the same value. Note: Oracle automatically creates indexes, or uses existing indexes, on the keys and expressions of unique and primary keys that you define with integrity constraints.
Indexing low selectivity columns can be helpful if the data distribution is skewed so that one or two values occur much less often than other values.
Do not use standard B-tree indexes on keys or expressions with few distinct values. Such keys or expressions usually have poor selectivity and therefore do not optimize performance unless the frequently selected key values appear less frequently than the other key values. You can use bitmap indexes effectively in such cases, unless the index is modified frequently, as in a high concurrency OLTP application.
Do not index columns that are modified frequently. UPDATE statements that modify indexed columns and INSERT and DELETE statements that modify indexed tables take longer than if there were no index. Such SQL statements must modify data in indexes as well as data in tables. They also generate additional undo and redo.
Do not index keys that appear only in WHERE clauses with functions or operators. A WHERE clause that uses a function, other than MIN or MAX, or an operator with an indexed key does not make available the access path that uses the index except with function-based indexes.
Consider indexing foreign keys of referential integrity constraints in cases in which a large number of concurrent INSERT, UPDATE, and DELETE statements access the parent and child tables. Such an index allows UPDATEs and DELETEs on the parent table without share locking the child table.
When choosing to index a key, consider whether the performance gain for queries is worth the performance loss for INSERTs, UPDATEs, and DELETEs and the use of the space required to store the index. You might want to experiment by comparing the processing times of the SQL statements with and without indexes. You can measure processing time with the SQL trace facility.
There are some things you should always index:
Primary Keys - these are given an index automatically (unless you specify a suitable existing index for Oracle to use)
Unique Keys - these are given an index automatically (ditto)
Foreign Keys - these are not automatically indexed, but you should add one to avoid performance issues when the constraints are checked
After that, look for other columns that are frequently used to filter queries: a typical example is people's surnames.
From the 10g Oracle Database Application Developers Guide - Fundamentals, Chapter 5:
In general, you should create an index on a column in any of the following situations:
The column is queried frequently.
A referential integrity constraint exists on the column.
A UNIQUE key integrity constraint exists on the column.
Use the following guidelines for determining when to create an index:
Create an index if you frequently want to retrieve less than about 15% of the rows in a large table. This threshold percentage varies greatly, however, according to the relative speed of a table scan and how clustered the row data is about the index key. The faster the table scan, the lower the percentage; the more clustered the row data, the higher the percentage.
Index columns that are used for joins to improve join performance.
Primary and unique keys automatically have indexes, but you might want to create an index on a foreign key; see Chapter 6, "Maintaining Data Integrity in Application Development" for more information.
Small tables do not require indexes; if a query is taking too long, then the table might have grown from small to large.
Some columns are strong candidates for indexing. Columns with one or more of the following characteristics are good candidates for indexing:
Values are unique in the column, or there are few duplicates.
There is a wide range of values (good for regular indexes).
There is a small range of values (good for bitmap indexes).
The column contains many nulls, but queries often select all rows having a value. In this case, a comparison that matches all the non-null values, such as:
WHERE COL_X >= -9.99 *power(10,125)
is preferable to
WHERE COL_X IS NOT NULL
This is because the first uses an index on COL_X (assuming that COL_X is a numeric column).
Columns with the following characteristics are less suitable for indexing:
There are many nulls in the column and you do not search on the non-null values.
Wow, that's just such a huge topic, it's hard to answer in this format. I srtongly recommend this book.
Relational Database Index Design and the Optimizers
by Tapio Lahdenmaki
You don't just use indexes to make table access faster, sometimes you make indexes to avoid table access altogether. Something not mentioned yet but vital.
There's a whole science to this if you really want to make your database perform maximally.
Ah, one specific optimization to Oracle is building reverse key indexes. If you have a PK index of a monoatomically increasing value, like a sequence, and you have highly concurrent inserts and don't plan to range scan that column then make it a reverse key index.
See how specific these optimizations can be?
Look into Database Normalization - you'll find a lot of good, industry standard rules about what keys should exist, how databases should be related, and hints on indexes.
-Adam
Usually one puts the ID columns up front and those usually identify the rows uniquely. A combination of columns can also do the same thing. As an example using cars... tags or license plates are unique and qualify for an index. They (the tags column) can qualify for the primary key. The owners name can qualify for an index if you are going to search on name. make of car really shouldn't get an index in the beginning as it's not going to vary too much. Indexes don't help if the data in the column doesn't vary too much.
Take a look at the SQL - what are the where clauses looking at. Those may need an index.
Measure. What is the issue - pages/queries taking too long ? what's being used for the queries. Create an index on those columns.
Caveats: indexes need time for updates and space.
and sometimes full table scans are quicker than an index. small tables can be scanned quicker than getting the index and then hitting the table. Look at your joins.