what is skewed column in Oracle - oracle

I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?

Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.

Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.

Related

oracle order by optimization

I am running a query on a large table and I am expecting a large number of returning row.
unfortunately I need to order the result by 2 columns, which makes the query quite slow.
I added an index to those specific columns but was wondering, if the order direction makes a difference.
one column is ordered desc and one is order asc.
thanks and best wishes,
e.
Your query might benefit from an index ordered the same way as your order by clause e.g.
create index index1 on table1 (col1 desc, col2 asc);
Whether it will benefit depends on the relative cost of the index scans and table lookups versus a simple full table scan. If the number of rows you want is low relative to the total number of rows in the table the query might benefit.
The only way to know for sure is try it.

Do PostgreSQL query plans depend on table row count?

My users table doesn't have many rows... yet. 😏
Might the query plan of the same query change as the table grows?
I.e., to see how my application will scale, should I seed my users table with BILLIONS 🤑 of rows before using EXPLAIN?
Estimated row counts are probably the most important factor that influence which query plan is chosen.
Two examples that support this:
If you use a WHERE condition on an indexed column of a table, three things can happen:
If the table is very small or a high percentage of the rows match the condition, a sequential scan will be used to read the whole table and filter out the rows that match the condition.
If the table is large and a low percentage of the rows match the condition, an index scan will be used.
If the table is large and a medium percentage of rows match the condition, a bitmap index scan will be used.
If you join two tables, the estimated row counts on the tables will determine if a nested loop join is chosen or not.

performance for sum oracle

I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.

BI: Fact Table Design/Data warehouse modelling

i have some issue in designing my Data Warehouse and ETL process because of the fact table. It contains over 100 millions rows for 2 years of accounting data. The dimensions are related to the fact table via Foreign Key, I also used surrogate key , indexes and views. How do you guys would deal with such a fact table in order to ensure a good performance , a reasonable ETL Process and to have an adaptive and resilient to changes Data Warehouse ? It will be partitioning the table by half year a good approach?
First, you should look again at your data-warehouse design.
In fact table, foreign keys combination must be unique per row. If not, there is something wrong with ETL process.
You can easily check this by comparing counts of all rows in fact table with count rows of query where you group by every foreign key (select count(*) from fact_table group by fk1, fk2, fk..n). It has to be equal.
Next, you told that you have surrogate keys as foreign keys. I think that's no reason to repeat you should use integers.
Partition fact table by month, I don't see why on half year period?
100 millions rows is not too big. Perhaps you should think about some columnar database (Vertica for example).
I created a columnstore index on the Fact Table and the query cost (relative to the batch) is now 14% with index and 86% without index. I think it's pretty good.
Execution Plan below.
http://uploadimage.ro/img.php?image=4508_execution_plan_sk6y.png

oracle partitioning on columns frequently used in joins and where conditions

The customer table contains 9.5 million records. The customer_id column is the primary key. The database is Oracle.
Questions:
1) Should the table contain main partitions or sub-partitions? How do I decide?
Also, I don't think indexing columnA or columnB will help here because of the type of data.
TableA.columnA (varchar) has more than 80% of the records for columnA values 5,6,7. The columnA has values from 1 to 7 only.
TableA.columnB (varchar) has 90% of the records for columnB value = 102. The columnB has values from 1 to 999.
Moreover, the typical queries are (in no particular order):
Query1: where tableA.columnA = values
Query2: where tableA.columnB = values
Query3: where tableA.columnA = values AND/OR tableA.columnB = values
2) When we create sub-partitions, what happens if the query only contains a where clause for sub-partition column? Does the query execution go directly to sub-partition or through main partition?
3) the join contains tableA.partitioned_column = tableB.indexed_column
(eg. customer_Table.branch_code = branch_table.branch_code)
Does partitioning help in the case of JOIN? Will it improve performance?
1) It's very difficult to answer not knowing table structure, the way it's usually used etc. But generally for big tables partitioning is very often necessity.
2) If you will not specify partition then Oracle will have to browse through all partitions to find where the subpartition is (which is not very slow). And then use partition pruning on subpartition. It will be still significantly faster then not having subpartitions at all. But the best situation is to refer in WHERE to partition and subpartition.
3) For 99% I think it will help, because Oracle can use partition pruning to get at once needed rows from tableA. You will be 100% sure if you check query plan. But the best situation is when both column are partition keys.
If 80-90% of these columns have the same values and they are the most often queried values, then partitioning will help some. You would be pruning 10-20% of the data during these queries but you probably want to find another way for Oracle to hone in on the data your query needs (dates, perhaps?)
The value distribution in your two columns also brings up the point of statistics and making sure they are being gathered properly (with histograms to describe the skew in these columns).
As #psur points out, without knowing the details of your system it's hard give concrete suggestions.

Resources