Snowflake: 2 correlated columns have very different clustering information (one has perfect, the other has terrible) - clustered-index

We have a table with 120M rows (over 2222 micropartitions), that has 2 important columns, record_id with values in format as prefix|<account_id>|<uuid> (unique) and column account_id, which has the value of <account_id>. Note that the prefix is same for all records. Then of course some factum columns, but that is not relevant.
Snowflake shows perfect clustering for the record_id column (automatically chosen by SF, no specified clustering is set by us) via clustering_information function:
"total_partition_count" : 2222,
"total_constant_partition_count" : 2222,
"average_overlaps" : 24.0,
"average_depth" : 25.0,
However, for the column account_id, the clustering is very bad
"total_constant_partition_count" : 0,
"average_overlaps" : 2221.0,
"average_depth" : 2222.0,
There is about 130 distinct account ids, which means that on average, records of one account_id should be over 17 partitions. Even if snowflake clusters by records_id, the beginning of that column (prefix|<account_id>) correlates with account_id column. So, records with the same account_id should end up in the same partitions. Therefore, I cannot figure out why there is 100% overlap of micropartitions for the account_id column. It is like if snowflake used some weird sorting for the record_id column and thus scattered rows of each account across all partitions. Is that possible?
This has negative consequences on performance, since doing a query with account_id filter results in scan of all partitions.
Note: also asked this question at snowflake forum https://support.snowflake.net/s/question/0D50Z00008vfglCSAQ/2-correlated-columns-have-very-different-clustering-information-one-has-perfect-the-other-has-terrible

In Snowflake's clustering reporting functions like those posted above, there is a limitation that only the first 6 characters of a varchar are considered for assessing clustering depth. So I would not trust the great results reported for record_id since the first 6 characters may be identical due to prefix even if the subsequent account_id's are random.
The best solution would be to explicitly declare clustering on account_id and activate auto-clustering on the table.

Related

Oracle Auto Partition strategy on Integer column

I need some help on how to perform auto partition on integer column, similar to how we do on date column like PARTITION BY RANGE (DIM_DT_ID) INTERVAL (NUMTODSINTERVAL(1,'DAY')).
I have 90 million rows and it sucks in performance and our SLA on query is 2 seconds, i would like to perform partition. What is the best approach and how do i enable auto partition on a Integer column
Our query will always filter by these columns like
select * from <tbname>
where ObjectID =1346785
and patentnumber=23456.
"i'm just making an example here, as i cant paste the original query for legality sake"
Fair enough, but the advice we give you will only be as good as the information you give us. So far, nothing you have posted suggests you need Partitioning.
The pasted query would perform well with a compound index, and would probably benefit from compression of the leading column:
create index your_table_lookup_index
on your_table(ObjectID, patentnumber) compress 1;
If that's a unique combination then make the index unique.
how do i enable auto partition on a Integer column
However, if you think you do have a genuine use case for Partitioning then we can use Interval Partitioning with integers as well as dates. This statement will create a table partitioned on objectid with a partition for every ten values.
create table your_table (
objectid number,
patentnumber number,
created_date date
)
partition by range (objectid)
interval (10)
(
partition p_00010 values less than (10)
);
On your posted figures that would be about 400 partitions with around 225000 rows per partition. Is that a good choice? Who can tell? You know your data and your use cases, we don't: perhaps a partition per objectid (i.e. with interval (1)) would be better.
You already have a table so you need to split it into Partitions. The standard of way of doing this would be
create a new table with your partitioning strategy (like above) but with the default partition ranged for values less than (MAXVALUE)
use partition exchange to move the existing table data into the new
structure
drop the old table and rename the table to the old table; resolve
foreign keys and other dependencies.
iteratively split the partition into the required range
This is a fairly time-consuming process. You have tagged your question [oracle12c]; if you're using Oracle 12c R2 you should definitely look at its online conversion mechanism, which is a single command. Find out more.
Remember that Partitioning for performance is a tricky game. While it can improve queries which return a large number of rows aligned with the Partition key it can make no difference to other queries, or even impair their performance. In particular, any query which does not include the partition key (objectid in your case) will likely perform worse after partitioning the table .
Final aside: as you know but for the benefit of future Seekers, Partitioning is a chargeable extra to the Enterprise Edition license. We're not allowed to use it unless we've paid for it.

what is skewed column in Oracle

I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?
Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.
Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.

BI: Fact Table Design/Data warehouse modelling

i have some issue in designing my Data Warehouse and ETL process because of the fact table. It contains over 100 millions rows for 2 years of accounting data. The dimensions are related to the fact table via Foreign Key, I also used surrogate key , indexes and views. How do you guys would deal with such a fact table in order to ensure a good performance , a reasonable ETL Process and to have an adaptive and resilient to changes Data Warehouse ? It will be partitioning the table by half year a good approach?
First, you should look again at your data-warehouse design.
In fact table, foreign keys combination must be unique per row. If not, there is something wrong with ETL process.
You can easily check this by comparing counts of all rows in fact table with count rows of query where you group by every foreign key (select count(*) from fact_table group by fk1, fk2, fk..n). It has to be equal.
Next, you told that you have surrogate keys as foreign keys. I think that's no reason to repeat you should use integers.
Partition fact table by month, I don't see why on half year period?
100 millions rows is not too big. Perhaps you should think about some columnar database (Vertica for example).
I created a columnstore index on the Fact Table and the query cost (relative to the batch) is now 14% with index and 86% without index. I think it's pretty good.
Execution Plan below.
http://uploadimage.ro/img.php?image=4508_execution_plan_sk6y.png

oracle partitioning on columns frequently used in joins and where conditions

The customer table contains 9.5 million records. The customer_id column is the primary key. The database is Oracle.
Questions:
1) Should the table contain main partitions or sub-partitions? How do I decide?
Also, I don't think indexing columnA or columnB will help here because of the type of data.
TableA.columnA (varchar) has more than 80% of the records for columnA values 5,6,7. The columnA has values from 1 to 7 only.
TableA.columnB (varchar) has 90% of the records for columnB value = 102. The columnB has values from 1 to 999.
Moreover, the typical queries are (in no particular order):
Query1: where tableA.columnA = values
Query2: where tableA.columnB = values
Query3: where tableA.columnA = values AND/OR tableA.columnB = values
2) When we create sub-partitions, what happens if the query only contains a where clause for sub-partition column? Does the query execution go directly to sub-partition or through main partition?
3) the join contains tableA.partitioned_column = tableB.indexed_column
(eg. customer_Table.branch_code = branch_table.branch_code)
Does partitioning help in the case of JOIN? Will it improve performance?
1) It's very difficult to answer not knowing table structure, the way it's usually used etc. But generally for big tables partitioning is very often necessity.
2) If you will not specify partition then Oracle will have to browse through all partitions to find where the subpartition is (which is not very slow). And then use partition pruning on subpartition. It will be still significantly faster then not having subpartitions at all. But the best situation is to refer in WHERE to partition and subpartition.
3) For 99% I think it will help, because Oracle can use partition pruning to get at once needed rows from tableA. You will be 100% sure if you check query plan. But the best situation is when both column are partition keys.
If 80-90% of these columns have the same values and they are the most often queried values, then partitioning will help some. You would be pruning 10-20% of the data during these queries but you probably want to find another way for Oracle to hone in on the data your query needs (dates, perhaps?)
The value distribution in your two columns also brings up the point of statistics and making sure they are being gathered properly (with histograms to describe the skew in these columns).
As #psur points out, without knowing the details of your system it's hard give concrete suggestions.

Oracle 10g: why does Oracle do a full table scan instead of using a B-tree index with low cardinality in this case?

I am researching indexes in Oracle 10g to speed up a particular query. Over and over again I am reading that indexing low cardinality columns (columns with very few unique values, such as a gender column in an employee table) will very rarely help speed up lookups. This makes sense if the data in that low cardinality column is uniformly distributed e.g. ~50% of employee records have gender = 'M', the other ~50% have gender = 'F'. But what about if the data is not uniformly distributed and you are searching for the records that do not have the same key as the majority? What if the above gender column was indexed, the employee table was for a company that had 2% male and 98% female employees, and we only every do queries on the male employees. Does this low cardinality rule of thumb still hold up?
The situation i am dealing with now is a table that has a non-null binary column, each record always has either a 1 or a 0 stored. Within this table there are something like 99,999 records with a 0 and a single record which has a 1 stored. Oracle is opting for a full table scan when I have a b-tree index on this binary column.
I suppose part of what I am not understanding is what the b-tree would look like when the majority of keys are duplicates and why it would not be able to quickly find a set of records that are in the non-duplicate minority.
Dan,
And the answer lies in the question below.
What are the various types/kinds of Indexes exists in Oracle (For Example) ??
When there are columns with more redundancy (Low Cardinality), Bitmap Indexes is the best choice.
e.g. Suppose a table with a column name 'Employee_status' .. Values (YES | NO)
Select * from emp where Employee_Status='Regular';
If you have B-Tree Index, Hash index ... other than Bitmap Index, this will raraly help inspite of using Filters and Indexes.
Thanks
Prashant Dixit
www.oracleant.com

Resources