Performance Hit when writing into the partitioned Tables - performance

Can someone please help why the table is taking too much time to write when table is very small

As advised here, you shouldn't partition on a column that has high cardinality (number of unique values). As can be seen in the screenshot, the orderDate column has 753 unique values. Under the covers that means 753 folders have to be created, and each folder would have on average ~1.2 records in a parquet file (assuming equal date distribution).
You should consider extracting the month and year, or just the year value from the orderDate column, and partition on that.

Related

Oracle Auto Partition strategy on Integer column

I need some help on how to perform auto partition on integer column, similar to how we do on date column like PARTITION BY RANGE (DIM_DT_ID) INTERVAL (NUMTODSINTERVAL(1,'DAY')).
I have 90 million rows and it sucks in performance and our SLA on query is 2 seconds, i would like to perform partition. What is the best approach and how do i enable auto partition on a Integer column
Our query will always filter by these columns like
select * from <tbname>
where ObjectID =1346785
and patentnumber=23456.
"i'm just making an example here, as i cant paste the original query for legality sake"
Fair enough, but the advice we give you will only be as good as the information you give us. So far, nothing you have posted suggests you need Partitioning.
The pasted query would perform well with a compound index, and would probably benefit from compression of the leading column:
create index your_table_lookup_index
on your_table(ObjectID, patentnumber) compress 1;
If that's a unique combination then make the index unique.
how do i enable auto partition on a Integer column
However, if you think you do have a genuine use case for Partitioning then we can use Interval Partitioning with integers as well as dates. This statement will create a table partitioned on objectid with a partition for every ten values.
create table your_table (
objectid number,
patentnumber number,
created_date date
)
partition by range (objectid)
interval (10)
(
partition p_00010 values less than (10)
);
On your posted figures that would be about 400 partitions with around 225000 rows per partition. Is that a good choice? Who can tell? You know your data and your use cases, we don't: perhaps a partition per objectid (i.e. with interval (1)) would be better.
You already have a table so you need to split it into Partitions. The standard of way of doing this would be
create a new table with your partitioning strategy (like above) but with the default partition ranged for values less than (MAXVALUE)
use partition exchange to move the existing table data into the new
structure
drop the old table and rename the table to the old table; resolve
foreign keys and other dependencies.
iteratively split the partition into the required range
This is a fairly time-consuming process. You have tagged your question [oracle12c]; if you're using Oracle 12c R2 you should definitely look at its online conversion mechanism, which is a single command. Find out more.
Remember that Partitioning for performance is a tricky game. While it can improve queries which return a large number of rows aligned with the Partition key it can make no difference to other queries, or even impair their performance. In particular, any query which does not include the partition key (objectid in your case) will likely perform worse after partitioning the table .
Final aside: as you know but for the benefit of future Seekers, Partitioning is a chargeable extra to the Enterprise Edition license. We're not allowed to use it unless we've paid for it.

what is skewed column in Oracle

I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?
Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.
Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.

Partitioning or bucketing hive table based on only month/year to optimize queries

I'm building a table that contains about 400k rows of a messaging app's data.
The current table's columns looks something like this:
message_id (int)| sender_userid (int)| other_col (string)| other_col2 (int)| create_dt (timestamp)
A lot of queries I would be running in the future will rely on a where clause involving the create_dt column. Since I expect this table to grow, I would like to try and optimize it right now. I'm aware that partitioning is one way, but when I partition it based on create_dt the result is too many partitions since I have every single date spanning back to Nov 2013.
Is there a way to instead partition by a range of dates? How about partition for every 3 months? or even every month? If this is possible - Could I possibly have too many partitions in the future making it inefficient? What are some other possible partition methods?
I've also read about bucketing, but as far as I'm aware that's only useful if you would be doing joins on a column that the bucket is based on. I would most likely be doing joins only on column sender_userid (int).
Thanks!
I think this might be a case of premature optimization. I'm not sure what your definition of "too many partitions" is, but we have a similar use case. Our tables are partitioned by date and customer column. We have data that spans back to Mar 2013. This created approximately 160k+ partitions. We also use a filter on date and we haven't seen any performance problems with this schema.
On a side note, Hive is getting better at scaling up to 100s of thousands of partitions and tables.
On another side note, I'm curious as to why you're using Hive in the first place for this. 400k rows is a tiny amount of data and is not really suited for Hive.
Check out hive built in UDFs. With the right combination of them you can achieve what you want. Here's an example to partition on every month (produces "YEAR-MONTH" string that you can use as partition column value):
select concat(cast(year(to_date(create_dt)) as string),'-',cast(month(to_date(create_dt)) as string))
But when partitioning on dates it is usually useful to have multiple levels of the date dimension so in this case you should have two partition columns, first for year and second for month:
select year(to_date(create_dt)),month(to_date(create_dt))
Keep in mind that timestamps and dates are strings, and that functions like month() or year() return integers as values of date fields. You can use simple mathematical operations to figure out the right partition.

BI: Fact Table Design/Data warehouse modelling

i have some issue in designing my Data Warehouse and ETL process because of the fact table. It contains over 100 millions rows for 2 years of accounting data. The dimensions are related to the fact table via Foreign Key, I also used surrogate key , indexes and views. How do you guys would deal with such a fact table in order to ensure a good performance , a reasonable ETL Process and to have an adaptive and resilient to changes Data Warehouse ? It will be partitioning the table by half year a good approach?
First, you should look again at your data-warehouse design.
In fact table, foreign keys combination must be unique per row. If not, there is something wrong with ETL process.
You can easily check this by comparing counts of all rows in fact table with count rows of query where you group by every foreign key (select count(*) from fact_table group by fk1, fk2, fk..n). It has to be equal.
Next, you told that you have surrogate keys as foreign keys. I think that's no reason to repeat you should use integers.
Partition fact table by month, I don't see why on half year period?
100 millions rows is not too big. Perhaps you should think about some columnar database (Vertica for example).
I created a columnstore index on the Fact Table and the query cost (relative to the batch) is now 14% with index and 86% without index. I think it's pretty good.
Execution Plan below.
http://uploadimage.ro/img.php?image=4508_execution_plan_sk6y.png

Oracle 10g: why does Oracle do a full table scan instead of using a B-tree index with low cardinality in this case?

I am researching indexes in Oracle 10g to speed up a particular query. Over and over again I am reading that indexing low cardinality columns (columns with very few unique values, such as a gender column in an employee table) will very rarely help speed up lookups. This makes sense if the data in that low cardinality column is uniformly distributed e.g. ~50% of employee records have gender = 'M', the other ~50% have gender = 'F'. But what about if the data is not uniformly distributed and you are searching for the records that do not have the same key as the majority? What if the above gender column was indexed, the employee table was for a company that had 2% male and 98% female employees, and we only every do queries on the male employees. Does this low cardinality rule of thumb still hold up?
The situation i am dealing with now is a table that has a non-null binary column, each record always has either a 1 or a 0 stored. Within this table there are something like 99,999 records with a 0 and a single record which has a 1 stored. Oracle is opting for a full table scan when I have a b-tree index on this binary column.
I suppose part of what I am not understanding is what the b-tree would look like when the majority of keys are duplicates and why it would not be able to quickly find a set of records that are in the non-duplicate minority.
Dan,
And the answer lies in the question below.
What are the various types/kinds of Indexes exists in Oracle (For Example) ??
When there are columns with more redundancy (Low Cardinality), Bitmap Indexes is the best choice.
e.g. Suppose a table with a column name 'Employee_status' .. Values (YES | NO)
Select * from emp where Employee_Status='Regular';
If you have B-Tree Index, Hash index ... other than Bitmap Index, this will raraly help inspite of using Filters and Indexes.
Thanks
Prashant Dixit
www.oracleant.com

Resources