Local indexes vs Global indexes for partitioned tables in Oracle - oracle

I have partitioned a table that is growing almost at a rate of 7-8 million rows a day. The partitioning has been done using a timestamp column as data can be archived or discarded a few weeks later. I have also created an index on the table which are on primary key or another value that is unique. My indexes are partitioned as well, however the partioning of index has been done using a hash function and does not include the partition key of table (which is a timestamp). So I have a few questions.
The table is a write-intensive table. It is currently mostly inserts, one update per row and 2-3 lookups within seconds of creation by indexed id and then the record is never accessed for any operation.
Is it optimal to define local indexes on the unique id or is it better to define global indexes and partition them as I have already done? If I define global indexes, without table partitioning key in it (timestamp, which is not used in lookup), will the access be more expensive if the number of partitions is huge?
What are the downsides of having global partitioned indexes for constantly growing data?
Once I decide to remove the partitions at a later point of time, since the indexes are not partitioned by timestamp but instead by unique id, will the operation cause a direct impact on functioning indexes?
Any other recommendations will be helpful.

Actually I don't know any reason to make an index partition different to table partition (that's what you have).
Either make global index or local index, i.e. partitioned index where partition of index is the same as partition of underlying table.
When you have global index and you drop or truncate a partition then the global indexes becomes "unusable" and have to be rebuild. You can automatically achieve this by adding clause UPDATE INDEXES to your drop/truncate statement. However, such operation may take some time, this is the main drawback of a global index.
In general local indexes are better, they are easier to maintain and usually faster since they are smaller. However, if you have many partitions and your main queries do not include the partition key (the timestamp in your case) then local index may have a negative impact on performance. If you have let's say 100 partitions, then Oracle would have to scan 100 index partitions which basically means: Scanning 100 indexes! In such case a global index is much faster.

Related

How can partitioned index help if the range is needed to be given manually during index creation?

Today I was reading about "Partitioned index" from this link for a performance tuning requirement.
The example that is given in the link reads like the following:
CREATE INDEX employees_global_part_idx ON employees(employee_id)
GLOBAL PARTITION BY RANGE(employee_id)
(PARTITION p1 VALUES LESS THAN(5000),
PARTITION p2 VALUES LESS THAN(MAXVALUE));
Till this all looks good except it is somewhat confusing to me that during definition of this index we are manually setting value of p1 as less than 5000
So for example, if the table has 12000 records, one partition has till 1 to 5000 records and the other one has 5000 to 12000 records which are unequal to each other. Also another hurdle in this approach is one can not make more partitions later on if intended. So this indexing approach with time will not be able to give a good performance advantage.
So is there any way overcome this problem in partitioned index?
In case the employee_id values are incremented when new records are created, you may want to use a HASH partitioned index instead of RANGE partitioned.
As per Oracle Partitioning guide:
Hash partitioned global indexes can also limit the impact of index skew on monotonously increasing column values.
Your index creation query would then be:
CREATE INDEX employees_global_part_idx ON employees(employee_id) GLOBAL
PARTITION BY HASH(employee_id)
(PARTITION p1,
PARTITION p2);
This lets Oracle take care of splitting the data evenly across the available partitions.
If you really want to use RANGE partitioned index, then every now and then you would need to maintain the index, by splitting the last partition and rebuilding the index.
Read also: Global Partitioned Indexes.
In terms of partitioning Oracle provides three types of indexes:
Local Partitioned Indexes: Each table partition has a corresponding index partition. I think this type is used (and useful) by majority.
Global Non-Partitioned Indexes: The index has no partition and spans over entire table. For example, such indexes are required for unique keys where partition key is not part of the unique key.
Global Partitioned Indexes (the type you refer in your question): You define partition rule of table independently from partition rule of the index.
Actually I cannot image any situation where a "Global Partitioned Indexes" really makes sense. They would be useful only for some very special, resp. exotic use-cases. Maybe when you have really huge amount of data and you have to distribute your index over different physical storages.

When should we go for partition and bucketing in hive?

I understand the concepts of partitioning and bucketing in Hive tables. But what I'd like to know is "when do we for partition and when do we go for bucketing ?"
What are ideal scenarios that can be said as suitable for partitioning and bucketing ?
The main reasons in which one uses partition and bucketing.
Partition:
Partitioning of table data is done for distributing load horizontally .
Example: If we have a very large table names as "Parts" and often we run "where" queries that restricts the results to a particular Part Type.
For a faster query response the table can be partitioned by (PART_TYPE STRING).Once you partition the table it changes the way Hive structures the data storage and Hive now will create sub-directories which will reflect the structure of the partition like:
.../Parts/PART_TYPE = Engine-Part
.../Parts/Part_Type = Brakes
So,now if you run a query on table "Parts" with WHERE PART_TYPE = 'Engine-Part'
, it will only scan the contents of one directory PART_TYPE = 'Engine-Part'
Partitioning feature is useful in Hive. but at the same time it may take long time to execute other queries.
Another drawback is if we create too many partitions which in turn creates large number of Hadoop files and directories that got created unnecessarily and it becomes an overhead to NameNode since NameNode must keep all metdatafiles for the file system in memory.
Bucketing:
Bucketing is another technique which can be used to further divide the data into more manageable form.
Example: Suppose the table "part_sale" has a top level partition of "sale_date" and it is further partitioned into "part_type" as second level partition.
This will lead to too many small partitions .
.../part_sale/sale-date = 2017-04-18/part_type = engine_part1
.../part_sale/sale-date = 2017-04-18/part_type = engine_part2
.../part_sale/sale-date = 2017-04-18/part_type = engine_part3
.../part_sale/sale-date = 2017-04-18/part_type = engine_part4
If we bucket the "part_sale" table ,and use "part_type" as our bucketing column of the table.The value of this column will be hashed by a user-defined number into buckets.Records with the same "part_type" will always be stored in same bucket.You can specify the number of buckets at the time of table creation so that number of buckets are fixed and there is no fluctuation with data.
Partitioning in Hive :-
If we are dealing with a large table and often run queries with WHERE clauses that restrict the results to a particular partitioned column/columns, then we should leverage the partition concept of hive . For a faster query response Hive table can be PARTITIONED BY (partition_cols_name).Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files, so scan becomes easy if partition is done properly. Should be done when the cardinality (number of possible values a field can have ) is not high. Else if there are too many partitions, then it is an overhead on the namenode.
Bucketing in Hive :-
If you want to segregate the data on a field which has high cardinality (number of possible values a field can have ), then we should use bucketing. If we want only a sample of data according to some specific fields and not the entire data , bucketing can be a good option. If some map-side joins are involved, then bucketed tables are a good option.
Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps in organizing data in each partition into multiple files, so as same set of data is always written in same bucket. Helps a lot in joining of columns.
Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.For example we have table with columns like date,employee_name,employee_id,salary,leaves etc . In this table just use date column as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. We can use HASH value for bucketing or a range to bucket the data.
Hive partitioning and Bucketing is ,when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.
In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;
PARTITIONING will be used when there are few unique values in the Column - which you want to load with your required WHERE clause
BUCKETING will be used if there are multiple unique values in your Where clause

Hive - Bucketing and Partitioning

What should be basis for us to narrow down whether to use partition or bucketing on a set of columns in Hive?
Suppose we have a huge data set, where we have two columns which are queried most often - so my obvious choice might be to make the partition based on these two columns, but also if this would result into a huge number of small files created in huge number of directories, than it would be a wrong decision to partition data based on these columns, and may be bucketing would have been a better option to do.
Can we define a methodology using which we can decide if we should go for bucketing or partitioning?
Bucketing and partitioning are not exclusive, you can use both.
My short answer from my fairly long hive experience is "you should ALWAYS use partitioning, and sometimes you may want to bucket too".
If you have a big table, partitioning helps reducing the amount of data you query. A partition is usually represented as a directory on HDFS. A common usage is to partition by year/month/day, since most people query by date.
The only drawback is that you should not partition on columns with a big cardinality.
Cardinality is a fundamental concept in big data, it's the number of possible values a column may have. 'US state' for instance has a low cardinality (around 50), while for instance 'ip_number' has a large cardinality (2^32 possible numbers).
If you partition on a field with a high cardinality, hive will create a very large number of directories in HDFS, which is not good (extra memory load on namenode).
Bucketing can be useful, but you also have to be disciplined when inserting data into a table. Hive won't check that the data you're inserting is bucketed the way it's supposed to.
A bucketed table has to do a CLUSTER BY, which may add an extra step in your processing.
But if you do lots of joins, they can be greatly sped up if both tables are bucketed the same way (on the same field and the same number of buckets). Also, once you decide the number of buckets, you can't easily change it.
Partioning :
Partioning is decomposing/dividing your input data based on some condition e.g: Date, Country here.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs PARTITION (dt='2012-01-01', country='GB');
Files created in warehouse as below after loading data:
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file1/
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file2/
/user/hive/warehouse/logs/dt=2012-01-01/country=US/file3/
/user/hive/warehouse/logs/dt=2012-01-02/country=GB/file4/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file5/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file6
SELECT ts, dt, line
FROM logs
WHERE country='GB';
This query will only scan file1, file2 and file4.
Bucketing :
Bucketing is further Decomposing/dividing your input data based on some other conditions.
There are two reasons why we might want to organize our tables (or partitions) into buckets.
The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns – which include the join columns – can be efficiently implemented as a map-side join.
The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them.
Let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to specify the columns to bucket on and the number of buckets:
CREATE TABLE student (rollNo INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
SELECT * FROM student TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());

oracle partition table can improve a single disk performance?

Oracle SQL references describes partition:
Partitioning allows a table, index, or index-organized table to be subdivided into smaller pieces, where each piece of such a database object is called a partition. Each partition has its own name, and may optionally have its own storage characteristics.
Then I have a question, if I only have one hard disk, the partition table can be placed only on a disk. Can a partition table improve performance?
Partitioning on a single disk can still help for some queries. For instance, Oracle can do partition-pruning: this is the ability to select from only a subset of all partitions for a query.
Suppose that you have a table that contains data from the last 12 months. If you want to query some total and average over one particular month, Oracle will probably need to FULL SCAN the whole table or read a lot of data with an index. With a partitioning scheme by month, Oracle would only need to read 1/12th of the data and still be able to FULL SCAN the partition as if it were a smaller version of the big table.

Which Oracle index is best to choose

I am having a table that has 5 million record. The primary key of this table is created in sequence. My question is which index to create for best performance?
B-Tree Index (default)
(Range) Partitioned Indexes
Or any other?
Considered I am going to use SELECT operation most of the time
B-Tree is the default. We have tables with one billion rows with B-tree indexes. OLTP systems almost always use B-tree for everything. The only time you consider alternate index types is because of special considerations. For example, a highly redundant data set(low cardinality): like an index on a column that contains only Y or N characters, may benefit from a bit-map index. At least in terms of resources.
Bitmaps are favored often for Data Warehouse applications. Other approaches are partitioned tables where a single physical data file has all of one single common column. This eliminates having to read across all of the files in a tablespace to run a report. Ex: the End of Month data for A/R.

Resources