Oracle | Compression Disadvantages and Advantages - oracle

We have oracle 12c database. We will be migrating to 19c soon. Generally we follow rule to compress table(advanced OLTP compression) and keep indexes uncompressed. Now we are facing situation where depending upon number of columns we have some indexes with 800GB and its corresponding table is of 200GB(compressed)
Can someone help me with understanding of below-
Does tables compression have impact on query performance/table loading
Should we compress index? Will it impact performance of loading or querying?
If table is partitioned can we selectively compress local indexex partition by partition.
Are there any best practices or dos or donts for Oracle compression?

Re 1:
Table compression can have an impact on performance, mostly a positive one. However, it is nearly impossible to predict, as it depends on the data, and on the order the data is inserted into the table, the number of updates, etc.
I'd normally check firstly the potential compression ratio of a table, either with dbms_compression.get_compression_ratio or by simply creating a compressed and an uncompressed copy of the table (or a subset of the rows if too big).
Re 2:
Index compression eliminates leading values in multicolumn indexes, so the answer is the same as for 1.
Re 3:
Yes. According to the partitioning guide, you can use
CREATE INDEX i_cost1 ON costs_demo (prod_id) COMPRESS LOCAL (
PARTITION costs_old, PARTITION costs_q1_2003,
PARTITION costs_q2_2003, PARTITION costs_recent NOCOMPRESS);

Related

Reg : Efficiency among query optimizers in hive

After reading about query optimization techniques I came to know about the below techniques.
1. Indexing - bitmap and BTree
2. Partitioning
3. Bucketing
I got the difference between partitioning and bucketing, and when to use them but I'm still confused how indexes actually work. Where is the metadata for index is stored? Is it the namenode which is storing it? I.e., actually while creating partitions or buckets we can see multiple directories in hdfs which explains the query performance optimization but how to visualize indexes? Are they really used in real life despite partitioning and bucketing being in the picture?
Please help me for the above queries and is there's any dedicated page for hadoop and hive developers community?
Indexes in Hive were never used in real life and were never efficient and as #mazaneicha noticed in the comment Indexing feature is removed completely in Hive 3.0, read this Jira: HIVE-18448. It was a great try any way, thanks to Facebook support, valuable lessons have been learned.
But there are light-weight indexes in ORC (well, not actually classic indexes but min, max and Bloom filter, it helps to prune stripes). ORC indexes and bloom filters are efficient if the data is sorted during insert (distribute+sort)
Partitioning is the most efficient if partitioning schema corresponds to how the table is being filtered or how is it being loaded (allows to load partitions in parallel, if the increment data is the whole partition it works efficiently).
Bucketing can help with optimizing joins and group by but sort-merge-bucket-mapjoin has serious restrictions making it also not efficient. Both tables should have the same bucketing schema, which in real life is rare or can be extremely inefficient. Also data should be sorted when loading buckets.
Consider using ORC with built-in indexes and Bloom filters, keep less number of files in your table to avoid metadata overload and avoid mappers copying thousands of files.
Read this partitions in hive interview questions and this Sorted Table in Hive
Useful links.
Official documentation: LanguageManual
Cloudera community: https://community.cloudera.com/

Are there any advantages in using Indexes on tables in Hadoop over Oracle?

I need to compare the Indexing in Oracle Vs Hadoop(Hive). Up till now, I could find two major indexing techniques in Hive i.e. COMPACT INDEXING and BITMAP INDEXING. I could check out the performance difference of COMPACT INDEXING in Hive compared to Oracle. I would need to understand more use cases / scenarios of using Bitmap Indexing in Hive. Also, need to know if Hive supports Reverse Key Indexes , Ascending and Descending Indexes like Oracle.
YES their is significant advantages in using index in HIVE over
oracle, keeping in mind that HIVE is suitable for Large data sets and
yet their are developments in making HIVE a real time data
warehousing tool.
One use case in which BITMAP indexing can be used is where table with
columns having distinct values and obviously it should be a large
table (you will get better results if table is large, do not test
with small tables).
As of now HIVE Supports only two indexing techniques COMPACT and
BITMAP for explicitly creating indexes.
Also Indexes in Hive are not recommended (although you can create as
per your use case), the reason for this is ORC Format.
ORC format has build in Indexes which allow the format to skip blocks of
data during read, they also support Bloom filters index. Together
this pretty much replicates what Hive Indexes did and they do it
automatically in the data format without the need to manage an
external table ( which is essentially what happens in indexes).
I would suggest you to rather spend your time to properly setup the
ORC tables.
also read this great post about hive indexing.
hive is data warehousing tool that runs on hadoop. inbuilt it has mapreduce capacity for hive queries. the metadata and actula data are seperated and store in apache derby. so the burden on database is very less. hive process large tables easily because of distributive nature. and also you can compare the inner joins performance of oracle and hive. hive will gives you better performance always.

Hive - Bucketing and Partitioning

What should be basis for us to narrow down whether to use partition or bucketing on a set of columns in Hive?
Suppose we have a huge data set, where we have two columns which are queried most often - so my obvious choice might be to make the partition based on these two columns, but also if this would result into a huge number of small files created in huge number of directories, than it would be a wrong decision to partition data based on these columns, and may be bucketing would have been a better option to do.
Can we define a methodology using which we can decide if we should go for bucketing or partitioning?
Bucketing and partitioning are not exclusive, you can use both.
My short answer from my fairly long hive experience is "you should ALWAYS use partitioning, and sometimes you may want to bucket too".
If you have a big table, partitioning helps reducing the amount of data you query. A partition is usually represented as a directory on HDFS. A common usage is to partition by year/month/day, since most people query by date.
The only drawback is that you should not partition on columns with a big cardinality.
Cardinality is a fundamental concept in big data, it's the number of possible values a column may have. 'US state' for instance has a low cardinality (around 50), while for instance 'ip_number' has a large cardinality (2^32 possible numbers).
If you partition on a field with a high cardinality, hive will create a very large number of directories in HDFS, which is not good (extra memory load on namenode).
Bucketing can be useful, but you also have to be disciplined when inserting data into a table. Hive won't check that the data you're inserting is bucketed the way it's supposed to.
A bucketed table has to do a CLUSTER BY, which may add an extra step in your processing.
But if you do lots of joins, they can be greatly sped up if both tables are bucketed the same way (on the same field and the same number of buckets). Also, once you decide the number of buckets, you can't easily change it.
Partioning :
Partioning is decomposing/dividing your input data based on some condition e.g: Date, Country here.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs PARTITION (dt='2012-01-01', country='GB');
Files created in warehouse as below after loading data:
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file1/
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file2/
/user/hive/warehouse/logs/dt=2012-01-01/country=US/file3/
/user/hive/warehouse/logs/dt=2012-01-02/country=GB/file4/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file5/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file6
SELECT ts, dt, line
FROM logs
WHERE country='GB';
This query will only scan file1, file2 and file4.
Bucketing :
Bucketing is further Decomposing/dividing your input data based on some other conditions.
There are two reasons why we might want to organize our tables (or partitions) into buckets.
The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns – which include the join columns – can be efficiently implemented as a map-side join.
The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them.
Let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to specify the columns to bucket on and the number of buckets:
CREATE TABLE student (rollNo INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
SELECT * FROM student TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());

Composite indexes on fact tables in a data warehouse - datamart

Is it a best practice to keep composite unique indexes on fact tables in a Oracle EDW - data mart for avoiding duplicates? will it impact ETL data load performance? Please provide your thoughts on this topic. What are other alternative ways to gain SLA for ETL load?
Each insert into a table that has an index will cause that index to be updated causing IO and slowing it down a bit. So loading into a table with indexes whether unique or not will be a bit slower. You can drop that index, load and then create it again. It will reduce index fragmentation and usually will be faster with large loads.
I'm surprised to see a unique index on a fact table. Usually there is not so much uniqueness required there and in general data warehouses denormalize and duplicate data.
It all depends on your case. If you can use ETL to avoid undesired duplicates do it instead of using an index. Don't create this index if the sole purpose is data integrity/consistency. Indexes get huge so they better be useful for your queries.

Which Oracle index is best to choose

I am having a table that has 5 million record. The primary key of this table is created in sequence. My question is which index to create for best performance?
B-Tree Index (default)
(Range) Partitioned Indexes
Or any other?
Considered I am going to use SELECT operation most of the time
B-Tree is the default. We have tables with one billion rows with B-tree indexes. OLTP systems almost always use B-tree for everything. The only time you consider alternate index types is because of special considerations. For example, a highly redundant data set(low cardinality): like an index on a column that contains only Y or N characters, may benefit from a bit-map index. At least in terms of resources.
Bitmaps are favored often for Data Warehouse applications. Other approaches are partitioned tables where a single physical data file has all of one single common column. This eliminates having to read across all of the files in a tablespace to run a report. Ex: the End of Month data for A/R.

Resources