Is it a best practice to keep composite unique indexes on fact tables in a Oracle EDW - data mart for avoiding duplicates? will it impact ETL data load performance? Please provide your thoughts on this topic. What are other alternative ways to gain SLA for ETL load?
Each insert into a table that has an index will cause that index to be updated causing IO and slowing it down a bit. So loading into a table with indexes whether unique or not will be a bit slower. You can drop that index, load and then create it again. It will reduce index fragmentation and usually will be faster with large loads.
I'm surprised to see a unique index on a fact table. Usually there is not so much uniqueness required there and in general data warehouses denormalize and duplicate data.
It all depends on your case. If you can use ETL to avoid undesired duplicates do it instead of using an index. Don't create this index if the sole purpose is data integrity/consistency. Indexes get huge so they better be useful for your queries.
Related
I want to have a memory cache layer in my application. To populate cache with items, I have to get data from a large Cassandra table. Select all is not recommended, because without using partition keys, it's a slow read operation. Prior to that I can "predict" partition keys using other Cassandra table that I'll have to read all again, but relatively it's a smaller volume table. After reading data from user table and creating a list of potential partition keys (userX, userY) that may or may not be present in initial table. With that list try and populate cache by executing select queries with each potential key. That also doesn't sound like a really good idea.
So the question is? How to properly populate cache layer with data from Cassandra DB?
The second option is preferred for warming up or pre-loading your cache.
Single-partition asynchronous queries from multiple client/app instances is much better than doing a full table scan. Asynchronous queries from lots of clients distributes the load efficiently to all nodes in the cluster which is why they perform better.
It should be said that if you've got your data model right and you've sized your cluster correctly, you can achieve single-digit millisecond latencies. I work with a lot of large organisations who have a 95% SLA for 6-8ms reads. Cheers!
After reading about query optimization techniques I came to know about the below techniques.
1. Indexing - bitmap and BTree
2. Partitioning
3. Bucketing
I got the difference between partitioning and bucketing, and when to use them but I'm still confused how indexes actually work. Where is the metadata for index is stored? Is it the namenode which is storing it? I.e., actually while creating partitions or buckets we can see multiple directories in hdfs which explains the query performance optimization but how to visualize indexes? Are they really used in real life despite partitioning and bucketing being in the picture?
Please help me for the above queries and is there's any dedicated page for hadoop and hive developers community?
Indexes in Hive were never used in real life and were never efficient and as #mazaneicha noticed in the comment Indexing feature is removed completely in Hive 3.0, read this Jira: HIVE-18448. It was a great try any way, thanks to Facebook support, valuable lessons have been learned.
But there are light-weight indexes in ORC (well, not actually classic indexes but min, max and Bloom filter, it helps to prune stripes). ORC indexes and bloom filters are efficient if the data is sorted during insert (distribute+sort)
Partitioning is the most efficient if partitioning schema corresponds to how the table is being filtered or how is it being loaded (allows to load partitions in parallel, if the increment data is the whole partition it works efficiently).
Bucketing can help with optimizing joins and group by but sort-merge-bucket-mapjoin has serious restrictions making it also not efficient. Both tables should have the same bucketing schema, which in real life is rare or can be extremely inefficient. Also data should be sorted when loading buckets.
Consider using ORC with built-in indexes and Bloom filters, keep less number of files in your table to avoid metadata overload and avoid mappers copying thousands of files.
Read this partitions in hive interview questions and this Sorted Table in Hive
Useful links.
Official documentation: LanguageManual
Cloudera community: https://community.cloudera.com/
We have oracle 12c database. We will be migrating to 19c soon. Generally we follow rule to compress table(advanced OLTP compression) and keep indexes uncompressed. Now we are facing situation where depending upon number of columns we have some indexes with 800GB and its corresponding table is of 200GB(compressed)
Can someone help me with understanding of below-
Does tables compression have impact on query performance/table loading
Should we compress index? Will it impact performance of loading or querying?
If table is partitioned can we selectively compress local indexex partition by partition.
Are there any best practices or dos or donts for Oracle compression?
Re 1:
Table compression can have an impact on performance, mostly a positive one. However, it is nearly impossible to predict, as it depends on the data, and on the order the data is inserted into the table, the number of updates, etc.
I'd normally check firstly the potential compression ratio of a table, either with dbms_compression.get_compression_ratio or by simply creating a compressed and an uncompressed copy of the table (or a subset of the rows if too big).
Re 2:
Index compression eliminates leading values in multicolumn indexes, so the answer is the same as for 1.
Re 3:
Yes. According to the partitioning guide, you can use
CREATE INDEX i_cost1 ON costs_demo (prod_id) COMPRESS LOCAL (
PARTITION costs_old, PARTITION costs_q1_2003,
PARTITION costs_q2_2003, PARTITION costs_recent NOCOMPRESS);
This is the interview question I faced, if we have 1 TB data in HDFS. Which type of method in hive gives us faster performance i.e partitioning or bucketing ?
I told them depending upon data we choose either partitioning or bucketing .But the interviewer didn't satisfied with my answer.
What should be proper answer (along with example) for it?
Your answer is correct that - It really depends on the data and what exactly you want to do with the data.
Partitioning is used for distributing load horizontally in a logical fashion. It optimizes the performance, but sometime it could lead to partition having very less amount of the within them. This results into bad performance, as the mapreduce works on bigger files than many small files.
Here, bucketing can help, because bucketing guarantee that all the data for the bucketing column remains together. E.g. if we bucket the employee table and use emp_id as the bucketing column, the value of this column will be hashed by a user-defined number of buckets (which must be optimized considering number of records). Records with the same emp_id will always be stored in the bucket. At the same time, one bucket may have many emp_id together having a more optimized chunk of data for mapreduce processing. bucketing is specially helpful, if you want to perform map-side join.
Your answer is correct--
Hive partitioning is an effective method to improve the query performance on larger tables . Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s).
Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).
I am having a table that has 5 million record. The primary key of this table is created in sequence. My question is which index to create for best performance?
B-Tree Index (default)
(Range) Partitioned Indexes
Or any other?
Considered I am going to use SELECT operation most of the time
B-Tree is the default. We have tables with one billion rows with B-tree indexes. OLTP systems almost always use B-tree for everything. The only time you consider alternate index types is because of special considerations. For example, a highly redundant data set(low cardinality): like an index on a column that contains only Y or N characters, may benefit from a bit-map index. At least in terms of resources.
Bitmaps are favored often for Data Warehouse applications. Other approaches are partitioned tables where a single physical data file has all of one single common column. This eliminates having to read across all of the files in a tablespace to run a report. Ex: the End of Month data for A/R.