Reg : Efficiency among query optimizers in hive - hadoop

After reading about query optimization techniques I came to know about the below techniques.
1. Indexing - bitmap and BTree
2. Partitioning
3. Bucketing
I got the difference between partitioning and bucketing, and when to use them but I'm still confused how indexes actually work. Where is the metadata for index is stored? Is it the namenode which is storing it? I.e., actually while creating partitions or buckets we can see multiple directories in hdfs which explains the query performance optimization but how to visualize indexes? Are they really used in real life despite partitioning and bucketing being in the picture?
Please help me for the above queries and is there's any dedicated page for hadoop and hive developers community?

Indexes in Hive were never used in real life and were never efficient and as #mazaneicha noticed in the comment Indexing feature is removed completely in Hive 3.0, read this Jira: HIVE-18448. It was a great try any way, thanks to Facebook support, valuable lessons have been learned.
But there are light-weight indexes in ORC (well, not actually classic indexes but min, max and Bloom filter, it helps to prune stripes). ORC indexes and bloom filters are efficient if the data is sorted during insert (distribute+sort)
Partitioning is the most efficient if partitioning schema corresponds to how the table is being filtered or how is it being loaded (allows to load partitions in parallel, if the increment data is the whole partition it works efficiently).
Bucketing can help with optimizing joins and group by but sort-merge-bucket-mapjoin has serious restrictions making it also not efficient. Both tables should have the same bucketing schema, which in real life is rare or can be extremely inefficient. Also data should be sorted when loading buckets.
Consider using ORC with built-in indexes and Bloom filters, keep less number of files in your table to avoid metadata overload and avoid mappers copying thousands of files.
Read this partitions in hive interview questions and this Sorted Table in Hive
Useful links.
Official documentation: LanguageManual
Cloudera community: https://community.cloudera.com/

Related

In hive ,is partitioning fast or bucketization fast?

This is the interview question I faced, if we have 1 TB data in HDFS. Which type of method in hive gives us faster performance i.e partitioning or bucketing ?
I told them depending upon data we choose either partitioning or bucketing .But the interviewer didn't satisfied with my answer.
What should be proper answer (along with example) for it?
Your answer is correct that - It really depends on the data and what exactly you want to do with the data.
Partitioning is used for distributing load horizontally in a logical fashion. It optimizes the performance, but sometime it could lead to partition having very less amount of the within them. This results into bad performance, as the mapreduce works on bigger files than many small files.
Here, bucketing can help, because bucketing guarantee that all the data for the bucketing column remains together. E.g. if we bucket the employee table and use emp_id as the bucketing column, the value of this column will be hashed by a user-defined number of buckets (which must be optimized considering number of records). Records with the same emp_id will always be stored in the bucket. At the same time, one bucket may have many emp_id together having a more optimized chunk of data for mapreduce processing. bucketing is specially helpful, if you want to perform map-side join.
Your answer is correct--
Hive partitioning is an effective method to improve the query performance on larger tables . Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s).
Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).

How to deal with hive partitioning for performance versus over-partitioning

We have a very large Hadoop dataset having more than a decade of historical transaction data - 6.5B rows and counting. We have partitioned it on year and month.
Performance is poor for a number of reasons. Nearly all of our queries can be further qualified by customer_id, as well, but we have 500 customers and growing quickly. If we narrow the query to a given month, we still need to scan all records just to find the records for one customer. The data is stored as Parquet now, so the main performance issues are not related to scanning all of the contents of a record.
We hesitated to add a partition on customer because if we have 120 year-month partitions, and 500 customers in each this will make 60K partitions which is larger than Hive metastore can effectively handle. We also hesitated to partition only on customer_id because some customers are huge and other tiny, so we have a natural data skew.
Ideally, we would be able to partition historical data, which is used far less frequently using one rule (perhaps year + customer_id) and current data using another (like year/month + customer_id). Have considered using multiple datasets, but managing this over time seems like more work and changes and so on.
Are there strategies, or capabilities of Hive that provide a way to handle a case like this where we "want" lots of partitions for performance, but are limited by the metastore?
I am also confused about the benefit of bucketing. A suitable bucketing based on customer id, for example, would seem to help in a similar way as partitioning. Yet Hortonworks "strongly recommends against" buckets (with no explanation why). Several other pages suggest bucketing is useful for sampling. Another good discussion of bucketing from Hortonworks indicates that Hive cannot do pruning with buckets the same way it can with partitions.
We're on a recent version of Hive/Hadoop (moving from CDH 5.7 to AWS EMR).
In real 60K partitions is not a big problem for Hive. I have experience with about 2MM partitions for one Have table and it works pretty fast. Some details you can find on link https://andr83.io/1123 Of course you need write queries carefully. Also I can recommend to use ORC format with indexes and bloom filters support.

Are there any advantages in using Indexes on tables in Hadoop over Oracle?

I need to compare the Indexing in Oracle Vs Hadoop(Hive). Up till now, I could find two major indexing techniques in Hive i.e. COMPACT INDEXING and BITMAP INDEXING. I could check out the performance difference of COMPACT INDEXING in Hive compared to Oracle. I would need to understand more use cases / scenarios of using Bitmap Indexing in Hive. Also, need to know if Hive supports Reverse Key Indexes , Ascending and Descending Indexes like Oracle.
YES their is significant advantages in using index in HIVE over
oracle, keeping in mind that HIVE is suitable for Large data sets and
yet their are developments in making HIVE a real time data
warehousing tool.
One use case in which BITMAP indexing can be used is where table with
columns having distinct values and obviously it should be a large
table (you will get better results if table is large, do not test
with small tables).
As of now HIVE Supports only two indexing techniques COMPACT and
BITMAP for explicitly creating indexes.
Also Indexes in Hive are not recommended (although you can create as
per your use case), the reason for this is ORC Format.
ORC format has build in Indexes which allow the format to skip blocks of
data during read, they also support Bloom filters index. Together
this pretty much replicates what Hive Indexes did and they do it
automatically in the data format without the need to manage an
external table ( which is essentially what happens in indexes).
I would suggest you to rather spend your time to properly setup the
ORC tables.
also read this great post about hive indexing.
hive is data warehousing tool that runs on hadoop. inbuilt it has mapreduce capacity for hive queries. the metadata and actula data are seperated and store in apache derby. so the burden on database is very less. hive process large tables easily because of distributive nature. and also you can compare the inner joins performance of oracle and hive. hive will gives you better performance always.

Spark Performance On Individual Record Lookups

I am conducting a performance test which compares queries on existing internal Hive tables between Spark SQL and Hive on Tez. Throughout the tests, Spark was showing query execution time that was on par or faster than Hive on Tez. These results are consistent with many of the examples out there. However, there was one noted exception with a query that involved key based selection at the individual record level. In this instance, Spark was significantly slower than Hive on Tez.
After researching this topic on the internet, I could not find a satisfactory answer and wanted to pose this example to the SO community to see if this is an individual one-off case associated with our environment or data, or a larger pattern related to Spark.
Spark 1.6.1
Spark Conf: Executors 2, Executory Memory 32G, Executor Cores 4.
Data is in an internal Hive Table which is stored as ORC file types compressed with zlib. The total size of the compressed files is ~2.2 GB.
Here is the query code.
#Python API
#orc with zlib key based select
dforczslt = sqlContext.sql("SELECT * FROM dev.perf_test_orc_zlib WHERE test_id= 12345678987654321")
dforczslt.show()
The total time to complete this query was over 400 seconds, compared to ~6 seconds with Hive on Tez. I also tried using predicate pushdown via the SQL context configs but this resulted in no noticeable performance increase. Also, when this same test was conducted using Parquet the query time was on par with Hive as well. I'm sure there are other solutions out there to increase the performance of the queries such as using RDDS v. Dataframes etc. But I'm really looking to understand how Spark is interacting with ORC files which is resulting in this gap.
Let me know if I can provide additional clarification around any of the talking points listed above.
The following steps might help to improve the performance of the Spark SQL query.
In general, Hive take the memory of the whole Hadoop cluster which is significantly larger than the executer memory (Here 2* 32 = 64 GB). What's the memory size of the nodes ?.
Further, the number of executers seems to be less (2) when compare to the number of number of map/reduce jobs generated by the hive query. Increasing the number of executers in multiples of 2 might help to improve the performance.
In SparkSQL and Dataframe, optimised execution using manually managed memory (Tungsten) is now enabled by default, along with code generation
for expression evaluation. this features can be enabled by setting spark.sql.tungsten.enabled to true in case if it's not already enabled.
sqlContext.setConf("spark.sql.tungsten.enabled", "true")
The columnar nature of the ORC format helps to avoid reading unnecessary columns. However, But, we are still reading unnecessary rows even if the query has WHERE clause filter.ORC predicate push-down would improve the performance with it's built-in indexs. Here, the ORC predicate push-down is disabled in the Spark SQL by default and need to be explicitly enabled.
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
I would recommend you to do some more research and find the potential performance blockers if any.

Performance of Filter Queries in HBase?

I am looking for a data store that serves the following needs:-
distributed because we have lots of data to query (in TBs)
Write intensive data store. Data will be generated from services and we want to store the data to perform analytics on them.
We want the analytical queries to be reasonably fast (order of minutes, not hours)
Most of our queries would be of the "Select, Filter, Aggregate, Sort" type.
Schema changes often as what we store will change depending on the changing requirements of the system
Part of the data that we store may also be used for pure large scale map/reduce jobs for other purposes.
Key-value stores are scalable but does not support our Query requirements.
Map/Reduce jobs are scalable and can execute the queries, but I think it will not meet our query latency requirements.
An RDBMS (like MySQL) would satisfy our query needs but it will force us to have a fixed schema. We could scale it but then we have to do sharing etc.
Commercial solutions like Vertica seem like a solution that would solve all of our problems, but I would avoid a commercial solution if I can.
HBase seems to be a system that is as scalable as Hadoop because of the underlying HDFS and seems to have the facilities to perform Filters and Aggregations, but I am not sure about the performance of Filter queries in HBase.
Currently HBase does not support Secondary indexes. This makes me wonder if HBase is a right option for Filtering on any arbitrary column. As per the documentation, Filtering on row-id and Column family is faster than filtering on just the column qualifier. However, I also read that having the Bloom Filter index on RowId and Column family significantly increases the size of the Bloom filter and makes this option practically infeasible.
I am unable to find much data online about performance of Filter queries in HBase.
Hoping I can find some more information here.
Thanks!
try apache cassandra, it supports Secondary Indexes very well. Coming to hbase bloom filters, please go thru this link, it describes multiple options of bloom depending on pattern, Hbase bllom filters
You are probably looking for
MPP solutions like Postgres-XL
or related plateforms.

Resources