This is the interview question I faced, if we have 1 TB data in HDFS. Which type of method in hive gives us faster performance i.e partitioning or bucketing ?
I told them depending upon data we choose either partitioning or bucketing .But the interviewer didn't satisfied with my answer.
What should be proper answer (along with example) for it?
Your answer is correct that - It really depends on the data and what exactly you want to do with the data.
Partitioning is used for distributing load horizontally in a logical fashion. It optimizes the performance, but sometime it could lead to partition having very less amount of the within them. This results into bad performance, as the mapreduce works on bigger files than many small files.
Here, bucketing can help, because bucketing guarantee that all the data for the bucketing column remains together. E.g. if we bucket the employee table and use emp_id as the bucketing column, the value of this column will be hashed by a user-defined number of buckets (which must be optimized considering number of records). Records with the same emp_id will always be stored in the bucket. At the same time, one bucket may have many emp_id together having a more optimized chunk of data for mapreduce processing. bucketing is specially helpful, if you want to perform map-side join.
Your answer is correct--
Hive partitioning is an effective method to improve the query performance on larger tables . Partitioning allows you to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition key(s).
Bucketing improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).
Related
After reading about query optimization techniques I came to know about the below techniques.
1. Indexing - bitmap and BTree
2. Partitioning
3. Bucketing
I got the difference between partitioning and bucketing, and when to use them but I'm still confused how indexes actually work. Where is the metadata for index is stored? Is it the namenode which is storing it? I.e., actually while creating partitions or buckets we can see multiple directories in hdfs which explains the query performance optimization but how to visualize indexes? Are they really used in real life despite partitioning and bucketing being in the picture?
Please help me for the above queries and is there's any dedicated page for hadoop and hive developers community?
Indexes in Hive were never used in real life and were never efficient and as #mazaneicha noticed in the comment Indexing feature is removed completely in Hive 3.0, read this Jira: HIVE-18448. It was a great try any way, thanks to Facebook support, valuable lessons have been learned.
But there are light-weight indexes in ORC (well, not actually classic indexes but min, max and Bloom filter, it helps to prune stripes). ORC indexes and bloom filters are efficient if the data is sorted during insert (distribute+sort)
Partitioning is the most efficient if partitioning schema corresponds to how the table is being filtered or how is it being loaded (allows to load partitions in parallel, if the increment data is the whole partition it works efficiently).
Bucketing can help with optimizing joins and group by but sort-merge-bucket-mapjoin has serious restrictions making it also not efficient. Both tables should have the same bucketing schema, which in real life is rare or can be extremely inefficient. Also data should be sorted when loading buckets.
Consider using ORC with built-in indexes and Bloom filters, keep less number of files in your table to avoid metadata overload and avoid mappers copying thousands of files.
Read this partitions in hive interview questions and this Sorted Table in Hive
Useful links.
Official documentation: LanguageManual
Cloudera community: https://community.cloudera.com/
We have a very large Hadoop dataset having more than a decade of historical transaction data - 6.5B rows and counting. We have partitioned it on year and month.
Performance is poor for a number of reasons. Nearly all of our queries can be further qualified by customer_id, as well, but we have 500 customers and growing quickly. If we narrow the query to a given month, we still need to scan all records just to find the records for one customer. The data is stored as Parquet now, so the main performance issues are not related to scanning all of the contents of a record.
We hesitated to add a partition on customer because if we have 120 year-month partitions, and 500 customers in each this will make 60K partitions which is larger than Hive metastore can effectively handle. We also hesitated to partition only on customer_id because some customers are huge and other tiny, so we have a natural data skew.
Ideally, we would be able to partition historical data, which is used far less frequently using one rule (perhaps year + customer_id) and current data using another (like year/month + customer_id). Have considered using multiple datasets, but managing this over time seems like more work and changes and so on.
Are there strategies, or capabilities of Hive that provide a way to handle a case like this where we "want" lots of partitions for performance, but are limited by the metastore?
I am also confused about the benefit of bucketing. A suitable bucketing based on customer id, for example, would seem to help in a similar way as partitioning. Yet Hortonworks "strongly recommends against" buckets (with no explanation why). Several other pages suggest bucketing is useful for sampling. Another good discussion of bucketing from Hortonworks indicates that Hive cannot do pruning with buckets the same way it can with partitions.
We're on a recent version of Hive/Hadoop (moving from CDH 5.7 to AWS EMR).
In real 60K partitions is not a big problem for Hive. I have experience with about 2MM partitions for one Have table and it works pretty fast. Some details you can find on link https://andr83.io/1123 Of course you need write queries carefully. Also I can recommend to use ORC format with indexes and bloom filters support.
I understand the concepts of partitioning and bucketing in Hive tables. But what I'd like to know is "when do we for partition and when do we go for bucketing ?"
What are ideal scenarios that can be said as suitable for partitioning and bucketing ?
The main reasons in which one uses partition and bucketing.
Partition:
Partitioning of table data is done for distributing load horizontally .
Example: If we have a very large table names as "Parts" and often we run "where" queries that restricts the results to a particular Part Type.
For a faster query response the table can be partitioned by (PART_TYPE STRING).Once you partition the table it changes the way Hive structures the data storage and Hive now will create sub-directories which will reflect the structure of the partition like:
.../Parts/PART_TYPE = Engine-Part
.../Parts/Part_Type = Brakes
So,now if you run a query on table "Parts" with WHERE PART_TYPE = 'Engine-Part'
, it will only scan the contents of one directory PART_TYPE = 'Engine-Part'
Partitioning feature is useful in Hive. but at the same time it may take long time to execute other queries.
Another drawback is if we create too many partitions which in turn creates large number of Hadoop files and directories that got created unnecessarily and it becomes an overhead to NameNode since NameNode must keep all metdatafiles for the file system in memory.
Bucketing:
Bucketing is another technique which can be used to further divide the data into more manageable form.
Example: Suppose the table "part_sale" has a top level partition of "sale_date" and it is further partitioned into "part_type" as second level partition.
This will lead to too many small partitions .
.../part_sale/sale-date = 2017-04-18/part_type = engine_part1
.../part_sale/sale-date = 2017-04-18/part_type = engine_part2
.../part_sale/sale-date = 2017-04-18/part_type = engine_part3
.../part_sale/sale-date = 2017-04-18/part_type = engine_part4
If we bucket the "part_sale" table ,and use "part_type" as our bucketing column of the table.The value of this column will be hashed by a user-defined number into buckets.Records with the same "part_type" will always be stored in same bucket.You can specify the number of buckets at the time of table creation so that number of buckets are fixed and there is no fluctuation with data.
Partitioning in Hive :-
If we are dealing with a large table and often run queries with WHERE clauses that restrict the results to a particular partitioned column/columns, then we should leverage the partition concept of hive . For a faster query response Hive table can be PARTITIONED BY (partition_cols_name).Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files, so scan becomes easy if partition is done properly. Should be done when the cardinality (number of possible values a field can have ) is not high. Else if there are too many partitions, then it is an overhead on the namenode.
Bucketing in Hive :-
If you want to segregate the data on a field which has high cardinality (number of possible values a field can have ), then we should use bucketing. If we want only a sample of data according to some specific fields and not the entire data , bucketing can be a good option. If some map-side joins are involved, then bucketed tables are a good option.
Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps in organizing data in each partition into multiple files, so as same set of data is always written in same bucket. Helps a lot in joining of columns.
Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.For example we have table with columns like date,employee_name,employee_id,salary,leaves etc . In this table just use date column as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. We can use HASH value for bucketing or a range to bucket the data.
Hive partitioning and Bucketing is ,when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.
In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;
PARTITIONING will be used when there are few unique values in the Column - which you want to load with your required WHERE clause
BUCKETING will be used if there are multiple unique values in your Where clause
Can we define a methodology using which we can decide if we should go for bucketing or partitioning?
Usually Partitioning in hive offers a way of segregating hive table data into multiple files/directorys. But partitioning gives effective results when,
There are limited number of partitions
Comparatively equal sized partitions
But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions(ex: 4-5 countries itself contributing 70-80% of total data) where as small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30% of total data).So, In these cases Partitioning will not be ideal.
To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.
Bucketing concept is based on (hashing function on the bucketed column) mod (by total number of buckets).The hash_function depends on the type of bucketing column.
Records with the same bucketed column will always be stored in the same bucket and physically each bucket is just a file in the table directory and Bucket numbering is 1-based.
Bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.
What should be basis for us to narrow down whether to use partition or bucketing on a set of columns in Hive?
Suppose we have a huge data set, where we have two columns which are queried most often - so my obvious choice might be to make the partition based on these two columns, but also if this would result into a huge number of small files created in huge number of directories, than it would be a wrong decision to partition data based on these columns, and may be bucketing would have been a better option to do.
Can we define a methodology using which we can decide if we should go for bucketing or partitioning?
Bucketing and partitioning are not exclusive, you can use both.
My short answer from my fairly long hive experience is "you should ALWAYS use partitioning, and sometimes you may want to bucket too".
If you have a big table, partitioning helps reducing the amount of data you query. A partition is usually represented as a directory on HDFS. A common usage is to partition by year/month/day, since most people query by date.
The only drawback is that you should not partition on columns with a big cardinality.
Cardinality is a fundamental concept in big data, it's the number of possible values a column may have. 'US state' for instance has a low cardinality (around 50), while for instance 'ip_number' has a large cardinality (2^32 possible numbers).
If you partition on a field with a high cardinality, hive will create a very large number of directories in HDFS, which is not good (extra memory load on namenode).
Bucketing can be useful, but you also have to be disciplined when inserting data into a table. Hive won't check that the data you're inserting is bucketed the way it's supposed to.
A bucketed table has to do a CLUSTER BY, which may add an extra step in your processing.
But if you do lots of joins, they can be greatly sped up if both tables are bucketed the same way (on the same field and the same number of buckets). Also, once you decide the number of buckets, you can't easily change it.
Partioning :
Partioning is decomposing/dividing your input data based on some condition e.g: Date, Country here.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs PARTITION (dt='2012-01-01', country='GB');
Files created in warehouse as below after loading data:
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file1/
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file2/
/user/hive/warehouse/logs/dt=2012-01-01/country=US/file3/
/user/hive/warehouse/logs/dt=2012-01-02/country=GB/file4/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file5/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file6
SELECT ts, dt, line
FROM logs
WHERE country='GB';
This query will only scan file1, file2 and file4.
Bucketing :
Bucketing is further Decomposing/dividing your input data based on some other conditions.
There are two reasons why we might want to organize our tables (or partitions) into buckets.
The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns – which include the join columns – can be efficiently implemented as a map-side join.
The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them.
Let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to specify the columns to bucket on and the number of buckets:
CREATE TABLE student (rollNo INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
SELECT * FROM student TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());