Advantage of creating Hive partitions when using parquet file storage

Advantage of creating Hive partitions when using parquet file storage - hadoop

Is it any advantage to create Hive partitions when using parquet file storage. Parquet is columnar storage file formats which stores data in column chunks with all the columns stored sequentially by index. When we query select a column based on a predicate, the select column index will jump to the required range based on predicate and print the values. How will partitioning be helpful? In row-oriented hive tables, partitioning is helpful because we'll hit only specified required range of data but Im not able to understand how will it be helpful in parquet storage.

In non-partitioned tables,hive would have to read all the files in the
table's data directory and then apply filters on it.For large table it is slow and expensive.
In partition tables,it will create subdirectories based on partition column.It distribute execution load horizontally and no need to search entire table columns for a single records.
The parquet file format have better compression but performance is not that good.
The partition with parquet reduce the execution time of query.eg.when i executed filter query on parquet table, it took 29.657 seconds whereas partition with parquet format took 14.21 seconds.If there is large table then definitely it will improve the performance of query.

Related

Storing data in HBase vs Parquet files

I am new to big data and am trying to understand the various ways of persisting and retrieving data.
I understand both Parquet and HBase are column oriented storage formats but Parquet is a file oriented storage and not a database unlike HBase.
My questions are :
What is the use case of using Parquet instead HBase
Is there a use case where Parquet can be used together with HBase.
In case of performing joins will Parquet be better performant than
HBase (say, accessed through a SQL skin like Phoenix)?

As you have already said in question, parquet is a storage while HBase is storage(HDFS) + Query Engine(API/shell) So a valid comparison should be done between parquet+Impala/Hive/Spark and HBase. Below are the key differences -
1) Disk space - Parquet takes less disk space in comparison to HBase. Parquet encoding saves more space than block compression in HBase.
2) Data Ingestion - Data ingestion in parquet is more efficient than HBase. A simple reason could be point 1. As in case of parquet, less data needs to be written on disk.
3) Record lookup on key - HBase is faster as this is a key-value storage while parquet is not. Indexing in parquet will be supported in future release.
4) Filter and other Scan queries - Since parquet store more information about records stored in a row group, it can skip lot of records while scanning the data. This is the reason, it's faster than HBase.
5) Updating records - HBase provides record updates while this may be problematic in parquet as the parquet files needs to be re-written. A careful design of schema and partitioning may improve updates but it's not comparable with HBase.
By comparing the above features, HBase seems more suitable for situations where updates are required and queries involve mainly key-value lookup. Query involving key range scan will also have better performance in HBase.
Parquet is suitable for use cases where updates are very few and queries involves filters, joins and aggregations.

Sorted Table in Hive (ORC file format)

I'm having some difficulties to make sure I'm leveraging sorted data within a Hive table. (Using ORC file format)
I understand we can affect how the data is read from a Hive table, by declaring a DISTRIBUTE BY clause in the create DDL.
CREATE TABLE trades
(
trade_id INT,
name STRING,
contract_type STRING,
ts INT
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 8 BUCKETS
STORED AS ORC;
This will mean that every time I make a query to this table, the data will be distributed by trade_id among the various mappers and afterward it will be sorted.
My question is:
I do not want the data to be split into N files (buckets), because the volume is not that much and I would stay with small files.
However, I do want to leverage sorted insertion.
INSERT OVERWRITE TABLE trades
PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY trade_id
SORT BY trade_id;
Do I really need to use CLUSTERED/SORT in the create DLL statement? Or does Hive/ORC knows how to leverage the fact that the insertion process already ensured that the data is sorted?
Could it make sense to do something like:
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 1 BUCKETS

Bucketed table is an outdated concept.
You do not need to write CLUSTERED BY in table DDL.
When loading table use distribute by partition key to reduce pressure on reducers especially when writing ORC, which requires intermediate buffers for building ORC and if each reducer loads many partitions it may cause OOM exception.
When the table is big, you can limit the max file size using bytes.per.reducer like this:
set hive.exec.reducers.bytes.per.reducer=67108864;--or even less
If you have more data, more reducers will be started, more files created. This is more flexible than loading fixed number of buckets.
This will also work better because for small tables you do not need to create smaller buckets.
ORC has internal indexes and bloom filters. Applying SORT you can improve index and bloom filters efficiency because all similar data will be stored together. Also this can improve compression depending on your data enthropy.
If distribution by partition key is not enough because you have some data skew and the data is big, you can additionally distribute by random. It is better to distribute by column if you have evenly distributed data. If not, distribute by random to avoid single long running reducer problem.
Finally your insert statement may look loke this:
set hive.exec.reducers.bytes.per.reducer=33554432; --32Mb per reducer
INSERT OVERWRITE TABLE trades PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY dt, --partition key is a must for big data
trade_id, --some other key if the data is too big and key is
--evenly distributed (no skew)
FLOOR(RAND()*100.0)%20 --random to distribute additionally on 20 equal parts
SORT BY contract_type; --sort data if you want filtering by this key
--to work better using internal index
Do not use CLUSTERED BY in table DDL because using DISTRIBUTE BY, ORC w indexes and bloom filters + SORT during insert you can achieve the same in more flexible way.
Distribute + sort can reduce the size of ORC files extremely by x3 or x4 times. Similar data can be better compressed and makes internal indexes more efficient.
Read also this: https://stackoverflow.com/a/55375261/2700344
This is related answer about about sorting: https://stackoverflow.com/a/47416027/2700344
The only case when you can use CLUSTER BY in table DDL is when you joining two big tables which can be bucketed by exactly the same number of buckets to be able to use sort-merge-bucket-map-join, but practically it is so rare case when you can bucket two big tables in the same way. Having only 1 bucket makes no sense because for small tables you can use map-join, just sort the data during insert to reduce the compressed data size.

When should we go for partition and bucketing in hive?

I understand the concepts of partitioning and bucketing in Hive tables. But what I'd like to know is "when do we for partition and when do we go for bucketing ?"
What are ideal scenarios that can be said as suitable for partitioning and bucketing ?

The main reasons in which one uses partition and bucketing.
Partition:
Partitioning of table data is done for distributing load horizontally .
Example: If we have a very large table names as "Parts" and often we run "where" queries that restricts the results to a particular Part Type.
For a faster query response the table can be partitioned by (PART_TYPE STRING).Once you partition the table it changes the way Hive structures the data storage and Hive now will create sub-directories which will reflect the structure of the partition like:
.../Parts/PART_TYPE = Engine-Part
.../Parts/Part_Type = Brakes
So,now if you run a query on table "Parts" with WHERE PART_TYPE = 'Engine-Part'
, it will only scan the contents of one directory PART_TYPE = 'Engine-Part'
Partitioning feature is useful in Hive. but at the same time it may take long time to execute other queries.
Another drawback is if we create too many partitions which in turn creates large number of Hadoop files and directories that got created unnecessarily and it becomes an overhead to NameNode since NameNode must keep all metdatafiles for the file system in memory.
Bucketing:
Bucketing is another technique which can be used to further divide the data into more manageable form.
Example: Suppose the table "part_sale" has a top level partition of "sale_date" and it is further partitioned into "part_type" as second level partition.
This will lead to too many small partitions .
.../part_sale/sale-date = 2017-04-18/part_type = engine_part1
.../part_sale/sale-date = 2017-04-18/part_type = engine_part2
.../part_sale/sale-date = 2017-04-18/part_type = engine_part3
.../part_sale/sale-date = 2017-04-18/part_type = engine_part4
If we bucket the "part_sale" table ,and use "part_type" as our bucketing column of the table.The value of this column will be hashed by a user-defined number into buckets.Records with the same "part_type" will always be stored in same bucket.You can specify the number of buckets at the time of table creation so that number of buckets are fixed and there is no fluctuation with data.

Partitioning in Hive :-
If we are dealing with a large table and often run queries with WHERE clauses that restrict the results to a particular partitioned column/columns, then we should leverage the partition concept of hive . For a faster query response Hive table can be PARTITIONED BY (partition_cols_name).Its helps to organize the data in logical fashion and when we query the partitioned table using partition column, it allows hive to skip all but relevant sub-directories and files, so scan becomes easy if partition is done properly. Should be done when the cardinality (number of possible values a field can have ) is not high. Else if there are too many partitions, then it is an overhead on the namenode.
Bucketing in Hive :-
If you want to segregate the data on a field which has high cardinality (number of possible values a field can have ), then we should use bucketing. If we want only a sample of data according to some specific fields and not the entire data , bucketing can be a good option. If some map-side joins are involved, then bucketed tables are a good option.

Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps in organizing data in each partition into multiple files, so as same set of data is always written in same bucket. Helps a lot in joining of columns.
Hive Buckets is nothing but another technique of decomposing data or decreasing the data into more manageable parts or equal parts.For example we have table with columns like date,employee_name,employee_id,salary,leaves etc . In this table just use date column as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. We can use HASH value for bucketing or a range to bucket the data.
Hive partitioning and Bucketing is ,when we do partitioning, we create a partition for each unique value of the column. But there may be situation where we need to create lot of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file.
In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;

PARTITIONING will be used when there are few unique values in the Column - which you want to load with your required WHERE clause
BUCKETING will be used if there are multiple unique values in your Where clause

Hive - Bucketing and Partitioning

What should be basis for us to narrow down whether to use partition or bucketing on a set of columns in Hive?
Suppose we have a huge data set, where we have two columns which are queried most often - so my obvious choice might be to make the partition based on these two columns, but also if this would result into a huge number of small files created in huge number of directories, than it would be a wrong decision to partition data based on these columns, and may be bucketing would have been a better option to do.
Can we define a methodology using which we can decide if we should go for bucketing or partitioning?

Bucketing and partitioning are not exclusive, you can use both.
My short answer from my fairly long hive experience is "you should ALWAYS use partitioning, and sometimes you may want to bucket too".
If you have a big table, partitioning helps reducing the amount of data you query. A partition is usually represented as a directory on HDFS. A common usage is to partition by year/month/day, since most people query by date.
The only drawback is that you should not partition on columns with a big cardinality.
Cardinality is a fundamental concept in big data, it's the number of possible values a column may have. 'US state' for instance has a low cardinality (around 50), while for instance 'ip_number' has a large cardinality (2^32 possible numbers).
If you partition on a field with a high cardinality, hive will create a very large number of directories in HDFS, which is not good (extra memory load on namenode).
Bucketing can be useful, but you also have to be disciplined when inserting data into a table. Hive won't check that the data you're inserting is bucketed the way it's supposed to.
A bucketed table has to do a CLUSTER BY, which may add an extra step in your processing.
But if you do lots of joins, they can be greatly sped up if both tables are bucketed the same way (on the same field and the same number of buckets). Also, once you decide the number of buckets, you can't easily change it.

Partioning :
Partioning is decomposing/dividing your input data based on some condition e.g: Date, Country here.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs PARTITION (dt='2012-01-01', country='GB');
Files created in warehouse as below after loading data:
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file1/
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file2/
/user/hive/warehouse/logs/dt=2012-01-01/country=US/file3/
/user/hive/warehouse/logs/dt=2012-01-02/country=GB/file4/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file5/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file6
SELECT ts, dt, line
FROM logs
WHERE country='GB';
This query will only scan file1, file2 and file4.
Bucketing :
Bucketing is further Decomposing/dividing your input data based on some other conditions.
There are two reasons why we might want to organize our tables (or partitions) into buckets.
The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns – which include the join columns – can be efficiently implemented as a map-side join.
The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them.
Let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to specify the columns to bucket on and the number of buckets:
CREATE TABLE student (rollNo INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
SELECT * FROM student TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());

What is the difference between partitioning and bucketing a table in Hive ?

I know both is performed on a column in the table but how is each operation different.

Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion. Example: if we are dealing with a large employee table and often run queries with WHERE clauses that restrict the results to a particular country or department . For a faster query response Hive table can be PARTITIONED BY (country STRING, DEPT STRING). Partitioning tables changes how Hive structures the data storage and Hive will now create subdirectories reflecting the partitioning structure like
.../employees/country=ABC/DEPT=XYZ.
If query limits for employee from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using date as the top-level partition and employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. Assuming the number of employee_id is much greater than the number of buckets, each bucket will have many employee_id. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets . Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

There are a few details missing from the previous explanations.
To better understand how partitioning and bucketing works, you should look at how data is stored in hive.
Let's say you have a table
CREATE TABLE mytable (
name string,
city string,
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
then hive will store data in a directory hierarchy like
/user/hive/warehouse/mytable/y=2015/m=12/d=02
So, you have to be careful when partitioning, because if you for instance partition by employee_id and you have millions of employees, you'll end up having millions of directories in your file system.
The term 'cardinality' refers to the number of possible value a field can have. For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be ~300. For a field like 'timestamp_ms', which changes every millisecond, cardinality can be billions. In general, when choosing a field for partitioning, it should not have a high cardinality, because you'll end up with way too many directories in your file system.
Clustering aka bucketing on the other hand, will result with a fixed number of files, since you do specify the number of buckets. What hive will do is to take the field, calculate a hash and assign a record to that bucket.
But what happens if you use let's say 256 buckets and the field you're bucketing on has a low cardinality (for instance, it's a US state, so can be only 50 different values) ? You'll have 50 buckets with data, and 206 buckets with no data.
Someone already mentioned how partitions can dramatically cut the amount of data you're querying. So in my example table, if you want to query only from a certain date forward, the partitioning by year/month/day is going to dramatically cut the amount of IO.
I think that somebody also mentioned how bucketing can speed up joins with other tables that have exactly the same bucketing, so in my example, if you're joining two tables on the same employee_id, hive can do the join bucket by bucket (even better if they're already sorted by employee_id since it's going to mergesort parts that are already sorted, which works in linear time aka O(n) ).
So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.
Also, you can partition on multiple fields, with an order (year/month/day is a good example), while you can bucket on only one field.

Before going into Bucketing, we need to understand what Partitioning is. Let us take the below table as an example. Note that I have given only 12 records in the below example for beginner level understanding. In real-time scenarios you might have millions of records.
PARTITIONING
---------------------
Partitioning is used to obtain performance while querying the data. For example, in the above table, if we write the below sql, it need to scan all the records in the table which reduces the performance and increases the overhead.
select * from sales_table where product_id='P1'
To avoid full table scan and to read only the records related to product_id='P1' we can partition (split hive table's files) into multiple files based on the product_id column. By this the hive table's file will be split into two files one with product_id='P1' and other with product_id='P2'. Now when we execute the above query, it will scan only the product_id='P1' file.
../hive/warehouse/sales_table/product_id=P1
../hive/warehouse/sales_table/product_id=P2
The syntax for creating the partition is given below. Note that we should not use the product_id column definition along with the non-partitioned columns in the below syntax. This should be only in the partitioned by clause.
create table sales_table(sales_id int,trans_date date, amount int)
partitioned by (product_id varchar(10))
Cons : We should be very careful while partitioning. That is, it should not be used for the columns where number of repeating values are very less (especially primary key columns) as it increases the number of partitioned files and increases the overhead for the Name node.
BUCKETING
------------------
Bucketing is used to overcome the cons that I mentioned in the partitioning section. This should be used when there are very few repeating values in a column (example - primary key column). This is similar to the concept of index on primary key column in the RDBMS. In our table, we can take Sales_Id column for bucketing. It will be useful when we need to query the sales_id column.
Below is the syntax for bucketing.
create table sales_table(sales_id int,trans_date date, amount int)
partitioned by (product_id varchar(10)) Clustered by(Sales_Id) into 3 buckets
Here we will further split the data into few more files on top of partitions.
Since we have specified 3 buckets, it is split into 3 files each for each product_id. It internally uses modulo operator to determine in which bucket each sales_id should be stored. For example, for the product_id='P1', the sales_id=1 will be stored in 000001_0 file (ie, 1%3=1), sales_id=2 will be stored in 000002_0 file (ie, 2%3=2),sales_id=3 will be stored in 000000_0 file (ie, 3%3=0) etc.

I think I am late in answering this question, but it keep coming up in my feed.
Navneet has provided excellent answer. Adding to it visually.
Partitioning helps in elimination of data, if used in WHERE clause, where as bucketing helps in organizing data in each partition into multiple files, so as same set of data is always written in same bucket. Helps a lot in joining of columns.
Suppose, you have a table with five columns, name, server_date, some_col3, some_col4 and some_col5. Suppose, you have partitioned the table on server_date and bucketed on name column in 10 buckets, your file structure will look something like below.
server_date=xyz
00000_0
00001_0
00002_0
........
00010_0
Here server_date=xyz is the partition and 000 files are the buckets in each partition. Buckets are calculated based on some hash functions, so rows with name=Sandy will always go in same bucket.

Hive Partitioning:
Partition divides large amount of data into multiple slices based on value of a table column(s).
Assume that you are storing information of people in entire world spread across 196+ countries spanning around 500 crores of entries. If you want to query people from a particular country (Vatican city), in absence of partitioning, you have to scan all 500 crores of entries even to fetch thousand entries of a country. If you partition the table based on country, you can fine tune querying process by just checking the data for only one country partition. Hive partition creates a separate directory for a column(s) value.
Pros:
Distribute execution load horizontally
Faster execution of queries in case of partition with low volume of data. e.g. Get the population from "Vatican city" returns very fast instead of searching entire population of world.
Cons:
Possibility of too many small partition creations - too many directories.
Effective for low volume data for a given partition. But some queries like group by on high volume of data still take long time to execute. e.g. Grouping of population of China will take long time compared to grouping of population in Vatican city. Partition is not solving responsiveness problem in case of data skewing towards a particular partition value.
Hive Bucketing:
Bucketing decomposes data into more manageable or equal parts.
With partitioning, there is a possibility that you can create multiple small partitions based on column values. If you go for bucketing, you are restricting number of buckets to store the data. This number is defined during table creation scripts.
Pros
Due to equal volumes of data in each partition, joins at Map side will be quicker.
Faster query response like partitioning
Cons
You can define number of buckets during table creation but loading of equal volume of data has to be done manually by programmers.

The difference is bucketing divides the files by Column Name, and partitioning divides the files under By a particular value inside table
Hopefully I defined it correctly

There are great responses here. I would like to keep it short to memorize the difference between partition & buckets.
You generally partition on a less unique column. And bucketing on most unique column.
Example if you consider World population with country, person name and their bio-metric id as an example. As you can guess, country field would be the less unique column and bio-metric id would be the most unique column. So ideally you would need to partition the table by country and bucket it by bio-metric id.

Using Partitions in Hive table is highly recommended for below reason -
Insert into Hive table should be faster ( as it uses multiple threads
to write data to partitions )
Query from Hive table should be efficient with low latency.
Example :-
Assume that Input File (100 GB) is loaded into temp-hive-table and it contains bank data from across different geographies.
Hive table without Partition
Insert into Hive table Select * from temp-hive-table
/hive-table-path/part-00000-1 (part size ~ hdfs block size)
/hive-table-path/part-00000-2
....
/hive-table-path/part-00000-n
Problem with this approach is - It will scan whole data for any query you run on this table. Response time will be high compare to other approaches where partitioning and Bucketing are used.
Hive table with Partition
Insert into Hive table partition(country) Select * from temp-hive-table
/hive-table-path/country=US/part-00000-1 (file size ~ 10 GB)
/hive-table-path/country=Canada/part-00000-2 (file size ~ 20 GB)
....
/hive-table-path/country=UK/part-00000-n (file size ~ 5 GB)
Pros - Here one can access data faster when it comes to querying data for specific geography transactions.
Cons - Inserting/querying data can further be improved by splitting data within each partition. See Bucketing option below.
Hive table with Partition and Bucketing
Note: Create hive table ..... with "CLUSTERED BY(Partiton_Column) into 5 buckets
Insert into Hive table partition(country) Select * from temp-hive-table
/hive-table-path/country=US/part-00000-1 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-2 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-3 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-4 (file size ~ 2 GB)
/hive-table-path/country=US/part-00000-5 (file size ~ 2 GB)
/hive-table-path/country=Canada/part-00000-1 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-2 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-3 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-4 (file size ~ 4 GB)
/hive-table-path/country=Canada/part-00000-5 (file size ~ 4 GB)
....
/hive-table-path/country=UK/part-00000-1 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-2 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-3 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-4 (file size ~ 1 GB)
/hive-table-path/country=UK/part-00000-5 (file size ~ 1 GB)
Pros - Faster Insert. Faster Query.
Cons - Bucketing will creating more files. There could be issue with many small files in some specific cases
Hope this will help !!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio