Can we make a table having both partitioning and bucketing in hive? - hadoop

Can we make a table having both partitioning and bucketing in hive ?

Yes.
Partitioning is you data is divided into number of directories on HDFS. Each directory is a partition. For example, if your table definition is like
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING)
CLUSTERED BY(user_id) INTO 256 BUCKETS;
Then you'll have directories on hdfs like
/user/hive/warehouse/user_info_bucketed/ds=2011-01-11/
/user/hive/warehouse/user_info_bucketed/ds=2011-01-12/
/user/hive/warehouse/user_info_bucketed/ds=2011-01-13/
Bucketing is about how your data is distributed inside a partition, So you'll have files on hdfs like
/user/hive/warehouse/user_info_bucketed/ds=2011-01-11/000000_0
/user/hive/warehouse/user_info_bucketed/ds=2011-01-11/000000_1
...
/user/hive/warehouse/user_info_bucketed/ds=2011-01-11/000000_255
/user/hive/warehouse/user_info_bucketed/ds=2011-01-12/000000_0
/user/hive/warehouse/user_info_bucketed/ds=2011-01-12/000000_1
...
/user/hive/warehouse/user_info_bucketed/ds=2011-01-12/000000_255
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
http://www.hadooptpoint.com/hive-buckets-optimization-techniques/

Yes.This is straight forward.
try something below:
CREATE TABLE IF NOT EXISTS employee_partition_bucket
(
employeeID Int,
firstName String,
designation String,
salary Int
)
PARTITIONED BY (department string)
CLUSTERED BY (designation) INTO 2 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
In this example I Have created partition by department and bucket by designation
Hopw this will help you

you can !! In that case, you will be having buckets inside partitioned data !

Related

Hive on AWS: convert S3 JSON to Columnar preserving partitions

I have files in S3 that contain many lines of JSON (separated by newline). I want to convert these files to a Columnar Format for consumption by AWS Athena
I am following the Converting to Columnar Formats guide to do this, however when converted to ORC, the partition convention in S3 is lost.
In this example, how do you preserve the dt partition in the converted to parquet s3 folder structure? When I run the example it just outputs s3://myBucket/pq/000000_0 and NOT s3://myBucket/pq/dt=2009-04-14-04-05/000000_0
Here is the HQL that sets up interface to bring JSON into a Hive table:
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string,
number string,
processId string,
browserCookie string,
requestEndTime string,
timers struct<modelLookup:string, requestTime:string>,
threadId string,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions' ;
msck repair table impressions;
Here is the HQL that converts to Parquet
CREATE EXTERNAL TABLE parquet_hive (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string)
STORED AS PARQUET
LOCATION 's3://mybucket/pq/';
INSERT OVERWRITE TABLE parquet_hive SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip FROM impressions where dt='2009-04-14-04-05';
First of all, Add PARTITIONED BY (dt string) to parquet_hive definition.
Second -
If you want to insert the data, partition by partition, you have to declare the partition you are inserting into.
Note the PARTITION (dt='2009-04-14-04-05')
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt='2009-04-14-04-05')
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip
FROM impressions where dt='2009-04-14-04-05'
;
An easier way would be to use dynamic partitioning.
Note the PARTITION (dt) and the dt as a last column in in the SELECT.
You might need to to set hive.exec.dynamic.partition.mode.
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt)
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip,dt
FROM impressions where dt='2009-04-14-04-05'
;
P.s.
CREATE EXTERNAL TABLE impressions does not "reads the JSON into a Hive table".
It is just an interface with the necessary information to read the HDFS files.
`
You can simply include the same PARTITIONED BY (dt string) parameter that is in your first statement, which will create the same directory structure.
In this case, the dt field (presumably, date) is actually stored in the directory name. A separate directory is created for each value.

Hive partitions on tables

when we partition a table the columns on which the table is being partitioned are not mentioned in the create statement and separately used in the partitioned by.What is the reason behind this.
CREATE TABLE REGISTRATION DATA (
userid BIGINT,
First_Name STRING,
Last_Name STRING,
address1 STRING,
address2 STRING,
city STRING,
zip_code STRING,
state STRING
)
PARTITION BY (
REGION STRING,
COUNTRY STRING
)
The partition that we create in hive makes a pseudocolumn on which we can query directly without having them in create statement.
So when we include partition column on the data of the table itself(create query) we will be getting error like 'Error in semantic analysis. Columns repeated in partitioning columns'

HIVE: Empty buckets getting created after partitioning in HDFS

I was trying to create Partition and buckets using HIVE.
For setting some of the properties:
set hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Below is the code for creating the table:
CREATE TABLE transactions_production
( id string,
dept string,
category string,
company string,
brand string,
date1 string,
productsize int,
productmeasure string,
purchasequantity int,
purchaseamount double)
PARTITIONED BY (chain string) clustered by(id) into 5 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Below is the code for inserting data into the table:
INSERT OVERWRITE TABLE transactions_production PARTITION (chain)
select id, dept, category, company, brand, date1, productsize, productmeasure,
purchasequantity, purchaseamount, chain from transactions_staging;
What went wrong:
Partitions and buckets are getting created in HDFS but the data is present only in the 1st bucket of all the partitions; all the remaining buckets are empty.
Please let me know what i did wrong and how to resolve this issue.
When using bucketing, Hive comes up with a hash of the clustered by value (here you use id) and splits the table into that many flat files inside partitions.
Because the table is split up by the hashes of the id's the size of each split is based on the values in your table.
If you have no values that would get mapped to the buckets other than the first bucket, all those flat files will be empty.

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

HIVE order by messes up data

In Hive 0.8 with Hadoop 1.03 consider this table:
CREATE TABLE table (
key int,
date timestamp,
name string,
surname string,
height int,
weight int,
age int)
CLUSTERED BY(key) INTO 128 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Then I tried:
select *
from table
where key=xxx
order by date;
The result is sorted but everything after the column name is wrong. In fact, all the rows have the exact same values in the respective fields and the surname column is missing. I also have a bitmap index on name and surname and an index on key.
Is there something wrong with my query or should I be looking into bugs about order by (I cant find anything specific).
Seems like there has been an error in loading data into hive. Make sure you don't have any special characters in your CSV File that might interfere with the insertion.
And you have clustered by the key property. Where does this key come from the CSV? or some other source? Are you sure that this is unique?

Resources