HIVE: Empty buckets getting created after partitioning in HDFS - hadoop

I was trying to create Partition and buckets using HIVE.
For setting some of the properties:
set hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Below is the code for creating the table:
CREATE TABLE transactions_production
( id string,
dept string,
category string,
company string,
brand string,
date1 string,
productsize int,
productmeasure string,
purchasequantity int,
purchaseamount double)
PARTITIONED BY (chain string) clustered by(id) into 5 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Below is the code for inserting data into the table:
INSERT OVERWRITE TABLE transactions_production PARTITION (chain)
select id, dept, category, company, brand, date1, productsize, productmeasure,
purchasequantity, purchaseamount, chain from transactions_staging;
What went wrong:
Partitions and buckets are getting created in HDFS but the data is present only in the 1st bucket of all the partitions; all the remaining buckets are empty.
Please let me know what i did wrong and how to resolve this issue.

When using bucketing, Hive comes up with a hash of the clustered by value (here you use id) and splits the table into that many flat files inside partitions.
Because the table is split up by the hashes of the id's the size of each split is based on the values in your table.
If you have no values that would get mapped to the buckets other than the first bucket, all those flat files will be empty.

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

Can we make a table having both partitioning and bucketing in hive?

Can we make a table having both partitioning and bucketing in hive ?
Yes.
Partitioning is you data is divided into number of directories on HDFS. Each directory is a partition. For example, if your table definition is like
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(ds STRING)
CLUSTERED BY(user_id) INTO 256 BUCKETS;
Then you'll have directories on hdfs like
/user/hive/warehouse/user_info_bucketed/ds=2011-01-11/
/user/hive/warehouse/user_info_bucketed/ds=2011-01-12/
/user/hive/warehouse/user_info_bucketed/ds=2011-01-13/
Bucketing is about how your data is distributed inside a partition, So you'll have files on hdfs like
/user/hive/warehouse/user_info_bucketed/ds=2011-01-11/000000_0
/user/hive/warehouse/user_info_bucketed/ds=2011-01-11/000000_1
...
/user/hive/warehouse/user_info_bucketed/ds=2011-01-11/000000_255
/user/hive/warehouse/user_info_bucketed/ds=2011-01-12/000000_0
/user/hive/warehouse/user_info_bucketed/ds=2011-01-12/000000_1
...
/user/hive/warehouse/user_info_bucketed/ds=2011-01-12/000000_255
Reference:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables
http://www.hadooptpoint.com/hive-buckets-optimization-techniques/
Yes.This is straight forward.
try something below:
CREATE TABLE IF NOT EXISTS employee_partition_bucket
(
employeeID Int,
firstName String,
designation String,
salary Int
)
PARTITIONED BY (department string)
CLUSTERED BY (designation) INTO 2 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
In this example I Have created partition by department and bucket by designation
Hopw this will help you
you can !! In that case, you will be having buckets inside partitioned data !

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

Hive: Fatal error when trying to create dynamic partitions

create table MY_DATA0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING ,country STRING, state STRING, city STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS TEXTFILE ;
LOAD DATA INPATH '/inputhive' OVERWRITE INTO TABLE MY_DATA0;
create table part0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING) partitioned by (country STRING, state STRING, city STRING)
clustered by (userid) into 256 buckets ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE ;
\insert overwrite table part0 partition(country, state, city) select session_id, userid, date_time,ip, url, country, state,city from my_data0;
Overview of my dataset:
{60A191CB-B3CA-496E-B33B-0ACA551DD503},1331582487,2012-03-12
13:01:27,66.91.193.75,http://www.acme.com/SH55126545/VD55179433,United
States,Hauula,Hawaii
{365CC356-7822-8A42-51D2-B6396F8FC5BF},1331584835,2012-03-12
13:40:35,173.172.214.24,http://www.acme.com/SH55126545/VD55179433,United
States,El Paso,Texas
When I run the last insert script I get an error as :
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]:
Fatal error occurred when node tried to create too many dynamic
partitions. The maximum number of dynamic partitions is controlled by
hive.exec.max.dynamic.partitions and
hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
PS:
I have set this two properties:
hive.exec.dynamic.partition.mode::nonstrict
hive.enforce.bucketing::true
Try setting those properties to higher values.
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
Partition columns should be mentioned at last in select statement.
Ex: if state is the partition column, then "insert into table t1 partition(state) select Id, name, dept, sal, state from t2"; this will work. For instance if my query is like this "insert into table t1 partition(state) select Id, name, dept,state, sal from t2;" then partitions will be created with salary(sal) column

Hive static partitions issue

I have a csv file which have 600 records, 300 for male and female each.
I have created a Table_Temp and fill all these records in that table. Then, I create Table_Main with gender as partition column.
For Temp_Table query is:
Create table if not exists Temp_Table
(id string, age int, gender string, city string, pin string)
row format delimited
fields terminated by ',';
Then I write the below query:
Insert into Table_Main
partitioned (gender)
select a,b,c,d,gender from Table)Temp
Problem: I am getting a file in /user/hive/warehouse/mydb.db/Table_Main/gender=Male/000000_0
In this file, I am getting total 600 records. I am not sure whats happening but what I was expected is I should get 300 records in this file(only Male).
Q:1. Where am I mistaken ?
Q:2. Should I not get one more folder for all other values(which are not in static partition) ? If NOT, what will happen to those ?
In static partition we need to specify a where condition while inserting data into partition table.(which I have not done).
For this we can use dynamic partition without where condition.

Resources