issue with hive partitioning and bucketing in CDH 5.10 quick VM - hadoop

i am new to this area and got stuck in a simple issue.
I am loading data into a hive table (using insert command from another table tset1) which is partitioned by udate and day as bucket.
insert overwrite test1 partition(udate) select id,value,udate,day from tset1;
so now the issue is when I am loading data it is taking wrong value in partition column. Day is taken as partition because in my table this is last column so during data load it's taking day as udate.
how I can force my query to take the right value during data load?
hive (testdb)> create table test1_buk(id int, value string,day int) partitioned by(udate string) clustered by(day) into 5 buckets row format delimited fields terminated by ',' stored as textfile;
hive (testdb)> desc tset1;
OK
col_name data_type comment
id int
value string
udate string
day int
hive (testdb)> desc test1_buk;
OK
col_name data_type comment
id int
value string
day int
udate string
# Partition Information
# col_name data_type comment
udate string
hive (testdb)> select * from test1_buk limit 1;
OK
test1_buk.id test1_buk.value test1_buk.day test1_buk.udate
5 w 2000 10
please help.

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

How to use insert statement for a Hive partitioned table?

I have a hive table dynpart.
id int
name char(30)
city char(30)
thisday string
# Partition Information
# col_name data_type comment
thisday string
It is partitioned by 'thisday' whose datatype is STRING.
How can I insert a single record into the table in a particular partition. I know there is load command to load an entire file data into hive table. I just want to know how an Insert statement can be written for a partitioned table. I tried to write command like below but this is taking data from another table.
insert into droplater partition(thisday='30/03/2017') select * from dynpart;
The table: Droplater has the same structure as dynpart. But the above command is to insert the data from another table. What I'd like to learn is to write a simple insert command into a partition, like: insert into tabname values(1,"abcd","efgh");into the table.
This will work for primitive types only (no arrays, structs etc.)
insert into tabname partition (thisday='30/03/2017') values (1,"abcd","efgh");
This will work for all types
insert into tabname partition (thisday='30/03/2017') select 1,"abcd","efgh";
P.s.
By all means, partition your table by date ((thisday date) )
insert into tabname partition (thisday=date '2017-03-30') ...
or at least use the ISO date format
insert into tabname partition (thisday='2017-03-30') ...

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

Hive: Fatal error when trying to create dynamic partitions

create table MY_DATA0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING ,country STRING, state STRING, city STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS TEXTFILE ;
LOAD DATA INPATH '/inputhive' OVERWRITE INTO TABLE MY_DATA0;
create table part0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING) partitioned by (country STRING, state STRING, city STRING)
clustered by (userid) into 256 buckets ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE ;
\insert overwrite table part0 partition(country, state, city) select session_id, userid, date_time,ip, url, country, state,city from my_data0;
Overview of my dataset:
{60A191CB-B3CA-496E-B33B-0ACA551DD503},1331582487,2012-03-12
13:01:27,66.91.193.75,http://www.acme.com/SH55126545/VD55179433,United
States,Hauula,Hawaii
{365CC356-7822-8A42-51D2-B6396F8FC5BF},1331584835,2012-03-12
13:40:35,173.172.214.24,http://www.acme.com/SH55126545/VD55179433,United
States,El Paso,Texas
When I run the last insert script I get an error as :
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]:
Fatal error occurred when node tried to create too many dynamic
partitions. The maximum number of dynamic partitions is controlled by
hive.exec.max.dynamic.partitions and
hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
PS:
I have set this two properties:
hive.exec.dynamic.partition.mode::nonstrict
hive.enforce.bucketing::true
Try setting those properties to higher values.
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
Partition columns should be mentioned at last in select statement.
Ex: if state is the partition column, then "insert into table t1 partition(state) select Id, name, dept, sal, state from t2"; this will work. For instance if my query is like this "insert into table t1 partition(state) select Id, name, dept,state, sal from t2;" then partitions will be created with salary(sal) column

HIVE External Table - Set Empty Strings to NULL

Currently I have a HIVE 0.7 instance on Amazon EMR. I am trying to create a duplicate of this instance on a new EMR cluster using Hive 0.11.
In my 0.7 instance I have an external table that will set empty strings to NULL. Here is how I create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS tablename
(column1 string,
column2 string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES ('serialization.null.format' = '');
Data is added to the table like this:
ALTER TABLE tablename
ADD PARTITION (year = '2013', month = '10', day='01')
LOCATION '/location_in_hdfs';
This works great in 0.7 but in 0.11 it doesn't seem to be evaluating my empty strings as NULLS. Interestingly, creating a normal table with the same data and table definition seems to evaluate empty strings as NULLs as expected.
Is there are different way to do this with an external table in 0.11?
Hive default partition properties overriding the table properties. Include SERDE properties in your alter statement:
ALTER TABLE tablename ADD PARTITION (year = '2013', month = '10', day='01') SET
SERDEPROPERTIES ('serialization.null.format' = '');

Resources