This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.
Related
I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings
i have a Text Format hive table, like:
CREATE EXTERNAL TABLE op_log (
time string, debug string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
now i create a orc format table with same fields, like
CREATE TABLE op_log_orc (
time string, debug string,app_id string,app_version string, ...more fields)
PARTITIONED BY (dt string)
STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");
when i copy from op_log to op_log_orc, i have get this errors:
hive> insert into op_log_orc PARTITION(dt='2016-08-09') select * from op_log where dt='2016-08-09';
FAILED: SemanticException [Error 10044]: Line 1:12 Cannot insert into target table because column number/types are different ''2016-08-09'': Table insclause-0 has 62 columns, but query has 63 columns.
hive>
The partition key (dt) in the source table is returned in the result set as though it were a regular field, so you have the extra column. Exclude the dt field from the field list (instead of *) if you're going to specify its value in the partition key. Alternatively, just specify dt as the name of the partition, without providing a value. See CTAS (create table as select...) in the example here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS)
when we partition a table the columns on which the table is being partitioned are not mentioned in the create statement and separately used in the partitioned by.What is the reason behind this.
CREATE TABLE REGISTRATION DATA (
userid BIGINT,
First_Name STRING,
Last_Name STRING,
address1 STRING,
address2 STRING,
city STRING,
zip_code STRING,
state STRING
)
PARTITION BY (
REGION STRING,
COUNTRY STRING
)
The partition that we create in hive makes a pseudocolumn on which we can query directly without having them in create statement.
So when we include partition column on the data of the table itself(create query) we will be getting error like 'Error in semantic analysis. Columns repeated in partitioning columns'
I was trying to create Partition and buckets using HIVE.
For setting some of the properties:
set hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Below is the code for creating the table:
CREATE TABLE transactions_production
( id string,
dept string,
category string,
company string,
brand string,
date1 string,
productsize int,
productmeasure string,
purchasequantity int,
purchaseamount double)
PARTITIONED BY (chain string) clustered by(id) into 5 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Below is the code for inserting data into the table:
INSERT OVERWRITE TABLE transactions_production PARTITION (chain)
select id, dept, category, company, brand, date1, productsize, productmeasure,
purchasequantity, purchaseamount, chain from transactions_staging;
What went wrong:
Partitions and buckets are getting created in HDFS but the data is present only in the 1st bucket of all the partitions; all the remaining buckets are empty.
Please let me know what i did wrong and how to resolve this issue.
When using bucketing, Hive comes up with a hash of the clustered by value (here you use id) and splits the table into that many flat files inside partitions.
Because the table is split up by the hashes of the id's the size of each split is based on the values in your table.
If you have no values that would get mapped to the buckets other than the first bucket, all those flat files will be empty.
Can any one guide me where I am doing mistake while doing dynamic partition.
--Staging table:
create table staging_peopledata
(
firstname string,
secondname string,
salary float,
country string
state string
)
row format delimited fields terminated by ',' lines terminated by '\n';
--Data for Staging table:
John,David,30000,RUS,tnRUS
John,David,30000,RUS,tnRUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,weAUS
Pierre,Cathey,6000,RUS,kaRUS
Pierre,Cathey,6000,RUS,kaRUS
Ahmed,Talib,11000,US,bcUS
Ahmed,Talib,11000,US,onUS
Ahmed,Talib,11000,US,onUS
kris,David,80000,UK,lnUK
kris,David,80000,UK,soUK
--Production table:
create table Production_peopledata
(
firstname string,
lastname string,
salary float)
partitioned by (country string, state string)
row format delimited fields terminated by ',' lines terminated by '\n';
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table Production_peopledata
partition(country,state)
select firstname, secondname, salary, country, state from staging_peopledata;
If i execute the above command I am getting error as below.
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires atleast one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
Can any one tell me where I am doing the mistake.
Can you please run below command on Hive Shell.
hive>set hive.exec.dynamic.partition.mode=nonstrict;
You need to set below properties:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
The column name want to partition on should not be part of the table definition. As the partition column is dynamically generated. While filling the data in the partitioned table the partitioned column should come from the source table.
Let's say we have EMP and EMP1 tables. EMP1 is the partitioned table which will get the data from the EMP table. Initially both of these tables are same. So first we need to create a partitioned column i.e. salpart. Then we will add this column in the source table which is EMP. After successful run we can see the partitioned files in user/hive/warehouse location. The above explanation is implemented as below:
load data local inpath '/home/cloudera/myemployeedata.txt' overwrite into table emp;
CREATE TABLE IF NOT EXISTS emp ( eid int, name String,
salary String, destination String,salpart string)
COMMENT "Employee details"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS emp1 ( eid int, name String,
salary String, destination String)
COMMENT "Employee details"
partitioned by (salpart string) {this column will values will come from a seperate table }
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
Dynamic Partition:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp1 partition(salpart) select eid,name,salary,destination,salpart from emp;
as per the error it seems that mode in still strict, for dynamic partitioning it need to be set to non strict
use below command
hive>set hive.exec.dynamic.partition.mode=nonstrict;
Once again try to do
Set hive.exec.dynamic.partition.mode=nonstrict
Sometimes in hive it happens even if you set this property it considers strict mode hence I suggest you to set this property once again