Hive: Fatal error when trying to create dynamic partitions - hadoop

create table MY_DATA0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING ,country STRING, state STRING, city STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS TEXTFILE ;
LOAD DATA INPATH '/inputhive' OVERWRITE INTO TABLE MY_DATA0;
create table part0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING) partitioned by (country STRING, state STRING, city STRING)
clustered by (userid) into 256 buckets ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE ;
\insert overwrite table part0 partition(country, state, city) select session_id, userid, date_time,ip, url, country, state,city from my_data0;
Overview of my dataset:
{60A191CB-B3CA-496E-B33B-0ACA551DD503},1331582487,2012-03-12
13:01:27,66.91.193.75,http://www.acme.com/SH55126545/VD55179433,United
States,Hauula,Hawaii
{365CC356-7822-8A42-51D2-B6396F8FC5BF},1331584835,2012-03-12
13:40:35,173.172.214.24,http://www.acme.com/SH55126545/VD55179433,United
States,El Paso,Texas
When I run the last insert script I get an error as :
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]:
Fatal error occurred when node tried to create too many dynamic
partitions. The maximum number of dynamic partitions is controlled by
hive.exec.max.dynamic.partitions and
hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
PS:
I have set this two properties:
hive.exec.dynamic.partition.mode::nonstrict
hive.enforce.bucketing::true

Try setting those properties to higher values.
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;

Partition columns should be mentioned at last in select statement.
Ex: if state is the partition column, then "insert into table t1 partition(state) select Id, name, dept, sal, state from t2"; this will work. For instance if my query is like this "insert into table t1 partition(state) select Id, name, dept,state, sal from t2;" then partitions will be created with salary(sal) column

Related

Unable to load data in Hive partitioned table

I have created a table in Hive with the following query:
create table if not exists employee(CASE_NUMBER String,
CASE_STATUS String,
CASE_RECEIVED_DATE DATE,
DECISION_DATE DATE,
EMPLOYER_NAME STRING,
PREVAILING_WAGE_PER_YEAR BIGINT,
PAID_WAGE_PER_YEAR BIGINT,
order_n int) partitioned by (JOB_TITLE_SUBGROUP STRING) row format delimited fields terminated by ',';
I tried loading data into the create table using below query:
LOAD DATA INPATH '/salary_data.csv' overwrite into table employee partition (JOB_TITLE_SUBGROUP);
For the partitioned table, I have even set following configuration :
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
But I am getting below error while executing the load query:
Your query has the following error(s):
Error while compiling statement: FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Invalid partition key & values; keys [job_title_subgroup, ], values [])
Please help.
If you want to load data into a Hive partition, you have to provide the value of the partition itself in the LOAD DATA query. So in this case, your query would be something like this.
LOAD DATA INPATH '/salary_data.csv' overwrite into table employee partition (JOB_TITLE_SUBGROUP="Value");
Where "Value" is the name of the partition in which you are loading your data. The reason is because Hive will use "Value" to create the directory in which your .csv is going to be stored, which will be something like this: .../employee/JOB_TITLE_SUBGROUP=Value. I hope this helps.
Check the documentation for details on the LOAD DATA syntax.
EDITED
Since the table has dynamic partition, one solution would be loading the .csv into an external table (e.g. employee_external) and then execute an INSERT command like this:
INSERT OVERWRITE INTO TABLE employee PARTITION(JOB_TITLE_SUBGROUP)
SELECT CASE_NUMBER, CASE_STATUS, (...), JOB_TITLE_SUBGROUP
FROM employee_external
I might be little late to reply but can try below steps:
Set below properties first :
Ø set hive.exec.dynamic.partition.mode=nonstrict;
Ø set hive.exec.dynamic.partition=true;
Create temp table first:
CREATE EXTERNAL TABLE IF NOT EXISTS employee_temp(
ID STRING,
Name STRING,
Salary STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
tblproperties ("skip.header.line.count"="1");
Load Data in temporary table:
hive> LOAD DATA INPATH 'filepath/employee.csv' OVERWRITE INTO TABLE employee;
Create Partitioned Table:
CREATE EXTERNAL TABLE IF NOT EXISTS employee_part(
ID STRING,
Name STRING)
PARTITIONED BY (Salary STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
tblproperties ("skip.header.line.count"="1");
Load Data into partitioned table from intermediate / temp table:
INSERT OVERWRITE TABLE employee_part PARTITION (SALARY) SELECT * FROM employee;

Hive partitions on tables

when we partition a table the columns on which the table is being partitioned are not mentioned in the create statement and separately used in the partitioned by.What is the reason behind this.
CREATE TABLE REGISTRATION DATA (
userid BIGINT,
First_Name STRING,
Last_Name STRING,
address1 STRING,
address2 STRING,
city STRING,
zip_code STRING,
state STRING
)
PARTITION BY (
REGION STRING,
COUNTRY STRING
)
The partition that we create in hive makes a pseudocolumn on which we can query directly without having them in create statement.
So when we include partition column on the data of the table itself(create query) we will be getting error like 'Error in semantic analysis. Columns repeated in partitioning columns'

Amazon EMR job with multiple input parameters

In Amazon data pipeline, I am creating activity to copy S3 to EMR using Hive.
To achieve it I have to pass two input parameters into EMR job as a step.
I have searched all most every data pipeline documentation but did not found the way to specify the multiple input parameters.
I also talk with the AWS support team but they are also not clear about it. The way/trick they suggested also not working.
Below is my step arguments and Hive query. Please let me know if anyone has idea to achieve it.
Steps:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -d, "output1=#{output.directoryPath}", -d,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -d,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
Hive Query:
drop table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;
I was able to get this to work using the following format on step field of EMRActivity:
Basically I changed -d with -hiveconf. Also changed substitution in hive script from to. I think this is a change made on newer version of hive.
Below is the changed working code:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -hiveconf, "output1=#{output.directoryPath}", -hiveconf,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -hiveconf,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
HIVE Query:
table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;
Hope this helps to someone.

Hive - facing challenge's in Dynamic partition error

Can any one guide me where I am doing mistake while doing dynamic partition.
--Staging table:
create table staging_peopledata
(
firstname string,
secondname string,
salary float,
country string
state string
)
row format delimited fields terminated by ',' lines terminated by '\n';
--Data for Staging table:
John,David,30000,RUS,tnRUS
John,David,30000,RUS,tnRUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,weAUS
Pierre,Cathey,6000,RUS,kaRUS
Pierre,Cathey,6000,RUS,kaRUS
Ahmed,Talib,11000,US,bcUS
Ahmed,Talib,11000,US,onUS
Ahmed,Talib,11000,US,onUS
kris,David,80000,UK,lnUK
kris,David,80000,UK,soUK
--Production table:
create table Production_peopledata
(
firstname string,
lastname string,
salary float)
partitioned by (country string, state string)
row format delimited fields terminated by ',' lines terminated by '\n';
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table Production_peopledata
partition(country,state)
select firstname, secondname, salary, country, state from staging_peopledata;
If i execute the above command I am getting error as below.
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires atleast one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
Can any one tell me where I am doing the mistake.
Can you please run below command on Hive Shell.
hive>set hive.exec.dynamic.partition.mode=nonstrict;
You need to set below properties:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
The column name want to partition on should not be part of the table definition. As the partition column is dynamically generated. While filling the data in the partitioned table the partitioned column should come from the source table.
Let's say we have EMP and EMP1 tables. EMP1 is the partitioned table which will get the data from the EMP table. Initially both of these tables are same. So first we need to create a partitioned column i.e. salpart. Then we will add this column in the source table which is EMP. After successful run we can see the partitioned files in user/hive/warehouse location. The above explanation is implemented as below:
load data local inpath '/home/cloudera/myemployeedata.txt' overwrite into table emp;
CREATE TABLE IF NOT EXISTS emp ( eid int, name String,
salary String, destination String,salpart string)
COMMENT "Employee details"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS emp1 ( eid int, name String,
salary String, destination String)
COMMENT "Employee details"
partitioned by (salpart string) {this column will values will come from a seperate table }
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
Dynamic Partition:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp1 partition(salpart) select eid,name,salary,destination,salpart from emp;
as per the error it seems that mode in still strict, for dynamic partitioning it need to be set to non strict
use below command
hive>set hive.exec.dynamic.partition.mode=nonstrict;
Once again try to do
Set hive.exec.dynamic.partition.mode=nonstrict
Sometimes in hive it happens even if you set this property it considers strict mode hence I suggest you to set this property once again

Hive partition columns seem to prevent "select distinct"

I have created a table in Hive like this:
CREATE TABLE application_path
(userId STRING, sessId BIGINT, accesstime BIGINT, actionId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Running on this table the query:
SELECT DISTINCT userId FROM application_path;
gives the expected result:
user1#domain.com
user2#domain.com
user3#domain.com
...
Then I've changed the declaration to add a partition:
CREATE TABLE application_path
(sessId BIGINT, accesstime BIGINT, actionId STRING)
PARTITIONED BY(userId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Now the query SELECT DISTINCT userId... runs for seconds as before, but eventually returns anything.
I've just noticed the syntax:
SHOW PARTITIONS application_path;
but I was wondering if that's the only way to get unique (distinct) values from a partitioning column. The output of SHOW PARTITION is not even an exact replacement of what you would get from SELECT DISTINCT, since the column name is prefixed to each row:
hive> show partitions application_path;
OK
userid=user1#domain.com
userid=user2#domain.com
userid=user3#domain.com
...
What's strange to me is that usedId can be used in GROUP BY with other columns, like in:
SELECT userId, sessId FROM application_path GROUP BY userId, sessId;
but does return anything in:
SELECT userId FROM application_path GROUP BY userId;
I experienced the same issue, it will be fixed in 0.10
https://issues.apache.org/jira/browse/HIVE-2955

Resources