Hive partition columns seem to prevent "select distinct" - distinct

I have created a table in Hive like this:
CREATE TABLE application_path
(userId STRING, sessId BIGINT, accesstime BIGINT, actionId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Running on this table the query:
SELECT DISTINCT userId FROM application_path;
gives the expected result:
user1#domain.com
user2#domain.com
user3#domain.com
...
Then I've changed the declaration to add a partition:
CREATE TABLE application_path
(sessId BIGINT, accesstime BIGINT, actionId STRING)
PARTITIONED BY(userId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Now the query SELECT DISTINCT userId... runs for seconds as before, but eventually returns anything.
I've just noticed the syntax:
SHOW PARTITIONS application_path;
but I was wondering if that's the only way to get unique (distinct) values from a partitioning column. The output of SHOW PARTITION is not even an exact replacement of what you would get from SELECT DISTINCT, since the column name is prefixed to each row:
hive> show partitions application_path;
OK
userid=user1#domain.com
userid=user2#domain.com
userid=user3#domain.com
...
What's strange to me is that usedId can be used in GROUP BY with other columns, like in:
SELECT userId, sessId FROM application_path GROUP BY userId, sessId;
but does return anything in:
SELECT userId FROM application_path GROUP BY userId;

I experienced the same issue, it will be fixed in 0.10
https://issues.apache.org/jira/browse/HIVE-2955

Related

Hive partitions on tables

when we partition a table the columns on which the table is being partitioned are not mentioned in the create statement and separately used in the partitioned by.What is the reason behind this.
CREATE TABLE REGISTRATION DATA (
userid BIGINT,
First_Name STRING,
Last_Name STRING,
address1 STRING,
address2 STRING,
city STRING,
zip_code STRING,
state STRING
)
PARTITION BY (
REGION STRING,
COUNTRY STRING
)
The partition that we create in hive makes a pseudocolumn on which we can query directly without having them in create statement.
So when we include partition column on the data of the table itself(create query) we will be getting error like 'Error in semantic analysis. Columns repeated in partitioning columns'

Excluding the partition field from select queries in Hive

Suppose I have a table definition as follows in Hive(the actual table has around 65 columns):
CREATE EXTERNAL TABLE S.TEST (
COL1 STRING,
COL2 STRING
)
PARTITIONED BY (extract_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LOCATION 'xxx';
Once the table is created, when I run hive -e "describe s.test", I see extract_date as being one of the columns on the table. Doing a select * from s.test also returns extract_date column values. Is it possible to exclude this virtual(?) column when running select queries in Hive.
Change this property
set hive.support.quoted.identifiers=none;
and run the query as
SELECT `(extract_date)?+.+` FROM <table_name>;
I tested it working fine.

Hive: Fatal error when trying to create dynamic partitions

create table MY_DATA0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING ,country STRING, state STRING, city STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS TEXTFILE ;
LOAD DATA INPATH '/inputhive' OVERWRITE INTO TABLE MY_DATA0;
create table part0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING) partitioned by (country STRING, state STRING, city STRING)
clustered by (userid) into 256 buckets ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE ;
\insert overwrite table part0 partition(country, state, city) select session_id, userid, date_time,ip, url, country, state,city from my_data0;
Overview of my dataset:
{60A191CB-B3CA-496E-B33B-0ACA551DD503},1331582487,2012-03-12
13:01:27,66.91.193.75,http://www.acme.com/SH55126545/VD55179433,United
States,Hauula,Hawaii
{365CC356-7822-8A42-51D2-B6396F8FC5BF},1331584835,2012-03-12
13:40:35,173.172.214.24,http://www.acme.com/SH55126545/VD55179433,United
States,El Paso,Texas
When I run the last insert script I get an error as :
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]:
Fatal error occurred when node tried to create too many dynamic
partitions. The maximum number of dynamic partitions is controlled by
hive.exec.max.dynamic.partitions and
hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
PS:
I have set this two properties:
hive.exec.dynamic.partition.mode::nonstrict
hive.enforce.bucketing::true
Try setting those properties to higher values.
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
Partition columns should be mentioned at last in select statement.
Ex: if state is the partition column, then "insert into table t1 partition(state) select Id, name, dept, sal, state from t2"; this will work. For instance if my query is like this "insert into table t1 partition(state) select Id, name, dept,state, sal from t2;" then partitions will be created with salary(sal) column

Amazon EMR job with multiple input parameters

In Amazon data pipeline, I am creating activity to copy S3 to EMR using Hive.
To achieve it I have to pass two input parameters into EMR job as a step.
I have searched all most every data pipeline documentation but did not found the way to specify the multiple input parameters.
I also talk with the AWS support team but they are also not clear about it. The way/trick they suggested also not working.
Below is my step arguments and Hive query. Please let me know if anyone has idea to achieve it.
Steps:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -d, "output1=#{output.directoryPath}", -d,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -d,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
Hive Query:
drop table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;
I was able to get this to work using the following format on step field of EMRActivity:
Basically I changed -d with -hiveconf. Also changed substitution in hive script from to. I think this is a change made on newer version of hive.
Below is the changed working code:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -hiveconf, "output1=#{output.directoryPath}", -hiveconf,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -hiveconf,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
HIVE Query:
table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;
Hope this helps to someone.

Hive - facing challenge's in Dynamic partition error

Can any one guide me where I am doing mistake while doing dynamic partition.
--Staging table:
create table staging_peopledata
(
firstname string,
secondname string,
salary float,
country string
state string
)
row format delimited fields terminated by ',' lines terminated by '\n';
--Data for Staging table:
John,David,30000,RUS,tnRUS
John,David,30000,RUS,tnRUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,syAUS
Mary,David,5000,AUS,weAUS
Pierre,Cathey,6000,RUS,kaRUS
Pierre,Cathey,6000,RUS,kaRUS
Ahmed,Talib,11000,US,bcUS
Ahmed,Talib,11000,US,onUS
Ahmed,Talib,11000,US,onUS
kris,David,80000,UK,lnUK
kris,David,80000,UK,soUK
--Production table:
create table Production_peopledata
(
firstname string,
lastname string,
salary float)
partitioned by (country string, state string)
row format delimited fields terminated by ',' lines terminated by '\n';
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table Production_peopledata
partition(country,state)
select firstname, secondname, salary, country, state from staging_peopledata;
If i execute the above command I am getting error as below.
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode
requires atleast one static partition column. To turn this off set
hive.exec.dynamic.partition.mode=nonstrict
Can any one tell me where I am doing the mistake.
Can you please run below command on Hive Shell.
hive>set hive.exec.dynamic.partition.mode=nonstrict;
You need to set below properties:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
The column name want to partition on should not be part of the table definition. As the partition column is dynamically generated. While filling the data in the partitioned table the partitioned column should come from the source table.
Let's say we have EMP and EMP1 tables. EMP1 is the partitioned table which will get the data from the EMP table. Initially both of these tables are same. So first we need to create a partitioned column i.e. salpart. Then we will add this column in the source table which is EMP. After successful run we can see the partitioned files in user/hive/warehouse location. The above explanation is implemented as below:
load data local inpath '/home/cloudera/myemployeedata.txt' overwrite into table emp;
CREATE TABLE IF NOT EXISTS emp ( eid int, name String,
salary String, destination String,salpart string)
COMMENT "Employee details"
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
CREATE TABLE IF NOT EXISTS emp1 ( eid int, name String,
salary String, destination String)
COMMENT "Employee details"
partitioned by (salpart string) {this column will values will come from a seperate table }
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\t"
LINES TERMINATED BY "\n"
STORED AS TEXTFILE;
Dynamic Partition:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table emp1 partition(salpart) select eid,name,salary,destination,salpart from emp;
as per the error it seems that mode in still strict, for dynamic partitioning it need to be set to non strict
use below command
hive>set hive.exec.dynamic.partition.mode=nonstrict;
Once again try to do
Set hive.exec.dynamic.partition.mode=nonstrict
Sometimes in hive it happens even if you set this property it considers strict mode hence I suggest you to set this property once again

Resources