Hive partitions on tables - hadoop

when we partition a table the columns on which the table is being partitioned are not mentioned in the create statement and separately used in the partitioned by.What is the reason behind this.
CREATE TABLE REGISTRATION DATA (
userid BIGINT,
First_Name STRING,
Last_Name STRING,
address1 STRING,
address2 STRING,
city STRING,
zip_code STRING,
state STRING
)
PARTITION BY (
REGION STRING,
COUNTRY STRING
)

The partition that we create in hive makes a pseudocolumn on which we can query directly without having them in create statement.
So when we include partition column on the data of the table itself(create query) we will be getting error like 'Error in semantic analysis. Columns repeated in partitioning columns'

Related

AWS Athena creates indentation and moves values into wrong columns after partitions loads

I encountered the following problem:
I created a Hive table in an EMR cluster in HDFS without partitions
and loaded a data to it.
I created another Hiva table based on the
table from the paragraph#1 but with partitions from the datetime
column: PARTITIONED BY (year STRING,month STRING,day STRING).
I loaded a data from the non partitioned table into partitioned table and get the valid result.
I created an Athena database and table with the same structure as Hive table.
I copied partitioned files from HDFS locally and by aws s3 sync transferred all files into S3 empty bucket. All files were transferred without error and with the same order as in Hive directory in HDFS.
I loaded partitions by MSCK REPAIR TABLE and didn't get any error in an output.
After that I found that many values got indentation, for example a value that need to be in the "IP" column was in "Operating_sys" column and etc.
My scripts are:
-- Hive tables
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_page_part
(
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/events_partitioned';
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_event_part
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/pages_partitioned';
INSERT INTO TABLE cloudfront_logs_page_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
page_path,
referer,
tracking_referer,
medium,
campaign,
source,
visitor_id,
ip,
session_id,
operating_sys,
ad_id,
keyword,
user_agent,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_page;
INSERT INTO TABLE cloudfront_logs_event_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
category,
action,
label,
value,
visitor_id,
ip,
session_id,
operating_sys,
extra_data_json,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_event;
-- Athena tables
CREATE DATABASE IF NOT EXISTS test
LOCATION 's3://...';
DROP TABLE IF EXISTS test.cloudfront_logs_page_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS powtoon_hive.cloudfront_logs_page_ath (
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
DROP TABLE IF EXISTS test.cloudfront_logs_event_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS test.cloudfront_logs_event_ath
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
What can be wrong? Table structure? Athena metadata?
The easiest method would be to convert your raw files directly into a partitioned Parquet columnar format. This has the benefit of partitioning, columnar storage, predicate push-down and all those other fancy words.
See: Converting to Columnar Formats - Amazon Athena

Hive select from table as complex type

Considering a base table employee and a table derived from employee called employee_salary_period which contains a complex datatype map. How to select and insert data from employee into employee_salary_period where salary_period_map is a key value pair i.e. salary: period
CREATE TABLE employee(
emp_id bigint,
name string,
address string,
salary double,
period string,
position string
)
PARTITIONED BY (
dept_id bigint)
STORED AS PARQUET
CREATE TABLE employee_salary_period(
emp_id
name string,
salary string,
period string,
salary_period_map Map<String,String>,
)
PARTITIONED BY (
dept_id bigint)
STORED AS PARQUET
I'm stuck trying to figure out how to select data as salary_period_map
Consider using str_to_map function provided by hive. I hope you have only one key (salary) in you map
select
emp_id
name,
salary,
period,
str_to_map(concat(salary,":",period),'&',':') as salary_period_map
from employee_salary_period

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

Hive: Fatal error when trying to create dynamic partitions

create table MY_DATA0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING ,country STRING, state STRING, city STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS TEXTFILE ;
LOAD DATA INPATH '/inputhive' OVERWRITE INTO TABLE MY_DATA0;
create table part0(session_id STRING, userid BIGINT,date_time STRING, ip STRING, URL STRING) partitioned by (country STRING, state STRING, city STRING)
clustered by (userid) into 256 buckets ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE ;
\insert overwrite table part0 partition(country, state, city) select session_id, userid, date_time,ip, url, country, state,city from my_data0;
Overview of my dataset:
{60A191CB-B3CA-496E-B33B-0ACA551DD503},1331582487,2012-03-12
13:01:27,66.91.193.75,http://www.acme.com/SH55126545/VD55179433,United
States,Hauula,Hawaii
{365CC356-7822-8A42-51D2-B6396F8FC5BF},1331584835,2012-03-12
13:40:35,173.172.214.24,http://www.acme.com/SH55126545/VD55179433,United
States,El Paso,Texas
When I run the last insert script I get an error as :
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]:
Fatal error occurred when node tried to create too many dynamic
partitions. The maximum number of dynamic partitions is controlled by
hive.exec.max.dynamic.partitions and
hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
PS:
I have set this two properties:
hive.exec.dynamic.partition.mode::nonstrict
hive.enforce.bucketing::true
Try setting those properties to higher values.
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
Partition columns should be mentioned at last in select statement.
Ex: if state is the partition column, then "insert into table t1 partition(state) select Id, name, dept, sal, state from t2"; this will work. For instance if my query is like this "insert into table t1 partition(state) select Id, name, dept,state, sal from t2;" then partitions will be created with salary(sal) column

Hive partition columns seem to prevent "select distinct"

I have created a table in Hive like this:
CREATE TABLE application_path
(userId STRING, sessId BIGINT, accesstime BIGINT, actionId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Running on this table the query:
SELECT DISTINCT userId FROM application_path;
gives the expected result:
user1#domain.com
user2#domain.com
user3#domain.com
...
Then I've changed the declaration to add a partition:
CREATE TABLE application_path
(sessId BIGINT, accesstime BIGINT, actionId STRING)
PARTITIONED BY(userId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Now the query SELECT DISTINCT userId... runs for seconds as before, but eventually returns anything.
I've just noticed the syntax:
SHOW PARTITIONS application_path;
but I was wondering if that's the only way to get unique (distinct) values from a partitioning column. The output of SHOW PARTITION is not even an exact replacement of what you would get from SELECT DISTINCT, since the column name is prefixed to each row:
hive> show partitions application_path;
OK
userid=user1#domain.com
userid=user2#domain.com
userid=user3#domain.com
...
What's strange to me is that usedId can be used in GROUP BY with other columns, like in:
SELECT userId, sessId FROM application_path GROUP BY userId, sessId;
but does return anything in:
SELECT userId FROM application_path GROUP BY userId;
I experienced the same issue, it will be fixed in 0.10
https://issues.apache.org/jira/browse/HIVE-2955

Resources