Partition a hive table with a column in middle from another external table - hadoop

I have created an external table as below:
create external table if not exists complaints (date_received string, product string, sub_product string, issue string, sub_issue string, consumer_complaint_narrative string, state string, company_public_response string, company varchar(50), zipcode int, tags string, consumer_consent_provided string, submitted_via string, date_sent_company string, company_response string, timely_response string, consumer_disputed string, complaint_id int) row format delimited fields terminated by ',' stored as textfile location 'hdfs:hostname:8020/complaints/';
Now I want to create another table complaints_new with partition as state and have all the data from above table. How can this be acheived?
I tried the below:
create external table if not exists complaints_new (date_received string, product string, sub_product string, issue string, sub_issue string, consumer_complaint_narrative string, company_public_response string, company varchar(50), zipcode int, tags string, consumer_consent_provided string, submitted_via string, date_sent_company string, company_response string, timely_response string, consumer_disputed string, complaint_id int) partitioned by (state varchar(20)) row format delimited fields terminated by ',' stored as textfile location 'hdfs://hostname:8020/complaints/';
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;
insert into table complaints_new partition(state) select * from complaints;
The query is failing.

You have a few problems here... you are pointing to the same location which means that you will be reading and overwriting that location... the other problem is that Hive expect th partition column to be the last element in your list, it means that you cannot do select *, instead you have to select field to field and put the state and the end of your select statement

Related

AWS Athena creates indentation and moves values into wrong columns after partitions loads

I encountered the following problem:
I created a Hive table in an EMR cluster in HDFS without partitions
and loaded a data to it.
I created another Hiva table based on the
table from the paragraph#1 but with partitions from the datetime
column: PARTITIONED BY (year STRING,month STRING,day STRING).
I loaded a data from the non partitioned table into partitioned table and get the valid result.
I created an Athena database and table with the same structure as Hive table.
I copied partitioned files from HDFS locally and by aws s3 sync transferred all files into S3 empty bucket. All files were transferred without error and with the same order as in Hive directory in HDFS.
I loaded partitions by MSCK REPAIR TABLE and didn't get any error in an output.
After that I found that many values got indentation, for example a value that need to be in the "IP" column was in "Operating_sys" column and etc.
My scripts are:
-- Hive tables
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_page_part
(
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/events_partitioned';
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_event_part
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/pages_partitioned';
INSERT INTO TABLE cloudfront_logs_page_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
page_path,
referer,
tracking_referer,
medium,
campaign,
source,
visitor_id,
ip,
session_id,
operating_sys,
ad_id,
keyword,
user_agent,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_page;
INSERT INTO TABLE cloudfront_logs_event_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
category,
action,
label,
value,
visitor_id,
ip,
session_id,
operating_sys,
extra_data_json,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_event;
-- Athena tables
CREATE DATABASE IF NOT EXISTS test
LOCATION 's3://...';
DROP TABLE IF EXISTS test.cloudfront_logs_page_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS powtoon_hive.cloudfront_logs_page_ath (
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
DROP TABLE IF EXISTS test.cloudfront_logs_event_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS test.cloudfront_logs_event_ath
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
What can be wrong? Table structure? Athena metadata?
The easiest method would be to convert your raw files directly into a partitioned Parquet columnar format. This has the benefit of partitioning, columnar storage, predicate push-down and all those other fancy words.
See: Converting to Columnar Formats - Amazon Athena

Unable to load text data into Hive table as ORC through temporary Hive table

I want to load .csv file into Hive table as a ORC file. I came across one post
which suggested a workaround to the problem to which I executed the below queries:
1) Creating and loading data as a text file into a temporary table:
CREATE TABLE IF NOT EXISTS CrimesData( ID int, Case_Number int, CrimeDate string, Block string , IUCR string,Primary_Type string, Description string, Location_Description string, Arrest string, Domestic string, Beat int, District int, Ward int, Community_Area int, FBI_Code string, X_Coordinate int, Y_Coordinate int, Year int, Updated_On string, Latitude decimal(10,10), Longitude decimal(10,10), CrimeLocation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '"' LINES TERMINATED BY '\n'
tblproperties("skip.header.line.count"="1")
LOAD DATA LOCAL INPATH '/home/cloudera/Documents/CrimesData.csv' INTO TABLE CrimesData
2) Creating a new table and specifying ORC data as the source:
CREATE TABLE IF NOT EXISTS CrimesDataORC( ID int, Case_Number int, CrimeDate string, Block string , IUCR string,Primary_Type string, Description string, Location_Description string, Arrest string, Domestic string, Beat int, District int, Ward int, Community_Area int, FBI_Code string, X_Coordinate int, Y_Coordinate int, Year int, Updated_On string, Latitude decimal(10,10), Longitude decimal(10,10), CrimeLocation string)
STORED AS ORC;
3) Insert data into the new table from temporary table:
INSERT INTO TABLE CrimesDataORC SELECT * FROM CrimesData;
The first two steps execute without any error but the step 3 throws the following error:
Error while processing statement: FAILED: Execution Error, return code
2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
I am running the above queries on Cloudera Manager Quickstart VM 5.8.
Not sure where I am going wrong as similar steps for another table in the same database works as expected.
It might be kind of data non-compliance with structure. Try to set some where conditions in the select statement to check rather inserting all the data

Query on Bucketized Table

I created a bucketized table as the following
drop table if exists bi_st.st_usr_member_active_day_test;
CREATE TABLE `bi_st.st_usr_member_active_day_test`(
`cal_dt_from` string,
`cal_dt_to` string,
`memberid` string,
`vipcode` string,
`vipleavel` string,
`cityid` string,
`cityname` string,
`groupid` int,
`groupname` string,
`storeid` int,
`storename` string,
`sectionid` int,
`sectionname` string,
`promotionid` string,
`promotionname` string,
`moduleid` string,
`modulename` string,
`activeness_today` string,
`new_vip_class` string
)
clustered by (storeid) into 2 buckets
row format delimited fields terminated by '\t'
stored as orc TBLPROPERTIES('transactional'='true');
And then inserted some data into it, and then I did
select * from bi_st.st_usr_member_active_day_test where storeid = 193;, it failed and gave an array index out of bound error. Can anybody explain about this? Thanks

Provide a file name prefix in Hive INSERT

I have a Hive script script that moves data from DynamoDB into S3,
CREATE EXTERNAL TABLE ddb-table (hash_key string, sort_key string, value string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "ddb-table",
"dynamodb.column.mapping" = "hash:hash,range:range,data:data"
);
CREATE EXTERNAL TABLE s3-bucket (hash string, range string, data string)
PARTITIONED BY (hash_key STRING)
LOCATION 's3://some-bucket-name/';
INSERT OVERWRITE TABLE s3-bucket PARTITION (hash_key)
SELECT sort_key string, value string, hash_key string
FROM ddb-table;
However the I want to control the file name format. I want to use the hash_key as well as other values as the filenames prefix in S3. Is this possible?

Amazon EMR job with multiple input parameters

In Amazon data pipeline, I am creating activity to copy S3 to EMR using Hive.
To achieve it I have to pass two input parameters into EMR job as a step.
I have searched all most every data pipeline documentation but did not found the way to specify the multiple input parameters.
I also talk with the AWS support team but they are also not clear about it. The way/trick they suggested also not working.
Below is my step arguments and Hive query. Please let me know if anyone has idea to achieve it.
Steps:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -d, "output1=#{output.directoryPath}", -d,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -d,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
Hive Query:
drop table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;
I was able to get this to work using the following format on step field of EMRActivity:
Basically I changed -d with -hiveconf. Also changed substitution in hive script from to. I think this is a change made on newer version of hive.
Below is the changed working code:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar,s3://us-east-1.elasticmapreduce/libs/hive/hive-script,--base-path,s3://us-east-1.elasticmapreduce/libs/hive/,--hive-versions,latest,--run-hive-script,--args,-f,s3://gwbpipeline-test/scripts/multiple_user_sample_new.hql, -hiveconf, "output1=#{output.directoryPath}", -hiveconf,"input1=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_users/", -hiveconf,"input2=s3://gwbpipeline-test/temp/sb-test/#{format(#scheduledStartTime,'YYYY-MM-dd hh-mm-ss')}/input/tbl_user_children/"
HIVE Query:
table if exists tbl_users;
CREATE EXTERNAL TABLE tbl_users (
user_id string, user_first_name string, user_last_name string, user_email string, user_dob string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:input1}';
drop table if exists tbl_user_children;
CREATE EXTERNAL TABLE tbl_user_children (
id string, full_name string, birthday string, type string, user_id string, facebook_id string, date_added string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:input2}';
drop table if exists tbl_users_child_output;
CREATE EXTERNAL TABLE userS3output (
user_id string, user_fname string, user_lname string, child_full_name string, child_dirthdate string )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${hiveconf:output1}';
INSERT INTO TABLE tbl_users_child_output SELECT u.user_id, u.user_first_name, u.user_last_name, c.full_name, c.birthday FROM tbl_users as u join tbl_user_children as c ON u.user_id = c.user_id;
Hope this helps to someone.

Resources