Hive on AWS: convert S3 JSON to Columnar preserving partitions - hadoop

I have files in S3 that contain many lines of JSON (separated by newline). I want to convert these files to a Columnar Format for consumption by AWS Athena
I am following the Converting to Columnar Formats guide to do this, however when converted to ORC, the partition convention in S3 is lost.
In this example, how do you preserve the dt partition in the converted to parquet s3 folder structure? When I run the example it just outputs s3://myBucket/pq/000000_0 and NOT s3://myBucket/pq/dt=2009-04-14-04-05/000000_0
Here is the HQL that sets up interface to bring JSON into a Hive table:
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string,
number string,
processId string,
browserCookie string,
requestEndTime string,
timers struct<modelLookup:string, requestTime:string>,
threadId string,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions' ;
msck repair table impressions;
Here is the HQL that converts to Parquet
CREATE EXTERNAL TABLE parquet_hive (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string)
STORED AS PARQUET
LOCATION 's3://mybucket/pq/';
INSERT OVERWRITE TABLE parquet_hive SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip FROM impressions where dt='2009-04-14-04-05';

First of all, Add PARTITIONED BY (dt string) to parquet_hive definition.
Second -
If you want to insert the data, partition by partition, you have to declare the partition you are inserting into.
Note the PARTITION (dt='2009-04-14-04-05')
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt='2009-04-14-04-05')
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip
FROM impressions where dt='2009-04-14-04-05'
;
An easier way would be to use dynamic partitioning.
Note the PARTITION (dt) and the dt as a last column in in the SELECT.
You might need to to set hive.exec.dynamic.partition.mode.
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt)
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip,dt
FROM impressions where dt='2009-04-14-04-05'
;
P.s.
CREATE EXTERNAL TABLE impressions does not "reads the JSON into a Hive table".
It is just an interface with the necessary information to read the HDFS files.
`

You can simply include the same PARTITIONED BY (dt string) parameter that is in your first statement, which will create the same directory structure.
In this case, the dt field (presumably, date) is actually stored in the directory name. A separate directory is created for each value.

Related

AWS Athena creates indentation and moves values into wrong columns after partitions loads

I encountered the following problem:
I created a Hive table in an EMR cluster in HDFS without partitions
and loaded a data to it.
I created another Hiva table based on the
table from the paragraph#1 but with partitions from the datetime
column: PARTITIONED BY (year STRING,month STRING,day STRING).
I loaded a data from the non partitioned table into partitioned table and get the valid result.
I created an Athena database and table with the same structure as Hive table.
I copied partitioned files from HDFS locally and by aws s3 sync transferred all files into S3 empty bucket. All files were transferred without error and with the same order as in Hive directory in HDFS.
I loaded partitions by MSCK REPAIR TABLE and didn't get any error in an output.
After that I found that many values got indentation, for example a value that need to be in the "IP" column was in "Operating_sys" column and etc.
My scripts are:
-- Hive tables
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_page_part
(
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/events_partitioned';
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_event_part
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/pages_partitioned';
INSERT INTO TABLE cloudfront_logs_page_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
page_path,
referer,
tracking_referer,
medium,
campaign,
source,
visitor_id,
ip,
session_id,
operating_sys,
ad_id,
keyword,
user_agent,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_page;
INSERT INTO TABLE cloudfront_logs_event_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
category,
action,
label,
value,
visitor_id,
ip,
session_id,
operating_sys,
extra_data_json,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_event;
-- Athena tables
CREATE DATABASE IF NOT EXISTS test
LOCATION 's3://...';
DROP TABLE IF EXISTS test.cloudfront_logs_page_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS powtoon_hive.cloudfront_logs_page_ath (
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
DROP TABLE IF EXISTS test.cloudfront_logs_event_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS test.cloudfront_logs_event_ath
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
What can be wrong? Table structure? Athena metadata?
The easiest method would be to convert your raw files directly into a partitioned Parquet columnar format. This has the benefit of partitioning, columnar storage, predicate push-down and all those other fancy words.
See: Converting to Columnar Formats - Amazon Athena

Provide a file name prefix in Hive INSERT

I have a Hive script script that moves data from DynamoDB into S3,
CREATE EXTERNAL TABLE ddb-table (hash_key string, sort_key string, value string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "ddb-table",
"dynamodb.column.mapping" = "hash:hash,range:range,data:data"
);
CREATE EXTERNAL TABLE s3-bucket (hash string, range string, data string)
PARTITIONED BY (hash_key STRING)
LOCATION 's3://some-bucket-name/';
INSERT OVERWRITE TABLE s3-bucket PARTITION (hash_key)
SELECT sort_key string, value string, hash_key string
FROM ddb-table;
However the I want to control the file name format. I want to use the hash_key as well as other values as the filenames prefix in S3. Is this possible?

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

AWS EMR Hive partitioning unable to recognize any type of partitions

I am trying to process some log files on a bucket in amazon s3.
I create the table :
CREATE EXTERNAL TABLE apiReleaseData2 (
messageId string, hostName string, timestamp string, macAddress string DISTINCT, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';
Then I run the following HiveQL statement and get my desired output in the file without any issues. My directories are setup in the following manner :
s3://apireleasecandidate1/regression/transferstatistics/2013/12/31/ < All the log files for this day >
What I want to do is that I specify the LOCATION up to the 's3://apireleasecandidate1/regression/transferstatistics/' and then call the
ALTER TABLE <Table Name> ADD PARTITION (<path>)
statement or the
ALTER TABLE <Table Name> RECOVER PARTITIONS ;
statement to access the files in the subdirectories. But when I do this there is no data in my table.
I tried the following :
CREATE EXTERNAL TABLE apiReleaseDataUsingPartitions (
messageId string, hostName string, timestamp string, macAddress string, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/';
and then I run the following ALTER command :
ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31');
But running the Select statement on the table gives out no results.
Can someone please guide me what I am doing wrong ?
Am I missing something Important ?
Cheers
Tanzeel
In HDFS anyway, the partitions manifest in a key/value format like this:
hdfs://apireleasecandidate1/regression/transferstatistics/year=2013/month=12/day=31
I can't vouch for S3 but an easy way to check would be to write some data into a dummy partition and see where it creates the file.
ADD PARTITION supports an optional LOCATION parameter, so you might be able to deal with this by saying
ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31') LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';
Again I've not dealt with S3 but would be interested to hear if this works for you.

Elastic Map Reduce JSON export to DynamoDB error AttributeValue may not contain an empty string

I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):
"AttributeValue may not contain an empty string"
{"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?
Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.
Here's an example of the Hive SQL schema and insert command:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';
The nulls shouldn't be a problem as long as it's not for the primary key.
However, DynamoDB does not allow empty strings nor empty sets as described in the data model.
To work around this, I think you have a couple options:
Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
You could also filter these records, but that will mean losing data. This could be done like this:
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";

Resources