My issue is i have tried this on my local machine with hadoop and used AWS EC2 to check, there are no return of records in the below query. Now the below script is correct and i know that for a fact?
My quesiton is why we don't see any results in the part file after the job is complete
DROP TABLE IF EXISTS batting;
CREATE EXTERNAL TABLE IF NOT EXISTS batting(id STRING, year INT, team STRING,
league STRING, games INT, ab INT, runs INT, hits INT, doubles INT, triples
INT, homeruns INT, rbi INT, sb INT, cs INT, walks INT, strikeouts INT, ibb
INT, hbp INT, sh INT, sf INT, gidp INT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LOCATION 's3://hive-test1/batting';
DROP TABLE IF EXISTS master;
CREATE EXTERNAL TABLE IF NOT EXISTS master(id STRING, byear INT, bmonth INT,
bday INT, bcountry STRING, bstate STRING, bcity STRING, dyear INT, dmonth
INT, dday INT, dcountry STRING, dstate STRING, dcity STRING, fname STRING,
lname STRING, name STRING, weight INT, height INT, bats STRING, throws
STRING, debut STRING, finalgame STRING, retro STRING, bbref STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION
's3://hive-test1/master';
INSERT OVERWRITE DIRECTORY 's3://hive-test1/output' SELECT n.fname,
n.lname, x.year, x.runs FROM master n JOIN (SELECT b.id as id, b.year as
year, b.runs as runs FROM batting b JOIN (SELECT year, max(runs) AS best FROM
batting GROUP BY year) o WHERE b.runs=o.best AND b.year=o.year) x ON
x.id=n.id ORDER BY x.runs DESC;
When you use Hive to create the two tables, all you're doing is creating a definition of name, field and their types, location and so on. Create does nothing with data.
Based on your similar question earlier, I think you have some existing HDFS files in CSV format that contain the data you want to query, right?
Before doing that I suggest that you manually insert a record into each table, likeINSERT INTO batting (id, year, team,league) VALUES ('1', 2016, 'Red Sox', 'AL Easr');. Then, query the table with SELECT * FROM batting; to confirm you have on record with some values in it.
Now you have the next problem to solve: how do I import an HDFS file to a Hive table? You can do this using Hue, if you have it installed. If not, I suggest you use Google to find an answer to this question.
In general, you have three problems to solve:
Create tables in Hive so the Hive megastore knows about their structure. This is called data definition langurs, or DDL in SQL.
Import and Lin your existing CSV data sets sitting as files on HDFS to their corresponding Hive tables
Query the tables using SQL likely using SELECT and JOIN, this is called data manipulation language or DML in SQL.
Each is a different step. Make them work, one by one and you'll take a complex problem and break it down into smaller problems that are easier to understand.
Related
I encountered the following problem:
I created a Hive table in an EMR cluster in HDFS without partitions
and loaded a data to it.
I created another Hiva table based on the
table from the paragraph#1 but with partitions from the datetime
column: PARTITIONED BY (year STRING,month STRING,day STRING).
I loaded a data from the non partitioned table into partitioned table and get the valid result.
I created an Athena database and table with the same structure as Hive table.
I copied partitioned files from HDFS locally and by aws s3 sync transferred all files into S3 empty bucket. All files were transferred without error and with the same order as in Hive directory in HDFS.
I loaded partitions by MSCK REPAIR TABLE and didn't get any error in an output.
After that I found that many values got indentation, for example a value that need to be in the "IP" column was in "Operating_sys" column and etc.
My scripts are:
-- Hive tables
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_page_part
(
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/events_partitioned';
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_event_part
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY
(
`year` STRING,
`month` STRING,
`day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/pages_partitioned';
INSERT INTO TABLE cloudfront_logs_page_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
page_path,
referer,
tracking_referer,
medium,
campaign,
source,
visitor_id,
ip,
session_id,
operating_sys,
ad_id,
keyword,
user_agent,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_page;
INSERT INTO TABLE cloudfront_logs_event_part
PARTITION
(
`year`,
`month`,
`day`
)
SELECT
log_DATE,
user_id,
category,
action,
label,
value,
visitor_id,
ip,
session_id,
operating_sys,
extra_data_json,
year(log_DATE) as `year`,
month(log_DATE) as `month`,
day(log_DATE) as `day`
FROM
cloudfront_logs_event;
-- Athena tables
CREATE DATABASE IF NOT EXISTS test
LOCATION 's3://...';
DROP TABLE IF EXISTS test.cloudfront_logs_page_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS powtoon_hive.cloudfront_logs_page_ath (
log_DATE STRING,
user_id STRING,
page_path STRING,
referer STRING,
tracking_referer STRING,
medium STRING,
campaign STRING,
source STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
ad_id STRING,
keyword STRING,
user_agent STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
DROP TABLE IF EXISTS test.cloudfront_logs_event_ath;
CREATE EXTERNAL TABLE IF NOT EXISTS test.cloudfront_logs_event_ath
(
log_DATE STRING,
user_id STRING,
category STRING,
action STRING,
label STRING,
value STRING,
visitor_id STRING,
ip STRING,
session_id STRING,
operating_sys STRING,
extra_data_json STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';
What can be wrong? Table structure? Athena metadata?
The easiest method would be to convert your raw files directly into a partitioned Parquet columnar format. This has the benefit of partitioning, columnar storage, predicate push-down and all those other fancy words.
See: Converting to Columnar Formats - Amazon Athena
I have created an external table as below:
create external table if not exists complaints (date_received string, product string, sub_product string, issue string, sub_issue string, consumer_complaint_narrative string, state string, company_public_response string, company varchar(50), zipcode int, tags string, consumer_consent_provided string, submitted_via string, date_sent_company string, company_response string, timely_response string, consumer_disputed string, complaint_id int) row format delimited fields terminated by ',' stored as textfile location 'hdfs:hostname:8020/complaints/';
Now I want to create another table complaints_new with partition as state and have all the data from above table. How can this be acheived?
I tried the below:
create external table if not exists complaints_new (date_received string, product string, sub_product string, issue string, sub_issue string, consumer_complaint_narrative string, company_public_response string, company varchar(50), zipcode int, tags string, consumer_consent_provided string, submitted_via string, date_sent_company string, company_response string, timely_response string, consumer_disputed string, complaint_id int) partitioned by (state varchar(20)) row format delimited fields terminated by ',' stored as textfile location 'hdfs://hostname:8020/complaints/';
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;
insert into table complaints_new partition(state) select * from complaints;
The query is failing.
You have a few problems here... you are pointing to the same location which means that you will be reading and overwriting that location... the other problem is that Hive expect th partition column to be the last element in your list, it means that you cannot do select *, instead you have to select field to field and put the state and the end of your select statement
In Hive:
I am trying to do a data merge from different tables with different schema(only difference is additional columns with data can be possible) in HIVE.
table A:
id int, name string, v1 string, v2 string
tbale B:
id int, name string, v1 String
table C:
id int, name string, v1 string, v2 string, v3 string, v4 string
I need to merge data from all these tables into a new 'table D' with
table D:
id int, name string, v1 stirng, v2 string, v3 string, v4 string
How can i insert data from source table based on condition from table schema like if column exist then insert that column value, else null.
I have written a query something like in a script to loop through all tables:
insert into table D select id, name, v1,v2,v3,v4 from ${hiveconf:TABLE_NAME};
This will fail as it cannot find v3, v4 columns not listed in table A
As I have to select from thousands of different tables, I would like to add a condition based column selection dynamically in a script(because there are thousands of tables can have varying number of columns among v1 to v4), as it is not a good practice to write and execute thousands of insert into statements individually.
How can we achieve this in SQL, so I can think of Hive way of doing.
I want to load .csv file into Hive table as a ORC file. I came across one post
which suggested a workaround to the problem to which I executed the below queries:
1) Creating and loading data as a text file into a temporary table:
CREATE TABLE IF NOT EXISTS CrimesData( ID int, Case_Number int, CrimeDate string, Block string , IUCR string,Primary_Type string, Description string, Location_Description string, Arrest string, Domestic string, Beat int, District int, Ward int, Community_Area int, FBI_Code string, X_Coordinate int, Y_Coordinate int, Year int, Updated_On string, Latitude decimal(10,10), Longitude decimal(10,10), CrimeLocation string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '"' LINES TERMINATED BY '\n'
tblproperties("skip.header.line.count"="1")
LOAD DATA LOCAL INPATH '/home/cloudera/Documents/CrimesData.csv' INTO TABLE CrimesData
2) Creating a new table and specifying ORC data as the source:
CREATE TABLE IF NOT EXISTS CrimesDataORC( ID int, Case_Number int, CrimeDate string, Block string , IUCR string,Primary_Type string, Description string, Location_Description string, Arrest string, Domestic string, Beat int, District int, Ward int, Community_Area int, FBI_Code string, X_Coordinate int, Y_Coordinate int, Year int, Updated_On string, Latitude decimal(10,10), Longitude decimal(10,10), CrimeLocation string)
STORED AS ORC;
3) Insert data into the new table from temporary table:
INSERT INTO TABLE CrimesDataORC SELECT * FROM CrimesData;
The first two steps execute without any error but the step 3 throws the following error:
Error while processing statement: FAILED: Execution Error, return code
2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
I am running the above queries on Cloudera Manager Quickstart VM 5.8.
Not sure where I am going wrong as similar steps for another table in the same database works as expected.
It might be kind of data non-compliance with structure. Try to set some where conditions in the select statement to check rather inserting all the data
I have files in S3 that contain many lines of JSON (separated by newline). I want to convert these files to a Columnar Format for consumption by AWS Athena
I am following the Converting to Columnar Formats guide to do this, however when converted to ORC, the partition convention in S3 is lost.
In this example, how do you preserve the dt partition in the converted to parquet s3 folder structure? When I run the example it just outputs s3://myBucket/pq/000000_0 and NOT s3://myBucket/pq/dt=2009-04-14-04-05/000000_0
Here is the HQL that sets up interface to bring JSON into a Hive table:
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string,
number string,
processId string,
browserCookie string,
requestEndTime string,
timers struct<modelLookup:string, requestTime:string>,
threadId string,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions' ;
msck repair table impressions;
Here is the HQL that converts to Parquet
CREATE EXTERNAL TABLE parquet_hive (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string)
STORED AS PARQUET
LOCATION 's3://mybucket/pq/';
INSERT OVERWRITE TABLE parquet_hive SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip FROM impressions where dt='2009-04-14-04-05';
First of all, Add PARTITIONED BY (dt string) to parquet_hive definition.
Second -
If you want to insert the data, partition by partition, you have to declare the partition you are inserting into.
Note the PARTITION (dt='2009-04-14-04-05')
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt='2009-04-14-04-05')
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip
FROM impressions where dt='2009-04-14-04-05'
;
An easier way would be to use dynamic partitioning.
Note the PARTITION (dt) and the dt as a last column in in the SELECT.
You might need to to set hive.exec.dynamic.partition.mode.
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt)
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip,dt
FROM impressions where dt='2009-04-14-04-05'
;
P.s.
CREATE EXTERNAL TABLE impressions does not "reads the JSON into a Hive table".
It is just an interface with the necessary information to read the HDFS files.
`
You can simply include the same PARTITIONED BY (dt string) parameter that is in your first statement, which will create the same directory structure.
In this case, the dt field (presumably, date) is actually stored in the directory name. A separate directory is created for each value.