In Hive:
I am trying to do a data merge from different tables with different schema(only difference is additional columns with data can be possible) in HIVE.
table A:
id int, name string, v1 string, v2 string
tbale B:
id int, name string, v1 String
table C:
id int, name string, v1 string, v2 string, v3 string, v4 string
I need to merge data from all these tables into a new 'table D' with
table D:
id int, name string, v1 stirng, v2 string, v3 string, v4 string
How can i insert data from source table based on condition from table schema like if column exist then insert that column value, else null.
I have written a query something like in a script to loop through all tables:
insert into table D select id, name, v1,v2,v3,v4 from ${hiveconf:TABLE_NAME};
This will fail as it cannot find v3, v4 columns not listed in table A
As I have to select from thousands of different tables, I would like to add a condition based column selection dynamically in a script(because there are thousands of tables can have varying number of columns among v1 to v4), as it is not a good practice to write and execute thousands of insert into statements individually.
How can we achieve this in SQL, so I can think of Hive way of doing.
Related
create table h5_qti_desc
( h5id string,
query string,
title string,
item string,
query_ids string,
title_ids string,
item_ids string,
label bigint
)PARTITIONED BY (day string) LIFECYCLE 160;
insert overwrite into h5_qti_desc
select * from aaa
;
I create a table named h5_qti_desc, and I want to insert into it from another aaa table, which has the field of day and there is no partition in aaa.
Table aaa has several days, like '20171010','20171015'...
How can I insert into h5_qti_desc with day as partition once, and the days in aaa acted as day in h5_qti_desc's partition.
You can use Hive dynamic partition functionality to insert data. Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table.
Below is an example of loading data to all partitions using one insert statement:
hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive>INSERT OVERWRITE TABLE h5_qti_desc PARTITION(day)
SELECT * FROM aaa
DISTRIBUTE day;
My issue is i have tried this on my local machine with hadoop and used AWS EC2 to check, there are no return of records in the below query. Now the below script is correct and i know that for a fact?
My quesiton is why we don't see any results in the part file after the job is complete
DROP TABLE IF EXISTS batting;
CREATE EXTERNAL TABLE IF NOT EXISTS batting(id STRING, year INT, team STRING,
league STRING, games INT, ab INT, runs INT, hits INT, doubles INT, triples
INT, homeruns INT, rbi INT, sb INT, cs INT, walks INT, strikeouts INT, ibb
INT, hbp INT, sh INT, sf INT, gidp INT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',' LOCATION 's3://hive-test1/batting';
DROP TABLE IF EXISTS master;
CREATE EXTERNAL TABLE IF NOT EXISTS master(id STRING, byear INT, bmonth INT,
bday INT, bcountry STRING, bstate STRING, bcity STRING, dyear INT, dmonth
INT, dday INT, dcountry STRING, dstate STRING, dcity STRING, fname STRING,
lname STRING, name STRING, weight INT, height INT, bats STRING, throws
STRING, debut STRING, finalgame STRING, retro STRING, bbref STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION
's3://hive-test1/master';
INSERT OVERWRITE DIRECTORY 's3://hive-test1/output' SELECT n.fname,
n.lname, x.year, x.runs FROM master n JOIN (SELECT b.id as id, b.year as
year, b.runs as runs FROM batting b JOIN (SELECT year, max(runs) AS best FROM
batting GROUP BY year) o WHERE b.runs=o.best AND b.year=o.year) x ON
x.id=n.id ORDER BY x.runs DESC;
When you use Hive to create the two tables, all you're doing is creating a definition of name, field and their types, location and so on. Create does nothing with data.
Based on your similar question earlier, I think you have some existing HDFS files in CSV format that contain the data you want to query, right?
Before doing that I suggest that you manually insert a record into each table, likeINSERT INTO batting (id, year, team,league) VALUES ('1', 2016, 'Red Sox', 'AL Easr');. Then, query the table with SELECT * FROM batting; to confirm you have on record with some values in it.
Now you have the next problem to solve: how do I import an HDFS file to a Hive table? You can do this using Hue, if you have it installed. If not, I suggest you use Google to find an answer to this question.
In general, you have three problems to solve:
Create tables in Hive so the Hive megastore knows about their structure. This is called data definition langurs, or DDL in SQL.
Import and Lin your existing CSV data sets sitting as files on HDFS to their corresponding Hive tables
Query the tables using SQL likely using SELECT and JOIN, this is called data manipulation language or DML in SQL.
Each is a different step. Make them work, one by one and you'll take a complex problem and break it down into smaller problems that are easier to understand.
when we partition a table the columns on which the table is being partitioned are not mentioned in the create statement and separately used in the partitioned by.What is the reason behind this.
CREATE TABLE REGISTRATION DATA (
userid BIGINT,
First_Name STRING,
Last_Name STRING,
address1 STRING,
address2 STRING,
city STRING,
zip_code STRING,
state STRING
)
PARTITION BY (
REGION STRING,
COUNTRY STRING
)
The partition that we create in hive makes a pseudocolumn on which we can query directly without having them in create statement.
So when we include partition column on the data of the table itself(create query) we will be getting error like 'Error in semantic analysis. Columns repeated in partitioning columns'
I was trying to create Partition and buckets using HIVE.
For setting some of the properties:
set hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Below is the code for creating the table:
CREATE TABLE transactions_production
( id string,
dept string,
category string,
company string,
brand string,
date1 string,
productsize int,
productmeasure string,
purchasequantity int,
purchaseamount double)
PARTITIONED BY (chain string) clustered by(id) into 5 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Below is the code for inserting data into the table:
INSERT OVERWRITE TABLE transactions_production PARTITION (chain)
select id, dept, category, company, brand, date1, productsize, productmeasure,
purchasequantity, purchaseamount, chain from transactions_staging;
What went wrong:
Partitions and buckets are getting created in HDFS but the data is present only in the 1st bucket of all the partitions; all the remaining buckets are empty.
Please let me know what i did wrong and how to resolve this issue.
When using bucketing, Hive comes up with a hash of the clustered by value (here you use id) and splits the table into that many flat files inside partitions.
Because the table is split up by the hashes of the id's the size of each split is based on the values in your table.
If you have no values that would get mapped to the buckets other than the first bucket, all those flat files will be empty.
I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):
"AttributeValue may not contain an empty string"
{"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}
I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?
Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.
Here's an example of the Hive SQL schema and insert command:
CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
-- Common
"created_at"="$.created_at",
"data"="$.data",
"android_network_carrier"="$.anw",
"type"="$.dt"
)
LOCATION s3://test.data/json_events;
CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
-- Common
created_at BIGINT,
data STRING,
type STRING,
android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test-events",
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier");
ALTER TABLE json_events RECOVER PARTITIONS;
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e';
The nulls shouldn't be a problem as long as it's not for the primary key.
However, DynamoDB does not allow empty strings nor empty sets as described in the data model.
To work around this, I think you have a couple options:
Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
You could also filter these records, but that will mean losing data. This could be done like this:
INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
data,
android_network_carrier,
type
FROM json_events
WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";