Hive static partitions issue - hadoop

I have a csv file which have 600 records, 300 for male and female each.
I have created a Table_Temp and fill all these records in that table. Then, I create Table_Main with gender as partition column.
For Temp_Table query is:
Create table if not exists Temp_Table
(id string, age int, gender string, city string, pin string)
row format delimited
fields terminated by ',';
Then I write the below query:
Insert into Table_Main
partitioned (gender)
select a,b,c,d,gender from Table)Temp
Problem: I am getting a file in /user/hive/warehouse/mydb.db/Table_Main/gender=Male/000000_0
In this file, I am getting total 600 records. I am not sure whats happening but what I was expected is I should get 300 records in this file(only Male).
Q:1. Where am I mistaken ?
Q:2. Should I not get one more folder for all other values(which are not in static partition) ? If NOT, what will happen to those ?

In static partition we need to specify a where condition while inserting data into partition table.(which I have not done).
For this we can use dynamic partition without where condition.

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

Loading Data into an empty Impala Table with account data partitioned by area code

I'm trying to copy data from a table called accounts into an empty table called accounts_by_area_code. I have the following fields in accounts_by_area_code: acct_num INT, first_name STRING, last_name STRING, phone_number STRING. The table is partitioned by areacode (the first 3 digits of phone_number.
I need to use a SELECT statement to extract the area code into an INSERT INTO TABLE command to copy the specified columns to the new table, dynamically partitioning by area code.
This is my last attempt:
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num, first_name, last_name, phone_number, areacode) PARTITION (areacode) SELECT STRLEFT (phone_number,3) AS areacode FROM accounts;"
This generates ERROR: AnalysisException: Column permutation and PARTITION clause mention more columns (5) than the SELECT / VALUES clause and PARTITION clause return (1). I'm not convinced I have even the basic syntax correct so any help would be great as I'm new to Impala.
Impala creates partitions dynamically based on data. So not sure why you want to create an empty table with partitions because it will be auto created while inserting new data.
Still, I think you can create empty table with partitions like this-
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num) PARTITION (areacode)
SELECT CAST(NULL as STRING), STRLEFT (phone_number,3) AS areacode FROM accounts;"

How can insert into the table with the original day as partition in Hive?

create table h5_qti_desc
( h5id string,
query string,
title string,
item string,
query_ids string,
title_ids string,
item_ids string,
label bigint
)PARTITIONED BY (day string) LIFECYCLE 160;
insert overwrite into h5_qti_desc
select * from aaa
;
I create a table named h5_qti_desc, and I want to insert into it from another aaa table, which has the field of day and there is no partition in aaa.
Table aaa has several days, like '20171010','20171015'...
How can I insert into h5_qti_desc with day as partition once, and the days in aaa acted as day in h5_qti_desc's partition.
You can use Hive dynamic partition functionality to insert data. Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table.
Below is an example of loading data to all partitions using one insert statement:
hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive>INSERT OVERWRITE TABLE h5_qti_desc PARTITION(day)
SELECT * FROM aaa
DISTRIBUTE day;

Updating unique id column for newly added records in table in hive

I have a table in which I want unique identifier to be added automatically as a new record is inserted into it. Considering I have column for unique identifier already created.
hive can't update the table but you can create a temporary table or overwrite your first table.
you can also use concat function to join the two diferent column or string.
here is the examples
function :concat(string A, string B…)
return: string
hive> select concat(‘abc’,'def’,'gh’) from dual;
abcdefgh
HQL &result
insert overwrite table stock select tradedate,concat('aa',tradetime),stockid ,buyprice,buysize ,sellprice,sellsize from stock;
20130726 aa094251 204001 6.6 152000 6.605 100
20130726 aa094106 204001 6.45 13400 6.46 100

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

Resources