Showing NULL values after adding column in hive - hadoop

i am using hive-version 1.2.1. i m newbie to hive.
i have added a column to TABLE_2 and shows NULL value. i want to put DATE part from timestamp column to newly created column. i tried with below query:
ALTER TABLE table_2 ADD COLUMNS(DATE_COL string);
INSERT INTO table_2 (DATE_COL) AS SELECT SUBSTRING(TIMESTAMP_COL,-19,10) FROM table_1 ;
this is working bt still it shows NULL values in newly created DATE_COL.
i want just date in DATE_COL.
table_1 has 13 columns, table_2 has 14 columns (13 + DATE_COL).
TIMESTAMP_COL :- STRING.
DATE_COL - STRING.
please tell me how to solve this problem.

Use UPDATE command :
Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
Hive version 0.14.0: INSERT...VALUES, UPDATE, and DELETE are now available with full ACID support.

Related

How to use derived columns in same hive table?

Could you please help me below query.
Suppose there is table employee and columns A , B and Date column.
I have to load data from table employee to another table emp with below transformation applied
Transformation in Employee table
Absolute value of column A - (column name in emp wil be ABS_A)
Absolute value of column B -(column name in emp wil be ABS_B)
Find the sum(ABS_A) for a given Date column
4.Find the sum(ABS_b) for a given Date column
Find sum(ABS_A)/sum(ABS_B) - column name will be Average.
So the Final table emp will have below columns
1.A
2.B
3.ABS_A
4.ABS_B
5.Average
How to handle such derived column in hive?
I tried below query but now working. could anyone guide me.
insert overwrite into emp
select
A,
B,
ABS(A) as ABS_A,
ABS(B) as ABS_B,
sum(ABS_A) OVER PARTION BY DATE AS sum_OF_A,
sum(ABS_B) OVER PARTTION BY DATE AS sum_of_b,
avg(sum_of_A,sum_of_b) over partition by date as average
from employee
Hive does not support using derived columns in the same subquery level. Use subqueries or functions in place of column aliases.
insert overwrite table emp
select A, B, ABS_A, ABS_B, sum_OF_A, sum_of_b, `date`, sum_OF_A/sum_of_b as average
from
(
select A, B, ABS(A) as ABS_A, ABS(B) as ABS_B, `date`,
sum(ABS(A)) OVER (PARTTION BY DATE) AS sum_OF_A,
sum(ABS(B)) OVER (PARTTION BY DATE) AS sum_of_b
from employee
)s;

Hive modify partitioned table data

Problem: One column value is null. It should be 'ab'. Unfortunately I have written '' instead 'ab'.
My table is partitioned table. Is there any way to change that?
I found the following way. But it seems inefficient.
Create a temp table like my table
Use INSERT OVERWRITE. Read data from my old table and write to new table. I am using case statement to change '' to 'ab'
And then change my temp table to original table.
I am looking for a solution something like update partition and msck. Is there any way to do?
You can overwrite single partition in this way:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite target_table partition (part_col)
select
case when column ='' then 'ab' else column end as column ,
col2, --select all the columns in the same order
col3,
part_col --partition column is the last one
from target_table where part_col='your_partition_value';
One possible solution would be to perform update on the table provided the column is not neither a partitioning nor bucketing column.
UPDATE tablename SET column = (CASE WHEN column = '' THEN 'ab' else column END) [WHERE expr if any];
Update: To support ACID operations on Hive
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;
Note: works only if Hive >= 0.14

Excluding the partition field from select queries in Hive

Suppose I have a table definition as follows in Hive(the actual table has around 65 columns):
CREATE EXTERNAL TABLE S.TEST (
COL1 STRING,
COL2 STRING
)
PARTITIONED BY (extract_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LOCATION 'xxx';
Once the table is created, when I run hive -e "describe s.test", I see extract_date as being one of the columns on the table. Doing a select * from s.test also returns extract_date column values. Is it possible to exclude this virtual(?) column when running select queries in Hive.
Change this property
set hive.support.quoted.identifiers=none;
and run the query as
SELECT `(extract_date)?+.+` FROM <table_name>;
I tested it working fine.

Hive update all values in a column

I have an external partitioned Hive table. One of its columns is a string named OLDDATE that has the date in a different format(DD-MM-YY). I want to update the column and store dates in YYYY-MM-DD format. All years are 20XX.
So I thought of this
select CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) from table
This gives me the dates in the format I want. Now how do I overwrite the old date with this new date?
You can effect an update by overwriting the table with its own contents, just with the date field changed according to your transformation, like this pseudo-code:
INSERT OVERWRITE table
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
FROM table;
#user2441441
To overwrite a partitioned table:
INSERT OVERWRITE table PARTITION (p_col)
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-
',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
, p_col
FROM table;
Since its an partitioned table, the folder names must be created with the date values.
Hence you are not able to update the values.
One work around for this would be create a new table and run your above query and insert data into the new table.
After that you can drop your existing table and treat this new table as your required table.

hive : select row with column having maximum value without join

writing hive query over a table to pick the row with maximum value in column
there is table with following data for example:
key value updated_at
1 "a" 1
1 "b" 2
1 "c" 3
the row which is updated last needs to be selected.
currently using following logic
select tab1.* from table_name tab1
join select tab2.key , max(tab2.updated_at) as max_updated from table_name tab2
on tab1.key=tab2.key and tab1.updated_at = tab2.max_updated;
Is there any other better way to perform this?
If it is true that updated_at is unique for that table, then the following is perhaps a simpler way of getting you what you are looking for:
-- I'm using Hive 0.13.0
SELECT * FROM table_name ORDER BY updated_at DESC LIMIT 1;
If it is possible for updated_at to be non-unique for some reason, you may need to adjust the ORDER BY logic to break any ties in the fashion you wish.

Resources