How to use derived columns in same hive table? - hadoop

Could you please help me below query.
Suppose there is table employee and columns A , B and Date column.
I have to load data from table employee to another table emp with below transformation applied
Transformation in Employee table
Absolute value of column A - (column name in emp wil be ABS_A)
Absolute value of column B -(column name in emp wil be ABS_B)
Find the sum(ABS_A) for a given Date column
4.Find the sum(ABS_b) for a given Date column
Find sum(ABS_A)/sum(ABS_B) - column name will be Average.
So the Final table emp will have below columns
1.A
2.B
3.ABS_A
4.ABS_B
5.Average
How to handle such derived column in hive?
I tried below query but now working. could anyone guide me.
insert overwrite into emp
select
A,
B,
ABS(A) as ABS_A,
ABS(B) as ABS_B,
sum(ABS_A) OVER PARTION BY DATE AS sum_OF_A,
sum(ABS_B) OVER PARTTION BY DATE AS sum_of_b,
avg(sum_of_A,sum_of_b) over partition by date as average
from employee

Hive does not support using derived columns in the same subquery level. Use subqueries or functions in place of column aliases.
insert overwrite table emp
select A, B, ABS_A, ABS_B, sum_OF_A, sum_of_b, `date`, sum_OF_A/sum_of_b as average
from
(
select A, B, ABS(A) as ABS_A, ABS(B) as ABS_B, `date`,
sum(ABS(A)) OVER (PARTTION BY DATE) AS sum_OF_A,
sum(ABS(B)) OVER (PARTTION BY DATE) AS sum_of_b
from employee
)s;

Related

Showing NULL values after adding column in hive

i am using hive-version 1.2.1. i m newbie to hive.
i have added a column to TABLE_2 and shows NULL value. i want to put DATE part from timestamp column to newly created column. i tried with below query:
ALTER TABLE table_2 ADD COLUMNS(DATE_COL string);
INSERT INTO table_2 (DATE_COL) AS SELECT SUBSTRING(TIMESTAMP_COL,-19,10) FROM table_1 ;
this is working bt still it shows NULL values in newly created DATE_COL.
i want just date in DATE_COL.
table_1 has 13 columns, table_2 has 14 columns (13 + DATE_COL).
TIMESTAMP_COL :- STRING.
DATE_COL - STRING.
please tell me how to solve this problem.
Use UPDATE command :
Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
Hive version 0.14.0: INSERT...VALUES, UPDATE, and DELETE are now available with full ACID support.

Hive update all values in a column

I have an external partitioned Hive table. One of its columns is a string named OLDDATE that has the date in a different format(DD-MM-YY). I want to update the column and store dates in YYYY-MM-DD format. All years are 20XX.
So I thought of this
select CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) from table
This gives me the dates in the format I want. Now how do I overwrite the old date with this new date?
You can effect an update by overwriting the table with its own contents, just with the date field changed according to your transformation, like this pseudo-code:
INSERT OVERWRITE table
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
FROM table;
#user2441441
To overwrite a partitioned table:
INSERT OVERWRITE table PARTITION (p_col)
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-
',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
, p_col
FROM table;
Since its an partitioned table, the folder names must be created with the date values.
Hence you are not able to update the values.
One work around for this would be create a new table and run your above query and insert data into the new table.
After that you can drop your existing table and treat this new table as your required table.

Range Partitioning in Hive

Does hive support range partitioning?
I mean does hive supports something like below:
insert overwrite table table2 PARTITION (employeeId BETWEEN 2001 and 3000)
select employeeName FROM emp10 where employeeId BETWEEN 2001 and 3000;
Where table2 & emp10 has two columns:
employeeName &
employeeId
When I run the above query i am facing an error:
FAILED: ParseException line 1:56 mismatched input 'BETWEEN' expecting ) near 'employeeId' in destination specification
Is not possible. Here is a quote from Hive documentation :
A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns
No its not possible. Even I use separate calculated column like ,
insert overwrite table table2 PARTITION (employeeId_range)
select employeeName , employeeId/1000 FROM emp10 where employeeId BETWEEN 2000 and 2999;
which will make sure all values fall in same partition.
while querying the table since we already know the range calculator, we can
select employeeName , employeeId FROM table2 where employeeId_range=2;
Thus we can also parallelise the queries of given ranges.
Hope it helps.

sql statement to select and insert one value from one table rest from outside the tables

I have two db2 tables, table one contains rows of constant data with id as one of the columns. second table contains id column which is foreign key to id column in first table and has 3 more varchar columns.
I am trying to insert rows into second table, where entry into id col is based on a where clause and the remaining columns get values from outside.
table1 has columns id, t1c1, t1c2, t1c3
table2 has columns id, t2c1, t2c2, t2c3
to insert into table2, I am trying this query:
insert into table2 values
(select id from table1 where t1c2 like 'xxx', 'abc1','abc2');
I know it is something basic I am missing here.
Please help correcting this query.

Update query resulting wrongly

I have table called company_emp. In that table I have 6 columns related to employees:
empid
ename
dob
doj, ...
I have another table called bday. In that I have only 2 columns; empid and dob.
I have this query:
select empid, dob
from company_emp
where dob like '01/05/2011'
It shows some list of employees.
In the same way I have queried with table bday it listed some employees.
Now I want to update the company_emp table for employees who have date '01/05/2011'.
I have tried a query like this:
update company_name a
set dob = (select dob from bday b
where b.empid=a.empid
and to_char(a.dob,'dd/mm/yyyy') = '01/05/2011'}
Then all the records in that row becoming null. How can I fix this query?
You're updating every row in the company_name/emp table.
You can fix that with a correlated subquery to make sure the row exists, or more efficiently by placing a primary or unique key on bday.empid and querying:
update (
select c.dob to_dob,
d.dob from_dob
from company_emp c join dob d on (c.empid = d.empid)
where d.dob = date '2011-05-01')
set to_dob = from_dob
Syntax not tested.

Resources