Convert value while inserting into HIVE table - hadoop

i have created bucketed table called emp_bucket into 4 buckets clustered on salary column. The structure of the table is as below:
hive> describe Consultant_Table_Bucket;
OK
id int
age int
gender string
role string
salary double
Time taken: 0.069 seconds, Fetched: 5 row(s)
I also have a staging table from where i can insert data into the above bucketed table. Below is the sample data in the staging table:
id age Gender role salary
-----------------------------------------------------
938 38 F consultant 55038.0
939 26 F student 33319.0
941 20 M student 97229.0
942 48 F consultant 78209.0
943 22 M consultant 77841.0
My requirement is to load data into the bucketed table for those employees whose salary is greater than 10,000 and while loading i have to convert "consultant" role to BigData consultant role.
I know how to insert data into my bucketed table using the select command, but need some guidance how can the consultant value in the role column above can be changed to BigData consultant while inserting.
Any help appreciated

Based on your insert, you just need to work on the role part of your select:
INSERT into TABLE bucketed_user PARTITION (salary)
select
id
, age
, gender
, if(role='consultant', 'BigData consultant', role) as role
, salary
FROM
stage_table
where
salary > 10000
;

Related

How to use derived columns in same hive table?

Could you please help me below query.
Suppose there is table employee and columns A , B and Date column.
I have to load data from table employee to another table emp with below transformation applied
Transformation in Employee table
Absolute value of column A - (column name in emp wil be ABS_A)
Absolute value of column B -(column name in emp wil be ABS_B)
Find the sum(ABS_A) for a given Date column
4.Find the sum(ABS_b) for a given Date column
Find sum(ABS_A)/sum(ABS_B) - column name will be Average.
So the Final table emp will have below columns
1.A
2.B
3.ABS_A
4.ABS_B
5.Average
How to handle such derived column in hive?
I tried below query but now working. could anyone guide me.
insert overwrite into emp
select
A,
B,
ABS(A) as ABS_A,
ABS(B) as ABS_B,
sum(ABS_A) OVER PARTION BY DATE AS sum_OF_A,
sum(ABS_B) OVER PARTTION BY DATE AS sum_of_b,
avg(sum_of_A,sum_of_b) over partition by date as average
from employee
Hive does not support using derived columns in the same subquery level. Use subqueries or functions in place of column aliases.
insert overwrite table emp
select A, B, ABS_A, ABS_B, sum_OF_A, sum_of_b, `date`, sum_OF_A/sum_of_b as average
from
(
select A, B, ABS(A) as ABS_A, ABS(B) as ABS_B, `date`,
sum(ABS(A)) OVER (PARTTION BY DATE) AS sum_OF_A,
sum(ABS(B)) OVER (PARTTION BY DATE) AS sum_of_b
from employee
)s;

how to delete columns with less than 20 repetitions on a hive table

I am trying to learn about deleting user_id's repeated in less than 20 times in ratings table (id's with less than 20 votes mess up the prediction)
delete * FROM rating
WHERE COUNT(user_id) <20;
Below is the error I have gotten: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 3:6 Not yet supported place for UDAF 'COUNT'"
There are two big problems
Your query is wrong. to work properly you need to use aggregation function count with groupby on user_id columns.
You can not delete records using delete statement unless you table is transactional table.
To delete the record from non-transnational table you need to use insert overwrite statement to overwrite the table with the records you want.
Syntax:
Insert overwrite table select * from <table_name> where <condition>
you code should look like this
INSERT overwrite TABLE rating
SELECT *
FROM rating
WHERE
user_id IN
(
SELECT user_id
FROM rating
GROUP BY(user_id)
HAVING count(user_id) > 20
);
If you are having transactional table then you can delete user_id having count less than 20 with the following statement.
hive> delete from rating where user_id in
(select user_id from rating group by user_id having count(user_id) < 20);

Trigger in multiple schema

I have one database with two schema: schema1 and schema2.
I want to create trigger for SCHEMA1.CLIENT table. When ever update query performed in SCHEMA1.CLIENT table before the changes and after the change rows will added in SCHEMA2.HISTORY table.
Example:
SCHEMA1.CLIENT
NAME AGE DEPT
KRANTHI 21 CSE
KUMAR 22 ME
If I update above table kranthi age from 21 to 33, the rows will be stored like below in SCHEMA2.HISTORY.
SCHEMA2.HISTORY
MODIFIE DNAME AGE DEPT
BEFORE KRANTHI 21 CSE
AFTER KRANTHI 33 CSE

Hive static partitions issue

I have a csv file which have 600 records, 300 for male and female each.
I have created a Table_Temp and fill all these records in that table. Then, I create Table_Main with gender as partition column.
For Temp_Table query is:
Create table if not exists Temp_Table
(id string, age int, gender string, city string, pin string)
row format delimited
fields terminated by ',';
Then I write the below query:
Insert into Table_Main
partitioned (gender)
select a,b,c,d,gender from Table)Temp
Problem: I am getting a file in /user/hive/warehouse/mydb.db/Table_Main/gender=Male/000000_0
In this file, I am getting total 600 records. I am not sure whats happening but what I was expected is I should get 300 records in this file(only Male).
Q:1. Where am I mistaken ?
Q:2. Should I not get one more folder for all other values(which are not in static partition) ? If NOT, what will happen to those ?
In static partition we need to specify a where condition while inserting data into partition table.(which I have not done).
For this we can use dynamic partition without where condition.

Update query resulting wrongly

I have table called company_emp. In that table I have 6 columns related to employees:
empid
ename
dob
doj, ...
I have another table called bday. In that I have only 2 columns; empid and dob.
I have this query:
select empid, dob
from company_emp
where dob like '01/05/2011'
It shows some list of employees.
In the same way I have queried with table bday it listed some employees.
Now I want to update the company_emp table for employees who have date '01/05/2011'.
I have tried a query like this:
update company_name a
set dob = (select dob from bday b
where b.empid=a.empid
and to_char(a.dob,'dd/mm/yyyy') = '01/05/2011'}
Then all the records in that row becoming null. How can I fix this query?
You're updating every row in the company_name/emp table.
You can fix that with a correlated subquery to make sure the row exists, or more efficiently by placing a primary or unique key on bday.empid and querying:
update (
select c.dob to_dob,
d.dob from_dob
from company_emp c join dob d on (c.empid = d.empid)
where d.dob = date '2011-05-01')
set to_dob = from_dob
Syntax not tested.

Resources