Range Partitioning in Hive - hadoop

Does hive support range partitioning?
I mean does hive supports something like below:
insert overwrite table table2 PARTITION (employeeId BETWEEN 2001 and 3000)
select employeeName FROM emp10 where employeeId BETWEEN 2001 and 3000;
Where table2 & emp10 has two columns:
employeeName &
employeeId
When I run the above query i am facing an error:
FAILED: ParseException line 1:56 mismatched input 'BETWEEN' expecting ) near 'employeeId' in destination specification

Is not possible. Here is a quote from Hive documentation :
A table can have one or more partition columns and a separate data directory is created for each distinct value combination in the partition columns

No its not possible. Even I use separate calculated column like ,
insert overwrite table table2 PARTITION (employeeId_range)
select employeeName , employeeId/1000 FROM emp10 where employeeId BETWEEN 2000 and 2999;
which will make sure all values fall in same partition.
while querying the table since we already know the range calculator, we can
select employeeName , employeeId FROM table2 where employeeId_range=2;
Thus we can also parallelise the queries of given ranges.
Hope it helps.

Related

How to use derived columns in same hive table?

Could you please help me below query.
Suppose there is table employee and columns A , B and Date column.
I have to load data from table employee to another table emp with below transformation applied
Transformation in Employee table
Absolute value of column A - (column name in emp wil be ABS_A)
Absolute value of column B -(column name in emp wil be ABS_B)
Find the sum(ABS_A) for a given Date column
4.Find the sum(ABS_b) for a given Date column
Find sum(ABS_A)/sum(ABS_B) - column name will be Average.
So the Final table emp will have below columns
1.A
2.B
3.ABS_A
4.ABS_B
5.Average
How to handle such derived column in hive?
I tried below query but now working. could anyone guide me.
insert overwrite into emp
select
A,
B,
ABS(A) as ABS_A,
ABS(B) as ABS_B,
sum(ABS_A) OVER PARTION BY DATE AS sum_OF_A,
sum(ABS_B) OVER PARTTION BY DATE AS sum_of_b,
avg(sum_of_A,sum_of_b) over partition by date as average
from employee
Hive does not support using derived columns in the same subquery level. Use subqueries or functions in place of column aliases.
insert overwrite table emp
select A, B, ABS_A, ABS_B, sum_OF_A, sum_of_b, `date`, sum_OF_A/sum_of_b as average
from
(
select A, B, ABS(A) as ABS_A, ABS(B) as ABS_B, `date`,
sum(ABS(A)) OVER (PARTTION BY DATE) AS sum_OF_A,
sum(ABS(B)) OVER (PARTTION BY DATE) AS sum_of_b
from employee
)s;

Loading Data into an empty Impala Table with account data partitioned by area code

I'm trying to copy data from a table called accounts into an empty table called accounts_by_area_code. I have the following fields in accounts_by_area_code: acct_num INT, first_name STRING, last_name STRING, phone_number STRING. The table is partitioned by areacode (the first 3 digits of phone_number.
I need to use a SELECT statement to extract the area code into an INSERT INTO TABLE command to copy the speciļ¬ed columns to the new table, dynamically partitioning by area code.
This is my last attempt:
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num, first_name, last_name, phone_number, areacode) PARTITION (areacode) SELECT STRLEFT (phone_number,3) AS areacode FROM accounts;"
This generates ERROR: AnalysisException: Column permutation and PARTITION clause mention more columns (5) than the SELECT / VALUES clause and PARTITION clause return (1). I'm not convinced I have even the basic syntax correct so any help would be great as I'm new to Impala.
Impala creates partitions dynamically based on data. So not sure why you want to create an empty table with partitions because it will be auto created while inserting new data.
Still, I think you can create empty table with partitions like this-
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num) PARTITION (areacode)
SELECT CAST(NULL as STRING), STRLEFT (phone_number,3) AS areacode FROM accounts;"

How to copy all constrains and data form one schema to another in oracle

I am using Toad for oracle 12c. I need to copy a table and data (40M) from one shcema to another (prod to test). However there is an unique key(not the PK for this table) called record_Id col which has something data like this 3.000*******19E15. About 2M rows has same numbers(I believe its because very large number) which are unique in prod. When I try to copy it violets the unique key of that col. I am using toad "export data to another schema" function to copy the data.
when I execute query in prod
select count(*) from table_name
OR
select count(distinct(record_id) from table_name
Both query gives the exact same numbers of data.
I don't have DBA permission. How do I copy all data without violating unique key of the table.
Thanks in advance!
You can use UPSERT for decisional INSERT or UPDATE or you may write small procedure for this.
you may consider to use NOT EXISTS, but your data is big and it might not be resource efficient.
insert into prod_tab
select * from other_tab t1 where NOT exists (
select 1 from prod_tab t2 where t1.id = t2.id
);
In Oracle you can use a MERGE query for that.
The following query proceeds as follows for each data row :
if the source record_id does not yet exist in the target table, a new record is inserted
else, the existing record is updated with source values
For the sake of the example, I assumed that there are two other columns in the table : column1 and column2.
MERGE INTO target_table t1
USING (SELECT * from source_table t2)
ON (t1.record_id = t2.record_id)
WHEN MATCHED THEN UPDATE SET
t1.column1 = t2.column1,
t1.column2 = t2.column2
WHEN NOT MATCHED THEN INSERT
(record_id, column1, column2) VALUES (t2.record_id, t2.column1, t2.column2)

how to join two tables using non primary key and remove duplicates in the result set

I am using MS Access 2013 DB
I have two tables
Table1:
StartDate,EndDate, ID1, ID2,ProgramName, LanguageID,Language, Gender,CenterName,ZoneName
Table2
StartDate,EndDate, ID3,ProgramName, LanguageID,Language, Gender,CenterName,ZoneName
I want to join these two tables and remove duplicates by comparing the following columns from both tables
StartDate,EndDate,ProgramName, LanguageID,Language, Gender,CenterName,ZoneName
some data in the columns StartDate, EndDate have null values also. The resultant table should contain the following columns with no duplicate data
StartDate,EndDate, ID1, ID2,ID3,ProgramName, LanguageID,Language, Gender,CenterName,ZoneName
First you want create new table (other structer)
Second insert every record to new table all record where value from table1 and table2 are diffrent (without duplicate). Remember about check

My Hadoop interview scenario based query -solution can be in HIVE/PIG/MapReduce

I have data in a file like below(comma(,) separated).
ID,Name,Sal
101,Ramesh,M,1000
102,Prasad,K,500
I want the output table to be like below
101, Ramesh M, 1000
102, Prasad K, 500
i.e Name and Surname in a single column in the output
In Hive if I give row format delimited fields terminated by ',' it will not work. Do we need to write a serde?
Solution can be in MR or PIG also.
Why you dont use concat function, if you dont want process data and just query the raw data, think about creating a view on it :
select ID,concat(Name ,' ' ,Surname),Sal from table;
You can use concat function.
First, You can create the table(i.e. table1) with raw data having 4 columns delimited by comma :
ID, first_name,last_name, salary
Then concat the first_name and last_name using select query and store the results in another table using CTAS(Create TABLE AS SELECT) feature
CREATE TABLE EMP_TABLE AS SELECT ID, CONCAT(first_name,' ','last_name) as NAME, salary from table1

Resources