Add conditional field to table in Hive or Impala - hadoop

I have a massive table stored as parquet and I need to add columns based on conditions.
Is there a way to do that without having to recreate a new table in Hive or Impala?
Something like this?
ALTER TABLE xyz
ADD COLUMN flag AS (CASE WHEN ... END)
Thank you

I don't believe that Hive or Impala support computed columns. This type of calculation is often done using a view:
CREATE VIEW v_xyz AS
SELECT xyz.*,
(CASE WHEN ... END) as flag
FROM xyz;
You can then update the view at any time to adjust the logic or add new columns.

Related

How to create table in Hive with specific column values from another table

I am new to Hive and have some problems. I try to find a answer here and other sites but with no luck... I also tried many different querys that come to my mind, also without success.
I have my source table and i want to create new table like this.
Were:
id would be number of distinct counties as auto increment numbers and primary key
counties as distinct names of counties (from source table)
You could follow this approach.
A CTAS(Create Table As Select)
with your example this CTAS could work
CREATE TABLE t_county
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE AS
WITH t AS(
SELECT DISTINCT county, ROW_NUMBER() OVER() AS id
FROM counties)
SELECT id, county
FROM t;
You cannot have primary key or foreign keys on Hive as you have primary key on RBDMSs like Oracle or MySql because Hive is schema on read instead of schema on write like Oracle so you cannot implement constraints of any kind on Hive.
I can not give you the exact answer because of it suppose to you must try to do it by yourself and then if you have a problem or a doubt come here and tell us. But, what i can tell you is that you can use the insertstatement to create a new table using data from another table, I.E:
create table CARS (name string);
insert table CARS select x, y from TABLE_2;
You can also use the overwrite statement if you desire to delete all the existing data that you have inside that table (CARS).
So, the operation will be
CREATE TABLE ==> INSERT OPERATION (OVERWRITE?) + QUERY OPERATION
Hive is not an RDBMS database, so there is no concept of primary key or foreign key.
But you can add auto increment column in Hive. Please try as:
Create table new_table as
select reflect("java.util.UUID", "randomUUID") id, countries from my_source_table;

select all but few columns in impala

Is there a way to replicate the below in impala?
SET hive.support.quoted.identifiers=none
INSERT OVERWRITE TABLE MyTableParquet PARTITION (A='SumVal', B='SumOtherVal') SELECT `(A)?+.+` FROM MyTxtTable WHERE A='SumVal'
Basically I have a table in hive as text with 1000 fields, and I need a select that drops off the field A. The above works for Hive but now impala, how can I do this in impala without specifying all other 999 fields directly?

Add a fixed value column to a table in Hive

I am prototyping my pig script to hive. I need to add a status column to the table which is imported from Oracle database.
My pig scripts looks like this:
user_data = LOAD 'USER_DATA' USING PigStorage(',') AS (USER_ID:int,MANAGER_ID:int,USER_NAME:int);
user_data_status = FOREACH user_data GENERATE
USER_ID,
MANAGER_ID,
USER_NAME,
'active' AS STATUS;
Here I am adding the STATUS column with 'active' value to the user_data table.
How can I add column to an existing table to add column while importing the table via Hive QL??
As far as I know, You will have to reload the data as you did in Pig.
For example, If you already have the table user_data with columns USER_ID:int,MANAGER_ID:int,USER_NAME:int and you are looking for USER_ID:int,MANAGER_ID:int,USER_NAME:int, STATUS:active
You can re-load the table user_data_status by using something like this
INSERT OVERWRITE TABLE user_data_status SELECT *, 'active' AS STATUS FROM user_data;
Though there are options to add columns to the existing table, that would only update the metadata in metastore and the values would be defaulted to NULL.
If I was you, I would rather re-load the complete data rather than looking to update the complete table using UPDATE command after Altering the column structure. Hope this helps !

HIVE: How create a table with all columns in another table EXCEPT one of them?

When I need to change a column into a partition (convert normal column as partition column in hive), I want to create a new table to copy all columns except one. I currently have >50 columns in the original table. Is there any clean way of doing that?
Something like:
CREATE student_copy LIKE student EXCEPT age and hair_color;
Thanks!
You can use a regex:
CTAS using REGEX column spec. :
set hive.support.quoted.identifiers=none;
CREATE TABLE student_copy AS SELECT `(age|hair_color)?+.+` FROM student;
set hive.support.quoted.identifiers=column;
BUT (as mentioned by Kishore Kumar Suthar :
this will not create a partitioned table, as that is not supported with CTAS (Create Table As Select).
Only way I see for you to get your partitioned table is by getting the complete create statement of the table (as mentioned by Abraham):
SHOW CREATE TABLE student;
Altering it to create a partition on the column you want. And after that you can use the select with regex when inserting into the new table.
If your partition column is already part of this select, then you need to make sure it is the last column you insert. If it is not you can exclude that column in the regex and including it as last. Also if you expect several partitions to be created based on your insert statement you need to enable 'dynamic partitioning':
set hive.support.quoted.identifiers=none;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE student_copy PARTITION(partcol1) SELECT `(age|hair_color|partcol1)?+.+`, partcol1 FROM student;
set hive.support.quoted.identifiers=column;
the 'hive.support.quoted.identifiers=none' is required to use the backticks '`' in the regex part of the query. I set this parameter to it's original value after my statement: 'hive.support.quoted.identifiers=column'
CREATE TABLE student_copy LIKE student;
It just copies the source table definition.
CREATE TABLE student_copy AS select name, age, class from student;
Target cannot be partitioned table.
Target cannot be external table.
It copies the structure as well as the data
I use below command to get the create statement of existing table.
SHOW CREATE TABLE student;
Copy the result and modify that based on your requirement for new table and run the modified command to get the new table.

How to let CREATE TABLE...AS SELECT in HIVE do not populate data?

When I run CTAS in HIVE, the data is also populated simultaneously. But I just want to create the table, but not populate the data. How and what I should do? Thanks.
You can do that by using the LIKE keyword.
create table new_table_name LIKE old_table_name
This will create the table structure without the data.
Use create EXTERNAL table instead of create table. Observe External keyword.
Use where condition in select statement and give a value of where which fetches no records from hive.
Example table name demo1
id name country
1 abc India
2 xyz Germany
3 pqr France
In CREATE TABLEā€¦AS SELECT in HIVE
Create table demo2...As SELECT id, name, country from demo1 where id=0;
So, in above where condition of id is given as 0 and from above data the select statement will fetch no record, similarly choose a value in where condition which returns no records. Hence no data will be inserted in newly created table.
#Sunil's answer helped me as well, I am just posting an addition that was necessary in my case.
The source table was in Avro format but the new one I wanted in ORC, hence,
CREATE TABLE dataaggregate_orc_empty LIKE dataaggregate_avro_compressed ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' TBLPROPERTIES ('orc.compress'='ZLIB');
The above step can be split in two steps, if required :
CREATE TABLE dataaggregate_orc_empty LIKE dataaggregate_avro_compressed;
alter table dataaggregate_orc_empty set fileformat ORC;
I would be glad if someone provides inputs for the data format changes that occur in this process and related problems, if any.

Resources