Hive modify partitioned table data - hadoop

Problem: One column value is null. It should be 'ab'. Unfortunately I have written '' instead 'ab'.
My table is partitioned table. Is there any way to change that?
I found the following way. But it seems inefficient.
Create a temp table like my table
Use INSERT OVERWRITE. Read data from my old table and write to new table. I am using case statement to change '' to 'ab'
And then change my temp table to original table.
I am looking for a solution something like update partition and msck. Is there any way to do?

You can overwrite single partition in this way:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite target_table partition (part_col)
select
case when column ='' then 'ab' else column end as column ,
col2, --select all the columns in the same order
col3,
part_col --partition column is the last one
from target_table where part_col='your_partition_value';

One possible solution would be to perform update on the table provided the column is not neither a partitioning nor bucketing column.
UPDATE tablename SET column = (CASE WHEN column = '' THEN 'ab' else column END) [WHERE expr if any];
Update: To support ACID operations on Hive
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;
Note: works only if Hive >= 0.14

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

How to replace NULL values in one column to 0 (of a very large table) without creating a new column of the desired results added to the table in HIVE?

I am trying to replace all of the NULL values to 0 in a column of a big table in HIVE.
However, every time I try to implement some code I end up generating a new column to the table. The column I am trying to change/modify still exists and still has the NULL values but the new column that is automatically generated (i.e. _c1) is what I want the column I am trying to modify, to look like.
I tried to run a COALESCE but that also ended up generating a new column. I also tried to implement a CASE WHEN, but the same results ensued.
Select *,
CASE WHEN columnname IS NULL THEN 0
ELSE columnname
END
from tablename;
Also tried
SELECT coalesce(columnname, CAST(0 AS BIGINT)) FROM tablename
I would just like to update the table with the other columns being as is but the column I want to modify still has its original name but instead of NULL values it has 0's that replaced them.
I don't want to generate a new column but modify an existing one.
How should I do that?
Use insert overwrite .. option.
insert overwrite table tablename
select c1,c2,...,coalesce(columnname,0) as columnname
from tablename
Note that you have to specify all the other column names required in select.

Hive update all values in a column

I have an external partitioned Hive table. One of its columns is a string named OLDDATE that has the date in a different format(DD-MM-YY). I want to update the column and store dates in YYYY-MM-DD format. All years are 20XX.
So I thought of this
select CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) from table
This gives me the dates in the format I want. Now how do I overwrite the old date with this new date?
You can effect an update by overwriting the table with its own contents, just with the date field changed according to your transformation, like this pseudo-code:
INSERT OVERWRITE table
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
FROM table;
#user2441441
To overwrite a partitioned table:
INSERT OVERWRITE table PARTITION (p_col)
SELECT
col1
, col2
...
, CONCAT('20',SPLIT(OLDDATE ,'-')[2],'-',SPLIT(OLDDATE ,'-')[1],'-
',SPLIT(OLDDATE ,'-')[0]) AS olddate
...
, coln
, p_col
FROM table;
Since its an partitioned table, the folder names must be created with the date values.
Hence you are not able to update the values.
One work around for this would be create a new table and run your above query and insert data into the new table.
After that you can drop your existing table and treat this new table as your required table.

Hive and Sqoop partition

I have sqoopd data from Netezza table and output file is in HDFS, but one column is a timestamp and I want to load it as a date column in my hive table. Using that column I want to create partition on date. How can i do that?
Example: in HDFS data is like = 2013-07-30 11:08:36
In hive I want to load only date (2013-07-30) not timestamps. I want to partition on that column DAILY.
How can I pass partition by column as dynamically?
I have tried with loading data into one table as source. In final table I will do insert overwrite table partition by (date_column=dynamic date) select * from table1
Set these 2 properties -
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And the Query can be like -
INSERT OVERWRITE TABLE TABLE PARTITION (DATE_STR)
SELECT
:
:
-- Partition Col is the last column
to_date(date_column) DATE_STR
FROM table1;
You can explore the two options of hive-import - if it is an incremental import you will be able to get the current day's partition.
--hive-partition-key
--hive-partition-value
You can just load the EMP_HISTORY table from EMP by enabling dynamic partition and converting the timestamp to date using to_date date function
The code might look something like this....
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE EMP_HISTORY PARTITION (join_date)
SELECT e.name as name, e.age as age, e.salay as salary, e.loc as loc, to_date(e.join_date) as join_date from EMP e ;

How to insert init-data into a table in hive?

I wanted to insert some initial data into the table in hive, so I created below HQL,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value;
but it does not work.
There is another query like the above,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value FROM table limit 1;
But it also didn't work, as I see that the tables are empty.
How can I set the initial data into the table?
(There is the reason why I have to do self-join)
About first HQL it should have from clause, its missing so HQL failure,
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value;
Regarding second HQL, from table should have atleast one row, so it can set the constant init values into your newly created table.
INSERT OVERWRITE TABLE table PARTITION(dt='2014-06-26') SELECT 'key_sum', '0' FROM table limit 1;
you can use any old hive table having data into it, and give a hit.
The following query works fine if we have already test table created in hive.
INSERT OVERWRITE TABLE test PARTITION(dt='2014-06-26') SELECT 'key_sum' as key, '0' as value FROM test;
I think the table which we perform insert should be created first.

Resources