How to alter/add column to json serde table - hadoop

I had a table
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ.testtable (
x BIGINT,
y STRING,
z STRING
)
PARTITIONED BY (
date string,
hour STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'paths'='x, y, z')
STORED AS TEXTFILE
LOCATION 'testlocation/testtable'
with huge json data. I want to add one more column something like c to existing table so i tried
>1. alter table XYZ.testtable add columns (c STRING);
> 2.ALTER TABLE XYZ.testtable SET SERDEPROPERTIES ( 'paths'='x, y, z, c')
but the c value which is present in json file coming as null.
I tried drop and recreate table with 'c' column . it was working fine. can any one help how to alter jsonserde table to add coulmn.

Related

Change Hive External Table Column names to upper case and add new columns

I have an external table for example dump_table, which is partitioned over year, month and day. If i run show create table dump_table i get the following:
CREATE EXTERNAL TABLE `dump_table`
(
`col_name` double,
`col_name_2` timestamp
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
CLUSTERED BY (
someid)
INTO 32 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://somecluster/test.db/dump_table'
TBLPROPERTIES (
'orc.compression'='SNAPPY',
'transient_lastDdlTime'='1564476840')
I have to change its columns to upper case and also add new columns, so it will become something like:
CREATE EXTERNAL TABLE `dump_table_2`
(
`COL_NAME` DOUBLE,
`COL_NAME_2` TIMESTAMP,
`NEW_COL` DOUBLE
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
CLUSTERED BY (
someid)
Option:1
as an option I can run Change (DDL Reference here) to change column names and then add new columns to it. BUT the thing is that i do not have any backup for this table and it contains alot of data. If anything goes wrong I might loose data.
Can I create a new external table and migrate data, partition by partition from dump_table to dump_table_2 ? what will the query look like for this migration?
Is there any better way of achieving this use case? Please help
You can create new table dump_table_2 with new columns and load data using sql:
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dump_table_2 partition (`year`, `month`, `day`)
select col1,
...
colN,
`year`, `month`, `day`
from dump_table_1 t --join other tables if necessary to calculate columns

HIVE - create external tables where string itself contains commas

I am new to Hive and am creating external tables on csv file. One of the issues I am coming across are values that contain multiple commas within string itself. For example, the csv file contains the following:
CSV File
When I create an external table in Hive, because there are columns within the "name" column, it shifts the first name to the right adding another column. This throws all of the data off when you view the table in Hive.
External Table result in Hive
Is there anything I can add to my script to keep the commas but also keep first and last name in the same column when the external table is created? Thank you all in advance - I am very new to Hive.
CREATE EXTERNAL TABLE database.table name (
ID INT,
Name String,
City String,
State String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/xyz/xyz/database/directory/'
TBLPROPERTIES ("skip.header.line.count"="1");
Check this solution - you need to add this line : ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://community.cloudera.com/t5/Support-Questions/comma-in-between-data-of-csv-mapped-to-external-table-in/td-p/220193
Complete DDL example:
create table hcc(field1 string,
field2 string,
field3 string,
field4 string,
field5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"");

Hive Partition By dynamic value in s3 file name

Assuming an S3 location with required data is of the form:
s3://stack-overflow-example/v1/
where each file title in v1/ is of the form
francesco_{YYY_DD_MM_HH}_totti.csv
and each csv file contains a unix timestamp as a column in each row.
Is it possible to create an external hive table partitioned by the {YYY_DD_MM_HH} in each file name without first creating an unpartitioned table?
I have tried the below:
create external table so_test
(
a int,
b int,
unixtimestamp string
)
PARTITIONED BY (
from_unixtime(CAST(ord/1000 as BIGINT), 'yyyy-MM-dd') string
)
LOCATION 's3://stack-overflow-example/v1'
but this fails.
An option that should work is creating an unpartitioned table like the below:
create external table so_test
(
a int,
b int,
unixtimestamp string
);
LOCATION 's3://stack-overflow-example/v1'
and then dynamically inserting into a partitioned table:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
create external table so_test_partitioned
(
a int,
b int,
unixtimestamp string
)
PARTITIONED BY (
datep string
)
LOCATION 's3://stack-overflow-example/v1';
INSERT OVERWRITE TABLE so_test_partitioned PARTITION (date)
select
a,
b,
unixtimestamp,
from_unixtime(CAST(ord/1000 as BIGINT), 'yyyy-MM-dd') as datep,
from so_test;
Is creating an unpartitioned table first the only way?

How to add column comments to the hive table which is using row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

I am trying to add column comments to the hive table (cities_v2) which is using org.apache.hadoop.hive.serde2.OpenCSVSerde. When I trigger the alter query to add column comment, it is running fine without error but still the column comment remains "from deserializer". Please help me.
Queries used to alter the table to add comments:
alter table cities_v2 change city_id city_id string COMMENT 'Unique ID from DCM';
alter table cities_v2 change city city string COMMENT 'City name, in English';
create table query:
CREATE EXTERNAL TABLE IF NOT EXISTS cities_v2 (
city_id INT ,
city STRING
)
PARTITIONED BY (filedate_pst STRING)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar" = "\," ,
"quoteChar" = "\"")
LOCATION '/common/data/dfa/cities_v2/'
tblproperties ("skip.header.line.count"="1");

HIVE External Table - Set Empty Strings to NULL

Currently I have a HIVE 0.7 instance on Amazon EMR. I am trying to create a duplicate of this instance on a new EMR cluster using Hive 0.11.
In my 0.7 instance I have an external table that will set empty strings to NULL. Here is how I create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS tablename
(column1 string,
column2 string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES ('serialization.null.format' = '');
Data is added to the table like this:
ALTER TABLE tablename
ADD PARTITION (year = '2013', month = '10', day='01')
LOCATION '/location_in_hdfs';
This works great in 0.7 but in 0.11 it doesn't seem to be evaluating my empty strings as NULLS. Interestingly, creating a normal table with the same data and table definition seems to evaluate empty strings as NULLs as expected.
Is there are different way to do this with an external table in 0.11?
Hive default partition properties overriding the table properties. Include SERDE properties in your alter statement:
ALTER TABLE tablename ADD PARTITION (year = '2013', month = '10', day='01') SET
SERDEPROPERTIES ('serialization.null.format' = '');

Resources