Hive Partition By dynamic value in s3 file name - hadoop

Assuming an S3 location with required data is of the form:
s3://stack-overflow-example/v1/
where each file title in v1/ is of the form
francesco_{YYY_DD_MM_HH}_totti.csv
and each csv file contains a unix timestamp as a column in each row.
Is it possible to create an external hive table partitioned by the {YYY_DD_MM_HH} in each file name without first creating an unpartitioned table?
I have tried the below:
create external table so_test
(
a int,
b int,
unixtimestamp string
)
PARTITIONED BY (
from_unixtime(CAST(ord/1000 as BIGINT), 'yyyy-MM-dd') string
)
LOCATION 's3://stack-overflow-example/v1'
but this fails.
An option that should work is creating an unpartitioned table like the below:
create external table so_test
(
a int,
b int,
unixtimestamp string
);
LOCATION 's3://stack-overflow-example/v1'
and then dynamically inserting into a partitioned table:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
create external table so_test_partitioned
(
a int,
b int,
unixtimestamp string
)
PARTITIONED BY (
datep string
)
LOCATION 's3://stack-overflow-example/v1';
INSERT OVERWRITE TABLE so_test_partitioned PARTITION (date)
select
a,
b,
unixtimestamp,
from_unixtime(CAST(ord/1000 as BIGINT), 'yyyy-MM-dd') as datep,
from so_test;
Is creating an unpartitioned table first the only way?

Related

Creating external hive table from parquet file which contains json string

I have a parquet file which is stored in a partitioned directory. The format of the partition is
/dates=*/hour=*/something.parquet.
The content of parquet file looks like as follows:
{a:1,b:2,c:3}.
This is json data and i want to create external hive table.
My approach:
CREATE EXTERNAL TABLE test_table (a int, b int, c int) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
After that i run MSCK REPAIR TABLE test_table; but i get following output:
hive> select * from test_table;
OK
NULL NULL NULL 2021-09-27 09
The other three columns are null. I think i have to define JSON schema somehow but i have no idea how to proceed further.
Create table with the same schema as parquet file:
CREATE EXTERNAL TABLE test_table (value string) PARTITIONED BY (dates string, hour string) STORED AS PARQUET LOCATION '/user/output/';
Run repair table to mount partitions:
MSCK REPAIR TABLE test_table;
Parse value in query:
select e.a, e.b, e.c
from test_table t
lateral view json_tuple(t.value, 'a', 'b', 'c') e as a,b,c
Cast values as int if necessary: cast(e.a as int) as a
You can also create a table for json fields as columns using this:
CREATE EXTERNAL TABLE IF NOT EXISTS test_table(
a INT,
b INT,
c INT)
partitioned by (dates string, hour string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS PARQUET
location '/user/output/';
Then run MSCK REPAIR TABLE test_table;
You would be able to query directly without writing any parsers.

Change Hive External Table Column names to upper case and add new columns

I have an external table for example dump_table, which is partitioned over year, month and day. If i run show create table dump_table i get the following:
CREATE EXTERNAL TABLE `dump_table`
(
`col_name` double,
`col_name_2` timestamp
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
CLUSTERED BY (
someid)
INTO 32 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://somecluster/test.db/dump_table'
TBLPROPERTIES (
'orc.compression'='SNAPPY',
'transient_lastDdlTime'='1564476840')
I have to change its columns to upper case and also add new columns, so it will become something like:
CREATE EXTERNAL TABLE `dump_table_2`
(
`COL_NAME` DOUBLE,
`COL_NAME_2` TIMESTAMP,
`NEW_COL` DOUBLE
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
CLUSTERED BY (
someid)
Option:1
as an option I can run Change (DDL Reference here) to change column names and then add new columns to it. BUT the thing is that i do not have any backup for this table and it contains alot of data. If anything goes wrong I might loose data.
Can I create a new external table and migrate data, partition by partition from dump_table to dump_table_2 ? what will the query look like for this migration?
Is there any better way of achieving this use case? Please help
You can create new table dump_table_2 with new columns and load data using sql:
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dump_table_2 partition (`year`, `month`, `day`)
select col1,
...
colN,
`year`, `month`, `day`
from dump_table_1 t --join other tables if necessary to calculate columns

How to alter/add column to json serde table

I had a table
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ.testtable (
x BIGINT,
y STRING,
z STRING
)
PARTITIONED BY (
date string,
hour STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'paths'='x, y, z')
STORED AS TEXTFILE
LOCATION 'testlocation/testtable'
with huge json data. I want to add one more column something like c to existing table so i tried
>1. alter table XYZ.testtable add columns (c STRING);
> 2.ALTER TABLE XYZ.testtable SET SERDEPROPERTIES ( 'paths'='x, y, z, c')
but the c value which is present in json file coming as null.
I tried drop and recreate table with 'c' column . it was working fine. can any one help how to alter jsonserde table to add coulmn.

Hive Create Table Partitions from file name

New to Hadoop. I know how to create a table in Hive (Syntax)
Creating a table with 3 Partition Key. but the keys are in File Names.
FileName Example : ServerName_ApplicationName_ApplicationName.XXXX.log.YYYY-MM-DD
there are hundreds of file in a directory want to create a table with following Partition Keys from file Name :ServerName,ApplicationName,Date and load all the files in to table
Hive Script would be the preference but open to any other ideas
(files are CSV. and I know The schema(column definitions) of the file )
I assume the File Name is in format ServerName_ApplicationName.XXXX.log.YYYY-MM-DD (removed second "applicationname" assuming it to be a typo).
Create a table on the contents of the original file. Some thing like..
create external table default.stack
(col1 string,
col2 string,
col3 string,
col4 int,
col5 int
)
ROW FORMAT DELIMITED
FIELDS terminated by ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://nameservice1/location1...';
Create another partitioned table in another location like
create external table default.stack_part
(col1 string,
col2 string,
col3 string,
col4 int,
col5 int
)
PARTITIONED BY ( servername string, applicationname string, load_date string)
STORED as AVRO -- u can choose any format for the final file
location 'hdfs://nameservice1/location2...';
Insert into partitioned table from base table using below query:
set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.compress.output=true;
set hive.exec.parallel=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
Insert overwrite table default.stack_part
partition ( servername, applicationname, load_date)
select *,
split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[0] as servername
,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[0] as applicationname
,split(split(reverse(split(reverse(INPUT__FILE__NAME),"/")[0]),"_")[1],'[.]')[3] as load_date
from default.stack;
I have tested this and it works.

hive external partitioned table

First i created hive external table partitioned by code and date
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/old_work/XYZ';
and then i execute insert overwrite on this table taking data from other table
INSERT OVERWRITE TABLE XYZ PARTITION (CODE,DATE)
SELECT
*
FROM TEMP_XYZ;
and after that i count the number of records in hive
select count(*) from XYZ;
it shows me 1000 records are there
and then i rename or move the location '/old_work/XYZ' to '/new_work/XYZ'
and then i again drop the XYZ table and created again pointing location to new directory
means '/new_work/XYZ'
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/new_work/XYZ';
But then when i execute select count(*) from XYZ table in hive , it shows 0 records ,
i think i missed something , please help me on this????
You need not drop the table and re create it the second time:
As soon as you move or rename a external hdfs location of the table just do this :
msck repair table <table_name>
In your case the error was because, The hive metastore wasnt updated with the new path .

Resources