Map a hive partition to a location - hadoop

I have a hive external table with partition by year, month day and hour.
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
'hdfs://path/to/data'
The data exists in directories such as
2014/05/10/07/00
2014/05/10/07/01
...
2014/05/10/07/22
2014/05/10/07/23
I get results When I select data using the following:
Select * from my_table where year=2014 and month="05" and day="07" and hour="03"
but I want to be able to query with out the quotes for values starting with a zero. Currently the following two examples don't work:
Select * from my_table where year=2014 and month=05 and day=07 and hour=03
Select * from my_table where year=2014 and month=5 and day=7 and hour=3
How can I support this? (instead of changing the directories not to have zero prefix on single digit values).
Thanks,
Guy

Before I go into the answer, this does involve changing the directory names but it will really make querying simple for you.
We have a similar kind of structure for our partitions but instead of using the names is this format 2014/05/10/07/22, we use it like 2014/201405/20140510/07/20140510.22. Basically the partitions are:
PARTITIONED BY
(
years bigint,
months bigint,
days bigint,
hours float
)
Now coming to the advantages of using this:
Query mentioned in the question:
Select * from my_table where year=2014 and month=05 and day=07 and hour=03
After new partitions
Select * from my_table where hour = 20140507.03
Also other queries on days and months can be run directly without explicitly specifying months and years.

Related

Change Hive External Table Column names to upper case and add new columns

I have an external table for example dump_table, which is partitioned over year, month and day. If i run show create table dump_table i get the following:
CREATE EXTERNAL TABLE `dump_table`
(
`col_name` double,
`col_name_2` timestamp
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
CLUSTERED BY (
someid)
INTO 32 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://somecluster/test.db/dump_table'
TBLPROPERTIES (
'orc.compression'='SNAPPY',
'transient_lastDdlTime'='1564476840')
I have to change its columns to upper case and also add new columns, so it will become something like:
CREATE EXTERNAL TABLE `dump_table_2`
(
`COL_NAME` DOUBLE,
`COL_NAME_2` TIMESTAMP,
`NEW_COL` DOUBLE
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
CLUSTERED BY (
someid)
Option:1
as an option I can run Change (DDL Reference here) to change column names and then add new columns to it. BUT the thing is that i do not have any backup for this table and it contains alot of data. If anything goes wrong I might loose data.
Can I create a new external table and migrate data, partition by partition from dump_table to dump_table_2 ? what will the query look like for this migration?
Is there any better way of achieving this use case? Please help
You can create new table dump_table_2 with new columns and load data using sql:
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dump_table_2 partition (`year`, `month`, `day`)
select col1,
...
colN,
`year`, `month`, `day`
from dump_table_1 t --join other tables if necessary to calculate columns

get latest data from hive table with multiple partition columns

I have a hive table with below structure
ID string,
Value string,
year int,
month int,
day int,
hour int,
minute int
This table is refreshed every 15 mins and it is partitioned with year/month/day/hour/minute columns. Please find below samples on partitions.
year=2019/month=12/day=29/hour=19/minute=15
year=2019/month=12/day=30/hour=00/minute=45
year=2019/month=12/day=30/hour=08/minute=45
year=2019/month=12/day=30/hour=09/minute=30
year=2019/month=12/day=30/hour=09/minute=45
I want to select only latest partition data from the table. I tried to use max() statements with those partition columns, but its not very efficient as data size is huge.
Please let me know, how can i get the data in a convenient way using hive sql.
If the latest partition is always in current date, then you can filter current date partition and use rank() to find records with latest hour, minute:
select * --list columns here
from
(
select s.*, rank() over(order by hour desc, minute desc) rnk
from your_table s
where s.year=year(current_date) --filter current day (better pass variables calculated if possible)
and s.month=lpad(month(current_date),2,0)
and s.day=lpad(day(current_date),2,0)
-- and s.hour=lpad(hour(current_timestamp),2,0) --consider also adding this
) s
where rnk=1 --latest hour, minute
And if the latest partition is not necessarily equals current_date then you can use rank() over (order by s.year desc, s.month desc, s.day desc, hour desc, minute desc), without filter on date this will scan all the table and is not efficient.
It will perform the best if you can calculate partition filters in the shell and pass as parameters. See comments in the code.

Is expression based partitioning supported in hive?

I have a table with a column, can i create a partition based on an expression using that column
I read that IBM's Big SQL technology has this feature.
I also know we can partition in hive by a column but what about an expression?
In this case i am doing a cast..it could be any expression
CREATE TABLE INVENTORY_A (
trans_id int,
product varchar(50),
trans_ts timestamp
)
PARTITIONED BY (
cast(trans_ts as date) AS date_part
)
I expect the records to be partitioned by the date value. So I expect that when a user writes a query like
select * from INVENTORY_A where trans_ts BETWEEN timestamp '2016-06-23 14:00:00.000' AND timestamp '2016-06-23 14:59:59.000'
the query will be smart enough to break the timestamp down by the date and do a filter only on the date
You can use Dynamic partitioning and cast your variables in select query.

How can insert into the table with the original day as partition in Hive?

create table h5_qti_desc
( h5id string,
query string,
title string,
item string,
query_ids string,
title_ids string,
item_ids string,
label bigint
)PARTITIONED BY (day string) LIFECYCLE 160;
insert overwrite into h5_qti_desc
select * from aaa
;
I create a table named h5_qti_desc, and I want to insert into it from another aaa table, which has the field of day and there is no partition in aaa.
Table aaa has several days, like '20171010','20171015'...
How can I insert into h5_qti_desc with day as partition once, and the days in aaa acted as day in h5_qti_desc's partition.
You can use Hive dynamic partition functionality to insert data. Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table.
Below is an example of loading data to all partitions using one insert statement:
hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive>INSERT OVERWRITE TABLE h5_qti_desc PARTITION(day)
SELECT * FROM aaa
DISTRIBUTE day;

query hive partitioned table over date/time range

My hive table is partitioned on year, month, day, Hour
Now I want to fetch data from 2014-05-27 to 2014-06-05
How can I do that??
I know one option is create partition on epoch(or yyyy-mm-dd-hh) and in query pass epoch time.
Can I do it without loosing date hierarchy??
Table Structure
CREATE TABLE IF NOT EXISTS table1 (col1 int, col2 int)
PARTITIONED BY (year int, month int, day int, hour int)
STORED AS TEXTFILE;
This is a similar scenario we face everyday while querying tables in hive. We have partitioned our tables similar to the way you explained and it has helped a lot if querying. This is how we partition:
CREATE TABLE IF NOT EXISTS table1 (col1 int, col2 int)
PARTITIONED BY (year bigint, month bigint, day bigint, hour int)
STORED AS TEXTFILE;
For partitions we assign values like this:
year = 2014, month = 201409, day = 20140924, hour = 01
This way the querying becomes really simple and you can directly query:
select * from table1 where day >= 20140527 and day < 20140605
Hope this helps
you can query like this
WHERE st_date > '2014-05-27-00' and end_date < '2014-06-05-24'
should give you desired result because even if it is a sting a it will be compared lexicographically i.e '2014-04-04' will be always greater '2014-04-03'.
I ran it on my sample tables and it works perfectly fine.
You can use CONCAT with LPAD.
Say you want to get all partitions between 2020-03-24, hour=00 to 2020-04-24, hour=23, then, your 'where' condition would look like:
WHERE (CONCAT(year, '-', LPAD(month,2,'0'), '-', LPAD(day,2,'0'), '_', LPAD(hour,2,'0')) > '2020-03-24_00')
AND (CONCAT(year, '-', LPAD(month,2,'0'), '-', LPAD(day,2,'0'), '_', LPAD(hour,2,'0')) < '2020-04-24_23')

Resources