query hive partitioned table over date/time range - hadoop

My hive table is partitioned on year, month, day, Hour
Now I want to fetch data from 2014-05-27 to 2014-06-05
How can I do that??
I know one option is create partition on epoch(or yyyy-mm-dd-hh) and in query pass epoch time.
Can I do it without loosing date hierarchy??
Table Structure
CREATE TABLE IF NOT EXISTS table1 (col1 int, col2 int)
PARTITIONED BY (year int, month int, day int, hour int)
STORED AS TEXTFILE;

This is a similar scenario we face everyday while querying tables in hive. We have partitioned our tables similar to the way you explained and it has helped a lot if querying. This is how we partition:
CREATE TABLE IF NOT EXISTS table1 (col1 int, col2 int)
PARTITIONED BY (year bigint, month bigint, day bigint, hour int)
STORED AS TEXTFILE;
For partitions we assign values like this:
year = 2014, month = 201409, day = 20140924, hour = 01
This way the querying becomes really simple and you can directly query:
select * from table1 where day >= 20140527 and day < 20140605
Hope this helps

you can query like this
WHERE st_date > '2014-05-27-00' and end_date < '2014-06-05-24'
should give you desired result because even if it is a sting a it will be compared lexicographically i.e '2014-04-04' will be always greater '2014-04-03'.
I ran it on my sample tables and it works perfectly fine.

You can use CONCAT with LPAD.
Say you want to get all partitions between 2020-03-24, hour=00 to 2020-04-24, hour=23, then, your 'where' condition would look like:
WHERE (CONCAT(year, '-', LPAD(month,2,'0'), '-', LPAD(day,2,'0'), '_', LPAD(hour,2,'0')) > '2020-03-24_00')
AND (CONCAT(year, '-', LPAD(month,2,'0'), '-', LPAD(day,2,'0'), '_', LPAD(hour,2,'0')) < '2020-04-24_23')

Related

calculations with Time(HH:MM:SS) type of column in Hive

I have created a hive table with column avg_response_time having value time in HH:MM:SS. As it is not a timestamp so I have to put this column under the string datatype. Now I want to do some calculations.
Here is the table schema:
create table agent_performance
(
S_No int,
`Date` string,
Agent string,
Total_chats int,
avg_response_time string,
avg_resolution_time string,
avg_rating float,
Total_feedback int
)
row format delimited
fields terminated by ',';
I am adding the image of the dataset.This the how the dataset look like
I want to do some calculations:
Total contribution hour for each and every agents weekly basis
Average weekly response time for each agent
You can split the hour:min:seconds data based on delimitter :.
And then use it to calculate total response time or resolution time.
also use date_format(current_date(),'W') to calculate week number in a month.
select
agent,
date_format(`date`,'W') week_no,
sum((split(avg_resolution_time,':')[0]*3600 +split(avg_resolution_time,':')[1]*60+split(avg_resolution_time,':')[2] )/3600) total_weekly_contri_hrs,
avg((split(avg_response_time ,':')[0]*3600 +split(avg_response_time ,':')[1]*60+split(avg_response_time ,':')[2] )/3600) Avg_weekly_response_time_hrs
from agent_performance
group by
1,2

get latest data from hive table with multiple partition columns

I have a hive table with below structure
ID string,
Value string,
year int,
month int,
day int,
hour int,
minute int
This table is refreshed every 15 mins and it is partitioned with year/month/day/hour/minute columns. Please find below samples on partitions.
year=2019/month=12/day=29/hour=19/minute=15
year=2019/month=12/day=30/hour=00/minute=45
year=2019/month=12/day=30/hour=08/minute=45
year=2019/month=12/day=30/hour=09/minute=30
year=2019/month=12/day=30/hour=09/minute=45
I want to select only latest partition data from the table. I tried to use max() statements with those partition columns, but its not very efficient as data size is huge.
Please let me know, how can i get the data in a convenient way using hive sql.
If the latest partition is always in current date, then you can filter current date partition and use rank() to find records with latest hour, minute:
select * --list columns here
from
(
select s.*, rank() over(order by hour desc, minute desc) rnk
from your_table s
where s.year=year(current_date) --filter current day (better pass variables calculated if possible)
and s.month=lpad(month(current_date),2,0)
and s.day=lpad(day(current_date),2,0)
-- and s.hour=lpad(hour(current_timestamp),2,0) --consider also adding this
) s
where rnk=1 --latest hour, minute
And if the latest partition is not necessarily equals current_date then you can use rank() over (order by s.year desc, s.month desc, s.day desc, hour desc, minute desc), without filter on date this will scan all the table and is not efficient.
It will perform the best if you can calculate partition filters in the shell and pass as parameters. See comments in the code.

How can insert into the table with the original day as partition in Hive?

create table h5_qti_desc
( h5id string,
query string,
title string,
item string,
query_ids string,
title_ids string,
item_ids string,
label bigint
)PARTITIONED BY (day string) LIFECYCLE 160;
insert overwrite into h5_qti_desc
select * from aaa
;
I create a table named h5_qti_desc, and I want to insert into it from another aaa table, which has the field of day and there is no partition in aaa.
Table aaa has several days, like '20171010','20171015'...
How can I insert into h5_qti_desc with day as partition once, and the days in aaa acted as day in h5_qti_desc's partition.
You can use Hive dynamic partition functionality to insert data. Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table.
Below is an example of loading data to all partitions using one insert statement:
hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive>INSERT OVERWRITE TABLE h5_qti_desc PARTITION(day)
SELECT * FROM aaa
DISTRIBUTE day;

issue with hive partitioning and bucketing in CDH 5.10 quick VM

i am new to this area and got stuck in a simple issue.
I am loading data into a hive table (using insert command from another table tset1) which is partitioned by udate and day as bucket.
insert overwrite test1 partition(udate) select id,value,udate,day from tset1;
so now the issue is when I am loading data it is taking wrong value in partition column. Day is taken as partition because in my table this is last column so during data load it's taking day as udate.
how I can force my query to take the right value during data load?
hive (testdb)> create table test1_buk(id int, value string,day int) partitioned by(udate string) clustered by(day) into 5 buckets row format delimited fields terminated by ',' stored as textfile;
hive (testdb)> desc tset1;
OK
col_name data_type comment
id int
value string
udate string
day int
hive (testdb)> desc test1_buk;
OK
col_name data_type comment
id int
value string
day int
udate string
# Partition Information
# col_name data_type comment
udate string
hive (testdb)> select * from test1_buk limit 1;
OK
test1_buk.id test1_buk.value test1_buk.day test1_buk.udate
5 w 2000 10
please help.

Map a hive partition to a location

I have a hive external table with partition by year, month day and hour.
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`hour` int)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION
'hdfs://path/to/data'
The data exists in directories such as
2014/05/10/07/00
2014/05/10/07/01
...
2014/05/10/07/22
2014/05/10/07/23
I get results When I select data using the following:
Select * from my_table where year=2014 and month="05" and day="07" and hour="03"
but I want to be able to query with out the quotes for values starting with a zero. Currently the following two examples don't work:
Select * from my_table where year=2014 and month=05 and day=07 and hour=03
Select * from my_table where year=2014 and month=5 and day=7 and hour=3
How can I support this? (instead of changing the directories not to have zero prefix on single digit values).
Thanks,
Guy
Before I go into the answer, this does involve changing the directory names but it will really make querying simple for you.
We have a similar kind of structure for our partitions but instead of using the names is this format 2014/05/10/07/22, we use it like 2014/201405/20140510/07/20140510.22. Basically the partitions are:
PARTITIONED BY
(
years bigint,
months bigint,
days bigint,
hours float
)
Now coming to the advantages of using this:
Query mentioned in the question:
Select * from my_table where year=2014 and month=05 and day=07 and hour=03
After new partitions
Select * from my_table where hour = 20140507.03
Also other queries on days and months can be run directly without explicitly specifying months and years.

Resources