calculations with Time(HH:MM:SS) type of column in Hive - time

I have created a hive table with column avg_response_time having value time in HH:MM:SS. As it is not a timestamp so I have to put this column under the string datatype. Now I want to do some calculations.
Here is the table schema:
create table agent_performance
(
S_No int,
`Date` string,
Agent string,
Total_chats int,
avg_response_time string,
avg_resolution_time string,
avg_rating float,
Total_feedback int
)
row format delimited
fields terminated by ',';
I am adding the image of the dataset.This the how the dataset look like
I want to do some calculations:
Total contribution hour for each and every agents weekly basis
Average weekly response time for each agent

You can split the hour:min:seconds data based on delimitter :.
And then use it to calculate total response time or resolution time.
also use date_format(current_date(),'W') to calculate week number in a month.
select
agent,
date_format(`date`,'W') week_no,
sum((split(avg_resolution_time,':')[0]*3600 +split(avg_resolution_time,':')[1]*60+split(avg_resolution_time,':')[2] )/3600) total_weekly_contri_hrs,
avg((split(avg_response_time ,':')[0]*3600 +split(avg_response_time ,':')[1]*60+split(avg_response_time ,':')[2] )/3600) Avg_weekly_response_time_hrs
from agent_performance
group by
1,2

Related

The ambiguity w.r.t date field in Dim_time

I have come across a fact table fact_trips - composed of columns like
driver_id,
vehicle_id,
date ( in the int form - 'YYYYMMDD')
timestamp (in milliseconds - bigint )
miles,
time_of_trip
I have another dim_time - composed of columns like
date ( in the int form - 'YYYYMMDD'),
timestamp (in milliseconds - bigint ),
month,
year,
day_of_week
day
Now when I want to see the trips grouped based on year, I have to join the two tables based on timestamp (in bigint) and then group by year from dim_time.
Why the hell do we keep date in int form then? Because ultimately, I have to join on timestamp. What needs to be changed?
Also, the dim_time does not have a primary key, hence there are multiple entries for the same date. So, when I join the tables, I get more rows in return than expected.
You should have 2 Dim tables:
DIM_DATE: PK = YYYYMMDD
DIM_TIME: PK = number. Will hold the same number of records as however many milliseconds there are in a day (assuming you are holding time at the millisecond grain rather than second, minute, etc)

Hive, Bucketing for the partitioned table

This is my script:
--table without partition
drop table if exists ufodata;
create table ufodata ( sighted string, reported string, city string, shape string, duration string, description string )
row format delimited
fields terminated by '\t'
Location '/mapreduce/hive/ufo';
--load my data in ufodata
load data local inpath '/home/training/downloads/ufo_awesome.tsv' into table ufodata;
--create partition table
drop table if exists partufo;
create table partufo ( sighted string, reported string, city string, shape string, duration string, description string )
partitioned by ( year string )
clustered by (year) into 6 buckets
row format delimited
fields terminated by '/t';
--by default dynamic partition is not set
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
--by default bucketing is false
set hive.enforcebucketing=true;
--loading mydata
insert overwrite table partufo
partition (year)
select sighted, reported, city, shape, min, description, SUBSTR(TRIM(sighted), 1,4) from ufodata;
Error message:
FAILED: Error in semantic analysis: Invalid column reference
I tried bucketing for my partitioned table. If I remove "clustered by (year) into 6 buckets" the script works fine. How do I bucket the partitioned table
There is an important thing we should consider while doing bucketing in hive.
The same column name cannot be used for both bucketing and partitioning. The reason is as follows:
Clustering and Sorting happens within a partition. Inside each partition there will be only one value associated with the partition column(in your case it is year)therefore there will not any be any impact on clustering and sorting. That is the reason for your error....
You can use the below syntax to create bucketing table with partition.
CREATE TABLE bckt_movies
(mov_id BIGINT , mov_name STRING ,prod_studio STRING, col_world DOUBLE , col_us_canada DOUBLE , col_uk DOUBLE , col_aus DOUBLE)
PARTITIONED BY (rel_year STRING)
CLUSTERED BY(mov_id) INTO 6 BUCKETS;
when you're doing dynamic partition, create a temporary table with all the columns (including your partitioned column) and load data into temporary table.
create actual partitioned table with partition column. While you are loading data from temporary table the partitioned column should be in the last in the select clause.

Hive Hadoop : Need to LOAD data into a table based on conditions on the input file

I am new to Hadoop Hive and have just started to do basic querying in hive.
My intention is I have an input text file (which has large number of records per line). The format of the file is something like this:
1;23;0;;;;1;3;2;1;1;4;5;6;;;;
1;43;6;;;;1;3;2;1;1;4;5;5;;;;
1;53;7;;;;1;3;2;1;1;4;5;2;;;;
(Each integer before a ";" has a meaning which I am intending to put it in Hive table as column names - and each line contains about 400 fields)
So for inserting this I have created a table "test" - using the following query:
CREATE TABLE test (field1 INT, field2 INT, field3 INT, field4 INT, ... field390 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\073";
And I load my text file with the records using the LOAD query as below:
LOAD DATA LOCAL INPATH '/tmp/test.txt'
OVERWRITE INTO TABLE test;
For now all the fields are getting inserted into the table upto 50 fields accurately. Later I have mismatches.
What I have in my format of input is - at 50th field in the test.txt I have a INT number which decides the number of fields to take following the field.
Example:
50th field: 2 -> Hive has to take the next 2*10 field INT values and insert in the table.
50th field: 1 -> Hive has to take the next 1*10 field INT values and insert in the table. And the rest 10 fields can be set NULL.
(The maximum value of 50th field is 2 - so I have reserved 2*10 fields for this in the table)
After 50th+(2*10) fields , the data should be read normally in the sequence as it did before the 50th field.
Do we have a way in which we can have a condition on the input so that the data gets inserted accordingly in Hive.
A help may be appreciated. Need a solution which will not guide me to pre-process the test.txt and then supply to the table.
I have tried to answer it at http://www.knowbigdata.com/page/hive-hadoop-need-load-data-table-based-conditions-input-file#comment-85
Does it make sense?
You can use where clause in Hive.
First load data into Hive raw table or HDFS, then again create table and load data based on where clause.
Means:
SELECT * FROM table_reference
WHERE name like "%venu%"
GROUP BY City;
Resource: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

query hive partitioned table over date/time range

My hive table is partitioned on year, month, day, Hour
Now I want to fetch data from 2014-05-27 to 2014-06-05
How can I do that??
I know one option is create partition on epoch(or yyyy-mm-dd-hh) and in query pass epoch time.
Can I do it without loosing date hierarchy??
Table Structure
CREATE TABLE IF NOT EXISTS table1 (col1 int, col2 int)
PARTITIONED BY (year int, month int, day int, hour int)
STORED AS TEXTFILE;
This is a similar scenario we face everyday while querying tables in hive. We have partitioned our tables similar to the way you explained and it has helped a lot if querying. This is how we partition:
CREATE TABLE IF NOT EXISTS table1 (col1 int, col2 int)
PARTITIONED BY (year bigint, month bigint, day bigint, hour int)
STORED AS TEXTFILE;
For partitions we assign values like this:
year = 2014, month = 201409, day = 20140924, hour = 01
This way the querying becomes really simple and you can directly query:
select * from table1 where day >= 20140527 and day < 20140605
Hope this helps
you can query like this
WHERE st_date > '2014-05-27-00' and end_date < '2014-06-05-24'
should give you desired result because even if it is a sting a it will be compared lexicographically i.e '2014-04-04' will be always greater '2014-04-03'.
I ran it on my sample tables and it works perfectly fine.
You can use CONCAT with LPAD.
Say you want to get all partitions between 2020-03-24, hour=00 to 2020-04-24, hour=23, then, your 'where' condition would look like:
WHERE (CONCAT(year, '-', LPAD(month,2,'0'), '-', LPAD(day,2,'0'), '_', LPAD(hour,2,'0')) > '2020-03-24_00')
AND (CONCAT(year, '-', LPAD(month,2,'0'), '-', LPAD(day,2,'0'), '_', LPAD(hour,2,'0')) < '2020-04-24_23')

HIVE order by messes up data

In Hive 0.8 with Hadoop 1.03 consider this table:
CREATE TABLE table (
key int,
date timestamp,
name string,
surname string,
height int,
weight int,
age int)
CLUSTERED BY(key) INTO 128 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Then I tried:
select *
from table
where key=xxx
order by date;
The result is sorted but everything after the column name is wrong. In fact, all the rows have the exact same values in the respective fields and the surname column is missing. I also have a bitmap index on name and surname and an index on key.
Is there something wrong with my query or should I be looking into bugs about order by (I cant find anything specific).
Seems like there has been an error in loading data into hive. Make sure you don't have any special characters in your CSV File that might interfere with the insertion.
And you have clustered by the key property. Where does this key come from the CSV? or some other source? Are you sure that this is unique?

Resources