complex Hive Query - hadoop

Hi I have the following table:
ID------ |--- time
======================
5------- | ----200101
3--------| --- 200102
2--------|---- 200103
12 ------|---- 200101
16-------|---- 200103
18-------|---- 200106
Now I want to know how often a certain month in the year appears. I cant use a group by because this only counts the number of times which appears in the table. But I also want to get a 0 when a certain month in the year does not appear. So the output should be something like this:
time-------|----count
=====================
200101--|-- 2
200102--|-- 1
200103--|-- 1
200104--|-- 0
200105--|-- 0
200106--|-- 1
Sorry for the bad table format, I hope it is still clear what I mean.
I would apreciate any help

You can provide a year-month table containing all year and month information. I wrote a script for you to generate such csv file:
#!/bin/bash
# year_month.sh
start_year=1970
end_year=2015
for year in $( seq ${start_year} ${end_year} ); do
for month in $( seq 1 12 ); do
echo ${year}$( echo ${month} | awk '{printf("%02d\n", $1)}');
done;
done > year_month.csv
Save it in year_month.sh and run it. Then you will get a file year_month.csv containing the year and month from 1970 to 2015. You can change start_year and end_year to specify the year range.
Then, upload the year_month.csv file to HDFS. For example,
hadoop fs -mkdir /user/joe/year_month
hadoop fs -put year_month.csv /user/joe/year_month/
After that, you can load year_month.csv into Hive. For example,
create external table if not exists
year_month (time int)
location '/user/joe/year_month';
At last, you can join the new table with your table to get the final result. For example, assume your table is id_time:
from (select year_month.time as time, time_count.id as id
from year_month
left outer join id_time
on year_month.time = id_time.time) temp
select time, count(id) as count
group by time;
Note: you need to make tiny modification (such as path, type) to the above statement.

Related

How to iterate over a hive table row by row and calculate metric when a specific condition is met?

I have a requirement as below:
I am trying to convert a MS Access table macro loop to work for a hive table. The table called trip_details contains details about a specific trip taken by a truck. The truck can stop at multiple locations and the type of stop is indicated by a flag called type_of_trip. This column contains values like arrival, departure, loading etc.
The ultimate aim is to calculate the dwell time of each truck (how much time does the truck take before beginning for another trip). To calculate this we have to iterate the table row by row and check for trip type.
A typical example look like this:
Do while end of file:
Store the first row in a variable.
Move to the second row.
If the type_of_trip = Arrival:
Move to the third row
If the type_of_trip = End Trip:
Store the third row
Take the difference of timestamps to calculate dwell time
Append the row into the output table
End
What is the best approach to tackle this problem in hive?
I tried checking if hive contains a keyword for loop but could not find one. I was thinking of doing this using a shell script but need guidance on how to approach this.
I cannot disclose the entire data but feel free to shoot any questions in the comments section.
Input
Trip ID type_of_trip timestamp location
1 Departure 28/5/2019 15:00 Warehouse
1 Arrival 28/5/2019 16:00 Store
1 Live Unload 28/5/2019 16:30 Store
1 End Trip 28/5/2019 17:00 Store
Expected Output
Trip ID Origin_location Destination_location Dwell_time
1 Warehouse Store 2 hours
You do not need loop for this, use the power of SQL query.
Convert your timestamps to seconds (using your format specified 'dd/MM/yyyy HH:mm'), calculate min and max per trip_id, taking into account type, subtract seconds, convert seconds difference to 'HH:mm' format or any other format you prefer:
with trip_details as (--use your table instead of this subquery
select stack (4,
1,'Departure' ,'28/5/2019 15:00','Warehouse',
1,'Arrival' ,'28/5/2019 16:00','Store',
1,'Live Unload' ,'28/5/2019 16:30','Store',
1,'End Trip' ,'28/5/2019 17:00','Store'
) as (trip_id, type_of_trip, `timestamp`, location)
)
select trip_id, origin_location, destination_location,
from_unixtime(destination_time-origin_time,'HH:mm') dwell_time
from
(
select trip_id,
min(case when type_of_trip='Departure' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) origin_time,
max(case when type_of_trip='End Trip' then unix_timestamp(`timestamp`,'dd/MM/yyyy HH:mm') end) destination_time,
max(case when type_of_trip='Departure' then location end) origin_location,
max(case when type_of_trip='End Trip' then location end) destination_location
from trip_details
group by trip_id
)s;
Result:
trip_id origin_location destination_location dwell_time
1 Warehouse Store 02:00

Using oracle loop to concatanete strings

I have someting like this
id day descrition
1 1 hi
1 1 today
1 1 is a beautifull
1 1 day
1 2 exemplo
1 2 for
1 2 this case
I need to do a funtion that for each day concatenate the descrtiomn colunm and return the result like this
id day descrition
1 1 hi today is a beautifull thay
1 2 exemplo for this case
Anny ideia about how can i do this usisng a loop in a function in oracle
You need a way of determining which order the values should be aggregated. The snippet below will rely on the implicit order in which Oracle reads the rows from the datafiles - if you have row movement enabled then you may get inconsistent results as the rows can be read in different orders as they are relocated in the underlying datafiles.
SELECT LISTAGG( description, ' ' ) WITHIN GROUP ( ORDER BY ROWNUM ) AS description
FROM your_table
GROUP BY id, day
It would be better to have another column that stores the order within each day.

Hive syntax: purpose of curly braces and dollar sign

I'm reading over some Hive scripts from another team in my company and having trouble understanding a specific part of it. The part in question is:where dt='${product_dt}', which can be found on on the third line from the bottom of the code chunk below.
I've never seen this syntax before nor am I able to find anything via Google search (probably because I don't know the correct search terms to use). Any insight into what that where row filter step is doing would be appreciated.
set hive.security.authorization.enabled=false;
add jar /opt/mobiletl/prod_workflow_dir/lib/hiveudf_hash.jar;
create temporary function hash_string as 'HashString';
drop table 00_truthset_product_email_uid_pid;
create table 00_truthset_product_email_uid_pid as
select distinct email,
concat_ws('|', hash_string(lower(email), "SHA-1"),
hash_string(lower(email), "MD5"),
hash_string(upper(email), "SHA-1"),
hash_string(upper(email), "MD5")) as hashed_email,
uid, address_id, confidencescore
from product.prod_vintages
where dt='${product_dt}'
and email is not null and email != ''
and address_id is not null and address_id != '';
I tried set product_dt = 2014-12;, but it doesn't seem to work:
hive> SELECT dt FROM enabilink.prod_vintages GROUP BY dt LIMIT 10;
. . .
dt
2014-12
2015-01
2015-02
2015-03
2015-05
2015-07
2015-10
2016-01
2016-02
2016-03
hive> set product_dt = 2014-12;
hive> SELECT email FROM product.prod_vintages WHERE dt='${product_dt}';
. . .
Total MapReduce CPU Time Spent: 2 seconds 570 msec
OK
email
Time taken: 25.801 seconds
those are variables set in Hive. if you have set the variables before the query (in the same session), Hive will replace it with the specified value
for example
set product_dt=03-11-2012
Edit
Make sure that you are removing the spaces in your dt field (use trim UDF). Also, set the variable without spaces.

fast infix search and count on a 40 million rows table postgresql

I'm new to database administration, but I need to create a database view, while the db admin requires it to run in 5 mins or less. My database is PostgreSQL 9.1.1 on RedHat4.4 linux 64-bit. I'm unsure about the hardware specifications. One of the tables is 40million rows. From the table, I have a column of directory paths from which I must group by about 20 string patterns and count its occurrences. The string pattern requires infix search, as it could be somewhere in the middle or end of the path. The string pattern also has a priority, as in when %str1% then 'str1', when %str2% then 'str2', and both str1, str2, str3, etc can occur on the same path, i.e.
path
/usr/myblock/str1/str2
/usr/myblock/something/str2
/usr/myblock/str1/something/str3
What I did so far was to build a table out of CASE statements then join it back to the original table by LIKE, then SELECT id, pattern, count(pattern). The query runtime was terrible, taking 5mins to retrieve from 5.5K rows. My query looks like this:
WITH a AS (
SELECT CASE
WHEN path ~ '^/usr/myblock/(.*)str1(.*)' THEN 'str1'
WHEN path ~ '^/usr/myblock/(.*)str2$' THEN 'str2'
WHEN path ~ '^/usr/myblock/(.*)str3$' THEN 'str3'
.... --multiple other case conditions
WHEN path ~ '^/usr/myblock/' THEN 'others'
ELSE 'n/a'
END as flow
FROM mega_t WHERE left(path,13)='/usr/myblock/' limit 5)
SELECT id, a.flow, count(*) AS flow_count FROM a
JOIN mega_t ON path LIKE '%' || a.flow || '%'
WHERE (some_conditions) AND to_timestamp(test_runs.created_at::double precision)
> ('now'::text::date - '1 mon'::interval) --collect last 1 month's results only
GROUP BY id, a.flow;
My expected output for that simple case would be:
id | flow | flow_count
1 | str1 | 2
2 | str2 | 1
What is a better way to search for substrings like this and count occurrences? I can't use ts_stat, nor 'SELECT count(path) WHERE path LIKE %str1%' because of the if-else priority it needs. I read about creating trigram indexes, but I think that is overkill for my patterns. I hope this question is clear and useful. Another thing I should add is that the 40million rows table is updated frequently every few seconds or minutes while the view will be accessed every eight hours daily.

Hive: SemanticException [Error 10002]: Line 3:21 Invalid column reference 'name'

I am using the following hive query script for the version 0.13.0
DROP TABLE IF EXISTS movies.movierating;
DROP TABLE IF EXISTS movies.list;
DROP TABLE IF EXISTS movies.rating;
DROP DATABASE IF EXISTS movies;
ADD JAR /usr/local/hadoop/hive/hive/lib/RegexLoader.jar;
CREATE DATABASE IF NOT EXISTS movies;
CREATE EXTERNAL TABLE IF NOT EXISTS movies.list (id STRING, name STRING, genre STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s");
CREATE EXTERNAL TABLE IF NOT EXISTS movies.rating (id STRING, userid STRING, rating STRING, timestamp STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'
with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s %4$s");
LOAD DATA LOCAL INPATH 'ml-10M100K/movies.dat' into TABLE movies.list;
LOAD DATA LOCAL INPATH 'ml-10M100K/ratings.dat' into TABLE movies.rating;
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating STRING);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, rating.rating from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id;
The issue is when I execute the script without the "GROUP BY" clause it works fine.
But when I execute it with the "GROUP BY" clause, I get the following error
FAILED: SemanticException [Error 10002]: Line 4:21 Invalid column reference 'name'
Any ideas what is happening here?
Appreciate your help
Thanks!
If you group by a column, your select statement can only select a) that column, b) columns derived only from that column, or c) a UDAF applied to other columns.
In this case, you're only grouping by list.id, so when you try to select list.name, that's invalid. Think about it this way: what if your list table contained the following two entries:
id|name |genre
--+-----+------
01|name1|comedy
01|name2|horror
What would you expect this query to return:
select list.id, list.name, list.genre from list group by list.id;
In this case it's nonsensical. I'm guessing that id in reality is a primary key, but note that hive does not know this, so the above data set is perfectly valid.
With all that in mind, it's not clear to me how to fix it because I don't know the desired output. For example, let's say without the group by (just the join), you have as output:
id|name |genre |rating
--+-----+------+-------
01|name1|comedy|'pretty good'
01|name1|comedy|'bad'
02|name2|horror|'9/10'
03|name3|action|NULL
What would you want the output to be with the group by? What are you trying to accomplish by doing the group by?
OK let me see if I can ask this in a better way.
Here are my two tables
Movies list table - Consists of movies information
ID | Movie Name | Genre
1 | Movie 1 | comedy
2 | movie 2 | action
3 | movie 3 | thriller
And I have ratings table
MOVIE_ID | USER ID | RATING on 5 | TIMESTAMP
1 | xyz | 5 | 12345612
1 | abc | 4 | 23232312
2 | zvc | 1 | 12321123
2 | zyx | 2 | 12312312
What I would like to do is get the output in the following way:
Movie ID | Movie Name | Genre | Rating Average
1 | Movie 1 | comedy | 4.5
2 | Movie 2 | action | 1.5
I am not a db expert but I understand this, when you group the data together you need to convert the multiple values to the scalar values or all the values, if string should be same right?
For example in my previous case, I was grouping them together as a string. So which is okay for list.id, list.name and list.genre, but the list.rating, well that is always going to give some problem here (I just learnt PIG along with hive, so grouping works differently there)
So to tackle the problem, I casted the rating and averaged it out and stored it in the float table. Have a look at my code below:
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating FLOAT);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, AVG(cast(rating.rating as FLOAT)) from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id, list.name,list.genre order by list.id DESC;
Thank you for your explanation. I might save the following question for the next thread but here is my observation:
The performance of the Overall job is reduced when performing Grouping and Joining together than to do it in two separate queries. For the same job, I had changed the code a bit to perform the grouping first and then joining the data and the over all time was reduced by 40 seconds. Earlier it was taking 140 seconds and now it is taking 100 seconds. Any reasons to that?
Once again thank you for your explanation.
I came across same issue:
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid column reference "charge_province"
After I put the "charge_province" in the group by, the issue is gone. I don't know why.

Resources