How to group by multiple columns and then transpose in Hive - hadoop

I have some data that I want to group by on multiple columns, perform an aggregation function on, and then transpose into different columns using Hive.
For example, given this input
Input:
hr type value
01 a 10
01 b 20
01 c 50
01 a 30
02 c 10
02 b 90
02 a 80
I want to produce this output:
Output:
hr a_avg b_avg c_avg
01 20 20 50
02 80 90 10
Where there is one distinct column for each distinct type in my input. a_avg corresponds to the average a value for each hour.
How can I do this in Hive? I am guessing I might need to make use of https://github.com/klout/brickhouse/wiki/Collect-UDFs
So far the best I can think of is to use multiple group-by clauses, but that won't transpose the data into multiple columns.
Any ideas?

You don't necessarily need to use Brickhouse, but it will definitely make it easier. Here is what I'm thinking, something like
select hr
, type_map['a'] a_avg
, type_map['b'] b_avg
, type_map['c'] c_avg
from (
select hr
, collect(type, avg_value) type_map -- Brickhouse collect; creates a map
from (
select hr
, type
, avg( value ) avg_value
from db.table
group by hr, type ) x
group by hr ) y

Related

PLSQL script that does gcd beetween all possible pairs of numbers in a randomly generated table

I have to create a table with 10 entries, filled with random values from 20-100. The numbers must be distinct. Then I must create a second table, filled with the greatest common denominator between all possible pairs, (x,y),where x,y are numbers from the first created table. This is what I have thus far:
DROP TABLE random;
CREATE TABLE random
(
numbers int
);
DECLARE
numberToInsert int;
i int:=0;
CURSOR c_random is select trunc(dbms_random.value(20,100)) from dual;
BEGIN
LOOP
open c_random;
i:=i+1;
fetch c_random into numberToInsert;
INSERT INTO random VALUES (numberToInsert);
close c_random;
EXIT WHEN i = 10;
END LOOP;
END;
/
So this inserts into table random 10 different random numbers from 20-100. The numbers are however not always distinct. And I don't know how to do the second part. Obviously I need to use a while loop and do Euclid algorithm. Any help would be appreciated.
Try this query:
select x, y
from(
select 19+level as x
from dual
connect by level <= 200-19
) cross join (
select 19+level as y
from dual
connect by level <= 200-19
)
order by dbms_random.value()
fetch first 20 rows only;
The query generates two sets of numbers from 20 to 200, then combine these two set using cross jon - this gives a distinct set of pairs X,Y coming from both sets
In the end the query uses "order by random" and picks 20 random rows.
FIRST 20 ROWS ONLY clause works on Oracle 12c, on earlier versions you need to convert this query into a subquery and use WHERE rownum <= 20
To insert a result into the table, use simple:
INSERT INTO table ...
SELECT ....the above query....
which part of the query insures the numbers are distinct?
Well, the CROSS JOIN guarantes that.
If two distinct sets are combined using the cross join, a result also will be distinct. This is how a join works - it combines all rows from one table with all rows from the other table.
Please take a look at the below simple example, the result is distinc:
select x, y
from(
select 19+level as x
from dual
connect by level <= 22-19
) cross join (
select 19+level as y
from dual
connect by level <= 22-19
)
X Y
---------- ----------
20 20
20 21
20 22
21 20
21 21
21 22
22 20
22 21
22 22

Sum multiple columns using PIG

I have multiple files with same columns and I am trying to aggregate the values in two columns using SUM.
The column structure is below
ID first_count second_count name desc
1 10 10 A A_Desc
1 25 45 A A_Desc
1 30 25 A A_Desc
2 20 20 B B_Desc
2 40 10 B B_Desc
How can I sum the first_count and second_count?
ID first_count second_count name desc
1 65 80 A A_Desc
2 60 30 B B_Desc
Below is the script I wrote but when I execute it I get an error "Could not infer matching function for SUM as multiple of none of them fit.Please use an explicit cast.
A = LOAD '/output/*/part*' AS (id:chararray,first_count:chararray,second_count:chararray,name:chararray,desc:chararray);
B = GROUP A BY id;
C = FOREACH B GENERATE group as id,
SUM(A.first_count) as first_count,
SUM(A.second_count) as second_count,
A.name as name,
A.desc as desc;
Your load statement is wrong. first_count, second_count is loaded as chararray. Sum can't add two strings. If you are sure that these columns will take numbers only then load them as int. Try this-
A = LOAD '/output/*/part*' AS (id:chararray,first_count:int,second_count:int,name:chararray,desc:chararray);
It should work.

Event Study (Extracting Dates in SAS)

I need to analyse abnormal returns for an event study on mergers and acquisitions.
** I would like to analyse abnormal returns to acquirers by using event windows. Basically I would like to extract the prices for the acquirers using -1 (the day before the announcement date), announcement date, and +1 (the day after the announcement date).**
I have two different datasets to extract information from.
The first is a dataset with all the merger and acquisition information that has the information in the following format:
DealNO AcquirerNO TargetNO AnnouncementDate
123 abcd Cfgg 22/12/2010
222 qwert cddfgf 26/12/1998
In addition, I have a 2nd dataset which has all the prices.
ISINnumber Date Price
abcd 21/12/2010 10
abcd 22/12/2010 11
abcd 23/12/2010 11
abcd 24/12/2010 12
qwert 20/12/1998 20
qwert 21/12/1998 20
qwert 22/12/1998 21
qwert 23/12/1998 21
qwert 24/12/1998 21
qwert 25/12/1998 22
qwert 26/12/1998 21
qwert 27/12/1998 23
ISIN number is the same as acquirer no, and that is the matching code.
In the end I would like to have a database something like this:
DealNO AcquirerNO TargetNO AnnouncementDate Acquirerprice(-1day) Acquireeprice(0day) Acquirerprice(+1day)
123 abcd Cfgg 22/12/2010 10 11 12
222 qwert cddfgf 26/12/1998 22 21 23
Do you know how I can get this?
I'd prefer to use sas to run the code, but if you are familiar with any other programs that can get the data like this, please let me know.
Thank you in advance ^_^.
This can be done quite easily with PROC SQL and joining the PRICE dataset three times. Try this (assuming data set names of ANNOUCE and PRICE):
Warning: untested code
%let day='21DEC2010'd;
proc sql;
create table RESULT as
select a.dealno,
a.acquirerno,
a.targetno,
a.annoucementdate,
p.price as acquirerprice_prev,
c.price as acquirerprice_cur,
n.price as acquirerprice_next
from ANNOUCE a
left join (select * from PRICE where date = &day-1) p on a.acquirerno = p.isinumber
left join (select * from PRICE where date = &day) c on a.acquirerno = c.isinumber
left join (select * from PRICE where date = &day+1) n on a.acquirerno = n.isinumber
;
quit;

Hive: Joining two tables with different keys

I have two tables like below. Basically i want to join both of them and expected the result like below.
First 3 rows of table 2 does not have any activity id just empty.
All fields are tab separated. Category "33" is having three description as per table 2.
We need to make use of "Activity ID" to get the result for "33" category as there are 3 values for that.
could anyone tell me how to achieve this output?
TABLE: 1
Empid Category ActivityID
44126 33 TRAIN
44127 10 UFL
44128 12 TOI
44129 33 UNASSIGNED
44130 15 MICROSOFT
44131 33 BENEFITS
44132 43 BENEFITS
TABLE 2:
Category ActivityID Categdesc
10 billable
12 billable
15 Non-billable
33 TRAIN Training
33 UNASSIGNED Bench
33 BENEFITS Benefits
43 Benefits
Expected Output:
44126 33 Training
44127 10 Billable
44128 12 Billable
44129 33 Bench
44130 15 Non-billable
44131 33 Benefits
44132 43 Benefits
It's little difficult to do this Hive as there are many limitations. This is how I solved it but there could be a better way.
I named your tables as below.
Table1 = EmpActivity
Table2 = ActivityMas
The challenge comes due to the null fields in Table2. I created a view and Used UNION to combine result from two distinct queries.
Create view actView AS Select * from ActivityMas Where Activityid ='';
SELECT * From (
Select EmpActivity.EmpId, EmpActivity.Category, ActivityMas.categdesc
from EmpActivity JOIN ActivityMas
ON EmpActivity.Category = ActivityMas.Category
AND EmpActivity.ActivityId = ActivityMas.ActivityId
UNION ALL
Select EmpActivity.EmpId, EmpActivity.Category, ActView.categdesc from EmpActivity
JOIN ActView ON EmpActivity.Category = ActView.Category
)
You have to use top level SELECT clause as the UNION ALL is not directly supported from top level statements. This will run total 3 MR jobs. ANd below is the result I got.
44127 10 billable
44128 12 billable
44130 15 Non-billable
44132 43 Benefits
44131 33 Benefits
44126 33 Training
44129 33 Bench
I'm not sure if I understand your question or your data, but would this work?
select table1.empid, table1.category, table2.categdesc
from table1 join table2
on table1.activityID = table2.activityID;

One Select, Two Resultsets: Returning Summary and Detail based on same query

I'm attempting to perform a large query where I want to:
Query the detail rows, then
Perform aggregations based on the results returned
Essentially, I want to perform my data-intensive query ONCE, and derive both summary and detail values from the one query, as the query is pretty intensive. I'm SURE there is a better way to do this using the frontend application (e.g. detail rows in the SQL, aggregate in front-end?), but I want to know how to do this all in PL/SQL using essentially one select against the db (for performance reasons, I don't want to call essentially the same large Select twice)(and at this point, my reasons for wanting to do it in one query might be called stubborn... i.e. even if there's a better way, I'd like to know if it can be done).
I know how to get the basic "detail-level" resultset. That query would return data such as:
UPC-Region-ProjectType-TotalAssignments-IncompleteAssignments
So say I have 10 records:
10-A-X-20-10
11-B-X-10-5
12-C-Y-30-15
13-C-Z-20-10
14-A-Y-10-5
15-B-X-30-15
16-C-Z-20-10
17-B-Y-10-5
18-C-Z-30-15
19-A-X-20-10
20-B-X-10-5
I want to be able to perform the query, then perform aggregations on that resultset, such as:
Region A Projects: 3
Region A Total Assign: 50
Region A Incompl Assign: 25
Region B...
Region C...
Project Type X Projects: 5
Project Type X Total Assign: 90
Project Type X Incompl Assign: 45
Project Type Y...
Project Type Z...
And then return both resultsets (Summary + Detail) to the calling application.
I guess the idea would be running the Details query into a Temp Table, and then selecting/performing aggregation on it there to build the second "summary level" query. then passing the two resultsets back as two refcursors.
But I'm open to ideas...
My initial attempts have been:
type rec_projects is record
(record matching my DetailsSQL)
/* record variable */
project_resultset rec_projects;
/* cursor variable */
OPEN cursorvar1 FOR
select
upc,
region,
project_type,
tot_assigns,
incompl_assigns
...
Then I:
loop
fetch cursorvar1 into project_resultset;
exit when cursorvar1%NOTFOUND;
/* perform row-by-row aggregations into variables */
If project_resultset.region = 'A'
then
numAProj := numAProj + 1;
numATotalAssign := numATotalAssign + project_resultset.Totassigns;
numAIncomplAssign := numAIncomplAssign + project_resultset.Incomplassigns;
and so on...
end loop;
Followed by opening another refcursor var - selecting the variables from DUAL:
open cursorvar2 for
select
numAProj, numATotalAssign, numAIncomplAssign, etc, etc from dual;
Lastly:
cur_out1 := cursorvar1;
cur_out2 := cursorvar2;
not working... cursorvar1 seems to load fine, and I get into the loop. But I'm not ending up with anything in cursorvar2, and just feel I'm probably totally on the wrong path here (that there is a better way to do it)
Thanks for your help.
I prefer doing all calculations on server side.
Both types of information (detail + master) can be fetched through single cursor:
with
DET as (
-- your details subquery here
select
UPC,
Region,
Project_Type,
Total_Assignments,
Incomplete_Assignments
from ...
)
select
UPC,
Region,
Project_Type,
Total_Assignments,
Incomplete_Assignments,
null as projects_ctr
from DET
union all
select
null as UPC,
Region,
null as Project_Type,
sum(Total_Assignments) as Total_Assignments,
sum(Incomplete_Assignments) as Incomplete_Assignments,
count(0) as projects_ctr
from DET
group by Region
union all
select
null as UPC,
null as Region,
Project_Type,
sum(Total_Assignments) as Total_Assignments,
sum(Incomplete_Assignments) as Incomplete_Assignments,
count(0) as projects_ctr
from DET
group by Project_Type
order by UPC nulls first, Region, Project_Type
Result:
UPC Region Project_Type Total_Assignments Incomplete_Assignments Projects_Ctr
------ ------ ------------ ----------------- ---------------------- ------------
(null) A (null) 50 25 3
(null) B (null) 60 30 4
(null) C (null) 100 50 4
(null) (null) X 90 45 5
(null) (null) Y 50 25 3
(null) (null) Z 70 35 3
10 A X 20 10 (null)
11 B X 10 5 (null)
12 C Y 30 15 (null)
13 C Z 20 10 (null)
14 A Y 10 5 (null)
15 B X 30 15 (null)
16 C Z 20 10 (null)
17 B Y 10 5 (null)
18 C Z 30 15 (null)
19 A X 20 10 (null)
20 B X 10 5 (null)
fiddle
If you are going to be creating these reports regularly, it might be better to create a global temporary table to store the results of your initial query:
CREATE GLOBAL TEMPORARY TABLE MY_TEMP_TABLE
ON COMMIT DELETE ROWS
AS
SELECT
UPC,
Region,
ProjectType,
TotalAssignments,
IncompleteAssignments
FROM WHEREVER
;
You can then run a series of follow-up queries to calculate the various statistics values for your report and output them in a format other than a large text table.

Resources