Rolling Distinct Counts - hadoop

I asked this question regarding how to get a rolling count of distinct users using SQL, but I also have Hadoop at my disposal and now I'm wondering if this analysis isn't better suited for Hadoop. Unfortunately, I'm new to Hadoop so beyond getting the data loaded and the most basic MapReduce jobs I'm ignorant on how to approach this. Assuming this is a good candidate for Hadoop, what is the best approach to take?

One possible way to model this in map reduce:
M/R Job 1:
Mapper: extract and output the USER_ID (output key) and date portion from the CONNECT_TS (output value).
Reducer: For each USER_ID group (output key), output the minimum date observed (output value)
M/R Job 2:
Mapper: from the previous job output, swap the key / value pairs (DATE as output key, USER_ID as value)
Reducer (single): for each input key (date), count the number of users (value), output this number along with an accumulated running total.
In Pig, everything except keeping the running total could be done using the following script:
A = LOAD '/home/cswhite/data.tsv' USING PigStorage('\t') AS (SESSION, USER_ID, TIMESTAMP);
B = foreach A GENERATE USER_ID, SUBSTRING(TIMESTAMP, 0, 10) AS DATE;
BF = filter B by DATE > '2013-01-01';
C = group BF by USER_ID;
D = foreach C {
sorted = order BF by DATE;
earliest = limit sorted 1;
generate group, flatten(earliest);
}
E = foreach D generate group as USER_ID, earliest::DATE as DATE;
F = group E by DATE;
G = foreach F generate group as DATE, COUNT(E) as USERS_CNT;
H = group G by ALL
I = foreach G generate SUM(G.USERS_CNT) as TOTAL_USERS;
So for the following tab separated input data:
1 99 2013-01-01 2:23:33
2 101 2013-01-01 2:23:55
3 104 2013-01-01 2:24:41
4 101 2013-01-01 2:24:43
5 101 2013-01-02 2:25:01
6 102 2013-01-02 2:26:01
7 99 2013-01-03 2:27:01
8 92 2013-01-04 2:28:01
9 234 2013-01-05 2:29:01
The alias G is as follows:
(2013-01-01,3)
(2013-01-02,1)
(2013-01-04,1)
(2013-01-05,1)
And alias 'I' is:
(5)

Related

aggregate function with join by most recent date Oracle

I'm trying to select a single test score for students by test type, test element, semester and ID. One score per ID. If a student has taken the test more than once, I want to only return the highest (or most recent) score for that test and element.
My problem is that there are a very small number of instances (less than 10 records out of 2000) where a student has two test scores recorded on different dates because they've re-taken the test to improve their score and we record both scores. My output therefore has a small number of records with multiple scores for unique name_ids (where test_id = 'act' and element_id = 'comp').
Examples of the two tables are:
students
----------------------------------------------------------
name_id term_id
100 Fall
100 Spring
100 Summer
105 Fall
105 Spring
110 Fall
110 Spring
110 Summer
test_score
----------------------------------------------------------
name_id test_id element_id score test_date
100 act comp 25 02/01/2019
100 sat comp 1250 01/20/2019
105 act comp 19 01/15/2019
105 act comp 21 02/28/2019
110 act comp 27 01/31/2019
I've tried using MAX(test_score) but perhaps could use MAX(test_date)? Either would work because the students don't report additional test scores from later dates if the scores aren't higher than what was originally reported.
This is a small part of a larger routine joining several tables so I don't know that I can replace my JOIN(s). I'm just trying to get this small subset of the routine to produce the correct number of unique records
SELECT
a.name_id NameID,
a.term_id TermID,
MAX(b.score) Score
FROM students a
LEFT JOIN test_score b ON a.name_id = b.name_id AND b.test_id = 'act' AND b.element_id = 'comp'
WHERE a.term_id = 'Spring'
group by b.score,a.name_id,a.term_id
order by a.name_id
No error messages but results from above will yield two records for NameID 105:
NameID TermID Score
100 Spring 25
105 Spring 19
105 Spring 21
110 Spring 27
I'm not certain how to write this to only select the highest score (or only the score from the most recent date)
Thanks for your guidance.
To select the highest score the GROUP BY cannot include the score...
SELECT
a.name_id NameID,
a.term_id TermID,
MAX(b.score) Score
FROM students a
LEFT JOIN test_score b ON a.name_id = b.name_id AND b.test_id = 'act' AND b.element_id = 'comp'
WHERE a.term_id = 'Spring'
group by a.name_id,a.term_id
order by a.name_id
To get the score associated with the highest date...
SELECT x.NameID,
x.TermID,
y.score
FROM (
SELECT
a.name_id NameID,
a.term_id TermID,
--- MAX(b.score) Score
MAX(b.test_date) test_date
FROM students a
LEFT JOIN test_score b ON a.name_id = b.name_id AND b.test_id = 'act' AND b.element_id = 'comp'
WHERE a.term_id = 'Spring'
group by a.name_id,a.term_id ) x
LEFT JOIN test_score y ON x.nameid = y.name_id AND x.test_date = y.test_date
order by x.nameid

Pig: Get first occurrence of variable in a group (while aggregating other variables)?

I have a dataset that looks like
gr col1 col2
A 2 'haha'
A 4 'haha'
A 3 'haha'
B 5 'hoho'
B 1 'hoho'
as you can see, in each group gr there is a numeric variable col1 and some string variable col2 that is the same within each group.
How can I get the following pseudo-code in PIG?
foreach group gt : generate the mean of col1 and get the first occurrence of col2
so output would look like
gr mean name
A 3 'haha'
B 3 'hoho'
thanks!
GROUP BY gr,col2 and get the AVG of col1. Assuming the fields are tab separated.
PigScript
A = load 'test6.txt' USING PigStorage('\t') as (gr:chararray,col1:int,col2:chararray);
B = GROUP A BY (gr,col2);
C = FOREACH B GENERATE FLATTEN(group) as (gr,name),AVG(A.col1) as mean;
DUMP C;
Note: if you want them in the order then add extra step
D = FOREACH C GENERATE $0 as gr,$2 as mean,$1 as name;
Output

Sum multiple columns using PIG

I have multiple files with same columns and I am trying to aggregate the values in two columns using SUM.
The column structure is below
ID first_count second_count name desc
1 10 10 A A_Desc
1 25 45 A A_Desc
1 30 25 A A_Desc
2 20 20 B B_Desc
2 40 10 B B_Desc
How can I sum the first_count and second_count?
ID first_count second_count name desc
1 65 80 A A_Desc
2 60 30 B B_Desc
Below is the script I wrote but when I execute it I get an error "Could not infer matching function for SUM as multiple of none of them fit.Please use an explicit cast.
A = LOAD '/output/*/part*' AS (id:chararray,first_count:chararray,second_count:chararray,name:chararray,desc:chararray);
B = GROUP A BY id;
C = FOREACH B GENERATE group as id,
SUM(A.first_count) as first_count,
SUM(A.second_count) as second_count,
A.name as name,
A.desc as desc;
Your load statement is wrong. first_count, second_count is loaded as chararray. Sum can't add two strings. If you are sure that these columns will take numbers only then load them as int. Try this-
A = LOAD '/output/*/part*' AS (id:chararray,first_count:int,second_count:int,name:chararray,desc:chararray);
It should work.

How to get the start and end events from a table

I have the following records in a table
session_id sequence timestamp
1 1 298349
1 2 299234
1 3 234255
2 1 153523
2 2 234524
3 1 123434
I want to have the following results
session_id start end
1 298349 234255
2 153523 234524
3 123434 123434
How can I do this in pig?
register 'file:$piglib/datafu-1.2.0.jar';
define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();
input_data = load 'so.txt' using PigStorage('\t') as (session_id:int, sequence:int, time:long);
g = group input_data by session_id;
r = foreach g {
s1 = order input_data by sequence asc;
s2 = order input_data by sequence desc;
generate group as session_id, FirstTupleFromBag(s1, null).time as start, FirstTupleFromBag(s2, null).time as end;
}
dump r;
First of all we group by session_id, then sort by sequence ascending and descending order and take the first tuple of the sorted bags respectively.
This makes use of the datafu UDF library (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/FirstTupleFromBag.html)

Eliminate pairs of observations under the condition, that observations can have more than one possible partner observation

In my current project we got several occasions where we had to implement a matching based on varying conditions. First a more detailed description of the Problem.
We got a table test:
key Value
1 10
1 -10
1 10
1 20
1 -10
1 10
2 10
2 -10
Now we want to apply a rule, so that inside a group (defined by value of key) pairs with a sum of 0 should be eliminated.
The expected result would be:
key value
1 10
1 20
Sort order is not relevant.
The following code is an example of our solution.
We want to eliminate observations with my_id 2 and 7 and additionaly 2 of the 3 Observations with amount 10.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
/* get all possible matches represented by pairs of my_id */
proc sql noprint;
create table zwischen_erg as
select a.my_id as a_id,
b.my_id as b_id
from test as a inner join
test as b on (a.alias=b.alias)
where a.amount=-b.amount;
quit;
/* select ids of matches to eliminate */
proc sort data=zwischen_erg ;
by a_id b_id;
run;
data zwischen_erg1;
set zwischen_erg;
by a_id;
if first.a_id then tmp_id1 = 0;
tmp_id1 +1;
run;
proc sort data=zwischen_erg;
by b_id a_id;
run;
data zwischen_erg2;
set zwischen_erg;
by b_id;
if first.b_id then tmp_id2 = 0;
tmp_id2 +1;
run;
proc sql;
create table delete_ids as
select zwischen_erg1.a_id as my_id
from zwischen_erg1 as erg1 left join
zwischen_erg2 as erg2 on
(erg1.a_id = erg2.a_id and
erg1.b_id = erg2.b_id)
where tmp_id1 = tmp_id2
;
quit;
/* use delete_ids as filter */
proc sql noprint;
create table erg as
select a.*
from test as a left join
delete_ids as b on (a.my_id = b.my_id)
where b.my_id=.;
quit;
The algorithm seems to work, at least nobody found input data that caused a error.
But nobody could explain to me why it works and I dont understand in detail how it is working.
So i got a couple of questions.
Does this algorithm eliminate the pairs in a correct manner for all possible combinations of input data?
If it does work correct, how does the algorithm work in detail? Especially the part
where tmp_id1 = tmp_id2.
Is there a better algorithm to eliminate corresponding pairs?
Thanks in advance and happy coding
Michael
As an answer to your third question. The following approach seems simpler to me.
And probably more performant. (since i have no joins)
/*For every (absolute) value, find how many more positive/negative occurrences we have per key*/
proc sql;
create view V_INTERMEDIATE_VIEW as
select key, abs(Value) as Value_abs, sum(sign(value)) as balance
from INPUT_DATA
group by key, Value_abs
;
quit;
*The balance variable here means how many times more often did we see the positive than the negative of this value. I.e., how many of either the positive or the negative were we not able to eliminate;
/*Now output*/
data OUTPUT_DATA (keep=key Value);
set V_INTERMEDIATE_VIEW;
Value = sign(balance)*Value_abs; *Put the correct value back;
do i=1 to abs(balance) by 1;
output;
end;
run;
If you only want pure SAS (so no proc sql), you could do it as below. Note that the idea behind it remains the same.
data V_INTERMEDIATE_VIEW /view=V_INTERMEDIATE_VIEW;
set INPUT_DATA;
value_abs = abs(value);
run;
proc sort data=V_INTERMEDIATE_VIEW out=INTERMEDIATE_DATA;
by key value_abs; *we will encounter the negatives of each value and then the positives;
run;
data OUTPUT_DATA (keep=key value);
set INTERMEDIATE_DATA;
by key value_abs;
retain balance 0;
balance = sum(balance,sign(value));
if last.value_abs then do;
value = sign(balance)*value_abs; *set sign depending on what we have in excess;
do i=1 to abs(balance) by 1;
output;
end;
balance=0; *reset balance for next value_abs;
end;
run;
NOTE: thanks to Joe for some useful performance suggestions.
I don't see any bugs after a quick read. But "zwischen_erg" could have a lot of unnecessary many-to-many matches which would be inefficient.
This seems to work (but not guaranteed), and might be more efficient. Also shorter, so perhaps easier to see whats going on.
data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;
proc sort data=test;
by alias amount;
run;
data zwischen_erg;
set test;
by alias amount;
if first.amount then occurrence = 0;
occurrence+1;
run;
proc sql;
create table zwischen as
select
a.my_id,
a.alias,
a.amount
from zwischen_erg as a
left join zwischen_erg as b
on a.amount = (-1)*b.amount and a.occurrence = b.occurrence
where b.my_id is missing;
quit;

Resources