Comparing several table columns - performance

I have two tables with a different number of rows and three columns. If, for a certain row, the values in column one and two are the same as in the other table, I want to select the values from the third column. Below is a sample code that works, but since in reality, I have over 2 million rows in both tables, it takes a very long time to run. Is there a way to speed up the code by getting rid of the loops?
MOOSE2=table(['2010/03/30 00:30:00'; '2010/03/22 18:00:00' ; '2010/04/21 18:30:00'; '2010/02/20 02:20:00'; '2010/03/10 02:30:00'],[5 ;8 ;4; 9 ;7],[10; 11 ;12 ;13 ;14]);
Lion2=table(['2010/03/30 00:30:00'; '2010/04/21 18:30:00'; '2010/03/20 22:00:00'; '2010/03/10 02:00:00'],[5;4;6;7],[17;12;11;14]);
[sMOOSE,~]=size(MOOSE2);
[sLion,~]=size(Lion2);
dmoose=[];
dlion=[];
for i=1:sLion
for j=1:sMOOSE
if (MOOSE2.(1)(j,:)==Lion2.(1)(i,:))&(MOOSE2.(2)(j,:)==Lion2.(2)(i,:))
dmoose=[dmoose;MOOSE2.(3)(j,:)];
dlion=[dlion;Lion2.(3)(i,:)];
end
end
end
Which gives me the correct output of
dlion =
17
12
dmoose =
10
12

Perfect scenario for intersect:
MOOSE2=table(['2010/03/30 00:30:00'; '2010/03/22 18:00:00' ; '2010/04/21 18:30:00'; '2010/02/20 02:20:00'; '2010/03/10 02:30:00'],[5 ;8 ;4; 9 ;7],[10; 11 ;12 ;13 ;14]);
Lion2=table(['2010/03/30 00:30:00'; '2010/04/21 18:30:00'; '2010/03/20 22:00:00'; '2010/03/10 02:00:00'],[5;4;6;7],[17;12;11;14]);
[~,moose_index,lion_index] = intersect(MOOSE2(:,1:2),Lion2(:,1:2),'rows');
dlion = Lion2.Var3(lion_index)
dmoose = MOOSE2.Var3(moose_index)

Related

advice to make my below Pig code simple

Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.
student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;
student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);
Sample input (student ID, location ID, mathScore, verbScore),
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),
7 3 150 240
thanks in advance,
Lin
You might be looking for this.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA = filter data by lid == 1;
student_CA_sum = SUM( student_CA.sid ) ;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output is:
grunt> dump result
(6,3,150,240)
grunt> describe result
result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}
first load the file(student)in hadoop file system. The perform the below action.
split student into student_CA if locationId == 1, student_Other if locationId != 1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;
student_Other_all = group student_Other all;
student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;
student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;
student_summary_all = group student_CAandOther_all_summary all;
student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;
output:
dump student_summary;
(6,3,150,240)
Hope this helps :)
While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.
The answer accepted has an logical error.
Try to have the below input file
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
7 1 10 10
The output will be
(13,4,160,250)
The output should be
(7,4.170,260)
I have modified the script to work correct.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA_sum = COUNT( data.sid ) ;
student_CA = filter data by lid == 1;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output
(7,4,160,250)

Pig: Group By, Average, and Order By

I am new to pig and I have a text file where each line contains a different record of information in the following format:
name, year, count, uniquecount
For example:
Zverkov winced_VERB 2004 8 8
Zverkov winced_VERB 2008 4 4
Zverkov winced_VERB 2009 1 1
zvlastni _ADV_ 1913 1 1
zvlastni _ADV_ 1928 2 2
zvlastni _ADV_ 1929 3 2
I want to group all the records by their unique names, then for each unique name calculate count/uniquecount, and finally sort the output by this calculated value.
Here is what I have been trying:
bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count) / SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average;
It seems my original code does produce the desired output with one minor change:
bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count)/SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average DESC, group ASC;

Pig 0.11.1 - Count groups in a time range

I have a dataset, A, that has timestamp, visitor, URL:
(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com)
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com)
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)
I want to measure number of visits per user per URL in a time window of say, 10 minutes, but as a rolling window that increments by the minute. Output would be:
(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)
To make the arithmetic easy, I change the timestamp to minute of the day, as:
(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */
To iterate over 'A' by a moving time window, I create a dataset B of minutes in the day:
(0)
(1)
(2)
.
.
.
.
(1440)
Ideally, I want to do something like:
A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)
foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}
DUMP B;
I know "GROUP" isn't allowed inside a "FOREACH" loop but is there a workaround to achieve the same result?
Thanks!
Maybe you can do something like this?
NOTE: This is dependent on the minutes you create for the logs being integers. If they are not then you can round to the nearest minute.
myudf.py
#!/usr/bin/python
#outputSchema('expanded: {(num:int)}')
def expand(start, end):
return [ (x) for x in range(start, end) ]
myscript.pig
register 'myudf.py' using jython as myudf ;
-- A1 is the minutes. Schema:
-- A1: {minute: int}
-- A2 is the logs. Schema:
-- A2: {minute: int,name: chararray}
-- These schemas should change to fit your needs.
B = FOREACH A1 GENERATE minute,
FLATTEN(myudf.expand(minute, minute+10)) AS matchto ;
-- B is in the form:
-- 1 1
-- 1 2
-- ....
-- 2 2
-- 2 3
-- ....
-- 100 100
-- 100 101
-- etc.
-- Now we join on the minute in the second column of B with the
-- minute in the log, then it is just grouping by the minute in
-- the first column and name and counting
C = JOIN B BY matchto, A2 BY minute ;
D = FOREACH (GROUP C BY (B::minute, name))
GENERATE FLATTEN(group), COUNT(C) as count ;
I'm a little worried about speed for larger logs, but it should work. Let me know if you need me to explain anything.
A = load 'dataSet1' as (ts, visitor, uri);
houred = FOREACH A GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, uri;
hour_frequency1 = GROUP houred BY (hour, user);
Something like this should help
ExtractHour is a UDF, you could create something similar for your required Duration.
Then grouping by Hour and then User
Your can use the GENERATE to do a count.
http://pig.apache.org/docs/r0.7.0/tutorial.html

more problems with the LAG function is SAS

The following bit of SAS code is supposed to read from a dataset which contains a numeric variable called 'Radvalue'. Radvalue is the temperature of a radiator, and if a radiator is switched off but then its temperature increases by 2 or more it's a sign that it has come on, and if it is on but its temperature decreases by 2 or more it's a sign that it's gone off.
Radstate is a new variable in the dataset which indicates for every observation whether the radiator is on or off, and it's this I'm trying to fill in automatically for the whole dataset.
So I'm trying to use the LAG function, trying to initialise the first row, which doesn't have a dif_radvalue, and then trying to apply the algorithm I just described to row 2 onwards.
Any idea why the columns Radstate and l_radstate come out completely blank?
Thanks everso much!! Let me know if I haven't explained the problem clearly.
Data work.heating_algorithm_b;
Input ID Radvalue;
Datalines;
1 15.38
2 15.38
3 20.79
4 33.47
5 37.03
6 40.45
7 40.45
8 40.96
9 39.44
10 31.41
11 26.49
12 23.06
13 21.75
14 20.16
15 19.23
;
DATA temp.heating_algorithm_c;
SET temp.heating_algorithm_b;
DIF_Radvalue = Radvalue - lag(Radvalue);
l_Radstate = lag(Radstate);
if missing(dif_radvalue) then
do;
dif_radvalue = 0;
radstate = "off";
end;
else if l_Radstate = "off" & DIF_Radvalue > 2 then Radstate = "on";
else if l_Radstate = "on" & DIF_Radvalue < -2 then Radstate = "off";
else Radstate = l_Radstate;
run;
You were trying to perform the LAG function on a variable only existing in the output data set (RADSTATE). I replaced the LAG on RADSTATE with a RETAIN. Also, you were right to keep the LAG function outside any conditional logic...Try the below code.
Data work.heating_algorithm_b;
Input ID Radvalue;
Datalines;
1 15.38
2 15.38
3 20.79
4 33.47
5 37.03
6 40.45
7 40.45
8 40.96
9 39.44
10 31.41
11 26.49
12 23.06
13 21.75
14 20.16
15 19.23
;
DATA work.heating_algorithm_c;
length radstate $3;
retain radstate;
SET work.heating_algorithm_b;
old_radvalue=lag(radvalue);
if _n_=1 then do;
dif_radvalue=0;
radstate="off";
end;
else do;
DIF_Radvalue = Radvalue-Old_Radvalue;
if Radstate = "off" & DIF_Radvalue > 2 then Radstate = "on";
else if Radstate = "on" & DIF_Radvalue < -2 then Radstate = "off";
/* Else Radstate stays the same */
end;
run;
I have no SAS experience, but maybe you need a missing(l_Radstate) check to cover the first time through, maybe something like this:
if missing(l_Radstate) then
do; radstate = "off"; end;
I think that would only be needed if the Radvalue - lag(Radvalue) did not force DIF_Radvalue to be missing. If it does, I am not sure what would help...

ruby multiple loop sets but with limited rows per set

Alrightie, so I'm building an CSV file this time with ruby. The outer loop will run up to length of num_of_loops, but it runs for an entire set rather than up to the specified row. I want to change the first column of a CSV file to a new name for each row.
If I do this:
class_days = %w[Wednesday Thursday Friday]
num_of_loops = (num_of_loops / class_days.size).ceil
num_of_loops.times {
["Wednesday","Thursday","Friday"].each do |x|
data[0] = x
data[4] = classname()
# Write all to file
#
csv << data
end
}
Then the loop will run only 3 times for a 5 row request.
I'd like it to run the full 5 rows such that instead of stopping at Wed/Thurs/Fri it goes to Wed/Thurs/Fri/Wed/Thurs instead.
class_days = %w[Wednesday Thursday Friday]
num_of_loops.times do |i|
data[0] = class_days[i % class_days.size]
data[4] = classname
csv << data
end
The interesting part is here:
class_days[i % class_days.size]
We need an index into class_days that is between 0 and class_days.size - 1. We can get that with the % (modulo) operator. That operator yields the remainder after dividing i by class_days.size. This table shows how it works:
i i % 3
0 0
1 1
2 2
3 0
4 1
5 2
...
The other key part is that the times method yields indices starting with 0.

Resources