I am new to pig and I have a text file where each line contains a different record of information in the following format:
name, year, count, uniquecount
For example:
Zverkov winced_VERB 2004 8 8
Zverkov winced_VERB 2008 4 4
Zverkov winced_VERB 2009 1 1
zvlastni _ADV_ 1913 1 1
zvlastni _ADV_ 1928 2 2
zvlastni _ADV_ 1929 3 2
I want to group all the records by their unique names, then for each unique name calculate count/uniquecount, and finally sort the output by this calculated value.
Here is what I have been trying:
bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count) / SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average;
It seems my original code does produce the desired output with one minor change:
bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count)/SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average DESC, group ASC;
Related
I have two tables with a different number of rows and three columns. If, for a certain row, the values in column one and two are the same as in the other table, I want to select the values from the third column. Below is a sample code that works, but since in reality, I have over 2 million rows in both tables, it takes a very long time to run. Is there a way to speed up the code by getting rid of the loops?
MOOSE2=table(['2010/03/30 00:30:00'; '2010/03/22 18:00:00' ; '2010/04/21 18:30:00'; '2010/02/20 02:20:00'; '2010/03/10 02:30:00'],[5 ;8 ;4; 9 ;7],[10; 11 ;12 ;13 ;14]);
Lion2=table(['2010/03/30 00:30:00'; '2010/04/21 18:30:00'; '2010/03/20 22:00:00'; '2010/03/10 02:00:00'],[5;4;6;7],[17;12;11;14]);
[sMOOSE,~]=size(MOOSE2);
[sLion,~]=size(Lion2);
dmoose=[];
dlion=[];
for i=1:sLion
for j=1:sMOOSE
if (MOOSE2.(1)(j,:)==Lion2.(1)(i,:))&(MOOSE2.(2)(j,:)==Lion2.(2)(i,:))
dmoose=[dmoose;MOOSE2.(3)(j,:)];
dlion=[dlion;Lion2.(3)(i,:)];
end
end
end
Which gives me the correct output of
dlion =
17
12
dmoose =
10
12
Perfect scenario for intersect:
MOOSE2=table(['2010/03/30 00:30:00'; '2010/03/22 18:00:00' ; '2010/04/21 18:30:00'; '2010/02/20 02:20:00'; '2010/03/10 02:30:00'],[5 ;8 ;4; 9 ;7],[10; 11 ;12 ;13 ;14]);
Lion2=table(['2010/03/30 00:30:00'; '2010/04/21 18:30:00'; '2010/03/20 22:00:00'; '2010/03/10 02:00:00'],[5;4;6;7],[17;12;11;14]);
[~,moose_index,lion_index] = intersect(MOOSE2(:,1:2),Lion2(:,1:2),'rows');
dlion = Lion2.Var3(lion_index)
dmoose = MOOSE2.Var3(moose_index)
I am using the Universe U2.net toolkit to update the record in universe database. We have so far no issue with update to non multi value field with the following code
Open_Again:
Try
db_connectionU2 = openConnU2()
db_connectionU2.Open()
Catch ex As Exception
GoTo Open_Again
End Try
Dim cmdWIP As New U2Command
'cmdWIP = New U2Command("DELETE FROM MPS", db_connectionU2)
cmdWIP = New U2Command("UPDATE POH SET EPOS=#FLAG where PONO='C11447'", db_connectionU2)
cmdWIP = New U2Command("UPDATE CURCVRD F8=#F8 where F0='51747*1'", db_connectionU2)
cmdWIP.Parameters.Add(New U2Parameter("#F8", U2Type.VarChar)).Value = "t"
cmdWIP.Connection = db_connectionU2
cmdWIP.ExecuteNonQuery()
cmdWIP.Dispose()
cmdWIP = Nothing
db_connectionU2.Close()
db_connectionU2.Dispose()
db_connectionU2 = Nothing
but it having the problem when we try to add in to multivalue field. It's return the error " Column being update from single to multi is illegal. Please see the red box for the message and the value we are writing in.
Please click below to see the screenshot
enter image description here
Thank you
You need to look at the DICT of that file and make sure your entries are marked and MultiValued and have an Multi-Value Association.
Here is an example from the HS.SALES demo account.
>LIST DICT CUSTOMER
DICT CUSTOMER 03:56:47pm 01 Dec 2016 Page 1
Type &
Field......... Field. Field........ Conversion.. Column......... Output Depth &
Name.......... Number Definition... Code........ Heading........ Format Assoc..
CUSTID D 0 P(0N) Customer ID 10R S
#ID D 0 CUSTOMER 10L S
SAL D 1 Salutation 5T S
FNAME D 2 First Name 12T S
LNAME D 3 Last Name 16T S
COMPANY D 4 Company Name 20T S
ADDR1 D 5 Address line 1 30T S
ADDR2 D 6 Address line 2 30T S
CITY D 7 City 12T S
STATE D 8 P(2A) State 2L S
MCU
ZIP D 9 P(5N) Zip 5L S
PHONE D 10 P("("3N")"3N Telephone 13R S
-4N)
PRODID D 11 P(1A4N) Product 5L M ORDER
S
SER_NUM D 12 P(6N) Serial# 6L M ORDER
S
Notice how PRODID has "M ORDERS" after is (the is drops to the next line thanks to the 80 char size of my terminal. This tells Universe that it is a multivalued field with an Association called ORDERS. This allows the SQL interpreter to know how to update things.
It gets a bit more complicated and I would recommend looking up HS.ADMIN and specifically HS.SCRIB for tips on formatting things for non-pick style consumption. Check the UVodbc guide for more info on that.
Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.
student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;
student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);
Sample input (student ID, location ID, mathScore, verbScore),
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),
7 3 150 240
thanks in advance,
Lin
You might be looking for this.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA = filter data by lid == 1;
student_CA_sum = SUM( student_CA.sid ) ;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output is:
grunt> dump result
(6,3,150,240)
grunt> describe result
result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}
first load the file(student)in hadoop file system. The perform the below action.
split student into student_CA if locationId == 1, student_Other if locationId != 1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;
student_Other_all = group student_Other all;
student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;
student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;
student_summary_all = group student_CAandOther_all_summary all;
student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;
output:
dump student_summary;
(6,3,150,240)
Hope this helps :)
While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.
The answer accepted has an logical error.
Try to have the below input file
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
7 1 10 10
The output will be
(13,4,160,250)
The output should be
(7,4.170,260)
I have modified the script to work correct.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA_sum = COUNT( data.sid ) ;
student_CA = filter data by lid == 1;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output
(7,4,160,250)
I have one file max_rank.txt containing:
1,a
2,b
3,c
and second file max_rank_add.txt:
d
e
f
My expecting result is:
1,a
2,b
3,c,
4,d,
5,e
6,f
So I want to generate RANK for second set of values, but starting with value greater than max from first set.
Beginig of the script probably looks like this:
existing = LOAD 'max_rank.txt' using PigStorage(',') AS (id: int, text : chararray);
new = LOAD 'max_rank_add.txt' using PigStorage() AS (text2 : chararray);
ordered = ORDER existing by id desc;
limited = LIMIT ordered 1;
new_rank = RANK new;
But I have problem with last, most importatn line, that adds value from limited to rank_new from new_rank.
Can you please give any suggestions?
Regards
Pawel
I've found a solution.
Both scripts work:
rank_plus_max = foreach new_rank generate flatten(limited.$0 + rank_new), text2;
rank_plus_max = foreach new_rank generate limited.$0 + rank_new, text2;
These DOES NOT work:
rank_plus_max = foreach new_rank generate flatten(limited.$0) + flatten(rank_new);
2014-02-24 10:52:39,580 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 10, column 62> mismatched input '+' expecting SEMI_COLON
Details at logfile: /export/home/pig/pko/pig_1393234166538.log
I have a dataset, A, that has timestamp, visitor, URL:
(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com)
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com)
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)
I want to measure number of visits per user per URL in a time window of say, 10 minutes, but as a rolling window that increments by the minute. Output would be:
(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)
To make the arithmetic easy, I change the timestamp to minute of the day, as:
(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */
To iterate over 'A' by a moving time window, I create a dataset B of minutes in the day:
(0)
(1)
(2)
.
.
.
.
(1440)
Ideally, I want to do something like:
A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)
foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}
DUMP B;
I know "GROUP" isn't allowed inside a "FOREACH" loop but is there a workaround to achieve the same result?
Thanks!
Maybe you can do something like this?
NOTE: This is dependent on the minutes you create for the logs being integers. If they are not then you can round to the nearest minute.
myudf.py
#!/usr/bin/python
#outputSchema('expanded: {(num:int)}')
def expand(start, end):
return [ (x) for x in range(start, end) ]
myscript.pig
register 'myudf.py' using jython as myudf ;
-- A1 is the minutes. Schema:
-- A1: {minute: int}
-- A2 is the logs. Schema:
-- A2: {minute: int,name: chararray}
-- These schemas should change to fit your needs.
B = FOREACH A1 GENERATE minute,
FLATTEN(myudf.expand(minute, minute+10)) AS matchto ;
-- B is in the form:
-- 1 1
-- 1 2
-- ....
-- 2 2
-- 2 3
-- ....
-- 100 100
-- 100 101
-- etc.
-- Now we join on the minute in the second column of B with the
-- minute in the log, then it is just grouping by the minute in
-- the first column and name and counting
C = JOIN B BY matchto, A2 BY minute ;
D = FOREACH (GROUP C BY (B::minute, name))
GENERATE FLATTEN(group), COUNT(C) as count ;
I'm a little worried about speed for larger logs, but it should work. Let me know if you need me to explain anything.
A = load 'dataSet1' as (ts, visitor, uri);
houred = FOREACH A GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, uri;
hour_frequency1 = GROUP houred BY (hour, user);
Something like this should help
ExtractHour is a UDF, you could create something similar for your required Duration.
Then grouping by Hour and then User
Your can use the GENERATE to do a count.
http://pig.apache.org/docs/r0.7.0/tutorial.html