Pig 0.11.1 - Count groups in a time range - hadoop

I have a dataset, A, that has timestamp, visitor, URL:
(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com)
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com)
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)
I want to measure number of visits per user per URL in a time window of say, 10 minutes, but as a rolling window that increments by the minute. Output would be:
(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)
To make the arithmetic easy, I change the timestamp to minute of the day, as:
(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */
To iterate over 'A' by a moving time window, I create a dataset B of minutes in the day:
(0)
(1)
(2)
.
.
.
.
(1440)
Ideally, I want to do something like:
A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)
foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}
DUMP B;
I know "GROUP" isn't allowed inside a "FOREACH" loop but is there a workaround to achieve the same result?
Thanks!

Maybe you can do something like this?
NOTE: This is dependent on the minutes you create for the logs being integers. If they are not then you can round to the nearest minute.
myudf.py
#!/usr/bin/python
#outputSchema('expanded: {(num:int)}')
def expand(start, end):
return [ (x) for x in range(start, end) ]
myscript.pig
register 'myudf.py' using jython as myudf ;
-- A1 is the minutes. Schema:
-- A1: {minute: int}
-- A2 is the logs. Schema:
-- A2: {minute: int,name: chararray}
-- These schemas should change to fit your needs.
B = FOREACH A1 GENERATE minute,
FLATTEN(myudf.expand(minute, minute+10)) AS matchto ;
-- B is in the form:
-- 1 1
-- 1 2
-- ....
-- 2 2
-- 2 3
-- ....
-- 100 100
-- 100 101
-- etc.
-- Now we join on the minute in the second column of B with the
-- minute in the log, then it is just grouping by the minute in
-- the first column and name and counting
C = JOIN B BY matchto, A2 BY minute ;
D = FOREACH (GROUP C BY (B::minute, name))
GENERATE FLATTEN(group), COUNT(C) as count ;
I'm a little worried about speed for larger logs, but it should work. Let me know if you need me to explain anything.

A = load 'dataSet1' as (ts, visitor, uri);
houred = FOREACH A GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, uri;
hour_frequency1 = GROUP houred BY (hour, user);
Something like this should help
ExtractHour is a UDF, you could create something similar for your required Duration.
Then grouping by Hour and then User
Your can use the GENERATE to do a count.
http://pig.apache.org/docs/r0.7.0/tutorial.html

Related

U2 Universe Update Multi value field errror

I am using the Universe U2.net toolkit to update the record in universe database. We have so far no issue with update to non multi value field with the following code
Open_Again:
Try
db_connectionU2 = openConnU2()
db_connectionU2.Open()
Catch ex As Exception
GoTo Open_Again
End Try
Dim cmdWIP As New U2Command
'cmdWIP = New U2Command("DELETE FROM MPS", db_connectionU2)
cmdWIP = New U2Command("UPDATE POH SET EPOS=#FLAG where PONO='C11447'", db_connectionU2)
cmdWIP = New U2Command("UPDATE CURCVRD F8=#F8 where F0='51747*1'", db_connectionU2)
cmdWIP.Parameters.Add(New U2Parameter("#F8", U2Type.VarChar)).Value = "t"
cmdWIP.Connection = db_connectionU2
cmdWIP.ExecuteNonQuery()
cmdWIP.Dispose()
cmdWIP = Nothing
db_connectionU2.Close()
db_connectionU2.Dispose()
db_connectionU2 = Nothing
but it having the problem when we try to add in to multivalue field. It's return the error " Column being update from single to multi is illegal. Please see the red box for the message and the value we are writing in.
Please click below to see the screenshot
enter image description here
Thank you
You need to look at the DICT of that file and make sure your entries are marked and MultiValued and have an Multi-Value Association.
Here is an example from the HS.SALES demo account.
>LIST DICT CUSTOMER
DICT CUSTOMER 03:56:47pm 01 Dec 2016 Page 1
Type &
Field......... Field. Field........ Conversion.. Column......... Output Depth &
Name.......... Number Definition... Code........ Heading........ Format Assoc..
CUSTID D 0 P(0N) Customer ID 10R S
#ID D 0 CUSTOMER 10L S
SAL D 1 Salutation 5T S
FNAME D 2 First Name 12T S
LNAME D 3 Last Name 16T S
COMPANY D 4 Company Name 20T S
ADDR1 D 5 Address line 1 30T S
ADDR2 D 6 Address line 2 30T S
CITY D 7 City 12T S
STATE D 8 P(2A) State 2L S
MCU
ZIP D 9 P(5N) Zip 5L S
PHONE D 10 P("("3N")"3N Telephone 13R S
-4N)
PRODID D 11 P(1A4N) Product 5L M ORDER
S
SER_NUM D 12 P(6N) Serial# 6L M ORDER
S
Notice how PRODID has "M ORDERS" after is (the is drops to the next line thanks to the 80 char size of my terminal. This tells Universe that it is a multivalued field with an Association called ORDERS. This allows the SQL interpreter to know how to update things.
It gets a bit more complicated and I would recommend looking up HS.ADMIN and specifically HS.SCRIB for tips on formatting things for non-pick style consumption. Check the UVodbc guide for more info on that.

Getting Repository dependencies

I'm using Informatica with Oracle RDBMS. Lately I've been struggling a bit.
I got an assignment to query the dependencies between each Model/Workflow , so the desired result will look something like this:
GRAND_MODEL | GRAND_WORKFLOW | WAIT_4_MODEL | WAIT_4_WORKFLOW
DWH_Model1 WF_workflow1 DWH_Model3 WF_Workflow3_1
DWH_Model1 WF_workflow1 DWH_Model4 WF_Workflow4_1
DWH_Model2 WF_workflow2_1 DWH_Model1 WF_Workflow1
Which means, WF_workflow1 in model DWH_Model1 waits for workflow WF_Workflow3_1 in model DWH_Model3 ETC....
We have 3 types of workflows , DELTA (will contains the word DELTA) DWH (same here) and CALC(same here). A workflow that waits uses an EVENT that will contain both of the models names, and a workflow that flags contain a CMD that contain the grand_model name.
So far We've come up with this:
SELECT DISTINCT OA.SUBJ_NAME AS GRAND_MODEL,
OL.SUBJ_NAME AS WAIT_4_MODEL_NAME,
REP.WORKFLOW_NAME AS WAIT_4_WORKFLOW_NAME,
A.FLAG_NAME,
CASE
WHEN INSTR(UPPER(A.FLAG_NAME), 'DWH') > 0 THEN
'DWH'
WHEN INSTR(UPPER(REP.WORKFLOW_NAME), 'DELTA') > 0 THEN
'DELTA'
ELSE
'CALC'
END CONNECTION_NAME
FROM OPB_SUBJECT#TO_INFORMATICA_ADMIN OL,
OPB_SUBJECT#TO_INFORMATICA_ADMIN OA,
OPB_TASK#TO_INFORMATICA_ADMIN T,
OPB_TASK#TO_INFORMATICA_ADMIN TL,
OPB_TASK_INST#TO_INFORMATICA_ADMIN TI,
REP_SESSION_INSTANCES#TO_INFORMATICA_ADMIN REP,
(SELECT T.TASK_ID,
SUBSTR(T.ATTR_VALUE,
LENGTH(T.ATTR_VALUE) + 2 -
INSTR(REVERSE(T.ATTR_VALUE), '/')) FLAG_NAME
FROM OPB_TASK_ATTR#TO_INFORMATICA_ADMIN T
WHERE T.TASK_TYPE = 60
AND INSTR(REVERSE(T.ATTR_VALUE), '/') > 0) A,
(SELECT T.TASK_ID,
T.SUBJECT_ID,
SUBSTR(T.PM_VALUE,
LENGTH(T.PM_VALUE) + 2 -
INSTR(REVERSE(T.PM_VALUE), '/')) FLAG_NAME
FROM OPB_TASK_VAL_LIST#TO_INFORMATICA_ADMIN T
WHERE INSTR(REVERSE(T.PM_VALUE), '/') > 0) L
WHERE OL.SUBJ_ID = L.SUBJECT_ID
AND A.TASK_ID = T.TASK_ID
AND T.SUBJECT_ID = OA.SUBJ_ID
AND A.FLAG_NAME = L.FLAG_NAME
AND OL.SUBJ_NAME <> OA.SUBJ_NAME
AND L.TASK_ID = TL.TASK_ID
AND TL.TASK_ID = TI.TASK_ID
AND TI.WORKFLOW_ID = REP.WORKFLOW_ID
This query works! The problem is, I'm getting worklets as workflows as well, so some of the times the last joins fails. I don't know how to avoid it..

advice to make my below Pig code simple

Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.
student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;
student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);
Sample input (student ID, location ID, mathScore, verbScore),
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),
7 3 150 240
thanks in advance,
Lin
You might be looking for this.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA = filter data by lid == 1;
student_CA_sum = SUM( student_CA.sid ) ;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output is:
grunt> dump result
(6,3,150,240)
grunt> describe result
result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}
first load the file(student)in hadoop file system. The perform the below action.
split student into student_CA if locationId == 1, student_Other if locationId != 1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;
student_Other_all = group student_Other all;
student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;
student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;
student_summary_all = group student_CAandOther_all_summary all;
student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;
output:
dump student_summary;
(6,3,150,240)
Hope this helps :)
While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.
The answer accepted has an logical error.
Try to have the below input file
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
7 1 10 10
The output will be
(13,4,160,250)
The output should be
(7,4.170,260)
I have modified the script to work correct.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA_sum = COUNT( data.sid ) ;
student_CA = filter data by lid == 1;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output
(7,4,160,250)

Pig: Group By, Average, and Order By

I am new to pig and I have a text file where each line contains a different record of information in the following format:
name, year, count, uniquecount
For example:
Zverkov winced_VERB 2004 8 8
Zverkov winced_VERB 2008 4 4
Zverkov winced_VERB 2009 1 1
zvlastni _ADV_ 1913 1 1
zvlastni _ADV_ 1928 2 2
zvlastni _ADV_ 1929 3 2
I want to group all the records by their unique names, then for each unique name calculate count/uniquecount, and finally sort the output by this calculated value.
Here is what I have been trying:
bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count) / SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average;
It seems my original code does produce the desired output with one minor change:
bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count)/SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average DESC, group ASC;

Speed up the analysis

I have 2 dataframes in R for example df and dfrefseq.
df<-data.frame( chr = c("chr1","chr1","chr1","chr4")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr = c("chr1","chr1","chr1","chr2")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("tra","FGE","FFs","FAA")
)
I want to check for each gene in df witch gene in dfrefseq lies closest to the selected df gene.
I first selected "chr1" in both dataframes.
Then I calculated for the first gene in readschr1 the distance between start-start start-stop stop-start and stop-stop sites.
The sum of this calculations say everything about the distance. My question here is, How can I speed up this analyse? Because now I tested only 1 gene against a dataframe, but I need to test 2000 genes.
readschr1 <- subset(df,df[,1]=="chr1")
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1")
names<-list()
read_start_start<-list()
read_start_stop<-list()
read_stop_start<-list()
read_stop_stop<-list()
for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)
sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]
Thank you!
The Bioconductor package GenomicRanges is designed to work with this type of data
source('http://bioconductor.org/biocLite.R')
biocLite('GenomicRanges') # one-time installation
then
library(GenomicRanges)
gr <- with(df,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
grrefseq <- with(dfrefseq,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
and
> nearest(gr, grrefseq)
[1] 1 2 3 NA
You can merge the two separate data.frames together to form one table and then use vectorized operations. The key to merge is to specify the common column(s) between the data.frames and to tell it what to do when there are cases that do not match. Specifying all = TRUE will return all rows and fill in NAs if there is no match in the other data.frame, i.e. ch2 and ch4 in this case. Once the data.frames have been merged, then it's a simple exercise in subtracting the different columns from one another and then summing the four columns of interest. I use transform to cut down on the typing needed to do the subtraction.
zz <- merge(df, dfrefseq, by = "chr", all = TRUE)
zz <- transform(zz,
read_start_start = abs(start.x - start.y)
, read_start_stop = abs(start.x - stop.y)
, read_stop_start = abs(stop.x - start.y)
, read_stop_stop = abs(stop.x - stop.y)
)
zz <- transform(zz,
sum_total_columns = read_start_start + read_start_stop + read_stop_start + read_stop_stop
)
Here's one approach get the row with the minimum distance. I'm assuming you want to do this by chr and genenames. I use the plyr package, but I'm sure there are base solutions if you'd prefer one of those. Maybe someone else will chime in with a base solution.
require(plyr)
ddply(zz, c("chr", "genenames.x"), function(x) x[which.min(x$sum_total_columns) ,])

Resources