Invalid scalar projection in foreach - hadoop

Hi I have a pig script like this. When doing foreach statement it throws invalid scalar projection error.Here is my code.
a = load 'file' using PigStorage(':');
b = group a by ($1, $7, $11);
c = foreach b generate flatten(group), COUNT(a) as (cnt: int);
d = filter c by cnt>1;
e = foreach d generate flatten(a) ;
The error is shown below
<line 6, column 31> Invalid scalar projection: a : A column needs to be projected from a relation for it to be used as a scalar
Any help will be appreciated.

The issue is because, 'a' doesn's exists in the 'd' relation schema.
describe the 'd' schema, you get:
d: {bytearray,bytearray,bytearray, cnt: int} in which 'a' doesn't exists.
In the script, C relation is formed, by Projection of flattening the group field and Number of elements of a, a is not included in the relation C.

Related

Inserting tuples inside an inner bag using Pig Latin - Hadoop

I am trying to create the following format of relation using Pig Latin:
userid, day, {(pid,fulldate, x,y),(pid,fulldate, x,y), ...}
Relation description: Each user (userid) in each day (day) has purchased multiple products (pid)
I am Loading the data into:
A= LOAD '**from a HDFS URL**' AS (pid: chararray,userid:
chararray,day:int,fulldate: chararray,x: chararray,y:chararray);
B= GROUP A BY (userid, day);
Describe B;
B: {group: (userid: chararray,day: int),A: {(pid: chararray,day: int,fulldate: chararray,x: chararray,userid: chararray,y: chararray)}}
C= FOREACH B FLATTEN(B) AS (userid,day), $1.pid, $1.fulldate,$1.x,$1.y;
Describe C;
C: {userid: chararray,day: int,{(pid: chararray)}},{(fulldate: chararray)},{(x: chararray)},{(y: chararray)}}
The result of Describe C does not give the format I want ! What I am doing wrong?
You are correct till the GROUP BY part. After that however you are trying to do something messy. I'm actually not sure what is happening for your alias C. To arrive at the format you are looking for, you will need a nested foreach.
C = FOREACH B {
data = A.pid, A.fulldate, A.x, A.y;
GENERATE FLATTEN(group), data;
}
This allows C to have one record for each (userid, day) and all the corresponding (pid,fulldate, x, y) tuples in a bag.
You can read more about nested foreach here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).
My understanding is that B is almost what you're looking for, except you would like the tuple containing userid and day to be flattened, and you would like only pid, fulldate, x, and y to appear in the bag.
First, you want to flatten the tuple group which has fields userid and day, not the bag A which contains multiple tuples. Flattening group unnests the tuple, which only has 1 set of unique values for each row, whereas flattening the bag A would effectively ungroup your previous GROUP BY statement since the values in the bag A are not unique. So the first part should read C = FOREACH B GENERATE FLATTEN(group) AS (userid, day);
Next, you want to keep pid, fulldate, x, and y in separate tuples for each record, but the way you've selected them essentially makes a bag of all the pid values, a bag of all the fulldate values, etc. Instead, try selecting these fields in a way that keeps the tuples nested in the bag:
C = FOREACH B GENERATE
FLATTEN(group) AS (userid, day),
A.(pid, fulldate, x, y) AS A;

Get value for unique record using Pig

Below is the input data set.
col1,col2,col3,col4,col5
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
Based on col2,col3,col4 will give unique record, I need to get any one value from col1 for the unique record, and populate as new field say col6. The expected output below
col1,col2,col3,col4,col5,col6
key1,111,1,12/11/2016,10,key3
key2,111,1,12/11/2016,10,key3
key3,111,1,12/11/2016,10,key3
key4,222,2,12/22/2016,10,key5
key5,222,2,12/22/2016,10,key5
key6,333,3,12/30/2016,10,key6
key7,111,0,12/11/2016,10,key7
Below is the script, I am getting error.
A = load 'test1.csv' using PigStorage(',');
B = GROUP A by ($1,$2,$3);
C = FOREACH B GENERATE FLATTEN(group), MAX(A.$0);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2106: Error executing an algebraic function
Looks like a good use case to use Nested Foreach
Ref : https://pig.apache.org/docs/r0.14.0/basic.html#foreach
Input :
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
PigScript
A = load 'input.csv' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = FOREACH(GROUP A BY (col2,col3,col4)) {
ordered = ORDER A BY col1 DESC;
latest = LIMIT ordered 1;
GENERATE FLATTEN(A) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray), FLATTEN(latest.col1) AS col6:chararray;
};
DUMP B;
Output :
(key1,111,1,12/11/2016,10,key3)
(key2,111,1,12/11/2016,10,key3)
(key3,111,1,12/11/2016,10,key3)
(key4,222,2,12/22/2016,10,key5)
(key5,222,2,12/22/2016,10,key5)
(key6,333,3,12/30/2016,10,key6)
(key7,111,0,12/11/2016,10,key7)

How Pig's COGROUP operator works?

How does the COGROUP operator works here?
How and why we are getting empty bag in the last two lines of output(No website explained in details about the data arrangement in COGROUP) ?
A = load 'student' as (name:chararray, age:int, gpa:float);
B = load 'student' as (name:chararray, age:int, gpa:float);
dump B;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)
X = cogroup A by age, B by age;
dump X;
(18,{(joe,18,2.5)},{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})
There is a very clear example in Definitive Guide book. I hope the below snippet helps you to understand the cogroup concept.
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP generates a tuple for each unique grouping key. The first field of each tuple
is the key, and the remaining fields are bags of tuples from the relations with a matching
key. The first bag contains the matching tuples from relation A with the same key.
Similarly, the second bag contains the matching tuples from relation B with the same
key.
If for a particular key a relation has no matching key, then the bag for that relation is
empty. For example, since no one has bought a scarf (with ID 1), the second bag in the
tuple for that row is empty. This is an example of an outer join, which is the default
type for COGROUP.

how to normalize a tuple of maps in apache pig?

I have the following relation in a pig script:
my_relation: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,timeseries,([value#50.0,timestamp#1388675231000]))
(++JRGOCZQD,timeseries,([value#50.0,timestamp#1388592317000],[value#25.0,timestamp#1388682237000]))
(++GCYI1OO4,timeseries,())
(++JYY0LOTU,timeseries,())
There can be any number of value/timestamp pairs in the bytearray column (even zero).
I would like to transform this relation into this (one row for each entityId, attributeName, value, timestamp quartet):
++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
++GCYI1OO4,timeseries,,
++JYY0LOTU,timeseries,,
Alternatively this would be fine too - I am not interested in the rows that have no values/timestamp
++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
Any ideas? Basically I want to normalize the tuple of maps in the bytearray column so that the schema is like this:
my_relation: {entityId: chararray,
attributeName: chararray,
value: float,
timestamp: int}
I am a pig beginner so sorry if this is obvious! Do I need a UDF to do this?
This question is similar but has no answers so far: How do I split in Pig a tuple of many maps into different rows
I am running Apache Pig version 0.12.0-cdh5.1.2
EDIT - adding details of what I've done so far.
Here's a pig script snippet, with output below:
-- StateVectorFileStorage is a LoadStoreFunc and AttributeData is a UDF, both java.
ts_to_average = LOAD 'StateVector' USING StateVectorFileStorage();
ts_to_average = LIMIT ts_to_average 10;
ts_to_average = FOREACH ts_to_average GENERATE entityId, FLATTEN(AttributeData(*));
a = FOREACH ts_to_average GENERATE entityId, $1 as attributeName:chararray, $2#'value';
b = foreach a generate entityId, attributeName, FLATTEN($2);
c_no_flatten = foreach b generate
$0 as entityId,
$1 as attributeName,
TOBAG($2 ..);
c = foreach b generate
$0 as entityId,
$1 as attributeName,
FLATTEN(TOBAG($2 ..));
d = foreach c generate
entityId,
attributeName,
(float)$2#'value' as value,
(int)$2#'timestamp' as timestamp;
dump a;
describe a;
dump b;
describe b;
dump c_no_flatten;
describe c_no_flatten;
dump c;
describe c;
dump d;
describe d;
Output follows. Notice how in the relation 'c', the second value/timestamp pair [value#52.0,timestamp#1388683516000] is lost.
(++JIYMIS2D,RechargeTimeSeries,([value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000]))
(++JRGOCZQD,RechargeTimeSeries,([value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries,())
a: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries)
b: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,{([value#50.0,timestamp#1388675231000])})
(++JRGOCZQD,RechargeTimeSeries,{([value#50.0,timestamp#1388592317000])})
(++GCYI1OO4,RechargeTimeSeries,{()})
c_no_flatten: {entityId: chararray,attributeName: chararray,{(bytearray)}}
(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000])
(++GCYI1OO4,RechargeTimeSeries,)
c: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,50.0,1388675231000)
(++JRGOCZQD,RechargeTimeSeries,50.0,1388592317000)
(++GCYI1OO4,RechargeTimeSeries,,)
d: {entityId: chararray,attributeName: chararray,value: float,timestamp: int}
This should do the the trick. First, flatten the tuple of maps to get rid of the encapsulating tuple:
b = foreach a generate entityId, attributeName, FLATTEN($2);
Now we can convert everything but the first two fields into a bag. The bag can be flattened (see http://pig.apache.org/docs/r0.12.0/basic.html#flatten) to get rows for each value/timestamp pair:
c = foreach b generate
$0 as entityId,
$1 as attributeName,
FLATTEN(TOBAG($2 ..));
Lastly, get the values you need out of the map:
d = foreach c generate
entityId,
attributeName,
(float)$2#'value' as value,
(int)$2#'timestamp' as timestamp;
Update:
Some other options to make a bag of maps out of the tuple of maps:
DataFu's TransposeTupleToBag: http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/util/TransposeTupleToBag.html
The foo() Python UDF in this answer: Pig - how to iterate on a bag of maps

How can I use DESCRIBE and DUMP inside nested FOREACH

I'm new in Pig and I some times I need to access schema for relations inside nested FOREACH. For example:
A = LOAD 'data' AS (url:chararray,outlink:chararray);
DUMP A;
(www.ccc.com,www.hjk.com)
(www.ddd.com,www.xyz.org)
(www.aaa.com,www.cvn.org)
(www.www.com,www.kpt.net)
(www.www.com,www.xyz.org)
(www.ddd.com,www.xyz.org)
B = GROUP A BY url;
DUMP B;
(www.aaa.com,{(www.aaa.com,www.cvn.org)})
(www.ccc.com,{(www.ccc.com,www.hjk.com)})
(www.ddd.com,{(www.ddd.com,www.xyz.org),(www.ddd.com,www.xyz.org)})
(www.www.com,{(www.www.com,www.kpt.net),(www.www.com,www.xyz.org)})
X = FOREACH B {
FA = FILTER A BY outlink == 'www.xyz.org';
PA = FA.outlink;
DA = DISTINCT PA;
GENERATE group, COUNT(DA);
}
DUMP X;
(www.aaa.com,0)
(www.ccc.com,0)
(www.ddd.com,1)
(www.www.com,1)
I want to know what is the structure of FA, PA and DA. I have tried to use DESCRIBE inside FOREACH block but it gives error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 13, column 13> Syntax error, unexpected symbol at or near 'FA'
Is there any way to get schema and structure of relations inside nested FOREACH just for learning purpose?
Have multiple runs and project FA/PA/DA in GENERATE statements. Sample code with projecting FA:
X = FOREACH B {
FA = FILTER A BY outlink == 'www.xyz.org';
--PA = FA.outlink;
--DA = DISTINCT PA;
GENERATE group, FA;
}
DUMP X;
DESCRIBE X;

Resources