PIG: scalar has more than one row in the output - hadoop

I have following code in pig in which i am checking the field (srcgt & destgt in record) from main files stored in record for values as mentioned in another file(intlgt.txt) having values 338,918299,181,238 but it throws error as mentioned below. Can you please suggest how to overcome this on Apache Pig version 0.15.0 (r1682971).
Pig code:
record = LOAD '/u02/20160201*.SMS' USING PigStorage('|','-tagFile') ;
intlgtrec = LOAD '/u02/config/intlgt.txt' ;
intlgt = foreach intlgtrec generate $0 as intlgt;
cdrfilter = foreach record generate (chararray) $1 as aparty, (chararray) $2 as bparty,(chararray) $3 as dt,(chararray)$4 as timestamp,(chararray) $29 as status,(chararray) $26 as srcgt,(chararray) $27 as destgt,(chararray)$0 as cdrfname ,(chararray) $13 as prepost;
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) ) ;`
Error is:
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1939982195_0002
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (338), 2nd :(918299) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar") at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)

When you are using
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) );
PIG is looking for a scalar. Be it a number, or a chararray; but a single one. So pig assumes your intlgt::intlgt is a relation with one row. e.g. the result of
intlgt = foreach (group intlgtrec all) generate COUNT_STAR(intlgtrec.$0)
(this would generate single row, with the count of records in the original relation)
In your case, the intlgt contains more than one row, since you have not done any grouping on it.
Based on your code, you're trying to look for SMS messages that had an intlgt on either end. Possible solutions:
if your intlgt enteries all have the same length (e.g. 3) then generate substring(srcgt, 1, 3) as srcgtshort, and JOIN intlgt::intlgt with record::srcgtshort. this will give you the records where srcgt begins with a value from intlgt. Then repeat this for destgt.
if they have a small number of lengths (e.g. some entries have length 3, some have length 4, and some have length 5) you can do the same thing, but it would be more laborious (as a field is required for each 'length').
if the number of rows in the two relations is not too big, do a cross between them, which would create all possible combinations of rows from record and rows from intlgt. Then you can filter by STARTSWITH(srcgt, intlgt::intlgt), because the two of them are fields in the same relation. Beware of this approach, as the number of records can get HUGE!

Related

Significance of Group keywords in Apache pig word count program

i am a beginner to pig and i have started with the word count program.
In the following word count program, i see group keyword being used in 3rd and 4th lines. Is the usage of the keyword 'group' same or different at both the places as i am a bit confused as the group in the 4th line of the program is throwing error when given in Caps?
lines = LOAD '/user/root/pig/pig_demo.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
They both are different.The former i.e. "GROUP" is an operator where as the latter i.e. "group" is a keyword referring to the GROUP Key.
Below is the explanation from here.
The GROUP operator groups together tuples that have the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP operation is a relation that includes one tuple per group. This tuple contains two fields:
The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key.
The second field takes the name of the original relation and is type bag.
The names of both fields are generated by the system as shown in the example below.
Note that the GROUP (and thus COGROUP) and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples.

How to retrieve previous row value in Pig

Hi i am using Pig to move values in HBASE. I am trying to execute on condition if it is success i'll Concatenate a value, if it fails i'll concatenate value of previous row.
for that i tried below code but it is not working and throwing error.
Code:
STOCK_A = LOAD '/user/cloudera/pat.hl7' USING PigStorage('|');
data = FILTER STOCK_A BY ($0 matches '.*OBR.*' or $0 matches '.*OBX.*');
MSH_DATA = FOREACH data GENERATE ($0 == 'OBR' ? CONCAT('HL','OBR',(chararray)$1) : CONCAT('HL','OBR',(chararray)(data -1).$1)) AS Uid, $1 AS id, $5 AS result, $3 AS resultname;
Error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 14, column 122> mismatched input '.' expecting RIGHT_PAREN
I want that concatenated value to be replicated in other rows till i reach another OBR. Please Help.
You can't refer to previous rows in Pig itself, but you can write an aggregate UDF that will accept all rows and do the required. But keep in mind that you also need to specify parallelism 1 or your rows will be split in chunks
I think you can Stitch, Over and lag to calculate the data from previous row. Not sure about efficiency though.

How to count on two columns of group by items in pig

I have generated two columns(origin and destination) out of 'n' number of columns. Now I want to generate count for these two columns combination. I am not able to get the result. I am getting error as, ERROR 1070: Could not resolve Count using imports:
Below is my script,
mydata = load '/Projects/Flightdata/1987/Rawdata' using PigStorage(',') as (year:int, month:int, dom:int, dow:int, deptime:long, crsdeptime:long, arrtime:long, crsarrtime:long, uniqcarcode:chararray, flightnum:long, tailnum:chararray, actelaptime:long, crselaptime:long, airtime:long, arrdeltime:long, depdeltime:long, origcode:chararray, destcode:chararray, dist:long, taxintime:long, taxiouttime:long, flightcancl:int, canclcode:chararray, diverted:int, carrierdel:long, weatherdel:long, nasdel:long, securitydel:long, lateaircraftdel:long);
Step2 = foreach mydata generate origcode, destcode;
grpby = group Step2 by (origcode, destcode) ;
step3 = foreach grpby generate group.origcode as source, group.destcode as destination, Count(step2);
here I want to generate count for each combination of origin and destination.
Any guidance will be helpful.
Please see the Pig documentation about case sensitivity
The names of Pig Latin functions are case sensitive.

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Resources