How to retrieve previous row value in Pig - hadoop

Hi i am using Pig to move values in HBASE. I am trying to execute on condition if it is success i'll Concatenate a value, if it fails i'll concatenate value of previous row.
for that i tried below code but it is not working and throwing error.
Code:
STOCK_A = LOAD '/user/cloudera/pat.hl7' USING PigStorage('|');
data = FILTER STOCK_A BY ($0 matches '.*OBR.*' or $0 matches '.*OBX.*');
MSH_DATA = FOREACH data GENERATE ($0 == 'OBR' ? CONCAT('HL','OBR',(chararray)$1) : CONCAT('HL','OBR',(chararray)(data -1).$1)) AS Uid, $1 AS id, $5 AS result, $3 AS resultname;
Error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 14, column 122> mismatched input '.' expecting RIGHT_PAREN
I want that concatenated value to be replicated in other rows till i reach another OBR. Please Help.

You can't refer to previous rows in Pig itself, but you can write an aggregate UDF that will accept all rows and do the required. But keep in mind that you also need to specify parallelism 1 or your rows will be split in chunks

I think you can Stitch, Over and lag to calculate the data from previous row. Not sure about efficiency though.

Related

mismatched input '$1' expecting LEFT_PAREN

I am new to pig Latin scripting I don't know whether am i doing is right or wrong please help me.
Below is the sample which I have the first group by player name that is first parameter now data which is present in bag i want to order them by score desc
Is it possible to get it done in pig by single statement?
(B.Kumarr,{(B.Kumarr,18),(B.Kumarr,10),(B.Kumarr,38)})
cricData3 = FOREACH cricData2 GENERATE $0,ORDER $1.$1 By DESC;
(B.Kumarr,{(B.Kumarr,38),(B.Kumarr,18),(B.Kumarr,10)})

PIG: scalar has more than one row in the output

I have following code in pig in which i am checking the field (srcgt & destgt in record) from main files stored in record for values as mentioned in another file(intlgt.txt) having values 338,918299,181,238 but it throws error as mentioned below. Can you please suggest how to overcome this on Apache Pig version 0.15.0 (r1682971).
Pig code:
record = LOAD '/u02/20160201*.SMS' USING PigStorage('|','-tagFile') ;
intlgtrec = LOAD '/u02/config/intlgt.txt' ;
intlgt = foreach intlgtrec generate $0 as intlgt;
cdrfilter = foreach record generate (chararray) $1 as aparty, (chararray) $2 as bparty,(chararray) $3 as dt,(chararray)$4 as timestamp,(chararray) $29 as status,(chararray) $26 as srcgt,(chararray) $27 as destgt,(chararray)$0 as cdrfname ,(chararray) $13 as prepost;
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) ) ;`
Error is:
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1939982195_0002
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (338), 2nd :(918299) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar") at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
When you are using
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) );
PIG is looking for a scalar. Be it a number, or a chararray; but a single one. So pig assumes your intlgt::intlgt is a relation with one row. e.g. the result of
intlgt = foreach (group intlgtrec all) generate COUNT_STAR(intlgtrec.$0)
(this would generate single row, with the count of records in the original relation)
In your case, the intlgt contains more than one row, since you have not done any grouping on it.
Based on your code, you're trying to look for SMS messages that had an intlgt on either end. Possible solutions:
if your intlgt enteries all have the same length (e.g. 3) then generate substring(srcgt, 1, 3) as srcgtshort, and JOIN intlgt::intlgt with record::srcgtshort. this will give you the records where srcgt begins with a value from intlgt. Then repeat this for destgt.
if they have a small number of lengths (e.g. some entries have length 3, some have length 4, and some have length 5) you can do the same thing, but it would be more laborious (as a field is required for each 'length').
if the number of rows in the two relations is not too big, do a cross between them, which would create all possible combinations of rows from record and rows from intlgt. Then you can filter by STARTSWITH(srcgt, intlgt::intlgt), because the two of them are fields in the same relation. Beware of this approach, as the number of records can get HUGE!

Apache Hadoop pig SPLIT not working. Giving Error 1200

Structure of bag:
emp = LOAD '...../emp.csv' using PigStorage(',') AS
(ename:chararray,id:int,job:chararray,sal:double)
This bag contains details of employees. I want to split the data based on job.
Bag = split emp into mngr if job == 'MANAGER';
This is not working & giving Error 1200.
If I include one more condition with it, for ex.- sal10k if sal<10000, then it is working. But why not only on one chararray?
I am new to hadoop pig. Know few basics. Kindly help.
Kindly find the solution to the problem below along with basic explanation about SPLIT operator:
The SPLIT operator is used to break a relation into two new relations. So you need to take care of both conditions , like IF and ELSE:
For instance: IF(Something matches) then make Relation1, IF(NOT(something
matches) then make another relation. ( You don't have else keyword in Pig).
SPLIT operation is an independent operation, meaning that you cant store the SPLIT operation in a relation:
Example:
Bag = split emp into mngr if job == 'MANAGER'; // This is wrong.
You can't represent a SPLIT operation by a relation.
It will execute independently on the GRUNT shell or Script like this :
*SPLIT emp INTO managers IF(job MATCHES '.MANAGER.'),not_managers IF(NOT(job MATCHES '.MANAGER.'));*
Here is an example data set and output for your reference:
**
Dataset
**
Ron,1331,MANAGER,7232332.34
John,4332,ASSOCIATE,45534.6
Michell,4112,MANAGER,8342423.43
Tamp,1353,ASSOCIATE,34324.67
Ramo,2144,MODULE LEAD,845433.32
Shina,1389,MANAGER,8345321.78
Chin,4323,MODULE LEAD,455465.42
SCRIPT:
emp = LOAD 'stackfile.txt' USING PigStorage(',') AS (ename:chararray,id:int,job:chararray,sal:double);
SPLIT emp INTO managers IF(job MATCHES '.*MANAGER.*'),not_managers IF(NOT(job MATCHES '.*MANAGER.*'));
DUMP managers;
OUTPUT:
(Ron,1331,MANAGER,7232332.34)
(Michell,4112,MANAGER,8342423.43)
(Shina,1389,MANAGER,8345321.78)
I think you are using SPLIT operator wrong.
This is from doc:
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …] [, alias OTHERWISE];
So don't use this part "Bag =" at start.

sampling of records inside group by throwing error

sample data : (tsv file: sampl)
1 a
2 b
3 c
raw= load 'sampl' using PigStorage() as (f1:chararray,f2:chararray);
grouped = group raw by f1;
describe grouped;
fields = foreach grouped {
x = sample raw 1;
generate x;
}
When I run this I am getting error at the line x = sample raw 1;
ERROR 1200: mismatched input 'raw' expecting LEFT_PAREN
Is sampling not allowed for a grouped record?
You can't use 'sample' command inside nested block.This is not supported in pig.
Only few operations operations like (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY) are allowed in nested block. You have to use the sample command outside of the nested block.
The other problem is, you are loading your input data using default delimiter ie tab. But your input data is delimited with space, so you need to change your script like this
raw= load 'sampl' using PigStorage(' ') as (f1:chararray,f2:chararray);

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Resources