Apache Hadoop pig SPLIT not working. Giving Error 1200 - hadoop

Structure of bag:
emp = LOAD '...../emp.csv' using PigStorage(',') AS
(ename:chararray,id:int,job:chararray,sal:double)
This bag contains details of employees. I want to split the data based on job.
Bag = split emp into mngr if job == 'MANAGER';
This is not working & giving Error 1200.
If I include one more condition with it, for ex.- sal10k if sal<10000, then it is working. But why not only on one chararray?
I am new to hadoop pig. Know few basics. Kindly help.

Kindly find the solution to the problem below along with basic explanation about SPLIT operator:
The SPLIT operator is used to break a relation into two new relations. So you need to take care of both conditions , like IF and ELSE:
For instance: IF(Something matches) then make Relation1, IF(NOT(something
matches) then make another relation. ( You don't have else keyword in Pig).
SPLIT operation is an independent operation, meaning that you cant store the SPLIT operation in a relation:
Example:
Bag = split emp into mngr if job == 'MANAGER'; // This is wrong.
You can't represent a SPLIT operation by a relation.
It will execute independently on the GRUNT shell or Script like this :
*SPLIT emp INTO managers IF(job MATCHES '.MANAGER.'),not_managers IF(NOT(job MATCHES '.MANAGER.'));*
Here is an example data set and output for your reference:
**
Dataset
**
Ron,1331,MANAGER,7232332.34
John,4332,ASSOCIATE,45534.6
Michell,4112,MANAGER,8342423.43
Tamp,1353,ASSOCIATE,34324.67
Ramo,2144,MODULE LEAD,845433.32
Shina,1389,MANAGER,8345321.78
Chin,4323,MODULE LEAD,455465.42
SCRIPT:
emp = LOAD 'stackfile.txt' USING PigStorage(',') AS (ename:chararray,id:int,job:chararray,sal:double);
SPLIT emp INTO managers IF(job MATCHES '.*MANAGER.*'),not_managers IF(NOT(job MATCHES '.*MANAGER.*'));
DUMP managers;
OUTPUT:
(Ron,1331,MANAGER,7232332.34)
(Michell,4112,MANAGER,8342423.43)
(Shina,1389,MANAGER,8345321.78)

I think you are using SPLIT operator wrong.
This is from doc:
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …] [, alias OTHERWISE];
So don't use this part "Bag =" at start.

Related

mismatched input '$1' expecting LEFT_PAREN

I am new to pig Latin scripting I don't know whether am i doing is right or wrong please help me.
Below is the sample which I have the first group by player name that is first parameter now data which is present in bag i want to order them by score desc
Is it possible to get it done in pig by single statement?
(B.Kumarr,{(B.Kumarr,18),(B.Kumarr,10),(B.Kumarr,38)})
cricData3 = FOREACH cricData2 GENERATE $0,ORDER $1.$1 By DESC;
(B.Kumarr,{(B.Kumarr,38),(B.Kumarr,18),(B.Kumarr,10)})

How to count on two columns of group by items in pig

I have generated two columns(origin and destination) out of 'n' number of columns. Now I want to generate count for these two columns combination. I am not able to get the result. I am getting error as, ERROR 1070: Could not resolve Count using imports:
Below is my script,
mydata = load '/Projects/Flightdata/1987/Rawdata' using PigStorage(',') as (year:int, month:int, dom:int, dow:int, deptime:long, crsdeptime:long, arrtime:long, crsarrtime:long, uniqcarcode:chararray, flightnum:long, tailnum:chararray, actelaptime:long, crselaptime:long, airtime:long, arrdeltime:long, depdeltime:long, origcode:chararray, destcode:chararray, dist:long, taxintime:long, taxiouttime:long, flightcancl:int, canclcode:chararray, diverted:int, carrierdel:long, weatherdel:long, nasdel:long, securitydel:long, lateaircraftdel:long);
Step2 = foreach mydata generate origcode, destcode;
grpby = group Step2 by (origcode, destcode) ;
step3 = foreach grpby generate group.origcode as source, group.destcode as destination, Count(step2);
here I want to generate count for each combination of origin and destination.
Any guidance will be helpful.
Please see the Pig documentation about case sensitivity
The names of Pig Latin functions are case sensitive.

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Pig throws error for simple Group by and count occurrence task

Using Hadoop's PIG-Latin to find the number of occurrences of unique search strings from a search engine log file.(click here to view the sample log file)
Please help me out. Thanks in advance.
Pig script
excitelog = load '/user/hadoop/input/excite-small.log' using PigStorage() AS
(encryptcode:chararray, numericid:int, searchstring:chararray);
GroupBySearchString = GROUP excitelog by searchstring;
searchStrFrq = foreach GroupBySearchString Generate group as searchstring,count(searchstring);
Error encountered
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
You need to do:
searchStrFrq = foreach GroupBySearchString Generate group as searchstring,
COUNT(excitelog) as kount;
This is because the way grouping works in pig, GroupBySearchString would be a bag of {group, excitelog}, where excitelog is itself a bag of all tuples matching the group. COUNT is a UDF takes a bag as input and returns the number of tuples in the bag. So, COUNT(excitelog) will then give you the number of tuples matching the group.
Function names PigStorage and COUNT are case sensitive.
so one need to keep COUNT function like as below:
wordcount = FOREACH grouped GENERATE group , COUNT(words);

Resources