sampling of records inside group by throwing error - hadoop

sample data : (tsv file: sampl)
1 a
2 b
3 c
raw= load 'sampl' using PigStorage() as (f1:chararray,f2:chararray);
grouped = group raw by f1;
describe grouped;
fields = foreach grouped {
x = sample raw 1;
generate x;
}
When I run this I am getting error at the line x = sample raw 1;
ERROR 1200: mismatched input 'raw' expecting LEFT_PAREN
Is sampling not allowed for a grouped record?

You can't use 'sample' command inside nested block.This is not supported in pig.
Only few operations operations like (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY) are allowed in nested block. You have to use the sample command outside of the nested block.
The other problem is, you are loading your input data using default delimiter ie tab. But your input data is delimited with space, so you need to change your script like this
raw= load 'sampl' using PigStorage(' ') as (f1:chararray,f2:chararray);

Related

Outputting a tuple with space between two values in pig

I have been using pig to filter a large file which contains data in tab separated form. The data inside that file is in the following form - fname lname age
Bill Gates 50
Warren Buffet 100
Elon Musk 80
Jack Dorsey 10
I want to filter this filter out where age > 50 and store the resulting data in (fname lname) form in a file using Pig.
Here is the code which I'm using -
data = LOAD 'persons.txt' AS (fname:chararray, lname:chararray, age:int);
data1 = FILTER data BY age > 50;
data2 = FOREACH data1 GENERATE (fname, lname);
STORE data2 INTO 'result.txt';
By using this code, I ma getting following output -
(Warren,Buffet)
(Elon,Musk)
This is not the output which I want instead I want to get following output -
(Warren Buffet)
(Elon Musk)
In order to get this kind of output I have tried using FOREACH data1 GENERATE (fname lname) without a comma between fname and lname. But it shows error Synatx error, unexpected symbol at or near fname.
Can anybody help me how can I get correct ouput?
Note -> I am running Pig on Hadoop Cluster not locally.
Use CONCAT with a space in between fname and lname
data2 = FOREACH data1 GENERATE CONCAT(fname,' ',lname);

PIG: scalar has more than one row in the output

I have following code in pig in which i am checking the field (srcgt & destgt in record) from main files stored in record for values as mentioned in another file(intlgt.txt) having values 338,918299,181,238 but it throws error as mentioned below. Can you please suggest how to overcome this on Apache Pig version 0.15.0 (r1682971).
Pig code:
record = LOAD '/u02/20160201*.SMS' USING PigStorage('|','-tagFile') ;
intlgtrec = LOAD '/u02/config/intlgt.txt' ;
intlgt = foreach intlgtrec generate $0 as intlgt;
cdrfilter = foreach record generate (chararray) $1 as aparty, (chararray) $2 as bparty,(chararray) $3 as dt,(chararray)$4 as timestamp,(chararray) $29 as status,(chararray) $26 as srcgt,(chararray) $27 as destgt,(chararray)$0 as cdrfname ,(chararray) $13 as prepost;
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) ) ;`
Error is:
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1939982195_0002
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (338), 2nd :(918299) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar") at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
When you are using
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) );
PIG is looking for a scalar. Be it a number, or a chararray; but a single one. So pig assumes your intlgt::intlgt is a relation with one row. e.g. the result of
intlgt = foreach (group intlgtrec all) generate COUNT_STAR(intlgtrec.$0)
(this would generate single row, with the count of records in the original relation)
In your case, the intlgt contains more than one row, since you have not done any grouping on it.
Based on your code, you're trying to look for SMS messages that had an intlgt on either end. Possible solutions:
if your intlgt enteries all have the same length (e.g. 3) then generate substring(srcgt, 1, 3) as srcgtshort, and JOIN intlgt::intlgt with record::srcgtshort. this will give you the records where srcgt begins with a value from intlgt. Then repeat this for destgt.
if they have a small number of lengths (e.g. some entries have length 3, some have length 4, and some have length 5) you can do the same thing, but it would be more laborious (as a field is required for each 'length').
if the number of rows in the two relations is not too big, do a cross between them, which would create all possible combinations of rows from record and rows from intlgt. Then you can filter by STARTSWITH(srcgt, intlgt::intlgt), because the two of them are fields in the same relation. Beware of this approach, as the number of records can get HUGE!

How to count on two columns of group by items in pig

I have generated two columns(origin and destination) out of 'n' number of columns. Now I want to generate count for these two columns combination. I am not able to get the result. I am getting error as, ERROR 1070: Could not resolve Count using imports:
Below is my script,
mydata = load '/Projects/Flightdata/1987/Rawdata' using PigStorage(',') as (year:int, month:int, dom:int, dow:int, deptime:long, crsdeptime:long, arrtime:long, crsarrtime:long, uniqcarcode:chararray, flightnum:long, tailnum:chararray, actelaptime:long, crselaptime:long, airtime:long, arrdeltime:long, depdeltime:long, origcode:chararray, destcode:chararray, dist:long, taxintime:long, taxiouttime:long, flightcancl:int, canclcode:chararray, diverted:int, carrierdel:long, weatherdel:long, nasdel:long, securitydel:long, lateaircraftdel:long);
Step2 = foreach mydata generate origcode, destcode;
grpby = group Step2 by (origcode, destcode) ;
step3 = foreach grpby generate group.origcode as source, group.destcode as destination, Count(step2);
here I want to generate count for each combination of origin and destination.
Any guidance will be helpful.
Please see the Pig documentation about case sensitivity
The names of Pig Latin functions are case sensitive.

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Hadoop, how to normalize multiple columns data?

I have a file .txt like this
1036177 19459.7356 17380.3761 18084.1440
1045709 19674.2457 17694.8674 18700.0120
1140443 19772.0645 17760.0904 19456.7521
where the first column represent the Key and the others are the values.
I would like to normalize (min-max) each column and after that sum up the columns.
Someone can give me some advice on how do that in MapReduce?
From an algorithmic perspective you'll need to:
Mapper
Parse / tokenize each input line by it's delimiter (space?)
Use a Text object to encapsulate the key field
Either create a custom value class to encapsulate the other fields or use an ArrayWritable wrapper
Output this Key / Value from your Mapper
Reducer
All values will be grouped by the same key, so here you'll just need to process each input value and calculate the min, max and sum for each column
Finally output your result
You might want to look at using Apache Pig which should make this task much easier (untested):
grunt> A = LOAD '/path/to/data.txt' USING PigStorage(' ')
AS (key, fld1:float, fld2:float, fld3:float);
grunt> GRP = GROUP A BY key;
grunt> B = FOREACH GRP GENERATE $0, MIN(fld1), MAX(fld1), SUM(fld1),
MIN(fld2), MAX(fld2), SUM(fld2),
MIN(fld3), MAX(fld3), SUM(fld3);
grunt> STORE B INTO '/path/to/output' USING PigStorage('\t', '-schema');

Resources