How to count on two columns of group by items in pig - hadoop

I have generated two columns(origin and destination) out of 'n' number of columns. Now I want to generate count for these two columns combination. I am not able to get the result. I am getting error as, ERROR 1070: Could not resolve Count using imports:
Below is my script,
mydata = load '/Projects/Flightdata/1987/Rawdata' using PigStorage(',') as (year:int, month:int, dom:int, dow:int, deptime:long, crsdeptime:long, arrtime:long, crsarrtime:long, uniqcarcode:chararray, flightnum:long, tailnum:chararray, actelaptime:long, crselaptime:long, airtime:long, arrdeltime:long, depdeltime:long, origcode:chararray, destcode:chararray, dist:long, taxintime:long, taxiouttime:long, flightcancl:int, canclcode:chararray, diverted:int, carrierdel:long, weatherdel:long, nasdel:long, securitydel:long, lateaircraftdel:long);
Step2 = foreach mydata generate origcode, destcode;
grpby = group Step2 by (origcode, destcode) ;
step3 = foreach grpby generate group.origcode as source, group.destcode as destination, Count(step2);
here I want to generate count for each combination of origin and destination.
Any guidance will be helpful.

Please see the Pig documentation about case sensitivity
The names of Pig Latin functions are case sensitive.

Related

PIG: scalar has more than one row in the output

I have following code in pig in which i am checking the field (srcgt & destgt in record) from main files stored in record for values as mentioned in another file(intlgt.txt) having values 338,918299,181,238 but it throws error as mentioned below. Can you please suggest how to overcome this on Apache Pig version 0.15.0 (r1682971).
Pig code:
record = LOAD '/u02/20160201*.SMS' USING PigStorage('|','-tagFile') ;
intlgtrec = LOAD '/u02/config/intlgt.txt' ;
intlgt = foreach intlgtrec generate $0 as intlgt;
cdrfilter = foreach record generate (chararray) $1 as aparty, (chararray) $2 as bparty,(chararray) $3 as dt,(chararray)$4 as timestamp,(chararray) $29 as status,(chararray) $26 as srcgt,(chararray) $27 as destgt,(chararray)$0 as cdrfname ,(chararray) $13 as prepost;
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) ) ;`
Error is:
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1939982195_0002
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (338), 2nd :(918299) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar") at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
When you are using
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) );
PIG is looking for a scalar. Be it a number, or a chararray; but a single one. So pig assumes your intlgt::intlgt is a relation with one row. e.g. the result of
intlgt = foreach (group intlgtrec all) generate COUNT_STAR(intlgtrec.$0)
(this would generate single row, with the count of records in the original relation)
In your case, the intlgt contains more than one row, since you have not done any grouping on it.
Based on your code, you're trying to look for SMS messages that had an intlgt on either end. Possible solutions:
if your intlgt enteries all have the same length (e.g. 3) then generate substring(srcgt, 1, 3) as srcgtshort, and JOIN intlgt::intlgt with record::srcgtshort. this will give you the records where srcgt begins with a value from intlgt. Then repeat this for destgt.
if they have a small number of lengths (e.g. some entries have length 3, some have length 4, and some have length 5) you can do the same thing, but it would be more laborious (as a field is required for each 'length').
if the number of rows in the two relations is not too big, do a cross between them, which would create all possible combinations of rows from record and rows from intlgt. Then you can filter by STARTSWITH(srcgt, intlgt::intlgt), because the two of them are fields in the same relation. Beware of this approach, as the number of records can get HUGE!

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Pig throws error for simple Group by and count occurrence task

Using Hadoop's PIG-Latin to find the number of occurrences of unique search strings from a search engine log file.(click here to view the sample log file)
Please help me out. Thanks in advance.
Pig script
excitelog = load '/user/hadoop/input/excite-small.log' using PigStorage() AS
(encryptcode:chararray, numericid:int, searchstring:chararray);
GroupBySearchString = GROUP excitelog by searchstring;
searchStrFrq = foreach GroupBySearchString Generate group as searchstring,count(searchstring);
Error encountered
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
You need to do:
searchStrFrq = foreach GroupBySearchString Generate group as searchstring,
COUNT(excitelog) as kount;
This is because the way grouping works in pig, GroupBySearchString would be a bag of {group, excitelog}, where excitelog is itself a bag of all tuples matching the group. COUNT is a UDF takes a bag as input and returns the number of tuples in the bag. So, COUNT(excitelog) will then give you the number of tuples matching the group.
Function names PigStorage and COUNT are case sensitive.
so one need to keep COUNT function like as below:
wordcount = FOREACH grouped GENERATE group , COUNT(words);

PIG how to count a number of rows in alias

I did something like this to count the number of rows in an alias in PIG:
logs = LOAD 'log'
logs_w_one = foreach logs generate 1 as one;
logs_group = group logs_w_one all;
logs_count = foreach logs_group generate SUM(logs_w_one.one);
dump logs_count;
This seems to be too inefficient. Please enlighten me if there is a better way!
COUNT is part of pig see the manual
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
Arnon Rotem-Gal-Oz already answered this question a while ago, but I thought some may like this slightly more concise version.
LOGS = LOAD 'log';
LOG_COUNT = FOREACH (GROUP LOGS ALL) GENERATE COUNT(LOGS);
Be careful, with COUNT your first item in the bag must not be null. Else you can use the function COUNT_STAR to count all rows.
Basic counting is done as was stated in other answers, and in the pig documentation:
logs = LOAD 'log';
all_logs_in_a_bag = GROUP logs ALL;
log_count = FOREACH all_logs_in_a_bag GENERATE COUNT(logs);
dump log_count
You are right that counting is inefficient, even when using pig's builtin COUNT because this will use one reducer. However, I had a revelation today that one of the ways to speed it up would be to reduce the RAM utilization of the relation we're counting.
In other words, when counting a relation, we don't actually care about the data itself so let's use as little RAM as possible. You were on the right track with your first iteration of the count script.
logs = LOAD 'log'
ones = FOREACH logs GENERATE 1 AS one:int;
counter_group = GROUP ones ALL;
log_count = FOREACH counter_group GENERATE COUNT(ones);
dump log_count
This will work on much larger relations than the previous script and should be much faster. The main difference between this and your original script is that we don't need to sum anything.
This also doesn't have the same problem as other solutions where null values would impact the count. This will count all the rows, regardless of if the first column is null or not.
USE COUNT_STAR
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT_STAR(LOGS);
Here is a version with optimization.
All the solutions above would require pig to read and write full tuple when counting, this script below just write '1'-s
DEFINE row_count(inBag, name) RETURNS result {
X = FOREACH $inBag generate 1;
$result = FOREACH (GROUP X ALL PARALLEL 1) GENERATE '$name', COUNT(X);
};
The use it like
xxx = row_count(rows, 'rows_count');
What you want is to count all the lines in a relation (dataset in Pig Latin)
This is very easy following the next steps:
logs = LOAD 'log'; --relation called logs, using PigStorage with tab as field delimiter
logs_grouped = GROUP logs ALL;--gives a relation with one row with logs as a bag
number = FOREACH LOGS_GROUP GENERATE COUNT_STAR(logs);--show me the number
I have to say it is important Kevin's point as using COUNT instead of COUNT_STAR we would have only the number of lines which first field is not null.
Also I like Jerome's one line syntax it is more concise but in order to be didactic I prefer to divide it in two and add some comment.
In general I prefer:
numerito = FOREACH (GROUP CARGADOS3 ALL) GENERATE COUNT_STAR(CARGADOS3);
over
name = GROUP CARGADOS3 ALL
number = FOREACH name GENERATE COUNT_STAR(CARGADOS3);

Resources