Pig: After join dump is throwing ERROR 1066: Unable to open iterator for alias C - hadoop

Below is my requirement:
Input:
0104919 ,08476,48528,2016,2016-08-29
00104919 ,08476,48528,2016,2016-09-05
00104919 ,08476,48528,2016,2016-09-12
00104919 ,08476,48528,2017,2016-08-29
Output after join should be:
2,00104919 ,08476,48528,2016,2016-09-05,2016-09-12
3,00104919 ,08476,48528,2016,2016-09-12,2016-08-29
Below is my code:
TABL = LOAD '/TABL/part-r-00000' using PigStorage('~') AS (a,b,c,d,e,f);
pre_Q1 = FOREACH TABL generate a,b,c,d,e;
DIST = DISTINCT pre_Q1;
ORDR = ORDER DIST BY *;
Q1 = rank ORDR;
Q2 = FOREACH Q1 GENERATE rank_ORDR + 1 AS rank_Q2, a, b, c, d, e;
Q_join = join Q2 by (rank_Q2, a, b, c, d), Q1 by (rank_ORDR, a, b, c, d);
C = limit Q_join 100;
dump C;
I am getting the below error.
Can someone point out what must be causing the below error.
Failed Jobs:
JobId Alias Feature Message Outputs
job_1474127474437_528208 C,Q2,Q_join HASH_JOIN Message: Job failed!
Input(s):
Successfully read 5235587 records (1516199217 bytes) from: "/TABL/part-r-00000"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1474127474437_528166 -> job_1474127474437_528185,
job_1474127474437_528185 -> job_1474127474437_528190,
job_1474127474437_528190 -> job_1474127474437_528204,
job_1474127474437_528204 -> job_1474127474437_528206,
job_1474127474437_528206 -> job_1474127474437_528208,
job_1474127474437_528208 -> null,
null
2017-01-04 04:02:37,407 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-01-04 04:02:37,569 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-01-04 04:02:37,729 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-01-04 04:02:37,887 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-01-04 04:02:37,945 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
2017-01-04 04:02:37,945 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias C
Details at logfile: /var/log/gphd/pig/pig.log

Try to modify the first line as below :
TABL = LOAD '/TABL/part-r-00000' using PigStorage(',') AS (a,b,c,d,e,f);
And watch out to the space at the end of the column a, it may affect the join !

Related

Sentiment Analysis of twitter data using hadoop and pig

Tweets from twitter are stored in hdfs in hadoop.
The tweets need to be processed for sentiment analysis. The tweets in hdfs are in avro format so they need to be processed using Json loader But in pig scripting the tweets from hdfs are not getting read.After changing jar files the pig script is showing failed message
By using these following jar files by pig script is getting failed.
REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';
REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';
REGISTER '/home/cloudera/Desktop/json-simple-3.1.0.jar';
These are another set of jar files with which its not failing but data is also not getting read.
REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';
REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';
REGISTER '/home/cloudera/Desktop/json-simple-1.1.jar';
Here is all my pig scripting commands i have used:
tweets = LOAD '/user/cloudera/OutputData/tweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH tweets GENERATE myMap#'id' as id ,myMap#'tweets' as tweets;
tokens = foreach B generate id, tweets, FLATTEN(TOKENIZE(tweets)) As word;
dictionary = load ' /user/cloudera/OutputData/AFINN.txt' using PigStorage('\t') AS(word:chararray,rating:int);
word_rating = join tokens by word left outer, dictionary by word using 'replicated';
describe word_rating;
rating = foreach word_rating generate tokens::id as id,tokens::tweets as tweets, dictionary::rating as rate;
word_group = group rating by (id,tweets);
avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;
positive_tweets = filter avg_rate by tweet_rating>=0;
DUMP positive_tweets;
negative_tweets = filter avg_rate by tweet_rating<=0;
DUMP negative_tweets;
Error on dumping above tweets command for the first set of jar files:
Input(s):
Failed to read data from "/user/cloudera/OutputData/tweets"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp37889715"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0001
2019-05-03 09:59:09,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2019-05-03 09:59:09,427 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException
Details at logfile: /home/cloudera/pig_1556902594207.log
Error on dumping above tweets command for the second set of jar files:
Input(s):
Successfully read 0 records (5178477 bytes) from: "/user/cloudera/OutputData/tweets"
Output(s):
Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp479037703"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0002
2019-05-03 10:01:05,417 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-05-03 10:01:05,418 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-05-03 10:01:05,418 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2019-05-03 10:01:05,428 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2019-05-03 10:01:05,428 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
Expected output was sorted positive and neative tweets but getting errors.
Please do help. Thank you.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException This usually indicates a syntax error in the Pig script.
The AS keyword in a LOAD statement usually require a schema. myMap in your LOAD statement is not a valid schema.
See https://stackoverflow.com/a/12829494/8886552 for an example of JsonLoader.

Unable to perform sum operation in pig

I am trying to perform sum operation on my data in pig but it is not accepting explicit type casting i have tried replacing (int) with double while performing sum.
Code
drivers = LOAD '/sachin/drivers.csv' USING PigStorage(',');
time = LOAD '/sachin/timesheet.csv' USING PigStorage(',');
drivdata = FILTER drivers BY $0>1;
timedata = filter time by $0>0;
drivgrp = group timedata by $0;
drivinfo = foreach drivgrp generate group as id , SUM(timedata.$2) as totalhr , SUM(timedata.$3) as totmillogged;
drivfinal = foreach drivdata generate $0 as id , $1 as name;
result = join drivfinal by id , drivinfo by id;
finalres = foreach result generate $0 as id, $1 as name, $3 as hrslogged, $4 as mileslogged;
summile = foreach finalres generate (int)SUM(mileslogged);
DUMP summile;
Error Message
grunt> exec /home/sachin/sec.pig
2017-12-13 21:57:58,812 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 1 time(s).
2017-12-13 21:57:58,854 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:58,996 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,036 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,080 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,121 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,192 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_INT 2 time(s).
2017-12-13 21:57:59,246 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1045: <line 10, column 41> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
Details at logfile: /home/sachin/pig_1513175202309.log
grunt>
I am actually trying to perform operation for each driver in the top 5 list and finding the miles logged and the percentage of mileslogged by the driver over the total miles logged and store the result in hdfs.
Link for Dataset:https://raw.githubusercontent.com/hortonworks/data-tutorials/master/tutorials/hdp/how-to-process-data-with-apache-pig/assets/driver_data.zip
Can anyone help me to solve this problem or help me to understand what is going wrong here ?
You have to cast mileslogged and then call the SUM function
finalres = foreach result generate $0 as id, $1 as name, $3 as hrslogged, (int)$4 as mileslogged;
summile = foreach finalres generate SUM(mileslogged);
Also I noticed that you are not specifying the datatype in the load statement.The default datatype is bytearray and I suspect you will get the correct result if you don't explicitly cast the fields in the subsequent steps.
From
http://pig.apache.org/docs/r0.17.0/func.html#sum
SUM is defined as
Computes the sum of the numeric values in a single-column bag. SUM requires a preceding GROUP ALL statement for global sums and a GROUP BY statement for group sums.
You code is passing a double whereas SUM requires a BAG containing doubles. No need to typecast but you need to group before calling the SUM function.
allres = group finalres ALL;
summile = foreach allres generate SUM(finalres.mileslogged);
DUMP summile;

Use PIG to count the number of records in an avro file

I can open a avro file in HUE and HUE shows me it has 10 records. i can browse through all the 10 records in HUE.
Now I write the following code in PIG
data = LOAD '/user/admin/2015/10/04/02/file1.avro' USING AvroStorage();
data_group = GROUP data ALL;
row_count = FOREACH data_group GENERATE COUNT(data);
dump row_count;
The output of the job is
Input(s):
Successfully read 4 records (58507 bytes) from: "/user/admin/2015/10/04/02/file1.avro"
Output(s):
Successfully stored 1 records (6 bytes) in: "hdfs://nn1/tmp/temp-268177355/tmp915757783"
Counters:
Total records written : 1
Total bytes written : 6
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1438959478020_940907
2015-10-29 19:08:55,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-10-29 19:08:55,252 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-29 19:08:55,253 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-10-29 19:08:55,261 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-10-29 19:08:55,261 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(4)
How did 10 become 4. Is there a different way to count the number of records in an avro file using PIG?

Pig Dump command throwing error

Unable to fetch data form join.
Data:
Jorge Posada |Yankees| {(Catcher,2000),(Designated_hitter,2001)}|[games#1594,hit_by_pitch#65,grand_slams#7]
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001)}|[on_base_percentage#0.297,games#26,home_runs#7]
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003),(Left_fielder,2001)}|[games#258,hit_by_pitch#3]
**Code:**
bfile= LOAD 'basketball1.txt' using PigStorage('|') as (name:chararray,team:chararray,pos:bag{t:tuple(point:chararray,year:int)},bat:map[]);
bfile1= foreach bfile generate name,pos.year as year;
bfile2= foreach bfile1 generate name,flatten(year) as play_year ;
bfile3= group bfile2 by play_year;
bfile4= foreach bfile3 generate group,COUNT($1) as count;
bfile5= foreach bfile generate flatten(pos.year) as year,bat#'games' as games_cnt;
bfile6= group bfile5 by year;
bjoin= join bfile3 by group ,bfile6 by group;
bjoin1= foreach bjoin generate bfile3.group,bfile3::bfile2.name as name,
bfile6::bfile5.games_cnt as tot_games;
**Describe bjoin1:**
bjoin: {bfile3::group: int,bfile3::bfile2: {(name: chararray,play_year: int)},
bfile6::group: int,bfile6::bfile5: {(year: int,games_cnt: bytearray)}}
While doing dump bjoin1 I face the following issue:
2014-11-15 07:31:42,318 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
2014-11-15 07:31:42,321 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias bjoin1
Details at logfile: /home/cloudera/pig_1416065344409.log
grunt> 2014-11-15 07:31:47,857 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce

Pig - Replicated Join

I have two input files
Student file :
abc 30 4.5
xyz 34 9.5
def 28 6.5
klm 35 10.5
Location file :
abc hawthorne
xyz artesia
def garnet
klm vanness
My desired ouput
abc hawthorne
xyz artesia
def garnet
klm vanness
To achieve this, I wrote the following pig program.
A = LOAD '/user/hive/warehouse/students.txt' USING PigStorage(' ') AS (NAME:CHARARRAY,AGE:INT,GPA:FLOAT);
B = LOAD '/user/hive/warehouse/location.txt.txt' using PigStorage(' ') AS (NAME:CHARARRAY,LOCATION:CHARARRAY);
C = JOIN A BY NAME , B BY LOCATION USING 'replicated';
DUMP C;
The trouble is that I dont see any output message. On top of that, I see the following warnings while execution :
2014-01-22 15:18:15,829 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).
2014-01-22 15:18:15,829 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).
2014-01-22 15:18:15,829 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-01-22 15:18:15,829 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-01-22 15:18:15,832 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2014-01-22 15:18:15,832 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2014-01-22 15:18:15,841 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-01-22 15:18:15,841 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-01-22 15:18:15,841 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
Hadoop Job IDs executed by Pig: job_201401210934_0082,job_201401210934_0083
i feel you are not seeing any output because join is not leading to any match.
You are creating a join on NAME from A (abc, xyz, def, klm) & LOCATION from B (hawthorne, artesia, garnet, vanness) and if you see there are no matching strings in two data sets, so leading to no join.

Resources