Sentiment Analysis of twitter data using hadoop and pig - hadoop

Tweets from twitter are stored in hdfs in hadoop.
The tweets need to be processed for sentiment analysis. The tweets in hdfs are in avro format so they need to be processed using Json loader But in pig scripting the tweets from hdfs are not getting read.After changing jar files the pig script is showing failed message
By using these following jar files by pig script is getting failed.
REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';
REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';
REGISTER '/home/cloudera/Desktop/json-simple-3.1.0.jar';
These are another set of jar files with which its not failing but data is also not getting read.
REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';
REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';
REGISTER '/home/cloudera/Desktop/json-simple-1.1.jar';
Here is all my pig scripting commands i have used:
tweets = LOAD '/user/cloudera/OutputData/tweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH tweets GENERATE myMap#'id' as id ,myMap#'tweets' as tweets;
tokens = foreach B generate id, tweets, FLATTEN(TOKENIZE(tweets)) As word;
dictionary = load ' /user/cloudera/OutputData/AFINN.txt' using PigStorage('\t') AS(word:chararray,rating:int);
word_rating = join tokens by word left outer, dictionary by word using 'replicated';
describe word_rating;
rating = foreach word_rating generate tokens::id as id,tokens::tweets as tweets, dictionary::rating as rate;
word_group = group rating by (id,tweets);
avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;
positive_tweets = filter avg_rate by tweet_rating>=0;
DUMP positive_tweets;
negative_tweets = filter avg_rate by tweet_rating<=0;
DUMP negative_tweets;
Error on dumping above tweets command for the first set of jar files:
Input(s):
Failed to read data from "/user/cloudera/OutputData/tweets"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp37889715"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0001
2019-05-03 09:59:09,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2019-05-03 09:59:09,427 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException
Details at logfile: /home/cloudera/pig_1556902594207.log
Error on dumping above tweets command for the second set of jar files:
Input(s):
Successfully read 0 records (5178477 bytes) from: "/user/cloudera/OutputData/tweets"
Output(s):
Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp479037703"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0002
2019-05-03 10:01:05,417 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-05-03 10:01:05,418 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-05-03 10:01:05,418 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2019-05-03 10:01:05,428 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2019-05-03 10:01:05,428 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
Expected output was sorted positive and neative tweets but getting errors.
Please do help. Thank you.

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException This usually indicates a syntax error in the Pig script.
The AS keyword in a LOAD statement usually require a schema. myMap in your LOAD statement is not a valid schema.
See https://stackoverflow.com/a/12829494/8886552 for an example of JsonLoader.

Related

Pig: After join dump is throwing ERROR 1066: Unable to open iterator for alias C

Below is my requirement:
Input:
0104919 ,08476,48528,2016,2016-08-29
00104919 ,08476,48528,2016,2016-09-05
00104919 ,08476,48528,2016,2016-09-12
00104919 ,08476,48528,2017,2016-08-29
Output after join should be:
2,00104919 ,08476,48528,2016,2016-09-05,2016-09-12
3,00104919 ,08476,48528,2016,2016-09-12,2016-08-29
Below is my code:
TABL = LOAD '/TABL/part-r-00000' using PigStorage('~') AS (a,b,c,d,e,f);
pre_Q1 = FOREACH TABL generate a,b,c,d,e;
DIST = DISTINCT pre_Q1;
ORDR = ORDER DIST BY *;
Q1 = rank ORDR;
Q2 = FOREACH Q1 GENERATE rank_ORDR + 1 AS rank_Q2, a, b, c, d, e;
Q_join = join Q2 by (rank_Q2, a, b, c, d), Q1 by (rank_ORDR, a, b, c, d);
C = limit Q_join 100;
dump C;
I am getting the below error.
Can someone point out what must be causing the below error.
Failed Jobs:
JobId Alias Feature Message Outputs
job_1474127474437_528208 C,Q2,Q_join HASH_JOIN Message: Job failed!
Input(s):
Successfully read 5235587 records (1516199217 bytes) from: "/TABL/part-r-00000"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1474127474437_528166 -> job_1474127474437_528185,
job_1474127474437_528185 -> job_1474127474437_528190,
job_1474127474437_528190 -> job_1474127474437_528204,
job_1474127474437_528204 -> job_1474127474437_528206,
job_1474127474437_528206 -> job_1474127474437_528208,
job_1474127474437_528208 -> null,
null
2017-01-04 04:02:37,407 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-01-04 04:02:37,569 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-01-04 04:02:37,729 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-01-04 04:02:37,887 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2017-01-04 04:02:37,945 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
2017-01-04 04:02:37,945 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias C
Details at logfile: /var/log/gphd/pig/pig.log
Try to modify the first line as below :
TABL = LOAD '/TABL/part-r-00000' using PigStorage(',') AS (a,b,c,d,e,f);
And watch out to the space at the end of the column a, it may affect the join !

Use PIG to count the number of records in an avro file

I can open a avro file in HUE and HUE shows me it has 10 records. i can browse through all the 10 records in HUE.
Now I write the following code in PIG
data = LOAD '/user/admin/2015/10/04/02/file1.avro' USING AvroStorage();
data_group = GROUP data ALL;
row_count = FOREACH data_group GENERATE COUNT(data);
dump row_count;
The output of the job is
Input(s):
Successfully read 4 records (58507 bytes) from: "/user/admin/2015/10/04/02/file1.avro"
Output(s):
Successfully stored 1 records (6 bytes) in: "hdfs://nn1/tmp/temp-268177355/tmp915757783"
Counters:
Total records written : 1
Total bytes written : 6
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1438959478020_940907
2015-10-29 19:08:55,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-10-29 19:08:55,252 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-29 19:08:55,253 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-10-29 19:08:55,261 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-10-29 19:08:55,261 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(4)
How did 10 become 4. Is there a different way to count the number of records in an avro file using PIG?

Hive will not write to aws s3

I have an external table in hive that is stored on my hadoop cluster and I want to move its contents into an external table that is stored on Amazon s3.
So I created an s3 backed table like so:
CREATE EXTERNAL TABLE IF NOT EXISTS export.export_table
like table_to_be_exported
ROW FORMAT SERDE ...
with SERDEPROPERTIES ('fieldDelimiter'='|')
STORED AS TEXTFILE
LOCATION 's3a://bucket/folder';
Then I run: INSERT INTO export.export_table SELECT * FROM table_to_be_exported
It outputs the following
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : Starting Job = job_1435176004514_0028, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1435176004514_0028/
INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1435176004514_0028
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2015-07-06 09:22:18,379 Stage-1 map = 0%, reduce = 0%
INFO : 2015-07-06 09:22:27,795 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec
INFO : MapReduce Total cumulative CPU time: 2 seconds 900 msec
INFO : Ended Job = job_1435176004514_0028
INFO : Stage-4 is selected by condition resolver.
INFO : Stage-3 is filtered out by condition resolver.
INFO : Stage-5 is filtered out by condition resolver.
INFO : Moving data to: s3a://bucket/folder/.hive-staging_hive_2015-07-06_09-22-10_351_9216807769834089982-3/-ext-10000 from s3a://bucket/folder/.hive-staging_hive_2015-07-06_09-22-10_351_9216807769834089982-3/-ext-10002
ERROR : Failed with exception Wrong FS: s3a://bucket/folder/.hive-staging_hive_2015-07-06_09-22-10_351_9216807769834089982-3/-ext-10002, expected: hdfs://quickstart.cloudera:8020
java.lang.IllegalArgumentException: Wrong FS: s3a://bucket/folder/.hive-staging_hive_2015-07-06_09-22-10_351_9216807769834089982-3/-ext-10002, expected: hdfs://quickstart.cloudera:8020
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
at org.apache.hadoop.hdfs.DistributedFileSystem.getEZForPath(DistributedFileSystem.java:1916)
at org.apache.hadoop.hdfs.client.HdfsAdmin.getEncryptionZoneForPath(HdfsAdmin.java:262)
at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.isPathEncrypted(Hadoop23Shims.java:1187)
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2449)
at org.apache.hadoop.hive.ql.exec.MoveTask.moveFile(MoveTask.java:105)
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:222)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1638)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1397)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1181)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1047)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1042)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:145)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:70)
at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:209)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask (state=08S01,code=1)
I have s3a key and secret set in my hadoop core-site.xml and am able to do reads and writes from s3 using hadoop directly hdfs dfs -ls s3a://.
Any guesses as to what I could do to get this to work?
Try using s3 instead of s3a, my guess is that s3a is not supported yet in EMR's Hive distribution.

Pig Dump command throwing error

Unable to fetch data form join.
Data:
Jorge Posada |Yankees| {(Catcher,2000),(Designated_hitter,2001)}|[games#1594,hit_by_pitch#65,grand_slams#7]
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001)}|[on_base_percentage#0.297,games#26,home_runs#7]
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003),(Left_fielder,2001)}|[games#258,hit_by_pitch#3]
**Code:**
bfile= LOAD 'basketball1.txt' using PigStorage('|') as (name:chararray,team:chararray,pos:bag{t:tuple(point:chararray,year:int)},bat:map[]);
bfile1= foreach bfile generate name,pos.year as year;
bfile2= foreach bfile1 generate name,flatten(year) as play_year ;
bfile3= group bfile2 by play_year;
bfile4= foreach bfile3 generate group,COUNT($1) as count;
bfile5= foreach bfile generate flatten(pos.year) as year,bat#'games' as games_cnt;
bfile6= group bfile5 by year;
bjoin= join bfile3 by group ,bfile6 by group;
bjoin1= foreach bjoin generate bfile3.group,bfile3::bfile2.name as name,
bfile6::bfile5.games_cnt as tot_games;
**Describe bjoin1:**
bjoin: {bfile3::group: int,bfile3::bfile2: {(name: chararray,play_year: int)},
bfile6::group: int,bfile6::bfile5: {(year: int,games_cnt: bytearray)}}
While doing dump bjoin1 I face the following issue:
2014-11-15 07:31:42,318 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Some jobs have failed! Stop running all dependent jobs
2014-11-15 07:31:42,321 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias bjoin1
Details at logfile: /home/cloudera/pig_1416065344409.log
grunt> 2014-11-15 07:31:47,857 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce

Pig - Replicated Join

I have two input files
Student file :
abc 30 4.5
xyz 34 9.5
def 28 6.5
klm 35 10.5
Location file :
abc hawthorne
xyz artesia
def garnet
klm vanness
My desired ouput
abc hawthorne
xyz artesia
def garnet
klm vanness
To achieve this, I wrote the following pig program.
A = LOAD '/user/hive/warehouse/students.txt' USING PigStorage(' ') AS (NAME:CHARARRAY,AGE:INT,GPA:FLOAT);
B = LOAD '/user/hive/warehouse/location.txt.txt' using PigStorage(' ') AS (NAME:CHARARRAY,LOCATION:CHARARRAY);
C = JOIN A BY NAME , B BY LOCATION USING 'replicated';
DUMP C;
The trouble is that I dont see any output message. On top of that, I see the following warnings while execution :
2014-01-22 15:18:15,829 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).
2014-01-22 15:18:15,829 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 2 time(s).
2014-01-22 15:18:15,829 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-01-22 15:18:15,829 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-01-22 15:18:15,832 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2014-01-22 15:18:15,832 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2014-01-22 15:18:15,841 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-01-22 15:18:15,841 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-01-22 15:18:15,841 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
Hadoop Job IDs executed by Pig: job_201401210934_0082,job_201401210934_0083
i feel you are not seeing any output because join is not leading to any match.
You are creating a join on NAME from A (abc, xyz, def, klm) & LOCATION from B (hawthorne, artesia, garnet, vanness) and if you see there are no matching strings in two data sets, so leading to no join.

Resources