PIG:Twitter Sentiment Analysis - hadoop

I am trying to implement the twitter sentiment analysis.I need to get all positive tweets and negative tweets and store them in particular text files.
sample.json
{"id": 252479809098223616, "created_at": "Wed Apr 12 08:23:20 +0000 2016", "text": "google is a good company", "user_id": 450990391}{"id": 252479809098223616, "created_at": "Wed Apr 12 08:23:20 +0000 2016", "text": "facebook is a bad company","user_id": 450990391}
dictionary.text having all the positive and negetive words list
weaksubj 1 bad adj n negative
strongsubj 1 good adj n positive
Pig Script:-
tweets = load 'new.json' using JsonLoader('id:chararray,text:chararray,user_id:chararray,created_at:chararray');
dictionary = load 'dictionary.text' AS (type:chararray,length:chararray,word:chararray,pos:chararray,stemmed:chararray,polarity:chararray);
words = foreach tweets generate FLATTEN( TOKENIZE(text) ) AS word,id,text,user_id,created_at;
sentiment = join words by word left outer, dictionary by word;
senti2 = foreach sentiment generate words::id as id,words::created_at as created_at,words::text as text,words::user_id as user_id,dictionary::polarity as polarity;
res = FILTER senti2 BY polarity MATCHES '.*possitive.*';
describe res:-
res: {id: chararray,created_at: chararray,text: chararray,user_id: chararray,polarity: chararray}
But when I dump res I dont see any output, but it executes fine without any errors.
What is the mistake that I am doing here.
Please suggest me.
Mohan.V

I see 2 errors here
1 : line 2 - When you DUMP dictionary , you will see all the records
in column 1 with rest of columns showing empty.
Solution : Specify an appropriate delimiter using PigStorage();
dictionary = load 'dictionary.text' AS (type:chararray,length:chararray,word:chararray,pos:chararray,stemmed:chararray,polarity:chararray);
DUMP dictionary;
(weaksubj 1 bad adj n negative,,,,,)
(strongsubj 1 good adj n positive,,,,,)
Second error :
line 6 : Correct the spelling of positive ! use something like
res = FILTER senti2 BY UPPER(polarity) MATCHES '.*POSITIVE.*';

I see spelling mistake in:
res = FILTER senti2 BY polarity MATCHES '.*possitive.*';
Isn't it '.*positive.*' ?

As per my recommendations you should use custom UDF's for solving your problem . Now you can use elephant-bird-pig-4.1.jar,json-simple-1.1.1.jar .
Also if you wanted to look at example for these then you can use these Sentiment Analysis Tutorial .
If you wanted code then you can refer these code and format your code according to tutorial and my code ,
REGISTER ‘/usr/local/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/ usr/local /elephant-bird-pig-4.1.jar';
REGISTER '/ usr/local /json-simple-1.1.1.jar’;
load_tweets = LOAD '/user/new.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
extract_details = FOREACH load_tweets GENERATE myMap#'id' as id,myMap#'text' as text;
tokens = foreach extract_details generate id,text, FLATTEN(TOKENIZE(text)) As word;
dictionary = load '/user/dictionary.text' AS (type:chararray,length:chararray,word:chararray,pos:chararray,stemmed:chararray,polarity:chararray);
word_rating = join tokens by word left outer, dictionary by word using 'replicated’; describe word_rating;
rating = foreach word_rating generate tokens::id as id,tokens::text as text, dictionary::rating as rate;
word_group = group rating by (id,text);
avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;
positive_tweets = filter avg_rate by tweet_rating>=0;

Related

PIG: Filter a string on a basis of a phrase

I was wondering if it is possible yo filter a string on the basis of the phrase? For example,I want to count number of times when ps3(ps 3) appears in the query. I am not sure how not to use exact match with the filter condition for "ps 3" as do not know how to put a tab inside of it. My code so far is:
data = LOAD '/user/cloudera/' using PigStorage(',') as (text:chararray);
filtered_data = FILTER data BY (text matches '.*ps3.*') OR (text == 'ps 3');
Res = FOREACH (GROUP filtered_data ALL) GENERATE COUNT(filtered_data);
DUMP Res;
So obviously code fails to count queries like "ps 3 today". Is there is a way to handle this?
Try this -
A = LOAD 'input.csv' USING PigStorage(',') AS (text:chararray);
B = FILTER A BY (LOWER(text) MATCHES '.*ps 3.*' OR LOWER(text) MATCHES '.*ps3.*');
DUMP B Output :
(ps 3 today)
(ps 3)
(ps3)
(PS3TODAY)

Removing duplicates using PigLatin and retaining the last element

I am using PigLatin. And I want to remove the duplicates from the bags and want to retain the last element of the particular key.
Input:
User1 7 LA
User1 8 NYC
User1 9 NYC
User2 3 NYC
User2 4 DC
Output:
User1 9 NYC
User2 4 DC
Here the first filed is a key. And I want the last record of that particular key to be retained in the output.
I know how to retain the first element. It is as below. But not able to retain the last element.
inpt = load '......' ......;
user_grp = GROUP inpt BY $0;
filtered = FOREACH user_grp {
top_rec = LIMIT inpt 1;
GENERATE FLATTEN(top_rec);
};
Can anybody help me on this? Thanks in advance!
#Anil : If you order by one of the fields in descending order. You will be able to get the last record. In the below code, have ordered by second field of input (field name : no in script)
Input :
User1,7,LA
User1,8,NYC
User1,9,NYC
User2,3,NYC
User2,4,DC
Pig snippet :
user_details = LOAD 'user_details.csv' USING PigStorage(',') AS (user_name:chararray,no:long,city:chararray);
user_details_grp_user = GROUP user_details BY user_name;
required_user_details = FOREACH user_details_grp_user {
user_details_sorted_by_no = ORDER user_details BY no DESC;
top_record = LIMIT user_details_sorted_by_no 1;
GENERATE FLATTEN(top_record);
}
Output : DUMP required_user_details
(User1,9,NYC )
(User2,4,DC)
Ok.. You can use RANK Operator .
Hope the below code helps.
rec = LOAD '/user/cloudera/inputfiles/sample.txt' USING PigStorage(',') AS(user:chararray,no:int,loc:chararray);
rec_rank = rank rec;
rec_rank_each = FOREACH rec_rank GENERATE $0 as rank_key, user, no, loc;
rec_rank_grp = GROUP rec_rank_each by user;
rec_rank_max = FOREACH rec_rank_grp GENERATE group as temp_user, MAX(rec_rank_each.rank_key) as max_rank;
rec_join = JOIN rec_rank_each BY (user,rank_key) , rec_rank_min BY(temp_user,max_rank);
rec_output = FOREACH rec_join GENERATE user,no,loc;
dump rec_output;
Ensure that you run this from pig 0.11 version as rank operator introduced from pig 0.11

Apache Pig: filter based on tupple member content

I'm learning Apache Pig and have encountered an issue to realise what I wish.
I've this object (after doing a GROUP BY):
MLSET_1: {group chararray,MLSET: {(key: chararray, text: chararray)}}
I'd like to GENERATE key only when a certain pattern (PATTERN_A) appears in text AND another pattern (PATTERN_B) does not appear in the text field for one key.
I know that I can use MLSET.text to get a tupple of all text values for a specific key but then I'm still having the same issue on how to filter on the list of items from a tuple.
Here's an example:
(key_A,{(key_A,start),(key_A,stop),(key_A,unknown),(key_A,whatever)})
(key_B,{(key_B,stop),(key_B,whatever)})
(key_C,{(key_C,start),(key_C,stop),(key_C,whatever)})
I'd like to get keys for lines where "start" appears and "unknown" does not appears. In this example I will get only key_C as a result.
Thanks in advance for your help !
Here's some code that might help you out. The solution is a nested foreach here:
C = FOREACH MLSET_1 {F1 = FILTER MLSET BY (text == PATTERN_A); F2 = FILTER MLSET BY (text != PATTERN_B); GENERATE group, COUNT(F1) AS cnt1, COUNT(F2) AS cnt2;};
D = FILTER C BY (cnt1 > 1 AND cnt2 == 0);
you'll probably have to adapt the comparison in the nested filter.
Here the another approach
C = FOREACH MLSET_1 GENERATE $0,$1,BagToString(MLSET.(key,text));
D = FILTER C BY ($2 MATCHES '.*start.*') AND NOT($2 MATCHES '.*unknown.*');
E = FOREACH D GENERATE $0,$1;
DUMP E;
Output for the above input:
(key_c,{(key_c,start),(key_c,stop),(key_c,whatever)})

Grouping a databag by identical values in pig

I created the following Pig script to filter the sentences from a collection of web documents (Common Crawl) that mention a movie title (from a predefined data file of movie titles), apply sentiment analysis on those sentences and group those sentiments by movie.
register ../commoncrawl-examples/lib/*.jar;
set mapred.task.timeout= 1000;
register ../commoncrawl-examples/dist/lib/commoncrawl-examples-1.0.1-HM.jar;
register ../dist/lib/movierankings-1.jar
register ../lib/piggybank.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar;
register ../lib/stanford-corenlp-full-2014-01-04/ejml-0.23.jar;
register ../lib/stanford-corenlp-full-2014-01-04/joda-time.jar;
register ../lib/stanford-corenlp-full-2014-01-04/jollyday.jar;
register ../lib/stanford-corenlp-full-2014-01-04/xom.jar;
DEFINE IsNotWord com.moviereviewsentimentrankings.IsNotWord;
DEFINE IsMovieDocument com.moviereviewsentimentrankings.IsMovieDocument;
DEFINE ToSentenceMoviePairs com.moviereviewsentimentrankings.ToSentenceMoviePairs;
DEFINE ToSentiment com.moviereviewsentimentrankings.ToSentiment;
DEFINE MoviesInDocument com.moviereviewsentimentrankings.MoviesInDocument;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- LOAD pages, movies and words
pages = LOAD '../data/textData-*' USING SequenceFileLoader as (url:chararray, content:chararray);
movies_fltr_grp = LOAD '../data/movie_fltr_grp_2/part-*' as (group: chararray,movies_fltr: {(movie: chararray)});
-- FILTER pages containing movie
movie_pages = FILTER pages BY IsMovieDocument(content, movies_fltr_grp.movies_fltr);
-- SPLIT pages containing movie in sentences and create movie-sentence pairs
movie_sentences = FOREACH movie_pages GENERATE flatten(ToSentenceMoviePairs(content, movies_fltr_grp.movies_fltr)) as (content:chararray, movie:chararray);
-- Calculate sentiment for each movie-sentence pair
movie_sentiment = FOREACH movie_sentences GENERATE flatten(ToSentiment(movie, content)) as (movie:chararray, sentiment:int);
-- GROUP movie-sentiment pairs by movie
movie_sentiment_grp_tups = GROUP movie_sentiment BY movie;
-- Reformat and print movie-sentiment pairs
movie_sentiment_grp = FOREACH movie_sentiment_grp_tups GENERATE group, movie_sentiment.sentiment AS sentiments:{(sentiment: int)};
describe movie_sentiment_grp;
Test runs on a small subset of the web crawl showed to be successfully give me pairs of a movie title with a databag of integers (from 1 to 5, representing very negative, negative, neutral, positive and very positive). As a last step I would like to transform this data into pairs movie title and a databag containing tuples with all distinct integers existing for this movie title and their count. The describe movie_sentiment_grp at the end of the script returns:
movie_sentiment_grp: {group: chararray,sentiments: {(sentiment: int)}}
So basically I probably need to FOREACH over each element of movie_sentiment_grp and GROUP the sentiments databag into groups of identical values and then use the COUNT() function to get the number of elements in each group. I was however not able to find anything on how to group a databag of integers into groups of identical values. Does anyone know how to do this?
Dummy solution:
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp{
sentiments_grp = GROUP sentiments BY ?;
}
Check out the CountEach UDF from Apache DataFu. Given a bag it will produce a new bag of the distinct tuples, with the count appended to each corresponding tuple.
Example from the documentation should make this clear:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
-- input:
-- ({(A),(A),(C),(B)})
input = LOAD 'input' AS (B: bag {T: tuple(alpha:CHARARRAY, numeric:INT)});
-- output_flatten:
-- ({(A,2),(C,1),(B,1)})
output_flatten = FOREACH input GENERATE CountEachFlatten(B);
For your case:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp GENERATE
group,
CountEach(sentiments);
You were on the right track. movie_sentiment_grp is in the right format, and a nested FOREACH would be correct, except you can not use a GROUP in it. The solution is to use a UDF. Something like this:
myudfs.py
#!/usr/bin/python
#outputSchema('sentiments: {(sentiment:int, count:int)}')
def count_sentiments(BAG):
res = {}
for s in BAG:
if s in res:
res[s] += 1
else:
res[s] = 1
return res.items()
This UDF is used like:
Register 'myudfs.py' using jython as myfuncs;
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp
GENERATE group, myfuncs.count_sentiments(sentiments) ;

How can I select record with minimum value in pig latin

I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.

Resources