Count word ocurrences in each row using Pig - hadoop

I have a set of tweets that have many different fields
raw_tweets = LOAD 'input.tsv' USING PigStorage('\t') AS (tweet_id, text,
in_reply_to_status_id, favorite_count, source, coordinates, entities,
in_reply_to_screen_name, in_reply_to_user_id, retweet_count, is_retweet,
retweet_of_id, user_id_id, lang, created_at, event_id_id, is_news);
I want to find the most common words for each date. I managed to group the texts by date:
r1 = FOREACH raw_tweets GENERATE SUBSTRING(created_at,0,10) AS a, REPLACE
(LOWER(text),'([^a-z\\s]+)','') AS b;
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));
Now it looks like:
(date text text3)
(date2 text2)
I removed the special characters, so only "real" words appear in the text.
Sample:
2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here
I want the output to look like
2017-06-18 the are is
2017-06-19 more words and
I don't really care about how many times the word appears. I just want to show the most common, if two words appear the same amount of times, show any of them.

While I'm sure there is a way to do this entirely in Pig, it would probably be more difficult than necessary.
UDFs are the way to go, in my opinion, and Python is just one option I will show because it's quick to register it in Pig.
For example,
input.tsv
2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here
py_udfs.py
from collections import Counter
from operator import itemgetter
#outputSchema("y:bag{t:tuple(word:chararray,count:int)}")
def word_count(sentence):
''' Does a word count of a sentence and orders common words first '''
words = Counter()
for w in sentence.split():
words[w] += 1
values = ((word,count) for word,count in words.items())
return sorted(values,key=itemgetter(1),reverse=True)
script.pig
REGISTER 'py_udfs.py' USING jython AS py_udfs;
A = LOAD 'input.tsv' USING PigStorage('\t') as (created_at:chararray,sentence:chararray);
B = FOREACH A GENERATE created_at, py_udfs.word_count(sentence);
\d B
Output
(2017-06-18,{(is,2),(the,2),(are,2),(green,1),(black,1),(words,1),(this,1),(plants,1),(there,1),(dog,1)})
(2017-06-19,{(more,2),(words,2),(here,1),(another,1),(begins,1),(phrase,1),(even,1),(and,1)})
If you are doing textual analysis, though, I would suggest
Removing stop words
Lemmatization / stemming
Use Apache Spark

Related

Spark Drop Duplicates on multiple columns - Performance Issue [duplicate]

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code.
In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two different systems: sa and sb. Each Row contains name, id_sa and id_sb. My goal is to produce a mapping from id_sa to id_sb such that for each id_sa, the corresponding id_sb is the most frequent id among all names attached to id_sa.
Let's try to clarify with an example. If I have the following rows:
[Row(name='n1', id_sa='a1', id_sb='b1'),
Row(name='n2', id_sa='a1', id_sb='b2'),
Row(name='n3', id_sa='a1', id_sb='b2'),
Row(name='n4', id_sa='a2', id_sb='b2')]
My goal is to produce a mapping from a1 to b2. Indeed, the names associated to a1 are n1, n2 and n3, which map respectively to b1, b2 and b2, so b2 is the most frequent mapping in the names associated to a1. In the same way, a2 will be mapped to b2. It's OK to assume that there will always be a winner: no need to break ties.
I was hoping that I could use groupBy(df.id_sa) on my dataframe, but I don't know what to do next. I was hoping for an aggregation that could produce, in the end, the following rows:
[Row(id_sa=a1, max_id_sb=b2),
Row(id_sa=a2, max_id_sb=b2)]
But maybe I'm trying to use the wrong tool and I should just go back to using RDDs.
Using join (it will result in more than one row in group in case of ties):
import pyspark.sql.functions as F
from pyspark.sql.functions import count, col
cnts = df.groupBy("id_sa", "id_sb").agg(count("*").alias("cnt")).alias("cnts")
maxs = cnts.groupBy("id_sa").agg(F.max("cnt").alias("mx")).alias("maxs")
cnts.join(maxs,
(col("cnt") == col("mx")) & (col("cnts.id_sa") == col("maxs.id_sa"))
).select(col("cnts.id_sa"), col("cnts.id_sb"))
Using window functions (will drop ties):
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
w = Window().partitionBy("id_sa").orderBy(col("cnt").desc())
(cnts
.withColumn("rn", row_number().over(w))
.where(col("rn") == 1)
.select("id_sa", "id_sb"))
Using struct ordering:
from pyspark.sql.functions import struct
(cnts
.groupBy("id_sa")
.agg(F.max(struct(col("cnt"), col("id_sb"))).alias("max"))
.select(col("id_sa"), col("max.id_sb")))
See also How to select the first row of each group?
I think what you might be looking for are window functions:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Here is an example in Scala (I don't have a Spark Shell with Hive available right now, so I was not able to test the code, but I think it should work):
case class MyRow(name: String, id_sa: String, id_sb: String)
val myDF = sc.parallelize(Array(
MyRow("n1", "a1", "b1"),
MyRow("n2", "a1", "b2"),
MyRow("n3", "a1", "b2"),
MyRow("n1", "a2", "b2")
)).toDF("name", "id_sa", "id_sb")
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy(myDF("id_sa")).orderBy(myDF("id_sb").desc)
myDF.withColumn("max_id_b", first(myDF("id_sb")).over(windowSpec).as("max_id_sb")).filter("id_sb = max_id_sb")
There are probably more efficient ways to achieve the same results with Window functions, but I hope this points you in the right direction.

Graphlab: How to avoid manually duplicating functions that has only a different string variable?

I imported my dataset with SFrame:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
I would like to do sentiment analysis on a set of words shown below:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
Then I would like to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
so far so good, but I need to manually create other functions for each of the selected words in this way, e.g., great_count, etc. How to avoid this manual effort and write cleaner code?
I think the SFrame.unpack command should do the trick. In fact, the limit parameter will accept your list of selected words and keep only these results, so that part is greatly simplified.
I don't know precisely what's in your reviews data, so I made a toy example:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
I also can't help but point out that GraphLab Create does have its own sentiment analysis toolkit, which could be worth checking out.
I actually find out an easier way do do this:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))

PIG Script to split large txt file into parts based on specified word

I am trying to build a pig script that takes in a textbook file and divides it into chapters and then compares the words in each chapter and returns only words that show up in all chapters and counts them. The chapters are Delimited fairly easily by CHAPTER - X.
Here's what I have so far:
lines = LOAD '../../Alice.txt' AS (line:chararray);
lineswithoutspecchars = FOREACH lines GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line;
words = FOREACH lineswithoutspecchars GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
Sorry that this question is probably way too simple compared to what I normally ask on stackoverflow and I googled around for it but perhaps I am not using the correct keywords. I am brand new to PIG and trying to learn it for a new job assignment.
Thanks in advance!
A bit lengthy but you will get the result. You could cut down unnecessary relations based on your file though. Provided appropriate comments in teh script.
Input File:
Pig does not know whether integer values in baseball are stored as ASCII strings, Java
serialized values, binary-coded decimal, or some other format. So it asks the load func-
tion, because it is that function’s responsibility to cast bytearrays to other types. In
general this works nicely, but it does lead to a few corner cases where Pig does not know
how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know
how to perform casts on it because that bytearray is not generated by a load function.
CHAPTER - X
In a strongly typed computer language (e.g., Java), the user must declare up front the
type for all variables. In weakly typed languages (e.g., Perl), variables can take on values
of different type and adapt as the occasion demands.
CHAPTER - X
In this example, remember we are pretending that the values for base_on_balls and
ibbs turn out to be represented as integers internally (that is, the load function con-
structed them as integers). If Pig were weakly typed, the output of unintended would
be records with one field typed as an integer. As it is, Pig will output records with one
field typed as a double. Pig will make a guess and then do its best to massage the data
into the types it guessed.
Pig Script:
A = LOAD 'file' as (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line;
//we need to split on CHAPTER X but the above load function would give us a tuple for each newline. so
group everything and convert that bag to string which will give a single tuple with _ as delimiter.
C = GROUP B ALL;
D = FOREACH C GENERATE BagToString(B) as (line:chararray);
//now we dont have any commas so convert our delimiter CHAPTER X to comma. We do this becuase if we pass this
to TOKENIZE it would split that into separarte column that would be useful to RANK it.
E = FOREACH D GENERATE REPLACE(line,'_CHAPTER X_',',') AS (line:chararray);
F = FOREACH E GENERATE REPLACE(line,'_',' ') AS (line:chararray); //remove the delimiter created by BagToString
//create separate columns
G = FOREACH F GENERATE FLATTEN(TOKENIZE(line,',')) AS (line:chararray);
//we need to rank each chapter so that would be easy when you are doing the count of each word.
H = RANK G;
J = FOREACH H GENERATE rank_G,FLATTEN(TOKENIZE(line)) as (line:chararray);
J1 = GROUP J BY (rank_G, line);
J2 = FOREACH J1 GENERATE COUNT(J) AS (cnt:long),FLATTEN(group.line) as (word:chararray),FLATTEN(group.rank_G) as (rnk:long);
//So J2 result will not have duplicate word within each chapter now.
//So if we group it by word and then filter teh count of that by 2 we are sure that the word is present in all chapters.
J3 = GROUP J2 BY word;
J4 = FOREACH J3 GENERATE SUM(J2.cnt) AS (sumval:long),COUNT(J2) as (cnt:long),FLATTEN(group) as (word:chararray);
J5 = FILTER J4 BY cnt > 2;
J6 = FOREACH J5 GENERATE word,sumval;
dump J6;
//result in order word,count across chapters
Output:
(a,8)
(In,5)
(as,6)
(the,9)
(values,4)

Hadoop Pig Script Help Needed with labeling words in a sentence

I am working on a solution to the following problem:
Given an arbitrary text document written in English, write a program that will generate a concordance, i.e. an alphabetical list of all word occurrences, labeled with word frequencies.
Bonus: label each word with the sentence numbers in which each occurrence appeared.
Now, I have the first part of this exercise completed. I am stuck on the bonus part.
Can someone please help me out? I am using Hadoop Pig on Cloudera Live. Here is what the sample output is suppose to look like including the bonus.
a. a {2:1,1}
b. all {1:1}
c. alphabetical {1:1}
d. an {2:1,1}
e. appeared {1:2}
Wordcount.pig script does the word count and the other one puts it in alphabetical order.
Wordcount.pig
--Load data
lines = LOAD '/user/cloudera/gettysburg.txt' AS (line:chararray);
-- Create list
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
-- Count occurances
grouped = GROUP words BY word;
--Generate wordcout
wordcount = FOREACH grouped GENERATE group, COUNT(words);
--Save output
STORE wordcount into '/user/cloudera/output';
WORDCOUNTALPHABETIZE.PIG
--Load unsorted data file
unsortedData = LOAD '/user/cloudera/output/UnsortedList.txt' AS (words:chararray, frequency:int);
DUMP unsortedData;
--Put data in alphabetical order
sortedData = ORDER unsortedData BY words ASC, frequency;
DUMP sortedData;
--Save output
STORE sortedData into '/user/cloudera/output2';
Thanks,
Anne
Could be achieved with UDF Enumerate(Datafu) which would be useful to generate sequence number for each tuple in a bag. can you try this?
register datafu-1.1.0.jar;
define Enumerate datafu.pig.bags.Enumerate('1');
A = LOAD '/home/hduser/a22.dat' as (line:chararray);
Z = FOREACH A GENERATE FLATTEN(TOKENIZE(line,'.')) as (word:chararray); // generate line_number with rank
Z1 = RANK Z;
Z2 = FOREACH Z1 GENERATE rank_Z,FLATTEN(TOKENIZE(word)) as (word:chararray); // line_number,word
Z3 = RANK Z2; // rank used to maintain the word order
Z4 = GROUP Z3 by rank_Z; // grouped by line_number to generate word_number for each line
Z5 = foreach Z4 {
sorted = order Z3 by rank_Z2;
generate group, sorted;
} //ordered to maintain word order
Z6 = foreach Z5 generate FLATTEN(Enumerate(sorted)) as (l:int,word_no:int,word:chararray,line_no:int); //generate word_number
Z7 = GROUP Z6 BY word;
Z8 = FOREACH Z7 GENERATE group,Z6.line_no,Z6.word_no,COUNT(Z6); // output in order word,line_number,word_number,count_of_each_word
For word nation below is the output:
(nation,{(16),(13),(25),(16)},{(2),(2),(4),(1)},4)
in the order (word,{(word_number1,word_number2,word_number3,word_number4},{line_number1,line_number2,line_number3,line_number4},count_of_each_word)

Hadoop Pig - Optimizing Word Count

In the canonical pig wordcount example, I'm curious how folks approach optimizing the condition where grouping by word could result in a bag with many (many) elements.
For example:
A = load 'input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
In line C, if there is a word, let's say "the", that occurs 1 billion times in the input file, this can result in the reducer hanging for a very long time while processing. What can be done to optimize this?
In any case, PIG will assess if a combiner can be used and will have one if so.
In the case of your example, it will obviously introduce a combiner which will reduce the number of key value pairs per word to a few or only one in best case. So on the reducer side you will not end up with huge number of key/ values per a given word.

Resources