I am working on a solution to the following problem:
Given an arbitrary text document written in English, write a program that will generate a concordance, i.e. an alphabetical list of all word occurrences, labeled with word frequencies.
Bonus: label each word with the sentence numbers in which each occurrence appeared.
Now, I have the first part of this exercise completed. I am stuck on the bonus part.
Can someone please help me out? I am using Hadoop Pig on Cloudera Live. Here is what the sample output is suppose to look like including the bonus.
a. a {2:1,1}
b. all {1:1}
c. alphabetical {1:1}
d. an {2:1,1}
e. appeared {1:2}
Wordcount.pig script does the word count and the other one puts it in alphabetical order.
Wordcount.pig
--Load data
lines = LOAD '/user/cloudera/gettysburg.txt' AS (line:chararray);
-- Create list
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
-- Count occurances
grouped = GROUP words BY word;
--Generate wordcout
wordcount = FOREACH grouped GENERATE group, COUNT(words);
--Save output
STORE wordcount into '/user/cloudera/output';
WORDCOUNTALPHABETIZE.PIG
--Load unsorted data file
unsortedData = LOAD '/user/cloudera/output/UnsortedList.txt' AS (words:chararray, frequency:int);
DUMP unsortedData;
--Put data in alphabetical order
sortedData = ORDER unsortedData BY words ASC, frequency;
DUMP sortedData;
--Save output
STORE sortedData into '/user/cloudera/output2';
Thanks,
Anne
Could be achieved with UDF Enumerate(Datafu) which would be useful to generate sequence number for each tuple in a bag. can you try this?
register datafu-1.1.0.jar;
define Enumerate datafu.pig.bags.Enumerate('1');
A = LOAD '/home/hduser/a22.dat' as (line:chararray);
Z = FOREACH A GENERATE FLATTEN(TOKENIZE(line,'.')) as (word:chararray); // generate line_number with rank
Z1 = RANK Z;
Z2 = FOREACH Z1 GENERATE rank_Z,FLATTEN(TOKENIZE(word)) as (word:chararray); // line_number,word
Z3 = RANK Z2; // rank used to maintain the word order
Z4 = GROUP Z3 by rank_Z; // grouped by line_number to generate word_number for each line
Z5 = foreach Z4 {
sorted = order Z3 by rank_Z2;
generate group, sorted;
} //ordered to maintain word order
Z6 = foreach Z5 generate FLATTEN(Enumerate(sorted)) as (l:int,word_no:int,word:chararray,line_no:int); //generate word_number
Z7 = GROUP Z6 BY word;
Z8 = FOREACH Z7 GENERATE group,Z6.line_no,Z6.word_no,COUNT(Z6); // output in order word,line_number,word_number,count_of_each_word
For word nation below is the output:
(nation,{(16),(13),(25),(16)},{(2),(2),(4),(1)},4)
in the order (word,{(word_number1,word_number2,word_number3,word_number4},{line_number1,line_number2,line_number3,line_number4},count_of_each_word)
Related
I have a set of tweets that have many different fields
raw_tweets = LOAD 'input.tsv' USING PigStorage('\t') AS (tweet_id, text,
in_reply_to_status_id, favorite_count, source, coordinates, entities,
in_reply_to_screen_name, in_reply_to_user_id, retweet_count, is_retweet,
retweet_of_id, user_id_id, lang, created_at, event_id_id, is_news);
I want to find the most common words for each date. I managed to group the texts by date:
r1 = FOREACH raw_tweets GENERATE SUBSTRING(created_at,0,10) AS a, REPLACE
(LOWER(text),'([^a-z\\s]+)','') AS b;
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));
Now it looks like:
(date text text3)
(date2 text2)
I removed the special characters, so only "real" words appear in the text.
Sample:
2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here
I want the output to look like
2017-06-18 the are is
2017-06-19 more words and
I don't really care about how many times the word appears. I just want to show the most common, if two words appear the same amount of times, show any of them.
While I'm sure there is a way to do this entirely in Pig, it would probably be more difficult than necessary.
UDFs are the way to go, in my opinion, and Python is just one option I will show because it's quick to register it in Pig.
For example,
input.tsv
2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here
py_udfs.py
from collections import Counter
from operator import itemgetter
#outputSchema("y:bag{t:tuple(word:chararray,count:int)}")
def word_count(sentence):
''' Does a word count of a sentence and orders common words first '''
words = Counter()
for w in sentence.split():
words[w] += 1
values = ((word,count) for word,count in words.items())
return sorted(values,key=itemgetter(1),reverse=True)
script.pig
REGISTER 'py_udfs.py' USING jython AS py_udfs;
A = LOAD 'input.tsv' USING PigStorage('\t') as (created_at:chararray,sentence:chararray);
B = FOREACH A GENERATE created_at, py_udfs.word_count(sentence);
\d B
Output
(2017-06-18,{(is,2),(the,2),(are,2),(green,1),(black,1),(words,1),(this,1),(plants,1),(there,1),(dog,1)})
(2017-06-19,{(more,2),(words,2),(here,1),(another,1),(begins,1),(phrase,1),(even,1),(and,1)})
If you are doing textual analysis, though, I would suggest
Removing stop words
Lemmatization / stemming
Use Apache Spark
I have fields C1C2C3C4 (no delimter present)in a raw file, I have to generate output which should look like C1,C2,C3,C4.Using PIG script.
Given :- size of C1=C2=C3=C4= 4bytes.
This should be straightforward with these steps:
Load the data as is
Generate four new columns, using the SUBSTRING function
For example, you should be able to extract c2 as:
SUBSTRING(inputstring, 5, 8)
Extending Dennis's Answer.
Assuming the field is stored as chararray
A = LOAD 'data.txt' as (f1:chararray);
B = FOREACH A GENERATE
SUBSTRING(f1,0,2) as A1,
SUBSTRING(f1,2,4) as A2,
SUBSTRING(f1,4,6) as A3,
SUBSTRING(f1,6,8) as A4;
DUMP B;
I am trying to build a pig script that takes in a textbook file and divides it into chapters and then compares the words in each chapter and returns only words that show up in all chapters and counts them. The chapters are Delimited fairly easily by CHAPTER - X.
Here's what I have so far:
lines = LOAD '../../Alice.txt' AS (line:chararray);
lineswithoutspecchars = FOREACH lines GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line;
words = FOREACH lineswithoutspecchars GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
Sorry that this question is probably way too simple compared to what I normally ask on stackoverflow and I googled around for it but perhaps I am not using the correct keywords. I am brand new to PIG and trying to learn it for a new job assignment.
Thanks in advance!
A bit lengthy but you will get the result. You could cut down unnecessary relations based on your file though. Provided appropriate comments in teh script.
Input File:
Pig does not know whether integer values in baseball are stored as ASCII strings, Java
serialized values, binary-coded decimal, or some other format. So it asks the load func-
tion, because it is that function’s responsibility to cast bytearrays to other types. In
general this works nicely, but it does lead to a few corner cases where Pig does not know
how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know
how to perform casts on it because that bytearray is not generated by a load function.
CHAPTER - X
In a strongly typed computer language (e.g., Java), the user must declare up front the
type for all variables. In weakly typed languages (e.g., Perl), variables can take on values
of different type and adapt as the occasion demands.
CHAPTER - X
In this example, remember we are pretending that the values for base_on_balls and
ibbs turn out to be represented as integers internally (that is, the load function con-
structed them as integers). If Pig were weakly typed, the output of unintended would
be records with one field typed as an integer. As it is, Pig will output records with one
field typed as a double. Pig will make a guess and then do its best to massage the data
into the types it guessed.
Pig Script:
A = LOAD 'file' as (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','') as line;
//we need to split on CHAPTER X but the above load function would give us a tuple for each newline. so
group everything and convert that bag to string which will give a single tuple with _ as delimiter.
C = GROUP B ALL;
D = FOREACH C GENERATE BagToString(B) as (line:chararray);
//now we dont have any commas so convert our delimiter CHAPTER X to comma. We do this becuase if we pass this
to TOKENIZE it would split that into separarte column that would be useful to RANK it.
E = FOREACH D GENERATE REPLACE(line,'_CHAPTER X_',',') AS (line:chararray);
F = FOREACH E GENERATE REPLACE(line,'_',' ') AS (line:chararray); //remove the delimiter created by BagToString
//create separate columns
G = FOREACH F GENERATE FLATTEN(TOKENIZE(line,',')) AS (line:chararray);
//we need to rank each chapter so that would be easy when you are doing the count of each word.
H = RANK G;
J = FOREACH H GENERATE rank_G,FLATTEN(TOKENIZE(line)) as (line:chararray);
J1 = GROUP J BY (rank_G, line);
J2 = FOREACH J1 GENERATE COUNT(J) AS (cnt:long),FLATTEN(group.line) as (word:chararray),FLATTEN(group.rank_G) as (rnk:long);
//So J2 result will not have duplicate word within each chapter now.
//So if we group it by word and then filter teh count of that by 2 we are sure that the word is present in all chapters.
J3 = GROUP J2 BY word;
J4 = FOREACH J3 GENERATE SUM(J2.cnt) AS (sumval:long),COUNT(J2) as (cnt:long),FLATTEN(group) as (word:chararray);
J5 = FILTER J4 BY cnt > 2;
J6 = FOREACH J5 GENERATE word,sumval;
dump J6;
//result in order word,count across chapters
Output:
(a,8)
(In,5)
(as,6)
(the,9)
(values,4)
The following code works quite well, but when I already have two existing bags (with their alias, suppose S1 and S2 for representing two existing bags for two sets), wondering how to call UDF setDifference to generate set differences? I think if I manually construct an additional bag, using my already existing input bags (S1 and S2), it will be additional overhead?
register datafu-1.2.0.jar;
define setDifference datafu.pig.sets.SetDifference();
-- ({(3),(4),(1),(2),(7),(5),(6)} \t {(1),(3),(5),(12)})
A = load 'input.txt' AS (B1:bag{T:tuple(val:int)},B2:bag{T:tuple(val:int)});
F1 = foreach A generate B1;
F2 = foreach A generate B2;
differenced = FOREACH A {
-- input bags must be sorted
sorted_b1 = ORDER B1 by val;
sorted_b2 = ORDER B2 by val;
GENERATE setDifference(sorted_b1,sorted_b2);
}
-- produces: ({(2),(4),(6),(7)})
DUMP differenced;
Update:
Question is, suppose I have two bags already, how to call UDF setDifference to get set differences? Do I need to build another super bag which contains the two separate bags? Thanks.
thanks in advance,
Lin
I don't see any overhead issue with the UDF invocation.
Ref : http://datafu.incubator.apache.org/docs/datafu/guide/set-operations.html, we have a example for using SetDifference method.
As per API (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/sets/SetDifference.html) SetDifference method takes bags as input and emits the difference between them.
N.B. Do note that the input bags have to be sorted.
In the example snippet shared, I don't see the need of below code snippet
F1 = foreach A generate B1;
F2 = foreach A generate B2;
Let me explain the problem. I have this line of code:
u = FOREACH persons GENERATE FLATTEN($0#'experiences') as j;
dump u;
which produces this output:
([id#1,date_begin#12 2012,description#blabla,date_end#04 2013],[id#2,date_begin#02 2011,description#blabla2,date_end#04 2013])
([id#1,date_begin#12 2011,description#blabla3,date_end#04 2012],[id#2,date_begin#02 2010,description#blabla4,date_end#04 2011])
Then, when I do this:
p = foreach u generate j#'id', j#'description';
dump p;
I have this output:
(1,blabla)
(1,blabla3)
But that's not what I wanted. I would like to have an output like this:
(1,blabla)
(2,blabla2)
(1,blabla3)
(2,blabla4)
How could I have this ?
Thank you very much.
I'm assuming that the $0 you are FLATTENing in u is a tuple.
The overall problem is that j is only referencing the first map in the tuple. In order to get the output you want, you'll have to convert each tuple into a bag, then FLATTEN it.
If you know that each tuple will have up to two maps, you can do:
-- My B is your u
B = FOREACH A GENERATE (tuple(map[],map[]))$0#'experiences' AS T ;
B2 = FOREACH B GENERATE FLATTEN(TOBAG(T.$0, T.$1)) AS j ;
C = foreach B2 generate j#'id', j#'description' ;
If you don't know how many fields will be in the tuple, then this is will be much harder.
NOTE: This works for pig 0.10.
For tuples with an undefined number of maps, the best answer I can think of is using a UDF to parse the bytearray:
myudf.py
#outputSchema('vals: {(val:map[])}')
def foo(the_input):
# This converts the indeterminate number of maps into a bag.
foo = [chr(i) for i in the_input]
foo = ''.join(foo).strip('()')
out = []
for f in foo.split('],['):
f = f.strip('[]')
out.append(dict((k, v) for k, v in [ i.split('#') for i in f.split(',')]))
return out
myscript.pig
register 'myudf.py' using jython as myudf ;
B = FOREACH A GENERATE FLATTEN($0#'experiences') ;
T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;
T2 = FOREACH T1 GENERATE M#'id', M#'description' ;
However, this relies on the fact that #, ,, or ],[ will not appear in any of the keys or values in the map.
NOTE: This works for pig 0.11.
So it seems that how pig handles the input to the python UDFs changed in this case. Instead of a bytearray being the input to foo, the bytearray is automatically converted to the appropriate type. In that case it makes everything much easier:
myudf.py
#outputSchema('vals: {(val:map[])}')
def foo(the_input):
# This converts the indeterminate number of maps into a bag.
out = []
for map in the_input:
out.append(map)
return out
myscript.pig
register 'myudf.py' using jython as myudf ;
# This time you should pass in the entire tuple.
B = FOREACH A GENERATE $0#'experiences' ;
T1 = FOREACH B GENERATE FLATTEN(myudf.foo($0)) AS M ;
T2 = FOREACH T1 GENERATE M#'id', M#'description' ;