Word Count with custom delimiter (|) using Pig - hadoop

I am a newbie to Pig latin.I wanted to process the below file and count the most occurred word.
Hadoop|is|an|open|source|Java-based|programming|framework|that|supports|
the|processing|and|storage|of|extremely|large|data|sets|in|a|distributed|computing|environment.
The file contains a | as a delimiter.

There are quite a few examples of word count examples out there.In any case,here's the one with delimiter '|'
lines = LOAD 'input.txt' AS (line:chararray);
newlines = FOREACH lines GENERATE REPLACE(line,'\\|',' ') AS newline;
words = FOREACH newlines GENERATE FLATTEN(TOKENIZE(newline)) as word;
grouped = GROUP words BY word;
w_count = FOREACH grouped GENERATE group, COUNT(words);
DUMP w_count;

Related

Apache pig program

Need help to write pig script for counting the no:of words in a
file containing the below text
What|is|Hadoop
History|of|Hadoop
How|Hadoop|name|was|given
Problems|with|Traditional|Large-Scale|Systems|and|Need|for|Hadoop
Understanding|Hadoop|Architecture
Fundamental|of|HDFS|(Blocks,|Name|Node,|Data|Node,|Secondary|Name|Node)
Rack|Awareness
Read/Write|from|HDFS
HDFS|Federation|and|High|Availability
Load the data into a chararray.Replace the '|' with space i.e. ' ' and Tokenize the line which will give you the words and then group and count the words
A = LOAD '/user/hadoop/data.txt' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE(line,'\\|',' ')));
C = GROUP B BY $0;
D = FOREACH C GENERATE group, COUNT(B);
DUMP D;
Output

How may I filter lines in Apache Pig?

I have a txt, and then I loaded lines from the txt, with this script:
lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
I need to filter every lines, with some words. I mean:
if the line is:
'Hi, I'm lord Stark, how are you?'
I need to search: "how are you" in the line, for every line in a txt and count the occurrences.
I tried with:
sentences = FOREACH lines GENERATE (FILTER lines BY (f1 matches 'how are you')) AS sent;
But it doesn't work.
Please help me.
Use following to filter the records having "how are you" string:
lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
sentence = FILTER lines BY (line matches '.*how are you.*');
To get the count of occurence:
grouped= GROUP sentence ALL;
sentence_COUNT = FOREACH grouped GENERATE COUNT(sentence);

How Replace function in Pig works?

My input file name is words.txt as below . Also there is no space in each record of this below file .
Hi
Hi
How
I am loading this file into Pig
words = LOAD '/user/inputs/words.txt' USING PigStorage() AS (line:chararray);
words_each = FOREACH words GENERATE REPLACE(line,'','|') ;
dump words_each;
I am getting output as
|H|i|
|H|i|
|H|o|w|
But I would like to know how exactly REPLACE functions treats '' which is my second argument in REPLACE function .
There is no empty space in my file, then how come I am getting | in my output .
Well, As per your statement, REPLACE function is called on ''. It doesn't contain any whitespace.
If you want to replace the space, you need to give it like this ' '. +
Both are different conditions as given below:
words_each = FOREACH words GENERATE REPLACE(line,'','|') ; // without space
words_each = FOREACH words GENERATE REPLACE(line,' ','|') ; // with space
First condition will add the Pipe symbol(|) after each character, while 2nd condition won't make any impact because there is no space in your file content.

Splitting Pig tuple

I want to split the following tuple into two tuples using pig script.
(key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
I want the output as follows:
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
Yes you can solve this problem using REGEX and TOTUPLE function. First split the string into two parts, first column is before the first comma and second column is remaining strings. Finally convert the two columns as tuples and store it.
input
key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([^,]+),(.*)$')) AS (col1,col2);
C = FOREACH B GENERATE TOTUPLE(col1),TOTUPLE(col2);
STORE C INTO 'output';
Output:( will be stored in output/part* file)
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)

I want to tokenize string using the following delimiters in pig: dash, comma, hash, space and colon

How can I do this using STRSPLIT, TOKENIZER or any other method?
You can use STRSPLIT with regex to solve this problem. I am not sure your input has single or multiple combination of delimiters(dash,comma,hypen,space and hash) but the below solution will work for both.
input
a#b c-d,e
f e,g#h:i
1,2,3,4,5
l#y#z#h#n
A B C D E
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'[-,:\\s#]',5));
DUMP B;
Output:
(a,b,c,d,e)
(f,e,g,h,i)
(1,2,3,4,5)
(l,y,z,h,n)
(A,B,C,D,E)
If you have only single delimiter in your input, say'#' or any other delimiter that you mentioned then try the below script ( '5' in the third arg is total number of columns in your input)
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'#',5));
In case of multiple delimiter, suppose you want to add any new delimiter say '$' then just add this delimiter inside the character class of regex.
Note '$' is special character in Regex which needs escaping for double backslashs like this '[\\$-,:\\s#]'

Resources