Need help to write pig script for counting the no:of words in a
file containing the below text
What|is|Hadoop
History|of|Hadoop
How|Hadoop|name|was|given
Problems|with|Traditional|Large-Scale|Systems|and|Need|for|Hadoop
Understanding|Hadoop|Architecture
Fundamental|of|HDFS|(Blocks,|Name|Node,|Data|Node,|Secondary|Name|Node)
Rack|Awareness
Read/Write|from|HDFS
HDFS|Federation|and|High|Availability
Load the data into a chararray.Replace the '|' with space i.e. ' ' and Tokenize the line which will give you the words and then group and count the words
A = LOAD '/user/hadoop/data.txt' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE(line,'\\|',' ')));
C = GROUP B BY $0;
D = FOREACH C GENERATE group, COUNT(B);
DUMP D;
Output
Related
I am a newbie to Pig latin.I wanted to process the below file and count the most occurred word.
Hadoop|is|an|open|source|Java-based|programming|framework|that|supports|
the|processing|and|storage|of|extremely|large|data|sets|in|a|distributed|computing|environment.
The file contains a | as a delimiter.
There are quite a few examples of word count examples out there.In any case,here's the one with delimiter '|'
lines = LOAD 'input.txt' AS (line:chararray);
newlines = FOREACH lines GENERATE REPLACE(line,'\\|',' ') AS newline;
words = FOREACH newlines GENERATE FLATTEN(TOKENIZE(newline)) as word;
grouped = GROUP words BY word;
w_count = FOREACH grouped GENERATE group, COUNT(words);
DUMP w_count;
I have sample dataset which looks something like this:
tmj_dc_mgmt, Washington, en, 483, 457, 256, ['hiring', 'BusinessMgmt', 'Washington', 'Job']
SRiku0728, 福山市, ja, 6705, 357, 273, ['None']
BesiktaSeyma_, Akyurt, tr, 12921, 1801, 283, ['None']
AnnaKFrick, Virginia, en, 5731, 682, 1120, ['Investment', 'PPP', 'Bogota', 'jobs']
Accprimary, Manchester, en, 1650, 268, 404, ['None']
The data inside square bracket's are hashtags, I want to count top 10 hashtags in whole list.
I have reached this far, not sure how to move further.
twitter_feed = LOAD '/twitter-data-mining/15' USING PigStorage(',');
hash_tags = FOREACH twitter_feed GENERATE $7;
fallten = FILTER hash_tags BY $1 MATCHES '\w+'|'\w+(\s\w+)*'
DUMP fallten;
Any help in correct direction would be appreciated
Thanks!
The load statement is incorrect.There are two ways you can achieve this to get the hashtags.First way is to load using '[' and then manipulating the string to counts the hashtags.Second way is to load the entire line and use regex_extract_all for getting the hashtags. I am listing the first way.See below
Load using '[' as the delimiter which will give 2 fields.
Extract the second field i.e. $1 and replace right bracket ']' and
all quotes '''.
Tokenize the resulting fields to get all the hashtags.
Filter the hashtags that does not match 'None'
Group the hashtags
Count the groupings
Note: I am not changing the case of the hashtags,since it is trivial
A = LOAD 'test10.txt' USING PigStorage('[');
B = FOREACH A GENERATE REPLACE(REPLACE($1,']',''),'\'','');
C = FOREACH B GENERATE FLATTEN(TOKENIZE(*));
D = FILTER C BY NOT($0 MATCHES 'None');
E = GROUP D by $0;
F = FOREACH E GENERATE group,COUNT(D.$0);
DUMP F;
Output
My input file name is words.txt as below . Also there is no space in each record of this below file .
Hi
Hi
How
I am loading this file into Pig
words = LOAD '/user/inputs/words.txt' USING PigStorage() AS (line:chararray);
words_each = FOREACH words GENERATE REPLACE(line,'','|') ;
dump words_each;
I am getting output as
|H|i|
|H|i|
|H|o|w|
But I would like to know how exactly REPLACE functions treats '' which is my second argument in REPLACE function .
There is no empty space in my file, then how come I am getting | in my output .
Well, As per your statement, REPLACE function is called on ''. It doesn't contain any whitespace.
If you want to replace the space, you need to give it like this ' '. +
Both are different conditions as given below:
words_each = FOREACH words GENERATE REPLACE(line,'','|') ; // without space
words_each = FOREACH words GENERATE REPLACE(line,' ','|') ; // with space
First condition will add the Pipe symbol(|) after each character, while 2nd condition won't make any impact because there is no space in your file content.
everyone,
I find many examples about count words, but cannot find counting letters. I just want to split the words into letters, and count them, but my code is wrong. Can someone help me with this? Thanks very much. And this is my code:
A = load './in/*.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER((chararray)$0))) as words;
C = FOREACH B GENERATE FLATTEN(REGEX_EXTRACT_ALL(words, '([a-zA-Z])')) as letter;
D = group C by letter;
E = FOREACH D GENERATE COUNT(C), group;
DUMP E;
Change your corresponding line as below:
C = foreach B generate flatten(TOKENIZE(REPLACE(words,'','|'), '|')) as letter;
The trick i have used is to replace each letter boundary with a special character(|) and then tokenize with that as delimiter. You can also use an uncommon string sequence instead of the special character.
My csv file contain 150 columns!! It contain "" as text qualifiers. how can i remove quotes("") using pig/hive/hbase dynamic script? similarly I have multiple files(50 csv files with different columns). How can i remove these "" from different files?
I tried with below pig script for 2 columns in a file:
A = LOAD 'hdfs://<hostname>:<port>/user/test/input.csv' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'"(.*)","(.*)"')) AS (id:int,name:chararray);
STORE B INTO '/user/test/output' USING PigStorage(',');
Any help would be appreciated.
Can you try like this?
input.txt
"123","456","789"
"abc","def","ghi"
PigScript:
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE REPLACE(line,'\\"','') AS line1;
C = FOREACH B GENERATE FLATTEN(STRSPLIT(line1,'\\,',3));
D = FOREACH C GENERATE $0,$1,$2;
DUMP D;
Output:
(123,456,789)
(abc,def,ghi)
In your case you can change the above 3rd line to STRSPLIT(line1,'\\,',150), where 150 is the total number of columns and you can access all the values by $0,$1...$149