everyone,
I find many examples about count words, but cannot find counting letters. I just want to split the words into letters, and count them, but my code is wrong. Can someone help me with this? Thanks very much. And this is my code:
A = load './in/*.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(LOWER((chararray)$0))) as words;
C = FOREACH B GENERATE FLATTEN(REGEX_EXTRACT_ALL(words, '([a-zA-Z])')) as letter;
D = group C by letter;
E = FOREACH D GENERATE COUNT(C), group;
DUMP E;
Change your corresponding line as below:
C = foreach B generate flatten(TOKENIZE(REPLACE(words,'','|'), '|')) as letter;
The trick i have used is to replace each letter boundary with a special character(|) and then tokenize with that as delimiter. You can also use an uncommon string sequence instead of the special character.
Related
Is there any efficient way to find the duplicate substring? Here, duplicate means that two same substring close to each other have the same value without overlap. For example, the source string is:
ABCDDEFGHFGH
'D' and 'FGH' is duplicated. 'F' appear two times in the sequence, however, they are not close to each other, so it does not duplicate. so our algorithm will return ['D', 'FGH']. I want to know whether there exists an elegant algorithm instead the brute force method?
It relates to Longest repeated substring problem, which builds Suffix Tree to provide string searching in linear time and space complexity Θ(n)
Not very efficient (suffix tree/array are better for very large strings), but very short regular expression solution (C#):
string source = #"ABCDDEFGHFGH";
string[] result = Regex
.Matches(source, #"(.+)\1")
.OfType<Match>()
.Select(match => match.Groups[1].Value)
.ToArray();
Explanation
(.+) - group of any (at least 1) characters
\1 - the same group (group #1) repeated
Test
Console.Write(string.Join(", ", result));
Outcome
D, FGH
In case of ambiguity, e.g. "AAAA" where we can provide "AA" as well as "A" the solution performs greedy and thus "AA" is returned.
Without using any regex which might turn out to be very slow, I guess it's best to use two cursors running hand to hand. The algorithm is pretty obvious from the below JS code.
function getNborDupes(s){
var cl = 0, // cursor left
cr = 0, // cursor right
ts = "", // test string
res = []; // result array
while (cl < s.length){
cr = cl;
while (++cr < s.length){
ts = s.slice(cl,cr); // ts starting from cl to cr (char # cr excluded)
// check ts with subst from cr to cr + ts.length (char # cr + ts.length excluded)
// if they match push it to result advance cursors to cl + ts.length and continue
ts === s.substr(cr,ts.length) && (res.push(ts), cl = cr += ts.length);
}
cl++;
}
return res;
}
var str = "ABCDDEFGHFGH";
console.log(getNborDupes(str));
Throughout the whole process ts will take the following values.
A
AB
ABC
ABCD
ABCDD
ABCDDE
ABCDDEF
ABCDDEFG
ABCDDEFGH
ABCDDEFGHF
ABCDDEFGHFG
B
BC
BCD
BCDD
BCDDE
BCDDEF
BCDDEFG
BCDDEFGH
BCDDEFGHF
BCDDEFGHFG
C
CD
CDD
CDDE
CDDEF
CDDEFG
CDDEFGH
CDDEFGHF
CDDEFGHFG
D
E
EF
EFG
EFGH
EFGHF
EFGHFG
F
FG
FGH
Though the cl = cr += ts.length part decides whether or not to re-start searching on from before or after the matching sub-string. As of currently the above code; "ABABABAB" input would return ["AB","AB"] for but if you make it cr = cl += ts.length then you should expect the result to be ["AB", "AB", "AB"].
Need help to write pig script for counting the no:of words in a
file containing the below text
What|is|Hadoop
History|of|Hadoop
How|Hadoop|name|was|given
Problems|with|Traditional|Large-Scale|Systems|and|Need|for|Hadoop
Understanding|Hadoop|Architecture
Fundamental|of|HDFS|(Blocks,|Name|Node,|Data|Node,|Secondary|Name|Node)
Rack|Awareness
Read/Write|from|HDFS
HDFS|Federation|and|High|Availability
Load the data into a chararray.Replace the '|' with space i.e. ' ' and Tokenize the line which will give you the words and then group and count the words
A = LOAD '/user/hadoop/data.txt' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE(line,'\\|',' ')));
C = GROUP B BY $0;
D = FOREACH C GENERATE group, COUNT(B);
DUMP D;
Output
I have sample dataset which looks something like this:
tmj_dc_mgmt, Washington, en, 483, 457, 256, ['hiring', 'BusinessMgmt', 'Washington', 'Job']
SRiku0728, 福山市, ja, 6705, 357, 273, ['None']
BesiktaSeyma_, Akyurt, tr, 12921, 1801, 283, ['None']
AnnaKFrick, Virginia, en, 5731, 682, 1120, ['Investment', 'PPP', 'Bogota', 'jobs']
Accprimary, Manchester, en, 1650, 268, 404, ['None']
The data inside square bracket's are hashtags, I want to count top 10 hashtags in whole list.
I have reached this far, not sure how to move further.
twitter_feed = LOAD '/twitter-data-mining/15' USING PigStorage(',');
hash_tags = FOREACH twitter_feed GENERATE $7;
fallten = FILTER hash_tags BY $1 MATCHES '\w+'|'\w+(\s\w+)*'
DUMP fallten;
Any help in correct direction would be appreciated
Thanks!
The load statement is incorrect.There are two ways you can achieve this to get the hashtags.First way is to load using '[' and then manipulating the string to counts the hashtags.Second way is to load the entire line and use regex_extract_all for getting the hashtags. I am listing the first way.See below
Load using '[' as the delimiter which will give 2 fields.
Extract the second field i.e. $1 and replace right bracket ']' and
all quotes '''.
Tokenize the resulting fields to get all the hashtags.
Filter the hashtags that does not match 'None'
Group the hashtags
Count the groupings
Note: I am not changing the case of the hashtags,since it is trivial
A = LOAD 'test10.txt' USING PigStorage('[');
B = FOREACH A GENERATE REPLACE(REPLACE($1,']',''),'\'','');
C = FOREACH B GENERATE FLATTEN(TOKENIZE(*));
D = FILTER C BY NOT($0 MATCHES 'None');
E = GROUP D by $0;
F = FOREACH E GENERATE group,COUNT(D.$0);
DUMP F;
Output
I want to split the following tuple into two tuples using pig script.
(key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
I want the output as follows:
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
Yes you can solve this problem using REGEX and TOTUPLE function. First split the string into two parts, first column is before the first comma and second column is remaining strings. Finally convert the two columns as tuples and store it.
input
key=bb7bde5661923b947ce59958773e85c5\,\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([^,]+),(.*)$')) AS (col1,col2);
C = FOREACH B GENERATE TOTUPLE(col1),TOTUPLE(col2);
STORE C INTO 'output';
Output:( will be stored in output/part* file)
(key=bb7bde5661923b947ce59958773e85c5\) (\/css\/bootstrap.min.cssHTTP\/1.1\,\/con-us.php,\/con-us.phpHTTP\/1.1\)
How can I do this using STRSPLIT, TOKENIZER or any other method?
You can use STRSPLIT with regex to solve this problem. I am not sure your input has single or multiple combination of delimiters(dash,comma,hypen,space and hash) but the below solution will work for both.
input
a#b c-d,e
f e,g#h:i
1,2,3,4,5
l#y#z#h#n
A B C D E
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'[-,:\\s#]',5));
DUMP B;
Output:
(a,b,c,d,e)
(f,e,g,h,i)
(1,2,3,4,5)
(l,y,z,h,n)
(A,B,C,D,E)
If you have only single delimiter in your input, say'#' or any other delimiter that you mentioned then try the below script ( '5' in the third arg is total number of columns in your input)
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'#',5));
In case of multiple delimiter, suppose you want to add any new delimiter say '$' then just add this delimiter inside the character class of regex.
Note '$' is special character in Regex which needs escaping for double backslashs like this '[\\$-,:\\s#]'