Strange behavior while splitting string with non-word character regex - ruby

Case 1(Trailing space)
> "on behalf of all of us ".split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]
but if there is leading space then it gives following
Case 2(Leading space)
> " on behalf of all of us".split(/\W+/)
=> ["", "on", "behalf", "of", "all", "of", "us"]
I was expecting result of Case 1 for Case 2 also.
ADDED
> "#dhh congratulations!!".split(/\W+/)
=> ["", "dhh", "congratulations"]
Would anyone please help me to understand the behavior?

[Update]
Skip regex, just Split on space!
> "#dhh congratulations!!".split
=> ["#dhh", "congratulations"]
\W matches any non-word character including space. so as the parser sees a space in start & some chars AFTER the space; it splits. But if the space it at the end, there is no other wordy char[a-zA-Z0-9] present to split with.
To get consistent behavior, you should remove whitespaces using #strip method.
Case 1(Trailing space)
1.9.3p327 :007 > " on behalf of all of us ".strip.split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]
Case 2(Leading space)
1.9.3p327 :008 > "on behalf of all of us ".strip.split(/\W+/)
=> ["on", "behalf", "of", "all", "of", "us"]

From the docs:
split(pattern=$;, [limit]) → anArray
[...]
If the limit parameter is omitted, trailing null fields are suppressed. If limit is a positive number, at most that number of fields will be returned (if limit is 1, the entire string is returned as the only entry in an array). If negative, there is no limit to the number of fields returned, and trailing null fields are not suppressed.

Just for documentation, following works for me
" #dhh congratulations!!".gsub(/^\W+/,'').split /\W+/
Another one
" #dhh congratulations!!".scan /\w+/
Both gives expected results. However there is a caveat for short forms like
> " Don't be shy.".scan /\w+/
=> ["Don", "t", "be", "shy"]
I am actually collecting words which are not articles, conjunctions, prepositions etc. So anyway I am ignoring such short forms and hence I used this solution.
I am preparing words cloud from tweets. If you know any proven algorithm please share.

Related

Ruby regular expressions for finding words

I'm quite new to regular expressions. I am using the regular expression:
/\w+/
To check for words, and it's obvious that this will have problems with punctuation, but I'm not quite sure how to change this regular expression. For example, when I run this command from a class I made:
Wordify.new.regex(/\w+/).string("This sentence isn't 'the best-example, isn't it not?...").display
I get the output:
-----------
this: 1
sentence: 1
isn: 2
t: 2
the: 1
best: 1
example: 1
it: 1
not: 1
-----------
How can I adjust the regular expression so that it matches words with apostrophes, like: isn't as one word, but will only match the when searching 'the or the'. Hyphens in the middle of a word like stack-overflow should match return stack and overflow separately, which this already does.
Additionally, words shouldn't be able to start or end with numbers, like test1241 or 436test should become test, but te7st is okay. Plain numbers should not be recognised.
Sorry, I know this is a big ask, but I'm not sure where to start with regex. Would be grateful if you could also explain what the expression means if possible.
str = "This is 2a' 4test' of my agréable re4'gex, n'est-ce pas?"
r = /
[[:alpha:]] # match a letter
(?: # begin the outer non-capture group
(?:[[:alpha:]]|\d|') # match a letter, digit or apostrophe in a non-capture group
* # execute the above non-capture group zero or more times
[[:alpha:]] # match a letter
)? # close the outer non-capture group and make it optional
/x # free-spacing regex definition mode
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est", "ce", "pas"]
Note the outer capture group is needed in case the string to be matched is a single character.
Hmmm. Maybe we should add a hyphen to the inner non-capture group.
r = /[[:alpha:]](?:(?:[[:alpha:]]|\d|'|-)*[[:alpha:]])?/
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est-ce", "pas"]
I now rarely use the word-matching character \w, mainly because it matches the underscore, as well as letters and digits. Instead I reach for a POSIX bracket expression (search "POSIX"), which has the added (perhaps primary) benefit that it is not English-centric. For example, matching a word character with the exception of an underscore is [[:alnum:]].
You can do something basic using:
/[a-z]+(?:'[a-z]+)*/i
To extend it to allow words like a2b and avoid 123abc abc123 and or plain numbers:
/[a-z]+(?:'[a-z]+|\d+[a-z]+)*/i
There's no special regex features used in the two patterns, only basics.
Try scanning the string using the [[:alpha:]] POSIX character class:
s = "This a sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
s.scan(/[[:alpha:]](?:['\w]*[[:alpha:]])?/)
# => ["This", "a", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
[First attempt]
I split the string into tokens separated by whitespace or hyphens then clean up each token per your rules, since it seems like they might be adjusted as you refine your problem:
def tokenize(str)
tokens = str.split(/(?:\s+|-)/)
tokens.reduce([]) do |memo, token|
token.gsub!(/(^\W+|\W+$)/, '') # Strip enclosing non-words
token.gsub!(/(^\d+|\d+$)/, '') # Strip enclosing digits
memo + (token=='' ? [] : [token]) # Ignore the empty string
end
end
s = "This sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
puts tokenize(s).inspect
# ["This", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
Clearly this solution doesn't use just regular expressions but for my money it's much easier to understand and modify then (what I imagine) a big regex would look like!

Making each word of input an array element

For example, say I have input like follows:
" see all of these cool spaces "
Omit the quotes. What I'm looking for is how to turn that into an array of words. Like this:
['see', 'all', 'of', 'these', 'cool', 'spaces']
Thanks
Here's one way: Use split (see String#split):
string.split
By default, split will split the string into an array where the whitespace is, ignoring leading and trailing whitespace. Exactly what you're asking for. This is the same as using the more explicit string.split(" ").
" see all of these cool spaces ".split
#=> ["see", "all", "of", "these", "cool", "spaces"]

How can I match Word Boundary "or" [##]?

I can't seem to get a regex that matches either a hashtag #, an #, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello #world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "#world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString" which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and #s. It seems like it should work simply with:
/[\b##]\w{3,12}\b/
but that returns ["#world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[##]\w{3,12}\b/
returns ["", "", "#world", "", "#ruby", "", "", ""].
/((\b|[##])\w{3,12}\b)/
matches the right things, but returns [[""], ["#"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.
/((\b|[##])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["#world", "#"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[##])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?
You can just use the regular expression /[##]?\b\w+\b/. That is, optionally match a # or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w+\b/)
# => ["Hello", "#world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w{3,4}\b/)
# => ["#ruby"]

String splitting with unknown punctuation in Ruby

I am building an application that downloads sentences and parses them for a word game. I don't know in advance what punctuation the text will contain.
I'd like to be able to split up the sentence/s, examine them for part of speech tag, and if the correct tag is found, replace it with " ", and rejoin them back in order.
text = "some string, with punctuation- for example: things I don't know about, that may or may not have whitespaces and random characters % !!"
How can I split it into an array so that I can pass the parser over each word, and rejoin them in order, bearing in mind that string.split(//) seems to need to know what punctuation I'm looking for?
split is useful when you can more easily describe the delimiters than the parts to be extracted. In your case, you can more easily describe the parts to be extracted rather than the delimiters, in which case scan is more suited. It is a wrong decision to use split. You should you scan.
text.scan(/[\w']+/)
# => ["some", "string", "with", "punctuation", "for", "example", "things", "I", "don't", "know", "about", "that", "may", "or", "may", "not", "have", "whitespaces", "and", "random", "characters"]
If you want to replace the matches, there is even more reason to not use split. In that case, you should use gsub.
text.gsub(/[\w']+/) do |word|
if word.is_of_certain_part_of_speech?
"___" # Replace it with `"___"`.
else
word # Put back the original word.
end
end

Ruby count number of found keywords in string

I have an array with keywords and I have a string, which may contain those keywords. I now need to know how many keywords are in the given string:
keywords = [ 'text' ,'keywords' ,'contains' ,'blue', '42']
text = 'This text is not long but it contains 3 keywords'
How can I now find out with a ruby command how many of the strings in my array are in the text (three in this case)? I could of course use a for each loop but I am almost sure that there is a more concise way to achieve this.
Thanks for your help
Update: Preferably the solution should not rely on the spaces. So the spaces could be replaced by arbitrary characters.
Update 2: The command should look for unique occurrences.
Here's one approach:
text.scan(/#{keywords.join('|')}/).length
Note that this is safe only if the keywords array contains only alphanumeric characters.
Not exactly what you wanted but
irb(main):012:0> text.split(' ')
=> ["This", "text", "is", "not", "long", "but", "it", "contains", "3", "keywords"]
irb(main):013:0> text.split(' ') & keywords
=> ["text", "contains", "keywords"]
will give you an array with matches

Resources