String splitting with unknown punctuation in Ruby - ruby

I am building an application that downloads sentences and parses them for a word game. I don't know in advance what punctuation the text will contain.
I'd like to be able to split up the sentence/s, examine them for part of speech tag, and if the correct tag is found, replace it with " ", and rejoin them back in order.
text = "some string, with punctuation- for example: things I don't know about, that may or may not have whitespaces and random characters % !!"
How can I split it into an array so that I can pass the parser over each word, and rejoin them in order, bearing in mind that string.split(//) seems to need to know what punctuation I'm looking for?

split is useful when you can more easily describe the delimiters than the parts to be extracted. In your case, you can more easily describe the parts to be extracted rather than the delimiters, in which case scan is more suited. It is a wrong decision to use split. You should you scan.
text.scan(/[\w']+/)
# => ["some", "string", "with", "punctuation", "for", "example", "things", "I", "don't", "know", "about", "that", "may", "or", "may", "not", "have", "whitespaces", "and", "random", "characters"]
If you want to replace the matches, there is even more reason to not use split. In that case, you should use gsub.
text.gsub(/[\w']+/) do |word|
if word.is_of_certain_part_of_speech?
"___" # Replace it with `"___"`.
else
word # Put back the original word.
end
end

Related

Ruby regular expressions for finding words

I'm quite new to regular expressions. I am using the regular expression:
/\w+/
To check for words, and it's obvious that this will have problems with punctuation, but I'm not quite sure how to change this regular expression. For example, when I run this command from a class I made:
Wordify.new.regex(/\w+/).string("This sentence isn't 'the best-example, isn't it not?...").display
I get the output:
-----------
this: 1
sentence: 1
isn: 2
t: 2
the: 1
best: 1
example: 1
it: 1
not: 1
-----------
How can I adjust the regular expression so that it matches words with apostrophes, like: isn't as one word, but will only match the when searching 'the or the'. Hyphens in the middle of a word like stack-overflow should match return stack and overflow separately, which this already does.
Additionally, words shouldn't be able to start or end with numbers, like test1241 or 436test should become test, but te7st is okay. Plain numbers should not be recognised.
Sorry, I know this is a big ask, but I'm not sure where to start with regex. Would be grateful if you could also explain what the expression means if possible.
str = "This is 2a' 4test' of my agréable re4'gex, n'est-ce pas?"
r = /
[[:alpha:]] # match a letter
(?: # begin the outer non-capture group
(?:[[:alpha:]]|\d|') # match a letter, digit or apostrophe in a non-capture group
* # execute the above non-capture group zero or more times
[[:alpha:]] # match a letter
)? # close the outer non-capture group and make it optional
/x # free-spacing regex definition mode
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est", "ce", "pas"]
Note the outer capture group is needed in case the string to be matched is a single character.
Hmmm. Maybe we should add a hyphen to the inner non-capture group.
r = /[[:alpha:]](?:(?:[[:alpha:]]|\d|'|-)*[[:alpha:]])?/
str.scan r
#=> ["This", "is", "a", "test", "of", "my", "agréable", "re4'gex", "n'est-ce", "pas"]
I now rarely use the word-matching character \w, mainly because it matches the underscore, as well as letters and digits. Instead I reach for a POSIX bracket expression (search "POSIX"), which has the added (perhaps primary) benefit that it is not English-centric. For example, matching a word character with the exception of an underscore is [[:alnum:]].
You can do something basic using:
/[a-z]+(?:'[a-z]+)*/i
To extend it to allow words like a2b and avoid 123abc abc123 and or plain numbers:
/[a-z]+(?:'[a-z]+|\d+[a-z]+)*/i
There's no special regex features used in the two patterns, only basics.
Try scanning the string using the [[:alpha:]] POSIX character class:
s = "This a sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
s.scan(/[[:alpha:]](?:['\w]*[[:alpha:]])?/)
# => ["This", "a", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
[First attempt]
I split the string into tokens separated by whitespace or hyphens then clean up each token per your rules, since it seems like they might be adjusted as you refine your problem:
def tokenize(str)
tokens = str.split(/(?:\s+|-)/)
tokens.reduce([]) do |memo, token|
token.gsub!(/(^\W+|\W+$)/, '') # Strip enclosing non-words
token.gsub!(/(^\d+|\d+$)/, '') # Strip enclosing digits
memo + (token=='' ? [] : [token]) # Ignore the empty string
end
end
s = "This sentence isn't 'the best-example, isn't it not?... a1 2b 3c3 d4d 555 stack-overflow"
puts tokenize(s).inspect
# ["This", "sentence", "isn't", "the", "best", "example", "isn't", "it", "not", "a", "b", "c", "d4d", "stack", "overflow"]
Clearly this solution doesn't use just regular expressions but for my money it's much easier to understand and modify then (what I imagine) a big regex would look like!

How can I match Word Boundary "or" [##]?

I can't seem to get a regex that matches either a hashtag #, an #, or a word-boundary. The goal is to break a string into Twitter-like entities and topics so:
input = "Hello #world, #ruby anotherString"
input.scan(entitiesRegex)
# => ["Hello", "#world", "#ruby", "anotherString"]
To get just the words, excluding "anotherString" which is too large, is simple:
/\b\w{3,12}\b/
will return ["Hello", "world", "ruby"]. Unfortunately this doesn't include the hashtags and #s. It seems like it should work simply with:
/[\b##]\w{3,12}\b/
but that returns ["#world", "#ruby"]. This made me realize that word boundaries are not by definition a character, so they don't fall into the category of "A single character" and, so, won't match. A few more attempts:
/\b|[##]\w{3,12}\b/
returns ["", "", "#world", "", "#ruby", "", "", ""].
/((\b|[##])\w{3,12}\b)/
matches the right things, but returns [[""], ["#"], ["#"], [""]] as expected, because the braces also mean capture everything enclosed.
/((\b|[##])\w{3,12}\b)/
kind of works. It returns [["Hello", ""], ["#world", "#"], ["#ruby", "#"]]. So now all the correct items are there, they're just located at the first element of each of the subarrays. The following snippet technically works:
input.scan(/((\b|[##])\w{3,12}\b)/).collect(&:first)
Is it possible to simplify this to match and return the correct substrings with just the regular expression not requiring the collect post-processing?
You can just use the regular expression /[##]?\b\w+\b/. That is, optionally match a # or #, followed by a word boundary (in #ruby, that boundary would be between # and ruby, in a normal word it would also match at the start of the word) and a bunch of word characters.
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w+\b/)
# => ["Hello", "#world", "#ruby", "anotherString"]
Furthermore, you can adjust the number of characters a matching word should have with quantifiers. You gave an example in a comment to a deleted answer to match only #ruby by using {3,4}:
p "Hello #world, #ruby anotherString".scan(/[##]?\b\w{3,4}\b/)
# => ["#ruby"]

Regex for finding at most n consecutive patterns

Lets say our pattern is a regex for capital letters (but we could have a more complex pattern than searching for capitals)
To find at least n consecutive patterns (in this case, the pattern we are looking for is simply a capital letter), we can do this:
(Using Ruby)
somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf"
at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ"]
at_least_3_capitals = somestring.scan(/[A-Z]{3}[A-Z]*/)
=> ["ABC", "XYZ"]
However, how do I search for at most n consecutive patterns, for example, at most one consecutive capital letter:
matches = somestring.scan(/ ??? /)
=> [" deFgHij kLmN pQrS ", " abcdEf"]
Detailed strategy
I read that I need to negate the "at least" regex, by turning it into a DFA, negating the accept states, (then converting it back to NFA, though we can leave it as it is) so to write it as a regex. If we think of encountering our pattern as receiving a '1' and not receiving the pattern as receiving a '0', we can draw a simple DFA diagram (where n=1, we want at most one of our pattern):
Specifically, I was wondering how this becomes a regex. Generally, I hope to find how to find "at most" with regex, as my regex skills feel stunted with "at least" alone.
Trip Hazards - not quite the right solution in spirit
Note that this question is not a dupicate of this post, as using the accepted methodology there would give:
somestring.scan(/[A-Z]{2}[A-Z]*(.*)[A-Z]{2}[A-Z]*/)
=> [[" deFgHij kLmN pQrS X"]]
Which is not what the DFA shows, not just because it misses the second sought match - more importantly that it includes the 'X', which it should not, as 'X' is followed by another capital, and from the DFA we see that a capital which is followed by another capital is not an accept state.
You could suggest
somestring.split(/[A-Z]{2}[A-Z]*/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]
(Thanks to Rubber Duck)
but I still want to know how to find at most n occurrences using regex alone. (For knowledge!)
Why your attempt does not work
There are a few problems with your current attempt.
The reason that X is part of the match is that .* is greedy and consumes as much as possible - hence, leaving only the required two capital letters to be matched by the trailing bit. This could be fixed with a non-greedy quantifier.
The reason why you don't get the second match is twofold. First, you require two trailing capital letters to be there, but instead there is the end of the string. Second, matches cannot overlap. The first match includes at least two trailing capital letters, but the second would need to match these again at the start which is not possible.
There are more hidden problems: try an input with four consecutive capital letters - it can give you an empty match (provided you use the non-greedy quantifier - the greedy one has even worse problems).
Fixing all of these with the current approach is hard (I tried and failed - check the edit history of this post if you want to see my attempt until I decided to scrap this approach altogether). So let's try something else!
Looking for another solution
What is it that we want to match? Disregarding the edge cases, where the match starts at the beginning of the string or ends at the end of the string, we want to match:
(non-caps) 1 cap (non-caps) 1 cap (non-caps) ....
This is ideal for Jeffrey Friedl's unrolling-the-loop. Which looks like
[^A-Z]+(?:[A-Z][^A-Z]+)*
Now what about the edge cases? We can phrase them like this:
We want to allow a single capital letter at the beginning of the match, only if it's at the beginning of the string.
We want to allow a single capital letter at the end of the match, only if it's at the end of the string.
To add these to our pattern, we simply group a capital letter with the appropriate anchor and mark both together as optional:
(?:^[A-Z])?[^A-Z]+(?:[A-Z][^A-Z]+)*(?:[A-Z]$)?
Now it's really working. And even better, we don't need capturing any more!
Generalizing the solution
This solution is easily generalized to the case of "at most n consecutive capital letters", by changing each [A-Z] to [A-Z]{1,n} and thereby allowing up to n capital letters where there is only one allowed so far.
See the demo for n = 2.
tl;dr
To match words containing at most N PATTERNs, use the regex
/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/
For example, to match words containing at most 1 capital letter,
/\b(?:\w(?:(?<![A-Z])|(?!(?:[A-Z]){1,})))+\b/
This works for multi-character patterns too.
Clarification Needed
I'm afraid your examples may cause confusion. Let's add a few words:
somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf mixedCaps mixeDCaps mIxedCaps mIxeDCaps T TT t tt"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now, rerunning your at-least-2-capitals regex returns
at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ", "DC", "DC", "TT"]
Note how complete words are not captured! Are you sure this is what you wanted? I ask, of course, because in your latter examples, your at-most-1-capital regex returns complete words, instead of just the capital letters being captured.
Solution
Here's the solution either way.
First, for matching just patterns (and not entire words, as consistent with your initial examples), here's a regex for at-most-N-PATTERNs:
/(?<!PATTERN)(?!(?:PATTERN){N+1,})(?:PATTERN)+/
For example, the at-most-1-capitals regex would be
/(?<![A-Z])(?!(?:[A-Z]){2,})(?:[A-Z])+/
and returns
=> ["F", "H", "L", "N", "Q", "S", "E", "C", "DC", "I", "C", "I", "DC", "T", "TT"]
To further exemplify, the at-most-2-capitals regex returns
=>
Finally, if you wanted to match entire words that contained at most a certain number of consecutive patterns, then here's a whole different approach:
/\b(?:\w(?:(?<![A-Z])|(?![A-Z]{1,})))+\b/
This returns
["deFgHij", "kLmN", "pQrS", "abcdEf", "mixedCaps", "mIxedCaps", "T", "t", "tt"]
The general form is
/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/
You can see all these examples at http://ideone.com/hImmZr.
to find "at most" with a regex, you use the suffix {1,n} (possibly preceded by a negative lookbehind and followed by a positive lookahead), so it seems that what you want is:
irb(main):006:0> somestring.scan(/[A-Z]{1,2}/)
=> ["AB", "C", "F", "H", "L", "N", "Q", "S", "XY", "Z", "E"]
or
irb(main):007:0> somestring.scan(/(?<![A-Z])[A-Z]{1,2}(?![A-Z])/)
=> ["F", "H", "L", "N", "Q", "S", "E"]
EDIT: if the OP still wants "the longest strings not including more than two uppercase letters", it can use:
irb(main):025:0> somestring.scan(/[^A-Z]+(?:[A-Z]{1,2}[^A-Z]+)*/)
=> [" deFgHij kLmN pQrS ", " abcdEf"]
(but that regex possibly won't match in the beginning and the end of the string)
It seems that
irb(main):026:0> somestring.split(/[A-Z]{3,}/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]
would be better for that.

Ruby count number of found keywords in string

I have an array with keywords and I have a string, which may contain those keywords. I now need to know how many keywords are in the given string:
keywords = [ 'text' ,'keywords' ,'contains' ,'blue', '42']
text = 'This text is not long but it contains 3 keywords'
How can I now find out with a ruby command how many of the strings in my array are in the text (three in this case)? I could of course use a for each loop but I am almost sure that there is a more concise way to achieve this.
Thanks for your help
Update: Preferably the solution should not rely on the spaces. So the spaces could be replaced by arbitrary characters.
Update 2: The command should look for unique occurrences.
Here's one approach:
text.scan(/#{keywords.join('|')}/).length
Note that this is safe only if the keywords array contains only alphanumeric characters.
Not exactly what you wanted but
irb(main):012:0> text.split(' ')
=> ["This", "text", "is", "not", "long", "but", "it", "contains", "3", "keywords"]
irb(main):013:0> text.split(' ') & keywords
=> ["text", "contains", "keywords"]
will give you an array with matches

Ruby: Is there a way to split a string only with the first x occurrencies?

For example, suppose I have this:
001, "john doe", "male", 37, "programmer", "likes dogs, women, and is lazy"
The problem is that the line is only supposed to have 6 fields. But if I separate it with split I get more, due to the comma being used improperly to separate the fields.
Right now I'm splitting everything, then when I get to the 5-th index onward I concatenate all the strings. But I was wondering if there was a split(",",6) or something along these lines.
Ruby has a CSV module in the standard library. It will do what you really need here (ignore commas in doubles quotes).
require 'CSV.rb'
CSV::Reader.parse("\"cake, pie\", bacon") do |row| p row; end
result:
["cake, pie", " bacon"]
=> nil
You might want to strip the results if you're dim like me and stick whitespace everywhere.
Yes, you can do the_string.split(",", 6). However this will still give "wrong" result if there's a comma inside quotes somewhere in the middle (e.g. 001, "doe, john",...).
However using Shellwords might be more appropriate here as this will also allow other sections than the last to contain commas inside quotes (it will also remove the quotes which may or may not be a problem, depending on what you're trying to do).
Example:
require 'shellwords'
the_string = %(001, "doe, john", "male", 37, "programmer", "likes dogs, women, and is lazy")
Shellwords.shellwords the_string
#=> ["001,", "doe, john,", "male,", "37,", "programmer,", "likes dogs, women, and is lazy"]

Resources