How do I keep the delimiters when splitting a Ruby string? - ruby

I have text like:
content = "Do you like to code? How I love to code! I'm always coding."
I'm trying to split it on either a ? or . or !:
content.split(/[?.!]/)
When I print out the results, the punctuation delimiters are missing.
Do you like to code
How I love to code
I'm always coding
How can I keep the punctuation?

Answer
Use a positive lookbehind regular expression (i.e. ?<=) inside a parenthesis capture group to keep the delimiter at the end of each string:
content.split(/(?<=[?.!])/)
# Returns an array with:
# ["Do you like to code?", " How I love to code!", " I'm always coding."]
That leaves a white space at the start of the second and third strings. Add a match for zero or more white spaces (\s*) after the capture group to exclude it:
content.split(/(?<=[?.!])\s*/)
# Returns an array with:
# ["Do you like to code?", "How I love to code!", "I'm always coding."]
Additional Notes
While it doesn't make sense with your example, the delimiter can be shifted to the front of the strings starting with the second one. This is done with a positive lookahead regular expression (i.e. ?=). For the sake of anyone looking for that technique, here's how to do that:
content.split(/(?=[?.!])/)
# Returns an array with:
# ["Do you like to code", "? How I love to code", "! I'm always coding", "."]
A better example to illustrate the behavior is:
content = "- the - quick brown - fox jumps"
content.split(/(?=-)/)
# Returns an array with:
# ["- the ", "- quick brown ", "- fox jumps"]
Notice that the square bracket capture group wasn't necessary since there is only one delimiter. Also, since the first match happens at the first character it ends up as the first item in the array.

To answer the question's title, adding a capture group to your split regex will preserve the split delimiters:
"Do you like to code? How I love to code! I'm always coding.".split /([?!.])/
=> ["Do you like to code", "?", " How I love to code", "!", " I'm always coding", "."]
From there, it's pretty simple to reconstruct sentences (or do other massaging as the problem calls for it):
s.split(/([?!.])/).each_slice(2).map(&:join).map(&:strip)
=> ["Do you like to code?", "How I love to code!", "I'm always coding."]
The regexes given in other answers do fulfill the body of the question more succinctly, though.

I'd use something like:
content.scan(/.+?[?!.]/)
# => ["Do you like to code?", " How I love to code!", " I'm always coding."]
If you want to get rid of the intervening spaces, use:
content.scan(/.+?[?!.]/).map(&:lstrip)
# => ["Do you like to code?", "How I love to code!", "I'm always coding."]

Use partition. An example from the documentation:
"hello".partition("l") #=> ["he", "l", "lo"]

The most robust way to do this is with a Natural Language Processing library: Rails gem to break a paragraph into series of sentences
You can also split in groups:
#content.split(/(\?+)|(\.+)|(!+)/)
After splitting into groups, you can join the sentence and delimiter.
#content.split(/(\?+)|(\.+)|(!+)/).each_slice(2) {|slice| puts slice.join}

Related

How to extract content within square brackets in ruby

I am trying to extract the content inside the square brackets.
So far I have been using this, which works but I was wondering instead of using this delete function, if I could directly use something in regex.
a = "This is such a great day [cool awesome]"
a[/\[.*?\]/].delete('[]') #=> "cool awesome"
Almost.
a = "This is such a great day [cool awesome]"
a[/\[(.*?)\]/, 1]
# => "cool awesome"
a[/(?<=\[).*?(?=\])/]
# => "cool awesome"
The first one relies on extracting a group instead of a full match; the second one leverages lookahead and lookbehind to avoid the delimiters in the final match.
You can do this with Regular expression using Regexp#=~.
/\[(?<inside_brackets>.+)\]/ =~ a
=> 25
inside_brackets
=> "cool awesome"
This way you assign inside_brackets to the string which matches the regular expression if any, which I think it is more readable.
Note: Be careful to place the regular expression at the left hand side.

Regular expression returns only one match

I have a set of keywords. Any keyword can contain a space symbol ['one', 'one two']. I generate a regexp from these kyewords like this /\b(?i:one|one\ two|three)\b/. Full example below:
keywords = ['one', 'one two', 'three']
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
text.downcase.scan(re)
the result of this code is
=> ["one", "one"]
How to find match of the second keyword one two and get result like this?
=> ["one", "one two"]
Regexes are eager to match. Once they find a match, they don't try to find another possibly longer one (with one important exception).
/\b(?i:one|one\ two|three)\b/ is never going to match one two because it will always match one first. You'd need /\b(?i:one two|one|three)\b/ so it tries one two first. Probably the simplest way to automate this is to sort by the longest keywords first.
keywords = ['one', 'one two', 'three']
re = Regexp.union(keywords.sort { |a,b| b.length <=> a.length }).source
re = /\b#{re}\b/i;
text = 'Some word one and one two other word'
puts text.scan(re)
Note that I set the whole regex to be case-insensitive, easier to read than (?:...), and that downcasing the string is redundant.
The exception is repetition like +, * and friends. They are greedy by default. .+ is going to match as many characters as it can. That's greedy. You can make it lazy, to match the first thing it sees, with a ?. .+? will match a single character.
"A foot of fools".match(/(.*foo)/); # matches "A foot of foo"
"A foot of fools".match(/(.*?foo)/); # matches "A foo"
The point is that \bone\b matches one in one two and since this branch appears before one two branch, it "wins" (see Remember That The Regex Engine Is Eager).
You need to sort the keyword array in a descending order before building a regex. It will then look like
(?-mix:\b(?i:three|one\ two|one)\b)
This way the longer one two will be before the shorter one and will get matched.
See the Ruby demo:
keywords = ['one', 'one two', 'three']
keywords = keywords.dup.sort.reverse
re = /\b(?i:#{ Regexp.union(keywords).source })\b/
text = 'Some word one and one two other word'
puts text.downcase.scan(re)
# => [ one, one two ]
I tried your example by moving the first element to the second position of the array and it works (e.g. http://rubular.com/r/4F2Hc46wHT).
In fact, it looks like the first keyword "overlaps" the second.
This response may be unhelpful if you can't change keywords order.

How do I write a regular expression that will match characters in any order?

I'm trying to write a regular expressions that will match a set of characters without regard to order. For example:
str = "act"
str.scan(/Insert expression here/)
would match:
cat
act
tca
atc
tac
cta
but would not match ca, ac or cata.
I read through a lot of similar questions and answers here on StackOverflow, but have not found one that matches my objectives exactly.
To clarify a bit, I'm using ruby and do not want to allow repeat characters.
Here is your solution
^(?:([act])(?!.*\1)){3}$
See it here on Regexr
^ # matches the start of the string
(?: # open a non capturing group
([act]) # The characters that are allowed and a capturing group
(?!.*\1) # That character is matched only if it does not occur once more, Lookahead assertion
){3} # Defines the amount of characters
$
The only special think is the lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
[act]{3} or ^[act]{3}$ will do it in most regular expression dialects. If you can narrow down the system you're using, that will help you get a more specific answer.
Edit: as mentioned by #georgydyer in the comments below, it's unclear from your question whether or not repeated characters are allowed. If not, you can adapt the answer from this question and get:
^(?=[act]{3}$)(?!.*(.).*\1).*$
That is, a positive lookahead to check a match, and then a negative lookahead with a backreference to exclude repeated characters.
Here's how I'd go about it:
regex = /\b(?:#{ Regexp.union(str.split('').permutation.map{ |a| a.join }).source })\b/
# => /(?:act|atc|cat|cta|tac|tca)/
%w[
cat act tca atc tac cta
ca ac cata
].each do |w|
puts '"%s" %s' % [w, w[regex] ? 'matches' : "doesn't match"]
end
That outputs:
"cat" matches
"act" matches
"tca" matches
"atc" matches
"tac" matches
"cta" matches
"ca" doesn't match
"ac" doesn't match
"cata" doesn't match
I use the technique of passing an array into Regexp.union for a lot of things; I works especially well with the keys of a hash, and passing the hash into gsub for rapid search/replace on text templates. This is the example from the gsub documentation:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
Regexp.union creates a regex, and it's important to use source instead of to_s when extracting the actual pattern being generated:
puts regex.to_s
=> (?-mix:\b(?:act|atc|cat|cta|tac|tca)\b)
puts regex.source
=> \b(?:act|atc|cat|cta|tac|tca)\b
Notice how to_s embeds the pattern's flags inside the string. If you don't expect them you can accidentally embed that pattern into another, which won't behave as you expect. Been there, done that and have the dented helmet as proof.
If you really want to have fun, look into the Perl Regexp::Assemble module available on CPAN. Using that, plus List::Permutor, lets us generate more complex patterns. On a simple string like this it won't save much space, but on long strings or large arrays of desired hits it can make a huge difference. Unfortunately, Ruby has nothing like this, but it is possible to write a simple Perl script with the word or array of words, and have it generate the regex and pass it back:
use List::Permutor;
use Regexp::Assemble;
my $regex_assembler = Regexp::Assemble->new;
my $perm = new List::Permutor split('', 'act');
while (my #set = $perm->next) {
$regex_assembler->add(join('', #set));
}
print $regex_assembler->re, "\n";
(?-xism:(?:a(?:ct|tc)|c(?:at|ta)|t(?:ac|ca)))
See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more information about using Regexp::Assemble with Ruby.
I will assume several things here:
- You are looking for permutations of given characters
- You are using ruby
str = "act"
permutations = str.split(//).permutation.map{|p| p.join("")}
# and for the actual test
permutations.include?("cat")
It is no regex though.
No doubt - the regex that uses positive/negative lookaheads and backreferences is slick, but if you're only dealing with three characters, I'd err on the side of verbosity by explicitly enumerating the character permutations like #scones suggested.
"act".split('').permutation.map(&:join)
=> ["act", "atc", "cat", "cta", "tac", "tca"]
And if you really need a regex out of it for scanning a larger string, you can always:
Regexp.union "act".split('').permutation.map(&:join)
=> /\b(act|atc|cat|cta|tac|tca)\b/
Obviously, this strategy doesn't scale if your search string grows, but it's much easier to observe the intent of code like this in my opinion.
EDIT: Added word boundaries for false positive on cata based on #theTinMan's feedback.

Eloquent way to format string in Ruby?

str = 'foo_bar baz __goo'
Should print as
Foo Bar Baz Goo
Tried to use split /(\s|_)/, but it returns '_' and ' ' and multiple spaces...?
Ruby 1.9.3
try this:
str.split(/[\s_]+/).map(&:classify).join(" ")
if you have access to active support, or
str.split(/[\s_]+/).map(&:capitalize).join(" ")
if you want plain ruby.
Assuming you mean /(\s|_)/ (direction of the slashes matters!), your regular expression is pretty close. The reason you're getting the delimiters (spaces and underscores) in your result is the parentheses: they instruct the splitter to include the delimiters in the returned array.
The reason you are getting extra empty strings is that you are splitting on \s, which matches exactly one space (or tab), or '_', which matches exactly one underscore. If you want to treat any number of spaces or underscores as a single delimiter, you need to add + to your regex - it means "one or more of the previous thing".
But \s|_+ means "a space, or one or more underscores". You want to apply the + to the whole expression, not just the _. That brings us back to the parentheses. In this case, you want to group the two alternatives together without capturing (and returning) them; the syntax for that is (?:...). So this is the result:
str.split(/(?:\s|_)+/)
Now, if you want to normalize case, you want to run capitalize on each string, which you can do with map, like this:
str.split(/(?:\s|_)+/).map { |s| s.capitalize }
or use the shortcut:
str.split(/(?:\s|_)+/).map(&:capitalize)
So far, all these solutions return an array of strings, which you can do a variety of things with. But if you just want to put them back together into a single string, you can use join. For instance, to put them together with a single space between them:
str.split(/(?:\s|_)+/).map(&:capitalize).join ' '
Try splitting the string on:
[\s_]+
Use /[\ _]+/, it looks for one or more occurrence or space or underscore. In that way it is able to eat out multiple underscores, spaces or a combination of both. After than you get an array, so you use map to transform each of them. Later you can get them together using join. See examples -
Get them in a list -
str.split(/[\ _]+/).map {|s| s.capitalize }
=> ["Foo", "Bar", "Baz", "Goo"]
Get them as a whole string -
str.split(/[\ _]+/).map {|s| s.capitalize }.join(" ")
=> "Foo Bar Baz Goo"
Here the pure Ruby version:
str = 'foo_bar baz __goo'
str.split(/[ _]+/).map{|s| s[0].upcase+s[1..-1]}.join(" ")

Regex replace pattern with first char of match & second char in caps

Let's say i have the following string:
"a test-eh'l"
I want to capitalize the start of each word. A word can be separated by a space, apostrophe, hyphen, a forward slash, a period, etc. So I want the string to turn out like this:
"A Test-Eh'L"
I'm not too worried about getting the first character capitalized from the gsub call, as that's easy to do after the fact. However, when I've been using IRB and match method, I only seem to be getting one result. When i use a scan, it collects the matches, but the problem is I cannot really do much with it, as i need to replace the contents of the original string.
Here's what i have so far:
"a test-eh'a".scan(/[\s|\-|\'][a-z]/)
=> [" t", "-e", "'a"]
"a test-eh'a".match(/[\s|\-|\'][a-z]/)
=> #<MatchData " t">
Then if i try the pattern using gsub:
"a test-eh'a".gsub(/[\s|\-|\'][a-z]/, $1)
TypeError: can't convert nil into String
In javascript, i would normally use parenthesis instead of square brackets on the front section. However, i wasn't getting correct results in the scan call when doing so.
"a test-eh'a".scan(/(\s|\-|\')[a-z]/)
=> [[" "], ["-"], ["'"]]
"a test-eh'a".gsub(/(\s|\-|\')[a-z]/, $1)
=> "a'est'h'"
Any help would be appreciated.
Try this:
"a test-eh'a".gsub(/(?:^|\s|-|')[a-z]/) { |r| r.upcase }
# => "A Test-Eh'A"

Resources