Why does the following parsing solution work? - ruby

I need simple parsing with embedded single and double quotes. For the following input:
" hello 'there ok \"hohh\" ' ciao \"eeee \" \" yessss 'aaa' \" %%55+ "
I need the following output:
["hello", "there ok \"hohh\" ", "ciao", "eeee ", " yessss 'aaa' ", "%%55+"]
Why does the following Ruby code that I came up with work? I do not understand the regex part. I know basic regex but I assume that the embedded quotes should not work but they still do, either with single ones having doubles and vice versa.
text.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}

No need to solve this with a custom regex; the ruby standard library contains a module for this: Shellwords.
Manipulates strings like the UNIX Bourne shell
This module manipulates strings according to the word parsing rules of the UNIX Bourne shell.
Usage:
require 'shellwords'
str = " hello 'there ok \"hohh\" ' ciao \"eeee \" \" yessss 'aaa' \" %%55+ "
Shellwords.split(str)
#=> ["hello", "there ok \"hohh\" ", "ciao", "eeee ", " yessss 'aaa' ", "%%55+"]
# Or equivalently:
str.shellsplit
#=> ["hello", "there ok \"hohh\" ", "ciao", "eeee ", " yessss 'aaa' ", "%%55+"]
The above is the "right" answer. Use that. What follows is additional information to explain why to use this, and why your answer "sort-of" works.
Parsing these strings accurately is tricky! Your regex attempt works for most inputs, but does not properly handle various edge cases. For example, consider:
str = "foo\\ bar"
str.shellsplit
#=> ["foo bar"] (correct!)
str.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}
#=> ["foo\\", "bar"] (wrong!)
The method's implementation does still use a (more complex!) regex under the hood, but also handles edge cases such as invalid inputs - which yours does not.
line.scan(/\G\s*(?>([^\s\\\'\"]+)|'([^\']*)'|"((?:[^\"\\]|\\.)*)"|(\\.?)|(\S))(\s|\z)?/m)
So without digging too deeply into the flaws of your approach (but suffice to say, it doesn't always work!), why does it mostly work? Well, your regex:
/\"(.*?)\"|'(.*?)'|([^\s]+)/
...is saying:
If " is found, match as little as possible (.*?) up until the closing ".
Same as above, for single quotes (').
If neither a single nor double quote is found, scan ahead to the first non-whitespace characters ([^\s]+ -- which could also, equivalently, have been written as \S+).
The .flatten is necessary because you're using capture groups ((...)). This could have been avoided if you'd used non-capture groups instead ((?:...)).
The .select{|x|x}, or (effectively) equivalently .compact was also necessary because of these capture groups - since in each match, 2 of the 3 groups were not part of the result.

Related

Delete all the whitespaces that occur after a word in ruby

I have a string " hello world! How is it going?"
The output I need is " helloworld!Howisitgoing?"
So all the whitespaces after hello should be removed. I am trying to do this in ruby using regex.
I tried strip and delete(' ') methods but I didn't get what I wanted.
some_string = " hello world! How is it going?"
some_string.delete(' ') #deletes all spaces
some_string.strip #removes trailing and leading spaces only
Please help. Thanks in advance!
There are numerous ways this could be accomplished without without a regular expressions, but using them could be the "cleanest" looking approach without taking sub-strings, etc. The regular expression I believe you are looking for is /(?!^)(\s)/.
" hello world! How is it going?".gsub(/(?!^)(\s)/, '')
#=> " helloworld!Howisitgoing?"
The \s matched any whitespace character (including tabs, etc), and the ^ is an "anchor" meaning the beginning of the string. The ! indicates to reject a match with following criteria. Using those together to your goal can be accomplished.
If you are not familiar with gsub, it is very similar to replace, but takes a regular expression. It additionally has a gsub! counter-part to mutate the string in place without creating a new altered copy.
Note that strictly speaking, this isn't all whitespace "after a word" to quote the exact question, but I gathered from your examples that your intentions were "all whitespace except beginning of string", which this will do.
def remove_spaces_after_word(str, word)
i = str.index(/\b#{word}\b/i)
return str if i.nil?
i += word.size
str.gsub(/ /) { Regexp.last_match.begin(0) >= i ? '' : ' ' }
end
remove_spaces_after_word("Hey hello world! How is it going?", "hello")
#=> "Hey helloworld!Howisitgoing?"

Ruby regex eliminate new line until . or ? or capital letter

I'd like to do the following with my strings:
line1= "You have a house\nnext to the corner."
Eliminate \n if the sentence doesn't finish in new line after dot or question mark or capital letter, so the desired output will be in this case:
"You have a house next to the corner.\n"
So another example, this time with the question mark:
"You like baggy trousers,\ndon't you?
should become:
"You like baggy trousers, don't you?\n".
I've tried:
line1.gsub!(/(?<!?|.)"\n"/, " ")
(?<!?|.) this immediately preceding \n there must NOT be either question mark(?) or a comma
But I get the following syntax error:
SyntaxError: (eval):2: target of repeat operator is not specified: /(?<!?|.)"\n"/
And for the sentences where in the middle of them there's a capital letter, insert a \n before that capital letter so the sentence:
"We were winning The Home Secretary played a important role."
Should become:
"We were winning\nThe Home Secretary played a important role."
NOTE: The answer is not meant to provide a generic way to remove unnecessary newline symbols inside sentences, it is only meant to serve OP purpose to only remove or insert newlines in specific places in a string.
Since you need to replace matches in different scenarios differently, you should consider a 2-step approach.
.gsub(/(?<![?.])\n/, ' ')
This one will replace all newlines that are not preceded with ? and . (as (?<![?.]) is a negative lookbehind failing the match if there is a subpattern match before the current location inside the string).
The second step is
.sub(/(?<!^) *+(?=[A-Z])/, '\n')
or
.sub(/(?<!^) *+(?=\p{Lu})/, '\n')
It will match 0+ spaces ( *+) (possessively, no backtracking into the space pattern) that are not at the beginning of the line (due to the (?<!^) negative lookbehind, replace ^ with \A to match the start of the whole string), and that is followed with a capital letter ((?=\p{Lu}) is a positive lookahead that requires a pattern to appear right after the current location to the right).
You are nearly there. You need to a) escape both ? and . and b) remove quotation marks around \n in the expression:
line1= "You have a house\nnext to the corner.\nYes?\nNo."
line1.gsub!(/(?<!\?|\.)\s*\n\s*/, " ")
#⇒ "You have a house next to the corner.\nYes?\nNo."
As you want the trailing \n, just add it afterwards:
line1.gsub! /\Z/, "\n"
#⇒ "You have a house next to the corner.\nYes?\nNo.\n"
The simple way to do this is to replace all the embedded new-lines with a space, which effectively joins the line segments, then fix the line-end. It's not necessary to worry about the punctuation and it's not necessary to use (or maintain) a regex.
You can do this a lot of ways, but I'd use:
sentences = [
"foo\nbar",
"foo\n\nbar",
"foo\nbar\n",
]
sentences.map{ |s| s.gsub("\n", ' ').squeeze(' ').strip + "\n" }
# => ["foo bar\n", "foo bar\n", "foo bar\n"]
Here's what's happening inside the map block:
s # => "foo\nbar", "foo\n\nbar", "foo\nbar\n"
.gsub("\n", ' ') # => "foo bar", "foo bar", "foo bar "
.squeeze(' ') # => "foo bar", "foo bar", "foo bar "
.strip # => "foo bar", "foo bar", "foo bar"
+ "\n"

Replacing '&' with '\&' in Ruby using String#sub

I'm trying to replace every & in a string with \& using String#gsubin Ruby. What I see is confusing me as I was hoping to get milk \& honey:
irb(main):009:0> puts "milk & honey".sub(/&/,'\ &')
milk \ & honey
=> nil
irb(main):010:0> puts "milk & honey".sub(/&/,'\&')
milk & honey
=> nil
irb(main):011:0> puts "milk & honey".sub(/&/,'\\&')
milk & honey
=> nil
irb(main):012:0>
This is on Ruby 2.0.0p481 on OS X. (I was using String#sub above but plan to use String#gsub for the general case with more than one & in a string.)
When you pass a string as the replacement value to String#sub (or String#gsub), it is first scanned for backreferences to the original string. Of particular interest here, the sequence \& is replaced by whatever part of the string matched the whole regular expression:
puts "bar".gsub(/./, '\\&\\&') # => bbaarr
Note that, despite appearances, the Ruby string literal '\\&\\&' represents a string with only four characters, not six:
puts '\\&\\&' # => \&\&
That's because even single-quoted Ruby strings are subject to backslash-substitution, in order to allow the inclusion of single-quotes inside single-quoted strings. Only ' or another backslash itself trigger substitution; a backslash followed by anything else is taken as simply a literal backslash. That means that you can usually get literal backslashes without doubling them:
puts '\&\&' # still => \&\&
But that's a fiddly detail to rely on, as the next character could change the interpretation. The safest practice is doubling all backslashes that you want to appear literally in a string.
Now in this case, we want to somehow get a literal backslash-ampersand back out of sub. Fortunately, just like the Ruby string parser, sub allows us to use doubled backslashes to indicate that a backslash should be taken as literal instead of as the start of a backreference. We just need to double the backslash in the string that sub receives - which means doubling both of the backslashes in the string's literal representation, taking us to a total of four backslashes in that form:
puts "milk & honey".sub(/&/, '\\\\&')
You can get away with only three backslashes here if you like living dangerously. :)
Alternatively, you can avoid all the backslash-counting and use the block form, where the replacement is obtained by calling a block of code instead of parsing a static string. Since the block is free to do any sort of substitution or string munging it wants, its return value is not scanned for backslash substitutions like the string version is:
puts "milk & honey".sub(/&/) { '\\&' }
Or the "risky" version:
puts "milk & honey".sub(/&/) { '\&' }
Just triple the \:
puts "milk & honey".sub(/&/,'\\\&')
See the IDEONE demo
In Ruby regex, \& means the entire regex, that is why it should be escaped, and then we need to add the literal \. More patterns available are listed below:
\& (the entire regex)
\+ (the last group)
\` (pre-match string)
\' (post-match string)
\0 (same as \&)
\1 (first captured group)
\2 (second captured group)
\\ (a backslash)
Block representation is easier and more human-readable and maintainable:
puts "milk & honey".sub(/&/) { '\&' }

How do I keep the delimiters when splitting a Ruby string?

I have text like:
content = "Do you like to code? How I love to code! I'm always coding."
I'm trying to split it on either a ? or . or !:
content.split(/[?.!]/)
When I print out the results, the punctuation delimiters are missing.
Do you like to code
How I love to code
I'm always coding
How can I keep the punctuation?
Answer
Use a positive lookbehind regular expression (i.e. ?<=) inside a parenthesis capture group to keep the delimiter at the end of each string:
content.split(/(?<=[?.!])/)
# Returns an array with:
# ["Do you like to code?", " How I love to code!", " I'm always coding."]
That leaves a white space at the start of the second and third strings. Add a match for zero or more white spaces (\s*) after the capture group to exclude it:
content.split(/(?<=[?.!])\s*/)
# Returns an array with:
# ["Do you like to code?", "How I love to code!", "I'm always coding."]
Additional Notes
While it doesn't make sense with your example, the delimiter can be shifted to the front of the strings starting with the second one. This is done with a positive lookahead regular expression (i.e. ?=). For the sake of anyone looking for that technique, here's how to do that:
content.split(/(?=[?.!])/)
# Returns an array with:
# ["Do you like to code", "? How I love to code", "! I'm always coding", "."]
A better example to illustrate the behavior is:
content = "- the - quick brown - fox jumps"
content.split(/(?=-)/)
# Returns an array with:
# ["- the ", "- quick brown ", "- fox jumps"]
Notice that the square bracket capture group wasn't necessary since there is only one delimiter. Also, since the first match happens at the first character it ends up as the first item in the array.
To answer the question's title, adding a capture group to your split regex will preserve the split delimiters:
"Do you like to code? How I love to code! I'm always coding.".split /([?!.])/
=> ["Do you like to code", "?", " How I love to code", "!", " I'm always coding", "."]
From there, it's pretty simple to reconstruct sentences (or do other massaging as the problem calls for it):
s.split(/([?!.])/).each_slice(2).map(&:join).map(&:strip)
=> ["Do you like to code?", "How I love to code!", "I'm always coding."]
The regexes given in other answers do fulfill the body of the question more succinctly, though.
I'd use something like:
content.scan(/.+?[?!.]/)
# => ["Do you like to code?", " How I love to code!", " I'm always coding."]
If you want to get rid of the intervening spaces, use:
content.scan(/.+?[?!.]/).map(&:lstrip)
# => ["Do you like to code?", "How I love to code!", "I'm always coding."]
Use partition. An example from the documentation:
"hello".partition("l") #=> ["he", "l", "lo"]
The most robust way to do this is with a Natural Language Processing library: Rails gem to break a paragraph into series of sentences
You can also split in groups:
#content.split(/(\?+)|(\.+)|(!+)/)
After splitting into groups, you can join the sentence and delimiter.
#content.split(/(\?+)|(\.+)|(!+)/).each_slice(2) {|slice| puts slice.join}

Regex replace pattern with first char of match & second char in caps

Let's say i have the following string:
"a test-eh'l"
I want to capitalize the start of each word. A word can be separated by a space, apostrophe, hyphen, a forward slash, a period, etc. So I want the string to turn out like this:
"A Test-Eh'L"
I'm not too worried about getting the first character capitalized from the gsub call, as that's easy to do after the fact. However, when I've been using IRB and match method, I only seem to be getting one result. When i use a scan, it collects the matches, but the problem is I cannot really do much with it, as i need to replace the contents of the original string.
Here's what i have so far:
"a test-eh'a".scan(/[\s|\-|\'][a-z]/)
=> [" t", "-e", "'a"]
"a test-eh'a".match(/[\s|\-|\'][a-z]/)
=> #<MatchData " t">
Then if i try the pattern using gsub:
"a test-eh'a".gsub(/[\s|\-|\'][a-z]/, $1)
TypeError: can't convert nil into String
In javascript, i would normally use parenthesis instead of square brackets on the front section. However, i wasn't getting correct results in the scan call when doing so.
"a test-eh'a".scan(/(\s|\-|\')[a-z]/)
=> [[" "], ["-"], ["'"]]
"a test-eh'a".gsub(/(\s|\-|\')[a-z]/, $1)
=> "a'est'h'"
Any help would be appreciated.
Try this:
"a test-eh'a".gsub(/(?:^|\s|-|')[a-z]/) { |r| r.upcase }
# => "A Test-Eh'A"

Resources