Could someone please explain the following Ruby code to me in detail? - ruby

I nearly had this challenge on Code Wars in the bag but, I blew it because my knowledge of gsub is sub-par at best. While I roughly understand the concept of gsub, I would like a more thorough understanding of it (different ways you can use it could be helpful to my development) as well as a bit by bit explanation of the code below.
def autocorrect(input)
input.gsub(/\b(you+|u)\b/i, 'your sister')
end

You're taking any string that contains a match to the regular expression shown and replacing it with the second parameter which is in this case, "your sister". Regular expressions are a bit tricky in Ruby but essentially that regular expression is saying:
/ #starts the reg exp
\b #any word boundary
(you+|u) #the word 'you' with one or more of the letter 'u' added after it (so youuuuu would fit) or just the letter 'u' alone with a 'y' or 'o'... the pipe symbol is an or statement in reg-exp. taking one or the other for a match.
\b #again finishing a word boundary
/ #closes the expression.
Checkout Rubular for tips. http://rubular.com/

Related

Split sentence by period followed by a capital letter

I'm trying to find a regex that will split a piece of text into sentences at ./?/! that is followed by a space that is followed by a capital letter.
"Hello there, my friend. In other words, i.e. what's up, man."
should split to:
Hello there, my friend| In other words, i.e. what's up, man|
I can get it to split on ./?/!, but I have no luck getting the space and capital letter criteria.
What I came up with:
.split("/. \s[A-Z]/")
Split a piece of text into sentences based on the criteria that it is a ./?/! that is followed by a space that is followed by a capital letter.
You may use a regex based on a lookahead:
s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/[!?.](?=\s+\p{Lu})/)
See the Ruby demo. In case you also need to split with the punctuation at the end of the string, use /[!?.](?=(?:\s+\p{Lu})|\s*\z)/.
Details:
[!?.] - matches a !, ? or . that is...
(?=\s+\p{Lu}) - (a positive lookahead) followed with 1+ whitespaces followed with 1 uppercase letter immediately to the right of the current location.
See the Rubular demo.
NOTE: If you need to split regular English text into sentences, you should consider using existing NLP solutions/libraries. See:
Pragmatic Segmenter
srx-english
The latter is based on regex, and can easily be extended with more regular expressions.
Apart from Wiktor's Answer you can also use lookarounds to find zero width and split on it.
Regex: (?<=[.?!]\s)(?=[A-Z]) finds zero width preceded by either [.?!] and space and followed by an upper case letter.
s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/(?<=[.?!]\s)(?=[A-Z])/)
Output
Hello there, my friend.
In other words, i.e. what's up, man.
Ruby Demo
Update: Based on Cary Swoveland's comment.
If the OP wanted to break the string into sentences I'd suggest (?<=[.?!])\s+(?=[A-Z]), as it removes spaces between sentences and permits the number of such spaces to be greater than one

Regular expression to find first letter in a string

Consider this example string:
mystr ="1. moody"
I want to capitalize the first letter that occurs in mystr. I am trying this regular expression in Ruby but still returns all the letters in mystr (moody) instead of the letter m only.
puts mystr.scan(/[a-zA-Z]{1}/)
Any help appreciated!
Do as below using String#sub
(arup~>~)$ pry --simple-prompt
>> s = "1. moody"
=> "1. moody"
>> s.sub(/[a-z]/i,&:upcase)
=> "1. Moody"
>>
If you want to modify the source string use s.sub!(/[a-z]/,&:upcase).
Just for completeness, although it doesn’t directly answer your question as posed but could be relevant, consider this variation:
mystr ="1. école"
The line mystr.sub(/[a-z]/i,&:upcase) (as in Arup Rakshit’s answer) will match the second letter of the word, producing
1. éCole
The line mystr.sub /\b\s?[a-zA-Z]{1}/, &:upcase (diego.greyrobot’s answer) won’t match at all and so the line will be unchanged.
There are two problems here. The first is that [a-zA-Z] doesn’t match accented characters, so é isn’t matched. The fix for this is to use the \p{Letter} character property:
mystr.sub /\p{Letter}/, &:upcase
This will match the character in question, but won’t change it. This is due to the second problem, which is that upcase (and downcase) only works on characters in the ASCII range. This is almost as easy to fix, but relies on using an external library such as unicode_utils:
require 'unicode_utils'
mystr.sub(/\p{Letter}/) { |c| UnicodeUtils.upcase(c)}
This results in:
1. École
which is probably what is wanted in this case.
This may not affect you if you are sure all your data is just ASCII, but is worth knowing for other situations.
The reason your attempt returns all the letters is because you are using the scan method which does just that, it returns all the characters which match the regex, in your case letters. For your use case you should use sub since you only want to substitute 1 letter.
I use http://rubular.com to practice my Ruby Regexes. Here's what I came up with http://rubular.com/r/fAQEDFVEVn
The regex is: /\b[a-z]/
It uses \b to find a word boundary, and finally we ask for one letter only with [a-zA-Z]
Finally we'll use sub to replace it with its upcased version:
"1. moody".sub /\b[a-z]/, &:upcase
=> "1. Moody"
Hope that helps.

Regex for matching everything before trailing slash, or first question mark?

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:

What does `(?:| ...)` mean in a Ruby regular expression?

While reading Engineering long-lasting software: an Agile approach using SaaS and cloud computing I came across the following regex (Chapter 5, Section 5.3 Introducing Cucumber and Capybara):
/^(?:|I )am on (.+)$/
I know about the non-capturing (?: ...) syntax, but what I don’t understand is the meaning of the first pipe character after the colon. Is it a typo? Does it serve any particular purpose?
The pipe in regex means alternative. In this case, it is expressing alternation between an empty string "" and the string "I ".
It is just the or. It can match either nothing or I (with a space). The rest is non-capturing group like you mention.
The regex matches something like I am on a diet and also am on a diet and in the above examples, captures a diet in the first group.
Try it out on Rubular - http://rubular.com/r/q3RFEoxj1e
(?:|something)
("nothing / empty string or the match")
Is exactly the same thing as:
(?:something)?
("the match, once or none")
In other words: the non-capturing subpattern is optional.

gsub partial replace

I would like to replace only the group in parenthesis in this expression :
my_string.gsub(/<--MARKER_START-->(.)*<--MARKER_END-->/, 'replace_text')
so that I get : <--MARKER_START-->replace_text<--MARKER_END-->
I know I could repeat the whole MARKER_START and MARKER_END blocks in the substitution expression but I thought there should be a more simple way to do this.
You can do it with zero width look-ahead and look-behind assertions.
This regex should work in ruby 1.9 and in perl and many other places:
Note: ruby 1.8 only supports look-ahead assertions. You need both look-ahead and look-behind to do this properly.
s.gsub( /(?<=<--MARKER START-->).*?(?=<--MARKER END-->)/, 'replacement text' )
What happens in ruby 1.8 is the ?<= causes it to crash because it doesn't understand the look-behind assertion. For that part, you then have to fall back to using a backreference - like Greig Hewgill mentions
so what you get is
s.gsub( /(<--MARKER START-->).*?(?=<--MARKER END-->)/, '\1replacement text' )
EXPLANATION THE FIRST:
I've replaced the (.)* in the middle of your regex with .*? - this is non-greedy.
If you don't have non-greedy, then your regex will try and match as much as it can - if you have 2 markers on one line, it goes wrong. This is best illustrated by example:
"<b>One</b> Two <b>Three</b>".gsub( /<b>.*<\/b>/, 'BOLD' )
=> "BOLD"
What we actually want:
"<b>One</b> Two <b>Three</b>".gsub( /<b>.*?<\/b>/, 'BOLD' )
=> "BOLD Two BOLD"
EXPLANATION THE SECOND:
zero-width-look-ahead-assertion sounds like a giant pile of nerdly confusion.
What "look-ahead-assertion" actually means is "Only match, if the thing we are looking for, is followed by this other stuff.
For example, only match a digit, if it is followed by an F.
"123F" =~ /\d(?=F)/ # will match the 3, but not the 1 or the 2
What "zero width" actually means is "consider the 'followed by' in our search, but don't count it as part of the match when doing replacement or grouping or things like that.
Using the same example of 123F, If we didn't use the lookahead assertion, and instead just do this:
"123F" =~ /\dF/ # will match 3F, because F is considered part of the match
As you can see, this is ideal for checking for our <--MARKER END-->, but what we need for the <--MARKER START--> is the ability to say "Only match, if the thing we are looking for FOLLOWS this other stuff". That's called a look-behind assertion, which ruby 1.8 doesn't have for some strange reason..
Hope that makes sense :-)
PS: Why use lookahead assertions instead of just backreferences? If you use lookahead, you're not actually replacing the <--MARKER--> bits, only the contents. If you use backreferences, you are replacing the whole lot. I don't know if this incurs much of a performance hit, but from a programming point of view it seems like the right thing to do, as we don't actually want to be replacing the markers at all.
You could do something like this:
my_string.gsub(/(<--MARKER_START-->)(.*)(<--MARKER_END-->)/, '\1replace_text\3')

Resources