Breaking up the below code to understand my regex and gsub understanding:
str = "abc/def/ghi.rb"
str = str.gsub(/^.*\//, '')
#str = ghi.rb
^ : beginning of the string
\/ : escape character for /
^.*\/ : everything from beginning to the last occurrence of / in the string
Is my understanding of the expression right?
How does .* work exactly?
Your general understanding is correct. The entire regex will match abc/def/ and String#gsub will replace it with empty string.
However, note that String#gsub doesn't change the string in place. This means that str will contain the original value("abc/def/ghi.rb") after the substitution. To change it in place, you can use String#gsub!.
As to how .* works - the algorithm the regex engine uses is called backtracking. Since .* is greedy (will try to match as many characters as possible), you can think that something like this will happen:
Step 1: .* matches the entire string abc/def/ghi.rb. Afterwards \/ tries to match a forward slash, but fails (nothing is left to match). .* has to backtrack.
Step 2: .* matches the entire string except the last character - abc/def/ghi.r. Afterwards \/ tries to match a forward slash, but fails (/ != b). .* has to backtrack.
Step 3: .* matches the entire string except the last two characters - abc/def/ghi.. Afterwards \/ tries to match a forward slash, but fails (/ != r). .* has to backtrack.
...
Step n: .* matches abc/def. Afterwards \/ tries to match a forward slash and succeeds. The matching ends here.
No, not quite.
^: beginning of a line
\/: escaped slash (escape character is \ alone)
^.*\/ : everything from beginning of a line to the last occurrence of / in the string
.* depends on the mode of the regex. In singleline mode (i.e., without m option), it means the longest possible sequence of zero or more non-newline characters. In multiline mode (i.e., with m option), it means the longest possible sequence of zero or more characters.
Your understanding is correct, but you should also note that the last statement is true because:
Repetition is greedy by default: as many occurrences as possible
are matched while still allowing the overall match to succeed.
Quoted from the Regexp documentation.
Yes. In short, it matches any number of any characters (.*) ending with a literal / (\/).
gsub replaces the match with the second argument (empty string '').
Nothing wrong with your regex, but File.basename(str) might be more appropriate.
To expound on what #Stefen said: It really looks like you're dealing with a file path, and that makes your question an XY problem where you're asking about Y when you should ask about X: Rather than how to use and understand a regex, the question should be what tool is used to manage paths.
Instead of rolling your own code, use code already written that comes with the language:
str = "abc/def/ghi.rb"
File.basename(str) # => "ghi.rb"
File.dirname(str) # => "abc/def"
File.split(str) # => ["abc/def", "ghi.rb"]
The reason you want to take advantage of File's built-in code is it takes into account the difference between directory delimiters in *nix-style OSes and Windows. At start-up, Ruby checks the OS and sets the File::SEPARATOR constant to what the OS needs:
File::SEPARATOR # => "/"
If your code moves from one system to another it will continue working if you use the built-in methods, whereas using a regex will immediately break because the delimiter will be wrong.
Related
So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]
I want to match urls that do NOT contain the string 'localhost' using Ruby regex
Based on answers and comments here, I put together two solutions, both of which seem to work:
Solution A:
(?!.*localhost)^.*$
Example: http://rubular.com/r/tQtbWacl3g
Solution B:
^((?!localhost).)*$
Example: http://rubular.com/r/2KKnQZUMwf
The problem is that I don't understand what they're doing. For example, according to the docs, ^ can be used in various ways:
[^abc] Any single character except: a, b, or c
^ Start of line
But I don't get how it's being applied here.
Can someone breakdown these expressions for me, and how they differ from one another?
In both of your cases, ^ is just the start of the line (since it's not used inside a character class). Since both ^ and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:
^(?!.*localhost).*$
The ^ anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost anywhere the string (the "anywhere" is taken care of by the .* in front of localhost). If that localhost can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^ this means, the pattern overall cannot match. If, however the .*localhost does not match (and hence localhost does not occur in the string), the lookahead succeeds, and the .*$ simply takes care of matching the rest of the string.
Now the other one
^((?!localhost).)*$
This time the lookahead only checks at the current position (there is no .* inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^ makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost is found at that position. If not, all is well, and . consumes one character. The * then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost - again, if not, all is well, and . consumes another character. This is done for every single character in the string, until we reach the end.
In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.
You can get them easily explained online. The first:
NODE EXPLANATION
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
localhost 'localhost'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
' '
And the second:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
localhost 'localhost'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
--------------------------------------------------------------------------------
As an aside comment, these two solutions are slow. A better way is to use:
^(?:[^l]+|l(?!ocalhost))+
In other words: all characters that are not a l or a l not followed by ocalhost
This will give you a better result since you don't have to check each positions. (For an url like http://localhost:1234/toto this kind of pattern will fail in ~15 steps vs ~50 steps for the two other patterns)
You can improve this pattern using atomic groups and possessive quantifiers to forbid backtracks:
^(?>[^l]++|l(?!ocalhost))++
Note that in your particular case you can speed up your pattern considering that you only want to check the host part of the url. Example:
^http:\/\/(?>[^l\s\/]++|l(?!ocalhost))++(?>\/\S*+|$)
according to the docs, ^ can be used in various ways:
[^abc] Any single character except: a, b, or c
^ Start of line
But I don't get how it's being applied here.
In the regex
(?!.*localhost)^.*$
The ^ is not inside any brackets, so the second one applies. Here is a trivial example:
/^x/
That regex says to match the start of the line, followed by the letter x. So it will match lines like this:
xcellent
x-ray
However, the regex will not match the lines:
axb
excellent
...because the x does not appear directly after the start of the line. You may wonder why 'axb' doesn't match. After all 'a' is the start of the line, and it is followed by an 'x'. However, 'start of the line' is just to the left of the first character, like this:
|
V
axb
^ is called a zero-width match because it matches the slim sliver just to the left of the 'a', e.g. between the starting quote mark and the 'a' in "axb". There's not really any space there, so ^ matches something that is 0 width.
Here is another example:
/x^/
That says to match the character x followed by the start of the line. Well, no line can have an x first and then the start of the line second, so that won't ever match anything.
Now your regex:
(?!.*localhost)^.*$
Like the 'start of line' ^, a lookahead is zero-width. What that means is that the lookahead scans the string looking for the match, but when it finds the match, it comes back to the beginning of the string, and then looks for the rest of the regex:
^.*$
One word of advice, when a regex requires lookarounds(lookaheads or lookbehinds), 99% of the time there are easier ways to do what you want. For instance, you could write:
url = "....."
if url.index('http') == 0
#then the line starts with 'http'
else
#the line doesn't start with http
end
That's much easier to read, and it doesn't require trying to decipher a complex regex.
I am using Ruby 1.9.3. Just going thorugh the Ruby tutorials. Now I just got stuck to a statement on which regular expression is working and giving out put also. But confusion with the \/ operators logic.
RegExp-1
Today's date is: 1/15/2013. (String)
(?<month>\d{1,2})\/(?<day>\d{1,2})\/(?<year>\d{4}) (Expression)
RegExp-2
s = 'a' * 25 + 'd' 'a' * 4 + 'c' (String)
/(b|a+)*\/ =~ s #=> ( expression)
Now couldn't understand how \/ and =~ operator works in Ruby.
Could anyome out of here help me to understand the same?
Thanks
\ serves as an escape character. In this context, it is used to indicate that the next character is a normal one and should not serve some special function. normally the / would end the regex, as regex's are bookended by the /. but preceding the / with a \ basically says "i'm not telling you to end the regex when I use this /, i want that as part of the regex."
As Lee pointed out, your second regex is invalid, specifically because you never end the regex with a proper /. you escape the last / so that it's just a plaintext character, so the regex is hanging. it's like doing str = "hello.
as another example, normally ^ is used in regex to indicate the beginning of a string, but doing \^ means you just want to use the ^ character in the regex.
=~ says "does the regex match the string?" If there is a match, it returns the index of the start of the match, otherwise returns nil. See this question for details.
EDIT: Note that the ?<month>, ?<day>, ?<year> stuff is grouping. seems like you could use a bit of brush-up on regex, check out this appendix of sorts to see what all the different special characters do.
My implementation of markdown turns double hyphens into endashes. E.g., a -- b becomes a – b
But sometimes users write a - b when they mean a -- b. I'd like a regular expression to fix this.
Obviously body.gsub(/ - /, " -- ") comes to mind, but this messes up markdown's unordered lists – i.e., if a line starts - list item, it will become -- list item. So solution must only swap out hyphens when there is a word character somewhere to their left
You can match a word character to the hyphen's left and use a backreference in the replacement string to put it back:
body.gsub(/(\w) - /, '\1 -- ')
Perhaps, if you want to be a little more accepting ...
gsub(/\b([ \t]+)-(?=[ \t]+)/, '\1--')
\b[ \t] forces a non-whitepace before the whitespace through a word boundary condition. I don't use \s to avoid line-runs. I also only use one capture to preserve the preceding whitespace (does Ruby 1.8.x have a ?<= ?).
why this snippet:
'He said "Hello"' =~ /(\w)\1/
matches "ll"? I thought that the \w part matches "H", and hence \1 refers to "H", thus nothing should be matched? but why this result?
I thought that the \w part matches "H"
\w matches any alphanumerical character (and underscore). It also happens to match H but that’s not terribly interesting since the regular expression then goes on to say that this has to be matched twice – which H can’t in your text (since it doesn’t appear twice consecutively), and neither is any of the other characters, just l. So the regular expression matches ll.
You're thinking of /^(\w)\1/. The caret symbol specifies that the match must start at the beginning of the line. Without that, the match can start anywhere in the string (it will find the first match).
and you're right, nothing was matched at that position. then regex went further and found match, which it returned to you.
\w is of course matches any word character, not just 'H'.
The point is, "\1" means one repetition of the "(\w)" block, only the letter "l" is doubled and will match your regex.
A nice page for toying around with ruby and regular expressions is Rubular