How do \ / operator work in ruby Regular expression? - ruby

I am using Ruby 1.9.3. Just going thorugh the Ruby tutorials. Now I just got stuck to a statement on which regular expression is working and giving out put also. But confusion with the \/ operators logic.
RegExp-1
Today's date is: 1/15/2013. (String)
(?<month>\d{1,2})\/(?<day>\d{1,2})\/(?<year>\d{4}) (Expression)
RegExp-2
s = 'a' * 25 + 'd' 'a' * 4 + 'c' (String)
/(b|a+)*\/ =~ s #=> ( expression)
Now couldn't understand how \/ and =~ operator works in Ruby.
Could anyome out of here help me to understand the same?
Thanks

\ serves as an escape character. In this context, it is used to indicate that the next character is a normal one and should not serve some special function. normally the / would end the regex, as regex's are bookended by the /. but preceding the / with a \ basically says "i'm not telling you to end the regex when I use this /, i want that as part of the regex."
As Lee pointed out, your second regex is invalid, specifically because you never end the regex with a proper /. you escape the last / so that it's just a plaintext character, so the regex is hanging. it's like doing str = "hello.
as another example, normally ^ is used in regex to indicate the beginning of a string, but doing \^ means you just want to use the ^ character in the regex.
=~ says "does the regex match the string?" If there is a match, it returns the index of the start of the match, otherwise returns nil. See this question for details.
EDIT: Note that the ?<month>, ?<day>, ?<year> stuff is grouping. seems like you could use a bit of brush-up on regex, check out this appendix of sorts to see what all the different special characters do.

Related

Generating a character class

I'm trying to censor letters in a word with word.gsub(/[^#{guesses}]/i, '-'), where word and guesses are strings.
When guesses is "", I get this error RegexpError: empty char-class: /[^]/i. I could sort such cases with an if/else statement, but can I add something to the regex to make it work in one line?
Since you are only matching (or not matching) letters, you can add a non-letter character to your regex, e.g. # or %:
word.gsub(/[^%#{guesses}]/i, '-')
See IDEONE demo
If #{guesses} is empty, the regex will still be valid, and since % does not appear in a word, there is no risk of censuring some guessed percentage sign.
You have two options. One is to avoid testing if your matches are empty, that is:
unless (guesses.empty?)
word.gsub(/^#{Regex.escape(guesses)}/i, '-')
end
Although that's not your intention, it's really the safest plan here and is the most clear in terms of code.
Or you could use the tr function instead, though only for non-empty strings, so this could be substituted inside the unless block:
word.tr('^' + guesses.downcase + guesses.upcase, '-')
Generally tr performs better than gsub if used frequently. It also doesn't require any special escaping.
Edit: Added a note about tr not working on empty strings.
Since tr treats ^ as a special case on empty strings, you can use an embedded ternary, but that ends up confusing what's going on considerably:
word.tr(guesses.empty? ? '' : ('^' + guesses.downcase + guesses.upcase), '-')
This may look somewhat similar to tadman's answer.
Probably you should keep the string that represents what you want to hide, instead of what you want to show. Let's say this is remains. Then, it would be easy as:
word.tr(remains.upcase + remains.downcase, "-")

How to understand gsub(/^.*\//, '') or the regex

Breaking up the below code to understand my regex and gsub understanding:
str = "abc/def/ghi.rb"
str = str.gsub(/^.*\//, '')
#str = ghi.rb
^ : beginning of the string
\/ : escape character for /
^.*\/ : everything from beginning to the last occurrence of / in the string
Is my understanding of the expression right?
How does .* work exactly?
Your general understanding is correct. The entire regex will match abc/def/ and String#gsub will replace it with empty string.
However, note that String#gsub doesn't change the string in place. This means that str will contain the original value("abc/def/ghi.rb") after the substitution. To change it in place, you can use String#gsub!.
As to how .* works - the algorithm the regex engine uses is called backtracking. Since .* is greedy (will try to match as many characters as possible), you can think that something like this will happen:
Step 1: .* matches the entire string abc/def/ghi.rb. Afterwards \/ tries to match a forward slash, but fails (nothing is left to match). .* has to backtrack.
Step 2: .* matches the entire string except the last character - abc/def/ghi.r. Afterwards \/ tries to match a forward slash, but fails (/ != b). .* has to backtrack.
Step 3: .* matches the entire string except the last two characters - abc/def/ghi.. Afterwards \/ tries to match a forward slash, but fails (/ != r). .* has to backtrack.
...
Step n: .* matches abc/def. Afterwards \/ tries to match a forward slash and succeeds. The matching ends here.
No, not quite.
^: beginning of a line
\/: escaped slash (escape character is \ alone)
^.*\/ : everything from beginning of a line to the last occurrence of / in the string
.* depends on the mode of the regex. In singleline mode (i.e., without m option), it means the longest possible sequence of zero or more non-newline characters. In multiline mode (i.e., with m option), it means the longest possible sequence of zero or more characters.
Your understanding is correct, but you should also note that the last statement is true because:
Repetition is greedy by default: as many occurrences as possible
are matched while still allowing the overall match to succeed.
Quoted from the Regexp documentation.
Yes. In short, it matches any number of any characters (.*) ending with a literal / (\/).
gsub replaces the match with the second argument (empty string '').
Nothing wrong with your regex, but File.basename(str) might be more appropriate.
To expound on what #Stefen said: It really looks like you're dealing with a file path, and that makes your question an XY problem where you're asking about Y when you should ask about X: Rather than how to use and understand a regex, the question should be what tool is used to manage paths.
Instead of rolling your own code, use code already written that comes with the language:
str = "abc/def/ghi.rb"
File.basename(str) # => "ghi.rb"
File.dirname(str) # => "abc/def"
File.split(str) # => ["abc/def", "ghi.rb"]
The reason you want to take advantage of File's built-in code is it takes into account the difference between directory delimiters in *nix-style OSes and Windows. At start-up, Ruby checks the OS and sets the File::SEPARATOR constant to what the OS needs:
File::SEPARATOR # => "/"
If your code moves from one system to another it will continue working if you use the built-in methods, whereas using a regex will immediately break because the delimiter will be wrong.

regex multiple matches with OR look behind

I have the following string:
'/photos/full/1/454/6454.jpg?20140521103415','/photos/full/2/452/54_2.jpg?20140521104743','/photos/full/3/254/C2454_3.jpg?20140521104744'
What I want to parse is the address from / to the ? but I can't seem to figure it out.
So far I have /(?<=')[^?]*/ which will properly get the first link, but the second and third link will start with ,'/photos/full/... <--notice that it starts with a ,'
If I then try /(?<=',')[^?]*/ I get the second and third link but miss the first link.
Rather than do 2 regexes, is there a way I can combine them to do 1? I've tried using `/((?<=')|(?<=',')[^?]*/ to no avail.
My code is of the form matches = string.scan(regex) and then I run a match.each block...
In Ruby 2, which has \K, you can use this simple regex (see demo):
'\K/[^?]+
To see all the matches:
regex = /'\K\/[^?]+/
subject.scan(regex) {|result|
# inspect result
}
Explain Regex
' # '\''
\K # 'Keep Out!' abandons what we have matched so far
\/ # '/'
[^?]+ # any character except: '?' (1 or more times
# (matching the most amount possible))
You can use this:
(?<=,|^)'\K[^?]+
Where (?<=,|^) checks that the quote is preceded with a comma or the start of the string/line. And where \K removes all on the left (the comma here) from the match result.
or more simple:
[^?']+(?=\?)
all that is not a quote or a question mark followed by a question mark.
One can simply use a positive lookahead and non-greedy operator, and this of course is not limited to v2.0:
str.scan(/(?<=')\/.*?(?=\?)/)
#=> ["/photos/full/1/454/6454.jpg",
# "/photos/full/2/452/54_2.jpg",
# "/photos/full/3/254/C2454_3.jpg"]
Edit: I added a positive lookbehined for the single quote. See comments.

how to extract value from line found

I'm opening a file and finding the line I need, but then I have trouble creating a variable from the found string
70c 08:04:04.014 rexx TRACE 2203 8=4.4|9=892|35=J|49=ICE_SM_S|56=SM|34=280|70=0241608914160889|71=0|626=2|793=16|72=|466=1164266784|857=0|73=1|11=|37=1156426784|526=1156426674|38=1|198=1310883PTM|54=1|6=117.2100000000|336=R|625=P|55=B|461=FXXXXX|200=20120901|207=IFEU|53=1|30=ICE|453=2|448=SLM|447=C|452=7|448=FFC|447=C|452=12|75=20120210|60=20120310-09:04:04|77=O|58=CYU795|232=14|233=GL_TRADEJOBOUT|234=N|233=GL_ORDERJOBOUT|234=N|233=GL_TAKEN|234=0|233=GL_TRADETYPE|234=E|
This is the string and I want to assign it to a variable of tag198, so it would be
tag198 = '1310883PTMS'
Anything after | is not needed.
tag198 = line.match(/198=(.*)/)[1]
puts tag198
but that keeps all after 198; I need just the string prior to the |.
Change your regular expression to:
/198=(.+?)\|/
That makes it non-greedy and stop at the vertical bar. You have to escape the vertical bar because it normally would mean "OR" in a regular expression.
Your regular expression's * is greedy, and will consume all characters it can without stopping the rest of the expression from matching. There is nothing in the expression that tells ruby when to stop collecting characters.
Look at regular-expressions.info. A partial fix for your problem would be to put a '|' after your capture:
tag198=line.match(/198=(.*)\|/)[1] puts tag198
The '|' is escaped as it has special meaning in regexes otherwise. This doesn't yet work though, because the * can still consume '|' characters, so long as it leaves one behind to match the '|' in our expression. To fix completely, prevent the * from capturing any pipes:
tag198 = line.match(/198=([^|]*)\|/)[1] puts tag198
See results of this change here.
If it is only letters and numbers you could use
/198=([A-Za-z0-9]*)/
Also, in case you didn't know, you can test regular expressions on rubular.com, it also provides some information about special charters in regular expressions, it is a great site for all your regular expressions needs even if it isn't for ruby.

Escaped Ruby regular expression gives no match when used as a literal regexp. Why?

Code says it all:
teststring = "helloworld$"
string_from_user = "world$"
regexp = Regexp.escape(string_from_user) # assigns "world\\$"
p teststring =~ Regexp.new(regexp) # prints 0 => match found
p teststring =~ /regexp/ # prints nil => no match
That the first one matches is mentioned in the Regexp.escape docs.
But why doesn't the second version match?
I'm concerned because I need to pass this regexp to third party Ruby code. The string comes from the user, so I want to escape it. Then, in some situations, I might add additional regexp symbols to this user's string. For example, I might pass "^helloworld\\$" so that third party code would match strings like "helloworld$othercontent".
I am worried that if the third party code uses =~ /regexp/ instead of =~ Regexp.new(regexp), I will be in trouble, because there will be no match as indicated by the code above.
Because /regexp/ is a regexp matching the string "regexp". Perhaps you meant /#{regexp}/?
Edit: I take it, from reading your question more fully, that you're passing a string into third party code that you know will be making a Regexp from that string. In which case, you should be safe. As noted above, /regexp/ cannot possibly be what they're doing, because it's just wrong. They must be using Regexp.new() or something similar.

Resources