Replacing partial regex matches in place with Ruby - ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end

A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)

You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables

In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.

As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

Related

What is the difference between these three alternative ways to write Ruby regular expressions?

I want to match the path "/". I've tried the following alternatives, and the first two do match, but I don't know why the third doesn't:
/\A\/\z/.match("/") # <MatchData "/">
"/\A\/\z/".match("/") # <MatchData "/">
Regexp.new("/\A\/\z/").match("/") # nil
What's going on here? Why are they different?
The first snippet is the only correct one.
The second example is... misleading. That string literal "/\A\/\z/" is, obviously, not a regex. It's a string. Strings have #match method which converts its argument to a regexp (if not already one) and match against it. So, in this example, it's '/' that is the regular expression, and it matches a forward slash found in the other string.
The third line is completely broken: don't need the surrounding slashes there, they are part of regex literal, which you didn't use. Also use single quoted strings, not double quoted (which try to interpret escape sequences like \A)
Regexp.new('\A/\z').match("/") # => #<MatchData "/">
And, of course, none of the above is needed if you just want to check if a string consists of only one forward slash. Just use the equality check in this case.
s == '/'

Generating a character class

I'm trying to censor letters in a word with word.gsub(/[^#{guesses}]/i, '-'), where word and guesses are strings.
When guesses is "", I get this error RegexpError: empty char-class: /[^]/i. I could sort such cases with an if/else statement, but can I add something to the regex to make it work in one line?
Since you are only matching (or not matching) letters, you can add a non-letter character to your regex, e.g. # or %:
word.gsub(/[^%#{guesses}]/i, '-')
See IDEONE demo
If #{guesses} is empty, the regex will still be valid, and since % does not appear in a word, there is no risk of censuring some guessed percentage sign.
You have two options. One is to avoid testing if your matches are empty, that is:
unless (guesses.empty?)
word.gsub(/^#{Regex.escape(guesses)}/i, '-')
end
Although that's not your intention, it's really the safest plan here and is the most clear in terms of code.
Or you could use the tr function instead, though only for non-empty strings, so this could be substituted inside the unless block:
word.tr('^' + guesses.downcase + guesses.upcase, '-')
Generally tr performs better than gsub if used frequently. It also doesn't require any special escaping.
Edit: Added a note about tr not working on empty strings.
Since tr treats ^ as a special case on empty strings, you can use an embedded ternary, but that ends up confusing what's going on considerably:
word.tr(guesses.empty? ? '' : ('^' + guesses.downcase + guesses.upcase), '-')
This may look somewhat similar to tadman's answer.
Probably you should keep the string that represents what you want to hide, instead of what you want to show. Let's say this is remains. Then, it would be easy as:
word.tr(remains.upcase + remains.downcase, "-")

Regex negative lookbehinds with a wildcard

I'm trying to match some text if it does not have another block of text in its vicinity. For example, I would like to match "bar" if "foo" does not precede it. I can match "bar" if "foo" does not immediately precede it using negative look behind in this regex:
/(?<!foo)bar/
but I also like to not match "foo 12345 bar". I tried:
/(?<!foo.{1,10})bar/
but using a wildcard + a range appears to be an invalid regex in Ruby. Am I thinking about the problem wrong?
You are thinking about it the right way. But unfortunately lookbehinds usually have be of fixed-length. The only major exception to that is .NET's regex engine, which allows repetition quantifiers inside lookbehinds. But since you only need a negative lookbehind and not a lookahead, too. There is a hack for you. Reverse the string, then try to match:
/rab(?!.{0,10}oof)/
Then reverse the result of the match or subtract the matching position from the string's length, if that's what you are after.
Now from the regex you have given, I suppose that this was only a simplified version of what you actually need. Of course, if bar is a complex pattern itself, some more thought needs to go into how to reverse it correctly.
Note that if your pattern required both variable-length lookbehinds and lookaheads, you would have a harder time solving this. Also, in your case, it would be possible to deconstruct your lookbehind into multiple variable length ones (because you use neither + nor *):
/(?<!foo)(?<!foo.)(?<!foo.{2})(?<!foo.{3})(?<!foo.{4})(?<!foo.{5})(?<!foo.{6})(?<!foo.{7})(?<!foo.{8})(?<!foo.{9})(?<!foo.{10})bar/
But that's not all that nice, is it?
As m.buettner already mentions, lookbehind in Ruby regex has to be of fixed length, and is described so in the document. So, you cannot put a quantifier within a lookbehind.
You don't need to check all in one step. Try doing multiple steps of regex matches to get what you want. Assuming that existence of foo in front of a single instance of bar breaks the condition regardless of whether there is another bar, then
string.match(/bar/) and !string.match(/foo.*bar/)
will give you what you want for the example.
If you rather want the match to succeed with bar foo bar, then you can do this
string.scan(/foo|bar/).first == "bar"

Variable Declaration Regex

I'm trying to make a simple Ruby regex to detect a JavaScript Declaration, but it fails.
Regex:
lines.each do |line|
unminifiedvar = /var [0-9a-zA-Z] = [0-9];/.match(line)
next if unminifiedvar == nil #no variable declarations on the line
#...
end
Testing Line:
var testvariable10 = 9;
A variable name can have more than one character, so you need a + after the character-set [...]. (Also, JS variable names can contain other characters besides alphanumerics.) A numeric literal can have more than one character, so you want a + on the RHS too.
More importantly, though, there are lots of other bits of flexibility that you'll find more painful to process with a regular expression. For instance, consider var x = 1+2+3; or var myString = "foo bar baz";. A variable declaration may span several lines. It need not end with a semicolon. It may have comments in the middle of it. And so on. Regular expressions are not really the right tool for this job.
Of course, it may happen that you're parsing code from a particular source with a very special structure and can guarantee that every declaration has the particular form you're looking for. In that case, go ahead, but if there's any danger that the nature of the code you're processing might change then you're going to be facing a painful problem that really isn't designed to be solved with regular expressions.
[EDITED about a day after writing, to fix a mistake kindly pointed out by "the Tin Man".]
You forgot the +, as in, more than one character for the variable name.
var [0-9a-zA-Z]+ = [0-9];
You may also want to add a + after the [0-9]. That way it can match multiple digits.
var [0-9a-zA-Z]+ = [0-9]+;
http://rubular.com/r/kPlNcGRaHA
Try /var [0-9a-zA-Z]+ = \d+;/
Without the +, [0-9a-zA-Z] will only match a single alphanumeric character. With +, it can match 1 or more alphanumeric characters.
By the way, to make it more robust, you may want to make it match any number of spaces between the tokens, not just exactly one space each. You may also want to make the semicolon at the end optional (because Javascript syntax doesn't require a semicolon). You might also want to make it always match against the whole line, not just a part of the line. That would be:
/\Avar\s+[0-9a-zA-Z]+\s*=\s*\d+;?\Z/
(There is a way to write [0-9a-zA-Z] more concisely, but it has slipped my memory; if someone else knows, feel free to edit this answer.)

Very odd issue with Ruby and regex

I am getting completely different reults from string.scan and several regex testers...
I am just trying to grab the domain from the string, it is the last word.
The regex in question:
/([a-zA-Z0-9\-]*\.)*\w{1,4}$/
The string (1 single line, verified in Ruby's runtime btw)
str = 'Show more results from software.informer.com'
Work fine, but in ruby....
irb(main):050:0> str.scan /([a-zA-Z0-9\-]*\.)*\w{1,4}$/
=> [["informer."]]
I would think that I would get a match on software.informer.com ,which is my goal.
Your regex is correct, the result has to do with the way String#scan behaves. From the official documentation:
"If the pattern contains groups, each individual result is itself an array containing one entry per group."
Basically, if you put parentheses around the whole regex, the first element of each array in your results will be what you expect.
It does not look as if you expect more than one result (especially as the regex is anchored). In that case there is no reason to use scan.
'Show more results from software.informer.com'[ /([a-zA-Z0-9\-]*\.)*\w{1,4}$/ ]
#=> "software.informer.com"
If you do need to use scan (in which case you obviously need to remove the anchor), you can use (?:) to create non-capturing groups.
'foo.bar.baz lala software.informer.com'.scan( /(?:[a-zA-Z0-9\-]*\.)*\w{1,4}/ )
#=> ["foo.bar.baz", "lala", "software.informer.com"]
You are getting a match on software.informer.com. Check the value of $&. The return of scan is an array of the captured groups. Add capturing parentheses around the suffix, and you'll get the .com as part of the return value from scan as well.
The regex testers and Ruby are not disagreeing about the fundamental issue (the regex itself). Rather, their interfaces are differing in what they are emphasizing. When you run scan in irb, the first thing you'll see is the return value from scan (an Array of the captured subpatterns), which is not the same thing as the matched text. Regex testers are most likely oriented toward displaying the matched text.
How about doing this :
/([a-zA-Z0-9\-]*\.*\w{1,4})$/
This returns
informer.com
On your test string.
http://rubular.com/regexes/13670

Resources