ruby regex make sub stop at first match - ruby

I am trying to replace a specific pattern in a text string.
That pattern is a href containing the word "sak".
My script currently looks like this:
ccontent=ccontent.sub(/<a .+?href=\"([^\"]+)\"[^\>]*>Sak<\/a>/, '')
The problem is that this replaces the entire string. (the string contains two links).
The problem is somewhere around the `a .+?" symbols, it runs through the link i want to Replace entirely and goes into the next link and replaces that whole link as well.
But I want it to STOP when the first pattern match is reached so that it only erases "sak" link.
How do i make the pattern match stop at the first time it reaches the 'href'?

Your expression is greedy, because .+? will actually keep matching any character as long as the pattern still matches.
Just use the [^>]* character set you're already using at the end of the regex:
ccontent.sub(/<a [^>]*href=\"([^\"]+)\"[^>]*>Sak<\/a>/, '')

Related

Ruby replace string if next letter is found

I want to replace a pattern in Ruby only if the next letter after the pattern is one of the given.
Example: replace "αυ" with "av" ONLY IF next letter after "αυ" is one of the followings: α|γ|δ|λ|μ|ν|ρ|σμ|ω
This code will not work of course, I suppose I need to use a regex more complicate to match one of the letter after the pattern.
string.gsub!("αυ", "av") if string =~ /α|γ|δ|λ|μ|ν|ρ|σμ|ω/
Thanks for any suggestion.
Use a positive lookahead:
string.gsub!(/αυ(?=α|γ|δ|λ|μ|ν|ρ|σμ|ω)/, "av")
See the Rubular demo
Details
αυ - a αυ substring
(?=α|γ|δ|λ|μ|ν|ρ|σμ|ω) - a positive lookahead that requires the presence of one of the alternatives inside it while excluding the alternative inside the match value, i.e. it will be left in the resulting string).
You may also "contract" the single-char alternations into a character class
/αυ(?=[αγδλμνρω]|σμ)/
^^^^^^^^^^
See another Rubular demo. σμ cannot be put inside a character class since it contains 2 chars.

Ensure non-matching of a pattern within a scope

I am trying to create a regex that matches a pattern in some part of a string, but not in another part of the string.
I am trying to match a substring that
(i) is surrounded by a balanced pair of one or more consecutive backticks `
(ii) and does not include as many consecutive backticks as in the surrounding patterns
(iii) where the surrounding patterns (sequence of backticks) are not adjacent to other backticks.
This is some variant of the syntax of inline code notation in Markdown syntax.
Examples of matches are as follows:
"xxx`foo`yyy" # => matches "foo"
"xxx``foo`bar`baz``yyy" # => matches "foo`bar`baz"
"xxx```foo``bar``baz```yyy" # => matches "foo``bar``baz"
One regex to achieve this is:
/(?<!`)(?<backticks>`+)(?<inline>.+?)\k<backticks>(?!`)/
which uses a non-greedy match.
I was wondering if I can get rid of the non-greedy match.
The idea comes from when the prohibited pattern is a single character. When I want to match a substring that is surrounded by a single quote ' that does not include a single quote in it, I can do either:
/'.+?'/
/'[^']+'/
The first one uses non-greedy match, and the second one uses an explicit non-matching pattern [^'].
I am wondering if it is possible to have something like the second form when the prohibited pattern is not a single character.
Going back to the original issue, there is negative lookahead syntax(?!), but I cannot restrict its effective scope. If I make my regex like this:
/(?<!`)(?<backticks>`+)(?<inline>(?!.*\k<backticks>).*)\k<backticks>(?!`)/
then the effect of (?!.*\k<backticks>) will not be limited to within (?<inline>...), but will extend to the whole string. And since that contradicts with the \k<backticks> at the end, the regex fails to match.
Is there a regex technique to ensure non-matching of a pattern (not-necessarily a single character) within a certain scope?
You can search for one or more characters which aren't the first character of a delimiter:
/(?<!`)(?<backticks>`+)(?<inline>(?:(?!\k<backticks>).)+)\k<backticks>(?!`)/

ruby regexp Skipping Zero Length Matches and nil matches

I have ruby app that uses first matched string by regex. my_url.match(/my_regex/).first
As a strings I have a list of urls that contain adress or postcode and from each of them I need to extract postocode or adreess by using regex
Samples of urls:
http://www.adresses.co.uk/avon/bath-city
http://www.adresses.co.uk/postcode/rm107jj
My regex:
\.co\.uk\/postcode\/([^\/]*)|\.co\.uk\/(?!postcode)([^\/]*\/[^\/]*)
My problem is that for non postcode urls a first matched data by this regex is nil see_on_rubular
How to rewrite or change this reflex so it will skip nil matches or to make first matches non nils. I need to solve it with regex not in ruby coding please.
Here's a regex that captures in group #1 everything after postcode/ if it's present, or else everything after .co.uk/:
\.co\.uk\/(?:postcode\/)?([^\/\n]+(?:\/[^\/\n]+)?)
(DEMO)
Note that this will give unexpected results if there are unwanted path elements at the end of a postcode link, such as:
http://www.adresses.co.uk/postcode/rm107jj/oops
UPDATE: Based on the comments, it looks like you want to match just the last path element. But we can't simply capture the second element, because there might be only one:
http://www.adresses.co.uk/west-midlands
We can, however, make the first element optional:
\.co\.uk\­/(?:[^\/\n]+\­/)?([^\/\n]+­)
Notice how I used a non-capturing group for the optional portion, so the part you want is still captured in group #1.
...

Replacing partial regex matches in place with Ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end
A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)
You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables
In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

regular expression gsub only if it does not have anything before

Is there anyway to scan only if there is nothing before what I am scanning for.
For example I have a post and I am scanning for a forward slash and what follows it but I do not want to scan for a forward slash if it is not the beginning character.
I want to scan for /this but I do not want to scan for this/this or http://this.com.
The regular expression I am currently using is..
/\/(\w+)/
I am using this with gsub to link each /forwardslash.
I think what you are asking for is to only match words that begin with '/', not strings or lines beginning with '/'. If that is true, I believe the following regex will work: %r{(?:^|\s+)/(\w+)}:
For example:
"/foo /this this/that http://this".scan %r{(?:^|\s+)/(\w+)} # => [["foo"], ["this"]]
The caret (^) character means "beginning of string" -- a dollar sign ($) means "end of string."
So
/^\/(\w+)/
...will get you what you want -- only matching at the beginning of the string.
First thing, since you're using a regex with slashes change the delimiter to something else, then you won't have to escape the backslashes and it will be easier to read.
Secondly, if you want to replace the slash as well then include it in the capture.
On to the regex.
...if it is not the beginning
character...
...of a line:
!^(/\w+)!
if it is not the beginning
character...
...of a word:
!\s(/\w+)!
but that won't match if it's at the very beginning of a line. For that you'll need something a lot more complex, so I'd just run both the regexes here instead of creating that monster.

Resources