ruby regexp Skipping Zero Length Matches and nil matches - ruby

I have ruby app that uses first matched string by regex. my_url.match(/my_regex/).first
As a strings I have a list of urls that contain adress or postcode and from each of them I need to extract postocode or adreess by using regex
Samples of urls:
http://www.adresses.co.uk/avon/bath-city
http://www.adresses.co.uk/postcode/rm107jj
My regex:
\.co\.uk\/postcode\/([^\/]*)|\.co\.uk\/(?!postcode)([^\/]*\/[^\/]*)
My problem is that for non postcode urls a first matched data by this regex is nil see_on_rubular
How to rewrite or change this reflex so it will skip nil matches or to make first matches non nils. I need to solve it with regex not in ruby coding please.

Here's a regex that captures in group #1 everything after postcode/ if it's present, or else everything after .co.uk/:
\.co\.uk\/(?:postcode\/)?([^\/\n]+(?:\/[^\/\n]+)?)
(DEMO)
Note that this will give unexpected results if there are unwanted path elements at the end of a postcode link, such as:
http://www.adresses.co.uk/postcode/rm107jj/oops
UPDATE: Based on the comments, it looks like you want to match just the last path element. But we can't simply capture the second element, because there might be only one:
http://www.adresses.co.uk/west-midlands
We can, however, make the first element optional:
\.co\.uk\­/(?:[^\/\n]+\­/)?([^\/\n]+­)
Notice how I used a non-capturing group for the optional portion, so the part you want is still captured in group #1.
...

Related

How to visit a link inside an email using capybara

I am new to cucumber with capybara. I got an application to test whose flow is:'after submitting a form, an email will be sent to the user which contains the link to another app. In order to access the app we have to open the mail and click the link, which will redirect to the app.'. I don't have access to the mail Id. Is there any way to extract that link and continue with the flow?
Please, give some possible way to do it.
Regards,
Abhisek Das
In your test, use whatever means you need in order to trigger the sending of the email by your application. Once the email is sent, use a regular expression to find the URL from the link within the email body (note this will work only for an email that contains a single link), and then visit the path from that URL with Capybara to continue with your test:
path_regex = /(?:"https?\:\/\/.*?)(\/.*?)(?:")/
email = ActionMailer::Base.deliveries.last
path = email.body.match(path_regex)[1]
visit(path)
Regular expression explained
A regular expression (regex) itself is demarcated by forward slashes, and this regex in particular consists of three groups, each demarcated by pairs of parentheses. The first and third groups both begin with ?:, indicating that they are non-capturing groups, while the second is a capturing group (no ?:). I will explain the significance of this distinction below.
The first group, (?:"https?\:\/\/.*?), is a:
non-capturing group, ?:
that matches a single double quote, "
we match a quote since we anticipate the URL to be in the href="..." attribute of a link tag
followed by the string http
optionally followed by a lowercase s, s?
the question mark makes the preceding match, in this case s, optional
followed by a colon and two forward slashes, \:\/\/
note the backslashes, which are used to escape characters that otherwise have a special meaning in a regex
followed by a wildcard, .*?, which will match any character any number of times up until the next match in the regex is reached
the period, or wildcard, matches any character
the asterisk, *, repeats the preceding match up to an unlimited number of times, depending on the successive match that follows
the question mark makes this a lazy match, meaning the wildcard will match as few characters as possible while still allowing the next match in the regex to be satisfied
The second group, (\/.*?) is a capturing group that:
matches a single forward slash, \/
this will match the first forward slash after the host portion of the URL (e.g. the slash at the end of http://www.example.com/) since the slashes in http:// were already matched by the first group
followed by another lazy wildcard, .*?
The third group, (?:"), is:
another non-capturing group, ?:
that matches a single double quote, "
And thus, our second group will match the portion of the URL starting with the forward slash after the host and going up to, but not including, the double quote at the end of our href="...".
When we call the match method using our regex, it returns an instance of MatchData, which behaves much like an array. The element at index 0 is a string containing the entire matched string (from all of the groups in the regex), while elements at subsequent indices contain only the portions of the string matched by the regex's capturing groups (only our second group, in this case). Thus, to get the corresponding match of our second group—which is the path we want to visit using Capybara—we grab the element at index 1.
You can use Nokogiri to parse the email body and find the link you want to click.
Imagine you want to click a link Change my password:
email = ActionMailer::Base.deliveries.last
html = Nokogiri::HTML(email.html_part.body.to_s)
target_url = html.at("a:contains('Change my password')")['href']
visit target_url
I think this is more semantic and robust that using regular expressions. For example, this would work if the email has many links.
If you're using or willing to use the capybara-email gem, there's now a simpler way of doing this. Let's say you've generated an email to recipient#email.com, which contains the link 'fancy link'.
Then you can just do this in your test suite:
open_email('recipient#email.com') # Allows the current_email method
current_email.click_link 'fancy link'

How to match anything EXCEPT this string?

How can I match a string that is NOT partners?
Here is what I have that matches partners:
/^partners$/i
I've tried the following to NOT match partners but doesn't seem to work:
/^(?!partners)$/i
Your regex
/^(?!partners)$/i
only matches empty lines because you didn't include the end-of-line anchor in your lookahead assertion. Lookaheads do just that - they "look ahead" without actually matching any characters, so only lines that match the regex ^$ will succeed.
This would work:
/^(?!partners$)/i
This reports a match with any string (or, since we're in Ruby here, any line in a multi-line string) that's different from partners. Note that it only matches the empty string at the start of the line. Which is enough for validation purposes, but the match result will be "" (instead of nil which you'd get if the match failed entirely).
not easily but with the look ahead operator it can.
Here the ruby regex
^((?!partners).)*$
Cheers
If you only want to get a true value when string is not partners then there is no need to use regex and you can just use a string comparison (which ignores case).
If you for some reason need a positive regex match for any string which does not contain partners (if it's a part of a larger regex for example) you could use several different constructs, like:
`^(?:(?!partners).)*$`
or
^(?:[^p]+|p(?!artners))*$
For example, in Java:
!"partners".equalsIgnoreCase(aString)

Replacing partial regex matches in place with Ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end
A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)
You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables
In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

ruby regex make sub stop at first match

I am trying to replace a specific pattern in a text string.
That pattern is a href containing the word "sak".
My script currently looks like this:
ccontent=ccontent.sub(/<a .+?href=\"([^\"]+)\"[^\>]*>Sak<\/a>/, '')
The problem is that this replaces the entire string. (the string contains two links).
The problem is somewhere around the `a .+?" symbols, it runs through the link i want to Replace entirely and goes into the next link and replaces that whole link as well.
But I want it to STOP when the first pattern match is reached so that it only erases "sak" link.
How do i make the pattern match stop at the first time it reaches the 'href'?
Your expression is greedy, because .+? will actually keep matching any character as long as the pattern still matches.
Just use the [^>]* character set you're already using at the end of the regex:
ccontent.sub(/<a [^>]*href=\"([^\"]+)\"[^>]*>Sak<\/a>/, '')

Very odd issue with Ruby and regex

I am getting completely different reults from string.scan and several regex testers...
I am just trying to grab the domain from the string, it is the last word.
The regex in question:
/([a-zA-Z0-9\-]*\.)*\w{1,4}$/
The string (1 single line, verified in Ruby's runtime btw)
str = 'Show more results from software.informer.com'
Work fine, but in ruby....
irb(main):050:0> str.scan /([a-zA-Z0-9\-]*\.)*\w{1,4}$/
=> [["informer."]]
I would think that I would get a match on software.informer.com ,which is my goal.
Your regex is correct, the result has to do with the way String#scan behaves. From the official documentation:
"If the pattern contains groups, each individual result is itself an array containing one entry per group."
Basically, if you put parentheses around the whole regex, the first element of each array in your results will be what you expect.
It does not look as if you expect more than one result (especially as the regex is anchored). In that case there is no reason to use scan.
'Show more results from software.informer.com'[ /([a-zA-Z0-9\-]*\.)*\w{1,4}$/ ]
#=> "software.informer.com"
If you do need to use scan (in which case you obviously need to remove the anchor), you can use (?:) to create non-capturing groups.
'foo.bar.baz lala software.informer.com'.scan( /(?:[a-zA-Z0-9\-]*\.)*\w{1,4}/ )
#=> ["foo.bar.baz", "lala", "software.informer.com"]
You are getting a match on software.informer.com. Check the value of $&. The return of scan is an array of the captured groups. Add capturing parentheses around the suffix, and you'll get the .com as part of the return value from scan as well.
The regex testers and Ruby are not disagreeing about the fundamental issue (the regex itself). Rather, their interfaces are differing in what they are emphasizing. When you run scan in irb, the first thing you'll see is the return value from scan (an Array of the captured subpatterns), which is not the same thing as the matched text. Regex testers are most likely oriented toward displaying the matched text.
How about doing this :
/([a-zA-Z0-9\-]*\.*\w{1,4})$/
This returns
informer.com
On your test string.
http://rubular.com/regexes/13670

Resources