ruby: use regex to replace http://anything with http://anything

ruby: use regex to replace http://anything with http://anything - ruby

I've seen a lot of variations on this question, but they usually are trying to either validate the 'anything' portion of the url, or provide different text for the anchor text vs the link.
For a simple blog function for users, I need the application helper that returns the same text except finds any string that starts with http:// (and ends in any whitespace or end of string) and replaces it with a same_string_here
Any tips on how to do this with regex would be appreciated... I figured out bits and pieces (grab a word starting with http) but cannot get the whole thing to work (can't figure out how to put it into the template with quotes around the href, handle the :// in the test, or put the string in a second place before the </a>).

You can try the following approach:
str = "XYZhttp://abc XYZ"
str.gsub!(/(http:\/\/\S+)/, '\1')
puts str
which will give you
XYZhttp://abc XYZ
Note: \S+ captures any non-whitespace, i.e. you have to have at least one such character after http://. It does also not remove the whitespace after http://abc. If you need you can append \s* to the regular expression outside the paranthesis.

I think an answer can be extracted from this answer: Turn URLs and #* into links. It does more than you want because it looks for twitter specific links like "#" but it does have solutions to your href question.

You can use:
body.gsub(/(?<!"|'|>)https?:\/\/[\S]+/, '\0').html_safe
The advantage of this approach is it'll replace the text, but not the html link, for example:
https://www.example.com
http://www.example.com
<a href="https://www.example.com">https://www.example.com</a
# becomes
https://www.example.com
http://www.example.com
https://www.example.com

Related

Using url unescape to build a url within a view in a string

I am building a url in a view within my sinatra app
I required the URI module in my app.
<li>Show</li>
Without URI.unescape I do not see %20 when I hover over the link. I just see this:
http://127.0.0.1:9292/blog/Coffee Title/49459
I am hoping for that space to be a -. But when I click on the link it will return in my browser as:
http://127.0.0.1:9292/blog/Coffee%20Title/49459
I tried using URI.unescape in irb. I am having trouble evaluating the Ruby code within the string. I am not sure what the right format is but think I am getting close.

URI.escape replaces characters in a string with their percent-encoded counterpart, which is why a space character becomes %20 (20 is the ASCII character code for space). It does not replace spaces with dashes.
To replace a character with another character in a string, use String#tr:
title = "Coffee Title"
title_with_dashes = title.tr(" ", "-")
puts title_with_dashes
# => Coffee-Title
However, this is only half of the equation. You've changed the URLs in your links but as a consequence your server probably isn't going to recognize the URL when someone clicks on the link, because the path that exists is Coffee Title, not Coffee-Title. That's a topic for another question, though.

Regex for matching everything before trailing slash, or first question mark?

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price

Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?

If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:

Find URLs in text and wrap in anchor tag

I'm basically writing my own Markdown parser. I want to detect a URL in a string and wrap it with an anchor tag if it's a valid URL. For example:
string = 'here is a link: http://google.com'
# if string matches regex (which it does)
# should return:
'here is a link: http://google.com'
# but this would remain unchanged:
string 'here is a link: google.com'
How can I achieve this?
Bonus points if you can point me to the code in an existing Ruby markdown parser that I can use as an example.

In general: use a regular expression to find URLs and wrap them in your HTML:
urls = %r{(?:https?|ftp|mailto)://\S+}i
html = str.gsub urls, '\0'
Note that this particular solution will turn this text:
See more at http://www.google.com.
…into…
See more at http://www.google.com.
So you may want to play with the regex a bit to figure out where the URL should really end.

You can use this jquery plugin
http://www.jquery.gr/linker/

Replacing partial regex matches in place with Ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end

A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)

You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables

In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.

As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

How can multiple trailing slashes can be removed from a URL in Ruby

What I'm trying to achieve here is lets say we have two example URLs:
url1 = "http://emy.dod.com/kaskaa/dkaiad/amaa//////////"
url2 = "http://www.example.com/"
How can I extract the striped down URLs?
url1 = "http://emy.dod.com/kaskaa/dkaiad/amaa"
url2 = "http://http://www.example.com"
URI.parse in Ruby sanitizes certain type of malformed URL but is ineffective in this case.
If we use regex then /^(.*)\/$/ removes a single slash / from url1 and is ineffective for url2.
Is anybody aware of how to handle this type of URL parsing?
The point here is I don't want my system to have http://www.example.com/ and http://www.example.com being treated as two different URLs. And same goes for http://emy.dod.com/kaskaa/dkaiad/amaa//// and http://emy.dod.com/kaskaa/dkaiad/amaa/.

If you just need to remove all slashes from the end of the url string then you can try the following regex:
"http://emy.dod.com/kaskaa/dkaiad/amaa//////////".sub(/(\/)+$/,'')
"http://www.example.com/".sub(/(\/)+$/,'')
/(\/)+$/ - this regex finds one or more slashes at the end of the string. Then we replace this match with empty string.
Hope this helps.

Although this thread is a bit old and the top answer is quite good, but I suggest another way to do this:
/^(.*?)\/$/
You could see it in action here: https://regex101.com/r/vC6yX1/2
The magic here is *?, which does a lazy match. So the entire expression could be translated as:
Match as few characters as it can and capture it, while match as many slashes as it can at the end.
Which means, in a more plain English, removes all trailing slashes.

def without_trailing_slash path
path[ %r(.*[^/]) ]
end
path = "http://emy.dod.com/kaskaa/dkaiad/amaa//////////"
puts without_trailing_slash path # "http://emy.dod.com/kaskaa/dkaiad/amaa"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ruby: use regex to replace http://anything with http://anything - ruby

I think an answer can be extracted from this answer: Turn URLs and #* into links. It does more than you want because it looks for twitter specific links like "#" but it does have solutions to your href question.

Related

Using url unescape to build a url within a view in a string

Regex for matching everything before trailing slash, or first question mark?

Find URLs in text and wrap in anchor tag

Replacing partial regex matches in place with Ruby

How can multiple trailing slashes can be removed from a URL in Ruby

Categories

Resources