Ruby regex: extract a list of urls from a string - ruby

I have a string of images' URLs and I need to convert it into an array.
http://rubular.com/r/E2a5v2hYnJ
How do I do this?

URI.extract(your_string)
That's all you need if you already have it in a string. I can't remember, but you may have to put require 'uri' in there first. Gotta love that standard library!
Here's the link to the docs URI#extract

Scan returns an array
myarray = mystring.scan(/regex/)
See here on regular-expressions.info

The best answer will depend very much on exactly what input string you expect.
If your test string is accurate then I would not use a regex, do this instead (as suggested by Marnen Laibow-Koser):
mystring.split('?v=3')
If you really don't have constant fluff between your useful strings then regex might be better. Your regex is greedy. This will get you part way:
mystring.scan(/https?:\/\/[\w.-\/]*?\.(jpe?g|gif|png)/)
Note the '?' after the '*' in the part capturing the server and path pieces of the URL, this makes the regex non-greedy.
The problem with this is that if your server name or path contains any of .jpg, .jpeg, .gif or .png then the result will be wrong in that instance.
Figuring out what is best needs more information about your input string. You might for example find it better to pattern match the fluff between your desired URLs.

Use String#split (see the docs for details).

Part of the problem is in rubular you are using https instead of http.. this gets you closer to what you want if the other answers don't work for you:
http://rubular.com/r/cIjmjxIfz5

Related

Remove specific parts from url

Lets suppose I have a url like this:
https://www.youtube.com/watch/3e4345?v=rwmEkvPBG1s
What is the best and shorthest way to only get the 3e4345 part?
Sometimes it doesn't contain additional params in ?
I don't want to use any gems.
What I did was:
url = url.split('/watch/')
url = url[1].split('/')[0].split('?')[0]
Is there a better way? Thanks
possibly the safest and best one. use URI.
URI("https://www.youtube.com/watch/34345?v=rwmEkvPBG1s").path.split("/").last
For more refer How to extract URL parameters from a URL with Ruby or Rails?
You could do the following and using the match function to find a match based on a regular expression statement. The value at [1] is the first capture from the regular expression. I have included a breakdown from regexper.com to help illustrate what the expression is accomplishing.
You will notice parentheses around the \d+ which are what captures the digits out of the URL when it matches.
url.to_s.match(/\/watch\/(\d+).*$/)[1]
x = "https://www.youtube.com/watch/34345?v=rwmEkvPBG1s"
File.basename(URI(x).path)
=> "34345"

Regex for matching everything before trailing slash, or first question mark?

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:

Ruby: Rubeque: Variable in regexp?

I'm solving http://www.rubeque.com/problems/a-man-comma--a-plan-comma--a-canal--panama-excl-/solutions but I'm a bit confused about treating #{} as comment in regexp.
My code look like this now
def longest_palindrome(txt)
txt[/#{txt.reverse}/]
end
I tried txt[/"#{txt.reverse}"/] or txt[#{txt.reverse}] but nothing works as I wish. How should I implicate variable into regexp?
This is not something you can do with a regex.
While you could use variable interpolation in the construction of a regex (see the other answers/comments), that wouldn't help you here. You could only use that to reverse a literal string, not a regex match result. Even if you could, you still wouldn't have solved the "find the longest palindrome" part, at least not with acceptable runtime performance.
Use a different approach to the problem.
It is hard to tell how do you wish that happens without examples, but I suppose you are after
txt[/#{Regexp.escape(txt.reverse)}/]
See the Regexp#escape method

String parse using regex

I have a string which is a function call. I want to parse it and obtain the parameters:
"add_location('http://abc.com/page/1/','This is the title, it is long',39.677765,-45.4343,34454,'http://abc.com/images/image_1.jpg')"
It has a total of 6 parameters and is a mixture of urls, integers and decimals. I can't figure out the regex for the split method which I will be using. Please help!
This is what I have come up with - which is wrong.
/('(.*\/[0-9]*)',)|([0-9]*,)/
Treating the string like a CSV might work:
require 'csv'
str = "add_location('http://abc.com/page/1/','This is the title, it is long',39.677765,-45.4343,34454,'http://abc.com/images/image_1.jpg')"
p CSV.parse(str[13..-2], :quote_char => "'").first
# => ["http://abc.com/page/1/", "This is the title, it is long", "39.677765", "-45.4343", "34454", "http://abc.com/images/image_1.jpg"]
Assuming all non-numeric parameters are enclosed in single quotes, as in your example
string.scan( /'.+?'|[-0-9.]+/ )
You really don't want to be parsing things this complex with a reg-ex; it just won't work in the long run. I'm not sure if you just want to parse this one string, or if there are lots of strings in this form which vary in exact contents. If you give a bit more info about your end goal, you might be able to get some more detailed help.
For parsing things this complex in the general case, you really want to perform proper tokenization (i.e. lexical analysis) of the string. In the past with Ruby, I've had good experiences doing this with Citrus. It's a nice gem for parsing complex tokens/languages like you're trying to do. You can find more about it here:
https://github.com/mjijackson/citrus

What regex can I use to get the domain name from a url in Ruby?

I am trying to construct a regex to extract a domain given a url.
for:
http://www.abc.google.com/
http://abc.google.com/
https://www.abc.google.com/
http://abc.google.com/
should give:
abc.google.com
URI.parse('http://www.abc.google.com/').host
#=> "www.abc.google.com"
Not a regex, but probably more robust then anything we come up with here.
URI.parse('http://www.abc.google.com/').host.gsub(/^www\./, '')
If you want to remove the www. as well this will work without raising any errors if the www. is not there.
Don't know much about ruby but this regex pattern gives you the last 3 parts of the url excluding the trailing slash with a minumum of 2 characters per part.
([\w-]{2,}\.[\w-]{2,}\.[\w-]{2,})/$
you may be able to use the domain_name gem for this kind of work. From the README:
require "domain_name"
host = DomainName("a.b.example.co.uk")
host.domain #=> "example.co.uk"
Your question is a little bit vague. Can you give a precise specification of what it is exactly that you want to do? (Preferable with a testsuite.) Right now, all your question says is that you want a method that always returns 'abc.google.com'. That's easy:
def extract_domain
return 'abc.google.com'
end
But that's probably not what you meant …
Also, you say that you need a Regexp. Why? What's wrong with, for example, using the URI class? After all, parsing and manipulating URIs is exactly what it was made for!
require 'uri'
URI.parse('https://abc.google.com/').host # => 'abc.google.com'
And lastly, you say you are "trying to extract a domain", but you never specify what you mean by "domain". It looks you are sometimes meaning the FQDN and sometimes randomly dropping parts of the FQDN, but according to what rules? For example, for the FQDN abc.google.com, the domain name is google.com and the host name is abc, but you want it to return abc.google.com which is not just the domain name but the full FQDN. Why?

Resources