Get first match of image name in URL (regex, Ruby) - ruby

I'm trying to regex the first match of an image name in a URL (ruby).
Here's my current code:
#wikimedia_link.match(/(\/|:)([a-zA-Z\_\-0-9]*\.(jpeg|jpg|png|gif))/).try(:[], 2)
It works (returns "Samuel_L_Jackson_Comic_Con.jpg") if I have one match, i.e.
http://en.wikipedia.org/wiki/File:Samuel_L_Jackson_Comic_Con.jpg
However, this returns an error (nil), and seems to be because there is "Lucy_desi_1957.JPG" and "220px-Lucy_desi_1957.JPG" in the url.
http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG
Any idea on how to just ge the first match?
Thank you!

If you want the filename at the end add a $ to match the end.
/(/|:)([\w-\.]+\.(jpeg|jpg|png|gif)$)/i

what you want is:
#wikimedia_link[/[^\/:]+\.(?i:jpeg|jpg|png|gif)/]
using the (?i:...) grouping switches to case insensitive matching, so either jpg or JPG will be matched.

This is how I'd do it:
2.0.0-p247 :008 > image_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG'
=> "http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG"
2.0.0-p247 :009 > image_name = image_url.match( /[-_\w:]+\.(jpe?g|png|gif)$/i ) => #<MatchData "220px-Lucy_desi_1957.JPG" 1:"JPG">
2.0.0-p247 :012 > image_name.to_s
=> "220px-Lucy_desi_1957.JPG"
Without IRB:
image_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG'
image_name = image_url.match( /[-_\w:]+\.(jpe?g|png|gif)$/i );
puts image_name #=> "220px-Lucy_desi_1957.JPG"
This solution is the best because it derives the file name of the image, be it from a simple url and a simple filename:
http://www.anexample.com/dog.jpg
or with a more complex filename:
http://www.anexample.com/342432_large-xs_dog.jpg
or if an image is referenced more than once in the URL:
http://www.anexample.com/cat.jpg/upload/342432_large-xs_dog.jpg/xs/342432_large-xs_dog.jpg

The following regex works for both of your examples
/^.+\/[\w:]+\.(jpe?g|png|gif)/i
And you can get just "http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG" with the following
"http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG".match(/^.+\/[\w:]+\.(jpe?g|png|gif)/i).to_a.first
If you're just after the filename itself, drop the ^.+\/ from the regex, leave it as simply
/[\w:]+\.(jpe?g|png|gif)/i
Using this version in the match will return only "Lucy_desi_1957.JPG"
In either case, if no match is found, nil will return.

Related

Extract url params in ruby

I would like to extract parameters from url. I have following path pattern:
pattern = "/foo/:foo_id/bar/:bar_id"
And example url:
url = "/foo/1/bar/2"
I would like to get {foo_id: 1, bar_id: 2}. I tried to convert pattern into something like this:
"\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)"
I failed on first step when I wanted to replace backslash in url:
formatted = pattern.gsub("/", "\/")
Do you know how to fix this gsub? Maybe you know better solution to do this.
EDIT:
It is plain Ruby. I am not using RoR.
As I said above, you only need to escape slashes in a Regexp literal, e.g. /foo\/bar/. When defining a Regexp from a string it's not necessary: Regexp.new("foo/bar") produces the same Regexp as /foo\/bar/.
As to your larger problem, here's how I'd solve it, which I'm guessing is pretty much how you'd been planning to solve it:
PATTERN_PART_MATCH = /:(\w+)/
PATTERN_PART_REPLACE = '(?<\1>.+?)'
def pattern_to_regexp(pattern)
expr = Regexp.escape(pattern) # just in case
.gsub(PATTERN_PART_MATCH, PATTERN_PART_REPLACE)
Regexp.new(expr)
end
pattern = "/foo/:foo_id/bar/:bar_id"
expr = pattern_to_regexp(pattern)
# => /\/foo\/(?<foo_id>.+?)\/bar\/(?<bar_id>.+?)/
str = "/foo/1/bar/2"
expr.match(str)
# => #<MatchData "/foo/1/bar/2" foo_id:"1" bar_id:"2">
Try this:
regex = /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
matches = "/foo/1/bar/2".match(regex)
Hash[matches.names.zip(matches[1..-1])]
IRB output:
2.3.1 :032 > regex = /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
=> /\/foo\/(?<foo_id>.*)\/bar\/(?<bar_id>.*)/i
2.3.1 :033 > matches = "/foo/1/bar/2".match(regex)
=> #<MatchData "/foo/1/bar/2" foo_id:"1" bar_id:"2">
2.3.1 :034 > Hash[matches.names.zip(matches[1..-1])]
=> {"foo_id"=>"1", "bar_id"=>"2"}
I'd advise reading this article on how Rack parses query params. The above works for your example you gave, but is not extensible for other params.
http://codefol.io/posts/How-Does-Rack-Parse-Query-Params-With-parse-nested-query
This might help you, the foo id and bar id will be dynamic.
require 'json'
#url to scan
url = "/foo/1/bar/2"
#scanning ids from url
id = url.scan(/\d/)
#gsub method to replacing values from url
url_with_id = url.gsub(url, "{foo_id: #{id[0]}, bar_id: #{id[1]}}")
#output
=> "{foo_id: 1, bar_id: 2}"
If you want to change string to hash
url_hash = eval(url_with_id)
=>{:foo_id=>1, :bar_id=>2}

Reading Several URIs in ruby

I need to read the contents of web page for several times and extract some information out of it for which I use regular expressions. I am using open-uri to read contents of the page and the sample code I written is as follows:
require 'open-uri'
def getResults(words)
results = []
words.each do |word|
results.push getAResult(word)
end
results
end
def getAResult(word)
file = open("http://www.somapage.com?option=#{word}")
contents = file.read
file.close
contents.match /some-regex-here/
$1.empty? ? -1 : $1.to_f
end
The problem is unless I comment out file.close line getAResult returns always -1. When I try this code on console, getAResult immediately returns -1, but ruby process runs for another two to three seconds or so.
If I remove file.close line getAResult returns the correct result, but now getResults is a bunch of -1s except for the first one. I tried to use curb gem for reading the page, but similar problem appears.
This seems like an issue related with threading. However, I couldn't come up with something reasonable to search and find a corresponding solution. What do you think problem would be?
NOTE: This web page I try to read does not return results so fast. It takes some time.
try hpricot or nokogiri
it can search documents via XPath in your html file
You should grab the match result, like the following:
1.9.3-327 (main):0 > contents.match /div/
=> #<MatchData "div">
1.9.3-327 (main):0 > $1
=> nil
1.9.3-327 (main):0 > contents.match /(div)/
=> #<MatchData "div" 1:"div">
1.9.3-327 (main):0 > $1
=> "div"
If you are worried about thread safety, then you shouldn't use the $n regexp variables. Capture your results directly, like this:
value = contents[/regexp/]
Specifically, here's a more ruby-like formatting of that method:
def getAResult(word)
contents = open("http://www.somapage.com?option=#{word}"){|f| f.read }
value = contents[/some-regex-here/]
value.empty? ? -1 : value.to_f
end
The block form of #open (as above) automatically closes the file when you are done with it.

How can I fix this regex which extracts a tweet id from a Twitter URL?

I am trying to write a regex that will extract a tweet id from a Twitter URL.
I have this one, which works when the Twitter username has a number in it:
'.*?\\d+.*?(\\d+)'
ruby-1.9.2-p0 > Regexp.new('.*?\\d+.*?(\\d+)',Regexp::IGNORECASE).match('https://twitter.com/#!/sportsguy33/status/41257488166686720')[1]
=> "41257488166686720"
ruby-1.9.2-p0 > Regexp.new('.*?\\d+.*?(\\d+)',Regexp::IGNORECASE).match('http://twitter.com/#!/dailythunder/status/41382006113841153')[1]
=> "3"
And this one, which works when the Twitter username doesn't have a number in it
'.*?(\\d+)'
ruby-1.9.2-p0 > Regexp.new('.*?(\\d+)',Regexp::IGNORECASE).match('https://twitter.com/#!/sportsguy33/status/41257488166686720')[1]
=> "33"
ruby-1.9.2-p0 > Regexp.new('.*?(\\d+)',Regexp::IGNORECASE).match('http://twitter.com/#!/dailythunder/status/41382006113841153')[1]
=> "41382006113841153"
How can I write one that will work in either case?
if the tweet ID is the last part of the url, you can use:
'\/(\d+)$'
the $ means the end of the string
I just released a gem tweet_url to parse Twitter URL.
require 'tweet_url'
tweet_url = TweetUrl.parse('https://twitter.com/yukihiro_matz/status/755950562227605504')
tweet_url.status_id #=> 755950562227605504
Heads up!
Be aware of that possibly there's a URL like https://twitter.com/sferik/status/540897316908331009/photo/1, so we cannot simply extract the last numeric part.
I would suggest you try out Rubular.
Rubular is a Ruby-based regular expression editor. It's a handy way to test regular expressions as you write them.

How to get the file extension from a url?

New to ruby, how would I get the file extension from a url like:
http://www.example.com/asdf123.gif
Also, how would I format this string, in c# I would do:
string.format("http://www.example.com/{0}.{1}", filename, extension);
Use File.extname
File.extname("test.rb") #=> ".rb"
File.extname("a/b/d/test.rb") #=> ".rb"
File.extname("test") #=> ""
File.extname(".profile") #=> ""
To format the string
"http://www.example.com/%s.%s" % [filename, extension]
This works for files with query string
file = 'http://recyclewearfashion.com/stylesheets/page_css/page_css_4f308c6b1c83bb62e600001d.css?1343074150'
File.extname(URI.parse(file).path) # => '.css'
also returns "" if file has no extension
url = 'http://www.example.com/asdf123.gif'
extension = url.split('.').last
Will get you the extension for a URL(in the most simple manner possible). Now, for output formatting:
printf "http://www.example.com/%s.%s", filename, extension
You could use Ruby's URI class like this to get the fragment of the URI (i.e. the relative path of the file) and split it at the last occurrence of a dot (this will also work if the URL contains a query part):
require 'uri'
your_url = 'http://www.example.com/asdf123.gif'
fragment = URI.split(your_url)[5]
extension = fragment.match(/\.([\w+-]+)$/)
I realize this is an ancient question, but here's another vote for using Addressable. You can use the .extname method, which works as desired even with a query string:
Addressable::URI.parse('http://www.example.com/asdf123.gif').extname # => ".gif"
Addressable::URI.parse('http://www.example.com/asdf123.gif?foo').extname # => ".gif"

Extract all urls inside a string in Ruby

I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks
Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI
The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

Resources