How to get the file extension from a url? - ruby

New to ruby, how would I get the file extension from a url like:
http://www.example.com/asdf123.gif
Also, how would I format this string, in c# I would do:
string.format("http://www.example.com/{0}.{1}", filename, extension);

Use File.extname
File.extname("test.rb") #=> ".rb"
File.extname("a/b/d/test.rb") #=> ".rb"
File.extname("test") #=> ""
File.extname(".profile") #=> ""
To format the string
"http://www.example.com/%s.%s" % [filename, extension]

This works for files with query string
file = 'http://recyclewearfashion.com/stylesheets/page_css/page_css_4f308c6b1c83bb62e600001d.css?1343074150'
File.extname(URI.parse(file).path) # => '.css'
also returns "" if file has no extension

url = 'http://www.example.com/asdf123.gif'
extension = url.split('.').last
Will get you the extension for a URL(in the most simple manner possible). Now, for output formatting:
printf "http://www.example.com/%s.%s", filename, extension

You could use Ruby's URI class like this to get the fragment of the URI (i.e. the relative path of the file) and split it at the last occurrence of a dot (this will also work if the URL contains a query part):
require 'uri'
your_url = 'http://www.example.com/asdf123.gif'
fragment = URI.split(your_url)[5]
extension = fragment.match(/\.([\w+-]+)$/)

I realize this is an ancient question, but here's another vote for using Addressable. You can use the .extname method, which works as desired even with a query string:
Addressable::URI.parse('http://www.example.com/asdf123.gif').extname # => ".gif"
Addressable::URI.parse('http://www.example.com/asdf123.gif?foo').extname # => ".gif"

Related

String Include is not matching

I'm having an issue where a string link that has .pdf is not matching with include? in ruby. Example code
link = somelink.pdf
puts link.include?(".pdf")
Output when I run the program.
http://somelink.com/somepdf.pdf
false
Try converting to string first
link = somelink.pdf
puts link.to_s.include?(".pdf")
OR
File.extname(link.to_s) == ".pdf"

Regular expression in ruby?

I have a URL like below.
/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db"
I need to extract only the id of the play (i.e. 5b35a825-d372-4375-b2f0-f641a38067db) using regular expression. How can I do it?
I would not use a regexp to parse a url. I would use Ruby's libraries to handle URLs:
require 'uri'
url = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
uri = URI.parse(url)
params = URI::decode_www_form(uri.query).to_h
params['play']
# => 5b35a825-d372-4375-b2f0-f641a38067db
You can do:
str = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
match = str.match(/.*\?play=([^&]+)/)
puts match[1]
=> "5b35a825-d372-4375-b2f0-f641a38067db"
The regex /.*\?play=([^&]+)/ will match everything up until ?play=, and then capture anything that is not a & (the query string parameter separator)
A match will create a MatchData object, represented here by match variable, and captures will be indices of the object, hence your matched data is available at match[1].
url = '/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db'
url.split("play=")[1] #=> "5b35a825-d372-4375-b2f0-f641a38067db"
Ruby's built-in URI class has everything needed to correctly parse, split and decode URLs:
require 'uri'
uri = URI.parse('/shows/the-ruby-book/meta-programming/?play=5b35a825-d372-4375-b2f0-f641a38067db')
URI::decode_www_form(uri.query).to_h['play'] # => "5b35a825-d372-4375-b2f0-f641a38067db"
If you're using an older Ruby that doesn't support to_h, use:
Hash[URI::decode_www_form(uri.query)]['play'] # => "5b35a825-d372-4375-b2f0-f641a38067db"
You should use URI, rather than try to split/extract using a regexp, because the query of a URI will be encoded if any values are not within the characters allowed by the spec. URI, or Addressable::URI, will decode those back to their original values for you.

Using binary data (strings in utf-8) from external file

I have problem with using strings in UTF-8 format, e.g. "\u0161\u010D\u0159\u017E\u00FD".
When such string is defined as variable in my program it works fine. But when I use such string by reading it from some external file I get the wrong output (I don't get what I want/expect). Definitely I'm missing some necessary encoding stuff...
My code:
file = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io| io.read.split(/\t/) }
puts data
data_var = "\u306b\u3064\u3044\u3066"
puts data_var
Output:
\u306b\u3064\u3044\u3066 # what I don't want
について # what I want
I'm trying to read the file in binary form by specifying 'rb' but obviously there is some other problem...
I run my code in Netbeans 7.3.1 with build in JRuby 1.7.3 (I tried also Ruby 2.0.0 but without any effect.)
Since I'm new in ruby world any ideas are welcomed...
If your file contains the literal escaped string:
\u306b\u3064\u3044\u3066
Then you will need to unescape it after reading. Ruby does this for you with string literals, which is why the second case worked for you. Taken from the answer to "Is this the best way to unescape unicode escape sequences in Ruby?", you can use this:
file = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io|
contents = io.read.gsub(/\\u([\da-fA-F]{4})/) { |m|
[$1].pack("H*").unpack("n*").pack("U*")
}
contents.split(/\t/)
}
Alternatively, if you will like to make it more readable, extract the substitution into a new method, and add it to the String class:
class String
def unescape_unicode
self.gsub(/\\u([\da-fA-F]{4})/) { |m|
[$1].pack("H*").unpack("n*").pack("U*")
}
end
end
Then you can call:
file = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io|
io.read.unescape_unicode.split(/\t/)
}
Just as a FYI:
data = File.open(file, 'rb') { |io| io.read.split(/\t/) }
Can be written more simply as one of these:
data = File.read(file, 'rb').split(/\t/)
data = File.readlines(file, "\t", 'mode' => 'rb')
(Remember that File inherits from IO, which is where these methods are defined, so look in IO for documentation on them.)
readlines takes a "separator" parameter, which in the example above is "\t". Ruby will substitute it for the usual "\n" on *nix or Mac OS, or "\r\n" on Windows, so records will be retrieved using the tab-delimiter.
This makes me wonder a bit why you'd want to do that though? I've never seen tabs as record delimiters, only column/field delimiters in "TSV" (Tab-Seperated-Value) files. So that leads me to think you should probably be using Ruby's CSV class, with a "\t" as the column-separator. But, without samples of the actual file you're reading I can't say for sure.

Get first match of image name in URL (regex, Ruby)

I'm trying to regex the first match of an image name in a URL (ruby).
Here's my current code:
#wikimedia_link.match(/(\/|:)([a-zA-Z\_\-0-9]*\.(jpeg|jpg|png|gif))/).try(:[], 2)
It works (returns "Samuel_L_Jackson_Comic_Con.jpg") if I have one match, i.e.
http://en.wikipedia.org/wiki/File:Samuel_L_Jackson_Comic_Con.jpg
However, this returns an error (nil), and seems to be because there is "Lucy_desi_1957.JPG" and "220px-Lucy_desi_1957.JPG" in the url.
http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG
Any idea on how to just ge the first match?
Thank you!
If you want the filename at the end add a $ to match the end.
/(/|:)([\w-\.]+\.(jpeg|jpg|png|gif)$)/i
what you want is:
#wikimedia_link[/[^\/:]+\.(?i:jpeg|jpg|png|gif)/]
using the (?i:...) grouping switches to case insensitive matching, so either jpg or JPG will be matched.
This is how I'd do it:
2.0.0-p247 :008 > image_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG'
=> "http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG"
2.0.0-p247 :009 > image_name = image_url.match( /[-_\w:]+\.(jpe?g|png|gif)$/i ) => #<MatchData "220px-Lucy_desi_1957.JPG" 1:"JPG">
2.0.0-p247 :012 > image_name.to_s
=> "220px-Lucy_desi_1957.JPG"
Without IRB:
image_url = 'http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG'
image_name = image_url.match( /[-_\w:]+\.(jpe?g|png|gif)$/i );
puts image_name #=> "220px-Lucy_desi_1957.JPG"
This solution is the best because it derives the file name of the image, be it from a simple url and a simple filename:
http://www.anexample.com/dog.jpg
or with a more complex filename:
http://www.anexample.com/342432_large-xs_dog.jpg
or if an image is referenced more than once in the URL:
http://www.anexample.com/cat.jpg/upload/342432_large-xs_dog.jpg/xs/342432_large-xs_dog.jpg
The following regex works for both of your examples
/^.+\/[\w:]+\.(jpe?g|png|gif)/i
And you can get just "http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG" with the following
"http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Lucy_desi_1957.JPG/220px-Lucy_desi_1957.JPG".match(/^.+\/[\w:]+\.(jpe?g|png|gif)/i).to_a.first
If you're just after the filename itself, drop the ^.+\/ from the regex, leave it as simply
/[\w:]+\.(jpe?g|png|gif)/i
Using this version in the match will return only "Lucy_desi_1957.JPG"
In either case, if no match is found, nil will return.

Regex to remove text before "http://"?

I have a ruby app parsing a bunch of URLs from strings:
#text = "a string with a url http://example.com"
#text.split.grep(/http[s]?:\/\/\w/)
#text[0] = "http://example.com"
This works fine ^^
But sometimes the URLs have text before the HTTP:// for example
#text = "What's a spacebar? ...http://example.com"
#text[0] = "...http://example.com"
Is there a regex that can select just the text before "http://" in a string so I can strip it out?
Perhaps a nicer way to achieve the same result is to use the URI standard library.
require 'uri'
text = "a string with a url http://example.com and another URL here:http://2.example.com and this here"
URI.extract(text, ['http', 'https'])
# => ["http://example.com", "http://2.example.com"]
Documentation: URI.extract
Spliting and then grepping is an odd way to do this. Why don't you just use String#scan:
#text = "a string with a url http://example.com"
urls = #text.scan(/http[s]?:\/\/\S+/)
url[0] # => "http://example.com"
.*(?=http://)
or you could combine the two.
.*(?=(f|ht)tp[s]://)
Just search for http://, then remove the parts of the string before that (as the =~ returns the offset into the string)

Resources