Which characters make a URL invalid? - validation

Which characters make a URL invalid?
Are these valid URLs?
example.com/file[/].html
http://example.com/file[/].html

In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following 84 characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;=
Note that this list doesn't state where in the URI these characters may occur.
Any other character needs to be encoded with the percent-encoding (%hh). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.

The '[' and ']' in this example are "unwise" characters but still legal. If the '/' in the []'s is meant to be part of file name then it is invalid since '/' is reserved and should be properly encoded:
http://example.com/file[/].html
To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.
There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.
Excluded US-ASCII Characters disallowed within the URI syntax:
control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">
The character "#" is excluded because it is used to delimit a URI from a fragment identifier. The percent character "%" is excluded because it is used for the encoding of escaped characters. In other words, the "#" and "%" are reserved characters that must be used in a specific context.
List of unwise characters are allowed but may cause problems:
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Characters that are reserved within a query component and/or have special meaning within a URI/URL:
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" | "$" | ","
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like ftp://user#hostname/ where the '#' character has special meaning.
Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:
http://mw1.google.com/mw-earth-vectordb/kml-samples/gp/seattle/gigapxl/$[level]/r$[y]_c$[x].jpg
Some of the character restrictions for URIs and URLs are programming language-dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like http://api.google.com/q?exp=a|b is not allowed and must be encoded instead as http://api.google.com/q?exp=a%7Cb if using Java with a URI object instance.

Most of the existing answers here are impractical because they totally ignore the real-world usage of addresses like:
https://en.wikipedia.org/wiki/Möbius_strip or
https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en.
First, a digression into terminology. What are these addresses? Are they valid URLs?
Historically, the answer was "no". According to RFC 3986, from 2005, such addresses are not URIs (and therefore not URLs, since URLs are a type of URIs). Per the terminology of 2005 IETF standards, we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI.
Per modern spec, the answer is "yes". The WHATWG Living Standard simply classifies everything that would previously be called "URIs" or "IRIs" as "URLs". This aligns the specced terminology with how normal people who haven't read the spec use the word "URL", which was one of the spec's goals.
What characters are allowed under the WHATWG Living Standard?
Per this newer meaning of "URL", what characters are allowed? In many parts of the URL, such as the query string and path, we're allowed to use arbitrary "URL units", which are
URL code points and percent-encoded bytes.
What are "URL code points"?
The URL code points are ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('), U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*), U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (#), U+005F (_), U+007E (~), and code points in the range U+00A0 to U+10FFFD, inclusive, excluding surrogates and noncharacters.
(Note that the list of "URL code points" doesn't include %, but that %s are allowed in "URL code units" if they're part of a percent-encoding sequence.)
The only place I can spot where the spec permits the use of any character that's not in this set is in the host, where IPv6 addresses are enclosed in [ and ] characters. Everywhere else in the URL, either URL units are allowed or some even more restrictive set of characters.
What characters were allowed under the old RFCs?
For the sake of history, and since it's not explored fully elsewhere in the answers here, let's examine was allowed under the older pair of specs.
First of all, we have two types of RFC 3986 reserved characters:
:/?#[]#, which are part of the generic syntax for a URI defined in RFC 3986
!$&'()*+,;=, which aren't part of the RFC's generic syntax, but are reserved for use as syntactic components of particular URI schemes. For instance, semicolons and commas are used as part of the syntax of data URIs, and & and = are used as part of the ubiquitous ?foo=bar&qux=baz format in query strings (which isn't specified by RFC 3986).
Any of the reserved characters above can be legally used in a URI without encoding, either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose. (For example, although / has syntactic meaning in a URL, you can use it unencoded in a query string, because it doesn't have meaning in a query string.)
RFC 3986 also specifies some unreserved characters, which can always be used simply to represent data without any encoding:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~
Finally, the % character itself is allowed for percent-encodings.
That leaves only the following ASCII characters that are forbidden from appearing in a URL:
The control characters (chars 0-1F and 7F), including new line, tab, and carriage return.
"<>^`{|}
Every other character from ASCII can legally feature in a URL.
Then RFC 3987 extends that set of unreserved characters with the following unicode character ranges:
%xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD
These block choices from the old spec seem bizarre and arbitrary given the latest Unicode block definitions; this is probably because the blocks have been added to in the decade since RFC 3987 was written.
Finally, it's perhaps worth noting that simply knowing which characters can legally appear in a URL isn't sufficient to recognise whether some given string is a legal URL or not, since some characters are only legal in particular parts of the URL. For example, the reserved characters [ and ] are legal as part of an IPv6 literal host in a URL like http://[1080::8:800:200C:417A]/foo but aren't legal in any other context, so the OP's example of http://example.com/file[/].html is illegal.

In your supplementary question you asked if www.example.com/file[/].html is a valid URL.
That URL isn't valid because a URL is a type of URI and a valid URI must have a scheme like http: (see RFC 3986).
If you meant to ask if http://www.example.com/file[/].html is a valid URL then the answer is still no because the square bracket characters aren't valid there.
The square bracket characters are reserved for URLs in this format: http://[2001:db8:85a3::8a2e:370:7334]/foo/bar (i.e. an IPv6 literal instead of a host name)
It's worth reading RFC 3986 carefully if you want to understand the issue fully.

All valid characters that can be used in a URI (a URL is a type of URI) are defined in RFC 3986.
All other characters can be used in a URL provided that they are "URL Encoded" first. This involves changing the invalid character for specific "codes" (usually in the form of the percent symbol (%) followed by a hexadecimal number).
This link, HTML URL Encoding Reference, contains a list of the encodings for invalid characters.

Several of Unicode character ranges are valid HTML5, although it might still not be a good idea to use them.
E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:
The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.
Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which says it aims to:
Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process.
That document defines URL code points as:
ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "#", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.
The term "URL code points" is then used in the statement:
If c is not a URL code point and not "%", parse error.
in a several parts of the parsing algorithm, including the schema, authority, relative path, query and fragment states: so basically the entire URL.
Also, the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"
Of course, as mentioned by Stephen C, it is not just about characters but also about context: you have to understand the entire algorithm. But since class "URL code points" is used on key points of the algorithm, it that gives a good idea of what you can use or not.
See also: Unicode characters in URLs

I needed to select characters to split URLs in a string, so I decided to create a list of characters which could not be found in the URL by myself:
>>> allowed = "-_.~!*'();:#&=+$,/?%#[]?#ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
>>> from string import printable
>>> ''.join(set(printable).difference(set(allowed)))
'`" <\x0b\n\r\x0c\\\t{^}|>'
So, the possible choices are the newline, tab, space, backslash and "<>{}^|. I guess I'll go with the space or newline. :)

I am implementing an old HTTP (0.9, 1.0, 1.1) request and response reader/writer. The request URI is the most problematic place.
You can't just use RFC 1738, 2396 or 3986 as it is. There are many old HTTP clients and servers that allow more characters. So I've made research based on accidentally published web server access logs: "GET URI HTTP/1.0" 200.
I've found that the following non-standard characters are often used in URIs:
\ { } < > | ` ^ "
These characters were described in RFC 1738 as unsafe.
If you want to be compatible with all old HTTP clients and servers - you have to allow these characters in the request URI.
Please read more information about this research in oghttp-request-collector.

This is not really an answer to your question, but validating URLs is really a serious p.i.t.a. You're probably just better off validating the domain name and leave query part of the URL be. That is my experience.
You could also resort to pinging the URL and seeing if it results in a valid response, but that might be too much for such a simple task.
Regular expressions to detect URLs are abundant, google it :)

I can't comment on the above answers, but wanted to emphasize the point (in another answer) that allowed characters aren't allowed everywhere. For example, domain names can't have underscores, so http://test_url.com is invalid.

From the source (emphasis added when needed):
Unsafe:
Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems. The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it. The character "%" is unsafe because it is used for
encodings of other characters. Other characters are unsafe because
gateways and other transport agents are known to sometimes modify such
characters. These characters are "{", "}", "|", "", "^", "~", "[",
"]", and "`".
All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in systems
that do not normally deal with fragment or anchor identifiers, so that
if the URL is copied into another system that does use them, it will
not be necessary to change the URL encoding.
Source

I came up with a couple of regular expressions for PHP that will convert URLs in text to anchor tags. (First it converts all www. URLs to http://, and then converts all URLs with https?:// to a href=... HTML links
$string = preg_replace('/(https?:\/\/)([!#$&-;=?\-\[\]_a-z~%]+)/sim', '$2', preg_replace('/(\s)((www\.)([!#$&-;=?\-\[\]_a-z~%]+))/sim', '$1http://$2', $string) );

Related

What does [CFWS] and [FWS] mean in this ABNF?

The RFC 2282 for emails have the below ABNF for quoted-string.
quoted-string = [CFWS]
DQUOTE *([FWS] qcontent) [FWS] DQUOTE
[CFWS]
I googled and foundthat CFWS is Comments, Folding, Whitespaces. I know what whitespaces are but don't what comments and folding is in terms of ABNF in an email address.
Also what does [FWS] inside *() mean? The double quotes can have 0 or more occurences of qcontent preceded by Folding and whitespaces?
This is very confusing. References to understand ABNF would be much appreciated.
This isn't part of the generic ABNF syntax (currently defined in RFC 5234, although RFC 2234 was the definition of ABNF in play at the time that RFC 2282 was written). Rather, FWS and CFWS are special tokens defined in the email RFC itself (see section 3.2.3 of RFC 2822, or section 3.2.2 of RFC 5322, which obsoleted RFC 2822 in 2008).
From RFC 5322:
2.2.3. Long Header Fields
Each header field is logically a single line of characters
comprising the field name, the colon, and the field body. For
convenience however, and to deal with the 998/78 character
limitations per line, the field body portion of a header field can be
split into a multiple-line representation; this is called "folding".
The general rule is that wherever this specification allows for
folding white space (not simply WSP characters), a CRLF may be
inserted before any WSP.
For example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test
...
The process of moving from this folded multiple-line representation
of a header field to its single line representation is called
"unfolding". Unfolding is accomplished by simply removing any CRLF
that is immediately followed by WSP. Each header field should be
treated in its unfolded form for further syntactic and semantic
evaluation. An unfolded header field has no length restriction and
therefore may be indeterminately long.
...
3.2.2. Folding White Space and Comments
White space characters, including white space used in folding
(described in section 2.2.3), may appear between many elements in
header field bodies. Also, strings of characters that are treated as
comments may be included in structured field bodies as characters
enclosed in parentheses. The following defines the folding white
space (FWS) and comment constructs.
Strings of characters enclosed in parentheses are considered
comments so long as they do not appear within a "quoted-string", as
defined in section
3.2.4. Comments may nest.
There are several places in this specification where comments and
FWS may be freely inserted. To accommodate that syntax, an
additional token for "CFWS" is defined for places where comments
and/or FWS can occur. However, where CFWS occurs in this
specification, it MUST NOT be inserted in such a way that any line
of a folded header field is made up entirely of WSP characters and
nothing else.
FWS = ([*WSP CRLF] 1*WSP) / obs-FWS
; Folding white space
ctext = %d33-39 / ; Printable US-ASCII
%d42-91 / ; characters not including
%d93-126 / ; "(", ")", or "\"
obs-ctext
ccontent = ctext / quoted-pair / comment
comment = "(" *([FWS] ccontent) [FWS] ")"
CFWS = (1*([FWS] comment) [FWS]) / FWS
Throughout this specification, where FWS (the folding white space
token) appears, it indicates a place where folding, as discussed in
section 2.2.3, may take place. Wherever folding appears in a
message (that is, a header field body containing a CRLF followed by
any WSP), unfolding (removal of the CRLF) is performed before any
further semantic analysis is performed on that header field according
to this specification. That is to say, any CRLF that appears in FWS is
semantically "invisible".
A comment is normally used in a structured field body to provide some
human-readable informational text. Since a comment is allowed to
contain FWS, folding is permitted within the comment. Also note that
since quoted-pair is allowed in a comment, the parentheses and
backslash characters may appear in a comment, so long as they appear
as a quoted-pair. Semantically, the enclosing parentheses are not
part of the comment; the comment is what is contained between the two
parentheses. As stated earlier, the "" in any quoted-pair and the
CRLF in any FWS that appears within the comment are semantically
"invisible" and therefore not part of the comment either.
Runs of FWS, comment, or CFWS that occur between lexical tokens in a
structured header field are semantically interpreted as a single space
character.

Using CertReq.exe, how to encode special characters in Subject

We are using Microsoft Certificate Request (CertReq.exe) to build certificate requests programmatically. For this purpose, we have to create input INF files, see docs here.
The Subject property is defined as Relative Distinguished Name string values, which should be encoded like specified by RFC 1779.
That essentially means to simply escape some characters (", +, ,, ;, <, >, or \) by prefixing it with \.
The problem is, that I could not figure out, how to properly encode a Subject that has the property "O=Foo + Bar".
Input (relevant INF part):
[NewRequest]
Subject = "CN=www.foo.de,OU=Foobar,O=Foo \+ Bar,L=Foo,S=Bar,C=DE"
Output:
The string contains an invalid X500 name attribute key, oid, value or delimiter. 0x80092023 (-2146885597 CRYPT_E_INVALID_X500_STRING)
c:\file_path.inf([NewRequest] Subject = "CN=www.foo.de,OU=Foobar,O=Foo \+ Bar,L=Foo,S=Bar,C=DE")
Duplicate escaping (using "and \) is discouraged by RFC 1799, but seems to solve problems in LDAP queries (see here, f.i.).
However, we also tried do not use the quotation to specify a subject, but got another unwanted result.
Input:
[NewRequest]
Subject = CN=www.foo.de,OU=Foobar,O=Foo \+ Bar,L=Foo,S=Bar,C=DE
Output:
The data is invalid. 0x8007000d (WIN32: 13 ERROR_INVALID_DATA)
c:\file_path.inf([NewRequest] Subject = "CN=www.foo.de", "OU=Foobar", "O=Foo \+ Bar", "L=Foo", "S=Bar", "C=DE")
The whole process works without the + sign. What is the correct way to encode a RDN (relative distinguished name) in the INF file?
Normally the + character has special meaning. You can disable that behavior like this and just use the + character like you would any other.
Subject = CN=www.foo.de,OU=Foobar,O=Foo + Bar,L=Foo,S=Bar,C=DE
X500NameFlags = 0x20000000
The plus character is normally reserved to separate multiple values for multi-valued RDNs.
I'm not entirely sure why escaping it does not work with CertEnroll as you are expecting it to.

Allowed characters in map key identifier in YAML?

Which characters are and are not allowed in a key (i.e. example in example: "Value") in YAML?
According to the YAML 1.2 specification simply advises using printable characters with explicit control characters being excluded (see here):
In constructing key names, characters the YAML spec. uses to denote syntax or special meaning need to be avoided (e.g. # denotes comment, > denotes folding, - denotes list, etc.).
Essentially, you are left to the relative coding conventions (restrictions) by whatever code (parser/tool implementation) that needs to consume your YAML document. The more you stick with alphanumerics the better; it has simply been our experience that the underscore has worked with most tooling we have encountered.
It has been a shared practice with others we work with to convert the period character . to an underscore character _ when mapping namespace syntax that uses periods to YAML. Some people have similarly used hyphens successfully, but we have seen it misconstrued in some implementations.
Any character (if properly quoted by either single quotes 'example' or double quotes "example"). Please be aware that the key does not have to be a scalar ('example'). It can be a list or a map.

What does 'case insensitive' mean in RFC 3986 with respect to non-English characters?

RFC 3986 specifies that the host component of a URI is 'case insensitive'. However, it doesn't specify what 'case insensitive' means in terms of UCS or UTF-8 characters.
Examples given in the RFC (e.g. "<HTTP://www.EXAMPLE.com/> is equivalent to <http://www.example.com/>") allow us to infer that 'case insensitive' means at least that the characters A-Z are considered equivalent to the character 32 ahead of them in the UTF-8 character set, i.e. a-z. However, no mention is made of how characters outside this range should be treated. So, given an non-encoded, non-normalised registered name of www.OLÉ.com, I see three potential forms of normalisation permissible by the RFC:
Lower case to www.olé.com then percent encode to www.ol%E9.com
Lower case only A-Z characters to www.olÉ.com and then percent encode to www.ol%C9.com
Percent encode to www.OL%C9.com, and then lower case the non-percent encoded parts to www.ol%C9.com, producing the same result as 2.
So the question is: Which is correct? If it's case 1., what defines which characters are considered upper case, and which are considered lower case (and which characters don't have a case)?
Hostnames resolved by DNS are always lowercase.
It is not possible to have UTF-8 characters in DNS hostnames (RFC 1123), however, a workaround has been put in place with "internationalized domain names". This workaround is commonly known as punycode.
Punycode enables non ASCII characters to be represented by ASCII characters.
non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).
-- https://www.ietf.org/rfc/rfc3492.txt
As for the example that you have provided in your question (www.olé.com), the domain name that would be resolved is not www.ol%E9.com.
If you are getting percentage signs in your domain name, it means that you have URL-encoded the hostname, and that is not correct, at least not for resolving.
For example, it will work correctly to have an a tag that looks like this:
Click Here
However, the DNS server will not resolve www.ol%C3%A9.com, but rather, the converted domain name as punycode:
Example
www.ol%C3%A9.com
becomes
www.olé.com
which in punycode translates to:
www.xn--ol-cja.com
Web browsers will generally convert uppercase characters to the lowercase version. For example, both www.olé.com and www.olÉ.com translate to the same DNS hostname (www.xn--ol-cja.com), because www.olÉ.com was lowercased to www.olé.com.
I recommend two tools to check IDN domain names to see what a domain name looks like once it goes through the punycode translation:
Verisign's IDN Conversion Tool (http://mct.verisign-grs.com/)
Punycoder Punycode to Text/Unicode https://www.punycoder.com/
Verisign's IDN tool is much stricter. Try both tools with www.olÉ.com as the input to see what I mean.
The rules for IDNA (Internationalized Domain Names for Applications) are complicated, but there are two main RFC's that are worth a look at:
Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationalehttps://www.rfc-editor.org/rfc/rfc5894
The Unicode Code Points and Internationalized Domain Names for Applications https://www.rfc-editor.org/rfc/rfc5892
rfc5894 section 3.1.3 specifies that characters may not be allowed if:
The character is an uppercase form or some other form that is
mapped to another character by Unicode case folding.

Why doesn't URI.escape escape single quotes?

Why doesn't URI.escape escape single quotes?
URI.escape("foo'bar\" baz")
=> "foo'bar%22%20baz"
For the same reason it doesn't escape ? or / or :, and so forth. URI.escape() only escapes characters that cannot be used in URLs at all, not characters that have a special meaning.
What you're looking for is CGI.escape():
require "cgi"
CGI.escape("foo'bar\" baz")
=> "foo%27bar%22+baz"
This is an old question, but the answer hasn't been updated in a long time. I thought I'd update this for others who are having the same problem. The solution I found was posted here: use ERB::Util.url_encode if you have the erb module available. This took care of single quotes & * for me as well.
CGI::escape doesn't escape spaces correctly (%20) versus plus signs.
According to the docs, URI.escape(str [, unsafe]) uses a regexp that matches all symbols that must be replaced with codes. By default the method uses REGEXP::UNSAFE. When this argument is a String, it represents a character set.
In your case, to modify URI.escape to escape even the single quotes you can do something like this ...
reserved_characters = /[^a-zA-Z0-9\-\.\_\~]/
URI.escape(YOUR_STRING, reserved_characters)
Explanation: Some info on the spec ...
All parameter names and values are escaped using the [rfc3986]
percent- encoding (%xx) mechanism. Characters not in the unreserved
character set ([rfc3986] section 2.3) must be encoded. characters in
the unreserved character set must not be encoded. hexadecimal
characters in encodings must be upper case. text names and values must
be encoded as utf-8 octets before percent-encoding them per [rfc3629].
I know this has been answered, but what I wanted was something slightly different, and I thought I might as well post it up: I wanted to keep the "/" in the url, but escape all the other non-standard characters. I did it thus:
#public filename is a *nix filepath,
#like `"/images/isn't/this a /horrible filepath/hello.png"`
public_filename.split("/").collect{|s| ERB::Util.url_encode(s)}.join("/")
=> "/images/isn%27t/this%20a%20/horrible%20filepath/hello.png"
I needed to escape the single quote as I was writing a cache invalidation for AWS Cloudfront, which didn't like the single quotes and expected them to be escaped. The above should make a uri which is more safe than the standard URI.escape but which still looks like a URI (CGI Escape breaks the uri format by escaping "/").

Resources