Escape and download URL using Ruby - ruby

I'm trying to download the HTML content from a URL without success.
Here is the URL:
http://example.com/some_string[value]
When use RestClient I get this error:
URI::InvalidURIError: bad URI(is not URI?)
I got some help from the Ruby on Rails IRC. The Idea is to escape the end of the URL.
$ "http://example.com/" + CGI::escape("some_string[value]")
=> "http://example.com/some_string%5Bvalue%5D"
The generated URL does not work, I'm getting a 404.
It works in the browsers though.
Anyone knows how to get it to work?

According to the URI RFC:
Other characters are excluded because gateways and other transport
agents are known to sometimes modify such characters, or they are
used as delimiters.
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Data corresponding to excluded characters must be escaped in order to
be properly represented within a URI.
Trusting a browser's response or ability to handle a link is risky. They do everything they can to return a page, instead of enforcing the standards, so they are not authoritative sources whether a page or URL is correctly defined.
RestClient's response is probably based on URI's, which returned the same error when I tested parsing the URL using URI.
I haven't ever seen a URL using unencoded "[" and "]" characters.

Related

UTL_HTTP - needs to stop escaping reserved chars

I've got a REST WebService and a PLSQL Package, from which i want to call the WebService.
The parameters for the call are located within the URI of the WebService.
htttp://myservice:8080/some/path/action?value1=123456&value2=some chars&value3=aGermanSonderzeichenCalledÄ
As you can see, there are 2 problems with the URI. First the whitespace for value2 and second the special character for value3.
That said, it is clear to me that the URI has to be encoded to a more friendly format.
The WebService desires UTF-8, so the URI is encoded with:
UTL_URL.ESCAPE(url,false,'UTF-8').
This results in the following URI:
htttp://myservice:8080/some/path/action?value1=123456&value2=some%20chars&value3=aGermanSonderzeichenCalled%C3%84
So far, so good. This encoded URI is passed to UTL_HTTP.BEGIN_REQUEST(url,'GET').
When I execute this request, and intercept it with Wireshark, i can see that the actual URI that got called is:
htttp://myservice:8080/some/path/action?value1=123456&value2=some%2520chars&value3=aGermanSonderzeichenCalled%25C3%2584
What we can see is, that UTL_HTTP escapes the reserved character '%' to %25.
So in my case the whitespace first got converted to %20 and after that to %2520.
What I'm looking for is a way to stop UTL_HTTP from escaping the reserved characters.
As an alternative, a way in which UTL_HTTP deals with the whitespace and special character, without me calling UTL_URL, would also work for me.

Trouble in passing "=" (equal) symbol in subsequent request - Jmeter

I newly started using jmeter.
my application returns an url with encryption value as response which has to be passed as request to get the next page. The encryption value always ends with "=" ex. "http://mycompany.com/enc=EncRypTedValue=". while passing the value as request, the "=" is replaced with some other character like '%3d' ex "http://mycompany.com/enc=EncRypTedValue%3d" . Since the token has been changed my application is not serving the request.
It took me a while to understand this, unlike other languages and environments in network standards URIs (URLs) do not use quotes or some escape characters to hide special characters.
Instead, a URL needs to be properly encoded by encoding each individual parameter separately in order to build the complete URL. In JavaScript encoding/decoding of the parameters is done with encodeURIComponent() and decodeURIComponent() respectively.
For example, the following:
http://example.com/?p1=hello=hi&p2=three=3
should be encoded using encodeURIComponent() on each parameters to build the following:
http://example.com/?p1=hello%3Dhi&p2=three%3D3
Note that the equal sign used for parameters p1= ... p2= remain as is.
Do not try encode/decode the whole URL, it won't work. :)
Do not be fooled by what is displayed on a browser address bar/field, that is only the human friendly string, the moment you copy it to the clipboard the browser will encoded it.
Hope this helps someone.
Your application has a problem then, because that's the way it should be sent. Url parameters should be encoded as specified in rfc3986. Browsers can do it automatically even, so that's something that should be fixed on your web app, if it is not working.
If data for a URI component would conflict with a reserved character's
purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
What you are experiencing is URL Encoding - = is a reserved character in URLs and you cannot just append it to your URL unencoded. It needs to be encoded. This obviously already happened in your case. On the server side the url parameters need to be decoded again. This is the job of the container normally, though.
Basing on your use case you may with to consider one of the following approaches:
You can use Regular Expression Extractor Post Processor to capture you response and store it to JMeter variable. As variables as Java Unicode Strings you shouldn't experience any problem with extra encoding of your "=" symbol.
JMeter provides __urldecode function which you can utilize to decode your request.
You can pre-process the request with kind of __Beanshell function or BeanShell preprocessor to decode the whole URL with something like:
URLDecoder.decode(vars.get("your_URL_to be decoded"),"encoding");
If your are adding encryption values in the subsequent request as request parameter then make sure 'Encoding?' is unchecked
Use quotes for your values. E.g. -Jkey="val=ue"

Having special characters in URLs in CodeIgniter

I want to have URLs like
http://www.example.com/(*.)
but CodeIgniter does not allow me to do that. When I try to access some URLs I get 404 error (and the requested page exists).
I know I can set allowed characters in URL, but I thought about encoding it. However, if I do something like this:
http://www.example.com/<?php echo rawurlencode(string) ?>
or even:
http://www.example.com/<?php echo rawurlencode(rawurlencode(string)) ?>
I still got the 404. Why is that? '%'s are allowed characters, so why it won't work? And what can I do to fix it?
You can allow certain signs through config/config.php and the permitted_uri_chars key.
However, though I'm not fully certain, I do believe these are restricted by default to increase security. As relevant explanation suggests:
/*
|--------------------------------------------------------------------------
| Allowed URL Characters
|--------------------------------------------------------------------------
|
| This lets you specify with a regular expression which characters are permitted
| within your URLs. When someone tries to submit a URL with disallowed
| characters they will get a warning message.
|
| As a security measure you are STRONGLY encouraged to restrict URLs to
| as few characters as possible. By default only these are allowed: a-z 0-9~%.:_-
|
| Leave blank to allow all characters -- but only if you are insane.
|
| DO NOT CHANGE THIS UNLESS YOU FULLY UNDERSTAND THE REPERCUSSIONS!!
|
*/
For instance, what's so neat about the current settings is that you allow few enough uris to parse IDs without risking to have them compromised by '', "" or similiar. Of course there's automatic and manual $this->db->escape(), but this just adds more failsafes.
WHen trying to pass urlencoded strings to the URI it will generate an error if the encoded string has a /, codeigniter will try to parse this as a segment, thus rendering a 404, what you need to use is query strings.
$string = rawurlencode(string)
http://www.example.com/class/method/?string=$string
then on your method
use get
function method()
{
$this->input->get('string');
}
In case you want to have slashes / in the URL, use double raw encoding. For example:
$string = rawurlencode(rawurlencode('/sth/sth'));

Sinatra converting backslash to forward slash

The JSON I'm posting to my webserver looks like this:
"qry_when":["date_is_in(\"X:\\Finqueries\\Dates\\earnings files\\earnings.wmt.txt\")"]
but in my sinatra code,
apost '/parsequery/*' do
data = params[:captures][0]
data looks like
"qry_when":["date_is_in(/"X:/Finqueries/Dates/earnings files/earnings.wmt.txt/")"]
Because the \" is getting turned into /", when I later call JSON.parse(data), I get a parsing error:
unexpected token at 'X:/Finqueries/Dates/earnings files/earnings.wmt.txt/")"]
Is there anyway to get Sinatra to not convert the backslashes to forward slashes?
EDIT: As a solution, I have javascript change all my "\" to %5C and single and double quotes to %27 before sending the json, and it's working now in both chrome and opera.

Which characters make a URL invalid?

Which characters make a URL invalid?
Are these valid URLs?
example.com/file[/].html
http://example.com/file[/].html
In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following 84 characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]#!$&'()*+,;=
Note that this list doesn't state where in the URI these characters may occur.
Any other character needs to be encoded with the percent-encoding (%hh). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.
The '[' and ']' in this example are "unwise" characters but still legal. If the '/' in the []'s is meant to be part of file name then it is invalid since '/' is reserved and should be properly encoded:
http://example.com/file[/].html
To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.
There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.
Excluded US-ASCII Characters disallowed within the URI syntax:
control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">
The character "#" is excluded because it is used to delimit a URI from a fragment identifier. The percent character "%" is excluded because it is used for the encoding of escaped characters. In other words, the "#" and "%" are reserved characters that must be used in a specific context.
List of unwise characters are allowed but may cause problems:
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Characters that are reserved within a query component and/or have special meaning within a URI/URL:
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" | "$" | ","
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like ftp://user#hostname/ where the '#' character has special meaning.
Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:
http://mw1.google.com/mw-earth-vectordb/kml-samples/gp/seattle/gigapxl/$[level]/r$[y]_c$[x].jpg
Some of the character restrictions for URIs and URLs are programming language-dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like http://api.google.com/q?exp=a|b is not allowed and must be encoded instead as http://api.google.com/q?exp=a%7Cb if using Java with a URI object instance.
Most of the existing answers here are impractical because they totally ignore the real-world usage of addresses like:
https://en.wikipedia.org/wiki/Möbius_strip or
https://zh.wikipedia.org/wiki/Wikipedia:关于中文维基百科/en.
First, a digression into terminology. What are these addresses? Are they valid URLs?
Historically, the answer was "no". According to RFC 3986, from 2005, such addresses are not URIs (and therefore not URLs, since URLs are a type of URIs). Per the terminology of 2005 IETF standards, we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI.
Per modern spec, the answer is "yes". The WHATWG Living Standard simply classifies everything that would previously be called "URIs" or "IRIs" as "URLs". This aligns the specced terminology with how normal people who haven't read the spec use the word "URL", which was one of the spec's goals.
What characters are allowed under the WHATWG Living Standard?
Per this newer meaning of "URL", what characters are allowed? In many parts of the URL, such as the query string and path, we're allowed to use arbitrary "URL units", which are
URL code points and percent-encoded bytes.
What are "URL code points"?
The URL code points are ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('), U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*), U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (#), U+005F (_), U+007E (~), and code points in the range U+00A0 to U+10FFFD, inclusive, excluding surrogates and noncharacters.
(Note that the list of "URL code points" doesn't include %, but that %s are allowed in "URL code units" if they're part of a percent-encoding sequence.)
The only place I can spot where the spec permits the use of any character that's not in this set is in the host, where IPv6 addresses are enclosed in [ and ] characters. Everywhere else in the URL, either URL units are allowed or some even more restrictive set of characters.
What characters were allowed under the old RFCs?
For the sake of history, and since it's not explored fully elsewhere in the answers here, let's examine was allowed under the older pair of specs.
First of all, we have two types of RFC 3986 reserved characters:
:/?#[]#, which are part of the generic syntax for a URI defined in RFC 3986
!$&'()*+,;=, which aren't part of the RFC's generic syntax, but are reserved for use as syntactic components of particular URI schemes. For instance, semicolons and commas are used as part of the syntax of data URIs, and & and = are used as part of the ubiquitous ?foo=bar&qux=baz format in query strings (which isn't specified by RFC 3986).
Any of the reserved characters above can be legally used in a URI without encoding, either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose. (For example, although / has syntactic meaning in a URL, you can use it unencoded in a query string, because it doesn't have meaning in a query string.)
RFC 3986 also specifies some unreserved characters, which can always be used simply to represent data without any encoding:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~
Finally, the % character itself is allowed for percent-encodings.
That leaves only the following ASCII characters that are forbidden from appearing in a URL:
The control characters (chars 0-1F and 7F), including new line, tab, and carriage return.
"<>^`{|}
Every other character from ASCII can legally feature in a URL.
Then RFC 3987 extends that set of unreserved characters with the following unicode character ranges:
%xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD
These block choices from the old spec seem bizarre and arbitrary given the latest Unicode block definitions; this is probably because the blocks have been added to in the decade since RFC 3987 was written.
Finally, it's perhaps worth noting that simply knowing which characters can legally appear in a URL isn't sufficient to recognise whether some given string is a legal URL or not, since some characters are only legal in particular parts of the URL. For example, the reserved characters [ and ] are legal as part of an IPv6 literal host in a URL like http://[1080::8:800:200C:417A]/foo but aren't legal in any other context, so the OP's example of http://example.com/file[/].html is illegal.
In your supplementary question you asked if www.example.com/file[/].html is a valid URL.
That URL isn't valid because a URL is a type of URI and a valid URI must have a scheme like http: (see RFC 3986).
If you meant to ask if http://www.example.com/file[/].html is a valid URL then the answer is still no because the square bracket characters aren't valid there.
The square bracket characters are reserved for URLs in this format: http://[2001:db8:85a3::8a2e:370:7334]/foo/bar (i.e. an IPv6 literal instead of a host name)
It's worth reading RFC 3986 carefully if you want to understand the issue fully.
All valid characters that can be used in a URI (a URL is a type of URI) are defined in RFC 3986.
All other characters can be used in a URL provided that they are "URL Encoded" first. This involves changing the invalid character for specific "codes" (usually in the form of the percent symbol (%) followed by a hexadecimal number).
This link, HTML URL Encoding Reference, contains a list of the encodings for invalid characters.
Several of Unicode character ranges are valid HTML5, although it might still not be a good idea to use them.
E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:
The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.
Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which says it aims to:
Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process.
That document defines URL code points as:
ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "#", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.
The term "URL code points" is then used in the statement:
If c is not a URL code point and not "%", parse error.
in a several parts of the parsing algorithm, including the schema, authority, relative path, query and fragment states: so basically the entire URL.
Also, the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"
Of course, as mentioned by Stephen C, it is not just about characters but also about context: you have to understand the entire algorithm. But since class "URL code points" is used on key points of the algorithm, it that gives a good idea of what you can use or not.
See also: Unicode characters in URLs
I needed to select characters to split URLs in a string, so I decided to create a list of characters which could not be found in the URL by myself:
>>> allowed = "-_.~!*'();:#&=+$,/?%#[]?#ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
>>> from string import printable
>>> ''.join(set(printable).difference(set(allowed)))
'`" <\x0b\n\r\x0c\\\t{^}|>'
So, the possible choices are the newline, tab, space, backslash and "<>{}^|. I guess I'll go with the space or newline. :)
I am implementing an old HTTP (0.9, 1.0, 1.1) request and response reader/writer. The request URI is the most problematic place.
You can't just use RFC 1738, 2396 or 3986 as it is. There are many old HTTP clients and servers that allow more characters. So I've made research based on accidentally published web server access logs: "GET URI HTTP/1.0" 200.
I've found that the following non-standard characters are often used in URIs:
\ { } < > | ` ^ "
These characters were described in RFC 1738 as unsafe.
If you want to be compatible with all old HTTP clients and servers - you have to allow these characters in the request URI.
Please read more information about this research in oghttp-request-collector.
This is not really an answer to your question, but validating URLs is really a serious p.i.t.a. You're probably just better off validating the domain name and leave query part of the URL be. That is my experience.
You could also resort to pinging the URL and seeing if it results in a valid response, but that might be too much for such a simple task.
Regular expressions to detect URLs are abundant, google it :)
I can't comment on the above answers, but wanted to emphasize the point (in another answer) that allowed characters aren't allowed everywhere. For example, domain names can't have underscores, so http://test_url.com is invalid.
From the source (emphasis added when needed):
Unsafe:
Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems. The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it. The character "%" is unsafe because it is used for
encodings of other characters. Other characters are unsafe because
gateways and other transport agents are known to sometimes modify such
characters. These characters are "{", "}", "|", "", "^", "~", "[",
"]", and "`".
All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in systems
that do not normally deal with fragment or anchor identifiers, so that
if the URL is copied into another system that does use them, it will
not be necessary to change the URL encoding.
Source
I came up with a couple of regular expressions for PHP that will convert URLs in text to anchor tags. (First it converts all www. URLs to http://, and then converts all URLs with https?:// to a href=... HTML links
$string = preg_replace('/(https?:\/\/)([!#$&-;=?\-\[\]_a-z~%]+)/sim', '$2', preg_replace('/(\s)((www\.)([!#$&-;=?\-\[\]_a-z~%]+))/sim', '$1http://$2', $string) );

Resources