RSS + Foreign characters -> Validation error - utf-8

I have an RSS feed and one of the links has an 'ó' in it which causes validation issue. Is it not possible to include these characters in the URI? Is it bad practice to have a character like this in the URI anyway?
Feed us UTF-8.

Unicode characters are legal in URI's. Now you can also use them in domain names. So you have error in your validation mechanism.

Related

How to Escape Double Quotes from Ruby Page Object text

In using the Page Object gem, I'm trying to pull text from a page to verify error messages. One of these error messages contains double-quotes, but when the page object pulls the text from the page, it pulls some other characters.
expected ["Please select a category other than the Default â?oEMSâ?? before saving."]
to include "Please select a category other than the Default \"EMS\" before saving."
(RSpec::Expectations::ExpectationNotMetError)
I'm not quite sure how to escape these - I'm not sure where I could use Regexs and be able to escape these odd characters.
Honestly you are over complicating your validation.
I would recommend simplifying what you are trying to do, start by asking yourself: Is the part in quotes a critical part of your validation?
If it is, isolate it by doing a String.contains("EMS")
If it is not, then you are probably doing too much work, only check for exactly what you need in validation:
String.beginsWith("Please select a category other than the Default")
With respect to the actual issue you are having, on a technical level you have an encoding issue. Encode your result string with utf-8 before you pass it to your validation and you will be fine.
Good luck
It's pretty likely that somewhere along the line encoded the string improperly. (A tipoff is the accented characters followed by ?.) It seems pretty likely that the quotes were converted to "smart quotes" somewhere. This table compares Window-1252 to UTF-8:
Code Point Characters UTF-8 Bytes
Unicode Windows
1252 Expected Actual
------ ---- - --- -----------
U+201C 0x93 “ “ %E2 %80 %9C
U+201D 0x94 ” †%E2 %80 %9D
What you'll want to do is spot check various places in the code to find the first place the string is encoded in something other than UTF-8:
puts error_str.encoding
(For clarity, error_str is the variable that holds the string you are testing. I'm using puts, but you might want have another way to log diagnostic messages.)
Once you find the string that's not encoded UTF-8, you can convert it:
error_str.encode('UTF-8')
Or, if the string is hardcoded somewhere, just replace the string.
For more debugging advice, see: 3 Steps to Fix Encoding Problems in Ruby and How to Get From They’re to They’re.

How to handle the javascript decoding error?

When the code is implemented, some characters cannot be decoded. I am getting a bunch of question marks like ??. How can I fix this?
HtmlInput inputBox2 = (HtmlInput)currentPage.getHtmlElementById("classNo");
inputBox2.setValueAttribute("2016同學15");
ScriptResult result = currentPage.executeJavaScript("javascript:Search(2)");
I found this in the compiler: ScriptResult[result=net.sourceforge.htmlunit.corejs.javascript.Undefined#24d7aac3 page=HtmlPage(http://www.xx.org/classNo=2016??15)#1330510442]
You might try to use URL-encoding for some ASCII and all non ASCII characters.
e.g. space by %20
Here is a web site explaning the
HTML URL Encoding Reference.
You can also interactive encode strings there.
Your "2016同學15" would be encoded as:
"2016%E5%90%8C%E5%AD%B815"

UPS/FedEx shipment request special characters

Recently I've been working on implementation of Label generation for FedEx and UPS couriers using they external service. I have a problem with special characters printed on label. Within response I'm getting correct text but on Label all special chars are replaced by dummy signs. According UPS&FedEx docs they perfectly supports such characters on labels till they are passed as UTF-8 and encoding node in xml is present (pointing to UTF-8).
Did anyone faced similar problem? Maybe there is an official note from them that they'r not supporting such case that I'm not aware of.
UPS and FedEx APIs supports only Latin-1 chars. Dummy chars were assigned by auto utf-8 cast in one of internal methods (dicttoxml) that results in double UTF-8 encoding.

What constitutes a valid URI query parameter key?

I'm looking over Section 3.4 of RFC 3986 trying to understand what constitutes a valid URI query parameter key, but I'm not seeing a clear answer.
The reason I'm asking is because I'm writing a Ruby class that composes a URI with query parameters. When a new parameter is added I want to validate the key. Based on experience, it seems like the key will be invalid if it requires any escaping.
I should also say that I plan to validate the key. I'm not sure how to go about validating this data either, but I do know that in all cases I should escape this value.
Advice is appreciated. Advice in the context of how validation might already be possible through say a Ruby Gem would also be a plus.
I could well be wrong, but that spec seems to say that anything following '?' or '#' is valid as long. I wonder if you should be looking more at the spec for 'application/x-www-form-urlencoded' (ie. the key/value pairs we're all used to)?
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
This is the default content type. Forms submitted with this content
type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by &'.
I don't believe key=value is part of the RFC, it's a convention that has emerged. Wikipedia suggests this is an 'W3C recommendation'.
Seems like some good stuff to be found searching on the application/x-www-form-urlencoded content type.
http://www.w3.org/TR/REC-html40/interact/forms.html#form-data-set

Disabling visually ambiguous characters in Google URL-shortener output

Is there a way to say (programmatically, I mean their API) the Google URL shortener not to produce short URL with characters like:
0 O
1 l
Because people often make mistake when reading those characters from displays and typing them elsewhere.
You cannot request the API to use a custom charset, so no.
Not a proper solution, but you could check the url for unwanted characters and request another short URL for the same long URL until you get one you like. Google URL shortner issues a unique short URL for an already shortned URL if you provide an OAuth token with the request. However I am not sure if a user is limited to one unique short URL per a specific long URL in which case this won't work either.
Since you're doing it programmatically, you could swap out those chars for their ascii value, '%6F' for the letter o, for instance. In this case, just warn the users that in doubt, it's a numeral.
Alternatively, use a font that distinguishes ambiguous chars, or better yet, color-code them (or underline numerals, or whatever visual mark)

Resources