ISO-8859-1 characters treated as UTF-8 in XSLT attributes - utf-8

The ¬ character (0xAC in ISO-8859-1) works for normal text if I ensure that ISO-8859-1 is always used as the encoding throughout. However, when using it in attributes it is escaped to: %C2%AC. I understand that it needs to be escaped for urls, but not why it escapes it in the same way as it would for UTF-8, rather than just %AC as I'd expect it to for ISO-8859-1.
Since the escapes are in the output html file the only conclusion is that the xslt processor is the cause.
Example:
input.xml
stylesheet.xslt
makefile
Which for me generates:
output.html
Output was generated using xsltproc, compiled against libxml 20707, libxslt 10126 and libexslt 815. This was on #! Linux (amd64). I have also tried: xmlstarlet tr (also uses libxml), xalan and google chrome (by adding an <?xml-stylesheet ... >, see input_ss.xml tag) with the same result.
Opera doesn't escape it at all, and it allows ¬ to be used literally in the url and attribute.
Is this standard behaviour for xslt or is this a bug in the way the attributes are escaped? And either way, is there a solution other than replacing %C2%AC with %AC bearing in mind it is almost certainly the same for other characters that are valid ISO-8859-1 and invalid in UTF-8.

There are 3 different text-based technologies in use here, XML, HTML and URIs.
All of these have escape mechanisms - that is to say, ways to use text to indicate other text that it is impossible or difficult to indicate in a given context.
The not-sign character ¬ (U+00AC) could be escaped in the first two as ¬ or ¬ perhaps with some leading zeros, in both XML and HTML (¬ would also work in HTML). This escape would be used no matter what encoding the XML or HTML was in, because it relates to the character ¬, not to its set of octets in a given character encoding - indeed, we would generally only use it in the case where there was no such set of octets in the encoding being used.
In this case, this is unnecessary, since the output is in a character encoding in which there is no need to escape it, and so in the source you can see The ¬ character unescaped.
This HTML includes the text of a URI. The encoding of the HTML has nothing to do with this, because the encoding is how we get the text of the HTML from one machine to another, but when the HTML is being parsed to read this URI we're past that point and are dealing with some text at the level of text - that is to say, it doesn't have an encoding any more.
Now, URIs have their own escape mechanisms. This must be used in the case of ¬, as it is not a character allowed in URIs (as opposed to IRIs). Sadly, unlike the escapes in XML and HTML, these escapes are based on octets in a given encoding rather than the code-point of the character itself.
It's easy to see this as a mistake now, but URIs were specified in 1994 and that formalised work going back to 1989/1990 while Unicode 1.0 was released in 1991 and didn't have the ground-breaking 2.0 until 1996, so hindsight has considerably more benefits than URI's inventors. (HTML had the same problem many years ago, but the format of its encodings made it much easier to fix this without as many backwards-compatibility issues).
So, what encoding should we use for those octets? The original specs left this undefined, but really the only possible choice is UTF-8. It's the only encoding that gives those escapes commonly used for chracters special to URIs their escapes in the range 0x20 - 0x7F while also covering all of the UCS.
There's also no way to indicate another choice could be more appropriate. Remember, we're working at the level of text, so your use of ISO-8859-1 is completely irrelevant. Even if we kept track of the encoding while parsing the HTML, the URI is going to be made use of in a way that is nothing to do with the document, so we still couldn't use it. In all, if we have to make use of an octet-based encoding, and we have to keep characters in the ASCII range matching the octets they'd have in ASCII, the only possible basis for the encoding is UTF-8.
For that reason, the escape in any URI for ¬ must always be %C2%AC.
There can be some legacy systems that expect URIs to use other encodings, but the solution is to fix the bit that's broken, not the bit that works, so if something expects ¬ to be %AC then catch it close to that by converting %C2%AC close to its use (and if it outputs %AC itself then of course you'll need to fix it to %C2%AC before it hits the outside world).

The XSLT spec says that when serializing URI-valued attributes, all non-ASCII characters are escaped using the %HH-escaping of the UTF-8 octets that represent the character. Although %HH-escaping of other encodings has been used in the past, it is no longer used today. This is quite independent of the encoding of the document itself.

Related

Find and replace non utf8 character

I have a process that inserts data into PDFs that eventually loads into a system that gets searched based on that inserted data. The inserted data looks something like:
<<
/IBM-ODIndexes
<< /Private
<<
/DOB (05031983)
/FULL_NAME (TEST USER)
/YEAR (2020)
>>
/LastModified(D:20210112201530)
>>
However, there are instances where the data in the FULL_NAME field contains non UTF8 characters and then users are unable to search the data. Specifically apostrophes come over from Microsoft Word and then gets interpreted like this:
/FULL_NAME (JERRY OÃ<83>¢ââ<80><9a>‰â<80><9e>¢CONNELL)
In this case I am looking to strip out the apostrophe that is represented as Ã<83>¢ââ<80><9a>‰â<80><9e>¢ and replace it with a white space.
There are several complexities here, but in general I would say that the only reliable way to deal with it is to figure out the text encoding of the incoming document and converting it to the target encoding.
Ã<83>¢ââ<80><9a>‰â<80><9e>¢ is 34 characters (that is, at least 34 bytes), and no single encoding ever used that much space for a single character. What’s probably happening is multiple levels of encoding, such as HTML entities, base64, UTF-8/16/32 or escape characters like %% to represent % in SQL or \\ to represent \ in Bash. Reversing all these levels of encoding manually is going to involve quite a lot of reading the huge docx standard. The simpler alternative is to use a library which can just convert the entire text into a known character encoding for you, at which point you have to do at most a single conversion into UTF-8.
Another argument for this is that the “apostrophe string” does contain otherwise harmless characters like “a” and “e”. Without at least some understanding of the encodings you’re unlikely to be able to separate encoded characters from non-encoded ones, which would make the resulting text full of invalid text.

Can I treat all domain names as being IDNs without any ill effects?

From testing, it seems like trying to convert both IDNs and regular domain names 'just works' - eg, if the input doesn't need to be changed punycode will just return the input.
punycode.toASCII('lancôme.com');
returns:
'xn--lancme-lxa.com'
And
punycode.toASCII('apple.com');
returns:
'apple.com'
This looks great, but is it specified anywhere? Can I safely convert everything to punycode?
That is correct. If you look at how the procedure for converting unicode strings to ascii punycode, the process only alters any non-ascii character. Since regular domains cannot contain non-ascii characters, if your conversor is correctly implemented, it will never transform any pure-ascii string.
You can read more about how unicode is converted to punycode here: https://en.wikipedia.org/wiki/Punycode
Punycode is specified in RFC 3492: https://www.ietf.org/rfc/rfc3492.txt, and it clearly says:
"Basic code point segregation" is a very simple and
efficient encoding for basic code points occurring in the extended
string: they are simply copied all at once.
Therefore, if your extended string is made of basic code points, it will just be copied without change.

German Umlaut displayed wrong despite correct Charset

I am encountering a weird problem regarding the encoding of my files.
I have a site which is multilingual; Users can set this viá a dropdown on the site itself, the default value being German.
When the user logs in, some settings are being set depending on the language (charset, codepage and LCID). At this point I also want to point out, that all my files are ANSI-encoded.
Recently, I had to make some changes.
So I fire up Visual Studio 2010, edit the files in question and upload them to my server using Filezilla.
And now, all of a sudden, the German umlauts (Ää, Öö, Üü, ß) are being displayed incorrectly (something like ä) - but only on the files I opened with VS2010.
I checked the charset on the site itself and also displaying it with Response.CharSet and it was ISO-8859-1, which is correct.
So I tried some converting with notepad++, but no success.
I know that setting the charset to UTF-8 would solve this problem, but a) the charset is set from a database-value and b) it kind of messes things up in other languages.
You are displaying a utf-8 encoded file with a iso-8859-1 view. Usually you want to see just one character, but why do you see two instead of one? This is because in utf-8 a german small 'a' letter with 'two dots' is a 2-byte sequence with utf-8 (0xC3 and 0xA4). If this gets NOT displayed as utf-8 but as iso-8859-1 encoding - which means one byte one character - you'll get that what you have mentioned. You'll get the startbyte 0xC3 as a single iso-8859-1 character and the following byte 0xA4 as as a single iso-8859-1 character. In utf-8 this 2-byte sequence must become decoded by extracting the payload bits of the startbyte and the following byte like this:
Startbyte: 11000011
Following: 10100100
So 110 of the startbyte must get stripped off, so 11 is left.
So 10 of the following byte must get stripped off, so 100100 is left.
Chained together this becomes 11100100 which is decimal 228 which should be equal to the german character 'a with two dots' unicode codepoint.
I recommend to let the encoding as it is, utf-8. It is just the encoding of your viewer/editor that should display utf-8 encoded files as utf-8 and not as iso-8859-1. Configure your viewer/editor with utf-8. In other words, configure the viewer's/editor's encoding according to the encoding of the file's content (which is in your case utf-8 and NOT iso-8859-1).
To convert your files or check them for a certain encoding, just use madedit. madedit has a built-in hex-editor which wraps a rectangle around utf-8 sequences, displaying just one character on the right side (the encoded codepoint). It's easy to identify single-byte characters and/or 2/3/4-byte sequences within utf-8 encoded files. It also wraps a rectangle around the 3-byte utf-8 BOM (if any).
Encoding problems have several failure points:
Check template file encoding
Check response encoding
Check database encoding
Check that they are coherent to what you want to output.
Also note that Notepad++ has a "Encode as..." and a "Convert to..."
1st one reads file as encoding specified and 2nd reads file and writes it back to selected encoding (changing file)

How to force Firefox addon charset/encoding?

I'm having a trouble with a mobile addon: it shows me the new elements added by scripting with a different charset of the page. E.g. I can read "cuadrúpedo" but the same word in my plugin show "cuadr¡pedo".
I tryed writing the next line to the beginning of my addon, but it didn't work:
document.getElementsByTagName("html")[0].setAttribute("lang", "es");
Then, I wrote a "converter function" which replaces the special characters with unicode, like the next line, but it didn't work.
str.replace( /ú/g, "/xfa־" );
What can I do?
Probably it's a matter of text encoding.
Make sure the file that contains the literal "cuadrúpedo" is saved as utf-8, not ansi.
Keep in mind that a few key files must be ansi encoded. These are install.rdf, chrome.manifest and bootstrap.js. In this case use unicode escapes, "cuadr\u00fapedo".
When the JavaScript file is loaded (in Gecko 1.8 and later) from a chrome:// URL, a Byte Order Mark is used to determine the character encoding of the script. Otherwise, the character encoding will be the same as the one used by the XUL file. So, one solution is the HTTP header can contain a character encoding declaration as part of the Content-Type header, for example:
Content-Type: application/javascript; charset=UTF-8
For cross version compatibility you must limit yourself to ASCII. However, you can use unicode escapes – the earlier example rewritten using them would be:
var text = "Ein sch\u00F6nes Beispiel eines mehrsprachigen Textes: \u65E5\u672C\u8A9E";
JavaScript and Navigator support for UTF-8/Unicode means you can use non-Latin, international, and localized characters, plus special technical symbols in JavaScript programs. Unicode provides a standard way to encode multilingual text: since the UTF-8 encoding of Unicode is compatible with ASCII, programs can use ASCII characters. To receive non-ASCII character input, the client needs to send the input as Unicode.
There is a webpage for text escaping and unescaping in Javascript:
http://0xcc.net/jsescape/
Sources:
https://developer.mozilla.org/en-US/docs/International_characters_in_XUL_JavaScript
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Values,_variables,_and_literals#Unicode

Allowed characters in submit forms (including UTF-8)

Suppose I allow my users to submit a form containing some text fields (I'm not talking about passwords). My users would occasionally use non-ASCII characters like Russian, Chinese, etc. so I use UTF-8 charsets in my database. The question is, should I really allow all of the possible UTF-8 characters? I had a look at the ASCII table and saw that characters 0 to 31 have nothing to do with text, except for newlines and white spaces. Characters 176 to 223 seem to be for decorative purposes :p. Should I restrict them?
The W3C skips these characters in their example regular expression in Multilingual form encoding:
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
Make sure it is valid UTF-8 and Unicode? Yes
Make sure it does not include certain characters, such as control codes? Probably not necessary
You should be aware that even though you are using UTF-8 in your form, you may not get valid UTF-8 from all user-agents when they send form data to you, and you will have to filter it as necessary. Invalid UTF-8 can take many forms, some of them being
Overlong encodings (which can lead to security issues)
Other invalid UTF-8 byte sequences, which may indicate that the user-agent ignored the character encoding and has submitted something like Windows-1252 or ISO-8859-1 encoding instead.
Code points that lie in reserved surrogate space in Unicode
All the above need to be filtered out during input, otherwise you are not storing valid Unicode.
If you want to serve valid HTML or XHTML, which use a subset of Unicode, you will need also need to filter out (either at input or output):
C0 control codes 0x00 to 0x19 (apart from tab, space, new line, carraige return)
0x7F
C1 control codes 0x80 to 0xBF
(probably) any code point above 0x10FFFF
No.
It's a very bad idea to try to "pre-clean" user input. What you consider "decorative" might be absolutely necessary to readers of another language. The best solution is to store the text as-is in the database, and then sanitize it before writing to the page.
When you say "the ASCII table" you're talking about this page, aren't you? That page is garbage. Only the first 128 characters (ie, 0..127) are "ASCII"; the mappings they show for the numbers 128..255 are from an ASCII extension called cp437. There are a lot of "extended ASCII's" out there, and cp437 is far from the most common one.
But I digress. Your question isn't about character encodings, it's about filtering, and a filter should be based on the properties of the characters: is it a letter, a digit, a control character? Most modern programming languages provide methods or functions to obtain such information, and most provide regex support as well. As for what you should filter, or whether you should filter at all, only you can know that.
It sounds like you need to learn more about character encodings and Unicode, though. Start here.

Resources