babel:octets-to-string throws out INVALID-UTF8-CONTINUATION-BYTE - utf-8

I'm writing a lisp program to fetch a web page of a Chinese website, I meet problem about parsing the Chinese words from the binary stream, I already have a vector of (unsigned-byte 8) containing the whole page, but when I put it to the babel:octets-to-string, it throws out an exception.
(setf buffer (babel:octets-to-string buffer :encoding :utf-8))
The exception is:
Illegal :UTF-8 character starting at position 437. [Condition of
type BABEL-ENCODINGS:INVALID-UTF8-CONTINUATION-BYTE]
I fount that when it meet a Chinese word it must throw out this exception. How can I solve it?

The error message says everything - there is an invalid UTF-8 byte sequence in your data.
The most probable cause for this error is that the page text itself is not encoded in UTF-8 but some other encoding for Chinese text. You should check the HTML 'META HTTP-EQUIV' tag and 'Content-Type' HTTP Response Header for encoding.

Related

Oracle mysterious Unicode codepoint

While calling XMLTYPE() on a CLOB column which should contain a valid XML1.0 xml (the db encoding should be UTF-8), the following error message comes out (I am from Italy):
ORA-31011: Analisi XML non riuscita
ORA-19202: Errore durante l'elaborazione XML
LPX-00217: carattere non valido 15577023 (U+EDAFBF)
Error at line 240
ORA-06512: a "SYS.XMLTYPE", line 272
ORA-06512: a line 1
31011. 00000 - "XML parsing failed"
*Cause: XML parser returned an error while trying to parse the document.
*Action: Check if the document to be parsed is valid.
Now this invalid character is given as Unicode codepoint EDAFBF. The problem is that according to Unicode spec (wikipedia), there are no codepoints beyond 10FFFF. So what could this error mean?
Inspecting this CLOB with SQLDeveloper (and copying it to Notepad++ with encoding set to utf-8) does not reveal anything unusual beyond some strange characters which apparently came from the user browser when he copied text from a Microsoft Word document (but the CLOB, at least as copied from SQLDeveloper UI and exhibited by Notepad++ with UTF-8 encoding, seems to be a valid UTF-8 text).
Is there a way to reproduce this error populating Oracle directly (from SQLDeveloper or in some other way)? (contacting the end user to understand what he put exactly in the web form is problematic)
Not addressing the first part of the question, but you can reproduce it with a RAW value:
select xmltype('<dummy>'
|| utl_raw.cast_to_varchar2(cast('EDAFBF' as raw(6)))
|| '</dummy>')
from dual;
Error report -
SQL Error: ORA-31011: XML parsing failed
ORA-19202: Error occurred in XML processing
LPX-00217: invalid character 15577023 (U+EDAFBF)
Error at line 1
ORA-06512: at "SYS.XMLTYPE", line 310
ORA-06512: at line 1
Just selecting the character:
select utl_raw.cast_to_varchar2(cast('EDAFBF' as raw(6)))
from dual;
... is displayed as a small square with an even smaller question mark inside it (I think) in SQL Developer for me (version 4.1), but that's just how it's choosing to render that; copying and pasting still gives the replacement character � since the codepoint is, as you say, invalid. XMLType is being stricter about the validity than CLOB. The unistr() function doesn't handle the value either, which isn't really a surprise.
(You don't need to cast the string to raw(6), just utl_raw.cast_to_varchar2('EDAFBF') has the same effect; but doing it explicitly makes it a bit clearer what's going on, I think).
I don't see how that could have got into your file without some kind of corruption, possibly through a botched character set conversion I suppose. You could maybe use dbms_lob.replace_fragment() or similar to replace or remove that character, but of course there may be others you haven't hit yet, and at best you'd only be treating the symptoms rather than the cause.

W3C unable to validate

Sorry, I am unable to validate this document because on line 1200 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xD8" does not map to Unicode
i would be thankful to know what exactly should i do, my website is : http://dailysahara.com/
The issue, as stated by the validator, is that you have some invalid UTF-8 in your document. It appears to be in the box on the left of the site with the four tabs "Tags", "Comments", "Recents", and "Popular". It shows up to me as a black square like this: �. If you remove that, you should be able to validate your site.

German Umlaut displayed wrong despite correct Charset

I am encountering a weird problem regarding the encoding of my files.
I have a site which is multilingual; Users can set this viá a dropdown on the site itself, the default value being German.
When the user logs in, some settings are being set depending on the language (charset, codepage and LCID). At this point I also want to point out, that all my files are ANSI-encoded.
Recently, I had to make some changes.
So I fire up Visual Studio 2010, edit the files in question and upload them to my server using Filezilla.
And now, all of a sudden, the German umlauts (Ää, Öö, Üü, ß) are being displayed incorrectly (something like ä) - but only on the files I opened with VS2010.
I checked the charset on the site itself and also displaying it with Response.CharSet and it was ISO-8859-1, which is correct.
So I tried some converting with notepad++, but no success.
I know that setting the charset to UTF-8 would solve this problem, but a) the charset is set from a database-value and b) it kind of messes things up in other languages.
You are displaying a utf-8 encoded file with a iso-8859-1 view. Usually you want to see just one character, but why do you see two instead of one? This is because in utf-8 a german small 'a' letter with 'two dots' is a 2-byte sequence with utf-8 (0xC3 and 0xA4). If this gets NOT displayed as utf-8 but as iso-8859-1 encoding - which means one byte one character - you'll get that what you have mentioned. You'll get the startbyte 0xC3 as a single iso-8859-1 character and the following byte 0xA4 as as a single iso-8859-1 character. In utf-8 this 2-byte sequence must become decoded by extracting the payload bits of the startbyte and the following byte like this:
Startbyte: 11000011
Following: 10100100
So 110 of the startbyte must get stripped off, so 11 is left.
So 10 of the following byte must get stripped off, so 100100 is left.
Chained together this becomes 11100100 which is decimal 228 which should be equal to the german character 'a with two dots' unicode codepoint.
I recommend to let the encoding as it is, utf-8. It is just the encoding of your viewer/editor that should display utf-8 encoded files as utf-8 and not as iso-8859-1. Configure your viewer/editor with utf-8. In other words, configure the viewer's/editor's encoding according to the encoding of the file's content (which is in your case utf-8 and NOT iso-8859-1).
To convert your files or check them for a certain encoding, just use madedit. madedit has a built-in hex-editor which wraps a rectangle around utf-8 sequences, displaying just one character on the right side (the encoded codepoint). It's easy to identify single-byte characters and/or 2/3/4-byte sequences within utf-8 encoded files. It also wraps a rectangle around the 3-byte utf-8 BOM (if any).
Encoding problems have several failure points:
Check template file encoding
Check response encoding
Check database encoding
Check that they are coherent to what you want to output.
Also note that Notepad++ has a "Encode as..." and a "Convert to..."
1st one reads file as encoding specified and 2nd reads file and writes it back to selected encoding (changing file)

Charset in data URI

Over the years from reading the evolving specs I had assumed that RFC 3986 had finally settled on UTF-8 encoding for escape octet sequences. That is, if my URI has %XX%YY%ZZ I can take that sequence of decoded octets (for any URI in the scheme-specific part) and interpret the resulting bytes as UTF-8 to find out what decoded information was intended. In practical terms, I can call JavaScript decodeURIComponent() which does this decoding automatically for me.
Then I read the spec for data: URIs, RFC 2397, which includes a charset argument, which (naturally) indicates the charset of the encoded data. But how does that work? If I have a two-octet encoded sequence %XX%YY in my data: URI, does a charset=iso-8859-1 indicate that the two decoded octects should not be interpreted as a UTF-8 sequence, but as as two separate Latin characters (as each byte in ISO-8859-1 represents a character)? RFC 2397 seems to indicate this, as it gives an example of "greek [sic] characters":
data:text/plain;charset=iso-8859-7,%be%fg%be
But this means that JavaScript decodeURIComponent() (which assumes UTF-8 encoded octets) can't be used to extract a string from a data URI, correct? Does this mean I have to create my own decoding for data URIs if the charset is something besides UTF-8?
Furthermore, does this mean that RFC 2397 is now in conflict with RFC 3986, which seems to indicate that UTF-8 is assumed? Or does RFC 3986 only refer "new URI scheme[s]", meaning that the data: URI scheme gets grandfathered in and has its own technique for specifying what the encoded octets means?
My best guess at the moment is that data: plays by its own rules and if it indicates a charset other than UTF-8, I'll have to use something other than decodeURIComponent() in JavaScript. Any recommendations on a replacement method would be welcome, too.
Remember that the data: URI scheme describes a resource that can be thought of as a file which consists of an opaque bytestream just as though it were a http: URI (the same bytestream, but stored on an HTTP server) or an ftp: URI (the same bytestream, but stored on an FTP server) or a file: URI (the same bytestream, but stored on your local filesystem). Only the metadata attached to the file gives the bytestream meaning.
RFC 2397 gives a clear specification on how this bytestream is to be embedded in the URI itself (in contrast to other URI schemes, where the URI gives instructions on where to fetch the bytestream, not what it contains). It might be base64 or it might be the percent-encoding method given in the RFC. Base64 is going to be more compact if the bytestream contains man non-ASCII bytes.
The data: URI also describes its own Content-Type, which gives the intended interpretation of the bytestream. In this case, since you have used text/plain;charset=iso-8859-7, the bytes must be correctly encoded ISO-8859-7 text. The bytes will definitely not be decided as UTF-8 or any other character encoding. It will be unambiguously decoded using the character encoding you have specified.

RSS reader Error : Input is not proper UTF-8 when use simplexml_load_file()

I'm using simplexml_load_file method for parsing feed from external source.
My code like this
$rssFeed['DAILYSTAR'] = 'http://www.thedailystar.net/latest/rss/rss.xml';
$rssParser = simplexml_load_file($url);
The output is as follows :
Warning: simplexml_load_file() [function.simplexml-load-file]: http://www.thedailystar.net/latest/rss/rss.xml:12: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x92 0x73 0x20 0x48 in C:\xampp\htdocs\googlebd\index.php on line 39
Ultimately stop with a fatal error. Main problem is the site's character encoding is ISO-8859-1, not UTF-8.
Can i be able to read this using this method(SimpleXML API)?
If no then any other method is available?
I've searched through Google but no answer. Every method I applied returns with this error.
Thanks,
Rashed
Well, well, when I retrieve this content using Python, I get the following:
'\n<rss version="2.0" encoding="ISO-8859-1">\n [...]
<description>The results of this year\x92s Higher Secondary Certificate
Now it says it's ISO-8859-1, but \x92 is not in that character set, but instead is the closing curly single quote, used as an apostrophe, in Windows-1252. So the page throws an encoding error, and as per the XML spec, clients should be "strict" and not fix errors.
You can retrieve it, and filter out the non-ISO-8859-1 characters in some fashion, or better, convert the encoding using mb-convert-encoding() before passing the result to your RSS parser.
Oh, and if you want to incorporate the result into a UTF-8 page, you may have convert everything to UTF-8, though this is English, which might not even require any different character encodings, if all turns out to be ASCII after all.
We ran into the same issue and used utf8_encode to change the encoding from ISO-8859-1/latin-1 to UTF-8 and get past the error.
$contents = file_get_contents($url);
simplexml_load_string(utf8_encode($contents));

Resources