Oracle mysterious Unicode codepoint - oracle

While calling XMLTYPE() on a CLOB column which should contain a valid XML1.0 xml (the db encoding should be UTF-8), the following error message comes out (I am from Italy):
ORA-31011: Analisi XML non riuscita
ORA-19202: Errore durante l'elaborazione XML
LPX-00217: carattere non valido 15577023 (U+EDAFBF)
Error at line 240
ORA-06512: a "SYS.XMLTYPE", line 272
ORA-06512: a line 1
31011. 00000 - "XML parsing failed"
*Cause: XML parser returned an error while trying to parse the document.
*Action: Check if the document to be parsed is valid.
Now this invalid character is given as Unicode codepoint EDAFBF. The problem is that according to Unicode spec (wikipedia), there are no codepoints beyond 10FFFF. So what could this error mean?
Inspecting this CLOB with SQLDeveloper (and copying it to Notepad++ with encoding set to utf-8) does not reveal anything unusual beyond some strange characters which apparently came from the user browser when he copied text from a Microsoft Word document (but the CLOB, at least as copied from SQLDeveloper UI and exhibited by Notepad++ with UTF-8 encoding, seems to be a valid UTF-8 text).
Is there a way to reproduce this error populating Oracle directly (from SQLDeveloper or in some other way)? (contacting the end user to understand what he put exactly in the web form is problematic)

Not addressing the first part of the question, but you can reproduce it with a RAW value:
select xmltype('<dummy>'
|| utl_raw.cast_to_varchar2(cast('EDAFBF' as raw(6)))
|| '</dummy>')
from dual;
Error report -
SQL Error: ORA-31011: XML parsing failed
ORA-19202: Error occurred in XML processing
LPX-00217: invalid character 15577023 (U+EDAFBF)
Error at line 1
ORA-06512: at "SYS.XMLTYPE", line 310
ORA-06512: at line 1
Just selecting the character:
select utl_raw.cast_to_varchar2(cast('EDAFBF' as raw(6)))
from dual;
... is displayed as a small square with an even smaller question mark inside it (I think) in SQL Developer for me (version 4.1), but that's just how it's choosing to render that; copying and pasting still gives the replacement character � since the codepoint is, as you say, invalid. XMLType is being stricter about the validity than CLOB. The unistr() function doesn't handle the value either, which isn't really a surprise.
(You don't need to cast the string to raw(6), just utl_raw.cast_to_varchar2('EDAFBF') has the same effect; but doing it explicitly makes it a bit clearer what's going on, I think).
I don't see how that could have got into your file without some kind of corruption, possibly through a botched character set conversion I suppose. You could maybe use dbms_lob.replace_fragment() or similar to replace or remove that character, but of course there may be others you haven't hit yet, and at best you'd only be treating the symptoms rather than the cause.

Related

characters from notepad file getting converted into special characters while reading using utl_file.get_line procedure

I have written a program to read data from a text file and load it into a table using UTL_FILE package in oracle. While reading a few lines some characters are getting converted into special characters, for example:
string in file = 63268982_GHC –EXH PALOMARES EVA
value entered into database = 63268982_GHC âEXH PALOMARES EVA
I tried using Convert function but it did not achieve anything.
My Oracle version is 11gR2 and it's using the nls charset WE8ISO8859P1. Because these strings represent physical file names I get a mismatch when I try to match with the filename.
I tried re-converting the value stored in Oracle in WE charset back to ascii like below:
convert('63268989_GHC âEXH PALOMARES','us7ascii','WE8ISO8859P1')
but the outcome is different from what was there in text file while reading. Can anyone please suggest how this problem can be overcome.
The – character in the file is not a regular hyphen (-, chr(45)) but an En Dash / U+2013 stored as three bytes, decimal 226, 128, 147 or hex e2, 80, 93. Interpreted individually rather than as a single multibyte character, these correspond to – as shown here.
Try opening the file with utl_file.fopen_nchar and reading lines with utl_file.get_line_nchar.
Oracle 11gR2 Database Globalization Support Guide: Programming with Unicode.

Play framework JDBC ebean mysql exception with characters řů but accepts áõ

Trying to save models and i get a:
java.sql.SQLException: Incorrect string value: ...
Saving a text like "jedna dva tři kachna dům a kachní maso"
I'm using default.url="jdbc:mysql://[url]/[database]?characterEncoding=UTF-8"
řů have no encoding in latin1; áõ do. That suggests that CHARACTER SET latin1 is involved somewhere. Let's see SHOW CREATE TABLE.
C599, etc, are valid utf8 encodings for the corresponding characters.
? occurs when the destination character set cannot represent the character. Again, this points to the column/table being latin1, when it should be utf8 (or utf8mb4).
More discussion, and for debugging similar situations: Trouble with utf8 characters; what I see is not what I stored
Probably has some special character, and the UTF-8 encode that you are forcing may cause some error.
This ASCII string has the following text:
String:
jedna dva tři kachna dům a kachní maso
ASCII:
'jedna dva t\xc5\x99i kachna d\xc5\xafm a kachn\xc3\xad maso'

Adaptive Server Enterprise ASCII function for multi-byte characters

While converting from Oracle to Sybase ASE I encounter the following issue: ASCII function doesn't return code for multi-byte characters properly, it looks like it gets only the first byte.
For example, the following statement returns 34655
select ASCII('㍉') from dual
while in Sybase it returns 63
select ASCII('㍉')
Adaptive Server has the following language settings
Language: Japanese
Character Set: eucjis
Even if I use Sybase uscalar function
select uscalar('㍉')
it returns 63
Only passing to uscalar function hex equivalent of this Japanese symbol gives different result, but not the same as in Oracle
select uscalar(0x875F)
returns 24455
But in this way another issue appears - I'm not able to cast this symbol to hex as
select convert(varbinary,'㍉')
returns only the first byte again (0x3f)
Please help me to find out the appropriate way of getting the correct ASCII code of multi-byte Japanese symbols in Adaptive Server Enterprise.

W3C unable to validate

Sorry, I am unable to validate this document because on line 1200 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xD8" does not map to Unicode
i would be thankful to know what exactly should i do, my website is : http://dailysahara.com/
The issue, as stated by the validator, is that you have some invalid UTF-8 in your document. It appears to be in the box on the left of the site with the four tabs "Tags", "Comments", "Recents", and "Popular". It shows up to me as a black square like this: �. If you remove that, you should be able to validate your site.

Character reference "&#1" is an invalid XML character

I set the property mapred.textoutputformat.separator with value \001. But when I run the MR Job, it's throwing exception:
Character reference "&#1" is an invalid XML character.
Please help me.
I got the solution. The reason was that when using "\001" character sequence or other Unicode characters, during the object serialization it was getting transformed to some invalid formats.
So the solution was to encode the character using Base64, override the getRecordWriter method of TextOutputFormat class and then decode it there.(Base64.decodeBase64)
This will work.

Resources