How to detect and fix incorrect character encoding - utf-8

A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8.
The upstream service is out of my control. They may fix it, it may never be fixed.
I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue?
Is there any way to detect this issue and fix the encoding only when I find a bad encoding?
I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding.
When the source sends e4 9c 94 (✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94 (â).
utf-8 string ✔ as bytes: e4 9c 94
bytes e4 9c 94 as latin1 string: â
utf-8 string â as bytes: c3 a2 c2 9c c2 94
I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8') but it will break as soon as the upstream issue is fixed.

Since you know how it is mangled, you can try to unmangle it by decoding the received UTF-8 bytes, encoding to latin1, and decoding as UTF-8 again. Only your mangled strings, pure ASCII strings, or very unlikely latin-1 string combinations will successfully decode twice. If that decoding fails, assume the upstream was fixed and just decode once as UTF-8. A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences that survive a double-decode but they are unlikely to occur in normal text.
Here's an example in Python (you didn't mention a language...):
# Assume bytes are latin1, but return encoded UTF-8.
def bad(b):
return b.decode('latin1').encode('utf8')
# Assume bytes are UTF-8, and pass them along.
def good(b):
return b
def decoder(b):
try:
return b.decode('utf8').encode('latin1').decode('utf8')
except UnicodeError:
return b.decode('utf8')
b = '✔'.encode('utf8')
print(decoder(bad(b)))
print(decoder(good(b)))
Output:
✔
✔

Bare ISO 8859-1 is almost guaranteed to be invalid UTF-8. Attempting to decode as ISO 8859-1 and then as UTF-8, and falling back to simply decoding as UTF-8 if this produces invalid byte sequences should work for this specific case.
In some more detail, the UTF-8 encoding severely restricts which non-ASCII character sequences are allowed. The allowed patterns are extremely unlikely in ISO-8859-1 because in this encoding, they represent sequences like à followed by an unprintable control character or mathematical operator, which simply do not tend to occur in any valid text.

Based on Mark Tolonen's answer, again in Python 3:
def maybe_fix_encoding(utf8_string, possible_codec="cp1252"):
"""Attempts to fix mangled text caused by interpreting UTF8 as cp1252
(or other codec: https://docs.python.org/3/library/codecs.html)"""
try:
return utf8_string.encode(possible_codec).decode('utf8')
except UnicodeError:
return utf8_string
>>> maybe_fix_encoding("some normal text and some scandinavian characters æ ø å Æ Ø Å")
'some normal text and some scandinavian characters æ ø å Æ Ø Å'

Based on turpachull's answer, and the python3 list of standard encodings (& Mark Amery's answer listing the set for various versions of python), here's a script that will attempt every encoding transform on stdin and output each version if it's different from the plain utf_8.
#!/usr/bin/env python3
import sys
import fileinput
encodings = ["ascii", "big5hkscs", "cp1006", "cp1125", "cp1250", "cp1252", "cp1254", "cp1256", "cp1258", "cp273", "cp437", "cp720", "cp775", "cp852", "cp856", "cp858", "cp861", "cp863", "cp865", "cp869", "cp875", "cp949", "euc_jis_2004", "euc_kr", "gbk", "hz", "iso2022_jp_1", "iso2022_jp_2004", "iso2022_jp_ext", "iso8859_11", "iso8859_14", "iso8859_16", "iso8859_3", "iso8859_5", "iso8859_7", "iso8859_9", "koi8_r", "koi8_u", "latin_1", "mac_cyrillic", "mac_iceland", "mac_roman", "ptcp154", "shift_jis_2004", "utf_16_be", "utf_32", "utf_32_le", "utf_7", "utf_8_sig", "big5", "cp037", "cp1026", "cp1140", "cp1251", "cp1253", "cp1255", "cp1257", "cp424", "cp500", "cp737", "cp850", "cp855", "cp857", "cp860", "cp862", "cp864", "cp866", "cp874", "cp932", "cp950", "euc_jisx0213", "euc_jp", "gb18030", "gb2312", "iso2022_jp", "iso2022_jp_2", "iso2022_jp_3", "iso2022_kr", "iso8859_10", "iso8859_13", "iso8859_15", "iso8859_2", "iso8859_4", "iso8859_6", "iso8859_8", "johab", "koi8_t", "kz1048", "mac_greek", "mac_latin2", "mac_turkish", "shift_jis", "shift_jisx0213", "utf_16", "utf_16_le", "utf_32_be", "utf_8"]
def maybe_fix_encoding(utf8_string, possible_codec="utf_8"):
try:
return utf8_string.encode(possible_codec).decode('utf_8')
except UnicodeError:
return utf8_string
for line in sys.stdin:
for e in encodings:
i=line.rstrip('\n')
result=maybe_fix_encoding(i, e)
if result != i or e == 'utf_8':
print("\t".join([e, result]))
print("\n")
usage e.g.:
$ echo 'Requiem der morgenröte' | ~/decode_string.py
cp1252 Requiem der morgenröte
cp1254 Requiem der morgenröte
iso2022_jp_1 Requiem der morgenr(D**B"yte
iso2022_jp_2 Requiem der morgenr(D**B"yte
iso2022_jp_2004 Requiem der morgenr(Q):B"yte
iso2022_jp_3 Requiem der morgenr(O):B"yte
iso2022_jp_ext Requiem der morgenr(D**B"yte
latin_1 Requiem der morgenröte
iso8859_9 Requiem der morgenröte
iso8859_14 Requiem der morgenröte
iso8859_15 Requiem der morgenröte
mac_iceland Requiem der morgenr̦te
mac_roman Requiem der morgenr̦te
mac_turkish Requiem der morgenr̦te
utf_7 Requiem der morgenr+AMMAtg-te
utf_8 Requiem der morgenröte
utf_8_sig Requiem der morgenröte

Related

Convert UTF-8 multibyte characters into multiple ascii characters

Can someone help me with converting some (potential) bogus UTF-8 multibyte characters into ascii as follows?
\u6162 → ["\x61", "\x62"] → ["a", "b"] → "ab"
My use case is fun only. I know I'm not compressing anything by representing two ascii characters in a multibyte character.
I've played around with various versions of unpack but it never seems to work correctly:
"\u6162".unpack('H*')
# => ["e685a2"]
Force encoding seems to return the same:
"\u6162".force_encoding('US-ASCII')
# => "\xE6\x85\xA2"
"\u6162" is not equivalent to "\x61" + "\x62". \u indicates a Unicode code point which does not translate directly to a hex value. Unicode code point 6162 is 慢.
Because it is a string, and because Ruby uses UTF-8 by default, when you unpack it you get the UTF-8 value of U+6162 which is three bytes: E6 85 A2.
2.2.1 :023 > "\u6162".encoding
=> #<Encoding:UTF-8>
2.2.1 :024 > "\u6162".unpack("A*")
=> ["\xE6\x85\xA2"]
To get what you want, you need its UTF-16 representation 61 62. But if you just encode as UTF-16 you'll get a byte order marker FE FF 61 62. So use UTF-16BE (big endian) to avoid this.
2.2.1 :052 > "\u6162".encode("UTF-16BE").unpack("A*")
=> ["ab"]
"\u6162".codepoints.first.divmod(16 ** 2).map(&:chr).join
# => "ab"

UTF-8 quoted-printable, multiline subject for Thunderbird?

Let's say I want to compose an email header with UTF-8, quoted-printable encoded subject, which is "test — UNIX-утилита для проверки типа файла и сравнения значений". I can confirm the bytes of the characters using:
$ echo "UNIX-утилита ..." | perl utfinfo.pl
Got 16 uchars
Char: 'U' u: 85 [0x0055] b: 85 [0x55] n: LATIN CAPITAL LETTER U [Basic Latin]
Char: 'N' u: 78 [0x004E] b: 78 [0x4E] n: LATIN CAPITAL LETTER N [Basic Latin]
Char: 'I' u: 73 [0x0049] b: 73 [0x49] n: LATIN CAPITAL LETTER I [Basic Latin]
Char: 'X' u: 88 [0x0058] b: 88 [0x58] n: LATIN CAPITAL LETTER X [Basic Latin]
Char: '-' u: 45 [0x002D] b: 45 [0x2D] n: HYPHEN-MINUS [Basic Latin]
Char: 'у' u: 1091 [0x0443] b: 209,131 [0xD1,0x83] n: CYRILLIC SMALL LETTER U [Cyrillic]
Char: 'т' u: 1090 [0x0442] b: 209,130 [0xD1,0x82] n: CYRILLIC SMALL LETTER TE [Cyrillic]
Char: 'и' u: 1080 [0x0438] b: 208,184 [0xD0,0xB8] n: CYRILLIC SMALL LETTER I [Cyrillic]
...
So, I'm trying to get the UTF-8, quoted printable representation of this. For instance, using Python's quopri:
$ python -c 'import quopri; a="test — UNIX-утилита для проверки типа файла и сравнения значений"; print(quopri.encodestring(a));'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
... or PHP's quoted_printable_encode, which gives the exact same output:
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; echo quoted_printable_encode($a)."\n";'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
So, to test, I make a text file called test.eml, and try to simply wrap this output in the =?UTF-8?Q? ... ?= tags for the Subject: line, making sure that line endings are CRLF \r\n:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... but if I open this in Thunderbird, I get a corrupt output:
I've read somewhere that multiline in long header fields is covered by RFC0822 "LONG HEADER FIELDS", and basically, the line ending should be followed by a space. So I indent the continuation lines by one space:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... and I get a slighly different subject in Thunderbird, but still corrupt:
Now, if I delete =\r\n from the first three continuation lines, so the subject is all in one line:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... then actually Thunderbird shows the subject line well:
... but then my header is in conflict with the recommendation from RFC 2822 - 2.1.1. Line Length Limits which says "Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."; specifically the line limit of 78 characters.
So, how can I obtain the proper multi-line quoted-printable representation of an UTF-8 Subject header string, so I can use it in an .eml file split at 78 characters - and have Thunderbird correctly read it?
When I ask python to create an email with that subject, here's what it does:
$ python
Python 2.7.9 (default, Mar 1 2015, 18:22:53)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.message import Message
>>> from email.header import Header
>>> msg = Message()
>>> import quopri
>>> h = Header(quopri.decodestring('test =E2=80=94 UNIX-'
'=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F'
'=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8'
'=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8'
'=D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F '
'=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?='), 'UTF-8')
>>> msg['Subject'] = h
>>> print msg.as_string()
Subject: =?utf-8?b?dGVzdCDigJQgVU5JWC3Rg9GC0LjQu9C40YLQsCDQtNC70Y8g0L/RgNC+0LI=?=
=?utf-8?b?0LXRgNC60Lgg0YLQuNC/0LAg0YTQsNC50LvQsCDQuCDRgdGA0LDQstC90LU=?=
=?utf-8?b?0L3QuNGPINC30L3QsNGH0LXQvdC40Lk/?=
>>>
So it uses base64 encoding instead of quoted-printable, but my strong suspicion, based on this, is that the answer is that each line must begin and end the escape.
Indeed:
>>> import email
>>> s = '''Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0?=
... =?UTF-8?Q?=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0?=
... =?UTF-8?Q?=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
... =?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0?=
... =?UTF-8?Q?=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1?=
... =?UTF-8?Q?=87=D0=B5=D0=BD=D0=B8=D0=B9?=
...
... Hello.
... '''
>>> e = email.message_from_string(s.replace('\n', '\r\n'))
>>> email.header.decode_header(e['Subject'])
[('test \xe2\x80\x94 UNIX-\xd1\x83\xd1\x82\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xb4\xd0\xbb\xd1\x8f \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb2\xd0\xb5\xd1\x80\xd0\xba\xd0\xb8 \xd1\x82\xd0\xb8\xd0\xbf\xd0\xb0 \xd1\x84\xd0\xb0\xd0\xb9\xd0\xbb\xd0\xb0 \xd0\xb8 \xd1\x81\xd1\x80\xd0\xb0\xd0\xb2\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb9', 'utf-8')]
>>> decoded = email.header.decode_header(e['Subject'])
>>> print decoded[0][0].decode(decoded[0][1])
test — UNIX-утилита для проверки типа файла и сравнения значений
EDIT: However, even with the above added in .eml file, Thunderbird fails again:
... but this time it indicates it got some of the chars correct. And indeed, breakage occurs where lines are broken "in the middle of a character"; say if for the sequence 0xD1, 0x83 for the character у, the =D1?= ends one line, and the Q?=83 starts the other, then Thunderbird cannot parse that. So after manual rearrangement, this snippet can be obtained:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8?=
=?UTF-8?Q?=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
=?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0?=
=?UTF-8?Q?=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0?=
=?UTF-8?Q?=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... which opens fine as an .eml message in Thunderbird (same as this image from OP).
EDIT2: Also PHP seems to do it right, with this invocation of mb_encode_mimeheader (directly pasteable in .eml file):
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; mb_internal_encoding("UTF-8"); echo mb_encode_mimeheader($a, "UTF-8", "Q")."\n";'
test =?UTF-8?Q?=E2=80=94=20UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82?=
=?UTF-8?Q?=D0=B0=20=D0=B4=D0=BB=D1=8F=20=D0=BF=D1=80=D0=BE=D0=B2=D0=B5?=
=?UTF-8?Q?=D1=80=D0=BA=D0=B8=20=D1=82=D0=B8=D0=BF=D0=B0=20=D1=84=D0=B0?=
=?UTF-8?Q?=D0=B9=D0=BB=D0=B0=20=D0=B8=20=D1=81=D1=80=D0=B0=D0=B2=D0=BD?=
=?UTF-8?Q?=D0=B5=D0=BD=D0=B8=D1=8F=20=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD?=
=?UTF-8?Q?=D0=B8=D0=B9?=
The problem with your test.eml is that your RFC2047 encoding is broken. The Q encoding is based on quoted-printable, but is not entirely the same. In particular, each space needs to be encoded as either =20 or _, and you cannot escape line breaks with a final =.
Fundamentally, each =?...?= sequence needs to be a single, unambiguous token per RFC 822. You can either break up your input into multiple such tokens and leave the spaces unencoded, or encode the spaces. Note that spaces between two such tokens are not significant, so encoding the spaces into the sequences makes more sense.
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test_=E2=80=94_UNIX-=D1=83=D1=82=D0=B8=D0=BB?=
=?UTF-8?Q?=D0=B8=D1=82=D0=B0_=D0=B4=D0=BB_=D1=8F_=D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8_=D1=82=D0=B8=D0=BF?=
=?UTF-8?Q?=D0=B0_=D1=84=D0=B0=D0=B9=D0=BB=D0=B0_=D0=B8_=D1=81?=
=?UTF-8?Q?=D1=80=D0=B0=D0=B2=D0=BD_=D0=B5=D0=BD=D0=B8=D1=8F_?=
=?UTF-8?Q?=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
Of course, with this exposition, quoted-printable isn't really legible at all, and probably takes much more space than base64, so you might prefer to go with the B encoding in the end after all.
Unless you are writing a MIME library yourself, the simple solution is to not care, and let the library piece this together for you. PHP is more problematic (the standard library lacks this functionality, and the third-party libraries are somewhat uneven--find one you trust, and stick to it), but in Python, simply pass in a Unicode string, and the email library will encode it if necessary.

Explain what those escaped numbers mean in unicode encoding in ruby 1.8.7

0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings?
>> c = JSON '["\\u0186"]'
[
[0] "Ɔ"
]
>> c[0][0]
198
>> c[0][1]
134
>> c[0][2]
nil
Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation:
U | Integer | UTF-8 characters as unsigned integers
>> c[0].unpack('U')
[
[0] 390
]
>
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186):
Note that 186 (hexadecimal) === 390 (decimal)
C/C++/Java source code : "\u0186"
UTF-32 (decimal) : 390
UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134)
You can read more about UTF-8 encoding on Wikipedia's article on UTF-8.
UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

Displaying whole ORACLE 8-bit CHARSETS in UNICODE

I maintain an Java EE web application against an eight bits charset oracle database.
The application will be used from abroad and I want to be able to check strings -for example with UNICODE regexps, and both from Java and from Javascript- to see if they fit into the database CHARSET.
One function in GDK -globalization developer kit- gives the equivalent Java name of the oracle charset -I think it was ISO-8859-15-. But I'm not certain the correspondence will be exact.
What I wanted is to display the whole charset -NOT ISO..., but the ORACLE one- char by char to use both from Java and Javascript, even to display the UNICODE points and to tell apart the control characters from printable ones.
There is a funcion in Oracle's GDK to that end?
Thank you.
I think I've found it! (Eureka!)
A little JAVA JDBC program resulted in exactly the characters in ISO-8859-15 that are distintc to ISO-8859-1 (by the way, I've learned that ISO-8859-1 occupies from 0x00 to 0xff in UNICODE).
Program output:
CHR: 164 UNICODE: 8364 euro sign
CHR: 166 UNICODE: 352
CHR: 168 UNICODE: 353
CHR: 180 UNICODE: 381
CHR: 184 UNICODE: 382
CHR: 188 UNICODE: 338
CHR: 189 UNICODE: 339
CHR: 190 UNICODE: 376
Program code (not using GDK at all):
NOTE: the statement "SELECT CHR(i using nchar_cs) FROM DUAL" just gave back the same numbers... WHY?
for(int i=0; i<256; i++)
{
Statement select = con.createStatement();
ResultSet result = select.executeQuery("select CHR(" + i +") from DUAL");
while(result.next())
{
int unicodePoint = result.getString(1).codePointBefore(1);
//int unicodePoint = result.getString(1).codePointAt(0);
if (unicodePoint != i)
System.out.println("CHR: " + i + "\tUNICODE: " + unicodePoint);
}
result.close();
result = null;
select.close();
select = null;
}

Is byte 0xFF valid in a UTF-8 encoded string?

Can an UTF-8 string contain the byte 0xFF (255)?
No. It is specifically forbidden by the spec.
UTF-8, Number of 1 bytes,First code point is U+0000, Last code point is U+007F.
The bytes 0xFE and 0xFF do not valid in UTF-8.
The first byte is 0 in The UTF-8 when bytes only one.
[click image for more info about UTF-8 bytes]

Resources