A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8.
The upstream service is out of my control. They may fix it, it may never be fixed.
I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue?
Is there any way to detect this issue and fix the encoding only when I find a bad encoding?
I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding.
When the source sends e4 9c 94 (✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94 (â).
utf-8 string ✔ as bytes: e4 9c 94
bytes e4 9c 94 as latin1 string: â
utf-8 string â as bytes: c3 a2 c2 9c c2 94
I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8') but it will break as soon as the upstream issue is fixed.
Since you know how it is mangled, you can try to unmangle it by decoding the received UTF-8 bytes, encoding to latin1, and decoding as UTF-8 again. Only your mangled strings, pure ASCII strings, or very unlikely latin-1 string combinations will successfully decode twice. If that decoding fails, assume the upstream was fixed and just decode once as UTF-8. A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences that survive a double-decode but they are unlikely to occur in normal text.
Here's an example in Python (you didn't mention a language...):
# Assume bytes are latin1, but return encoded UTF-8.
def bad(b):
return b.decode('latin1').encode('utf8')
# Assume bytes are UTF-8, and pass them along.
def good(b):
return b
def decoder(b):
try:
return b.decode('utf8').encode('latin1').decode('utf8')
except UnicodeError:
return b.decode('utf8')
b = '✔'.encode('utf8')
print(decoder(bad(b)))
print(decoder(good(b)))
Output:
✔
✔
Bare ISO 8859-1 is almost guaranteed to be invalid UTF-8. Attempting to decode as ISO 8859-1 and then as UTF-8, and falling back to simply decoding as UTF-8 if this produces invalid byte sequences should work for this specific case.
In some more detail, the UTF-8 encoding severely restricts which non-ASCII character sequences are allowed. The allowed patterns are extremely unlikely in ISO-8859-1 because in this encoding, they represent sequences like à followed by an unprintable control character or mathematical operator, which simply do not tend to occur in any valid text.
Based on Mark Tolonen's answer, again in Python 3:
def maybe_fix_encoding(utf8_string, possible_codec="cp1252"):
"""Attempts to fix mangled text caused by interpreting UTF8 as cp1252
(or other codec: https://docs.python.org/3/library/codecs.html)"""
try:
return utf8_string.encode(possible_codec).decode('utf8')
except UnicodeError:
return utf8_string
>>> maybe_fix_encoding("some normal text and some scandinavian characters æ ø å Æ Ø Å")
'some normal text and some scandinavian characters æ ø å Æ Ø Å'
Based on turpachull's answer, and the python3 list of standard encodings (& Mark Amery's answer listing the set for various versions of python), here's a script that will attempt every encoding transform on stdin and output each version if it's different from the plain utf_8.
#!/usr/bin/env python3
import sys
import fileinput
encodings = ["ascii", "big5hkscs", "cp1006", "cp1125", "cp1250", "cp1252", "cp1254", "cp1256", "cp1258", "cp273", "cp437", "cp720", "cp775", "cp852", "cp856", "cp858", "cp861", "cp863", "cp865", "cp869", "cp875", "cp949", "euc_jis_2004", "euc_kr", "gbk", "hz", "iso2022_jp_1", "iso2022_jp_2004", "iso2022_jp_ext", "iso8859_11", "iso8859_14", "iso8859_16", "iso8859_3", "iso8859_5", "iso8859_7", "iso8859_9", "koi8_r", "koi8_u", "latin_1", "mac_cyrillic", "mac_iceland", "mac_roman", "ptcp154", "shift_jis_2004", "utf_16_be", "utf_32", "utf_32_le", "utf_7", "utf_8_sig", "big5", "cp037", "cp1026", "cp1140", "cp1251", "cp1253", "cp1255", "cp1257", "cp424", "cp500", "cp737", "cp850", "cp855", "cp857", "cp860", "cp862", "cp864", "cp866", "cp874", "cp932", "cp950", "euc_jisx0213", "euc_jp", "gb18030", "gb2312", "iso2022_jp", "iso2022_jp_2", "iso2022_jp_3", "iso2022_kr", "iso8859_10", "iso8859_13", "iso8859_15", "iso8859_2", "iso8859_4", "iso8859_6", "iso8859_8", "johab", "koi8_t", "kz1048", "mac_greek", "mac_latin2", "mac_turkish", "shift_jis", "shift_jisx0213", "utf_16", "utf_16_le", "utf_32_be", "utf_8"]
def maybe_fix_encoding(utf8_string, possible_codec="utf_8"):
try:
return utf8_string.encode(possible_codec).decode('utf_8')
except UnicodeError:
return utf8_string
for line in sys.stdin:
for e in encodings:
i=line.rstrip('\n')
result=maybe_fix_encoding(i, e)
if result != i or e == 'utf_8':
print("\t".join([e, result]))
print("\n")
usage e.g.:
$ echo 'Requiem der morgenröte' | ~/decode_string.py
cp1252 Requiem der morgenröte
cp1254 Requiem der morgenröte
iso2022_jp_1 Requiem der morgenr(D**B"yte
iso2022_jp_2 Requiem der morgenr(D**B"yte
iso2022_jp_2004 Requiem der morgenr(Q):B"yte
iso2022_jp_3 Requiem der morgenr(O):B"yte
iso2022_jp_ext Requiem der morgenr(D**B"yte
latin_1 Requiem der morgenröte
iso8859_9 Requiem der morgenröte
iso8859_14 Requiem der morgenröte
iso8859_15 Requiem der morgenröte
mac_iceland Requiem der morgenr̦te
mac_roman Requiem der morgenr̦te
mac_turkish Requiem der morgenr̦te
utf_7 Requiem der morgenr+AMMAtg-te
utf_8 Requiem der morgenröte
utf_8_sig Requiem der morgenröte
Nginx can be configured to generate a uuid suitable for client identification. Upon receiving a request from a new client, it appends a uuid in two forms before forwarding the request upstream to the origin server(s):
cookie with uuid in Base64 (e.g. CgIGR1ZfUkeEXQ2YAwMZAg==)
header with uuid in hexadecimal (e.g. 4706020A47525F56980D5D8402190303)
I want to convert a hexadecimal representation to the Base64 equivalent. I have a working solution in Ruby, but I don't fully grasp the underlying mechanics, especially the switching of byte-orders:
hex_str = "4706020A47525F56980D5D8402190303"
Treating hex_str as a sequence of high-nibble (most significant 4 bits first) binary data, produce the (ASCII-encoded) string representation:
binary_seq = [hex_str].pack("H*")
# 47 (71 decimal) -> "G"
# 06 (6 decimal) -> "\x06" (non-printable)
# 02 (2 decimal) -> "\x02" (non-printable)
# 0A (10 decimal) -> "\n"
# ...
#=> "G\x06\x02\nGR_V\x98\r]\x84\x02\x19\x03\x03"
Map binary_seq to an array of 32-bit little-endian unsigned integers. Each 4 characters (4 bytes = 32 bits) maps to an integer:
data = binary_seq.unpack("VVVV")
# "G\x06\x02\n" -> 167904839 (?)
# "GR_V" -> 1449087559 (?)
# "\x98\r]\x84" -> 2220690840 (?)
# "\x02\x19\x03\x03" -> 50534658 (?)
#=> [167904839, 1449087559, 2220690840, 50534658]
Treating data as an array of 32-bit big-endian unsigned integers, produce the (ASCII-encoded) string representation:
network_seq = data.pack("NNNN")
# 167904839 -> "\n\x02\x06G" (?)
# 1449087559 -> "V_RG" (?)
# 2220690840 -> "\x84]\r\x98" (?)
# 50534658 -> "\x03\x03\x19\x02" (?)
#=> "\n\x02\x06GV_RG\x84]\r\x98\x03\x03\x19\x02"
Encode network_seq in Base64 string:
Base64.encode64(network_seq).strip
#=> "CgIGR1ZfUkeEXQ2YAwMZAg=="
My rough understanding is that big-endian is the standard byte-order for network communications, while little-endian is more common on host machines. Why nginx provides two forms that require switching byte order to convert I'm not sure.
I also don't understand how the .unpack("VVVV") and .pack("NNNN") steps work. I can see that G\x06\x02\n becomes \n\x02\x06G, but I don't understand the steps that get there. For example, focusing on the first 8 digits of hex_str, why do .pack(H*) and .unpack("VVVV") produce:
"4706020A" -> "G\x06\x02\n" -> 167904839
whereas converting directly to base-10 produces:
"4706020A".to_i(16) -> 1191576074
? The fact that I'm asking this shows I need clarification on what exactly is going on in all these conversions :)
I'm importing content from an outside database that is infected with a variety of odd characters, e.g.
> str
=> "Nature’s Variety, Best Friends Animal Society team up"
From context it seems that ’ represents a right single-quote. In cp1252 encoding:
> str.encode('cp1252')
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
So how do I convert it to the correct UTF-8 character? Here's what I've tried:
> str.encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('UTF-8', invalid: :replace, replace: '?', undef: :replace)
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').encode('UTF-8', invalid: :replace, replace: '?', undef: :replace)
=> "Nature’s Variety, Best Friends Animal Society team up"
I'd rather find a way to do a generic re-encoding so that it will handle all such miss-encoded characters. But if I have to I'll do individual search and replacing. But I'm not able to make that work either:
> str.encode('cp1252').gsub('\xE2/x80/x99', "'")
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').gsub(%r{\xE2\x80\x99}, "'")
SyntaxError: unexpected tIDENTIFIER, expecting $end
> str.encode('cp1252').gsub(Regexp.escape('\xE2\x80\x99'), "'")
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
I'd like to do this, but I can't even paste these characters into my REPL:
> str.gsub('’', "'")
When I try I get:
> str.gsub('C"b,b,b
* "', ",")
=> "Nature’s Variety, Best Friends Animal Society team up"
Frustrating. Any suggestions on how to encode this properly into UTF-8?
Edit: At the request for the actual bytes in the string:
> str.bytes.to_a.join(' ')
=> "78 97 116 117 114 101 195 162 226 130 172 226 132 162 115 32 86 97 114 105 101 116 121 44 32 66 101 115 116 32 70 114 105 101 110 100 115 32 65 110 105 109 97 108 32 83 111 99 105 101 116 121 32 116 101 97 109 32 117 112"
I had this problem with Fixing Incorrect String Encoding From MySQL. You need to set the proper encoding and then force it back.
fallback = {
"\u0081" => "\x81".force_encoding("CP1252"),
"\u008D" => "\x8D".force_encoding("CP1252"),
"\u008F" => "\x8F".force_encoding("CP1252"),
"\u0090" => "\x90".force_encoding("CP1252"),
"\u009D" => "\x9D".force_encoding("CP1252")
}
str.encode('CP1252', fallback: fallback).force_encoding('UTF-8')
The fallback may not be necessary depending on your data, but it ensures that it won't raise an error by handling the five bytes which are undefined in CP1252.
Once Ruby has got the encoding wrong, the characters will stay incorrect, according to the original mistake. Conversions simply convert the now wrong characters into the new encoding.
To correct Ruby's mistake on input, you need to use the force_encoding method, which does not do a conversion, it just corrects Ruby's note of what encoding a String has.
In your case the fault has occurred before you read the values from the DB. If you pick out the problem bytes: bytes = %w(195 162 226 130 172 226 132 162).map(&:to_i) they look to be in UTF-8 encoding, and already in the database double-encoded. You can probably assume a problem with whatever has written these into the DB (note if it is a live process, this is a bug that needs sorting, you will continue to get these bad values in).
What has happened is your DB (or code that writes to it) received some UTF-8 bytes representing the correct character, but assumed they were CP1252 to be converted to UTF-8. It made that conversion and wrote valid UTF-8 (but wrong characters) into the DB.
If I do the following in Ruby console using UTF-8 encoding in my terminal and as the default Ruby encoding, I can replicate your problem:
str = "Nature’s Variety, Best Friends Animal Society team up"
=> "Nature’s Variety, Best Friends Animal Society team up"
str = str.force_encoding('CP1252').encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
The fault is reversible, as shown here:
str = str.encode('CP1252').force_encoding('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
The encode('CP1252') undoes the original mistaken conversion.
The force_encoding('UTF-8') sets the encoding back to what the system most likely received in the first place.
You will want to find where in your system an assumption of CP1252 input is being made, and instead assume UTF-8 (it may get more complicated than that if you have multiple sources in different encodings).
I have encountered what appears to be a bug with the YAML parser. Take this simple yaml file for example:
new account:
- FLEETBOSTON
- 011001742
If you parse it using this ruby line of code:
INPUT_DATA = YAML.load_file("test.yml")
Then I get this back:
{"new account"=>["FLEETBOSTON", 2360290]}
Am I doing something wrong? Because I'm pretty sure this is never supposed to happen.
It is supposed to happen. Numbers starting with 0 are in octal notation. Unless the next character is x, in which case they're hexadecimal.
07 == 7
010 == 8
011 == 9
0x9 == 9
0xA == 10
0xF == 15
0x10 == 16
0x11 == 17
Go into irb and just type in 011001742.
1.9.2-p290 :001 > 011001742
=> 2360290
PEBKAC. :)
Your number is a number, so it's treated as a number. If you want to make it explictly a string, enclose it into quotes, so YAML will not try to make it a number.
new account:
- FLEETBOSTON
- '011001742'
0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings?
>> c = JSON '["\\u0186"]'
[
[0] "Ɔ"
]
>> c[0][0]
198
>> c[0][1]
134
>> c[0][2]
nil
Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation:
U | Integer | UTF-8 characters as unsigned integers
>> c[0].unpack('U')
[
[0] 390
]
>
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186):
Note that 186 (hexadecimal) === 390 (decimal)
C/C++/Java source code : "\u0186"
UTF-32 (decimal) : 390
UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134)
You can read more about UTF-8 encoding on Wikipedia's article on UTF-8.
UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.