iconv for Elixir - utf-8

I download a CSV file and save it with this code:
body = HTTPoison.get!(url).body
|> String.replace("ü", "ü")
|> String.replace("ö", "ö")
File.write!("/tmp/example.csv", body)
To do the String.replace/3 to replace ü with ü is of course not a good way. HTTPoison tells me that the body is {"Content-Type", "csv;charset=utf-8"}.
How can I solve this without String.replace/3?

What you have here is data that is first UTF-8 encoded, then the bytes are treated as latin1 encoding and encoded to UTF-8 again.
A hex dump snippet from the data in that URL shows this:
00007d20: 2c22 222c 2c2c 224f 7269 6769 6e3a 2044 ,"",,,"Origin: D
00007d30: c383 c2bc 7373 656c 646f 7266 222c 224b ....sseldorf","K
00007d40: 6579 776f 7264 733a 204c 6173 7420 4d69 eywords: Last Mi
ü is encoded as <<0xc3, 0x83, 0xc2, 0xbc>> which was probably created like this:
iex(1)> "ü\0"
<<195, 188, 0>>
iex(2)> <<195::utf8, 188::utf8>> == <<0xc3, 0x83, 0xc2, 0xbc>>
true
To reverse this process, you can use a combination of :unicode.characters_to_list and :erlang.list_to_binary.
iex(3)> <<0xc3, 0x83, 0xc2, 0xbc>> |> :unicode.characters_to_list |> :erlang.list_to_binary
"ü"
That URL also includes a BOM at the start:
00000000: efbb bf22 5a75 7069 6422 2c22 5072 6f67 ..."Zupid","Prog
^^^^ ^^
00000010: 7261 6d49 6422 2c22 4d65 7263 6861 6e74 ramId","Merchant
00000020: 5072 6f64 7563 744e 756d 6265 7222 2c22 ProductNumber","
This can be removed using |> Enum.drop(1) after :unicode.characters_to_list.
So the following should work for you:
HTTPoison.get!(url).body
|> :unicode.characters_to_list
|> Enum.drop(1)
|> :erlang.list_to_binary

Related

How to syntax highlight a hex dump of JVM classfile?

I'm studying binary file structure of JVM classfile.
My current toolbox consists of $ xxd <classfile> and $ javap -v <classfile>.
Sample outputs of these two tools are as follows:
$ xxd com/example/mycode/MyTest.class
00000000: cafe babe 0000 003d 001d 0a00 0200 0307 .......=........
00000010: 0004 0c00 0500 0601 0010 6a61 7661 2f6c ..........java/l
00000020: 616e 672f 4f62 6a65 6374 0100 063c 696e ang/Object...<in
00000030: 6974 3e01 0003 2829 5609 0008 0009 0700 it>...()V.......
...
000001a0: 0000 000a 0002 0000 0005 0008 0006 0001 ................
000001b0: 001b 0000 0002 001c ........
and
$ javap.exe -v com/example/mycode/MyTest.class
Classfile /<PathTo>/MyTest.class
Last modified 2022/11/01; size 440 bytes
...
interfaces: 0, fields: 0, methods: 2, attributes: 1
Constant pool:
#1 = Methodref #2.#3 // java/lang/Object."<init>":()V
#2 = Class #4 // java/lang/Object
#3 = NameAndType #5:#6 // "<init>":()V
#4 = Utf8 java/lang/Object
#5 = Utf8 <init>
#6 = Utf8 ()V
...
#27 = Utf8 SourceFile
#28 = Utf8 MyTest.java
{
public com.example.mycode.MyTest();
descriptor: ()V
flags: (0x0001) ACC_PUBLIC
Code:
stack=1, locals=1, args_size=1
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
LineNumberTable:
line 3: 0
public static void main(java.lang.String[]);
...
}
SourceFile: "MyTest.java"
But, from these two outputs, it is difficult to comprehend which part of one output
correspond which part of the other.
It is hard to analyze the hex dumped binary by comparing with the disassembled output.
In this particular case I could manually assign tags by referring
specification,
but it was hard work even if the sample file is a trivial hello world.
In general large files such method is hard to be done.
Edit: made the question prorer
So what I want to do is the following:
Syntax highlight the xxd dump output along classfile structure
so that easily view which part is, for example, the constant pool part,
or the method info and attributes, in order to easily compare
with javap output.
More aggressively, it is useful to view javap and xxd outputs
side by side, and selecting a text on one side results in
highlighting corresponding text on the other side.
So, my question:
Is there any way or any other tools to understand
xxd hex dump output in terms of javap decompiled output?
Especially I'd like to comprehend that each hex corresponds to
each decompiled entry one-to-one.
My current idea is to highlight colors on hex dump,
possibly like the following image.
Is there any software to do like this?
Maybe I need to do some coding,
something like writing parser of .class-file.
Then, which is the efficient way to do it in less effort
to obtain highlighted hex dump with format tag annotations
according to the .class-file specs, like shown in the image below?
Thank you for reading.

How to detect and fix incorrect character encoding

A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8.
The upstream service is out of my control. They may fix it, it may never be fixed.
I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue?
Is there any way to detect this issue and fix the encoding only when I find a bad encoding?
I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding.
When the source sends e4 9c 94 (✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94 (â).
utf-8 string ✔ as bytes: e4 9c 94
bytes e4 9c 94 as latin1 string: â
utf-8 string â as bytes: c3 a2 c2 9c c2 94
I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8') but it will break as soon as the upstream issue is fixed.
Since you know how it is mangled, you can try to unmangle it by decoding the received UTF-8 bytes, encoding to latin1, and decoding as UTF-8 again. Only your mangled strings, pure ASCII strings, or very unlikely latin-1 string combinations will successfully decode twice. If that decoding fails, assume the upstream was fixed and just decode once as UTF-8. A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences that survive a double-decode but they are unlikely to occur in normal text.
Here's an example in Python (you didn't mention a language...):
# Assume bytes are latin1, but return encoded UTF-8.
def bad(b):
return b.decode('latin1').encode('utf8')
# Assume bytes are UTF-8, and pass them along.
def good(b):
return b
def decoder(b):
try:
return b.decode('utf8').encode('latin1').decode('utf8')
except UnicodeError:
return b.decode('utf8')
b = '✔'.encode('utf8')
print(decoder(bad(b)))
print(decoder(good(b)))
Output:
✔
✔
Bare ISO 8859-1 is almost guaranteed to be invalid UTF-8. Attempting to decode as ISO 8859-1 and then as UTF-8, and falling back to simply decoding as UTF-8 if this produces invalid byte sequences should work for this specific case.
In some more detail, the UTF-8 encoding severely restricts which non-ASCII character sequences are allowed. The allowed patterns are extremely unlikely in ISO-8859-1 because in this encoding, they represent sequences like à followed by an unprintable control character or mathematical operator, which simply do not tend to occur in any valid text.
Based on Mark Tolonen's answer, again in Python 3:
def maybe_fix_encoding(utf8_string, possible_codec="cp1252"):
"""Attempts to fix mangled text caused by interpreting UTF8 as cp1252
(or other codec: https://docs.python.org/3/library/codecs.html)"""
try:
return utf8_string.encode(possible_codec).decode('utf8')
except UnicodeError:
return utf8_string
>>> maybe_fix_encoding("some normal text and some scandinavian characters æ ø å Æ Ø Å")
'some normal text and some scandinavian characters æ ø å Æ Ø Å'
Based on turpachull's answer, and the python3 list of standard encodings (& Mark Amery's answer listing the set for various versions of python), here's a script that will attempt every encoding transform on stdin and output each version if it's different from the plain utf_8.
#!/usr/bin/env python3
import sys
import fileinput
encodings = ["ascii", "big5hkscs", "cp1006", "cp1125", "cp1250", "cp1252", "cp1254", "cp1256", "cp1258", "cp273", "cp437", "cp720", "cp775", "cp852", "cp856", "cp858", "cp861", "cp863", "cp865", "cp869", "cp875", "cp949", "euc_jis_2004", "euc_kr", "gbk", "hz", "iso2022_jp_1", "iso2022_jp_2004", "iso2022_jp_ext", "iso8859_11", "iso8859_14", "iso8859_16", "iso8859_3", "iso8859_5", "iso8859_7", "iso8859_9", "koi8_r", "koi8_u", "latin_1", "mac_cyrillic", "mac_iceland", "mac_roman", "ptcp154", "shift_jis_2004", "utf_16_be", "utf_32", "utf_32_le", "utf_7", "utf_8_sig", "big5", "cp037", "cp1026", "cp1140", "cp1251", "cp1253", "cp1255", "cp1257", "cp424", "cp500", "cp737", "cp850", "cp855", "cp857", "cp860", "cp862", "cp864", "cp866", "cp874", "cp932", "cp950", "euc_jisx0213", "euc_jp", "gb18030", "gb2312", "iso2022_jp", "iso2022_jp_2", "iso2022_jp_3", "iso2022_kr", "iso8859_10", "iso8859_13", "iso8859_15", "iso8859_2", "iso8859_4", "iso8859_6", "iso8859_8", "johab", "koi8_t", "kz1048", "mac_greek", "mac_latin2", "mac_turkish", "shift_jis", "shift_jisx0213", "utf_16", "utf_16_le", "utf_32_be", "utf_8"]
def maybe_fix_encoding(utf8_string, possible_codec="utf_8"):
try:
return utf8_string.encode(possible_codec).decode('utf_8')
except UnicodeError:
return utf8_string
for line in sys.stdin:
for e in encodings:
i=line.rstrip('\n')
result=maybe_fix_encoding(i, e)
if result != i or e == 'utf_8':
print("\t".join([e, result]))
print("\n")
usage e.g.:
$ echo 'Requiem der morgenröte' | ~/decode_string.py
cp1252 Requiem der morgenröte
cp1254 Requiem der morgenröte
iso2022_jp_1 Requiem der morgenr(D**B"yte
iso2022_jp_2 Requiem der morgenr(D**B"yte
iso2022_jp_2004 Requiem der morgenr(Q):B"yte
iso2022_jp_3 Requiem der morgenr(O):B"yte
iso2022_jp_ext Requiem der morgenr(D**B"yte
latin_1 Requiem der morgenröte
iso8859_9 Requiem der morgenröte
iso8859_14 Requiem der morgenröte
iso8859_15 Requiem der morgenröte
mac_iceland Requiem der morgenr̦te
mac_roman Requiem der morgenr̦te
mac_turkish Requiem der morgenr̦te
utf_7 Requiem der morgenr+AMMAtg-te
utf_8 Requiem der morgenröte
utf_8_sig Requiem der morgenröte

Convert string in Image and blit in Pyglet

I want to convert a string that contains the data of an PNG image in image again and blit in Pyglet. But I didn't. Show this error:
Error:
File "a14.py", line 52, in __init__
self.sprite = pyglet.sprite.Sprite(img, 0, 0)
File "/usr/local/lib/python3.4/dist-packages/pyglet-1.2.4-py3.4.egg/pyglet/sprite.py", line 234, in __init__
AttributeError: 'bytes' object has no attribute 'get_texture'
Code: -----------------------------------------------------------------
imageData:
8950 4e47 0d0a 1a0a 0000 000d 4948 4452
0000 0010 0000 001c 0806 0000 0068 313f
1a00 0000 0473 4249 5408 0808 087c 0864
8800 0002 0249 4441 5438 8d95 552d 8cdb
3018 7dad 0a02 47a6 2bec 58d8 32b4 c2b0
8d4c 0a8c 74c4 5649 d819 e6c0 a181 05e6
5848 1593 d319 562a c958 60c7 3c66 5898
69a4 d0ac 0349 9c38 3fd5 ee49 561c e77b
ef7b 9fe3 2f59 ecf3 fc8a 191c 291d ad7d
cb73 eb7e 758b 287c 562f 6400 a27a 1a52
6a89 8c04 8e94 1a62 515d e0c6 0a1b 6c81
2241 e145 a364 8b7e 092d b9a8 2e00 8040
f1b9 eab0 6f5c 2ca7 c881 e237 c900 b06b
ca5c b664 e212 437e 0b96 730f 348b a0b5
ae47 6f3e 2c61 0500 8245 0009 e178 be21
b770 1c07 5a96 f555 6b43 340e ee7f ff42
5a1d c616 4838 e9ec fed3 07fc fdf9 3851
0217 5dd6 34ab 27ea 54bb e102 5a96 803a
8d04 572f 1f3f 83f5 ea76 d2ac 0e06 e078
3e92 700d b60e ac04 169a 7370 0570 d52c
32a3 5dbb 35f6 797e 5d36 22b6 7500 5a96
c6c9 4c62 0083 93b8 a3b4 b33c 83b4 3ae0
fd97 1fdd 1e0c 558f 94a2 70eb f7cd 9b43
b50d d726 a64f 1e09 f41b 292c 5310 9740
7905 183f 770e c8c6 12b1 7a81 b8c4 3c10
3e1b 9101 80f1 f3cc 3968 d076 e2ff c208
482f 3135 b722 aefc 8a94 6c2c c2b0 046b
0fa4 9780 cbd8 7426 571c 5040 8a4e 84f1
335e 5e5f a705 36c1 3b48 d422 2d3c 7542
db83 a590 c8b6 299e 1eee a605 5a91 4b90
c10f bd9b b5b7 307b f0f4 7087 f3a1 dbc0
5248 33e7 6189 5248 9c0f 172b fba4 833e
fa22 7358 0cff 0bdf 9fff cc06 0fb3 4f3a
88dc 7414 148b 0a49 b8c6 8e56 e32f d254
f010 5a31 c4a2 8256 cc7c 8d67 1d4c 89c4
a276 e54c b8fb 0775 650d 8c04 f5e9 f200
0000 0049 454e 44ae 4260 82
the code that I tried:
import pyglet
import binascii
class Window(pyglet.window.Window):
def __init__(self, *args, **kwargs):
super(Window, self).__init__(*args, **kwargs)
imgData = #ABOVE /\, IN MY CODE IMAGEDATA IS HERE
img = "".join(imgData.split())
img = binascii.unhexlify(img)
self.sprite = pyglet.sprite.Sprite(img, 0, 0)
def on_draw(self):
self.clear()
self.sprite.draw()
def main():
window = Window(width=640, height=480, caption='Pyglet')
pyglet.app.run()
if __name__ == '__main__':
main()
There is a way to do it, if yes, how? Can someone help me?
Sprite expects image object not bytes array. You can use io.BytesIO to use bytes as file object and then load it using image.load().
Full working example
import pyglet
import binascii
import io
imgData = '''8950 4e47 0d0a 1a0a 0000 000d 4948 4452
0000 0010 0000 001c 0806 0000 0068 313f
1a00 0000 0473 4249 5408 0808 087c 0864
8800 0002 0249 4441 5438 8d95 552d 8cdb
3018 7dad 0a02 47a6 2bec 58d8 32b4 c2b0
8d4c 0a8c 74c4 5649 d819 e6c0 a181 05e6
5848 1593 d319 562a c958 60c7 3c66 5898
69a4 d0ac 0349 9c38 3fd5 ee49 561c e77b
ef7b 9fe3 2f59 ecf3 fc8a 191c 291d ad7d
cb73 eb7e 758b 287c 562f 6400 a27a 1a52
6a89 8c04 8e94 1a62 515d e0c6 0a1b 6c81
2241 e145 a364 8b7e 092d b9a8 2e00 8040
f1b9 eab0 6f5c 2ca7 c881 e237 c900 b06b
ca5c b664 e212 437e 0b96 730f 348b a0b5
ae47 6f3e 2c61 0500 8245 0009 e178 be21
b770 1c07 5a96 f555 6b43 340e ee7f ff42
5a1d c616 4838 e9ec fed3 07fc fdf9 3851
0217 5dd6 34ab 27ea 54bb e102 5a96 803a
8d04 572f 1f3f 83f5 ea76 d2ac 0e06 e078
3e92 700d b60e ac04 169a 7370 0570 d52c
32a3 5dbb 35f6 797e 5d36 22b6 7500 5a96
c6c9 4c62 0083 93b8 a3b4 b33c 83b4 3ae0
fd97 1fdd 1e0c 558f 94a2 70eb f7cd 9b43
b50d d726 a64f 1e09 f41b 292c 5310 9740
7905 183f 770e c8c6 12b1 7a81 b8c4 3c10
3e1b 9101 80f1 f3cc 3968 d076 e2ff c208
482f 3135 b722 aefc 8a94 6c2c c2b0 046b
0fa4 9780 cbd8 7426 571c 5040 8a4e 84f1
335e 5e5f a705 36c1 3b48 d422 2d3c 7542
db83 a590 c8b6 299e 1eee a605 5a91 4b90
c10f bd9b b5b7 307b f0f4 7087 f3a1 dbc0
5248 33e7 6189 5248 9c0f 172b fba4 833e
fa22 7358 0cff 0bdf 9fff cc06 0fb3 4f3a
88dc 7414 148b 0a49 b8c6 8e56 e32f d254
f010 5a31 c4a2 8256 cc7c 8d67 1d4c 89c4
a276 e54c b8fb 0775 650d 8c04 f5e9 f200
0000 0049 454e 44ae 4260 82'''
class Window(pyglet.window.Window):
def __init__(self, *args, **kwargs):
super(Window, self).__init__(*args, **kwargs)
img = "".join(imgData.split())
img = binascii.unhexlify(img)
file_object = io.BytesIO(img)
img = pyglet.image.load("noname.png", file=file_object)
self.sprite = pyglet.sprite.Sprite(img, 0, 0)
def on_draw(self):
self.clear()
self.sprite.draw()
def main():
window = Window(width=640, height=480, caption='Pyglet')
pyglet.app.run()
if __name__ == '__main__':
main()

UTF-8 quoted-printable, multiline subject for Thunderbird?

Let's say I want to compose an email header with UTF-8, quoted-printable encoded subject, which is "test — UNIX-утилита для проверки типа файла и сравнения значений". I can confirm the bytes of the characters using:
$ echo "UNIX-утилита ..." | perl utfinfo.pl
Got 16 uchars
Char: 'U' u: 85 [0x0055] b: 85 [0x55] n: LATIN CAPITAL LETTER U [Basic Latin]
Char: 'N' u: 78 [0x004E] b: 78 [0x4E] n: LATIN CAPITAL LETTER N [Basic Latin]
Char: 'I' u: 73 [0x0049] b: 73 [0x49] n: LATIN CAPITAL LETTER I [Basic Latin]
Char: 'X' u: 88 [0x0058] b: 88 [0x58] n: LATIN CAPITAL LETTER X [Basic Latin]
Char: '-' u: 45 [0x002D] b: 45 [0x2D] n: HYPHEN-MINUS [Basic Latin]
Char: 'у' u: 1091 [0x0443] b: 209,131 [0xD1,0x83] n: CYRILLIC SMALL LETTER U [Cyrillic]
Char: 'т' u: 1090 [0x0442] b: 209,130 [0xD1,0x82] n: CYRILLIC SMALL LETTER TE [Cyrillic]
Char: 'и' u: 1080 [0x0438] b: 208,184 [0xD0,0xB8] n: CYRILLIC SMALL LETTER I [Cyrillic]
...
So, I'm trying to get the UTF-8, quoted printable representation of this. For instance, using Python's quopri:
$ python -c 'import quopri; a="test — UNIX-утилита для проверки типа файла и сравнения значений"; print(quopri.encodestring(a));'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
... or PHP's quoted_printable_encode, which gives the exact same output:
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; echo quoted_printable_encode($a)."\n";'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
So, to test, I make a text file called test.eml, and try to simply wrap this output in the =?UTF-8?Q? ... ?= tags for the Subject: line, making sure that line endings are CRLF \r\n:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... but if I open this in Thunderbird, I get a corrupt output:
I've read somewhere that multiline in long header fields is covered by RFC0822 "LONG HEADER FIELDS", and basically, the line ending should be followed by a space. So I indent the continuation lines by one space:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... and I get a slighly different subject in Thunderbird, but still corrupt:
Now, if I delete =\r\n from the first three continuation lines, so the subject is all in one line:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... then actually Thunderbird shows the subject line well:
... but then my header is in conflict with the recommendation from RFC 2822 - 2.1.1. Line Length Limits which says "Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."; specifically the line limit of 78 characters.
So, how can I obtain the proper multi-line quoted-printable representation of an UTF-8 Subject header string, so I can use it in an .eml file split at 78 characters - and have Thunderbird correctly read it?
When I ask python to create an email with that subject, here's what it does:
$ python
Python 2.7.9 (default, Mar 1 2015, 18:22:53)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.message import Message
>>> from email.header import Header
>>> msg = Message()
>>> import quopri
>>> h = Header(quopri.decodestring('test =E2=80=94 UNIX-'
'=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F'
'=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8'
'=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8'
'=D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F '
'=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?='), 'UTF-8')
>>> msg['Subject'] = h
>>> print msg.as_string()
Subject: =?utf-8?b?dGVzdCDigJQgVU5JWC3Rg9GC0LjQu9C40YLQsCDQtNC70Y8g0L/RgNC+0LI=?=
=?utf-8?b?0LXRgNC60Lgg0YLQuNC/0LAg0YTQsNC50LvQsCDQuCDRgdGA0LDQstC90LU=?=
=?utf-8?b?0L3QuNGPINC30L3QsNGH0LXQvdC40Lk/?=
>>>
So it uses base64 encoding instead of quoted-printable, but my strong suspicion, based on this, is that the answer is that each line must begin and end the escape.
Indeed:
>>> import email
>>> s = '''Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0?=
... =?UTF-8?Q?=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0?=
... =?UTF-8?Q?=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
... =?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0?=
... =?UTF-8?Q?=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1?=
... =?UTF-8?Q?=87=D0=B5=D0=BD=D0=B8=D0=B9?=
...
... Hello.
... '''
>>> e = email.message_from_string(s.replace('\n', '\r\n'))
>>> email.header.decode_header(e['Subject'])
[('test \xe2\x80\x94 UNIX-\xd1\x83\xd1\x82\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xb4\xd0\xbb\xd1\x8f \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb2\xd0\xb5\xd1\x80\xd0\xba\xd0\xb8 \xd1\x82\xd0\xb8\xd0\xbf\xd0\xb0 \xd1\x84\xd0\xb0\xd0\xb9\xd0\xbb\xd0\xb0 \xd0\xb8 \xd1\x81\xd1\x80\xd0\xb0\xd0\xb2\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb9', 'utf-8')]
>>> decoded = email.header.decode_header(e['Subject'])
>>> print decoded[0][0].decode(decoded[0][1])
test — UNIX-утилита для проверки типа файла и сравнения значений
EDIT: However, even with the above added in .eml file, Thunderbird fails again:
... but this time it indicates it got some of the chars correct. And indeed, breakage occurs where lines are broken "in the middle of a character"; say if for the sequence 0xD1, 0x83 for the character у, the =D1?= ends one line, and the Q?=83 starts the other, then Thunderbird cannot parse that. So after manual rearrangement, this snippet can be obtained:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8?=
=?UTF-8?Q?=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
=?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0?=
=?UTF-8?Q?=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0?=
=?UTF-8?Q?=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... which opens fine as an .eml message in Thunderbird (same as this image from OP).
EDIT2: Also PHP seems to do it right, with this invocation of mb_encode_mimeheader (directly pasteable in .eml file):
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; mb_internal_encoding("UTF-8"); echo mb_encode_mimeheader($a, "UTF-8", "Q")."\n";'
test =?UTF-8?Q?=E2=80=94=20UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82?=
=?UTF-8?Q?=D0=B0=20=D0=B4=D0=BB=D1=8F=20=D0=BF=D1=80=D0=BE=D0=B2=D0=B5?=
=?UTF-8?Q?=D1=80=D0=BA=D0=B8=20=D1=82=D0=B8=D0=BF=D0=B0=20=D1=84=D0=B0?=
=?UTF-8?Q?=D0=B9=D0=BB=D0=B0=20=D0=B8=20=D1=81=D1=80=D0=B0=D0=B2=D0=BD?=
=?UTF-8?Q?=D0=B5=D0=BD=D0=B8=D1=8F=20=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD?=
=?UTF-8?Q?=D0=B8=D0=B9?=
The problem with your test.eml is that your RFC2047 encoding is broken. The Q encoding is based on quoted-printable, but is not entirely the same. In particular, each space needs to be encoded as either =20 or _, and you cannot escape line breaks with a final =.
Fundamentally, each =?...?= sequence needs to be a single, unambiguous token per RFC 822. You can either break up your input into multiple such tokens and leave the spaces unencoded, or encode the spaces. Note that spaces between two such tokens are not significant, so encoding the spaces into the sequences makes more sense.
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test_=E2=80=94_UNIX-=D1=83=D1=82=D0=B8=D0=BB?=
=?UTF-8?Q?=D0=B8=D1=82=D0=B0_=D0=B4=D0=BB_=D1=8F_=D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8_=D1=82=D0=B8=D0=BF?=
=?UTF-8?Q?=D0=B0_=D1=84=D0=B0=D0=B9=D0=BB=D0=B0_=D0=B8_=D1=81?=
=?UTF-8?Q?=D1=80=D0=B0=D0=B2=D0=BD_=D0=B5=D0=BD=D0=B8=D1=8F_?=
=?UTF-8?Q?=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
Of course, with this exposition, quoted-printable isn't really legible at all, and probably takes much more space than base64, so you might prefer to go with the B encoding in the end after all.
Unless you are writing a MIME library yourself, the simple solution is to not care, and let the library piece this together for you. PHP is more problematic (the standard library lacks this functionality, and the third-party libraries are somewhat uneven--find one you trust, and stick to it), but in Python, simply pass in a Unicode string, and the email library will encode it if necessary.

Explain what those escaped numbers mean in unicode encoding in ruby 1.8.7

0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings?
>> c = JSON '["\\u0186"]'
[
[0] "Ɔ"
]
>> c[0][0]
198
>> c[0][1]
134
>> c[0][2]
nil
Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation:
U | Integer | UTF-8 characters as unsigned integers
>> c[0].unpack('U')
[
[0] 390
]
>
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186):
Note that 186 (hexadecimal) === 390 (decimal)
C/C++/Java source code : "\u0186"
UTF-32 (decimal) : 390
UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134)
You can read more about UTF-8 encoding on Wikipedia's article on UTF-8.
UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

Resources