Interpret and display different Encodings for email bodies in ruby: - ruby

I am using the Email gem in my rails app but I am encountering some problems with encoding:
I am working on a mail that presents itself this way:
....
Message-ID: <22D41F1A16CD5A5719309A96F8C95D50#vcrfnyjsz>
From: "=?utf-8?B?IOWFqOWbvealvOWHpOWFvOiBjOWwj+WnkOS/oeaBrw==?=" <info#nks-media.ru>
To: ...
...
MIME-Version: 1.0
Content-Type: text/html;
charset="utf-8"
Content-Transfer-Encoding: base64
X-Priority: 5
X-MSMail-Priority: Low
X-Mailer: Microsoft Outlook Express 6.00.2900.5512
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5512
PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMCBUcmFuc2l0aW9uYWwv
L0VOIj4NCjxIVE1MIHhtbG5zOm8gPSAidXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6
b2ZmaWNlIj48SEVBRD4NCjxNRVRBIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIg
aHR0cC1lcXVpdj1Db250ZW50LVR5cGU+DQo8TUVUQSBuYW1lPUdFTkVSQVRPUiBjb250ZW50PSJN
U0hUTUwgOC4wMC42MDAxLjIzNTg4Ij48L0hFQUQ+DQo8Qk9EWSBiZ0NvbG9yPWFxdWE+DQo8UD48
Rk9OVCBjb2xvcj1ncmF5IHNpemU9Nj7lhajlm73lsI/lp5Dkv6Hmga/vvIzlrabnlJ/lprnkv6Hm
ga/vvIzmpbzlh6TlhbzogYzlpbPvvIzoia/lrrbkv6Hmga/vvIzlhbzogYzkv6Hmga/vvIzlpKfk
v53lgaXkv6Hmga88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6
ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgc2l6ZT02PjxBIA0KaHJlZj0iaHR0cDovL3d3dy5obmhu
LmNsdWIveGlueGkuaHRtIj5odGh0dHA6Ly93d3cuaG5obi5jbHViL3hpbnhpLmh0bWh0dHA6Ly93
d3cuaG5obi5jbHViL3hpbnhpLmh0bTwvQT48L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJB
Q0tHUk9VTkQtQ09MT1I6ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgDQpzaXplPTY+PC9GT05UPiZu
YnNwOzwvUD4NCjxQPjxGT05UIGNvbG9yPSM4MDgwODAgc2l6ZT02PjxGT05UIHNpemU9Nj48Rk9O
VCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpjb2xvcj1hcXVhPuS6uuWPr+S7peaK
oui1sOS7luWUr+S4gOaDs+imgeeahOWls+S6uu+8jOWlueWPquiDveWxnuS6juS7luOAgjwvRk9O
VD48L1A+DQo8UD48Rk9OVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFx
dWEiIA0KY29sb3I9YXF1YT7jgIDjgIDmlrnnk7fpl63kuIrnnLzmt7HlkLjkuobkuIDlj6PmsJTv
vIznhLblkI7lsIblpbnnmoTmlbTkuKrohLjln4vlnKjog7jliY3jgILku5bojqvmiY7nibnnmoTv
vIzov5nkuKrlppblrb3vvIHnm7TmjqXpl7fmrbvlvpfkuobvvIE8L0ZPTlQ+PC9QPg0KPFA+PEZP
TlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNPTE9SOiBhcXVhIiANCmNvbG9yPWFx
dWE+ICAgICAgICDmuIXmup/miJHnn6XpgZPvvIzmmK/pgqPlj6rlrp7lipvlubPlubPnmoTohb7o
m4fvvIznqYbojbvlj4jmmK/osIHllYrvvJ/mmK/ku5blkI7mnaXmlLbmnI3nmoTlppbprZTlkJfv
vJ88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNP
TE9SOiBhcXVhIiANCmNvbG9yPWFxdWE+ICAgIOWFremBk+WYtOinkua1ruWHuuS4gOaKueivoeW8
gumYtOajrueahOeskeaEj++8muKAnOaNheegtOWkqeOAguKAnTwvRk9OVD48L1A+DQo8UD48Rk9O
VCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIA0KY29sb3I9YXF1
YT4gICAg4oCc6YKj5aW55oCO5LmI5Lya6ZSZ5LqG6Zeo77yf5pei54S25pyJ5LiA77yM5oiR5oCO
5LmI6IO95LiN6K6k5Li66L+Y5Lya5pyJ5LqM77yf5aaC5p6c5oiR6K+05L2g546w5Zyo5q2j5aSE
5LqO5aS06ISR5re35Lmx77yM5oCd6Lev5LiN5riF55qE54q25oCB5LiN6L+H5YiG5ZCn77yf4oCd
5bm06L275rCR6K2m6Zeu55m95Li977yM55m95Li954K55aS05om/6K6k44CCPC9GT05UPjwvUD4N
CjxQPjxGT05UIHNpemU9Nj48Rk9OVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpj
b2xvcj1hcXVhPiAgICDmnpfljZfmnKzlsLHmmK/nm5fnlKjkuoblkI7kurrnmoTnn6Xor4bvvIzm
iYDku6XkuZ/msqHku4DkuYjlj6/pqoTlgrLnmoTvvIzkvr/nrJHnnYDku6TkvJfkurrlubPouqvv
vIzlj4jlr7nprY/lvoHpl67pgZPvvJrigJzprY/ljb/lrrblnKjmnJ3loILkuYvkuIrkuJPpl67m
raTkuovvvIzmg7PmnaXlv4XmmK/mnInku4DkuYjmt7HmhI/nvaLvvJ/igJ08L0ZPTlQ+PC9QPg0K
PFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIGNvbG9yPWFxdWEgDQpzaXpl
PTY+44CA44CAPC9GT05UPjwvUD48L0ZPTlQ+PC9GT05UPjwvRk9OVD48L0ZPTlQ+PC9GT05UPjwv
Rk9OVD48L0ZPTlQ+PC9CT0RZPjwvSFRNTD4NCg==
(I have omitted not inherent parts)
I fetch it with the Net::IMAP class of ruby an pass it as a string to the
Email.read_from_string
method of the gem.
It return me an object, call it msg. I now call msg.body and have this answer:
<Mail::Body:0x007f0045976ea8 #boundary=nil, #preamble=nil, #epilogue=nil, #charset="US-ASCII", #part_sort_order=["text/plain", "text/enriched", "text/html"], #parts=[], #raw_source="PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMCBUcmFuc2l0aW9uYWwv\r\nL0VOIj4NCjxIVE1MIHhtbG5zOm8gPSAidXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6\r\nb2ZmaWNlIj48SEVBRD4NCjxNRVRBIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIg\r\naHR0cC1lcXVpdj1Db250ZW50LVR5cGU+DQo8TUVUQSBuYW1lPUdFTkVSQVRPUiBjb250ZW50PSJN\r\nU0hUTUwgOC4wMC42MDAxLjIzNTg4Ij48L0hFQUQ+DQo8Qk9EWSBiZ0NvbG9yPWFxdWE+DQo8UD48\r\nRk9OVCBjb2xvcj1ncmF5IHNpemU9Nj7lhajlm73lsI/lp5Dkv6Hmga/vvIzlrabnlJ/lprnkv6Hm\r\nga/vvIzmpbzlh6TlhbzogYzlpbPvvIzoia/lrrbkv6Hmga/vvIzlhbzogYzkv6Hmga/vvIzlpKfk\r\nv53lgaXkv6Hmga88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6\r\nICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgc2l6ZT02PjxBIA0KaHJlZj0iaHR0cDovL3d3dy5obmhu\r\nLmNsdWIveGlueGkuaHRtIj5odGh0dHA6Ly93d3cuaG5obi5jbHViL3hpbnhpLmh0bWh0dHA6Ly93\r\nd3cuaG5obi5jbHViL3hpbnhpLmh0bTwvQT48L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJB\r\nQ0tHUk9VTkQtQ09MT1I6ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgDQpzaXplPTY+PC9GT05UPiZu\r\nYnNwOzwvUD4NCjxQPjxGT05UIGNvbG9yPSM4MDgwODAgc2l6ZT02PjxGT05UIHNpemU9Nj48Rk9O\r\nVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpjb2xvcj1hcXVhPuS6uuWPr+S7peaK\r\noui1sOS7luWUr+S4gOaDs+imgeeahOWls+S6uu+8jOWlueWPquiDveWxnuS6juS7luOAgjwvRk9O\r\nVD48L1A+DQo8UD48Rk9OVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFx\r\ndWEiIA0KY29sb3I9YXF1YT7jgIDjgIDmlrnnk7fpl63kuIrnnLzmt7HlkLjkuobkuIDlj6PmsJTv\r\nvIznhLblkI7lsIblpbnnmoTmlbTkuKrohLjln4vlnKjog7jliY3jgILku5bojqvmiY7nibnnmoTv\r\nvIzov5nkuKrlppblrb3vvIHnm7TmjqXpl7fmrbvlvpfkuobvvIE8L0ZPTlQ+PC9QPg0KPFA+PEZP\r\nTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNPTE9SOiBhcXVhIiANCmNvbG9yPWFx\r\ndWE+ICAgICAgICDmuIXmup/miJHnn6XpgZPvvIzmmK/pgqPlj6rlrp7lipvlubPlubPnmoTohb7o\r\nm4fvvIznqYbojbvlj4jmmK/osIHllYrvvJ/mmK/ku5blkI7mnaXmlLbmnI3nmoTlppbprZTlkJfv\r\nvJ88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNP\r\nTE9SOiBhcXVhIiANCmNvbG9yPWFxdWE+ICAgIOWFremBk+WYtOinkua1ruWHuuS4gOaKueivoeW8\r\ngumYtOajrueahOeskeaEj++8muKAnOaNheegtOWkqeOAguKAnTwvRk9OVD48L1A+DQo8UD48Rk9O\r\nVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIA0KY29sb3I9YXF1\r\nYT4gICAg4oCc6YKj5aW55oCO5LmI5Lya6ZSZ5LqG6Zeo77yf5pei54S25pyJ5LiA77yM5oiR5oCO\r\n5LmI6IO95LiN6K6k5Li66L+Y5Lya5pyJ5LqM77yf5aaC5p6c5oiR6K+05L2g546w5Zyo5q2j5aSE\r\n5LqO5aS06ISR5re35Lmx77yM5oCd6Lev5LiN5riF55qE54q25oCB5LiN6L+H5YiG5ZCn77yf4oCd\r\n5bm06L275rCR6K2m6Zeu55m95Li977yM55m95Li954K55aS05om/6K6k44CCPC9GT05UPjwvUD4N\r\nCjxQPjxGT05UIHNpemU9Nj48Rk9OVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpj\r\nb2xvcj1hcXVhPiAgICDmnpfljZfmnKzlsLHmmK/nm5fnlKjkuoblkI7kurrnmoTnn6Xor4bvvIzm\r\niYDku6XkuZ/msqHku4DkuYjlj6/pqoTlgrLnmoTvvIzkvr/nrJHnnYDku6TkvJfkurrlubPouqvv\r\nvIzlj4jlr7nprY/lvoHpl67pgZPvvJrigJzprY/ljb/lrrblnKjmnJ3loILkuYvkuIrkuJPpl67m\r\nraTkuovvvIzmg7PmnaXlv4XmmK/mnInku4DkuYjmt7HmhI/nvaLvvJ/igJ08L0ZPTlQ+PC9QPg0K\r\nPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIGNvbG9yPWFxdWEgDQpzaXpl\r\nPTY+44CA44CAPC9GT05UPjwvUD48L0ZPTlQ+PC9GT05UPjwvRk9OVD48L0ZPTlQ+PC9GT05UPjwv\r\nRk9OVD48L0ZPTlQ+PC9CT0RZPjwvSFRNTD4NCg==\r\n\r\n\r\n", #encoding="base64">
so everything seems right.
I do:
msg.body.encoding # return "Base64"
and its right again, but here the strange, when I do:
msg.body.only_us_ascii? # return True
Should not this be false? The content type in the header of the email is 'utf-8'.
In fact, if I try to do
msg.body.decoded
here is:
"<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">\r\n<HTML xmlns:o = \"urn:schemas-microsoft-com:office:office\"><HEAD>\r\n<META content=\"text/html; charset=utf-8\" http-equiv=Content-Type>\r\n<META name=GENERATOR content=\"MSHTML 8.00.6001.23588\"></HEAD>\r\n<BODY bgColor=aqua>\r\n<P><FONT color=gray size=6>\xE5\x85\xA8\xE5\x9B\xBD\xE5\xB0\x8F\xE5\xA7\x90\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\xAD\xA6\xE7\x94\x9F\xE5\xA6\xB9\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE6\xA5\xBC\xE5\x87\xA4\xE5\x85\xBC\xE8\x81\x8C\xE5\xA5\xB3\xEF\xBC\x8C\xE8\x89\xAF\xE5\xAE\xB6\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\x85\xBC\xE8\x81\x8C\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\xA4\xA7\xE4\xBF\x9D\xE5\x81\xA5\xE4\xBF\xA1\xE6\x81\xAF</FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: #00ffff\" color=#00ffff size=6><A \r\nhref=\"http://www.hnhn.club/xinxi.htm\">hthttp://www.hnhn.club/xinxi.htmhttp://www.hnhn.club/xinxi.htm</A></FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: #00ffff\" color=#00ffff \r\nsize=6></FONT> </P>\r\n<P><FONT color=#808080 size=6><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua>\xE4\xBA\xBA\xE5\x8F\xAF\xE4\xBB\xA5\xE6\x8A\xA2\xE8\xB5\xB0\xE4\xBB\x96\xE5\x94\xAF\xE4\xB8\x80\xE6\x83\xB3\xE8\xA6\x81\xE7\x9A\x84\xE5\xA5\xB3\xE4\xBA\xBA\xEF\xBC\x8C\xE5\xA5\xB9\xE5\x8F\xAA\xE8\x83\xBD\xE5\xB1\x9E\xE4\xBA\x8E\xE4\xBB\x96\xE3\x80\x82</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua>\xE3\x80\x80\xE3\x80\x80\xE6\x96\xB9\xE7\x93\xB7\xE9\x97\xAD\xE4\xB8\x8A\xE7\x9C\xBC\xE6\xB7\xB1\xE5\x90\xB8\xE4\xBA\x86\xE4\xB8\x80\xE5\x8F\xA3\xE6\xB0\x94\xEF\xBC\x8C\xE7\x84\xB6\xE5\x90\x8E\xE5\xB0\x86\xE5\xA5\xB9\xE7\x9A\x84\xE6\x95\xB4\xE4\xB8\xAA\xE8\x84\xB8\xE5\x9F\x8B\xE5\x9C\xA8\xE8\x83\xB8\xE5\x89\x8D\xE3\x80\x82\xE4\xBB\x96\xE8\x8E\xAB\xE6\x89\x8E\xE7\x89\xB9\xE7\x9A\x84\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xA6\x96\xE5\xAD\xBD\xEF\xBC\x81\xE7\x9B\xB4\xE6\x8E\xA5\xE9\x97\xB7\xE6\xAD\xBB\xE5\xBE\x97\xE4\xBA\x86\xEF\xBC\x81</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE6\xB8\x85\xE6\xBA\x9F\xE6\x88\x91\xE7\x9F\xA5\xE9\x81\x93\xEF\xBC\x8C\xE6\x98\xAF\xE9\x82\xA3\xE5\x8F\xAA\xE5\xAE\x9E\xE5\x8A\x9B\xE5\xB9\xB3\xE5\xB9\xB3\xE7\x9A\x84\xE8\x85\xBE\xE8\x9B\x87\xEF\xBC\x8C\xE7\xA9\x86\xE8\x8D\xBB\xE5\x8F\x88\xE6\x98\xAF\xE8\xB0\x81\xE5\x95\x8A\xEF\xBC\x9F\xE6\x98\xAF\xE4\xBB\x96\xE5\x90\x8E\xE6\x9D\xA5\xE6\x94\xB6\xE6\x9C\x8D\xE7\x9A\x84\xE5\xA6\x96\xE9\xAD\x94\xE5\x90\x97\xEF\xBC\x9F</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE5\x85\xAD\xE9\x81\x93\xE5\x98\xB4\xE8\xA7\x92\xE6\xB5\xAE\xE5\x87\xBA\xE4\xB8\x80\xE6\x8A\xB9\xE8\xAF\xA1\xE5\xBC\x82\xE9\x98\xB4\xE6\xA3\xAE\xE7\x9A\x84\xE7\xAC\x91\xE6\x84\x8F\xEF\xBC\x9A\xE2\x80\x9C\xE6\x8D\x85\xE7\xA0\xB4\xE5\xA4\xA9\xE3\x80\x82\xE2\x80\x9D</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE2\x80\x9C\xE9\x82\xA3\xE5\xA5\xB9\xE6\x80\x8E\xE4\xB9\x88\xE4\xBC\x9A\xE9\x94\x99\xE4\xBA\x86\xE9\x97\xA8\xEF\xBC\x9F\xE6\x97\xA2\xE7\x84\xB6\xE6\x9C\x89\xE4\xB8\x80\xEF\xBC\x8C\xE6\x88\x91\xE6\x80\x8E\xE4\xB9\x88\xE8\x83\xBD\xE4\xB8\x8D\xE8\xAE\xA4\xE4\xB8\xBA\xE8\xBF\x98\xE4\xBC\x9A\xE6\x9C\x89\xE4\xBA\x8C\xEF\xBC\x9F\xE5\xA6\x82\xE6\x9E\x9C\xE6\x88\x91\xE8\xAF\xB4\xE4\xBD\xA0\xE7\x8E\xB0\xE5\x9C\xA8\xE6\xAD\xA3\xE5\xA4\x84\xE4\xBA\x8E\xE5\xA4\xB4\xE8\x84\x91\xE6\xB7\xB7\xE4\xB9\xB1\xEF\xBC\x8C\xE6\x80\x9D\xE8\xB7\xAF\xE4\xB8\x8D\xE6\xB8\x85\xE7\x9A\x84\xE7\x8A\xB6\xE6\x80\x81\xE4\xB8\x8D\xE8\xBF\x87\xE5\x88\x86\xE5\x90\xA7\xEF\xBC\x9F\xE2\x80\x9D\xE5\xB9\xB4\xE8\xBD\xBB\xE6\xB0\x91\xE8\xAD\xA6\xE9\x97\xAE\xE7\x99\xBD\xE4\xB8\xBD\xEF\xBC\x8C\xE7\x99\xBD\xE4\xB8\xBD\xE7\x82\xB9\xE5\xA4\xB4\xE6\x89\xBF\xE8\xAE\xA4\xE3\x80\x82</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE6\x9E\x97\xE5\x8D\x97\xE6\x9C\xAC\xE5\xB0\xB1\xE6\x98\xAF\xE7\x9B\x97\xE7\x94\xA8\xE4\xBA\x86\xE5\x90\x8E\xE4\xBA\xBA\xE7\x9A\x84\xE7\x9F\xA5\xE8\xAF\x86\xEF\xBC\x8C\xE6\x89\x80\xE4\xBB\xA5\xE4\xB9\x9F\xE6\xB2\xA1\xE4\xBB\x80\xE4\xB9\x88\xE5\x8F\xAF\xE9\xAA\x84\xE5\x82\xB2\xE7\x9A\x84\xEF\xBC\x8C\xE4\xBE\xBF\xE7\xAC\x91\xE7\x9D\x80\xE4\xBB\xA4\xE4\xBC\x97\xE4\xBA\xBA\xE5\xB9\xB3\xE8\xBA\xAB\xEF\xBC\x8C\xE5\x8F\x88\xE5\xAF\xB9\xE9\xAD\x8F\xE5\xBE\x81\xE9\x97\xAE\xE9\x81\x93\xEF\xBC\x9A\xE2\x80\x9C\xE9\xAD\x8F\xE5\x8D\xBF\xE5\xAE\xB6\xE5\x9C\xA8\xE6\x9C\x9D\xE5\xA0\x82\xE4\xB9\x8B\xE4\xB8\x8A\xE4\xB8\x93\xE9\x97\xAE\xE6\xAD\xA4\xE4\xBA\x8B\xEF\xBC\x8C\xE6\x83\xB3\xE6\x9D\xA5\xE5\xBF\x85\xE6\x98\xAF\xE6\x9C\x89\xE4\xBB\x80\xE4\xB9\x88\xE6\xB7\xB1\xE6\x84\x8F\xE7\xBD\xA2\xEF\xBC\x9F\xE2\x80\x9D</FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: aqua\" color=aqua \r\nsize=6>\xE3\x80\x80\xE3\x80\x80</FONT></P></FONT></FONT></FONT></FONT></FONT></FONT></FONT></BODY></HTML>\r\n"
Not utf-8 as I expected but ASCII-8BIT, and I don't know hot to use it, or see it in the browser.
Any help?

It's Base64, as revealed by msg.body.encoding # return "Base64". I'm no email format expert, but I'd guess the Base64 nature of the body was revealed in some header you didn't include in your paste. (After all msg.body.encoding must be getting it from somewhere).
Base64 isn't actually a character encoding like UTF-8. It's instead a conversion of binary data to ascii.
I think it's unfortunate that the Email gem doesn't take care of this for you.
But if what you have is Base64, you can decode it using the stdlib Base64 class.
data = Base64.decode(msg.body)
However, if it's Base64-encoded in the first place, what comes out the other side might not be plain text, but some kind of binary file format (an MS Word document? I dunno), so might still not make sense read directly even once decoded

Related

Parse quoted-printable encoding content from .mht file

I am trying to get all the images from .mht file by using Nokogiri gem. But since the .mht file has quoted-printable encoding, all the images that I received, has weird characters in it:
<img alt='3D"AFC-Logo' src="3D%22https://upload.=" width='3D"75"' height='3D"75"'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/wikimedia-butto=" width='3D"88"' height='3D"31"' alt='3D"Wikimedia'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/poweredby_mediawiki_8=" alt='3D"Powered' width='3D"88"' height='3D"31"'>
This is the link to that .mht file: https://drive.google.com/file/d/1DtbgrFyCEcggAk1nqpZSluNhRt-k3t95/view?usp=sharing
And below is the code that I am using to get all the images from the .mht file:
html = File.open("1646037951.mht").read
image_links = get_image_links(html)
def get_image_links(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//img[#src]")
raise "No <img .../> tags!" if nodes.empty?
nodes.inject([]) do |uris, node|
puts node.to_s
uris << node.attr('src').strip
end.uniq
end
I have tried to parse it by using .unpack('M').first but it's still not working as it just returns the same result as above.
Or maybe Rails have something for this?

Using ruby SAX parsers for GB2312 encoded xml

Good day,
I have a lot of big xml files that i need to parse, but problems is they have 'gb2312' encoding. I would normaly use SAX parser for this.
So here is in example of xml:
<?xml version="1.0" encoding="gb2312"?>
<Root>
<ValueList Count="112290" FieldCount="11">
<Item1 Value1="23743" Value2="Дипломатия � Пустой кувшин" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item2 Value1="6611" Value2="ДЛ � 018 омела � золотой кинжал" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item3 Value1="6608" Value2="Наука (ДЛ)�круг фей 021�тяпка" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item4 Value1="6612" Value2="Знаки ДЛ � 003руны � разрушение" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
....
</Root>
I'm trying to use Nokogiri SAX (also tried libxml-ruby with same result) parser:
require 'nokogiri'
class SchemaParser < Nokogiri::XML::SAX::Document
def initialize
#cnt = 0
end
def start_element name, attrs =[]
if name == "Item1"
#cnt+= 1
puts #cnt
end
end
end
parser = Nokogiri::XML::SAX::Parser.new(SchemaParser.new)
parser.parse_io(File.open('2_4_EQUIPMENT_ESSENCE.xml'), 'gb2312')
But this gives error "`check_encoding': 'GB2312' is not a valid encoding (ArgumentError)". If I remove encoding declaration and let Nokogiri detect encoding himself, I will receive this error:
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
I/O error : encoder error
I also tried to open File with proper encoding, but that didn't help SAX parser:
[3] pry(main)> f = File.open('2_4_EQUIPMENT_ESSENCE.xml', "r:gb2312")
=> #<File:2_4_EQUIPMENT_ESSENCE.xml>
[4] pry(main)> f.external_encoding.name
=> "GB2312"
Did anyone use 'gb2312' encoding with SAX parsers in ruby? Any recommendations how to proceed?
It seems the issue is that Libxml2 does not support the GB2312 encoding (see here for a list of supported encodings).
I'm not sure if you have tried this, but I think you can work around this by removing the encoding declaration from the XML files (so Libxml2 does not try to transcode the data) and set the external encoding of the File object to GB2312, because then Ruby will transcode the file to UTF-8 as it is read, and from then on everything will remain as UTF-8.
So, here is my workaround.
Problems:
Some of characters presented in xml are not 'gb2312' encoding, I have found that 'GB18030' would be a better choice with full Chinese characters.
I converted all xml's to utf8, so i can use SAX parser.
I ended up with this rake task:
desc "convert chinese xml files to utf-8"
task :convert do
rm_rf 'data/utf8'
mkdir 'data/utf8'
Dir.foreach('data') {|f|
if f.end_with?('.xml')
puts "converted:: data/utf8/#{f}" if system("iconv -f GB18030 -t UTF-8 data/#{f} > data/utf8/#{f}")
end
}
#replace encodings for xml files
system("bundle exec ruby -pi -e \"gsub(/gb2312/, 'UTF-8')\" data/utf8/*.xml")
end

Problems with text/csv Content-Encoding = UTF-8 in Ruby Mechanize

When attempting to load a page which is a CSV that has encoding of UTF-8, using Mechanize V2.5.1, I used the following code:
a.content_encoding_hooks << lambda{|httpagent, uri, response, body_io|
response['Content-Encoding'] = 'none' if response['Content-Encoding'].to_s == 'UTF-8'
}
p4 = a.get(redirect_url, nil, ['accept-encoding' => 'UTF-8'])
but I find that the content encoding hook is not being called and I get the following error and traceback:
/Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:787:in 'response_content_encoding': unsupported content-encoding: UTF-8 (Mechanize::Error)
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:274:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize.rb:407:in 'get'
from prototype/test1.rb:307:in `<main>'
Does anyone have an idea why the content hook code is not firing and why I am getting the error?
but I find that the content encoding hook is not being called
What makes you think that?
The error message references this code:
def response_content_encoding response, body_io
...
...
out_io = case response['Content-Encoding']
when nil, 'none', '7bit', "" then
body_io
when 'deflate' then
content_encoding_inflate body_io
when 'gzip', 'x-gzip' then
content_encoding_gunzip body_io
else
raise Mechanize::Error,
"unsupported content-encoding: #{response['Content-Encoding']}"
So mechanize only recognizes the content encodings: '7bit', 'deflate', 'gzip', or 'x-gzip'.
From the HTTP/1.1 spec:
4.11 Content-Encoding
The Content-Encoding entity-header field is used as a modifier to the
media-type. When present, its value indicates what additional content
codings have been applied to the entity-body, and thus what decoding
mechanisms must be applied in order to obtain the media-type
referenced by the Content-Type header field. Content-Encoding is
primarily used to allow a document to be compressed without losing the
identity of its underlying media type.
Content-Encoding = "Content-Encoding" ":" 1#content-coding
Content codings are defined in section 3.5. An example of its use is
Content-Encoding: gzip
The content-coding is a characteristic of the entity identified by the
Request-URI. Typically, the entity-body is stored with this encoding
and is only decoded before rendering or analogous usage. However, a
non-transparent proxy MAY modify the content-coding if the new coding
is known to be acceptable to the recipient, unless the "no-transform"
cache-control directive is present in the message.
...
...
3.5 Content Codings
Content coding values indicate an encoding transformation that has
been or can be applied to an entity. Content codings are primarily
used to allow a document to be compressed or otherwise usefully
transformed without losing the identity of its underlying media type
and without loss of information. Frequently, the entity is stored in
coded form, transmitted directly, and only decoded by the recipient.
content-coding = token
All content-coding values are case-insensitive. HTTP/1.1 uses
content-coding values in the Accept-Encoding (section 14.3) and
Content-Encoding (section 14.11) header fields. Although the value
describes the content-coding, what is more important is that it
indicates what decoding mechanism will be required to remove the
encoding.
The Internet Assigned Numbers Authority (IANA) acts as a registry for
content-coding value tokens. Initially, the registry contains the
following tokens:
gzip An encoding format produced by the file compression program "gzip" (GNU zip) as described in RFC 1952 [25]. This format is a
Lempel-Ziv coding (LZ77) with a 32 bit CRC.
compress The encoding format produced by the common UNIX file compression program "compress". This format is an adaptive
Lempel-Ziv-Welch coding (LZW).
Use of program names for the identification of encoding formats
is not desirable and is discouraged for future encodings. Their
use here is representative of historical practice, not good
design. For compatibility with previous implementations of HTTP,
applications SHOULD consider "x-gzip" and "x-compress" to be
equivalent to "gzip" and "compress" respectively.
deflate The "zlib" format defined in RFC 1950 [31] in combination with the "deflate" compression mechanism described in RFC 1951 [29].
identity The default (identity) encoding; the use of no transformation whatsoever. This content-coding is used only in the
Accept- Encoding header, and SHOULD NOT be used in the
Content-Encoding header.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5
In other words, an http content encoding has nothing to do with ascii v. utf-8 v. latin-1.
In addition the source code for Mechanize::HTTP::Agent has this in it:
# A list of hooks to call after retrieving a response. Hooks are called with
# the agent and the response returned.
attr_reader :post_connect_hooks
# A list of hooks to call before making a request. Hooks are called with
# the agent and the request to be performed.
attr_reader :pre_connect_hooks
# A list of hooks to call to handle the content-encoding of a request.
attr_reader :content_encoding_hooks
So it doesn't even look like you are calling the right hook.
Here is an example I got to work:
require 'mechanize'
a = Mechanize.new
p a.content_encoding_hooks
func = lambda do |a, uri, resp, body_io|
puts body_io.read
puts "The Content-Encoding is: #{resp['Content-Encoding']}"
if resp['Content-Encoding'].to_s == 'UTF-8'
resp['Content-Encoding'] = 'none'
end
puts "The Content-Encoding is now: #{resp['Content-Encoding']}"
end
a.content_encoding_hooks << func
a.get(
'http://localhost:8080/cgi-bin/myprog.rb',
[],
nil,
"Accept-Encoding" => 'gzip, deflate' #This is what Firefox always uses
)
myprog.rb:
#!/usr/bin/env ruby
require 'cgi'
cgi = CGI.new('html3')
headers = {
"type" => 'text/html',
"Content-Encoding" => "UTF-8",
}
cgi.out(headers) do
cgi.html() do
cgi.head{ cgi.title{"Content-Encoding Test"} } +
cgi.body() do
cgi.div(){ "The Accept-Encoding was: #{cgi.accept_encoding}" }
end
end
end
--output:--
[]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><HTML><HEAD><TITLE>Content-Encoding Test</TITLE></HEAD><BODY><DIV>The Accept-Encoding was: gzip, deflate</DIV></BODY></HTML>
The Content-Encoding is: UTF-8
The Content-Encoding is now: none

I always get an UndefinedConversionError in Ruby 2.0 while scraping with Mechanize

When I try to submit a textarea with Mechanize and Ruby 2.0, I always get an
Encoding::UndefinedConversionError: U+0151 from UTF-8 to ISO-8859-1
Then I tryied to convert the text with Iconv, I got a similar result:
Iconv.iconv("LATIN1", "UTF-8", text)
I get this error message:
Iconv::IllegalSequence: "őzködik, melyet "...
As the text contains east-european characters. What can I do to avoid this kind of inconveniences or how can I convert properly between different encodings?
I have found an elegant solution:
replacements = [["À", "À"], ["Á", "Á"], ["Â", "Â"], ["Ã", "Ã"], ["Ä", "Ä"], ["Å", "Å"], ["Æ", "Æ"], ["Ç", "Ç"], ["È", "È"], ["É", "É"], ["Ê", "Ê"], ["Ë", "Ë"], ["Ì", "Ì"], ["Í", "Í"], ["Î", "Î"], ["Ï", "Ï"], ["Ð", "Ð"], ["Ñ", "Ñ"], ["Ò", "Ò"], ["Ó", "Ó"], ["Ô", "Ô"], ["Õ", "Õ"], ["Ö", "Ö"], ["Ø", "Ø"], ["Ù", "Ù"], ["Ú", "Ú"], ["Û", "Û"], ["Ü", "Ü"], ["Ý", "Ý"], ["Þ", "Þ"], ["ß", "ß"], ["à", "à"], ["á", "á"], ["â", "â"], ["ã", "ã"], ["ä", "ä"], ["å", "å"], ["æ", "æ"], ["ç", "ç"], ["è", "è"], ["é", "é"], ["ê", "ê"], ["ë", "ë"], ["ì", "ì"], ["í", "í"], ["î", "î"], ["ï", "ï"], ["ð", "ð"], ["ñ", "ñ"], ["ò", "ò"], ["ó", "ó"], ["ô", "ô"], ["õ", "õ"], ["ö", "ö"], ["ø", "ø"], ["ù", "ù"], ["ú", "ú"], ["û", "û"], ["ü", "ü"], ["ý", "ý"], ["þ", "þ"], ["ÿ", "ÿ"]]
def replace(str,replacements)
replacements.each {|replacement| str.gsub!(replacement[0], replacement[1])}
return str
end
my_string=replace(my_string,replacements)

Trouble Parsing XML using Ruby XML Parser

I am having trouble parsing some returned XML using this command: XML::Parser.string(xml_string).parse
Here is the XML I am trying to parse:
<div style=\"border:1px solid #990000;padding-left:20px;margin:0 0 10px 0;\">
<h4>A PHP Error was encountered</h4>
<p>Severity: Notice</p>
<p>Message: Undefined index: HTTP_USER_AGENT</p>
<p>Filename: test</p>
<p>Line Number: test</p>
</div><?xml version=\"1.0\" encoding=\"UTF-8\"?>
<response>
<review>
<reviewer><![CDATA[test]]></reviewer>
<ip><![CDATA[test]]></ip>
rating><![CDATA[test]]></rating>
<content><![CDATA[test.]]></content>
<date><![CDATA[test]]></date>
</review>
</response>
I get this error:
Fatal error: XML declaration allowed only at the start of the document at :10.Fatal error: Extra content at the end of the document at :11.
LibXML::XML::Error: Fatal error: Extra content at the end of the document
What is going on here?
Your string is not a valid XML document; it appears to be two documents concatenated together. (The first one is a "<div>" the second one is a "<response>".)
Try separating them into two strings and parsing each of them separately.
When you are fetching xml_string, I believe you need to set the user agent. You are not providing a user agent so the server serving the XML is choking.
Use this code to add a user agent to your request:
resp = http.post(path, query, {'User-Agent' => "Ruby"})

Resources