Apache Nifi Encode XML to ISO-8859-1 - apache-nifi

I need to encode the contents of a XML file from UTF-8 to ISO-8859-1, but for the life of me I cannot get it encoded..
I tried using convertCharacterSet but there is no change.
For the Processor I have the following:
Input Character Set: UTF-8
Output Character Set: ISO-8859-1
The input is:
<TstH>
<Header>
<ProductCode>LOCAL</ProductCode>
<EffectiveDate>2022-07-18</EffectiveDate>
<AgencyCode>USER1</AgencyCode>
<AgencyPassword>pwd1</AgencyPassword>
<Trace>0</Trace>
</Header>
</TstH>
And my expected output should be:
<TstH>
<Header>
<ProductCode>LOCAL</ProductCode>
<EffectiveDate>2022-07-18</EffectiveDate>
<AgencyCode>USER1</AgencyCode>
<AgencyPassword>pwd1</AgencyPassword>
<Trace>0</Trace>
</Header>
</TstH>

Related

Interpret and display different Encodings for email bodies in ruby:

I am using the Email gem in my rails app but I am encountering some problems with encoding:
I am working on a mail that presents itself this way:
....
Message-ID: <22D41F1A16CD5A5719309A96F8C95D50#vcrfnyjsz>
From: "=?utf-8?B?IOWFqOWbvealvOWHpOWFvOiBjOWwj+WnkOS/oeaBrw==?=" <info#nks-media.ru>
To: ...
...
MIME-Version: 1.0
Content-Type: text/html;
charset="utf-8"
Content-Transfer-Encoding: base64
X-Priority: 5
X-MSMail-Priority: Low
X-Mailer: Microsoft Outlook Express 6.00.2900.5512
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5512
PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMCBUcmFuc2l0aW9uYWwv
L0VOIj4NCjxIVE1MIHhtbG5zOm8gPSAidXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6
b2ZmaWNlIj48SEVBRD4NCjxNRVRBIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIg
aHR0cC1lcXVpdj1Db250ZW50LVR5cGU+DQo8TUVUQSBuYW1lPUdFTkVSQVRPUiBjb250ZW50PSJN
U0hUTUwgOC4wMC42MDAxLjIzNTg4Ij48L0hFQUQ+DQo8Qk9EWSBiZ0NvbG9yPWFxdWE+DQo8UD48
Rk9OVCBjb2xvcj1ncmF5IHNpemU9Nj7lhajlm73lsI/lp5Dkv6Hmga/vvIzlrabnlJ/lprnkv6Hm
ga/vvIzmpbzlh6TlhbzogYzlpbPvvIzoia/lrrbkv6Hmga/vvIzlhbzogYzkv6Hmga/vvIzlpKfk
v53lgaXkv6Hmga88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6
ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgc2l6ZT02PjxBIA0KaHJlZj0iaHR0cDovL3d3dy5obmhu
LmNsdWIveGlueGkuaHRtIj5odGh0dHA6Ly93d3cuaG5obi5jbHViL3hpbnhpLmh0bWh0dHA6Ly93
d3cuaG5obi5jbHViL3hpbnhpLmh0bTwvQT48L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJB
Q0tHUk9VTkQtQ09MT1I6ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgDQpzaXplPTY+PC9GT05UPiZu
YnNwOzwvUD4NCjxQPjxGT05UIGNvbG9yPSM4MDgwODAgc2l6ZT02PjxGT05UIHNpemU9Nj48Rk9O
VCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpjb2xvcj1hcXVhPuS6uuWPr+S7peaK
oui1sOS7luWUr+S4gOaDs+imgeeahOWls+S6uu+8jOWlueWPquiDveWxnuS6juS7luOAgjwvRk9O
VD48L1A+DQo8UD48Rk9OVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFx
dWEiIA0KY29sb3I9YXF1YT7jgIDjgIDmlrnnk7fpl63kuIrnnLzmt7HlkLjkuobkuIDlj6PmsJTv
vIznhLblkI7lsIblpbnnmoTmlbTkuKrohLjln4vlnKjog7jliY3jgILku5bojqvmiY7nibnnmoTv
vIzov5nkuKrlppblrb3vvIHnm7TmjqXpl7fmrbvlvpfkuobvvIE8L0ZPTlQ+PC9QPg0KPFA+PEZP
TlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNPTE9SOiBhcXVhIiANCmNvbG9yPWFx
dWE+ICAgICAgICDmuIXmup/miJHnn6XpgZPvvIzmmK/pgqPlj6rlrp7lipvlubPlubPnmoTohb7o
m4fvvIznqYbojbvlj4jmmK/osIHllYrvvJ/mmK/ku5blkI7mnaXmlLbmnI3nmoTlppbprZTlkJfv
vJ88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNP
TE9SOiBhcXVhIiANCmNvbG9yPWFxdWE+ICAgIOWFremBk+WYtOinkua1ruWHuuS4gOaKueivoeW8
gumYtOajrueahOeskeaEj++8muKAnOaNheegtOWkqeOAguKAnTwvRk9OVD48L1A+DQo8UD48Rk9O
VCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIA0KY29sb3I9YXF1
YT4gICAg4oCc6YKj5aW55oCO5LmI5Lya6ZSZ5LqG6Zeo77yf5pei54S25pyJ5LiA77yM5oiR5oCO
5LmI6IO95LiN6K6k5Li66L+Y5Lya5pyJ5LqM77yf5aaC5p6c5oiR6K+05L2g546w5Zyo5q2j5aSE
5LqO5aS06ISR5re35Lmx77yM5oCd6Lev5LiN5riF55qE54q25oCB5LiN6L+H5YiG5ZCn77yf4oCd
5bm06L275rCR6K2m6Zeu55m95Li977yM55m95Li954K55aS05om/6K6k44CCPC9GT05UPjwvUD4N
CjxQPjxGT05UIHNpemU9Nj48Rk9OVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpj
b2xvcj1hcXVhPiAgICDmnpfljZfmnKzlsLHmmK/nm5fnlKjkuoblkI7kurrnmoTnn6Xor4bvvIzm
iYDku6XkuZ/msqHku4DkuYjlj6/pqoTlgrLnmoTvvIzkvr/nrJHnnYDku6TkvJfkurrlubPouqvv
vIzlj4jlr7nprY/lvoHpl67pgZPvvJrigJzprY/ljb/lrrblnKjmnJ3loILkuYvkuIrkuJPpl67m
raTkuovvvIzmg7PmnaXlv4XmmK/mnInku4DkuYjmt7HmhI/nvaLvvJ/igJ08L0ZPTlQ+PC9QPg0K
PFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIGNvbG9yPWFxdWEgDQpzaXpl
PTY+44CA44CAPC9GT05UPjwvUD48L0ZPTlQ+PC9GT05UPjwvRk9OVD48L0ZPTlQ+PC9GT05UPjwv
Rk9OVD48L0ZPTlQ+PC9CT0RZPjwvSFRNTD4NCg==
(I have omitted not inherent parts)
I fetch it with the Net::IMAP class of ruby an pass it as a string to the
Email.read_from_string
method of the gem.
It return me an object, call it msg. I now call msg.body and have this answer:
<Mail::Body:0x007f0045976ea8 #boundary=nil, #preamble=nil, #epilogue=nil, #charset="US-ASCII", #part_sort_order=["text/plain", "text/enriched", "text/html"], #parts=[], #raw_source="PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMCBUcmFuc2l0aW9uYWwv\r\nL0VOIj4NCjxIVE1MIHhtbG5zOm8gPSAidXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6\r\nb2ZmaWNlIj48SEVBRD4NCjxNRVRBIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIg\r\naHR0cC1lcXVpdj1Db250ZW50LVR5cGU+DQo8TUVUQSBuYW1lPUdFTkVSQVRPUiBjb250ZW50PSJN\r\nU0hUTUwgOC4wMC42MDAxLjIzNTg4Ij48L0hFQUQ+DQo8Qk9EWSBiZ0NvbG9yPWFxdWE+DQo8UD48\r\nRk9OVCBjb2xvcj1ncmF5IHNpemU9Nj7lhajlm73lsI/lp5Dkv6Hmga/vvIzlrabnlJ/lprnkv6Hm\r\nga/vvIzmpbzlh6TlhbzogYzlpbPvvIzoia/lrrbkv6Hmga/vvIzlhbzogYzkv6Hmga/vvIzlpKfk\r\nv53lgaXkv6Hmga88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6\r\nICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgc2l6ZT02PjxBIA0KaHJlZj0iaHR0cDovL3d3dy5obmhu\r\nLmNsdWIveGlueGkuaHRtIj5odGh0dHA6Ly93d3cuaG5obi5jbHViL3hpbnhpLmh0bWh0dHA6Ly93\r\nd3cuaG5obi5jbHViL3hpbnhpLmh0bTwvQT48L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJB\r\nQ0tHUk9VTkQtQ09MT1I6ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgDQpzaXplPTY+PC9GT05UPiZu\r\nYnNwOzwvUD4NCjxQPjxGT05UIGNvbG9yPSM4MDgwODAgc2l6ZT02PjxGT05UIHNpemU9Nj48Rk9O\r\nVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpjb2xvcj1hcXVhPuS6uuWPr+S7peaK\r\noui1sOS7luWUr+S4gOaDs+imgeeahOWls+S6uu+8jOWlueWPquiDveWxnuS6juS7luOAgjwvRk9O\r\nVD48L1A+DQo8UD48Rk9OVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFx\r\ndWEiIA0KY29sb3I9YXF1YT7jgIDjgIDmlrnnk7fpl63kuIrnnLzmt7HlkLjkuobkuIDlj6PmsJTv\r\nvIznhLblkI7lsIblpbnnmoTmlbTkuKrohLjln4vlnKjog7jliY3jgILku5bojqvmiY7nibnnmoTv\r\nvIzov5nkuKrlppblrb3vvIHnm7TmjqXpl7fmrbvlvpfkuobvvIE8L0ZPTlQ+PC9QPg0KPFA+PEZP\r\nTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNPTE9SOiBhcXVhIiANCmNvbG9yPWFx\r\ndWE+ICAgICAgICDmuIXmup/miJHnn6XpgZPvvIzmmK/pgqPlj6rlrp7lipvlubPlubPnmoTohb7o\r\nm4fvvIznqYbojbvlj4jmmK/osIHllYrvvJ/mmK/ku5blkI7mnaXmlLbmnI3nmoTlppbprZTlkJfv\r\nvJ88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNP\r\nTE9SOiBhcXVhIiANCmNvbG9yPWFxdWE+ICAgIOWFremBk+WYtOinkua1ruWHuuS4gOaKueivoeW8\r\ngumYtOajrueahOeskeaEj++8muKAnOaNheegtOWkqeOAguKAnTwvRk9OVD48L1A+DQo8UD48Rk9O\r\nVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIA0KY29sb3I9YXF1\r\nYT4gICAg4oCc6YKj5aW55oCO5LmI5Lya6ZSZ5LqG6Zeo77yf5pei54S25pyJ5LiA77yM5oiR5oCO\r\n5LmI6IO95LiN6K6k5Li66L+Y5Lya5pyJ5LqM77yf5aaC5p6c5oiR6K+05L2g546w5Zyo5q2j5aSE\r\n5LqO5aS06ISR5re35Lmx77yM5oCd6Lev5LiN5riF55qE54q25oCB5LiN6L+H5YiG5ZCn77yf4oCd\r\n5bm06L275rCR6K2m6Zeu55m95Li977yM55m95Li954K55aS05om/6K6k44CCPC9GT05UPjwvUD4N\r\nCjxQPjxGT05UIHNpemU9Nj48Rk9OVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpj\r\nb2xvcj1hcXVhPiAgICDmnpfljZfmnKzlsLHmmK/nm5fnlKjkuoblkI7kurrnmoTnn6Xor4bvvIzm\r\niYDku6XkuZ/msqHku4DkuYjlj6/pqoTlgrLnmoTvvIzkvr/nrJHnnYDku6TkvJfkurrlubPouqvv\r\nvIzlj4jlr7nprY/lvoHpl67pgZPvvJrigJzprY/ljb/lrrblnKjmnJ3loILkuYvkuIrkuJPpl67m\r\nraTkuovvvIzmg7PmnaXlv4XmmK/mnInku4DkuYjmt7HmhI/nvaLvvJ/igJ08L0ZPTlQ+PC9QPg0K\r\nPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIGNvbG9yPWFxdWEgDQpzaXpl\r\nPTY+44CA44CAPC9GT05UPjwvUD48L0ZPTlQ+PC9GT05UPjwvRk9OVD48L0ZPTlQ+PC9GT05UPjwv\r\nRk9OVD48L0ZPTlQ+PC9CT0RZPjwvSFRNTD4NCg==\r\n\r\n\r\n", #encoding="base64">
so everything seems right.
I do:
msg.body.encoding # return "Base64"
and its right again, but here the strange, when I do:
msg.body.only_us_ascii? # return True
Should not this be false? The content type in the header of the email is 'utf-8'.
In fact, if I try to do
msg.body.decoded
here is:
"<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">\r\n<HTML xmlns:o = \"urn:schemas-microsoft-com:office:office\"><HEAD>\r\n<META content=\"text/html; charset=utf-8\" http-equiv=Content-Type>\r\n<META name=GENERATOR content=\"MSHTML 8.00.6001.23588\"></HEAD>\r\n<BODY bgColor=aqua>\r\n<P><FONT color=gray size=6>\xE5\x85\xA8\xE5\x9B\xBD\xE5\xB0\x8F\xE5\xA7\x90\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\xAD\xA6\xE7\x94\x9F\xE5\xA6\xB9\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE6\xA5\xBC\xE5\x87\xA4\xE5\x85\xBC\xE8\x81\x8C\xE5\xA5\xB3\xEF\xBC\x8C\xE8\x89\xAF\xE5\xAE\xB6\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\x85\xBC\xE8\x81\x8C\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\xA4\xA7\xE4\xBF\x9D\xE5\x81\xA5\xE4\xBF\xA1\xE6\x81\xAF</FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: #00ffff\" color=#00ffff size=6><A \r\nhref=\"http://www.hnhn.club/xinxi.htm\">hthttp://www.hnhn.club/xinxi.htmhttp://www.hnhn.club/xinxi.htm</A></FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: #00ffff\" color=#00ffff \r\nsize=6></FONT> </P>\r\n<P><FONT color=#808080 size=6><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua>\xE4\xBA\xBA\xE5\x8F\xAF\xE4\xBB\xA5\xE6\x8A\xA2\xE8\xB5\xB0\xE4\xBB\x96\xE5\x94\xAF\xE4\xB8\x80\xE6\x83\xB3\xE8\xA6\x81\xE7\x9A\x84\xE5\xA5\xB3\xE4\xBA\xBA\xEF\xBC\x8C\xE5\xA5\xB9\xE5\x8F\xAA\xE8\x83\xBD\xE5\xB1\x9E\xE4\xBA\x8E\xE4\xBB\x96\xE3\x80\x82</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua>\xE3\x80\x80\xE3\x80\x80\xE6\x96\xB9\xE7\x93\xB7\xE9\x97\xAD\xE4\xB8\x8A\xE7\x9C\xBC\xE6\xB7\xB1\xE5\x90\xB8\xE4\xBA\x86\xE4\xB8\x80\xE5\x8F\xA3\xE6\xB0\x94\xEF\xBC\x8C\xE7\x84\xB6\xE5\x90\x8E\xE5\xB0\x86\xE5\xA5\xB9\xE7\x9A\x84\xE6\x95\xB4\xE4\xB8\xAA\xE8\x84\xB8\xE5\x9F\x8B\xE5\x9C\xA8\xE8\x83\xB8\xE5\x89\x8D\xE3\x80\x82\xE4\xBB\x96\xE8\x8E\xAB\xE6\x89\x8E\xE7\x89\xB9\xE7\x9A\x84\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xA6\x96\xE5\xAD\xBD\xEF\xBC\x81\xE7\x9B\xB4\xE6\x8E\xA5\xE9\x97\xB7\xE6\xAD\xBB\xE5\xBE\x97\xE4\xBA\x86\xEF\xBC\x81</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE6\xB8\x85\xE6\xBA\x9F\xE6\x88\x91\xE7\x9F\xA5\xE9\x81\x93\xEF\xBC\x8C\xE6\x98\xAF\xE9\x82\xA3\xE5\x8F\xAA\xE5\xAE\x9E\xE5\x8A\x9B\xE5\xB9\xB3\xE5\xB9\xB3\xE7\x9A\x84\xE8\x85\xBE\xE8\x9B\x87\xEF\xBC\x8C\xE7\xA9\x86\xE8\x8D\xBB\xE5\x8F\x88\xE6\x98\xAF\xE8\xB0\x81\xE5\x95\x8A\xEF\xBC\x9F\xE6\x98\xAF\xE4\xBB\x96\xE5\x90\x8E\xE6\x9D\xA5\xE6\x94\xB6\xE6\x9C\x8D\xE7\x9A\x84\xE5\xA6\x96\xE9\xAD\x94\xE5\x90\x97\xEF\xBC\x9F</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE5\x85\xAD\xE9\x81\x93\xE5\x98\xB4\xE8\xA7\x92\xE6\xB5\xAE\xE5\x87\xBA\xE4\xB8\x80\xE6\x8A\xB9\xE8\xAF\xA1\xE5\xBC\x82\xE9\x98\xB4\xE6\xA3\xAE\xE7\x9A\x84\xE7\xAC\x91\xE6\x84\x8F\xEF\xBC\x9A\xE2\x80\x9C\xE6\x8D\x85\xE7\xA0\xB4\xE5\xA4\xA9\xE3\x80\x82\xE2\x80\x9D</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE2\x80\x9C\xE9\x82\xA3\xE5\xA5\xB9\xE6\x80\x8E\xE4\xB9\x88\xE4\xBC\x9A\xE9\x94\x99\xE4\xBA\x86\xE9\x97\xA8\xEF\xBC\x9F\xE6\x97\xA2\xE7\x84\xB6\xE6\x9C\x89\xE4\xB8\x80\xEF\xBC\x8C\xE6\x88\x91\xE6\x80\x8E\xE4\xB9\x88\xE8\x83\xBD\xE4\xB8\x8D\xE8\xAE\xA4\xE4\xB8\xBA\xE8\xBF\x98\xE4\xBC\x9A\xE6\x9C\x89\xE4\xBA\x8C\xEF\xBC\x9F\xE5\xA6\x82\xE6\x9E\x9C\xE6\x88\x91\xE8\xAF\xB4\xE4\xBD\xA0\xE7\x8E\xB0\xE5\x9C\xA8\xE6\xAD\xA3\xE5\xA4\x84\xE4\xBA\x8E\xE5\xA4\xB4\xE8\x84\x91\xE6\xB7\xB7\xE4\xB9\xB1\xEF\xBC\x8C\xE6\x80\x9D\xE8\xB7\xAF\xE4\xB8\x8D\xE6\xB8\x85\xE7\x9A\x84\xE7\x8A\xB6\xE6\x80\x81\xE4\xB8\x8D\xE8\xBF\x87\xE5\x88\x86\xE5\x90\xA7\xEF\xBC\x9F\xE2\x80\x9D\xE5\xB9\xB4\xE8\xBD\xBB\xE6\xB0\x91\xE8\xAD\xA6\xE9\x97\xAE\xE7\x99\xBD\xE4\xB8\xBD\xEF\xBC\x8C\xE7\x99\xBD\xE4\xB8\xBD\xE7\x82\xB9\xE5\xA4\xB4\xE6\x89\xBF\xE8\xAE\xA4\xE3\x80\x82</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE6\x9E\x97\xE5\x8D\x97\xE6\x9C\xAC\xE5\xB0\xB1\xE6\x98\xAF\xE7\x9B\x97\xE7\x94\xA8\xE4\xBA\x86\xE5\x90\x8E\xE4\xBA\xBA\xE7\x9A\x84\xE7\x9F\xA5\xE8\xAF\x86\xEF\xBC\x8C\xE6\x89\x80\xE4\xBB\xA5\xE4\xB9\x9F\xE6\xB2\xA1\xE4\xBB\x80\xE4\xB9\x88\xE5\x8F\xAF\xE9\xAA\x84\xE5\x82\xB2\xE7\x9A\x84\xEF\xBC\x8C\xE4\xBE\xBF\xE7\xAC\x91\xE7\x9D\x80\xE4\xBB\xA4\xE4\xBC\x97\xE4\xBA\xBA\xE5\xB9\xB3\xE8\xBA\xAB\xEF\xBC\x8C\xE5\x8F\x88\xE5\xAF\xB9\xE9\xAD\x8F\xE5\xBE\x81\xE9\x97\xAE\xE9\x81\x93\xEF\xBC\x9A\xE2\x80\x9C\xE9\xAD\x8F\xE5\x8D\xBF\xE5\xAE\xB6\xE5\x9C\xA8\xE6\x9C\x9D\xE5\xA0\x82\xE4\xB9\x8B\xE4\xB8\x8A\xE4\xB8\x93\xE9\x97\xAE\xE6\xAD\xA4\xE4\xBA\x8B\xEF\xBC\x8C\xE6\x83\xB3\xE6\x9D\xA5\xE5\xBF\x85\xE6\x98\xAF\xE6\x9C\x89\xE4\xBB\x80\xE4\xB9\x88\xE6\xB7\xB1\xE6\x84\x8F\xE7\xBD\xA2\xEF\xBC\x9F\xE2\x80\x9D</FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: aqua\" color=aqua \r\nsize=6>\xE3\x80\x80\xE3\x80\x80</FONT></P></FONT></FONT></FONT></FONT></FONT></FONT></FONT></BODY></HTML>\r\n"
Not utf-8 as I expected but ASCII-8BIT, and I don't know hot to use it, or see it in the browser.
Any help?
It's Base64, as revealed by msg.body.encoding # return "Base64". I'm no email format expert, but I'd guess the Base64 nature of the body was revealed in some header you didn't include in your paste. (After all msg.body.encoding must be getting it from somewhere).
Base64 isn't actually a character encoding like UTF-8. It's instead a conversion of binary data to ascii.
I think it's unfortunate that the Email gem doesn't take care of this for you.
But if what you have is Base64, you can decode it using the stdlib Base64 class.
data = Base64.decode(msg.body)
However, if it's Base64-encoded in the first place, what comes out the other side might not be plain text, but some kind of binary file format (an MS Word document? I dunno), so might still not make sense read directly even once decoded

Python 3 - GeoPy and encoding

I'm using DictWriter to write a dictionary to a csv after some geolocation work.
location = geolocator.reverse(coords)
row["address"] = location.address
writer.writerow(row)
Which generates this:
File "C:\bin64\python\3.4.3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' in
position 118: character maps to <undefined>
My problem was in how I was opening the file. I suppose I should have posted that in the question. I needed to set the encoding upon opening the file.
with open('results.csv', mode='w', encoding='utf-8', newline='') as file:
...

Nokogiri removing xml encoding

I am using nokogiri to decode some xml. This xml does have some html as values. I am seeing some strange behavior when parsing this. It appears nokogiri is removing some of the html encoded tags, so when i parse the html I am unable to decode it properly. See examples below:
doc = Nokogiri::XML '<?xml version="1.0"?><manifest
xmlns="http://www.imsglobal.org/xsd/imscp_v1p1"
identifier="Manifest-eaf97d26-aa83-4399-8e9b-ae9f6f5fc6a2"
xmlns="http://www.imsglobal.org/xsd/imscp_v1p1"
xmlns:imsmd="http://www.imsglobal.org/xsd/imsmd_v1p2"
xmlns:imsqti="http://www.imsglobal.org/xsd/imsqti_v2p1">
<imsmd:langstring><p>
 These are the<strong>instructions</strong> for the pool</p></imsmd:langstring>'
this yields the following value:
"<?xml version=\"1.0\"?>\n<manifest xmlns=\"http://www.imsglobal.org/xsd/imscp_v1p1\" xmlns:imsmd=\"http://www.imsglobal.org/xsd/imsmd_v1p2\" xmlns:imsqti=\"http://www.imsglobal.org/xsd/imsqti_v2p1\" identifier=\"Manifest-eaf97d26-aa83-4399-8e9b-ae9f6f5fc6a2\">\n<imsmd:langstring>p
 These are thestrong instructions/strong for the pool/p</imsmd:langstring></manifest>\n"
Notice how the < > tags are missing. However the following works as expected.
doc = Nokogiri::XML '<?xml version="1.0"?><imsmd:langstring><p>
 These are the<strong> instructions</strong> for the pool</p></imsmd:langstring>'
and gives the following result
"<?xml version=\"1.0\"?>\n<imsmd:langstring><p>
 These are the<strong> instructions</strong> for the pool</p></imsmd:langstring>\n"
I am sure I am missing something but can't figure out what is causing this.

Testing Nokogiri XML generation with blank nodes

I'm having a bit of trouble testing some XML generation using Nokogiri when the node is blank. I'm using Minitest to compare the generated XML string with a template fixture file. My test fails with the blank node as Minitest is comparing <Node></Node> with <Node />.
XML Generation
builder = Nokogiri::XML::Builder.new encoding: "UTF-8" do |xml|
xml.Header
xml.FileName #object.filename
end
Template file
This is the file I'm using as a fixture in my tests
<?xml version="1.0" encoding="UTF-8"?>
<Header/>
<FileName></FileName>
Minitest output
3) Failure:
--- expected
+++ actual
## -25,7 +25,7 ##
<Header />
- <FileName/>
+ <FileName></FileName>
As you can see, MiniTest is trying to compare a self-closing tag with a non-self-closing tag and making the test fail. Changing the fixture tag to a self-closing one results, strangely, in exactly the same error message.
It's because sometimes #object.filename is nil - if I have a blank XML node (as in xml.Header above) using a self-closing tag in my fixture works no problem.
I would use XML schema in this case:
def test_that_xml_data_conforms_to_schema
xml_data = ...
schema_data = ...
fragment = Nokogiri::XML.parse(xml_data)
schema = Nokogiri::XML::Schema(schema_data)
assert schema.valid?(fragment)
end

Using ruby SAX parsers for GB2312 encoded xml

Good day,
I have a lot of big xml files that i need to parse, but problems is they have 'gb2312' encoding. I would normaly use SAX parser for this.
So here is in example of xml:
<?xml version="1.0" encoding="gb2312"?>
<Root>
<ValueList Count="112290" FieldCount="11">
<Item1 Value1="23743" Value2="Дипломатия � Пустой кувшин" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item2 Value1="6611" Value2="ДЛ � 018 омела � золотой кинжал" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item3 Value1="6608" Value2="Наука (ДЛ)�круг фей 021�тяпка" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item4 Value1="6612" Value2="Знаки ДЛ � 003руны � разрушение" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
....
</Root>
I'm trying to use Nokogiri SAX (also tried libxml-ruby with same result) parser:
require 'nokogiri'
class SchemaParser < Nokogiri::XML::SAX::Document
def initialize
#cnt = 0
end
def start_element name, attrs =[]
if name == "Item1"
#cnt+= 1
puts #cnt
end
end
end
parser = Nokogiri::XML::SAX::Parser.new(SchemaParser.new)
parser.parse_io(File.open('2_4_EQUIPMENT_ESSENCE.xml'), 'gb2312')
But this gives error "`check_encoding': 'GB2312' is not a valid encoding (ArgumentError)". If I remove encoding declaration and let Nokogiri detect encoding himself, I will receive this error:
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
I/O error : encoder error
I also tried to open File with proper encoding, but that didn't help SAX parser:
[3] pry(main)> f = File.open('2_4_EQUIPMENT_ESSENCE.xml', "r:gb2312")
=> #<File:2_4_EQUIPMENT_ESSENCE.xml>
[4] pry(main)> f.external_encoding.name
=> "GB2312"
Did anyone use 'gb2312' encoding with SAX parsers in ruby? Any recommendations how to proceed?
It seems the issue is that Libxml2 does not support the GB2312 encoding (see here for a list of supported encodings).
I'm not sure if you have tried this, but I think you can work around this by removing the encoding declaration from the XML files (so Libxml2 does not try to transcode the data) and set the external encoding of the File object to GB2312, because then Ruby will transcode the file to UTF-8 as it is read, and from then on everything will remain as UTF-8.
So, here is my workaround.
Problems:
Some of characters presented in xml are not 'gb2312' encoding, I have found that 'GB18030' would be a better choice with full Chinese characters.
I converted all xml's to utf8, so i can use SAX parser.
I ended up with this rake task:
desc "convert chinese xml files to utf-8"
task :convert do
rm_rf 'data/utf8'
mkdir 'data/utf8'
Dir.foreach('data') {|f|
if f.end_with?('.xml')
puts "converted:: data/utf8/#{f}" if system("iconv -f GB18030 -t UTF-8 data/#{f} > data/utf8/#{f}")
end
}
#replace encodings for xml files
system("bundle exec ruby -pi -e \"gsub(/gb2312/, 'UTF-8')\" data/utf8/*.xml")
end

Resources