Problems with text/csv Content-Encoding = UTF-8 in Ruby Mechanize - ruby

When attempting to load a page which is a CSV that has encoding of UTF-8, using Mechanize V2.5.1, I used the following code:
a.content_encoding_hooks << lambda{|httpagent, uri, response, body_io|
response['Content-Encoding'] = 'none' if response['Content-Encoding'].to_s == 'UTF-8'
}
p4 = a.get(redirect_url, nil, ['accept-encoding' => 'UTF-8'])
but I find that the content encoding hook is not being called and I get the following error and traceback:
/Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:787:in 'response_content_encoding': unsupported content-encoding: UTF-8 (Mechanize::Error)
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:274:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize.rb:407:in 'get'
from prototype/test1.rb:307:in `<main>'
Does anyone have an idea why the content hook code is not firing and why I am getting the error?

but I find that the content encoding hook is not being called
What makes you think that?
The error message references this code:
def response_content_encoding response, body_io
...
...
out_io = case response['Content-Encoding']
when nil, 'none', '7bit', "" then
body_io
when 'deflate' then
content_encoding_inflate body_io
when 'gzip', 'x-gzip' then
content_encoding_gunzip body_io
else
raise Mechanize::Error,
"unsupported content-encoding: #{response['Content-Encoding']}"
So mechanize only recognizes the content encodings: '7bit', 'deflate', 'gzip', or 'x-gzip'.
From the HTTP/1.1 spec:
4.11 Content-Encoding
The Content-Encoding entity-header field is used as a modifier to the
media-type. When present, its value indicates what additional content
codings have been applied to the entity-body, and thus what decoding
mechanisms must be applied in order to obtain the media-type
referenced by the Content-Type header field. Content-Encoding is
primarily used to allow a document to be compressed without losing the
identity of its underlying media type.
Content-Encoding = "Content-Encoding" ":" 1#content-coding
Content codings are defined in section 3.5. An example of its use is
Content-Encoding: gzip
The content-coding is a characteristic of the entity identified by the
Request-URI. Typically, the entity-body is stored with this encoding
and is only decoded before rendering or analogous usage. However, a
non-transparent proxy MAY modify the content-coding if the new coding
is known to be acceptable to the recipient, unless the "no-transform"
cache-control directive is present in the message.
...
...
3.5 Content Codings
Content coding values indicate an encoding transformation that has
been or can be applied to an entity. Content codings are primarily
used to allow a document to be compressed or otherwise usefully
transformed without losing the identity of its underlying media type
and without loss of information. Frequently, the entity is stored in
coded form, transmitted directly, and only decoded by the recipient.
content-coding = token
All content-coding values are case-insensitive. HTTP/1.1 uses
content-coding values in the Accept-Encoding (section 14.3) and
Content-Encoding (section 14.11) header fields. Although the value
describes the content-coding, what is more important is that it
indicates what decoding mechanism will be required to remove the
encoding.
The Internet Assigned Numbers Authority (IANA) acts as a registry for
content-coding value tokens. Initially, the registry contains the
following tokens:
gzip An encoding format produced by the file compression program "gzip" (GNU zip) as described in RFC 1952 [25]. This format is a
Lempel-Ziv coding (LZ77) with a 32 bit CRC.
compress The encoding format produced by the common UNIX file compression program "compress". This format is an adaptive
Lempel-Ziv-Welch coding (LZW).
Use of program names for the identification of encoding formats
is not desirable and is discouraged for future encodings. Their
use here is representative of historical practice, not good
design. For compatibility with previous implementations of HTTP,
applications SHOULD consider "x-gzip" and "x-compress" to be
equivalent to "gzip" and "compress" respectively.
deflate The "zlib" format defined in RFC 1950 [31] in combination with the "deflate" compression mechanism described in RFC 1951 [29].
identity The default (identity) encoding; the use of no transformation whatsoever. This content-coding is used only in the
Accept- Encoding header, and SHOULD NOT be used in the
Content-Encoding header.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5
In other words, an http content encoding has nothing to do with ascii v. utf-8 v. latin-1.
In addition the source code for Mechanize::HTTP::Agent has this in it:
# A list of hooks to call after retrieving a response. Hooks are called with
# the agent and the response returned.
attr_reader :post_connect_hooks
# A list of hooks to call before making a request. Hooks are called with
# the agent and the request to be performed.
attr_reader :pre_connect_hooks
# A list of hooks to call to handle the content-encoding of a request.
attr_reader :content_encoding_hooks
So it doesn't even look like you are calling the right hook.
Here is an example I got to work:
require 'mechanize'
a = Mechanize.new
p a.content_encoding_hooks
func = lambda do |a, uri, resp, body_io|
puts body_io.read
puts "The Content-Encoding is: #{resp['Content-Encoding']}"
if resp['Content-Encoding'].to_s == 'UTF-8'
resp['Content-Encoding'] = 'none'
end
puts "The Content-Encoding is now: #{resp['Content-Encoding']}"
end
a.content_encoding_hooks << func
a.get(
'http://localhost:8080/cgi-bin/myprog.rb',
[],
nil,
"Accept-Encoding" => 'gzip, deflate' #This is what Firefox always uses
)
myprog.rb:
#!/usr/bin/env ruby
require 'cgi'
cgi = CGI.new('html3')
headers = {
"type" => 'text/html',
"Content-Encoding" => "UTF-8",
}
cgi.out(headers) do
cgi.html() do
cgi.head{ cgi.title{"Content-Encoding Test"} } +
cgi.body() do
cgi.div(){ "The Accept-Encoding was: #{cgi.accept_encoding}" }
end
end
end
--output:--
[]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><HTML><HEAD><TITLE>Content-Encoding Test</TITLE></HEAD><BODY><DIV>The Accept-Encoding was: gzip, deflate</DIV></BODY></HTML>
The Content-Encoding is: UTF-8
The Content-Encoding is now: none

Related

Google Cloud DLP - CSV inspection

I'm trying to inspect a CSV file and there are no findings being returned (I'm using the EMAIL_ADDRESS info type and the addresses I'm using are coming up with positive hits here: https://cloud.google.com/dlp/demo/#!/). I'm sending the CSV file into inspect_content with a byte_item as follows:
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
In looking at the supported file types, it looks like CSV/TSV files are inspected via Structured Parsing.
For CSV/TSV does that mean one can't just sent in the file, and needs to use the table attribute instead of byte_item as per https://cloud.google.com/dlp/docs/inspecting-structured-text?
What about for XSLX files for example? They're an unspecified file type so I tried with a configuration like so, but it still returned no findings:
byte_item: {
type: :BYTES_TYPE_UNSPECIFIED,
data: File.open('/xxxxx/dlptest.xlsx', 'rb').read
}
I'm able to do inspection and redaction with images and text fine, but having a bit of a problem with other file types. Any ideas/suggestions welcome! Thanks!
Edit: The contents of the CSV in question:
$ cat ~/Downloads/dlptest.csv
dylans#gmail.com,anotehu,steve#example.com
blah blah,anoteuh,
aonteuh,
$ file ~/Downloads/dlptest.csv
~/Downloads/dlptest.csv: ASCII text, with CRLF line terminators
The full request:
parent = "projects/xxxxxxxx/global"
inspect_config = {
info_types: [{name: "EMAIL_ADDRESS"}],
min_likelihood: :POSSIBLE,
limits: { max_findings_per_request: 0 },
include_quote: true
}
request = {
parent: parent,
inspect_config: inspect_config,
item: {
byte_item: {
type: :CSV,
data: File.open('/xxxxx/dlptest.csv', 'r').read
}
}
}
dlp = Google::Cloud::Dlp.dlp_service
response = dlp.inspect_content(request)
The CSV file I was testing with was something I created using Google Sheets and exported as a CSV, however, the file showed locally as a "text/plain; charset=us-ascii". I downloaded a CSV off the internet and it had a mime of "text/csv; charset=utf-8". This is the one that worked. So it looks like my issue was specifically due the file being an incorrect mime type.
xlsx is not yet supported. Coming soon. (Maybe that part of the question should be split out from the CSV debugging issue.)

API Blueprint and Dredd - Required field missing from response, but tests still pass

I am using a combination of API Blueprint and Dredd to test an API my application is dependent on. I am using attributes in API blueprint to define the structure of the response's body.
Apparently I'm missing something though because the tests always pass even though I've purposefully defined a fake "required" parameter that I know is missing from the API's response. It seems that Dredd is only testing whether the type of the response body (array) rather than the type and the parameters within it.
My API Blueprint file:
FORMAT: 1A
HOST: http://somehost.net
# API Title
## Endpoints [GET /endpoint/{date}]
+ Parameters
+ date: `2016-09-01` (string, required) - Date
+ Response 200 (application/json; charset=utf-8)
+ Attributes (array[Data])
## Data Structures
### Data
- realParameter: 2432432 (number)
- realParameter2: `some string` (string, required)
- realParameter3: `Something else` (string, required)
- realParameter4: 1 (number, required)
- fakeParam: 1 (number, required)
The response body:
[
{
"realParameter": 31,
"realParameter2": "some value",
"realParameter3": "another value",
"realParameter4": 8908
},
{
"realParameter": 54,
"realParameter2": "something here",
"realParameter3": "and here too",
"realParameter4": 6589
}
]
And my Dredd config file:
reporter: apiary
custom:
apiaryApiKey: somekey
apiaryApiName: somename
dry-run: null
hookfiles: null
language: nodejs
sandbox: false
server: null
server-wait: 3
init: false
names: false
only: []
output: []
header: []
sorted: false
user: null
inline-errors: false
details: false
method: []
color: true
level: info
timestamp: false
silent: false
path: []
blueprint: myApiBlueprintFile.apib
endpoint: 'http://ahost.com'
Does anyone have any idea why Dredd ignores the fact that "fakeParameter" doesn't actually show up in the response body and still allows the test to pass?
You've run into a limitation of MSON, the language API Blueprint uses for describing attributes. In many cases, MSON describes what MAY be present in the data structure rather than what MUST exactly be present.
The most prominent case are arrays, where basically any content of the array is optional and thus the underlying generated JSON Schema doesn't put any constraints on array contents. Dredd just respects that, so indirectly it becomes a Dredd issue too, however there's not much Dredd can do about it.
There's an issue for the problem: apiaryio/mson#66 You can follow and comment under the issue to get updated about this. Dredd is usually very prompt in getting the latest API Blueprint parser, so once it's implemented in the language itself, it won't take long to appear in Dredd.
Obvious (but tedious) workaround is to specify your own JSON Schema with stricter rules using the + Schema section alongside the + Attributes section.

JMeter: Body & File content

The WebAPI request has a POST method which expects Content body. I've tried to use both Parameters and Body options but I receive error responses - 'Invalid Request' with 400 Status code, etc.
JMeter request Sample Content Body:
{
"ParamA": 111,
"ParamB": "Char String",
"ParamC": "VarType"
}
OR
{ "ParamA": 111, "ParamB": "Char String", "ParamC": "VarType"}
Listener Request:
POST data:
--8vpH3B6WcV4f1La46_wccVi4c25lrLJaGcN--
Listener Response:
{"message":"The request is invalid.","modelState":{"value":["An error
has occurred."]}}
Any insight into viable options? Eventually, I'm planning on reading the Body string from a .csv file so I can parameterize the request. Reading from a .CSV file only reads the first line of the request body - for example: '{'
Any help would be greatly appreciated.
Best,
Ray
HTTP Request
Request
Uncheck in HTTP request the option:
Use multipart/form data for POST
Also check your CSV does not contain some data that contains the CSV separator which is '\t' by default.
Ensure it doesn't by changing separator to '|' for example if you're sure your JSON will never contain it.

Interpret and display different Encodings for email bodies in ruby:

I am using the Email gem in my rails app but I am encountering some problems with encoding:
I am working on a mail that presents itself this way:
....
Message-ID: <22D41F1A16CD5A5719309A96F8C95D50#vcrfnyjsz>
From: "=?utf-8?B?IOWFqOWbvealvOWHpOWFvOiBjOWwj+WnkOS/oeaBrw==?=" <info#nks-media.ru>
To: ...
...
MIME-Version: 1.0
Content-Type: text/html;
charset="utf-8"
Content-Transfer-Encoding: base64
X-Priority: 5
X-MSMail-Priority: Low
X-Mailer: Microsoft Outlook Express 6.00.2900.5512
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5512
PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMCBUcmFuc2l0aW9uYWwv
L0VOIj4NCjxIVE1MIHhtbG5zOm8gPSAidXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6
b2ZmaWNlIj48SEVBRD4NCjxNRVRBIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIg
aHR0cC1lcXVpdj1Db250ZW50LVR5cGU+DQo8TUVUQSBuYW1lPUdFTkVSQVRPUiBjb250ZW50PSJN
U0hUTUwgOC4wMC42MDAxLjIzNTg4Ij48L0hFQUQ+DQo8Qk9EWSBiZ0NvbG9yPWFxdWE+DQo8UD48
Rk9OVCBjb2xvcj1ncmF5IHNpemU9Nj7lhajlm73lsI/lp5Dkv6Hmga/vvIzlrabnlJ/lprnkv6Hm
ga/vvIzmpbzlh6TlhbzogYzlpbPvvIzoia/lrrbkv6Hmga/vvIzlhbzogYzkv6Hmga/vvIzlpKfk
v53lgaXkv6Hmga88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6
ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgc2l6ZT02PjxBIA0KaHJlZj0iaHR0cDovL3d3dy5obmhu
LmNsdWIveGlueGkuaHRtIj5odGh0dHA6Ly93d3cuaG5obi5jbHViL3hpbnhpLmh0bWh0dHA6Ly93
d3cuaG5obi5jbHViL3hpbnhpLmh0bTwvQT48L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJB
Q0tHUk9VTkQtQ09MT1I6ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgDQpzaXplPTY+PC9GT05UPiZu
YnNwOzwvUD4NCjxQPjxGT05UIGNvbG9yPSM4MDgwODAgc2l6ZT02PjxGT05UIHNpemU9Nj48Rk9O
VCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpjb2xvcj1hcXVhPuS6uuWPr+S7peaK
oui1sOS7luWUr+S4gOaDs+imgeeahOWls+S6uu+8jOWlueWPquiDveWxnuS6juS7luOAgjwvRk9O
VD48L1A+DQo8UD48Rk9OVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFx
dWEiIA0KY29sb3I9YXF1YT7jgIDjgIDmlrnnk7fpl63kuIrnnLzmt7HlkLjkuobkuIDlj6PmsJTv
vIznhLblkI7lsIblpbnnmoTmlbTkuKrohLjln4vlnKjog7jliY3jgILku5bojqvmiY7nibnnmoTv
vIzov5nkuKrlppblrb3vvIHnm7TmjqXpl7fmrbvlvpfkuobvvIE8L0ZPTlQ+PC9QPg0KPFA+PEZP
TlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNPTE9SOiBhcXVhIiANCmNvbG9yPWFx
dWE+ICAgICAgICDmuIXmup/miJHnn6XpgZPvvIzmmK/pgqPlj6rlrp7lipvlubPlubPnmoTohb7o
m4fvvIznqYbojbvlj4jmmK/osIHllYrvvJ/mmK/ku5blkI7mnaXmlLbmnI3nmoTlppbprZTlkJfv
vJ88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNP
TE9SOiBhcXVhIiANCmNvbG9yPWFxdWE+ICAgIOWFremBk+WYtOinkua1ruWHuuS4gOaKueivoeW8
gumYtOajrueahOeskeaEj++8muKAnOaNheegtOWkqeOAguKAnTwvRk9OVD48L1A+DQo8UD48Rk9O
VCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIA0KY29sb3I9YXF1
YT4gICAg4oCc6YKj5aW55oCO5LmI5Lya6ZSZ5LqG6Zeo77yf5pei54S25pyJ5LiA77yM5oiR5oCO
5LmI6IO95LiN6K6k5Li66L+Y5Lya5pyJ5LqM77yf5aaC5p6c5oiR6K+05L2g546w5Zyo5q2j5aSE
5LqO5aS06ISR5re35Lmx77yM5oCd6Lev5LiN5riF55qE54q25oCB5LiN6L+H5YiG5ZCn77yf4oCd
5bm06L275rCR6K2m6Zeu55m95Li977yM55m95Li954K55aS05om/6K6k44CCPC9GT05UPjwvUD4N
CjxQPjxGT05UIHNpemU9Nj48Rk9OVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpj
b2xvcj1hcXVhPiAgICDmnpfljZfmnKzlsLHmmK/nm5fnlKjkuoblkI7kurrnmoTnn6Xor4bvvIzm
iYDku6XkuZ/msqHku4DkuYjlj6/pqoTlgrLnmoTvvIzkvr/nrJHnnYDku6TkvJfkurrlubPouqvv
vIzlj4jlr7nprY/lvoHpl67pgZPvvJrigJzprY/ljb/lrrblnKjmnJ3loILkuYvkuIrkuJPpl67m
raTkuovvvIzmg7PmnaXlv4XmmK/mnInku4DkuYjmt7HmhI/nvaLvvJ/igJ08L0ZPTlQ+PC9QPg0K
PFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIGNvbG9yPWFxdWEgDQpzaXpl
PTY+44CA44CAPC9GT05UPjwvUD48L0ZPTlQ+PC9GT05UPjwvRk9OVD48L0ZPTlQ+PC9GT05UPjwv
Rk9OVD48L0ZPTlQ+PC9CT0RZPjwvSFRNTD4NCg==
(I have omitted not inherent parts)
I fetch it with the Net::IMAP class of ruby an pass it as a string to the
Email.read_from_string
method of the gem.
It return me an object, call it msg. I now call msg.body and have this answer:
<Mail::Body:0x007f0045976ea8 #boundary=nil, #preamble=nil, #epilogue=nil, #charset="US-ASCII", #part_sort_order=["text/plain", "text/enriched", "text/html"], #parts=[], #raw_source="PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMCBUcmFuc2l0aW9uYWwv\r\nL0VOIj4NCjxIVE1MIHhtbG5zOm8gPSAidXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6\r\nb2ZmaWNlIj48SEVBRD4NCjxNRVRBIGNvbnRlbnQ9InRleHQvaHRtbDsgY2hhcnNldD11dGYtOCIg\r\naHR0cC1lcXVpdj1Db250ZW50LVR5cGU+DQo8TUVUQSBuYW1lPUdFTkVSQVRPUiBjb250ZW50PSJN\r\nU0hUTUwgOC4wMC42MDAxLjIzNTg4Ij48L0hFQUQ+DQo8Qk9EWSBiZ0NvbG9yPWFxdWE+DQo8UD48\r\nRk9OVCBjb2xvcj1ncmF5IHNpemU9Nj7lhajlm73lsI/lp5Dkv6Hmga/vvIzlrabnlJ/lprnkv6Hm\r\nga/vvIzmpbzlh6TlhbzogYzlpbPvvIzoia/lrrbkv6Hmga/vvIzlhbzogYzkv6Hmga/vvIzlpKfk\r\nv53lgaXkv6Hmga88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6\r\nICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgc2l6ZT02PjxBIA0KaHJlZj0iaHR0cDovL3d3dy5obmhu\r\nLmNsdWIveGlueGkuaHRtIj5odGh0dHA6Ly93d3cuaG5obi5jbHViL3hpbnhpLmh0bWh0dHA6Ly93\r\nd3cuaG5obi5jbHViL3hpbnhpLmh0bTwvQT48L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc3R5bGU9IkJB\r\nQ0tHUk9VTkQtQ09MT1I6ICMwMGZmZmYiIGNvbG9yPSMwMGZmZmYgDQpzaXplPTY+PC9GT05UPiZu\r\nYnNwOzwvUD4NCjxQPjxGT05UIGNvbG9yPSM4MDgwODAgc2l6ZT02PjxGT05UIHNpemU9Nj48Rk9O\r\nVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpjb2xvcj1hcXVhPuS6uuWPr+S7peaK\r\noui1sOS7luWUr+S4gOaDs+imgeeahOWls+S6uu+8jOWlueWPquiDveWxnuS6juS7luOAgjwvRk9O\r\nVD48L1A+DQo8UD48Rk9OVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFx\r\ndWEiIA0KY29sb3I9YXF1YT7jgIDjgIDmlrnnk7fpl63kuIrnnLzmt7HlkLjkuobkuIDlj6PmsJTv\r\nvIznhLblkI7lsIblpbnnmoTmlbTkuKrohLjln4vlnKjog7jliY3jgILku5bojqvmiY7nibnnmoTv\r\nvIzov5nkuKrlppblrb3vvIHnm7TmjqXpl7fmrbvlvpfkuobvvIE8L0ZPTlQ+PC9QPg0KPFA+PEZP\r\nTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNPTE9SOiBhcXVhIiANCmNvbG9yPWFx\r\ndWE+ICAgICAgICDmuIXmup/miJHnn6XpgZPvvIzmmK/pgqPlj6rlrp7lipvlubPlubPnmoTohb7o\r\nm4fvvIznqYbojbvlj4jmmK/osIHllYrvvJ/mmK/ku5blkI7mnaXmlLbmnI3nmoTlppbprZTlkJfv\r\nvJ88L0ZPTlQ+PC9QPg0KPFA+PEZPTlQgc2l6ZT02PjxGT05UIHN0eWxlPSJCQUNLR1JPVU5ELUNP\r\nTE9SOiBhcXVhIiANCmNvbG9yPWFxdWE+ICAgIOWFremBk+WYtOinkua1ruWHuuS4gOaKueivoeW8\r\ngumYtOajrueahOeskeaEj++8muKAnOaNheegtOWkqeOAguKAnTwvRk9OVD48L1A+DQo8UD48Rk9O\r\nVCBzaXplPTY+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIA0KY29sb3I9YXF1\r\nYT4gICAg4oCc6YKj5aW55oCO5LmI5Lya6ZSZ5LqG6Zeo77yf5pei54S25pyJ5LiA77yM5oiR5oCO\r\n5LmI6IO95LiN6K6k5Li66L+Y5Lya5pyJ5LqM77yf5aaC5p6c5oiR6K+05L2g546w5Zyo5q2j5aSE\r\n5LqO5aS06ISR5re35Lmx77yM5oCd6Lev5LiN5riF55qE54q25oCB5LiN6L+H5YiG5ZCn77yf4oCd\r\n5bm06L275rCR6K2m6Zeu55m95Li977yM55m95Li954K55aS05om/6K6k44CCPC9GT05UPjwvUD4N\r\nCjxQPjxGT05UIHNpemU9Nj48Rk9OVCBzdHlsZT0iQkFDS0dST1VORC1DT0xPUjogYXF1YSIgDQpj\r\nb2xvcj1hcXVhPiAgICDmnpfljZfmnKzlsLHmmK/nm5fnlKjkuoblkI7kurrnmoTnn6Xor4bvvIzm\r\niYDku6XkuZ/msqHku4DkuYjlj6/pqoTlgrLnmoTvvIzkvr/nrJHnnYDku6TkvJfkurrlubPouqvv\r\nvIzlj4jlr7nprY/lvoHpl67pgZPvvJrigJzprY/ljb/lrrblnKjmnJ3loILkuYvkuIrkuJPpl67m\r\nraTkuovvvIzmg7PmnaXlv4XmmK/mnInku4DkuYjmt7HmhI/nvaLvvJ/igJ08L0ZPTlQ+PC9QPg0K\r\nPFA+PEZPTlQgc3R5bGU9IkJBQ0tHUk9VTkQtQ09MT1I6IGFxdWEiIGNvbG9yPWFxdWEgDQpzaXpl\r\nPTY+44CA44CAPC9GT05UPjwvUD48L0ZPTlQ+PC9GT05UPjwvRk9OVD48L0ZPTlQ+PC9GT05UPjwv\r\nRk9OVD48L0ZPTlQ+PC9CT0RZPjwvSFRNTD4NCg==\r\n\r\n\r\n", #encoding="base64">
so everything seems right.
I do:
msg.body.encoding # return "Base64"
and its right again, but here the strange, when I do:
msg.body.only_us_ascii? # return True
Should not this be false? The content type in the header of the email is 'utf-8'.
In fact, if I try to do
msg.body.decoded
here is:
"<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">\r\n<HTML xmlns:o = \"urn:schemas-microsoft-com:office:office\"><HEAD>\r\n<META content=\"text/html; charset=utf-8\" http-equiv=Content-Type>\r\n<META name=GENERATOR content=\"MSHTML 8.00.6001.23588\"></HEAD>\r\n<BODY bgColor=aqua>\r\n<P><FONT color=gray size=6>\xE5\x85\xA8\xE5\x9B\xBD\xE5\xB0\x8F\xE5\xA7\x90\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\xAD\xA6\xE7\x94\x9F\xE5\xA6\xB9\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE6\xA5\xBC\xE5\x87\xA4\xE5\x85\xBC\xE8\x81\x8C\xE5\xA5\xB3\xEF\xBC\x8C\xE8\x89\xAF\xE5\xAE\xB6\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\x85\xBC\xE8\x81\x8C\xE4\xBF\xA1\xE6\x81\xAF\xEF\xBC\x8C\xE5\xA4\xA7\xE4\xBF\x9D\xE5\x81\xA5\xE4\xBF\xA1\xE6\x81\xAF</FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: #00ffff\" color=#00ffff size=6><A \r\nhref=\"http://www.hnhn.club/xinxi.htm\">hthttp://www.hnhn.club/xinxi.htmhttp://www.hnhn.club/xinxi.htm</A></FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: #00ffff\" color=#00ffff \r\nsize=6></FONT> </P>\r\n<P><FONT color=#808080 size=6><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua>\xE4\xBA\xBA\xE5\x8F\xAF\xE4\xBB\xA5\xE6\x8A\xA2\xE8\xB5\xB0\xE4\xBB\x96\xE5\x94\xAF\xE4\xB8\x80\xE6\x83\xB3\xE8\xA6\x81\xE7\x9A\x84\xE5\xA5\xB3\xE4\xBA\xBA\xEF\xBC\x8C\xE5\xA5\xB9\xE5\x8F\xAA\xE8\x83\xBD\xE5\xB1\x9E\xE4\xBA\x8E\xE4\xBB\x96\xE3\x80\x82</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua>\xE3\x80\x80\xE3\x80\x80\xE6\x96\xB9\xE7\x93\xB7\xE9\x97\xAD\xE4\xB8\x8A\xE7\x9C\xBC\xE6\xB7\xB1\xE5\x90\xB8\xE4\xBA\x86\xE4\xB8\x80\xE5\x8F\xA3\xE6\xB0\x94\xEF\xBC\x8C\xE7\x84\xB6\xE5\x90\x8E\xE5\xB0\x86\xE5\xA5\xB9\xE7\x9A\x84\xE6\x95\xB4\xE4\xB8\xAA\xE8\x84\xB8\xE5\x9F\x8B\xE5\x9C\xA8\xE8\x83\xB8\xE5\x89\x8D\xE3\x80\x82\xE4\xBB\x96\xE8\x8E\xAB\xE6\x89\x8E\xE7\x89\xB9\xE7\x9A\x84\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xA6\x96\xE5\xAD\xBD\xEF\xBC\x81\xE7\x9B\xB4\xE6\x8E\xA5\xE9\x97\xB7\xE6\xAD\xBB\xE5\xBE\x97\xE4\xBA\x86\xEF\xBC\x81</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE6\xB8\x85\xE6\xBA\x9F\xE6\x88\x91\xE7\x9F\xA5\xE9\x81\x93\xEF\xBC\x8C\xE6\x98\xAF\xE9\x82\xA3\xE5\x8F\xAA\xE5\xAE\x9E\xE5\x8A\x9B\xE5\xB9\xB3\xE5\xB9\xB3\xE7\x9A\x84\xE8\x85\xBE\xE8\x9B\x87\xEF\xBC\x8C\xE7\xA9\x86\xE8\x8D\xBB\xE5\x8F\x88\xE6\x98\xAF\xE8\xB0\x81\xE5\x95\x8A\xEF\xBC\x9F\xE6\x98\xAF\xE4\xBB\x96\xE5\x90\x8E\xE6\x9D\xA5\xE6\x94\xB6\xE6\x9C\x8D\xE7\x9A\x84\xE5\xA6\x96\xE9\xAD\x94\xE5\x90\x97\xEF\xBC\x9F</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE5\x85\xAD\xE9\x81\x93\xE5\x98\xB4\xE8\xA7\x92\xE6\xB5\xAE\xE5\x87\xBA\xE4\xB8\x80\xE6\x8A\xB9\xE8\xAF\xA1\xE5\xBC\x82\xE9\x98\xB4\xE6\xA3\xAE\xE7\x9A\x84\xE7\xAC\x91\xE6\x84\x8F\xEF\xBC\x9A\xE2\x80\x9C\xE6\x8D\x85\xE7\xA0\xB4\xE5\xA4\xA9\xE3\x80\x82\xE2\x80\x9D</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE2\x80\x9C\xE9\x82\xA3\xE5\xA5\xB9\xE6\x80\x8E\xE4\xB9\x88\xE4\xBC\x9A\xE9\x94\x99\xE4\xBA\x86\xE9\x97\xA8\xEF\xBC\x9F\xE6\x97\xA2\xE7\x84\xB6\xE6\x9C\x89\xE4\xB8\x80\xEF\xBC\x8C\xE6\x88\x91\xE6\x80\x8E\xE4\xB9\x88\xE8\x83\xBD\xE4\xB8\x8D\xE8\xAE\xA4\xE4\xB8\xBA\xE8\xBF\x98\xE4\xBC\x9A\xE6\x9C\x89\xE4\xBA\x8C\xEF\xBC\x9F\xE5\xA6\x82\xE6\x9E\x9C\xE6\x88\x91\xE8\xAF\xB4\xE4\xBD\xA0\xE7\x8E\xB0\xE5\x9C\xA8\xE6\xAD\xA3\xE5\xA4\x84\xE4\xBA\x8E\xE5\xA4\xB4\xE8\x84\x91\xE6\xB7\xB7\xE4\xB9\xB1\xEF\xBC\x8C\xE6\x80\x9D\xE8\xB7\xAF\xE4\xB8\x8D\xE6\xB8\x85\xE7\x9A\x84\xE7\x8A\xB6\xE6\x80\x81\xE4\xB8\x8D\xE8\xBF\x87\xE5\x88\x86\xE5\x90\xA7\xEF\xBC\x9F\xE2\x80\x9D\xE5\xB9\xB4\xE8\xBD\xBB\xE6\xB0\x91\xE8\xAD\xA6\xE9\x97\xAE\xE7\x99\xBD\xE4\xB8\xBD\xEF\xBC\x8C\xE7\x99\xBD\xE4\xB8\xBD\xE7\x82\xB9\xE5\xA4\xB4\xE6\x89\xBF\xE8\xAE\xA4\xE3\x80\x82</FONT></P>\r\n<P><FONT size=6><FONT style=\"BACKGROUND-COLOR: aqua\" \r\ncolor=aqua> \xE6\x9E\x97\xE5\x8D\x97\xE6\x9C\xAC\xE5\xB0\xB1\xE6\x98\xAF\xE7\x9B\x97\xE7\x94\xA8\xE4\xBA\x86\xE5\x90\x8E\xE4\xBA\xBA\xE7\x9A\x84\xE7\x9F\xA5\xE8\xAF\x86\xEF\xBC\x8C\xE6\x89\x80\xE4\xBB\xA5\xE4\xB9\x9F\xE6\xB2\xA1\xE4\xBB\x80\xE4\xB9\x88\xE5\x8F\xAF\xE9\xAA\x84\xE5\x82\xB2\xE7\x9A\x84\xEF\xBC\x8C\xE4\xBE\xBF\xE7\xAC\x91\xE7\x9D\x80\xE4\xBB\xA4\xE4\xBC\x97\xE4\xBA\xBA\xE5\xB9\xB3\xE8\xBA\xAB\xEF\xBC\x8C\xE5\x8F\x88\xE5\xAF\xB9\xE9\xAD\x8F\xE5\xBE\x81\xE9\x97\xAE\xE9\x81\x93\xEF\xBC\x9A\xE2\x80\x9C\xE9\xAD\x8F\xE5\x8D\xBF\xE5\xAE\xB6\xE5\x9C\xA8\xE6\x9C\x9D\xE5\xA0\x82\xE4\xB9\x8B\xE4\xB8\x8A\xE4\xB8\x93\xE9\x97\xAE\xE6\xAD\xA4\xE4\xBA\x8B\xEF\xBC\x8C\xE6\x83\xB3\xE6\x9D\xA5\xE5\xBF\x85\xE6\x98\xAF\xE6\x9C\x89\xE4\xBB\x80\xE4\xB9\x88\xE6\xB7\xB1\xE6\x84\x8F\xE7\xBD\xA2\xEF\xBC\x9F\xE2\x80\x9D</FONT></P>\r\n<P><FONT style=\"BACKGROUND-COLOR: aqua\" color=aqua \r\nsize=6>\xE3\x80\x80\xE3\x80\x80</FONT></P></FONT></FONT></FONT></FONT></FONT></FONT></FONT></BODY></HTML>\r\n"
Not utf-8 as I expected but ASCII-8BIT, and I don't know hot to use it, or see it in the browser.
Any help?
It's Base64, as revealed by msg.body.encoding # return "Base64". I'm no email format expert, but I'd guess the Base64 nature of the body was revealed in some header you didn't include in your paste. (After all msg.body.encoding must be getting it from somewhere).
Base64 isn't actually a character encoding like UTF-8. It's instead a conversion of binary data to ascii.
I think it's unfortunate that the Email gem doesn't take care of this for you.
But if what you have is Base64, you can decode it using the stdlib Base64 class.
data = Base64.decode(msg.body)
However, if it's Base64-encoded in the first place, what comes out the other side might not be plain text, but some kind of binary file format (an MS Word document? I dunno), so might still not make sense read directly even once decoded

Using ruby SAX parsers for GB2312 encoded xml

Good day,
I have a lot of big xml files that i need to parse, but problems is they have 'gb2312' encoding. I would normaly use SAX parser for this.
So here is in example of xml:
<?xml version="1.0" encoding="gb2312"?>
<Root>
<ValueList Count="112290" FieldCount="11">
<Item1 Value1="23743" Value2="Дипломатия � Пустой кувшин" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item2 Value1="6611" Value2="ДЛ � 018 омела � золотой кинжал" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item3 Value1="6608" Value2="Наука (ДЛ)�круг фей 021�тяпка" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item4 Value1="6612" Value2="Знаки ДЛ � 003руны � разрушение" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
....
</Root>
I'm trying to use Nokogiri SAX (also tried libxml-ruby with same result) parser:
require 'nokogiri'
class SchemaParser < Nokogiri::XML::SAX::Document
def initialize
#cnt = 0
end
def start_element name, attrs =[]
if name == "Item1"
#cnt+= 1
puts #cnt
end
end
end
parser = Nokogiri::XML::SAX::Parser.new(SchemaParser.new)
parser.parse_io(File.open('2_4_EQUIPMENT_ESSENCE.xml'), 'gb2312')
But this gives error "`check_encoding': 'GB2312' is not a valid encoding (ArgumentError)". If I remove encoding declaration and let Nokogiri detect encoding himself, I will receive this error:
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
I/O error : encoder error
I also tried to open File with proper encoding, but that didn't help SAX parser:
[3] pry(main)> f = File.open('2_4_EQUIPMENT_ESSENCE.xml', "r:gb2312")
=> #<File:2_4_EQUIPMENT_ESSENCE.xml>
[4] pry(main)> f.external_encoding.name
=> "GB2312"
Did anyone use 'gb2312' encoding with SAX parsers in ruby? Any recommendations how to proceed?
It seems the issue is that Libxml2 does not support the GB2312 encoding (see here for a list of supported encodings).
I'm not sure if you have tried this, but I think you can work around this by removing the encoding declaration from the XML files (so Libxml2 does not try to transcode the data) and set the external encoding of the File object to GB2312, because then Ruby will transcode the file to UTF-8 as it is read, and from then on everything will remain as UTF-8.
So, here is my workaround.
Problems:
Some of characters presented in xml are not 'gb2312' encoding, I have found that 'GB18030' would be a better choice with full Chinese characters.
I converted all xml's to utf8, so i can use SAX parser.
I ended up with this rake task:
desc "convert chinese xml files to utf-8"
task :convert do
rm_rf 'data/utf8'
mkdir 'data/utf8'
Dir.foreach('data') {|f|
if f.end_with?('.xml')
puts "converted:: data/utf8/#{f}" if system("iconv -f GB18030 -t UTF-8 data/#{f} > data/utf8/#{f}")
end
}
#replace encodings for xml files
system("bundle exec ruby -pi -e \"gsub(/gb2312/, 'UTF-8')\" data/utf8/*.xml")
end

Resources