Why do I get "ArgumentError - invalid byte sequence in US-ASCII" in ruby 2.0 - ruby

I have some code that handles a web request that comes in (not a rails app), and with the following couple lines,
str.encode!(::Encoding::ASCII, :undef => :replace, :invalid => :replace, :replace => '')
str.gsub(/[\\\%\']/, '')
the str.gsub call gets the exception "ArgumentError - invalid byte sequence in US-ASCII".
I was under the impression that, if I called the encode! method with ::Encoding::ASCII, it would handle this, but apparently not; the character I'm trying to handle shows up in my text logfile as ‘ or %91. Anyone know why the encode! call doesn't do what I expect it to?
I don't know what the string looks like exactly beforehand -- this has only ever occurred in a production environment, and I am debugging from logfiles, where the value has likely been encoded in some fashion other than the original. I'm going to try Marshal.dump'ing the object to save it off and reproduce it locally next time it happens.

In Ruby 2.0 (and earlier), trying to use encode (or encode!) on a string that is already in the target encoding is a no-op:
Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
In your case, if str is already has ASCII encoding then the encode call will not do anything, so any invalid bytes will remain and cause errors in the subsequent gsub call.
This doesn’t happen with Ruby 2.1, which also introduced the scrub method as an easier way to remove invalid bytes.
If you cannot upgrade your version of Ruby you might be able to get round this by changing to a different encoding and back, for example:
str.encode(::Encoding::UTF_8, :undef => :replace, :invalid => :replace, :replace => '').encode(::Encoding::ASCII)
A better solution would be to ensure you correctly handle the character encoding of all text data entering your application, converting as necessary as it enters (usually to UTF8). How you do this will depend on where the data is coming from.
In your example, it looks like the data is being submitted in the CP-1252 encoding (the character U+2018 LEFT SINGLE QUOTATION MARK is encoded with the byte 0x91 in that encoding). If you are sure the data always come in this encoding, you could fix this with:
str.force_encoding(Encoding::Windows_1252).encode(Encoding::UTF_8)

Related

How to decode a string in Ruby

I am working with the Mandrill Inbound Email API, and when an email has an attachment with one or more spaces in its file name, then the file name is encoded in a format that I do not know how to decode.
Here is a an example string I receive for the file name: =?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?=
I tried Base64.decode64(#{encoded_value}) but that didn't return a readable text.
How do I decode that value into a readable string?
This is MIME encoded-word syntax as defined in RFC-2822. From Wikipedia:
The form is: "=?charset?encoding?encoded text?=".
charset may be any character set registered with IANA. Typically it would be the same charset as the message body.
encoding can be either "Q" denoting Q-encoding that is similar to the quoted-printable encoding, or "B" denoting base64 encoding.
encoded text is the Q-encoded or base64-encoded text.
Fortunately you don't need to write a decoder for this. The Mail gem comes with a Mail::Encodings.value_decode method that works perfectly and is very well-tested:
subject = "=?UTF-8?B?TWlzc2lvbmFyecKgRmFpdGjCoFByb21pc2XCoGFuZMKgQ2FzaMKgUmVjZWlwdHPCoFlURMKgMjUzNQ==?= =?UTF-8?B?OTnCoEp1bHktMjAxNS5jc3Y=?="
Mail::Encodings.value_decode(subject)
# => "Missionary Faith Promise and Cash Receipts YTD 253599 July-2015.csv"
It gracefully handles lots of edge cases you probably wouldn't think of (until your app tries to handle them and falls over):
subject = "Re:[=?iso-2022-jp?B?GyRCJTAlayE8JV0lcyEmJTglYyVRJXMzdDwwMnEbKEI=?=\n =?iso-2022-jp?B?GyRCPFIbKEI=?=] =?iso-2022-jp?B?GyRCSlY/LiEnGyhC?=\n =?iso-2022-jp?B?GyRCIVolMCVrITwlXSVzIVskKkxkJCQ5ZyRvJDsbKEI=?=\n =?iso-2022-jp?B?GyRCJE43byRLJEQkJCRGIUolaiUvJSglOSVIGyhC?=#1056273\n =?iso-2022-jp?B?GyRCIUsbKEI=?="
Mail::Encodings.value_decode(subject)
# => "Re:[グルーポン・ジャパン株式会社] 返信:【グルーポン】お問い合わせの件について(リクエスト#1056273\n )"
If you're using Rails you already have the Mail gem. Otherwise just add gem "mail" to your Gemfile, then bundle install and, in your script, require "mail".
Thanks to the comment from #Yevgeniy-Anfilofyev who pointed me in the right direction, I was able to write the following method that correctly parsed the encoded value and returned an ASCII string.
def self.decode(value)
# It turns out the value is made up of multiple encoded parts
# so we first need to split each part so we can decode them seperately
encoded_parts = name.split('=?UTF-8?B?').
map{|x| x.sub(/\?.*$/, '') }.
delete_if{|x| x.blank? }
encoded_parts.map{|x| Base64.decode64(x)}. # decode each part
join(''). # join the parts together
force_encoding('utf-8'). # force UTF-8 encoding
gsub("\xC2\xA0", " ") # remove the UTF-8 encoded spaces with an ASCII space
end

In ruby, how do I turn a text representation of a byte in to a byte?

What is the best way to turn the string "FA" into /xFA/ ?
To be clear, I don't want to turn "FA" into 7065 or "FA".to_i(16).
In Java the equivalent would be this:
byte b = (byte) Integer.decode("0xFA");
So you're using / markers, but you aren't actually asking about regexps, right?
I think this does what you want:
['FA'].pack('H*')
# => "\xFA"
There is no actual byte type in ruby stdlib (I don't think? unless there's one I don't know about?), just Strings, that can be any number of bytes long (in this case, one). A single "byte" is typically represented as a 1-byte long String in ruby. #bytesize on a String will always return the length in bytes.
"\xFA".bytesize
# => 1
Your example happens not to be a valid UTF-8 character, by itself. Depending on exactly what you're doing and how you're environment is set up, your string might end up being tagged with a UTF-8 encoding by default. If you are dealing with binary data, and want to make sure the string is tagged as such, you might want to #force_encoding on it to be sure. It should NOT be neccesary when using #pack, the results should be tagged as ASCII-8BIT already (which has a synonym of BINARY, it's basically the "null encoding" used in ruby for binary data).
['FA'].pack('H*').encoding
=> #<Encoding:ASCII-8BIT
But if you're dealing with string objects holding what's meant to be binary data, not neccesarily valid character data in any encoding, it is useful to know you may sometimes need to do str.force_encoding("ASCII-8BIT") (or force_encoding("BINARY"), same thing), to make sure your string isn't tagged as a particular text encoding, which would make ruby complain when you try to do certain operations on it if it includes invalid bytes for that encoding -- or in other cases, possibly do the wrong thing
Actually for a regexp
Okay, you actually do want a regexp. So we have to take our string we created, and embed it in a regexp. Here's one way:
representation = "FA"
str = [representation].pack("H*")
# => "\xFA"
data = "\x01\xFA\xC2".force_encoding("BINARY")
regexp = Regexp.new(str)
data =~ regexp
# => 1 (matched on byte 1; the first byte of data is byte 0)
You see how I needed the force_encoding there on the data string, otherwise ruby would default to it being a UTF-8 string (depending on ruby version and environment setup), and complain that those bytes aren't valid UTF-8.
In some cases you might need to explicitly set the regexp to handle binary data too, the docs say you can pass a second argument 'n' to Regexp.new to do that, but I've never done it.

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?
The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

Ruby character encoding issue with scraped HTML

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:
./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
In my logs, I can see Café showing up for some of the strings that would be included in the join operation.
Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.
I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.
Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?
Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.
Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:
[CODE REMOVED]
Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.
Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).
I hate character encoding.
Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:
> s = 'Café'
=> "Café"
> s.encoding
=> #<Encoding:UTF-8>
> s.force_encoding('iso-8859-1').encode('utf-8')
=> "Café"
So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.
You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.
Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

"invalid byte sequence in UTF-8" in rspec controller response

We encounter the said error on some of our newer virtual machines, while other machines remain unaffected and wonder why and furthermore how to get rid of them.
the two main differences are as follows
vm_old:
debian squeeze
ruby1.9.2p0
vm_new:
debian wheezy
ruby1.9.2p320 (over rvm)
There naturally are more changes within the VMs, but i don't know which would affect this behavior.
We have a response containing umlauts within some of our controllers (ie. '{"message": "ü"}') and we have set # encoding: utf-8
Within the spec we test the response against a fixed string with this umlaut
it 'should test something' do
get :some_controller, format: :json
response.status.should == 200
json = ActiveSupport::JSON.decode(response.body)
json["message"].should == 'ü' # breaks on this line
# ... some more tests
end
The substitute for ü seems to be a random 4 digit string.
On occasion this string seems to be valid utf-8 and can be transfered.
We then have a failed spec instead of the error message in the title, since the random string is not the same as ü.
The spec file itself also has the # encoding: utf-8 on the first line.
We tried playing with the locale or with force_encoding('utf-8')
The question now becomes:
Has someone else encountered a problem like this?
and
How to solve it?
Edit: turns out it is not always starting with P\.
Edit 2:
Testing around showed it is a problem with the json decode.
The controller response is something like "{\"foo\": \"\u00fc\"}", decoding that results in random output where the sequence \u00fc used to be.
for simple reproduction:
bundle exec rails c
> ActiveSupport::JSON.decode(ActiveSupport::JSON.encode({:foo => "ü"})
rails version is 3.0.4
Edit 3:
Changing the JSON backend to Yaml seems to be a valid workaround.
I'm not certain if this will be of help to you, but I figured I'd toss it out there. For me, adding this code:
.encode('UTF-16le', :invalid => :replace, :replace => '').encode('UTF-8')
totally saved me. Essentially, it involves converting your UTF-8 encoding to UTF-16, and then encoding it back to UTF-8. More information is available here.

Resources