Why do I get ArgumentError - invalid byte sequence in UTF-8? - ruby

While trying to print Duplicaci¾n out of a CSV file, I get the following error:
ArgumentError - invalid byte sequence in UTF-8
I'm using Ruby 1.9.3-p362 and opening the file using:
CSV.foreach(fpath, headers: true) do |row|
How can I skip an invalid character without using iconv or str.encode(undef: :replace, invalid: :replace, replace: '')?
I tried answers from the following questions, but nothing worked:
ruby 1.9: invalid byte sequence in UTF-8
Ruby Invalid Byte Sequence in UTF-8
ruby 1.9: invalid byte sequence in UTF-8

This is from the CSV.open documentation:
You must provide a mode with an embedded Encoding designator unless your data is in Encoding::default_external(). CSV will check the Encoding of the underlying IO object (set by the mode you pass) to determine how to parse the data. You may provide a second Encoding to have the data transcoded as it is read just as you can with a normal call to IO::open(). For example, "rb:UTF-32BE:UTF-8" would read UTF-32BE data from the file but transcode it to UTF-8 before CSV parses it.
That applies to any method in CSV that opens a file.
Also start reading in the documentation at the part beginning with:
CSV and Character Encodings (M17n or Multilingualization)
Ruby is expecting UTF-8 but is seeing characters that don't fit. I'd suspect WIN-1252 or ISO-8859-1 or a variant.

Related

Invalid UTF-8 Ruby strings

I'm running into some strange behaviour and inconsistency in the way that Ruby (v2.5.3) deals with encoded strings versus the YAML parser. Here's an example:
"\x80" # Returns "\x80"
"\x80".bytesize # Returns 1
"\x80".bytes # Returns [128]
"\x80".encoding # Returns UTF-8
YAML.load('{value: "\x80"}')["value"] # Returns "\u0080"
YAML.load('{value: "\x80"}')["value"].bytesize # Returns 2
YAML.load('{value: "\x80"}')["value"].bytes # Returns [194, 128]
YAML.load('{value: "\x80"}')["value"].encoding # Returns UTF-8
My understanding of UTF-8 is that any single-byte value above 0x7F should be encoded into two bytes. So my questions are the following:
Is the one byte string "\x80" valid UTF-8?
If so, why does YAML convert into a two-byte pattern?
If not, why is Ruby claiming the encoding is UTF-8 but containing an invalid byte sequence?
Is there a way to make the YAML parser and the Ruby string behave in the same way as each other?
It is not valid UTF-8
"\x80".valid_encoding?
# false
Ruby is claiming it is UTF-8 because all String literals are UTF-8 by default, even if that makes them invalid.
I don't think you can force the YAML parser to return invalid UTF-8. But to get Ruby to convert that character you can do this
"\x80".b.ord.chr('utf-8')
# "\u0080"
.b is only available in Ruby 2+. You need to use force_encoding otherwise.

How can I use Net::Http to download a file with UTF-8 characters in it?

I have an application where users can upload text-based files (xml, csv, txt) that are persisted to S3. Some of these files are pretty big. There are a variety of operations that need to be performed on the data in these files, so rather than read them from S3 and have it time out occasionally I download the files locally, then turn the operations loose on them.
Here's the code I use to download the file from S3. Upload is the name of the AR model I use to store this information. This method is an instance method on the Upload model:
def download
basename = File.basename(self.text_file_name.path)
filename = Rails.root.join(basename)
host = MyFitment::Utility.get_host_without_www(self.text_file_name.url)
Net::HTTP.start(host) do |http|
f = open(filename)
begin
http.request_get(self.text_file_name.url) do |resp|
resp.read_body do |segment|
f.write(segment) # Fails when non-ASCII 8-bit characters are included.
end
end
ensure
f.close()
end
end
filename
end
So you see that line above where the load fails. This code somehow thinks all files that are downloaded are encoded in ASCII 8-bit. How can I:
1) Check the encoding of a remote file like that
2) Download it and write it successfully.
Here's the error that is happening with a particular file right now:
Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8
from /Users/me/code/myapp/app/models/upload.rb:47:in `write'
Thank you for any help you can offer!
How can I: 1) Check the encoding of a remote file like that.
You can check the Content-Type header of the response, which, if present, may look something like this:
Content-Type: text/plain; charset=utf-8
As you can see, the encoding is specified there. If there's no Content-Type header, or if the charset is not specified, or if the charset is specified incorrectly, then you can't know the encoding of the text. There are gems that can try to guess the encoding(with increasing accuracy), e.g. rchardet, charlock_holmes, but for complete accuracy, you have to know the encoding before reading the text.
This code somehow thinks all files that are downloaded are encoded in
ASCII 8-bit.
In ruby, ASCII-8BIT is equivalent to binary, which means the Net::HTTP library just gives you a string containing a series of single bytes, and it's up to you to decide how to interpret those bytes.
If you want to interpret those bytes as UTF-8, then you do that with String#force_encoding():
text = text.force_encoding("UTF-8")
You might want to do that if, for instance, you want to do some regex matching on the string, and you want to match full characters(which might be multi-byte) rather than just single bytes.
Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8
Using String#encode('UTF-8') to convert ASCII-8BIT to UTF-8 doesn't work for bytes whose ascii codes are greater than 127:
(0..255).each do |ascii_code|
str = ascii_code.chr("ASCII-8BIT")
#puts str.encoding #=>ASCII-8BIT
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
puts "Can't encode char with ascii code #{ascii_code} to UTF-8."
end
end
--output:--
Can't encode char with ascii code 128 to UTF-8.
Can't encode char with ascii code 129 to UTF-8.
Can't encode char with ascii code 130 to UTF-8.
...
...
Can't encode char with ascii code 253 to UTF-8.
Can't encode char with ascii code 254 to UTF-8.
Can't encode char with ascii code 255 to UTF-8.
Ruby just reads one byte at a time from the ASCII-8BIT string and tries to convert the character in the byte to UTF-8. So, while 128 may be a legal byte in UTF-8 when part of a multi-byte character sequence, 128 is not a legal UTF-8 character as a single byte.
As for writing the strings to a file, instead of this:
f = open(filename)
...if you want to output UTF-8 to the file, you would write:
f = open(filename, "w:UTF-8")
By default, ruby uses whatever the value of Encoding.default_external is to encode output to a file. The default_external encoding is pulled from your system's environment, or you can set it explicitly.

Ruby incompatible character encodings

I am currently trying to write a script that iterates over an input file and checks data on a website. If it finds the new data, it prints out to the terminal that it passes, if it doesn't it tells me it fails. And vice versa for deleted data. It was working fine until the input file I was given contains the "™" character. Then when ruby gets to that line, it is spitting out an error:
PDAPWeb.rb:73:in `include?': incompatible character encodings: UTF-8 and IBM437
(Encoding::CompatibilityError)
The offending line is a simple check to see if the text exists on the page.
if browser.text.include? (program_name)
Where the program_name variable is a parsed piece of information from the input file. In this instance, the program_name contains the 'TM' character mentioned before.
After some research I found that adding the line # encoding: utf-8 to the beginning of my script could help, but so far has not proven useful.
I added this to my program_name variable to see if it would help(and it allowed my script to run without errors), but now it is not properly finding the TM character when it should be.
program_name = record[2].gsub("\n", '').force_encoding("utf-8").encode("IBM437", replace: nil)
This seemed to convert the TM character to this: Γäó
I thought maybe i had IBM437 and utf-8 parts reversed, so I tried the opposite
program_name = record[2].gsub("\n", '').force_encoding("IBM437").encode("utf-8", replace: nil)
and am now receiving this error when attempting to run the script
PDAPWeb.rb:48:in `encode': U+2122 from UTF-8 to IBM437 (Encoding::UndefinedConve
rsionError)
I am using ruby 1.9.3p392 (2013-02-22) and I'm not sure if I should upgrade as this is the standard version installed in my company.
Is my encoding incorrect and causing it to convert the TM character with errors?
Here’s what it looks like is going on. Your input file contains a ™ character, and it is in UTF-8 encoding. However when you read it, since you don’t specify the encoding, Ruby assumes it is in your system’s default encoding of IBM437 (you must be on Windows).
This is basically the same as this:
>> input = "™"
=> "™"
>> input.encoding
=> #<Encoding:UTF-8>
>> input.force_encoding 'ibm437'
=> "\xE2\x84\xA2"
Note that force_encoding doesn’t change the actual string, just the label associated with it. This is the same outcome as in your case, only you arrive here via a different route (by reading the file).
The web page also has a ™ symbol, and is also encoded as UTF-8, but in this case Ruby has the encoding correct (Watir probably uses the headers from the page):
>> web_page = '™'
=> "™"
>> web_page.encoding
=> #<Encoding:UTF-8>
Now when you try to compare these two strings you get the compatibility error, because they have different encodings:
>> web_page.include? input
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and IBM437
from (irb):11:in `include?'
from (irb):11
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
If either of the two strings only contained ASCII characters (i.e. code points less that 128) then this comparison would have worked. Both UTF-8 and IBM437 are both supersets of ASCII, and are only incompatible if they both contain characters outside of the ASCII range. This is why you only started seeing this behaviour when the input file had a ™.
The fix is to inform Ruby what the actual encoding of the input file is. You can do this with the already loaded string:
>> input.force_encoding 'utf-8'
=> "™"
You can also do this when reading the file, e.g. (there are a few ways of reading files, they all should allow you to explicitly specify the encoding):
input = File.read("input_file.txt", :encoding => "utf-8")
# now input will be in the correct encoding
Note in both of these the string isn’t being changed, it still contains the same bytes, but Ruby now knows its correct encoding.
Now the comparison should work okay:
>> web_page.include? input
=> true
There is no need to encode the string. Here’s what happens if you do. First if you correct the encoding to UTF-8 then encode to IBM437:
>> input.force_encoding("utf-8").encode("IBM437", replace: nil)
Encoding::UndefinedConversionError: U+2122 from UTF-8 to IBM437
from (irb):16:in `encode'
from (irb):16
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
IBM437 doesn’t include the ™ character, so you can’t encode a string containing it to this encoding without losing data. By default Ruby raises an exception when this happens. You can force the encoding by using the :undef option, but the symbol is lost:
>> input.force_encoding("utf-8").encode("IBM437", :undef => :replace)
=> "?"
If you go the other way, first using force_encoding to IBM437 then encoding to UTF-8 you get the string Γäó:
>> input.force_encoding("IBM437").encode("utf-8", replace: nil)
=> "Γäó"
The string is already in IBM437 encoding as far as Ruby is concerned, so force_encoding doesn’t do anything. The UTF-8 representation of ™ is the three bytes 0xe2 0x84 0xa2, and when interpreted as IBM437 these bytes correspond to the three characters seen here which are then converted into their UTF-8 representations.
(These two outcomes are the other way round from what you describe in the question, hence my comment above. I’m assuming that this is just a copy-and-paste error.)

Incompatible character encodings error

I'm trying to run a ruby script which generates translated HTML files from a JSON file. However I get this error:
incompatible character encodings: UTF-8 and CP850
Ruby
translation_hash = JSON.parse(File.read('translation_master.json').force_encoding("ISO-8859-1").encode("utf-8", replace: nil))
It seems to get stuck on this line of the JSON:
Json
"3": "Klassisch geschnittene Anzüge",
because there is a special character "ü". The JSON file's encoding is ANSI. Any ideas what could be wrong?
Try adding # encoding: UTF-8 to the top of the ruby file. This tells ruby to interpret the file with a different encoding. If this doesn't work try to find out what kind of encoding the text uses and change the line accordingly.
IMHO your code should work if the encoding of the json file is "ISO-8859-1" and if it is a valid json file.
So you should first verify if "ISO-8859-1" is the correct encoding and
by the way if the file is a valid json file.
# read the file with the encoding, you assume it is correct
json_or_not = File.read('translation_master.json').force_encoding("ISO-8859-1")
# print result and ckeck if something is obscure
puts json_or_not

`gsub': incompatible character encodings: UTF-8 and IBM437

I try to use search, google but with no luck.
OS: Windows XP
Ruby version 1.9.3po
Error:
`gsub': incompatible character encodings: UTF-8 and IBM437
Code:
require 'rubygems'
require 'hpricot'
require 'net/http'
source = Net::HTTP.get('host', '/' + ARGV[0] + '.asp')
doc = Hpricot(source)
doc.search("p.MsoNormal/a").each do |a|
puts a.to_plain_text
end
Program output few strings but when text is ”NOŻYCE” I am getting error above.
Could somebody help?
You could try converting your HTML to UTF-8 since it appears the original is in vintage-retro DOS format:
source.encode!('UTF-8')
That should flip it from 8-bit ASCII to UTF-8 as expected by the Hpricot parser.
The inner encoding of the source variable is UTF-8 but that is not what you want.
As tadman wrote, you must first tell Ruby that the actual characters in the string are in the IBM437 encoding. Then you can convert that string to your favourite encoding, but only if such a conversion is possible.
source.force_encoding('IBM437').encode('UTF-8')
In your case, you cannot convert your string to ISO-8859-2 because not all IBM437 characters can be converted to that charset. Sticking to UTF-8 is probably your best option.
Anyway, are you sure that that file is actually transmitted in IBM437? Maybe it is stored as such in the HTTP server but it is sent over-the-wire with another encoding. Or it may not even be exactly in IBM437, it may be CP852, also called MS-DOC Latin 2 (different from ISO Latin 2).

Resources