Incompatible character encodings error - ruby

I'm trying to run a ruby script which generates translated HTML files from a JSON file. However I get this error:
incompatible character encodings: UTF-8 and CP850
Ruby
translation_hash = JSON.parse(File.read('translation_master.json').force_encoding("ISO-8859-1").encode("utf-8", replace: nil))
It seems to get stuck on this line of the JSON:
Json
"3": "Klassisch geschnittene Anzüge",
because there is a special character "ü". The JSON file's encoding is ANSI. Any ideas what could be wrong?

Try adding # encoding: UTF-8 to the top of the ruby file. This tells ruby to interpret the file with a different encoding. If this doesn't work try to find out what kind of encoding the text uses and change the line accordingly.

IMHO your code should work if the encoding of the json file is "ISO-8859-1" and if it is a valid json file.
So you should first verify if "ISO-8859-1" is the correct encoding and
by the way if the file is a valid json file.
# read the file with the encoding, you assume it is correct
json_or_not = File.read('translation_master.json').force_encoding("ISO-8859-1")
# print result and ckeck if something is obscure
puts json_or_not

Related

Unexpected encoding error using JSON.parse

I've got a rather large JSON file on my Windows machine and it contains stuff like \xE9. When I JSON.parse it, it works fine.
However, when I push the code to my server running CentOS, I always get this: "\xE9" on US-ASCII (Encoding::InvalidByteSequenceError)
Here is the output of file on both machines
Windows:
λ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
CentOS:
$ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
Here is the error I get when trying to parse it:
$ ruby -rjson -e 'JSON.parse(File.read("data.json"))'
/usr/local/rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC3" on US-ASCII (Encoding::InvalidByteSequenceError)
What could be causing this problem? I've tried using iconv to change the file into every possible encoding I can, but nothing seems to work.
"\xE9" is é in ISO-8859-1 (and various other ISO-8859-X encodings and Windows-1250 and ...) and is certainly not UTF-8.
You can get File.read to fix up the encoding for you by using the encoding options:
File.read('data.json',
:external_encoding => 'iso-8859-1',
:internal_encoding => 'utf-8'
)
That will give you a UTF-8 encoded string that you can hand to JSON.parse.
Or you could let JSON.parse deal with the encoding by using just :external_encoding to make sure the string comes of the disk with the right encoding flag:
JSON.parse(
File.read('data.json',
:external_encoding => 'iso-8859-1',
)
)
You should have a close look at data.json to figure out why file(1) thinks it is UTF-8. The file might incorrectly have a BOM when it is not UTF-8 or someone might be mixing UTF-8 and Latin-1 encoded strings in one file.

Encoding::UndefinedConversionError when using open-uri

When I do this:
require 'open-uri'
response = open('some-html-page-url-here')
response.read
On a certain url I get the following error (due to wrong encoding in the returned url?!):
Encoding::UndefinedConversionError: U+00A0 from UTF-8 to US-ASCII
Any way around this to still get the html content?
In the introduction to the open-uri module, the docs say this,
It is possible to open an http, https or ftp URL as though it were a file
And if you know anything about reading files, then you have to know the encoding of the file you are trying to read. You need to know the encoding so that you can tell ruby how to read the file(i.e. how many bytes(or how much space) each character will occupy).
In the first code example in the docs, there is this:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
So if you don't know the encoding of the "file" you are trying to read, you can get the encoding with f.charset. If that encoding is different than your default external encoding, you will most likely get an error. Your default external encoding is the encoding ruby uses to read from external sources. You can check what your default external encoding is set to like this:
The default external Encoding is pulled from your environment...Have a
look:
$ echo $LC_CTYPE
en_US.UTF-8
or
$ ruby -e 'puts Encoding.default_external.name'
UTF-8
http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings
On Mac OSX, I actually have to do the following to see the default external encoding:
$ echo $LANG
You can set your default external encoding with the method Encoding.default_external=(), so you might want to try something like this:
open('some_url_here') do |f|
Encoding.default_external = f.charset
html = f.read
end
Setting an IO object to binmode, like you have done, tells ruby that the encoding of the file is BINARY (or ruby's confusing synonym ASCII-8BIT), which means you are telling ruby that each character in the file takes up one byte. In your case, you are telling ruby to read the character U+00A0, whose UTF-8 representation takes up two bytes 0xC2 0xA0, as two characters instead of just one character, so you have eliminated your error, but you have produced two junk characters instead of the original character.
Doing a response.binmode before the response.read stops the error from happening.
Had the same issue, will add my solution here:
After reading the open-uri documentation further, it turns out you could set the encoding of the io before reading using the set_encoding method, like this:
result = open('some-page-uri') do |io|
io.set_encoding(Encoding.default_external)
io.read
end
Hope it helps!

ruby `encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

Hannibal episodes in tvdb have weird characters in them.
For example:
Œuf
So ruby spits out:
./manifesto.rb:19:in `encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
from ./manifesto.rb:19:in `to_json'
from ./manifesto.rb:19:in `<main>'
Line 19 is:
puts #tree.to_json
Is there a way to deal with these non utf characters? I'd rather not replace them, but convert them? Or ignore them? I don't know, any help appreciated.
Weird part is that script works fine via cron. Manually running it creates error.
File.open(yml_file, 'w') should be change to File.open(yml_file, 'wb')
It seems you should use another encoding for the object. You should set the proper codepage to the variable #tree, for instance, using iso-8859-1 instead of ascii-8bit by using #tree.force_encoding('ISO-8859-1'). Because ASCII-8BIT is used just for binary files.
To find the current external encoding for ruby, issue:
Encoding.default_external
If sudo solves the problem, the problem was in default codepage (encoding), so to resolve it you have to set the proper default codepage (encoding), by either:
In ruby to change encoding to utf-8 or another proper one, do as follows:
Encoding.default_external = Encoding::UTF_8
In bash, grep current valid set up:
$ sudo env|grep UTF-8
LC_ALL=ru_RU.UTF-8
LANG=ru_RU.UTF-8
Then set them in .bashrc properly, in a similar way, but not exactly with ru_RU language, such as the following:
export LC_ALL=ru_RU.UTF-8
export LANG=ru_RU.UTF-8
I had the same problems when saving to the database. I'll offer one thing that I use (perhaps, this will help someone).
if you know that sometimes your text has strange characters, then
before saving you can encode your text in some other format, and then
decode the text again after it is returned from the database.
example:
string = "Œuf"
before save we encode string
text_to_save = CGI.escape(string)
(character "Œ" encoded in "%C5%92" and other characters remained the same)
=> "%C5%92uf"
load from database and decode
CGI.unescape("%C5%92uf")
=> "Œuf"
I just suffered through a number of hours trying to fix a similar problem. I'd checked my locales, database encoding, everything I could think of and was still getting ASCII-8BIT encoded data from the database.
Well, it turns out that if you store text in a binary field, it will automatically be returned as ASCII-8BIT encoded text, which makes sense, however this can (obviously) cause problems in your application.
It can be fixed by changing the column encoding back to :text in your migrations.

Why do I get ArgumentError - invalid byte sequence in UTF-8?

While trying to print Duplicaci¾n out of a CSV file, I get the following error:
ArgumentError - invalid byte sequence in UTF-8
I'm using Ruby 1.9.3-p362 and opening the file using:
CSV.foreach(fpath, headers: true) do |row|
How can I skip an invalid character without using iconv or str.encode(undef: :replace, invalid: :replace, replace: '')?
I tried answers from the following questions, but nothing worked:
ruby 1.9: invalid byte sequence in UTF-8
Ruby Invalid Byte Sequence in UTF-8
ruby 1.9: invalid byte sequence in UTF-8
This is from the CSV.open documentation:
You must provide a mode with an embedded Encoding designator unless your data is in Encoding::default_external(). CSV will check the Encoding of the underlying IO object (set by the mode you pass) to determine how to parse the data. You may provide a second Encoding to have the data transcoded as it is read just as you can with a normal call to IO::open(). For example, "rb:UTF-32BE:UTF-8" would read UTF-32BE data from the file but transcode it to UTF-8 before CSV parses it.
That applies to any method in CSV that opens a file.
Also start reading in the documentation at the part beginning with:
CSV and Character Encodings (M17n or Multilingualization)
Ruby is expecting UTF-8 but is seeing characters that don't fit. I'd suspect WIN-1252 or ISO-8859-1 or a variant.

unable to convert array data to json when '¿' is there

this is my ruby code
require 'json'
a=Array.new
value="¿value"
data=value.gsub('¿','-')
a[0]=data
puts a
puts "json is"
puts jsondata=a.to_json
getting following error
C:\Ruby193>new.rb
C:/Ruby193/New.rb:3: invalid multibyte char (US-ASCII)
C:/Ruby193/New.rb:3: syntax error, unexpected tIDENTIFIER, expecting $end
value="┐value"
^
That's not a JSON problem — Ruby can't decode your source because it contains a multibyte character. By default, Ruby tries to decode files as US-ASCII, but ¿ isn't representable in US-ASCII, so it fails. The solution is to provide a magic comment as described in the documentation. Assuming your source file's encoding is UTF-8, you can tell Ruby that like so:
# encoding: UTF-8
# ...
value = "¿value"
# ...
With an editor or an IDE the soluton of icktoofay (# encoding: UTF-8 - in the first line) is perfect.
In a shell with IRB or PRY it is difficult to find a working configuration. But there is a workaround that at least worked for my encoding problem which was to enter German umlaut characters.
Workaround for PRY:
In PRY I use the edit command to edit the contents of the input buffer
as described in this pry wiki page.
This opens an external editor (you can configure which editor you want). And the editor accepts special characters that can not be entered in PRY directly.

Resources