How to use utf-8 encoded data in erb template - ruby

I have an data-file stored with utf-8 encode, and I want to embed the data to an erb template. The data-file is explicitly encoded with utf-8 at the top. But while running the erb engine but I encounter Encoding::CompatibilityError Error.
I thought as the default encoding in Ruby is ASCII, the erb template must also encoded under ascii. I have explicitly changed it to utf-8 but there is no good.
Here is the data-file:
# coding: utf-8
samples: [
{ name: '北京', city: '北京' }
]
Here is the Erb template:
<% # -*- coding: UTF-8 -*- %>
#...
<p><%= samples[:name] %></p>

(I decided to write different answer)
Two issues, I think.
datafile encoding on input
how you output
The erb library knows about the encoding specification in magic comments, but the data file part, you need to take care by yourself. So, when you read the file, you have to specify encoding, or specify default encoding beforehand.
On output, you need to specify the encoding for output. You can specify per I/O channel basis.
To specify default encoding (easiest), you can:
Encoding.default_external = "UTF-8"
to use UTF-8 for all I/O.

In the scenario where you have an ERB template rendering strings from another file that is in UTF-8, adding the following to the top of the ERB template solved it for me:
<%# coding: UTF-8 %>
(instead of <% # -*- coding: UTF-8 -*- %>)

If you're using Rails, have you configured default encoding, in application.rb? like:
config.encoding = "utf-8"
My Rails (3.2.1) project does not contain any configuration other than that.
Other thing you want to check is, whether your datafile really in UTF-8 or not.
If you're using Unix-like system, you can use 'nkf' command to check the code, by:
nkf --guess FILE_NAME

Specify <meta http-equiv="content-type" content="text/html;charset=UTF-8" /> in the header of the template

Related

Encoding in erb templates [duplicate]

I have an data-file stored with utf-8 encode, and I want to embed the data to an erb template. The data-file is explicitly encoded with utf-8 at the top. But while running the erb engine but I encounter Encoding::CompatibilityError Error.
I thought as the default encoding in Ruby is ASCII, the erb template must also encoded under ascii. I have explicitly changed it to utf-8 but there is no good.
Here is the data-file:
# coding: utf-8
samples: [
{ name: '北京', city: '北京' }
]
Here is the Erb template:
<% # -*- coding: UTF-8 -*- %>
#...
<p><%= samples[:name] %></p>
(I decided to write different answer)
Two issues, I think.
datafile encoding on input
how you output
The erb library knows about the encoding specification in magic comments, but the data file part, you need to take care by yourself. So, when you read the file, you have to specify encoding, or specify default encoding beforehand.
On output, you need to specify the encoding for output. You can specify per I/O channel basis.
To specify default encoding (easiest), you can:
Encoding.default_external = "UTF-8"
to use UTF-8 for all I/O.
In the scenario where you have an ERB template rendering strings from another file that is in UTF-8, adding the following to the top of the ERB template solved it for me:
<%# coding: UTF-8 %>
(instead of <% # -*- coding: UTF-8 -*- %>)
If you're using Rails, have you configured default encoding, in application.rb? like:
config.encoding = "utf-8"
My Rails (3.2.1) project does not contain any configuration other than that.
Other thing you want to check is, whether your datafile really in UTF-8 or not.
If you're using Unix-like system, you can use 'nkf' command to check the code, by:
nkf --guess FILE_NAME
Specify <meta http-equiv="content-type" content="text/html;charset=UTF-8" /> in the header of the template

Why does a file written out encoded as UTF-8 end up being ISO-8859-1 instead?

I am reading an ISO-8859-1 encoded text file, transcoding it to UTF-8, and writing out a different file as UTF-8. However, when I inspect the output file, it is still encoded as ISO-8859-1! What am I doing wrong?
Here is my ruby class:
module EF
class Transcoder
# app_path ......... Path to the java console application (InferEncoding.jar) that infers the character encoding.
# target_encoding .. Transcodes the text loaded from the file into this encoding.
attr_accessor :app_path, :target_encoding
def initialize(consoleAppPath)
#app_path = consoleAppPath
#target_encoding = "UTF-8"
end
def detect_encoding(filename)
encoding = `java -jar #{#app_path} \"#{filename}\"`
encoding = encoding.strip
end
def transcode(filename)
original_encoding = detect_encoding(filename)
content = File.open(filename, "r:#{original_encoding}", &:read)
content = content.force_encoding(original_encoding)
content.encode!(#target_encoding, :invalid => :replace)
end
def transcode_file(input_filename, output_filename)
content = transcode(input_filename)
File.open(output_filename, "w:#{#target_encoding}") do |f|
f.write content
end
end
end
end
By way of explanation, #app_path is the path to a Java jar file. This console application will read a text file and tell me what its current encoding is (printing it to stdout). It uses the ubiquitous ICU library. (I tried using the ruby gem charlock-holmes, but I cannot get it to compile on Windows for MINGW. The Java bindings to ICU are good, so I wrote a Java application instead.)
To call the above class, I do this in irb:
require_relative 'transcoder'
tc = EF::Transcoder.new("C:/Users/paul.chernoch/Documents/java/InferEncoding.jar")
tc.detect_encoding "C:/temp/infer-encoding-test/ISO-8859-1.txt"
tc.transcode_file("C:/temp/infer-encoding-test/ISO-8859-1.txt", "C:/temp/infer-encoding-test/output-utf8.txt")
tc.detect_encoding "C:/temp/infer-encoding-test/output-utf8.txt"
The file ISO-8859-1.txt is encoded like it sounds. I used Notepad++ to write the file using that encoding.
I used my Java application to test the file. It concurs that it is in ISO-8859-1 format.
I also created a file in Notepad++ and saved it as UTF-8. I then verified using my java app that it was in UTF-8.
After I perform the above in irb, I used my java app to test the output file and it says the format is still ISO-8859-1.
What am I doing wrong? If you hard-code the method detect_encoding to return "ISO-8859-1", you do not need my java application to replicate the part that reads the file.
Any solution must NOT use charlock-holmes.

Rails parse upload file "\xDE" from ASCII-8BIT to UTF-8

I try parse upload *.txt file and get some import DB information. But before save it I try get tring in utf-8 format. When I do that I get error:
"\xDE" from ASCII-8BIT to UTF-8
First file characters
Import data \xDE\xE4\xE5
Before parse code
# encoding: utf-8
require "iconv"
class HandlerController < ApplicationController
def add_report
utf8_format = "UTF-8"
file_data = params[:import_file].tempfile.read.encode(utf8_format)
end
end
P.S. Also I try do that with iconv but it didn't help
You need to start from a known encoding with valid content (and compatible characters for input and output) before you will be able to successfully convert a string.
ASCII-8BIT doesn't assign Unicode-compatible characters to values 128..255 - it cannot be converted to Unicode.
The chances are that the input - as you say it is text - is in some other encoding to start with. You could start by assuming ISO-8859-1 ("Latin-1") which is quite a common encoding, although you may have some other clue, or know what characters to expect in the file, in which case you should try others.
I suggest you try something like this:
file_data = params[:import_file].tempfile.read.force_encoding('ISO-8859-1')
utf8_file_data = file_data.encode(utf8_format)
This probably will not give you an error, but if my guess at 'ISO-8859-1' is wrong, it will give you gibberish unfortunately.

ruby 1.9 wrong file encoding on windows

I have a ruby file with these contents:
# encoding: iso-8859-1
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
puts File.read('foo.txt').encoding
When I run it from windows command prompt ruby 1.9.3 I get: IBM437
When I run it from cygwin ruby 1.9.3 I get: UTF-8
What I expect to get is: iso-8859-1
Can someone explain what's happening here?
UPDATE
Here's a better description of what I'm looking for:
I understand now thanks to Darshan that by default ruby will load files in
Encoding.default _external, but shouldn't the # encoding: iso-8859-1
line override that?
Should ruby be able to auto-detect a file's encoding? Is there any
filesystem where the encoding is an attribute?
What is my best option to 'remember' the encoding I saved the file
in?
You're not specifying the encoding when you read the file. You're being very careful to specify it everywhere except there, but then you're reading it with the default encoding.
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'.force_encoding('iso-8859-1')}
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding }
# => ISO-8859-1
Also note that you probably mean 'fòo'.encode('iso-8859-1') rather than 'fòo'.force_encoding('iso-8859-1'). The latter leaves the bytes unchanged, while the former transcodes the string.
Update: I'll elaborate a bit since I wasn't as clear or thorough as I could have been.
If you don't specify an encoding with File.read(), the file will be read with Encoding.default_external. Since you're not setting that yourself, Ruby is using a value depending on the environment it's run in. In your Windows environment, it's IBM437; in your Cygwin environment, it's UTF-8. So my point above was that of course that's what the encoding is; it has to be, and it has nothing to do with what bytes are contained in the file. Ruby doesn't auto-detect encodings for you.
force_encoding() doesn't change the bytes in a string, it only changes the Encoding attached to those bytes. If you tell Ruby "pretend this string is ISO-8859-1", then it won't transcode them when you tell it "please write this string as ISO-8859-1". encode() transcodes for you, as does writing to the file if you don't trick it into not doing so.
Putting those together, if you have a source file in ISO-8859-1:
# encoding: iso-8859-1
# Write in ISO-8859-1 regardless of default_external
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
# Read in ISO-8859-1 regardless of default_external,
# transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1
puts File.read('foo.txt').encoding # -> Whatever is specified by default_external
If you have a source file in UTF-8:
# encoding: utf-8
# Write in ISO-8859-1 regardless of default_external, transcoding from UTF-8
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
# Read in ISO-8859-1 regardless of default_external,
# transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1
puts File.read('foo.txt').encoding # -> Whatever is specified by default_external
Update 2, to answer your new questions:
No, the # encoding: iso-8859-1 line does not change Encoding.default_external, it only tells Ruby that the source file itself is encoded in ISO-8859-1. Simply add
Encoding.default_external = "iso-8859-1"
if you expect all files that your read to be stored in that encoding.
No, I don't personally think Ruby should auto-detect encodings, but reasonable people can disagree on that one, and a discussion of "should it be so" seems off-topic here.
Personally, I use UTF-8 for everything, and in the rare circumstances that I can't control encoding, I manually set the encoding when I read the file, as demonstrated above. My source files are always in UTF-8. If you're dealing with files that you can't control and don't know the encoding of, the charguess gem or similar would be useful.

Does Ruby auto-detect a file's codepage?

If a save a text file with the following character б U+0431, but save it as an ANSI code page file.
Ruby returns ord = 63. Saving the file with UTF-8 as the codepage returns ord = 208, 177
Should I be specifically telling Ruby to handle the input encoded with a certain code page? If so, how do you do this?
Is that in ruby source code or in a file which is read with File.open? If it's in the ruby source code, you can (in ruby 1.9) add this to the top of the file:
# encoding: utf-8
Or you could specify most other encodings (like iso-8859-1).
If you are reading a file with File.open, you could do something like this:
File.open("file.txt", "r:utf-8") {|f| ... }
As with the encoding comment, you can pass in different types of encodings here too.

Resources