Encoding issue when using Nokogiri replace - ruby

I have this code:
# encoding: utf-8
require 'nokogiri'
s = "<a href='/path/to/file'>Café Verona</a>".encode('UTF-8')
puts "Original string: #{s}"
#doc = Nokogiri::HTML::DocumentFragment.parse(s)
links = #doc.css('a')
only_text = 'Café Verona'.encode('UTF-8')
puts "Replacement text: #{only_text}"
links.first.replace(only_text)
puts #doc.to_html
However, the output is this:
Original string: <a href='/path/to/file'>Café Verona</a>
Replacement text: Café Verona
Café Verona
Why does the text in #doc end up with the wrong encoding?
I tried with and without encode('UTF-8') or using Document instead of DocumentFragment, but it's the same problem.
I'm using Nokogiri v1.5.6 with Ruby 1.9.3p194.

Seems that if you pass a nokogiri text object it does the thing ;)
links.first.replace Nokogiri::XML::Text.new(only_text, #doc)

I can't duplicate the problem, but I have two different things to try:
Instead of using:
s = "<a href='/path/to/file'>Café Verona</a>".encode('UTF-8')
Try:
s = "<a href='/path/to/file'>Café Verona</a>"
Your string is already UTF-8 encoded, because of your statement # encoding: utf-8. That's why you put that in the script, to tell Ruby the literal string is in UTF-8. It's possible that you're double-encoding it, though I don't think Ruby will -- it should silently ignore the second attempt because it's already UTF-8.
Another thing I wonder about is, output like:
Café Verona
is an indicator that the language/character-set encoding of your system and your terminal aren't right. Trying to output UTF-8 strings on a system set to something else can get mismatches in the terminal and/or browser. Windows systems are typically Win-1252, ISO-8859-1 or something similar, not UTF-8. On my Mac OS system I have these environment variables set:
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
"Open iso-8859-1 encoded html with nokogiri messes up accents" might be useful too.

Related

Can I programmatically convert "I’d" to "I’d" using Ruby?

I can't seem to find the right combination of String#encode shenanigans.
I think I'd got confused on this one so I'll post this here to hopefully help anyone else who is similarly confused.
I was trying to do my encoding in an irb session, which gives you
irb(main):002:0> 'I’d'.force_encoding('UTF-8')
=> "I’d"
And if you try using encode instead of force_encoding then you get
irb(main):001:0> 'I’d'.encode('UTF-8')
=> "I’d"
This is with irb set to use an output and input encoding of UTF-8. In my case to convert that string the way I want it involves telling Ruby that the source string is in windows-1252 encoding. You can do this by using the -E argument in which you specify `inputencoding:outputencoding' and then you get this
$ irb -EWindows-1252:UTF-8
irb(main):001:0> 'I’d'
=> "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d"
That looks wrong unless you pipe it out, which gives this
$ ruby -E Windows-1252:UTF-8 -e "puts 'I’d'"
I’d
Hurrah. I'm not sure about why Ruby showed it as "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d" (something to do with the code page of the terminal?) so if anyone can comment with further insight that would be great.
I expect your script is using the encoding cp1251 and you have ruby >= 1.9.
Then you can use force_encoding:
#encoding: cp1251
#works also with encoding: binary
source = 'I’d'
puts source.force_encoding('utf-8') #-> I’d
If my exceptions are wrong: Which encoding do you use and which ruby version?
A little background:
Problems with encoding are difficult to analyse. There may be conflicts between:
Encoding of the source code (That's defined by the editor).
Expected encoding of the source code (that's defined with #encoding on the first line). This is used by ruby.
Encoding of the string (see e.g. section String encodings in http://nuclearsquid.com/writings/ruby-1-9-encodings/ )
Encoding of the output shell

Encoding::UndefinedConversionError when using open-uri

When I do this:
require 'open-uri'
response = open('some-html-page-url-here')
response.read
On a certain url I get the following error (due to wrong encoding in the returned url?!):
Encoding::UndefinedConversionError: U+00A0 from UTF-8 to US-ASCII
Any way around this to still get the html content?
In the introduction to the open-uri module, the docs say this,
It is possible to open an http, https or ftp URL as though it were a file
And if you know anything about reading files, then you have to know the encoding of the file you are trying to read. You need to know the encoding so that you can tell ruby how to read the file(i.e. how many bytes(or how much space) each character will occupy).
In the first code example in the docs, there is this:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
So if you don't know the encoding of the "file" you are trying to read, you can get the encoding with f.charset. If that encoding is different than your default external encoding, you will most likely get an error. Your default external encoding is the encoding ruby uses to read from external sources. You can check what your default external encoding is set to like this:
The default external Encoding is pulled from your environment...Have a
look:
$ echo $LC_CTYPE
en_US.UTF-8
or
$ ruby -e 'puts Encoding.default_external.name'
UTF-8
http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings
On Mac OSX, I actually have to do the following to see the default external encoding:
$ echo $LANG
You can set your default external encoding with the method Encoding.default_external=(), so you might want to try something like this:
open('some_url_here') do |f|
Encoding.default_external = f.charset
html = f.read
end
Setting an IO object to binmode, like you have done, tells ruby that the encoding of the file is BINARY (or ruby's confusing synonym ASCII-8BIT), which means you are telling ruby that each character in the file takes up one byte. In your case, you are telling ruby to read the character U+00A0, whose UTF-8 representation takes up two bytes 0xC2 0xA0, as two characters instead of just one character, so you have eliminated your error, but you have produced two junk characters instead of the original character.
Doing a response.binmode before the response.read stops the error from happening.
Had the same issue, will add my solution here:
After reading the open-uri documentation further, it turns out you could set the encoding of the io before reading using the set_encoding method, like this:
result = open('some-page-uri') do |io|
io.set_encoding(Encoding.default_external)
io.read
end
Hope it helps!

Zlib and utf-8 in ruby

I'm trying to use zlib to compress out some lengthy strings, some of which may contain unicode characters. At the moment, I'm doing this in ruby, but I think this would apply across any language really. Here's the super basic implementation:
require 'zlib'
example = "“hello world”" # note the unicode quotes
compressed = Zlib.deflate(example)
puts Zlib.inflate(compressed)
The issue here is that the text comes out as this:
\xE2\x80\x9Chello world\xE2\x80\x9
...no unicode quotes, just weird unrecognizable characters. Does anyone know of a way that Zlib can be used while retaining unicode characters? Bonus points for an answer in ruby : )
It seems Zlib produces ASCII-8BIT as the default encoding upon inflating. To fix it just force the original encoding:
require 'zlib'
input = "“hello world”"
compressed = Zlib.deflate(input)
output = Zlib.inflate(compressed).force_encoding(input.encoding)
Or set the encoding manually:
output = Zlib.inflate(compressed).force_encoding('utf-8')

Why do I get an "Invalid Byte Sequence in UTF-8" error reading a text file?

I'm writing a Ruby script to process a large text file, and keep getting an odd encoding error.
Here's the situation:
input_data = File.new(in_path, 'r').read
p input_data.encoding.name # UTF-8
break_char = "\r".encode("UTF-8")
p break_char # "\r"
p break_char.encoding.name # "UTF-8"
input_data.split(",".encode("UTF-8"))
p Encoding.compatible?(input_data, break_char) # # Encoding:UTF-8>
This produces the error :in 'split': invalid byte sequence in UTF-8 (ArgumentError)
I read http://blog.grayproductions.net/articles/ruby_19s_string and looked at other solutions to apparently the same problem, but still can't work out why it's happening when I believe I am controlling the encodings.
I'm on OSX working with ruby 1.9.2
Obviously your input file is not UTF-8 (or at least, not entirely). If you don't care about non-ascii characters, you can simply assume your file is ascii-8bit encoded. BTW, your separator (break_char) is not causing problems as comma is encoded the same way in UTF-8 as in ASCII.
fname = 'test.in'
# create example file and fill it with invalid UTF-8 sequence
File.open(fname, 'w') do |f|
f.write "\xc3\x28"
end
# then try to read and parse it
s = File.open(fname) do |f| # file opened as UTF-8
#s = File.open(fname, 'r:ascii-8bit') do |f| # file opened as ascii-8bit
f.read
end
p s.split ','
I fail to get an error here on Linux even when the input file is not UTF-8. (I'm using Ruby 1.9.2, as well.)
Logically, either this problem is linked with OS-X, or it's something to do with your input data. Does it happen regardless of the data in the input file?
(I realise that this is not a proper answer, but I lack the rep to add a comment. And since no-one has responded yet, I thought it better than nothing...)
You read the file using the default encoding your system provides. So ruby tags the string as utf8, which doesn't mean it's really utf8-data. Try file <input file> to guess what kind of encoding is in there, then tell ruby it's that one (unclean: force_encoding(<encoding>), clean: tell the File object what encoding it is, I don't know how to do that) and then use encode!("utf8") to convert it to utf8.
Here are 2 common situations and how to deal with them:
Situation 1
You have an UTF-8 input-file with possibly a few invalid bytes
Remove the invalid bytes:
test = "Partly valid\xE4 UTF-8 encoding: äöüß"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
str.scrub('')
=> "Partly valid UTF-8 encoding: äöüß"
Situation 2
You have an input-file that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):
test = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
unless str.valid_encoding?
str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace )
end #unless
=> "String in ISO-8859-1 encoding: äöüß"
Notes
The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT possible (or at least extremely hard) to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in computer-systems, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15
[ruby] [encoding] [utf8] [file-encoding] [character-encoding]
Please try this one:-
input_data = File.open("path/your_file.pdf", "rb") {|io| io.read}
Thanks

How can I convert a string from windows-1252 to utf-8 in Ruby?

I'm migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).
Turns out the Windows string data is encoded as windows-1252 and Rails and MySQL are both assuming utf-8 input so some of the characters, such as apostrophes, are getting mangled. They wind up as "a"s with an accent over them and stuff like that.
Does anyone know of a tool, library, system, methodology, ritual, spell, or incantation to convert a windows-1252 string to utf-8?
For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:
Iconv documentation
According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
One might then attempt to do a full conversion like so:
ic = Iconv.new('UTF-8', 'WINDOWS-1252')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
If you're on Ruby 1.9...
string_in_windows_1252 = database.get(...)
# => "Fåbulous"
string_in_windows_1252.encoding
# => "windows-1252"
string_in_utf_8 = string_in_windows_1252.encode('UTF-8')
# => "Fabulous"
string_in_utf_8.encoding
# => 'UTF-8'
Hy,
I had the exact same problem.
These tips helped me get goin:
Always check for the proper encoding name in order to feed your conversion tools correctly.
In doubt you can get a list of supported encodings for iconv or recode using:
$ recode -l
or
$ iconv -l
Always start from you original file and encode a sample to work with:
$ recode windows-1252..u8 < original.txt > sample_utf8.txt
or
$ iconv -f windows-1252 -t utf8 original.txt -o sample_utf8.txt
Install Ruby1.9, because it helps you A LOT when it comes to encodings. Even if you don't use it in your programm, you can always start an irb1.9 session and pick on the strings to see what the output is.
File.open has a new 'mode' parameter in Ruby 1.9. Use it!
This article helped a lot: http://blog.nuclearsquid.com/writings/ruby-1-9-encodings
File.open('original.txt', 'r:windows-1252:utf-8')
# This opens a file specifying all encoding options. r:windows-1252 means read it as windows-1252. :utf-8 means treat it as utf-8 internally.
Have fun and swear a lot!
If you want to convert a file named win1252file, on a unix OS, run:
$ iconv -f windows-1252 -t utf-8 win1252_file > utf8_file
You should probably be able to do the same on Windows with cygwin.
If you're NOT on Ruby 1.9, and assuming yhager's command works, you could try
File.open('/tmp/w1252', 'w') do |file|
my_windows_1252_string.each_byte do |byte|
file << byte
end
end
`iconv -f windows-1252 -t utf-8 /tmp/w1252 > /tmp/utf8`
my_utf_8_string = File.read('/tmp/utf8')
['/tmp/w1252', '/tmp/utf8'].each do |path|
FileUtils.rm path
end

Resources