How to solve this iconv exception? - ruby

The following code:
def convertToUTF8 str
str = Iconv.conv('ASCII//IGNORE', 'UTF8', str).gsub("\x00", "") # line 585
str.chomp!
str
end
Raises exception:
myfile.rb:585: in `conv': invalid encoding ("ASCII", "UTF8") (Iconv::InvalidEncoding)
This error is on my macbook pro. But same code can run one my peer's macbook pro. So I suppose it should be a environment problem.
I tried to use following code:
# The following two lines are in bar.rb
require 'iconv'
puts Iconv.list
Then execute bar.rb:
$ ruby bar.rb | grep UTF # ruby version is 1.9.1-p376
UTF-8
UTF-8-MAC
UTF8-MAC
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UNICODE-1-1-UTF-7
UTF-7
CSUNICODE11UTF7
But if I switch to ruby 2.0, then the list contains "UTF8".
So the problem turns to how to install "UTF8" for Ruby 1.9.1?

Related

Ruby not respecting # encoding specification

Given the following script (it must be in its own file):
#!/usr/bin/env ruby
# encoding: binary
s = "\xe1\xe7\xe6\x07\x00\x01\x00"
puts s.encoding
The output of this is "UTF-8". Why isn't it binary (ASCII-8BIT)?
Because the # encoding: binary must be on the line immediately following the #!/usr/bin/env ruby. Alternatively, if there is no #!/usr/bin/env ruby line, it must then be on the first line of the file.
When the blank line is removed (i.e. the encoding specification is on the second line):
#!/usr/bin/env ruby
# encoding: binary
s = "\xe1\xe7\xe6\x07\x00\x01\x00"
puts s.encoding
...the output is "ASCII-8BIT".
Here is a link to the Ruby documentation regarding magic comments such as encoding (thanks to Stefan who mentioned this in a comment):
https://ruby-doc.org/core-3.1.2/doc/syntax/comments_rdoc.html#label-Magic+Comments

How to consistently get ASCII-8BIT encoding in Ruby?

Ruby seems a bit inconsistent in its handling of encodings:
irb -E BINARY:BINARY
irb(main):001:0> "hi".encoding
=> #<Encoding:ASCII-8BIT>
So that "works". Now what about plain ruby?
ruby -E BINARY:BINARY -e 'p "hi".encoding'
#<Encoding:US-ASCII>
That doesn't work. Furthermore, when p "hi".encoding is placed in x.rb, the output of ruby -E BINARY:BINARY x.rb is:
#<Encoding:UTF-8>
How do I get ASCII-8BIT literals when invoking ruby?
String literals have the same encoding as the script encoding. Instead of 'hi'.encoding you can use the keyword __ENCODING__ to retrieve it. The script encoding can be changed by putting a magic comment at the beginning of your script:
# encoding: ASCII-8BIT
p __ENCODING__ # => #<Encoding:ASCII-8BIT>
The -E flag of ruby doesn't affect the encoding of string literals. It's only for changing the external and internal encoding. You can read about the various type of encodings and their purpose in the Encoding documentation.
Back to the encoding of string literals: Even though irb claims its -E flag is the "Same as ruby -E" that isn't true. It uses the external encoding as script encoding. irb already has several limitations. This could be one of them. It's at least a documentation bug.
Besides the magic comment there's another discouraged way to set the script encoding via ruby: the -K flag and the n (none) kcode. ruby -Kne "p __ENCODING__" should print #<Encoding:ASCII-8BIT>. However -K also changes the external encoding.

Ruby error invalid multibyte char (US-ASCII)

I am trying to run the ruby script found here
but I am getting the error
invalid multibyte char (US-ASCII)
for line 12 which is
http = Net::HTTP.new("twitter.com", Net::HTTP.https_default_port())
can someone please explain to me what this means and how I can fix it, thanks
When you run the script with Ruby 1.9, change the first two lines of the script to:
#!/usr/bin/env ruby
# encoding: utf-8
require 'net/http'
This tells Ruby to run the script with support for the UTF-8 character set. Without that line Ruby 1.9 would default to the US_ASCII character set.
Just for the record: This will not work in Ruby 1.8, because 1.8 doesn't knew anything about string encodings. And the line is not needed anymore in Ruby 2.0, because Ruby 2.0 is using UTF-8 as the default anyway.
It means that a multibyte character is used and Ruby is not set to handle it. If you are using an old version of Ruby, then put the following magic comment at the beginning of the file:
# coding: utf-8
If you use a modern version of Ruby, then that problem would not arise in the first place.

Ruby: ARGV breaks accented characters

# encoding: utf-8
foo = "Résumé"
p foo
> "Résumé"
# encoding: utf-8
ARGV.each do |argument|
p argument
end
test.rb Résumé > "R\xE9sum\xE9"
Why does this occur, and how can I get ARGV to return "Résumé"?
I have chcp 65001 set already and am using ruby 1.9.2p290 (2011-07-09) [i386-mingw32]
EDIT After asking around on irc, I was instructed to do chcp 1252>NUL which fixed the problem.
For some reason, Windows doesn't use UTF-8 in your console. So, although Ruby expects UTF-8 encoded string, it gets Windows-1252 encoded string.
So you have several possibilities (which I can't test as I, fortunately, don't use Windows):
Persuade Windows to use UTF-8 in your console. I don't know if chcp should work and, if so, why it doesn't.
Tell Ruby to use Windows-1252 instead of UTF-8 as default
Convert ARGV from Windows-1252 to UTF-8 manually:
Example:
>> argument = "R\xE9sum\xE9"
=> "R\xE9sum\xE9"
>> argument.force_encoding('windows-1252').encode('utf-8')
=> "Résumé"

How can I convert a string from windows-1252 to utf-8 in Ruby?

I'm migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).
Turns out the Windows string data is encoded as windows-1252 and Rails and MySQL are both assuming utf-8 input so some of the characters, such as apostrophes, are getting mangled. They wind up as "a"s with an accent over them and stuff like that.
Does anyone know of a tool, library, system, methodology, ritual, spell, or incantation to convert a windows-1252 string to utf-8?
For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:
Iconv documentation
According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
One might then attempt to do a full conversion like so:
ic = Iconv.new('UTF-8', 'WINDOWS-1252')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
If you're on Ruby 1.9...
string_in_windows_1252 = database.get(...)
# => "Fåbulous"
string_in_windows_1252.encoding
# => "windows-1252"
string_in_utf_8 = string_in_windows_1252.encode('UTF-8')
# => "Fabulous"
string_in_utf_8.encoding
# => 'UTF-8'
Hy,
I had the exact same problem.
These tips helped me get goin:
Always check for the proper encoding name in order to feed your conversion tools correctly.
In doubt you can get a list of supported encodings for iconv or recode using:
$ recode -l
or
$ iconv -l
Always start from you original file and encode a sample to work with:
$ recode windows-1252..u8 < original.txt > sample_utf8.txt
or
$ iconv -f windows-1252 -t utf8 original.txt -o sample_utf8.txt
Install Ruby1.9, because it helps you A LOT when it comes to encodings. Even if you don't use it in your programm, you can always start an irb1.9 session and pick on the strings to see what the output is.
File.open has a new 'mode' parameter in Ruby 1.9. Use it!
This article helped a lot: http://blog.nuclearsquid.com/writings/ruby-1-9-encodings
File.open('original.txt', 'r:windows-1252:utf-8')
# This opens a file specifying all encoding options. r:windows-1252 means read it as windows-1252. :utf-8 means treat it as utf-8 internally.
Have fun and swear a lot!
If you want to convert a file named win1252file, on a unix OS, run:
$ iconv -f windows-1252 -t utf-8 win1252_file > utf8_file
You should probably be able to do the same on Windows with cygwin.
If you're NOT on Ruby 1.9, and assuming yhager's command works, you could try
File.open('/tmp/w1252', 'w') do |file|
my_windows_1252_string.each_byte do |byte|
file << byte
end
end
`iconv -f windows-1252 -t utf-8 /tmp/w1252 > /tmp/utf8`
my_utf_8_string = File.read('/tmp/utf8')
['/tmp/w1252', '/tmp/utf8'].each do |path|
FileUtils.rm path
end

Resources