Converting UTF8 to ANSI with Ruby - ruby

I have a Ruby script that generates a UTF8 CSV file remotely in a Linux machine and then transfers the file to a Windows machine thru SFTP.
I then need to open this file with Excel, but Excel doesn't get UTF8, so I always need to open the file in a text editor that has the capability to convert UTF8 to ANSI.
I would love to do this programmatically using Ruby and avoid the manual conversion step. What's the easiest way to do it?
PS: I tried using iconv but had no success.

ascii_str = yourUTF8text.unpack("U*").map{|c|c.chr}.join
assuming that your text really does fit in the ascii character set.

I finally managed to do it using iconv, I was just messing up the parameters. So, this is how you do it:
require 'iconv'
utf8_csv = File.open("utf8file.csv").read
# gotta be careful with the weird parameters order: TO, FROM !
ansi_csv = Iconv.iconv("LATIN1", "UTF-8", utf8_csv).join
File.open("ansifile.csv", "w") { |f| f.puts ansi_csv }
That's it!

I had a similar issue trying to generate CSV files from user-generated content on the server. I found the unidecoder gem which does a nice job of transliterating unicode characters into ascii.
Example:
"olá, mundo!".to_ascii #=> "ola, mundo!"
"你好".to_ascii #=> "Ni Hao "
"Jürgen Müller".to_ascii #=> "Jurgen Muller"
"Jürgen Müller".to_ascii("ü" => "ue") #=> "Juergen Mueller"
For our simple use case, this worked well.
Pivotal Labs has a great blog post on unicode transliteration to ascii discussing this in more detail.

Since ruby 1.9 there is an easier way:
yourstring.encode('ASCII')
To avoid problems with invalid (non-ASCII) characters you can ignore the problems:
yourstring.encode('ASCII', invalid: :replace, undef: :replace, replace: "_")

Related

Can I programmatically convert "I’d" to "I’d" using Ruby?

I can't seem to find the right combination of String#encode shenanigans.
I think I'd got confused on this one so I'll post this here to hopefully help anyone else who is similarly confused.
I was trying to do my encoding in an irb session, which gives you
irb(main):002:0> 'I’d'.force_encoding('UTF-8')
=> "I’d"
And if you try using encode instead of force_encoding then you get
irb(main):001:0> 'I’d'.encode('UTF-8')
=> "I’d"
This is with irb set to use an output and input encoding of UTF-8. In my case to convert that string the way I want it involves telling Ruby that the source string is in windows-1252 encoding. You can do this by using the -E argument in which you specify `inputencoding:outputencoding' and then you get this
$ irb -EWindows-1252:UTF-8
irb(main):001:0> 'I’d'
=> "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d"
That looks wrong unless you pipe it out, which gives this
$ ruby -E Windows-1252:UTF-8 -e "puts 'I’d'"
I’d
Hurrah. I'm not sure about why Ruby showed it as "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d" (something to do with the code page of the terminal?) so if anyone can comment with further insight that would be great.
I expect your script is using the encoding cp1251 and you have ruby >= 1.9.
Then you can use force_encoding:
#encoding: cp1251
#works also with encoding: binary
source = 'I’d'
puts source.force_encoding('utf-8') #-> I’d
If my exceptions are wrong: Which encoding do you use and which ruby version?
A little background:
Problems with encoding are difficult to analyse. There may be conflicts between:
Encoding of the source code (That's defined by the editor).
Expected encoding of the source code (that's defined with #encoding on the first line). This is used by ruby.
Encoding of the string (see e.g. section String encodings in http://nuclearsquid.com/writings/ruby-1-9-encodings/ )
Encoding of the output shell

Encoding::UndefinedConversionError when using open-uri

When I do this:
require 'open-uri'
response = open('some-html-page-url-here')
response.read
On a certain url I get the following error (due to wrong encoding in the returned url?!):
Encoding::UndefinedConversionError: U+00A0 from UTF-8 to US-ASCII
Any way around this to still get the html content?
In the introduction to the open-uri module, the docs say this,
It is possible to open an http, https or ftp URL as though it were a file
And if you know anything about reading files, then you have to know the encoding of the file you are trying to read. You need to know the encoding so that you can tell ruby how to read the file(i.e. how many bytes(or how much space) each character will occupy).
In the first code example in the docs, there is this:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
So if you don't know the encoding of the "file" you are trying to read, you can get the encoding with f.charset. If that encoding is different than your default external encoding, you will most likely get an error. Your default external encoding is the encoding ruby uses to read from external sources. You can check what your default external encoding is set to like this:
The default external Encoding is pulled from your environment...Have a
look:
$ echo $LC_CTYPE
en_US.UTF-8
or
$ ruby -e 'puts Encoding.default_external.name'
UTF-8
http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings
On Mac OSX, I actually have to do the following to see the default external encoding:
$ echo $LANG
You can set your default external encoding with the method Encoding.default_external=(), so you might want to try something like this:
open('some_url_here') do |f|
Encoding.default_external = f.charset
html = f.read
end
Setting an IO object to binmode, like you have done, tells ruby that the encoding of the file is BINARY (or ruby's confusing synonym ASCII-8BIT), which means you are telling ruby that each character in the file takes up one byte. In your case, you are telling ruby to read the character U+00A0, whose UTF-8 representation takes up two bytes 0xC2 0xA0, as two characters instead of just one character, so you have eliminated your error, but you have produced two junk characters instead of the original character.
Doing a response.binmode before the response.read stops the error from happening.
Had the same issue, will add my solution here:
After reading the open-uri documentation further, it turns out you could set the encoding of the io before reading using the set_encoding method, like this:
result = open('some-page-uri') do |io|
io.set_encoding(Encoding.default_external)
io.read
end
Hope it helps!

CSV writes Ñ into its code form instead of its actual form

I have a CSV file. I checked its encoding using this:
File.open('C:\path\to\file\myfile.txt').read.encoding
and it returned the encoding as:
=> #<Encoding:IBM437>
I'm reading this CSV per row -- stripping spaces and doing other stuff. After "cleansing" it, I push it to a new file. I'm doing it like this:
CSV.foreach(file_read, encoding: "IBM437:UTF-8") do |r|
# some code
CSV.open(file_appended, "a", col_sep: "|") do |csv|
csv << r
end
end
Now my problem is, inside the CSV I'm reading, there's a word with an accented character -- Ñ to be exact. This character is being appended to the new file as
\u2564
Its a problem considering that the accented character is a vital part of that word, and I wanted that character to appear to the new file as-is.
Am I missing something? I tried the ff. source:destination encoding but to no avail:
ISO-8859-1:UTF8 (and vice versa)
ISO-8859-1:Windows-1252 (and vice versa)
Am I missing something?
Here is my ruby version, just if you'd need to know:
ruby 1.9.3p392 (2013-02-22) [i386-mingw32]
Thanks in advance!
The line below solved my problem:
Encoding.default_external = "iso-8859-1"
It tells Ruby that the file being read is encoded in ISO-8859-1, and therefore correctly interprets the Ñ character.
Credit goes to Darshan Computing's answer here. Just look for Update #2.

How to work with multibyte strings read from a file in Ruby? See inside

I'm a newbie to Ruby with Perl background. And I got some problems with .reverse of a multibyte string read from an utf-8 encoded file.
Code:
#!C:\Ruby200-x64\bin\ruby
puts "Content-Type:text/plain;charset=utf8\n\n" #I execute it via CGI
$: << "."
puts "А это строка".reverse #mb-string output is pretty fine
#but when I do the following, it fails;
file = File.open('test_rb_file.txt','r')
file.each_line {|line| puts line.reverse} #puts line works good, but not puts line.reverse
The script itself is in utf-8. The test_rb_file.txt is in utf-8. So, when I try to output a multibyte string - all ok, but when I try to read it from a file and reverse - it fails.
I think, specifying the encoding of the file I read from (test_rb_file.txt) would do the trick, but I don't know how to do that so far. And I maybe wrong about that.
Any ideas to fix the problem? Thanks in advance
UPD All fixed, thanks everyone. Following thing sets the encoding of input file and fixes the problem:
file = File.open('test_rb_file.txt','r:UTF-8')
File.open('test_rb_file.txt','r:UTF-8')
To check encoding of a String "YourString".encoding

Ruby UTF-8 Encoding doesn't work in Windows even with Magic Comment

I'm trying to run a file (ruby anyfile.rb in cmd prompt) with the following contents:
# encoding: utf-8
puts 'áá'
happens the following error:
invalid multibyte char (UTF-8)
It seems that Ruby does not understand the magic comment...
EDIT: If I remove the "# encoding: utf-8" and run the command prompt like this:
ruby-E:UTF-8 encoding.rb
then it works - any ideas?
EDIT2: when i run:
ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
i got [#Encoding:CP850, nil], maybe my Encoding.default_external is wrong?!
Environment:
Windows XP (yes, I also hate windows + ruby)
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
I believe this is a classic case of "if you hear hooves, think horses, not zebras".
The error message is telling you that you have a byte sequence in your file that is not a valid UTF-8 multibyte sequence.
It is definitely possible that
It seems that Ruby does not understand the magic comment...
as you say, and that up until now nobody noticed that magic comments don't actually work because you are the first person in the history of humankind to actually try to use magic comments. (Actually, this is not possible. If Ruby didn't understand magic comments, it would complain about an invalid ASCII character, since ASCII is the default encoding if no magic comment is present.)
Or, there actually is an invalid multibyte UTF-8 sequence in your file.
Which do you think is more likely? If I were you, I would check my file.
I've encountered similar issues from time to time with files that were not saved as UTF-8, even when the magic comment states so.
I've found that Ruby 1.9.2 had issues to properly convert UTF-8 to codepages 850 and 437, the defaults for command prompt on Windows.
I do recommend you upgrade to Ruby 1.9.3 (latest is patchlevel 125) which solves a lot of encoding issues, specially on Windows.
Also, to verify that your saved file do not contain a Unicode BOM (so it is plain UTF) and is properly saved.
To verify that, you can switch the codepage in the console to unicode (chcp 65001) and try type myscript.rb
You should see the accented letters correctly.
Last but no least, ensure your command prompt uses a TrueType font so extended characters are properly displayed.
Hope that helps.
Try
# encoding: iso-8859-1
Not everything that's text is utf8.
Are you sure you selected 'UTF-8' from the Encoding dropdown when you saved the file in Notepad? I've just tried this on an XP machine and your code example worked for me.

Resources