Ruby 2.0 iconv replacement - ruby

I don't know Ruby but want to run an script where:
D:/Heather/Ruby/lib/ruby/2.0.0/rubygems/core_ext/kernel_require.rb:45:in `require': cannot load such file -- iconv (LoadError)
it works somehow if I comment iconv code but it will be much better if I can recode this part:
return Iconv.iconv('UTF-8//IGNORE', 'UTF-8', (s + ' ') ).first[0..-2]
without iconv. Maybe I can use String#encode here somehow?

Iconv was deprecated (removed) in 1.9.3.
You can still install it.
Reference Material if you unsure:
https://rvm.io/packages/iconv/
However the suggestion is that you don't and rather use:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
API

String#scrub can be used since Ruby 2.1.
str.scrub(''),
str.scrub{ |bytes| '' }
Related question: Equivalent of Iconv.conv(“UTF-8//IGNORE”,…) in Ruby 1.9.X?

If you're not on Ruby 2.1, so can't use String#scrub then the following will ignore all parts of the string that aren't correctly UTF-8 encoded.
string.encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
The encode method does almost exactly what you want, but with the caveat that encode doesn't do anything if it thinks the string is already UTF-8. So you need to change encodings, going via an encoding that can still encode the full set of unicode characters that UTF-8 can encode. (If you don't you'll corrupt any characters that aren't in that encoding - 7bit ASCII would be a really bad choice!)

I have not had luck with the various approaches using a one line string.encode by itself
But I wrote a backfill that implements String#scrub in MRI pre 2.1, or other rubies that do not have it.
https://github.com/jrochkind/scrub_rb

Related

iconv will be deprecated in the future, transliterate

ruby 1.9.3 is warning about iconv deprecation, but I use iconv to remove diacritic to have plain ASCII from
Iconv.iconv('asccii//translit', 'utf-8', 'Těžiště')
returns Teziste. How I can obtain this using String.encode?
If I had Rails (or just ActiveSupport) around, I'd do something like this:
ActiveSupport::Multibyte::Unicode.normalize('Těžiště', :kd).chars.grep(/\p{^Mn}/).join('')
to get 'Teziste'. The :kd essentially decomposes your accented characters into separate accents and characters and then the \p{^Mn} removes all the non-spacing marks from the character stream and when you put it all back together with join, you get the unaccented string back.
If you don't have Rails or ActiveSupport handy, then you could use UnicodeUtils.compatibility_decomposition from unicode-utils instead of ActiveSupport::Multibyte::Unicode.normalize:
> UnicodeUtils.compatibility_decomposition('Těžiště').chars.grep(/\p{^Mn}/).join('')
=> "Teziste"
I tend to have the ActiveSupport version patched into String in Rails-land:
def de_accent
#
# `\p{Mn}` is also known as `\p{Nonspacing_Mark}` but only the short
# and cryptic form is documented.
#
ActiveSupport::Multibyte::Unicode.normalize(self, :kd).chars.grep(/\p{^Mn}/).join('')
end
so that I can say things like:
> s = 'Těžiště'.de_accent
=> "Teziste"
to strip out accents.
This approach won't handle everything but maybe it will do enough.

Converting UTF-8 characters into properly ASCII characters

I have the string "V\355ctor" (I think that's Víctor).
Is there a way to convert it to ASCII where í would be replaced by an ASCII i?
I already have tried Iconv without success.
(I'm only getting Iconv::IllegalSequence: "\355ctor")
Further, are there differences between Ruby 1.8.7 and Ruby 2.0?
EDIT:
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "V\355ctor") this seems to work but the result is Vctor not Victor
I know of two options.
transliterate from the I18n gem.
$ irb
1.9.3-p448 :001 > string = "Víctor"
=> "Víctor"
1.9.3-p448 :002 > require 'i18n'
=> true
1.9.3-p448 :003 > I18n.transliterate(string)
=> "Victor"
Unidecoder from the stringex gem.
Stringex::Unidecoder..decode(string)
Update:
When running Unidecoder on "V\355ctor", you get the following error:
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with IBM437 string)
Hmm, maybe you want to first translate from IBM437:
string.force_encoding('IBM437').encode('UTF-8')
This may help you get further. Note that the autodetected encoding could be incorrect, if you know exactly what the encoding is, it would make everything a lot easier.
What you want to do is called transliteration.
The most used and best maintained library for this is ICU. (Iconv is frequently used too, but it has many limitations such as the one you ran into.)
A cursory Google search yields a few ruby ICU wrappers. I'm afraid I cannot comment on which one is better, since I've admittedly never used any of them. But that is the kind of stuff you want to be using.

How to change deprecated iconv to String#encode for invalid UTF8 correction

I get sources from the web and sometimes the encoding of the material is not 100% UTF8 byte sequence valid. I use iconv to silently ignore these sequences to get a cleaned string.
#iconv = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = #iconv.iconv(untrusted_string)
However now the iconv has been deprecated, I see its deprecation warning a lot.
iconv will be deprecated in the future, use String#encode
I tried the converting it, using String#encode's :invalid and :replace options, but it seems not to be working (i.e. the incorrect byte sequence has not been removed). What is the correct way to use String#encode for this?
This has been answered in this question:
Is there a way in ruby 1.9 to remove invalid byte sequences from strings?
Use either
untrusted_string.chars.select{|i| i.valid_encoding?}.join
or
untrusted_string.encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')
The question that Martijn linked to has what seem to be the two best ways to do that, but Martijn made an understandable but incorrect change when copying the second approach to his answer here. Doing .encode('UTF-8', <options>).encode('UTF-8') doesn't work. As indicated in the original answer in the other question, the key is to encode to a different encoding, then back to UTF-8. If your original string is already flagged as UTF-8 in ruby's internals then ruby will ignore any call to encode it as UTF-8.
In the following examples I'm going to use "a#{0xFF.chr}b".force_encoding('UTF-8') to produce a string that ruby believes is UTF-8 but which contains invalid UTF-8 bytes.
1.9.3p194 :019 > "a#{0xFF.chr}b".force_encoding('UTF-8')
=> "a\xFFb"
1.9.3p194 :020 > "#{0xFF.chr}".force_encoding('UTF-8').encoding
=> #<Encoding:UTF-8>
Note how encoding to UTF-8 does nothing:
1.9.3p194 :016 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-8', :invalid => :replace, :replace => '').encode('UTF-8')
=> "a\xFFb"
But encoding to something else (UTF-16) and then back to UTF-8 cleans up the string:
1.9.3p194 :017 > "a#{0xFF.chr}b".force_encoding('UTF-8').encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
=> "ab"

Iconv and Kconv on Ruby (1.9.2)

I know that Iconv is used to convert strings' encoding.
From my understandings Kconv is for the same purpose (am I wrong?).
My question is: what is the difference between them, and what should I use for encoding conversions.
btw found some info that Iconv will be deprecated from 1.9.3 version.
As https://stackoverflow.com/users/23649/jtbandes says, it looks Kconv is like Iconv but specialized for Kanji ("the logographic Chinese characters that are used in the modern Japanese writing system along with hiragana" http://en.wikipedia.org/wiki/Kanji). Unless you are working on something specifically Japanese, I'm guessing you don't need Kconv.
If you're using Ruby 1.9, you can use the built-in encoding support most of the time instead of Iconv. I tried for hours to understand what I was doing until I read this:
http://www.joelonsoftware.com/articles/Unicode.html
Then you can start to use stuff like
String#encode # Ruby 1.9
String#encode! # Ruby 1.9
String#force_encoding # Ruby 1.9
with confidence. If you have more complex needs, do read http://blog.grayproductions.net/categories/character_encodings
UPDATED Thanks to JohnZ in the comments
Iconv is still useful in Ruby 1.9 because it can transliterate characters (something that String#encode et al. can't do). Here's an example of how to extend String with a function that transliterates to UTF-8:
require 'iconv'
class ::String
# Return a new String that has been transliterated into UTF-8
# Should work in Ruby 1.8 and Ruby 1.9 thanks to http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/
def as_utf8(from_encoding = 'UTF-8')
::Iconv.conv('UTF-8//TRANSLIT', from_encoding, self + ' ')[0..-2]
end
end
"foo".as_utf8 #=> "foo"
"foo".as_utf8('ISO-8859-1') #=> "foo"
Thanks JohnZ!

Converting UTF8 to ANSI with Ruby

I have a Ruby script that generates a UTF8 CSV file remotely in a Linux machine and then transfers the file to a Windows machine thru SFTP.
I then need to open this file with Excel, but Excel doesn't get UTF8, so I always need to open the file in a text editor that has the capability to convert UTF8 to ANSI.
I would love to do this programmatically using Ruby and avoid the manual conversion step. What's the easiest way to do it?
PS: I tried using iconv but had no success.
ascii_str = yourUTF8text.unpack("U*").map{|c|c.chr}.join
assuming that your text really does fit in the ascii character set.
I finally managed to do it using iconv, I was just messing up the parameters. So, this is how you do it:
require 'iconv'
utf8_csv = File.open("utf8file.csv").read
# gotta be careful with the weird parameters order: TO, FROM !
ansi_csv = Iconv.iconv("LATIN1", "UTF-8", utf8_csv).join
File.open("ansifile.csv", "w") { |f| f.puts ansi_csv }
That's it!
I had a similar issue trying to generate CSV files from user-generated content on the server. I found the unidecoder gem which does a nice job of transliterating unicode characters into ascii.
Example:
"olá, mundo!".to_ascii #=> "ola, mundo!"
"你好".to_ascii #=> "Ni Hao "
"Jürgen Müller".to_ascii #=> "Jurgen Muller"
"Jürgen Müller".to_ascii("ü" => "ue") #=> "Juergen Mueller"
For our simple use case, this worked well.
Pivotal Labs has a great blog post on unicode transliteration to ascii discussing this in more detail.
Since ruby 1.9 there is an easier way:
yourstring.encode('ASCII')
To avoid problems with invalid (non-ASCII) characters you can ignore the problems:
yourstring.encode('ASCII', invalid: :replace, undef: :replace, replace: "_")

Resources