Ruby's String#gsub, unicode, and non-word characters - ruby

As part of a larger series of operations, I'm trying to take tokenized chunks of a larger string and get rid of punctuation, non-word gobbledygook, etc. My initial attempt used String#gsub and the \W regexp character class, like so:
my_str = "Hello,"
processed = my_str.gsub(/\W/,'')
puts processed # => Hello
Super, super, super simple. Of course, now I'm extending my program to deal with non-Latin characters, and all heck's broken loose. Ruby's \W seems to be something like [^A-Za-z0-9_], which, of course, excludes stuff with diacritics (ü, í, etc.). So, now my formerly-simple code crashes and burns in unpleasent ways:
my_str = "Quística."
processed = my_str.gsub(/\W/,'')
puts processed # => Qustica
Notice that gsub() obligingly removed the accented "í" character. One way I've thought of to fix this would be to extend Ruby's \W whitelist to include higher Unicode code points, but there are an awful lot of them, and I know I'd miss some and cause problems down the line (and let's not even start thinking about non-Latin languages...). Another solution would be to blacklist all the stuff I want to get rid of (punctuation, $/%/&/™, etc.), but, again, there's an awful lot of that and I really don't want to start playing blacklist-whack-a-mole.
Has anybody out there found a principled solution to this problem? Is there some hidden, Unicode-friendly version of \W that I haven't discovered yet? Thanks!

You need to run ruby with the "-Ku" option to make it use UTF-8. See the documentation for command-line options. This is what happens when I do this with irb:
% irb -Ku
irb(main):001:0> my_str = "Quística."
=> "Quística."
irb(main):002:0> processed = my_str.gsub(/\W/,'')
=> "Quística"
irb(main):003:0>
You can also put it on the #! line in your ruby script:
#!/usr/bin/ruby -Ku

I would just like to add that in 1.9.1 it works by default.
$ irb
ruby-1.9.1-p243 > my_str = "Quística."
=> "Quística."
ruby-1.9.1-p243 > processed = my_str.gsub(/\W/,'')
=> "Quística"
ruby-1.9.1-p243 > processed.encoding
=> #<Encoding:UTF-8>
PS. Nothing beats rvm for trying out different versions of Ruby. DS.

Related

Escaping in %q notation won't work in irb

Here is a sample code called test.rb:
s = %Q_abc\_def\_ghi_
puts s
s = %q_abc\_def\_ghi_
puts s
It works fine as expected:
➜ Desktop ruby test.rb
abc_def_ghi
abc_def_ghi
However, when I run it in irb, nothing happened after s = %q_abc\_def\_ghi_:
➜ Desktop irb
irb(main):001:0> s = %Q_abc\_def\_ghi_
=> "abc_def_ghi"
irb(main):002:0> puts s
abc_def_ghi
=> nil
irb(main):003:0>
irb(main):004:0* s = %q_abc\_def\_ghi_
irb(main):005:1> puts s
irb(main):006:1>
irb(main):007:1*
irb(main):008:1*
Why it won't work? And how can I escape '_' (or other delimiters) in %q notation?
My Ruby version is:
ruby -v
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin15]
IRB has its own Ruby lexer/parser which it uses to try and keep track of the state of code entered so that it can do things like display different prompts depending on things like if you are in the middle of a string or defining a method or class. The code is the passed to Ruby to be evaluated “properly”.
It looks like this has a bug relating to how it handles escaping of single quoted style strings that aren’t actually using using single quotes.
Ruby itself handles the escaping just fine, so normally I don’t think this bug would actually have much affect, but in your example you happen to have used the string def right after the second _, which is a keyword that IRB also looks for.
This combination appears to put IRB into a strange state where its understanding of what is going on differs from what’s actually happening. This is the odd behaviour you are seeing.
A little playing around with a checked out version of the IRB code seems to support this. The snippet I think is to blame looks like this:
elsif ch == '\\' and #ltype == "'" #'
case ch = getc
when "\\", "\n", "'"
else
ungetc
end
Changing the when line to also look for the actual character being used:
when "\\", "\n", "'", quoted
(quoted is a parameter passed to the function) appears to fix it, and your examples all work fine with this modified version. I don’t know if that is a sufficient fix though, I don’t know the code—this is just a quick hack.
It might be worth opening a bug about this.
I'm not sure why this displays differently in your Ruby file and IRB but lowercase percent strings do not escape. See Difference between '%{}', '%Q{}', '%q{}' in ruby string delimiters
Since %q does not support escaping, there is probably some undefined behavior when you try to use different delimiters and escape characters.
This probably isn't the answer you were looking for but I think it should help a bit.

iconv will be deprecated in the future, transliterate

ruby 1.9.3 is warning about iconv deprecation, but I use iconv to remove diacritic to have plain ASCII from
Iconv.iconv('asccii//translit', 'utf-8', 'Těžiště')
returns Teziste. How I can obtain this using String.encode?
If I had Rails (or just ActiveSupport) around, I'd do something like this:
ActiveSupport::Multibyte::Unicode.normalize('Těžiště', :kd).chars.grep(/\p{^Mn}/).join('')
to get 'Teziste'. The :kd essentially decomposes your accented characters into separate accents and characters and then the \p{^Mn} removes all the non-spacing marks from the character stream and when you put it all back together with join, you get the unaccented string back.
If you don't have Rails or ActiveSupport handy, then you could use UnicodeUtils.compatibility_decomposition from unicode-utils instead of ActiveSupport::Multibyte::Unicode.normalize:
> UnicodeUtils.compatibility_decomposition('Těžiště').chars.grep(/\p{^Mn}/).join('')
=> "Teziste"
I tend to have the ActiveSupport version patched into String in Rails-land:
def de_accent
#
# `\p{Mn}` is also known as `\p{Nonspacing_Mark}` but only the short
# and cryptic form is documented.
#
ActiveSupport::Multibyte::Unicode.normalize(self, :kd).chars.grep(/\p{^Mn}/).join('')
end
so that I can say things like:
> s = 'Těžiště'.de_accent
=> "Teziste"
to strip out accents.
This approach won't handle everything but maybe it will do enough.

Converting UTF-8 characters into properly ASCII characters

I have the string "V\355ctor" (I think that's Víctor).
Is there a way to convert it to ASCII where í would be replaced by an ASCII i?
I already have tried Iconv without success.
(I'm only getting Iconv::IllegalSequence: "\355ctor")
Further, are there differences between Ruby 1.8.7 and Ruby 2.0?
EDIT:
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "V\355ctor") this seems to work but the result is Vctor not Victor
I know of two options.
transliterate from the I18n gem.
$ irb
1.9.3-p448 :001 > string = "Víctor"
=> "Víctor"
1.9.3-p448 :002 > require 'i18n'
=> true
1.9.3-p448 :003 > I18n.transliterate(string)
=> "Victor"
Unidecoder from the stringex gem.
Stringex::Unidecoder..decode(string)
Update:
When running Unidecoder on "V\355ctor", you get the following error:
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with IBM437 string)
Hmm, maybe you want to first translate from IBM437:
string.force_encoding('IBM437').encode('UTF-8')
This may help you get further. Note that the autodetected encoding could be incorrect, if you know exactly what the encoding is, it would make everything a lot easier.
What you want to do is called transliteration.
The most used and best maintained library for this is ICU. (Iconv is frequently used too, but it has many limitations such as the one you ran into.)
A cursory Google search yields a few ruby ICU wrappers. I'm afraid I cannot comment on which one is better, since I've admittedly never used any of them. But that is the kind of stuff you want to be using.

Desperately trying to remove this diabolical excel generated special character from csv in ruby

My computer has no idea what this character is. It came from Excel.
In excel it was a weird space, now it is literally represented by several symbols viz. my computer has no idea what it is.
This character is represented by a Ê in Excel (in csv, as xls it is a space of some kind), OS X's TextEdit treats it as a big space this long "            ", which is, I think, what it is. Ruby's CSV parser blows up when it tries to parse it using normal utf-8, and I have to add :encoding => "windows-1251:utf-8" to parse it, in which case Ruby turns it into an "K". This K appears in groups of 9, 12, 15 and 18 (KKKKKKKKK, etc) in my CSV, and cannot be removed via gsub(/K/) (groups of K, /KKKKKKKKK/, etc, cannot be removed either)! I've also used the opensource tool CSVfix, but its "removing leading and trailing spaces" command did not have an effect on the Ks.
I've tried using sed as suggested in Remove non-ascii characters from csv, but got errors like
sed: 1: "output.csv": invalid command code o
when running something like sed -i 's/[\d128-\d255]//' input.csv on Mac.
Parse your csv with the following to remove your "evil" character
.encode!("ISO-8859-1", :invalid => :replace)
**self-answers (different account, same person)
1st solution attempt:
evil_string_from_csv_cell = "KKKKKKKKK"
encoding_opts = {
:invalid => :replace, :undef => :replace,
:replace => '', :universal_newline => true }
evil_string_from_csv_cell.encode Encoding.find('ASCII'), encoding_opts
#=> ""
2nd solution attempt:
Don't use 'windows-1251:utf-8' for encoding, use 'iso-8859-1' instead, which will turn those (cyrillic) K's into '\xCA', which can then be removed with
string.gsub!(/\xCA/, '')
** I have not solved this problem yet.
3rd solution attempt:
trying to match array of K's as if they were actual K's is foolish. Copy and paste in the actual cyrillic K and see how that works-- here is the character, notice the little curl on the end
К
ruby treats it by making it a little bit bolder than normal K's
4th solution/strategy attempt (success):
use regular expressions to capture the characters, so long as you can encode the weird spaces (or whatever they are) into something, you can then ignore them using regular expressions
also try to take advantage of any spatial (matrix-like) patterns amongst the document types.
The answer to this problem is
A.) this is a very difficult problem. no one so far knows how to "physically" remove the cyrillic Ks.
but
B.) csv files are just strings separated by unescaped commas, so matching strings using regular expressions works just find so long as the encoding doesn't break the program.
So to read the file
f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
parsed = CSV.parse(f)
then find specific rows via regular expression literal string matching (it will overlook the cyrillic K's)
parsed.each do |p| #here, p[0] is the metatag column
#specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
end
I couldn't get sed working but finally had luck doing this in Vim:
vim myhorriblefile.csv
# Once vim is open:
:s/Ê/ /g
:wq
# Done!
As a generalized function for reuse, this can be:
clean_weird_character () {
vim "$1" -c ":%s/Ê/ /g" -c "wq"
}

Working around unexpected behavior in yaml for Ruby -- interned unicode strings

(1.9 on Windows)
Reproducing:
require 'yaml'
s = YAML::load("\xEC\x86\x8C\xEB\x85\x80\xEC\x8B\x9C\xEB\x8C\x80")
# => "∞åîδàÇ∞ï£δîÇ" or "소녀시대", depending on your terminal's unicode support
s_interned = s.intern
s_interned.class # => Symbol
s_yamld = s_interned.to_yaml
# => "--- \":\\xEC\\x86\\x8C\\xEB\\x85\\x80\\xEC\\x8B\\x9C\\xEB\\x8C\\x80\"\n"
unyamld = YAML::load(s_yamld)
# => ":∞åîδàÇ∞ï£δîÇ" or ":소녀시대"
unyamld.class # => String
# => expected: Symbol
And once again:
YAML::load(s_interned.to_yaml).class # => String
Here's how a "normal" symbol behaves:
YAML::load(:foo.to_yaml).class # => Symbol
Normal symbols behave fine, but symbols with unicode characters don't seem to. They get interpreted as strings with a colon as their first character.
I'm pretty sure this script was working last night. But I woke up this morning and everything is gone wrong.
Does anyone know how I can resolve this or get around this?
I've tried using some clever regular expression/sub hacks to get around this and "reconvert", but they've all proven inelegant or have made the situation worse.
I'm new to 1.9 as well but it seems you have to add the encoding to the top of the file sometimes. Something like:
# encoding: utf-8
Again... no idea when or why. Still have to learn how it works in 1.9. I found some more background information here: "Ruby 1.9 Common Problems Pt. 1: Encoding".

Resources