Using Ruby's fastercsv with character encodings - ruby

Using Ruby 1.8.7, I want to accept csv's into my system, even though this is an admin application, it seems I can get several different types of csvs. On my mac if I export from excel using "windows csv" option then fastercsv can read it out by default. On windows I seem to be getting utf-16 encoded csvs (which I havent figured out how to parse yet)
It seems like a pretty common thing to allow users to upload a csv that could be in utf8, utf16, ascii etc type formats, detect and parse them. Has anyone figured this out?
I started to look at UniversalDetector to help me detct, then use Iconv to convert, but this seems to be tricky and was hoping someone figured it out :)

According to FasterCSV's docs, the initialize method takes an :encoding option:
The encoding to use when parsing the file. Defaults to your $KDOCE setting. Valid values: n??? orN??? for none, e??? orE??? for EUC, s??? orS??? for SJIS, and u??? orU??? for UTF-8 (see Regexp.new()).
Because its list is limited, you might want to look into using iconv to do a pre-process of the contents, then pass them to CSV. You can use Ruby's interface to iconv ("Iconv") or the command-line version of it. Iconv is very powerful and flexible and capable of converting UTF-16 among other things.
Actually detecting the encoding of the document is more problematic, but the command-line version can help you there. If I remember right it can help identify the encoding. It can also convert between encodings, or, if you want, it can be told to convert to ASCII, converting to the closest matching characters, or ignoring them entirely.
Ruby 1.9.2 is much more capable than 1.8.7 when it comes to dealing with different character sets, so you might want to consider upgrading. Also, to become more familiar with the tools and issues of dealing with character-sets and multibyte characters you should read James Gray's blogs.

Related

ICU requires intermediate UTF16 conversion step

Why is libicu using utf16 as it's "common denominator" format instead of utf8? I need to convert from utf8 to utf32 and back and libicu seems to make it unnecessarily difficult by requiring this 2 step utf8->utf16->utf32 conversion, although it's own functions like u_tolower also require a UChar32 input.
It doesn't seem memory is the determining factor here, otherwise they could just use utf8 for their "base" format as well.
UTF-16 is the default encoding form of the Unicode Standard, so I suspect that answers the "why" there. See this ICU page for some additional information.

Is there a UTF-8 replacement for BASH?

I often need to write simple BASH scripts on my computer for manipulating files. BASH seems to have difficulty working with UTF-8 content.
Are there any versions of BASH which are fully UTF-8 compatible?
Is there a replacement for BASH, which uses a similar or identical syntax, but is UTF-8 compatible?
Bash itself shouldn't have any problems using UTF8. Most likely your problems are caused by another program, e.g. the terminal emulator or editor. Make sure that your environment is set up to use UTF8. For more information on this, see for example here.
I take your problem is the usual sed/awk/grep... etc doesn't support unicode, so stackoverflow solutions usually don't work for you?
bash itself is very limited without external programs.
To do what you want, you probably have to code in a more functional programming language other than bash.
UTF-8 itself is not very suitable for processing, you need to parse it into 2-byte or 4 byte character and then process the characters. (i.e. conversion to UTF-16 or UTF-32) and then convert it back to UTF-8 for storage.

Dealing with UTF-8 file in Ruby

8 gods and goddesses,
I have a csv file is is rumoured to be encoded in Win UTF-8. I need to apply a bunch of reg exps and other sorts of string/array manipulation to it and then have it output again in WIN UTF-8. I'm running Ruby 1.8 on Mac Lion. Are there any gotchas that I should be aware of? I got no UTF-8 fu.
Ok, so win utf-8 shocked everyone else as it did me. What about UTF-8? anyone? anyone?
Thanks in advance
Mark
I am not quite sure what your problem is.
Ruby 1.8 has native support for UTF-8. Actually it the only format it can work with internally. Otherwise you can always use iconv to convert between encodings. If the format is something different than UTF-8 you have to use iconv for input and output.
Regarting CSV, I think that fastercsv is a really neat framework for that, as it covers all corner cases and allows you to customize the in-/output format.
Depending on how many of these files you have to edit it might be faster to just use some texteditor to convert your file to standard UTF-8 with Unix style endings. Then you can apply your changes and convert it back with your editor.

In Ruby, how to automatically convert non-supported characters in text-processing?

(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-8­23
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).

clean up strange encoding in ruby

I'm currently playing a bit with couchdb.
I'm trying to migrate some blog data from redis (key value store) to couchdb (key value store).
Seeing as I probably migrated this data a gazillion times from and to different blogging engines (everybody has got to have a hobby :) ), there seem to be some encoding snafus.
I'm using CouchREST to access CouchDB from ruby and I'm getting this:
<JSON::GeneratorError: source sequence is illegal/malformed>
the problem seems to be the body_html part of the object:
<Post:0x00000000e9ee18 #body_html="[.....]Wie Sie bereits wissen, m\xF6chte EUserv k\xFCnftig seine [...]
Those are supposed to be Umlauts ("möchte" and "künftig").
Any idea how to get rid of those problems? I tried some conversions using the ruby 1.9 encoding feature or iconv before inserting, but haven't got any luck yet :(
If I try to e.g. convert that stuff to ISO-8859-1 using the .encode() method of ruby 1.9, this is what happens (different text, same problem):
#<Encoding::UndefinedConversionError: "\xC6\x92" from UTF-8 to ISO-8859-1>
I try to e.g. convert that stuff to ISO-8859-1
Close. You actually want to do it the other way around: you've got ISO-8859-1(*), you want UTF-8(**). So str.encode('utf-8', 'iso-8859-1') would be more likely to do the trick.
*: actually you might well have Windows code page 1252, which is like ISO-8859-1, but with extra smart-quotes and things in the range 0x80-0x9F which ISO-8859-1 uses for control codes. If so, use 'cp1252' instead.
**: well, you probably do. Working with UTF-8 is the best way forward so you can store all possible characters. If you really want to keep working in ISO-8859-1/cp1252, then presumably the problem is just that Ruby has mis-guessed the character set in use and you can fix it by calling str.force_encoding('iso-8859-1').

Resources