Dealing with UTF-8 file in Ruby - ruby

8 gods and goddesses,
I have a csv file is is rumoured to be encoded in Win UTF-8. I need to apply a bunch of reg exps and other sorts of string/array manipulation to it and then have it output again in WIN UTF-8. I'm running Ruby 1.8 on Mac Lion. Are there any gotchas that I should be aware of? I got no UTF-8 fu.
Ok, so win utf-8 shocked everyone else as it did me. What about UTF-8? anyone? anyone?
Thanks in advance
Mark

I am not quite sure what your problem is.
Ruby 1.8 has native support for UTF-8. Actually it the only format it can work with internally. Otherwise you can always use iconv to convert between encodings. If the format is something different than UTF-8 you have to use iconv for input and output.
Regarting CSV, I think that fastercsv is a really neat framework for that, as it covers all corner cases and allows you to customize the in-/output format.
Depending on how many of these files you have to edit it might be faster to just use some texteditor to convert your file to standard UTF-8 with Unix style endings. Then you can apply your changes and convert it back with your editor.

Related

Is there a UTF-8 replacement for BASH?

I often need to write simple BASH scripts on my computer for manipulating files. BASH seems to have difficulty working with UTF-8 content.
Are there any versions of BASH which are fully UTF-8 compatible?
Is there a replacement for BASH, which uses a similar or identical syntax, but is UTF-8 compatible?
Bash itself shouldn't have any problems using UTF8. Most likely your problems are caused by another program, e.g. the terminal emulator or editor. Make sure that your environment is set up to use UTF8. For more information on this, see for example here.
I take your problem is the usual sed/awk/grep... etc doesn't support unicode, so stackoverflow solutions usually don't work for you?
bash itself is very limited without external programs.
To do what you want, you probably have to code in a more functional programming language other than bash.
UTF-8 itself is not very suitable for processing, you need to parse it into 2-byte or 4 byte character and then process the characters. (i.e. conversion to UTF-16 or UTF-32) and then convert it back to UTF-8 for storage.

Batch convert to UTF8 using Ruby

I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.

Ruby 1.9 - Invalid multibyte character (utf-8)

I have a ruby file with only these two lines:
# encoding: utf-8
puts "—"
When I run it with ruby test_enc.rb it fails with:
test_enc.rb:2: invalid multibyte char (UTF-8)
test_enc.rb:2: unterminated string meets end of file
I don't know how to properly specify the character code of — (emdash), but vim tells me it is 151, Hex 97, Octal 227. It fails the same way with other characters like ã as well, so I doubt it is related specifically to that character.
I am running on Windows XP and the version of ruby I'm using is:
ruby 1.9.1p430 (2010-08-16 revision 28998) [i386-mingw32]
I feel like there is something very obvious I am missing here. Any ideas?
EDIT: Learned a valuable lesson about assumptions today - specifically assuming your editor IS using UTF-8 without actually checking it. Oops!
Thanks for the quick and accurate replies all!
EDIT AGAIN: The 'setting up vim properly for utf-8' grew too big and wasn't really relevant to this question, so it is now a separate question.
Given that Ruby is explicitly calling your attention to UTF-8, I strongly suspect that you haven't actually written out a UTF-8 file to start with. Make sure that Vim (or whatever text editor you're using to create the file) is really set to write out UTF-8.
Note that in UTF-8, any non-ASCII character will be represented by multiple bytes, not a single byte as you've described from the Vim diagnostics. I'd recommend using a binary file editor (or dump, or whatever) to really show what's in the text file though. Something that doesn't already have some preconceived notion of the encoding - something that isn't even trying to think of it as a text file.
Notepad lets you write out a file in UTF-8, so you might want to try that just to see what happens. (I don't have Ruby installed myself, otherwise I'd try it for you.)
Your file is in latin1. Ruby is right.
emdash would be encoded on two bytes not one in UTF-8.

Using Ruby's fastercsv with character encodings

Using Ruby 1.8.7, I want to accept csv's into my system, even though this is an admin application, it seems I can get several different types of csvs. On my mac if I export from excel using "windows csv" option then fastercsv can read it out by default. On windows I seem to be getting utf-16 encoded csvs (which I havent figured out how to parse yet)
It seems like a pretty common thing to allow users to upload a csv that could be in utf8, utf16, ascii etc type formats, detect and parse them. Has anyone figured this out?
I started to look at UniversalDetector to help me detct, then use Iconv to convert, but this seems to be tricky and was hoping someone figured it out :)
According to FasterCSV's docs, the initialize method takes an :encoding option:
The encoding to use when parsing the file. Defaults to your $KDOCE setting. Valid values: n??? orN??? for none, e??? orE??? for EUC, s??? orS??? for SJIS, and u??? orU??? for UTF-8 (see Regexp.new()).
Because its list is limited, you might want to look into using iconv to do a pre-process of the contents, then pass them to CSV. You can use Ruby's interface to iconv ("Iconv") or the command-line version of it. Iconv is very powerful and flexible and capable of converting UTF-16 among other things.
Actually detecting the encoding of the document is more problematic, but the command-line version can help you there. If I remember right it can help identify the encoding. It can also convert between encodings, or, if you want, it can be told to convert to ASCII, converting to the closest matching characters, or ignoring them entirely.
Ruby 1.9.2 is much more capable than 1.8.7 when it comes to dealing with different character sets, so you might want to consider upgrading. Also, to become more familiar with the tools and issues of dealing with character-sets and multibyte characters you should read James Gray's blogs.

How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).
When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.
I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.
When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.
I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.
Answer:
ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.
The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.
I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.
Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!
Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.
Here is how notepad does that
There is also the python Universal Encoding Detector which you can check.
My guess is:
First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.
ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.
The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.
Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.
There is code out there for checking validity of UTF chars.

Resources