Ruby 1.9 - Invalid multibyte character (utf-8) - ruby

I have a ruby file with only these two lines:
# encoding: utf-8
puts "—"
When I run it with ruby test_enc.rb it fails with:
test_enc.rb:2: invalid multibyte char (UTF-8)
test_enc.rb:2: unterminated string meets end of file
I don't know how to properly specify the character code of — (emdash), but vim tells me it is 151, Hex 97, Octal 227. It fails the same way with other characters like ã as well, so I doubt it is related specifically to that character.
I am running on Windows XP and the version of ruby I'm using is:
ruby 1.9.1p430 (2010-08-16 revision 28998) [i386-mingw32]
I feel like there is something very obvious I am missing here. Any ideas?
EDIT: Learned a valuable lesson about assumptions today - specifically assuming your editor IS using UTF-8 without actually checking it. Oops!
Thanks for the quick and accurate replies all!
EDIT AGAIN: The 'setting up vim properly for utf-8' grew too big and wasn't really relevant to this question, so it is now a separate question.

Given that Ruby is explicitly calling your attention to UTF-8, I strongly suspect that you haven't actually written out a UTF-8 file to start with. Make sure that Vim (or whatever text editor you're using to create the file) is really set to write out UTF-8.
Note that in UTF-8, any non-ASCII character will be represented by multiple bytes, not a single byte as you've described from the Vim diagnostics. I'd recommend using a binary file editor (or dump, or whatever) to really show what's in the text file though. Something that doesn't already have some preconceived notion of the encoding - something that isn't even trying to think of it as a text file.
Notepad lets you write out a file in UTF-8, so you might want to try that just to see what happens. (I don't have Ruby installed myself, otherwise I'd try it for you.)

Your file is in latin1. Ruby is right.
emdash would be encoded on two bytes not one in UTF-8.

Related

Ruby 1.9 iso-8859-8-i encoding

I'm trying to create a piece of code that will download a page from the internet and do some manipulation on it. The page is encoded in iso-8859-1.
I can't find a way to handle this file. I need to search through the file in Hebrew and return the changed file to the user.
I tried to use string.encode, but I still get the wrong encoding.
when printing the response encoding, I get: "encoding":{} like its undefined, and this is an example of what it returns:
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd
It should be Hebrew letters.
When I try with final.body.encode('iso-8859-8-i'), I get the error code converter not found (ASCII-8BIT to iso-8859-8-i).
When you have input where Ruby or OS has incorrectly assign encoding, then conversions will not work. That's because Ruby will start with the wrong assumption and try to maintain the wrong characters when converting.
However, if you know from some other source what the correct encoding is, you can use force_encoding method to tell Ruby how to interpret the bytes it has loaded into a String. Note this alters the object in place.
E.g.
contents = final.body
contents.force_encoding( 'ISO-8859-8' )
puts contents
At this point (provided it works), you now can make conversions (to e.g. UTF-8), because Ruby has been correctly told what characters it is dealing with.
I could not find 'ISO-8859-8-I' on my version of Ruby. I am not sure yet how close 'ISO-8859-8' is to what you need (some Googling suggests that it may be OK for you, if the ...-I encoding is not available).

Dealing with UTF-8 file in Ruby

8 gods and goddesses,
I have a csv file is is rumoured to be encoded in Win UTF-8. I need to apply a bunch of reg exps and other sorts of string/array manipulation to it and then have it output again in WIN UTF-8. I'm running Ruby 1.8 on Mac Lion. Are there any gotchas that I should be aware of? I got no UTF-8 fu.
Ok, so win utf-8 shocked everyone else as it did me. What about UTF-8? anyone? anyone?
Thanks in advance
Mark
I am not quite sure what your problem is.
Ruby 1.8 has native support for UTF-8. Actually it the only format it can work with internally. Otherwise you can always use iconv to convert between encodings. If the format is something different than UTF-8 you have to use iconv for input and output.
Regarting CSV, I think that fastercsv is a really neat framework for that, as it covers all corner cases and allows you to customize the in-/output format.
Depending on how many of these files you have to edit it might be faster to just use some texteditor to convert your file to standard UTF-8 with Unix style endings. Then you can apply your changes and convert it back with your editor.

Batch convert to UTF8 using Ruby

I'm encountering a little problem with my file encodings.
Sadly, as of yet I still am not on good terms with everything where encoding matters; although I have learned plenty since I began using Ruby 1.9.
My problem at hand: I have several files to be processed, which are expected to be in UTF-8 format. But I do not know how to batch convert those files properly; e.g. when in Ruby, I open the file, encode the string to utf8 and save it in another place.
Unfortunately that's not how it is done - the file is still in ANSI.
At least that's what my Notepad++ says.
I find it odd though, because the string was clearly encoded to UTF-8, and I even set the File.open parameter :encoding to 'UTF-8'. My shell is set to CP65001, which I believe also corresponds to UTF-8.
Any suggestions?
Many thanks!
/e: What's more, when in Notepad++, I can convert manually as such:
Selecting everything,
copy,
setting encoding to UTF-8 (here, \x-escape-sequences can be seen)
pasting everything from clipboard
Done! Escape-characters vanish, file can be processed.
Unfortunately that's not how it is done - the file is still in ANSI. At least that's what my Notepad++ says.
UTF-8 was designed to be a superset of ASCII, which means that most of the printable ASCII characters are the same in UTF-8. For this reason it's not possible to distinguish between ASCII and UTF-8 unless you have "special" characters. These special characters are represented using multiple bytes in UTF-8.
It's well possible that your conversion is actually working, but you can double-check by trying your program with special characters.
Also, one of the best utilities for converting between encodings is iconv, which also has ruby bindings.

In Ruby, how to automatically convert non-supported characters in text-processing?

(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-8­23
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).

Unicode characters in a Ruby script?

I would like to write a Ruby script which writes Japanese characters to the console. For example:
puts "こんにちは・今日は"
However, I get an exception when running it:
jap.rb:1: Invalid char `\377' in expression
jap.rb:1: Invalid char `\376' in expression
Is it possible to do? I'm using Ruby 1.8.6.
You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.
What you are seeing with \377 and \376 (octal for \xFF and \xFE) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.
Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.
And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):
puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"
However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.
So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:
puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"
But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).
(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)

Resources