why does my vim encoding and ruby encoding not agree? - ruby

In the ruby file:
p __ENCODING__
#<Encoding:US-ASCII>
In vim:
set encoding?
encoding=utf-8
This is causing me grief (http://stackoverflow.com/questions/14495486/ruby-syntax-error-with-multiple-language-in-hash), which is patched but I still don't understand why the file shows as ASCII by ruby and utf-8 by vim.

As #melpomene commented, :set encoding tells you what encoding is used internally by Vim.
:set fileencoding will tell you what encoding Vim decided to use for your document. The possible values are given by the fileencodings option. ASCII is not part of the default list as it's usually handled transparently by the other encodings listed.
But that part of your question is puzzling me:
but I still don't understand why the file is ASCII
because it looks like you actively want that file to be treated as ASCII by the interpreter.
Anyway, that encoding directive is only used by Ruby: it doesn't mean that the file is actually encoded as ASCII or that Vim is supposed to care about it and treat it in a special way.
In short, whether your file is actually encoded in ASCII or not, Vim doesn't care.
So… what do you want exactly? That vim sets its fileencoding option to ASCII when you open a supposedly ASCII file? That your supposedly ASCII file be converted to another encoding?
edit
With that directive, you explicitely tell Ruby that the file's content must be treated as ASCII and Ruby says "OK, that's ASCII, if you say so.".
This directive doesn't change anything to the actual encoding of the file. It could be utf-8, latin1 or whatever.
Vim doesn't understand that directive.
Vim chooses the encoding it uses for that file according to a number of rules you should read about in :h encoding, :h fileencoding and :h fileencodings.
Vim doesn't treat ASCII in a special "ASCII" way, it just handles it has the subset of utf-8 that it is.
So, before we go further, please verify:
the encoding of the file with something like $ file /path/to/file
the fileencoding Vim uses for that file with :set fileencoding

Related

Issue with encoding of a character (not able to sed or .gsub)

I am dealing with some multilingual data(English and Arabic) in a json file with a weird character i am not able to parse. I am not sure what the character is. I tried getting the ASCII value via vim and this is what i got
"38 0x26"
This is the status line in vim i used to get the value (http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character).
:set statusline=%<%f%h%m%r%=%b\ 0x%B\ \ %l,%c%V\ %P
This is how the character looks in vim -
I tried 'sed' and '.gsub' to replace this character unsuccessfully.
Is there a way where i can replace this character(preferably with .gsub ruby) with '&' or something else?
Thanks
try with something like
sed 's/[[:alpnum:][:space:]\[\]{}()\.\*\\\/_(AllAsciiVariationYouWant)/&/g;t
s/./?/g' YourFile
where (AllAsciiVariationYouWant) is all character that you want to keep as is (without the surrounding "()" )
JSON is encoded in UTF-8 (Unicode). If you're seeing funky-looking characters in your file, it's probably because your editor is not treating Unicode characters properly. That could be caused by the use of a terminal emulator that doesn't support Unicode; an incorrect $LANG setting; vim not being able to correctly determine the encoding of the file; and likely other reasons.
What terminal program are you using? What's your $LANG environment variable set to (echo $LANG)? If you're certain your terminal supports Unicode, try:
LANG=en_US.utf-8 vim your_file_here.json
(The above example assumes that U.S. English is appropriate for the file, which it may not be.)
As for replacing characters in the file, vim's substitution command can be used:
:%s/old text/new text/g
The above command will run the substitute command on all lines in the file (%), replacing every instance of "old text" with "new text". (The g at the end tells vim to replace every instance on a line, not just the first it finds.)

File encoding using ruby in windows

I have two files in a windows folder. Using the technique described here I found out that one file encoding is ANSI and another one is UTF-8.
However, If I open cmd or Powershell and try to get the encoding in IRB with the following code I get always "CP850":
File.open(file_name).read.encoding.name # => CP850
or
File.open(file_name).external_encoding.name # => CP850
Notepad++ also gives me that one file is ANSI and another is UTF-8.
How can I get the proper encoding using Ruby in Windows?
It is impossible to tell what encoding a file is, but it's possible to make an educated guess.
When you open a file, ruby simply assumes it's encoded with the default 8-bit encoding (in your case CP850).
See Detect encoding
and What is ANSI format? about ANSI

Ruby UTF-8 Encoding doesn't work in Windows even with Magic Comment

I'm trying to run a file (ruby anyfile.rb in cmd prompt) with the following contents:
# encoding: utf-8
puts 'áá'
happens the following error:
invalid multibyte char (UTF-8)
It seems that Ruby does not understand the magic comment...
EDIT: If I remove the "# encoding: utf-8" and run the command prompt like this:
ruby-E:UTF-8 encoding.rb
then it works - any ideas?
EDIT2: when i run:
ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
i got [#Encoding:CP850, nil], maybe my Encoding.default_external is wrong?!
Environment:
Windows XP (yes, I also hate windows + ruby)
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
I believe this is a classic case of "if you hear hooves, think horses, not zebras".
The error message is telling you that you have a byte sequence in your file that is not a valid UTF-8 multibyte sequence.
It is definitely possible that
It seems that Ruby does not understand the magic comment...
as you say, and that up until now nobody noticed that magic comments don't actually work because you are the first person in the history of humankind to actually try to use magic comments. (Actually, this is not possible. If Ruby didn't understand magic comments, it would complain about an invalid ASCII character, since ASCII is the default encoding if no magic comment is present.)
Or, there actually is an invalid multibyte UTF-8 sequence in your file.
Which do you think is more likely? If I were you, I would check my file.
I've encountered similar issues from time to time with files that were not saved as UTF-8, even when the magic comment states so.
I've found that Ruby 1.9.2 had issues to properly convert UTF-8 to codepages 850 and 437, the defaults for command prompt on Windows.
I do recommend you upgrade to Ruby 1.9.3 (latest is patchlevel 125) which solves a lot of encoding issues, specially on Windows.
Also, to verify that your saved file do not contain a Unicode BOM (so it is plain UTF) and is properly saved.
To verify that, you can switch the codepage in the console to unicode (chcp 65001) and try type myscript.rb
You should see the accented letters correctly.
Last but no least, ensure your command prompt uses a TrueType font so extended characters are properly displayed.
Hope that helps.
Try
# encoding: iso-8859-1
Not everything that's text is utf8.
Are you sure you selected 'UTF-8' from the Encoding dropdown when you saved the file in Notepad? I've just tried this on an XP machine and your code example worked for me.

How does the magic comment ( # Encoding: utf-8 ) in ruby​​ work?

How does the magic comment in ruby​​ works? I am talking about:
# Encoding: utf-8
Is this a preprocessing directive? Are there other uses of this type of construction?
Ruby interpreter instructions at the top of the source file - this is called magic comment. Before processing your source code interpreter reads this line and sets proper encoding. It's quite common for interpreted languages I believe. At least Python uses the same approach.
You can specify encoding in a number of different ways (some of them are recognized by editors):
# encoding: UTF-8
# coding: UTF-8
# -*- coding: UTF-8 -*-
You can read some interesting stuff about source encoding in this article.
The only thing I'm aware of that has similar construction is shebang, but it is related to Unix shells in general and is not Ruby-specific.
magic_comments defined in ruby/ruby
This magic comment tells Ruby the source encoding of the currently parsed file. As Ruby 1.9.x by default assumes US_ASCII you have tell the interpreter what encoding your source code is in if you use non-ASCII characters (like umlauts or accented characters).
The comment has to be the first line of the file (or below the shebang if used) to be recognized.
There are other encoding settings. See this question for more information.
Since version 2.0, Ruby assumes UTF-8 encoding of the source file by default. As such, this magic encoding comment has become a rarer sight in the wild if you write your source code in UTF-8 anyway.
As you noted, magic comments are a special preprocessing construct. They must be defined at the top of the file (except, if there is already a unix shebang at the top). As of Ruby 2.3 there are three kinds of magic comments:
Encoding comment: See other answers. Must always be the first magic comment. Must be ASCII compatible. Sets the source encoding, so you will run into problems if the real encoding of the file does not match the specified encoding
frozen_string_literal: true: Freezes all string literals in the current file
warn_indent: true: Activates indentation warnings for the current file
More info: Magic Instructions
While this isn't exactly an answer for your question, if you want to read more about encodings, how they work, what kinds of problems crop up with them: the great Yehuda Katz wrote about encodings as they were being worked out in Ruby 1.9 and beyond:
Ruby 1.9 Encodings: A Primer and the Solution for Rails
Encodings, Unabridged

Reading ASCII-encoded files with Ruby 1.9 in a UTF-8 environment

I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)

Resources