Ruby not respecting # encoding specification - ruby

Given the following script (it must be in its own file):
#!/usr/bin/env ruby
# encoding: binary
s = "\xe1\xe7\xe6\x07\x00\x01\x00"
puts s.encoding
The output of this is "UTF-8". Why isn't it binary (ASCII-8BIT)?

Because the # encoding: binary must be on the line immediately following the #!/usr/bin/env ruby. Alternatively, if there is no #!/usr/bin/env ruby line, it must then be on the first line of the file.
When the blank line is removed (i.e. the encoding specification is on the second line):
#!/usr/bin/env ruby
# encoding: binary
s = "\xe1\xe7\xe6\x07\x00\x01\x00"
puts s.encoding
...the output is "ASCII-8BIT".
Here is a link to the Ruby documentation regarding magic comments such as encoding (thanks to Stefan who mentioned this in a comment):
https://ruby-doc.org/core-3.1.2/doc/syntax/comments_rdoc.html#label-Magic+Comments

Related

How to consistently get ASCII-8BIT encoding in Ruby?

Ruby seems a bit inconsistent in its handling of encodings:
irb -E BINARY:BINARY
irb(main):001:0> "hi".encoding
=> #<Encoding:ASCII-8BIT>
So that "works". Now what about plain ruby?
ruby -E BINARY:BINARY -e 'p "hi".encoding'
#<Encoding:US-ASCII>
That doesn't work. Furthermore, when p "hi".encoding is placed in x.rb, the output of ruby -E BINARY:BINARY x.rb is:
#<Encoding:UTF-8>
How do I get ASCII-8BIT literals when invoking ruby?
String literals have the same encoding as the script encoding. Instead of 'hi'.encoding you can use the keyword __ENCODING__ to retrieve it. The script encoding can be changed by putting a magic comment at the beginning of your script:
# encoding: ASCII-8BIT
p __ENCODING__ # => #<Encoding:ASCII-8BIT>
The -E flag of ruby doesn't affect the encoding of string literals. It's only for changing the external and internal encoding. You can read about the various type of encodings and their purpose in the Encoding documentation.
Back to the encoding of string literals: Even though irb claims its -E flag is the "Same as ruby -E" that isn't true. It uses the external encoding as script encoding. irb already has several limitations. This could be one of them. It's at least a documentation bug.
Besides the magic comment there's another discouraged way to set the script encoding via ruby: the -K flag and the n (none) kcode. ruby -Kne "p __ENCODING__" should print #<Encoding:ASCII-8BIT>. However -K also changes the external encoding.

How can I read accented characters in ruby shoes?

I can't do a very simple thing in ruby shoes.
The code is the seguent:
#encoding:UTF-8
Shoes.app do
#var = File.readlines('temp.txt')
para #var
end
If in the txt there is a word with an accented character, the code doesn't work. Shoes says "not a valid UTF8 string"
How can I solve this problem???
edit: the result is the same if I remove the line #encoding:UTF-8
The line:
#encoding:UTF-8
only governs the characters you use for your actual code. And if you are using a recent version of ruby, it does nothing because by default ruby code is assumed to be written with UTF-8 characters.
edit: the result is the same if I remove the line #encoding:UTF-8
Yes, that's because you are probably using a recent version of ruby, and that is the default.
You have to know the encoding of the file you want to read. If the file is encoded in ISO-8859-1, you can do:
#var = File.readlines('temp.txt', encoding: "ISO-8859-1")
What characters are in your file? What do you get when you do this:
$ ruby -e"puts Encoding.default_external"

Can I programmatically convert "I’d" to "I’d" using Ruby?

I can't seem to find the right combination of String#encode shenanigans.
I think I'd got confused on this one so I'll post this here to hopefully help anyone else who is similarly confused.
I was trying to do my encoding in an irb session, which gives you
irb(main):002:0> 'I’d'.force_encoding('UTF-8')
=> "I’d"
And if you try using encode instead of force_encoding then you get
irb(main):001:0> 'I’d'.encode('UTF-8')
=> "I’d"
This is with irb set to use an output and input encoding of UTF-8. In my case to convert that string the way I want it involves telling Ruby that the source string is in windows-1252 encoding. You can do this by using the -E argument in which you specify `inputencoding:outputencoding' and then you get this
$ irb -EWindows-1252:UTF-8
irb(main):001:0> 'I’d'
=> "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d"
That looks wrong unless you pipe it out, which gives this
$ ruby -E Windows-1252:UTF-8 -e "puts 'I’d'"
I’d
Hurrah. I'm not sure about why Ruby showed it as "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d" (something to do with the code page of the terminal?) so if anyone can comment with further insight that would be great.
I expect your script is using the encoding cp1251 and you have ruby >= 1.9.
Then you can use force_encoding:
#encoding: cp1251
#works also with encoding: binary
source = 'I’d'
puts source.force_encoding('utf-8') #-> I’d
If my exceptions are wrong: Which encoding do you use and which ruby version?
A little background:
Problems with encoding are difficult to analyse. There may be conflicts between:
Encoding of the source code (That's defined by the editor).
Expected encoding of the source code (that's defined with #encoding on the first line). This is used by ruby.
Encoding of the string (see e.g. section String encodings in http://nuclearsquid.com/writings/ruby-1-9-encodings/ )
Encoding of the output shell

Encoding::UndefinedConversionError when using open-uri

When I do this:
require 'open-uri'
response = open('some-html-page-url-here')
response.read
On a certain url I get the following error (due to wrong encoding in the returned url?!):
Encoding::UndefinedConversionError: U+00A0 from UTF-8 to US-ASCII
Any way around this to still get the html content?
In the introduction to the open-uri module, the docs say this,
It is possible to open an http, https or ftp URL as though it were a file
And if you know anything about reading files, then you have to know the encoding of the file you are trying to read. You need to know the encoding so that you can tell ruby how to read the file(i.e. how many bytes(or how much space) each character will occupy).
In the first code example in the docs, there is this:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
So if you don't know the encoding of the "file" you are trying to read, you can get the encoding with f.charset. If that encoding is different than your default external encoding, you will most likely get an error. Your default external encoding is the encoding ruby uses to read from external sources. You can check what your default external encoding is set to like this:
The default external Encoding is pulled from your environment...Have a
look:
$ echo $LC_CTYPE
en_US.UTF-8
or
$ ruby -e 'puts Encoding.default_external.name'
UTF-8
http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings
On Mac OSX, I actually have to do the following to see the default external encoding:
$ echo $LANG
You can set your default external encoding with the method Encoding.default_external=(), so you might want to try something like this:
open('some_url_here') do |f|
Encoding.default_external = f.charset
html = f.read
end
Setting an IO object to binmode, like you have done, tells ruby that the encoding of the file is BINARY (or ruby's confusing synonym ASCII-8BIT), which means you are telling ruby that each character in the file takes up one byte. In your case, you are telling ruby to read the character U+00A0, whose UTF-8 representation takes up two bytes 0xC2 0xA0, as two characters instead of just one character, so you have eliminated your error, but you have produced two junk characters instead of the original character.
Doing a response.binmode before the response.read stops the error from happening.
Had the same issue, will add my solution here:
After reading the open-uri documentation further, it turns out you could set the encoding of the io before reading using the set_encoding method, like this:
result = open('some-page-uri') do |io|
io.set_encoding(Encoding.default_external)
io.read
end
Hope it helps!

unable to convert array data to json when '¿' is there

this is my ruby code
require 'json'
a=Array.new
value="¿value"
data=value.gsub('¿','-')
a[0]=data
puts a
puts "json is"
puts jsondata=a.to_json
getting following error
C:\Ruby193>new.rb
C:/Ruby193/New.rb:3: invalid multibyte char (US-ASCII)
C:/Ruby193/New.rb:3: syntax error, unexpected tIDENTIFIER, expecting $end
value="┐value"
^
That's not a JSON problem — Ruby can't decode your source because it contains a multibyte character. By default, Ruby tries to decode files as US-ASCII, but ¿ isn't representable in US-ASCII, so it fails. The solution is to provide a magic comment as described in the documentation. Assuming your source file's encoding is UTF-8, you can tell Ruby that like so:
# encoding: UTF-8
# ...
value = "¿value"
# ...
With an editor or an IDE the soluton of icktoofay (# encoding: UTF-8 - in the first line) is perfect.
In a shell with IRB or PRY it is difficult to find a working configuration. But there is a workaround that at least worked for my encoding problem which was to enter German umlaut characters.
Workaround for PRY:
In PRY I use the edit command to edit the contents of the input buffer
as described in this pry wiki page.
This opens an external editor (you can configure which editor you want). And the editor accepts special characters that can not be entered in PRY directly.

Resources