Printing a CP850 encoded string with Ruby (IRB) - ruby

I want to open a text file (test.txt) that contains arabic text (its encoding is CP850), then print its content to STDOUT :
# coding : CP850
STDOUT.set_encoding(Encoding::CP850); # not sure if it's necessary
open('G:/test.txt',?r){|f|
f.read.each_char{|c| print c};
# or puts f.read;
}
gets
but it does not print the arabic characters, the output is some symbols and random characters.
Using Ruby 2.2.3

Change the encoding of the file to utf-8.
I don't know how this is accomplished in Ruby, but Django (the newer ones using Python 3), it's:
open('filename.txt', w, 'utf-8)
If you're using Python 2 it will be slightly more difficult. If so, it's worth upgrading to 3 just because it's native unicode and makes doing anything with Arabic a lot easier.

Related

macOS Automator's Ruby defaults to ASCII despite being >= 1.9

I am trying to get access to the text in the macOS clipboard from within Automator using a Ruby script. This script calls macOS's internal Ruby (/usr/bin/ruby). After running into much trouble with unidentified character sequence errors, I noticed that Automator's Ruby defaults to ASCII instead of UTF-8, while this is not the default behaviour of modern Ruby since years ago.
So, running the following:
require 'clipboard'
puts(Clipboard.paste.encoding)
always yields "ASCII", while running the same Ruby interpreter from the command line to run the same script and to paste the same pieces of text always yields "UTF-8".
This becomes an issue when I copy multibyte characters like the accented characters (e.g. ê). For instance if I copy the following text:
Bourdieu, P., & Passeron, J.-C. (1970). La reproduction: éléments pour une théorie du système d’enseignement. Ed. de Minuit.
And then run:
require 'clipboard'
puts(Clipboard.paste)
I get nothing in Automator while I get a copy of the original text on the command line.
If I try to transform the text in any way, I get an error. Let's say I run the following:
require 'clipboard'
puts(Clipboard.paste.gsub(/\r/,""))
In response, I will receive:
-e:2:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
from -e:2:in `<main>'
How can I avoid this and make sure what I get from the clipboard is already converted into proper UTF-8?
I have tried encode and force_encoding methods, as well as a variety of combinations of # encoding: UTF-8, Encoding.default_external='utf-8' and Encoding.default_internal='utf-8', but it seems there are corrupt characters that hinder the conversion, so no success in the end.
Is there anything I am ignoring here, or any combination I haven't tried?
Notes:
It is Automator that calls the interpreter, and not me. So, I can't modify Automator's call to add switches and modify options.
string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') works, but the sanitization comes at the cost of chopping off the multibyte characters, which is obviously not the intended behaviour here.
I found that in macOS Mojave 10.14.6, starting the Automator 'Run Shell Script' with # coding: UTF-8 solved the problem. Not sure the #!/usr/bin/ruby is useful or necessary, but I include it. You can test by using this code with and without the # coding: UTF-8:
#!/usr/bin/ruby
# coding: UTF-8
test_s = "will print ✪"
puts test_s
Credit for the answer is from here: discussions.apple.com

print unicode charcater to Rubymine console

So I want to print a hebrew character, unicode value of \u{fb20} into the Rubymine console, the common output where hello world would typically go if you were to just run puts 'hello world'.
The code I am trying to run is just puts "\u{fb20}", nothing crazy.
Rubymine is set to default system encoding for both projectg and IDE levels, and I have tried setting the encoding to UTF-8 and UTF-16, but neither of these three setting will simply print this character correctly to the console.
I get ﬠ printed to the console at the moment, which is not the right character. The right character is ﬠ.
Try encoding the character in ruby, and then printing it.
symbol = "\ufb20"
puts symbol.encode('utf-16')
And change Rubymine to be encoded in UTF-16

Replacing "\xe9" character from a unicode string in Python 3

Using SublimeText 2.0.2 with Python 3.4.2, I get a webpage with urllib :
response = urllib.request.urlopen(req)
pagehtml = response.read()
Print => qualit\xe9">\r\n\t\t<META HTTP
I get a "\xe9" character within the unicode string!
The header of the pagehtml tell me it's encoded in ISO-8859-1
(Content-Type: text/html;charset=ISO-8859-1). But if I decode it with ISO-8859-1 then encode it in utf-8, it only get worse...
resultat = pagehtml.decode('ISO-8859-1').encode('utf-8')
Print => qualit\xc3\xa9">\r\n\t\t<META HTTP
How can I replace all the "\xe9"... characters by their corresponding letters ("é"...) ?
Edit 1
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') !
I should mention I'm running my code within SublimeText 2.0.2. It's seems to be my problem.
Edit 2
It is working fine in IDLE (Python 3.4.2) and in OSX terminal (Python 2.5) but don't work in SublimeText 2.0.2 (with Python 3.4.2)... => That seems to be a problem with SublimeText console (output window) and not with my code.
I'm gonna look at PYTHONIOENCODING env as suggested by J.F. Sebastian
It's seems I should be able to setting it in the sublime-build file.
Edit 3 - Solution
I just added "env": {"PYTHONIOENCODING": "UTF-8"} in the sublime-build file.
Done. Thanks everyone ;-)
The response is an encoded byte string. Just decode it:
>>> pagehtml = b'qualit\xe9'
>>> print(pagehtml)
b'qualit\xe9'
>>> print(pagehtml.decode('ISO-8859-1'))
qualité
I am pretty sure you do not actually have a problem, except for understanding bytes versus unicode. Things are working as they should. pagehtml is encoded bytes. (I confirmed this with req = 'http://python.org' in your first line.) When bytes are displayed, those which can be interpreted as printable ascii encodings are printed as such and other bytes are printed with hex escapes. b'\xe9' is the hex escape encoding of the single-byte ISO-8859-1 encoding of é and b'\xc3\xa9' is the hex escape encoding of its double-byte utf-8 encoding.
>>> b = b"qualit\xe9"
>>> u = b.decode('ISO-8859-1')
>>> u
'qualité'
>>> b2 = u.encode()
>>> b2
b'qualit\xc3\xa9'
>>> len(b) == 7 and len(b2) == 8
True
>>> b[6]
233
>>> b2[6], b2[7]
(195, 169)
So pageuni = pagehtml.decode('ISO-8859-1') gives you the page as unicode. This decoding does the replacing that you asked for.
I'm getting an UnicodeEncodeError (that's why I was encoding in 'utf-8') ! I should mention I'm running my code within SublimeText. It's seems to be my problem. Any solution ?
don't encode manually, print unicode strings instead.
For Unix
Set PYTHONIOENCODING=utf-8 if the output is redirected or if locale (LANGUAGE, LC_ALL, LC_CTYPE, LANG) is not configured (it defaults to C (ascii)).
For Windows
If the content can be represented using the console codepage then set PYTHONIOENCODING=your_console_cp envvar e.g., PYTHONIOENCODING=cp1252 (set it to cp1252 only if it is indeed the encoding that your console uses, run chcp to check). Or use whatever encoding SublimeText can show correctly if it doesn't open a console window to run Python scripts.
Unless the output is redirected; you don't need to set PYTHONIOENCODING envvar if you run your script from the command-line directly.
Otherwise (to support characters that can't be represented in the console encoding), install win_unicode_console package and either run your script using python3 -mrun your_script.py or put at the top of your script:
import win_unicode_console
win_unicode_console.enable()
It uses Win32 API such as WriteConsoleW() to print to the console. You still need to configure correct fonts to see arbitrary Unicode text in the console.

How does the magic comment ( # Encoding: utf-8 ) in ruby​​ work?

How does the magic comment in ruby​​ works? I am talking about:
# Encoding: utf-8
Is this a preprocessing directive? Are there other uses of this type of construction?
Ruby interpreter instructions at the top of the source file - this is called magic comment. Before processing your source code interpreter reads this line and sets proper encoding. It's quite common for interpreted languages I believe. At least Python uses the same approach.
You can specify encoding in a number of different ways (some of them are recognized by editors):
# encoding: UTF-8
# coding: UTF-8
# -*- coding: UTF-8 -*-
You can read some interesting stuff about source encoding in this article.
The only thing I'm aware of that has similar construction is shebang, but it is related to Unix shells in general and is not Ruby-specific.
magic_comments defined in ruby/ruby
This magic comment tells Ruby the source encoding of the currently parsed file. As Ruby 1.9.x by default assumes US_ASCII you have tell the interpreter what encoding your source code is in if you use non-ASCII characters (like umlauts or accented characters).
The comment has to be the first line of the file (or below the shebang if used) to be recognized.
There are other encoding settings. See this question for more information.
Since version 2.0, Ruby assumes UTF-8 encoding of the source file by default. As such, this magic encoding comment has become a rarer sight in the wild if you write your source code in UTF-8 anyway.
As you noted, magic comments are a special preprocessing construct. They must be defined at the top of the file (except, if there is already a unix shebang at the top). As of Ruby 2.3 there are three kinds of magic comments:
Encoding comment: See other answers. Must always be the first magic comment. Must be ASCII compatible. Sets the source encoding, so you will run into problems if the real encoding of the file does not match the specified encoding
frozen_string_literal: true: Freezes all string literals in the current file
warn_indent: true: Activates indentation warnings for the current file
More info: Magic Instructions
While this isn't exactly an answer for your question, if you want to read more about encodings, how they work, what kinds of problems crop up with them: the great Yehuda Katz wrote about encodings as they were being worked out in Ruby 1.9 and beyond:
Ruby 1.9 Encodings: A Primer and the Solution for Rails
Encodings, Unabridged

Reading ASCII-encoded files with Ruby 1.9 in a UTF-8 environment

I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)

Resources