Printing accented characters in Ruby - ruby

I would like to display this: Bonne fête
I have # encoding: UTF-8 on the very first line of my script. However the result I got is this:
Bonne f\u00EAte
Is there something I can do to display the correct character (ê instead of \u00EA)?
ps:
I am using: Aptana Studio 3 as IDE, Ruby 1.9.3, and Windows 7
Windows>Preferences>Workspace>Text file encoding is set to Other: UTF-8
I have tried # encoding: ISO-8859-1 with no luck.

Related

macOS Automator's Ruby defaults to ASCII despite being >= 1.9

I am trying to get access to the text in the macOS clipboard from within Automator using a Ruby script. This script calls macOS's internal Ruby (/usr/bin/ruby). After running into much trouble with unidentified character sequence errors, I noticed that Automator's Ruby defaults to ASCII instead of UTF-8, while this is not the default behaviour of modern Ruby since years ago.
So, running the following:
require 'clipboard'
puts(Clipboard.paste.encoding)
always yields "ASCII", while running the same Ruby interpreter from the command line to run the same script and to paste the same pieces of text always yields "UTF-8".
This becomes an issue when I copy multibyte characters like the accented characters (e.g. ê). For instance if I copy the following text:
Bourdieu, P., & Passeron, J.-C. (1970). La reproduction: éléments pour une théorie du système d’enseignement. Ed. de Minuit.
And then run:
require 'clipboard'
puts(Clipboard.paste)
I get nothing in Automator while I get a copy of the original text on the command line.
If I try to transform the text in any way, I get an error. Let's say I run the following:
require 'clipboard'
puts(Clipboard.paste.gsub(/\r/,""))
In response, I will receive:
-e:2:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
from -e:2:in `<main>'
How can I avoid this and make sure what I get from the clipboard is already converted into proper UTF-8?
I have tried encode and force_encoding methods, as well as a variety of combinations of # encoding: UTF-8, Encoding.default_external='utf-8' and Encoding.default_internal='utf-8', but it seems there are corrupt characters that hinder the conversion, so no success in the end.
Is there anything I am ignoring here, or any combination I haven't tried?
Notes:
It is Automator that calls the interpreter, and not me. So, I can't modify Automator's call to add switches and modify options.
string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') works, but the sanitization comes at the cost of chopping off the multibyte characters, which is obviously not the intended behaviour here.
I found that in macOS Mojave 10.14.6, starting the Automator 'Run Shell Script' with # coding: UTF-8 solved the problem. Not sure the #!/usr/bin/ruby is useful or necessary, but I include it. You can test by using this code with and without the # coding: UTF-8:
#!/usr/bin/ruby
# coding: UTF-8
test_s = "will print ✪"
puts test_s
Credit for the answer is from here: discussions.apple.com

Encoding files with utf-8 using Ruby

I can't set ruby to use the utf-8 for encoding files.
Script like this
# encoding: UTF-8
puts "ą"
works fine
but such
# encoding: UTF-8
File.open("test.txt", "w:UTF-8") do |f|
f.write "ą"
end
causes the console pops up
task.rb: 4: invalid multibyte char (UTF-8)
despite the fact that all commands turning on utf-8 encoding are applied.
I'm using ruby 2.0.0-p451 from rubyinstaller for windows.
Ok everything works fine, I just changed enconding in notepad++ from ansi to utf-8.

Encode a movie with Unicode filename in Windows using Popen

I want to encode a movie through IO.popen by ruby(1.9.3) in windows 7.
If the file name contains only ascii strings, encoding proceed normally.
But with unicode filename the script returns "No such file or directory" error.
Like following code.
#-*- encoding: utf-8 -*-
command = "ffmpeg -i ü.rm"
IO.popen(command){|pipe|
pipe.each{|line|
p line
}
}
I couldn't find whether the problem causes by ffmpeg or ruby.
How can fix this problem?
Windows doesn't use UTF-8 encoding. Ruby send the byte sequence of the Unicode filename to the file system directly, and of course the file system won't recognize UTF-8 sequences. It seems newer version of Ruby has fixed this issue. (I'm not sure. I'm using 1.9.2p290 and it's still there.)
You need to convert the UTF-8 filename to the encoding your Windows uses.
# coding: utf-8
code_page = "cp#{`chcp`.chomp[/\d+$/]}" # detect code page automatically.
command = "ffmpeg -i ü.rm".encode(code_page)
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
Another way is to save your script with the same encoding Windows uses. And don't forget to update the encoding declaration. For example, I'm using Simplified Chinese Windows and it uses GBK(CP936) as default encoding:
# coding: GBK
# save this file in GBK
command = "ffmpeg -i ü.rm"
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
BTW, by convention, it is suggested to use do...end for multi-line code blocks rather than {...}, unless in special cases.
UPDATE:
The underlying filesystem NTFS uses UTF-16 for file name encoding. So 가 is a valid filename character. However, GBK isn't able to encode 가, and so as to CP932 in your Japanese Windows. So you cannot send that specific filename to your cmd.exe and it isn't likely you can process that file with IO.popen. For CP932 compatible filenames, the encoding approach provided above works fine. For those filenames not compatible with CP932, it might be better to modify your filenames to a compatible one.

Encoding issue when using Nokogiri replace

I have this code:
# encoding: utf-8
require 'nokogiri'
s = "<a href='/path/to/file'>Café Verona</a>".encode('UTF-8')
puts "Original string: #{s}"
#doc = Nokogiri::HTML::DocumentFragment.parse(s)
links = #doc.css('a')
only_text = 'Café Verona'.encode('UTF-8')
puts "Replacement text: #{only_text}"
links.first.replace(only_text)
puts #doc.to_html
However, the output is this:
Original string: <a href='/path/to/file'>Café Verona</a>
Replacement text: Café Verona
Café Verona
Why does the text in #doc end up with the wrong encoding?
I tried with and without encode('UTF-8') or using Document instead of DocumentFragment, but it's the same problem.
I'm using Nokogiri v1.5.6 with Ruby 1.9.3p194.
Seems that if you pass a nokogiri text object it does the thing ;)
links.first.replace Nokogiri::XML::Text.new(only_text, #doc)
I can't duplicate the problem, but I have two different things to try:
Instead of using:
s = "<a href='/path/to/file'>Café Verona</a>".encode('UTF-8')
Try:
s = "<a href='/path/to/file'>Café Verona</a>"
Your string is already UTF-8 encoded, because of your statement # encoding: utf-8. That's why you put that in the script, to tell Ruby the literal string is in UTF-8. It's possible that you're double-encoding it, though I don't think Ruby will -- it should silently ignore the second attempt because it's already UTF-8.
Another thing I wonder about is, output like:
Café Verona
is an indicator that the language/character-set encoding of your system and your terminal aren't right. Trying to output UTF-8 strings on a system set to something else can get mismatches in the terminal and/or browser. Windows systems are typically Win-1252, ISO-8859-1 or something similar, not UTF-8. On my Mac OS system I have these environment variables set:
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
"Open iso-8859-1 encoded html with nokogiri messes up accents" might be useful too.

Ruby UTF-8 Encoding doesn't work in Windows even with Magic Comment

I'm trying to run a file (ruby anyfile.rb in cmd prompt) with the following contents:
# encoding: utf-8
puts 'áá'
happens the following error:
invalid multibyte char (UTF-8)
It seems that Ruby does not understand the magic comment...
EDIT: If I remove the "# encoding: utf-8" and run the command prompt like this:
ruby-E:UTF-8 encoding.rb
then it works - any ideas?
EDIT2: when i run:
ruby -e 'p [Encoding.default_external, Encoding.default_internal]'
i got [#Encoding:CP850, nil], maybe my Encoding.default_external is wrong?!
Environment:
Windows XP (yes, I also hate windows + ruby)
ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
I believe this is a classic case of "if you hear hooves, think horses, not zebras".
The error message is telling you that you have a byte sequence in your file that is not a valid UTF-8 multibyte sequence.
It is definitely possible that
It seems that Ruby does not understand the magic comment...
as you say, and that up until now nobody noticed that magic comments don't actually work because you are the first person in the history of humankind to actually try to use magic comments. (Actually, this is not possible. If Ruby didn't understand magic comments, it would complain about an invalid ASCII character, since ASCII is the default encoding if no magic comment is present.)
Or, there actually is an invalid multibyte UTF-8 sequence in your file.
Which do you think is more likely? If I were you, I would check my file.
I've encountered similar issues from time to time with files that were not saved as UTF-8, even when the magic comment states so.
I've found that Ruby 1.9.2 had issues to properly convert UTF-8 to codepages 850 and 437, the defaults for command prompt on Windows.
I do recommend you upgrade to Ruby 1.9.3 (latest is patchlevel 125) which solves a lot of encoding issues, specially on Windows.
Also, to verify that your saved file do not contain a Unicode BOM (so it is plain UTF) and is properly saved.
To verify that, you can switch the codepage in the console to unicode (chcp 65001) and try type myscript.rb
You should see the accented letters correctly.
Last but no least, ensure your command prompt uses a TrueType font so extended characters are properly displayed.
Hope that helps.
Try
# encoding: iso-8859-1
Not everything that's text is utf8.
Are you sure you selected 'UTF-8' from the Encoding dropdown when you saved the file in Notepad? I've just tried this on an XP machine and your code example worked for me.

Resources