How can I read accented characters in ruby shoes?

How can I read accented characters in ruby shoes? - ruby

I can't do a very simple thing in ruby shoes.
The code is the seguent:
#encoding:UTF-8
Shoes.app do
#var = File.readlines('temp.txt')
para #var
end
If in the txt there is a word with an accented character, the code doesn't work. Shoes says "not a valid UTF8 string"
How can I solve this problem???
edit: the result is the same if I remove the line #encoding:UTF-8

The line:
#encoding:UTF-8
only governs the characters you use for your actual code. And if you are using a recent version of ruby, it does nothing because by default ruby code is assumed to be written with UTF-8 characters.
edit: the result is the same if I remove the line #encoding:UTF-8
Yes, that's because you are probably using a recent version of ruby, and that is the default.
You have to know the encoding of the file you want to read. If the file is encoded in ISO-8859-1, you can do:
#var = File.readlines('temp.txt', encoding: "ISO-8859-1")
What characters are in your file? What do you get when you do this:
$ ruby -e"puts Encoding.default_external"

Related

macOS Automator's Ruby defaults to ASCII despite being >= 1.9

I am trying to get access to the text in the macOS clipboard from within Automator using a Ruby script. This script calls macOS's internal Ruby (/usr/bin/ruby). After running into much trouble with unidentified character sequence errors, I noticed that Automator's Ruby defaults to ASCII instead of UTF-8, while this is not the default behaviour of modern Ruby since years ago.
So, running the following:
require 'clipboard'
puts(Clipboard.paste.encoding)
always yields "ASCII", while running the same Ruby interpreter from the command line to run the same script and to paste the same pieces of text always yields "UTF-8".
This becomes an issue when I copy multibyte characters like the accented characters (e.g. ê). For instance if I copy the following text:
Bourdieu, P., & Passeron, J.-C. (1970). La reproduction: éléments pour une théorie du système d’enseignement. Ed. de Minuit.
And then run:
require 'clipboard'
puts(Clipboard.paste)
I get nothing in Automator while I get a copy of the original text on the command line.
If I try to transform the text in any way, I get an error. Let's say I run the following:
require 'clipboard'
puts(Clipboard.paste.gsub(/\r/,""))
In response, I will receive:
-e:2:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
from -e:2:in `<main>'
How can I avoid this and make sure what I get from the clipboard is already converted into proper UTF-8?
I have tried encode and force_encoding methods, as well as a variety of combinations of # encoding: UTF-8, Encoding.default_external='utf-8' and Encoding.default_internal='utf-8', but it seems there are corrupt characters that hinder the conversion, so no success in the end.
Is there anything I am ignoring here, or any combination I haven't tried?
Notes:
It is Automator that calls the interpreter, and not me. So, I can't modify Automator's call to add switches and modify options.
string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') works, but the sanitization comes at the cost of chopping off the multibyte characters, which is obviously not the intended behaviour here.

I found that in macOS Mojave 10.14.6, starting the Automator 'Run Shell Script' with # coding: UTF-8 solved the problem. Not sure the #!/usr/bin/ruby is useful or necessary, but I include it. You can test by using this code with and without the # coding: UTF-8:
#!/usr/bin/ruby
# coding: UTF-8
test_s = "will print ✪"
puts test_s
Credit for the answer is from here: discussions.apple.com

CSV writes Ñ into its code form instead of its actual form

I have a CSV file. I checked its encoding using this:
File.open('C:\path\to\file\myfile.txt').read.encoding
and it returned the encoding as:
=> #<Encoding:IBM437>
I'm reading this CSV per row -- stripping spaces and doing other stuff. After "cleansing" it, I push it to a new file. I'm doing it like this:
CSV.foreach(file_read, encoding: "IBM437:UTF-8") do |r|
# some code
CSV.open(file_appended, "a", col_sep: "|") do |csv|
csv << r
end
end
Now my problem is, inside the CSV I'm reading, there's a word with an accented character -- Ñ to be exact. This character is being appended to the new file as
\u2564
Its a problem considering that the accented character is a vital part of that word, and I wanted that character to appear to the new file as-is.
Am I missing something? I tried the ff. source:destination encoding but to no avail:
ISO-8859-1:UTF8 (and vice versa)
ISO-8859-1:Windows-1252 (and vice versa)
Am I missing something?
Here is my ruby version, just if you'd need to know:
ruby 1.9.3p392 (2013-02-22) [i386-mingw32]
Thanks in advance!

The line below solved my problem:
Encoding.default_external = "iso-8859-1"
It tells Ruby that the file being read is encoded in ISO-8859-1, and therefore correctly interprets the Ñ character.
Credit goes to Darshan Computing's answer here. Just look for Update #2.

Encoding issue when using Nokogiri replace

I have this code:
# encoding: utf-8
require 'nokogiri'
s = "<a href='/path/to/file'>Café Verona</a>".encode('UTF-8')
puts "Original string: #{s}"
#doc = Nokogiri::HTML::DocumentFragment.parse(s)
links = #doc.css('a')
only_text = 'Café Verona'.encode('UTF-8')
puts "Replacement text: #{only_text}"
links.first.replace(only_text)
puts #doc.to_html
However, the output is this:
Original string: <a href='/path/to/file'>Café Verona</a>
Replacement text: Café Verona
CafÃ© Verona
Why does the text in #doc end up with the wrong encoding?
I tried with and without encode('UTF-8') or using Document instead of DocumentFragment, but it's the same problem.
I'm using Nokogiri v1.5.6 with Ruby 1.9.3p194.

Seems that if you pass a nokogiri text object it does the thing ;)
links.first.replace Nokogiri::XML::Text.new(only_text, #doc)

I can't duplicate the problem, but I have two different things to try:
Instead of using:
s = "<a href='/path/to/file'>Café Verona</a>".encode('UTF-8')
Try:
s = "<a href='/path/to/file'>Café Verona</a>"
Your string is already UTF-8 encoded, because of your statement # encoding: utf-8. That's why you put that in the script, to tell Ruby the literal string is in UTF-8. It's possible that you're double-encoding it, though I don't think Ruby will -- it should silently ignore the second attempt because it's already UTF-8.
Another thing I wonder about is, output like:
CafÃ© Verona
is an indicator that the language/character-set encoding of your system and your terminal aren't right. Trying to output UTF-8 strings on a system set to something else can get mismatches in the terminal and/or browser. Windows systems are typically Win-1252, ISO-8859-1 or something similar, not UTF-8. On my Mac OS system I have these environment variables set:
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
"Open iso-8859-1 encoded html with nokogiri messes up accents" might be useful too.

ruby unicode escapes as command line arguments

It looks like this question has been asked by a python dev (Allowing input of Unicode escapes as command line arguments), which I think partially relates, but it doesn't fully give me a solution for my immediate problem in Ruby. I'm curious if there is a way to take escaped unicode sequences as command line arguments, assign to a variable, then have the escaped unicode be processed and displayed as normal unicode after the script runs. Basically, I want to be able to choose a unicode number, then have Ruby stick that in a filename and have the actual unicode character displayed.
Here are a few things I've noticed that cause problems:
unicode = ARGV[0] #command line argument is \u263a
puts unicode
puts unicode.inspect
=> u263a
=> "u263a"
The forward slash needed to have the string be treated as a unicode sequence gets stripped.
Then, if we try adding another "\" to escape it,
unicode = ARGV[0] #command line argument is \\u263a
puts unicode
puts unicode.inspect
=> \u263a
=> "\\u263a"
but it still won't be processed properly.
Here's some more relevant code where I'm actually trying to make this happen:
unicode = ARGV[0]
filetype = ARGV[1]
path = unicode + "." + filetype
File.new(path, "w")
It seems like this should be pretty simple, but I've searched and searched and cannot find a solution. I should add, I do know that supplying the hard-coded escaped unicode in a string works just fine, like File.new("\u263a.#{filetype}", "w"), but getting it from an argument/variable is what I'm having an issue with. I'm using Ruby 1.9.2.

To unescape the unicode escaped command line argument and create a new file with the user supplied unicode string in the filename, I used #mu is too short's method of using pack and unpack, like so:
filetype = ARGV[1]
unicode = ARGV[0].gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
path = unicode + "." + filetype
File.new(path, "w")

Output the euro sign in a ruby script

I'm currently trying to output the euro symbol through a simple ruby script that I made but I keep getting "?" whenever I try.
I'm currently using puts "\244".
Any thoughts?
btw. I'm using ruby 1.9.2 p180

You need to add a "magic comment" at the top of your script like this:
# encoding: UTF-8
puts "€"
... assuming that you want to use UTF-8 to allow for double-byte characters. You can then use the Euro symbol directly.
You can read more about Ruby 1.9 string encoding and magic comments here:
http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio