Ruby compatibility error encoding - ruby

I'm having a problem. Let's look:
C:\temp> ruby script.rb
script.rb => Powershell output
puts "ę" => ę #irb \xA9
puts "\xA9" => ▯
puts "ę"=="\xA9" => false
input = $stdin.gets.chomp => input=="ę"
puts "e#{input}e" => eęe
puts "ę"==input => false
puts "ę#{input}" => Encoding::Compatibility Error Utf8 & CP852
irb => #command line in ruby
puts "ę"=="\xA9" => true
input = $stdin.gets.chomp => input=="ę"
puts "ę"==input => true && "\xA9"==input => true
puts "ę#{input}" => ęę
It looks like powershell's input uses other font for all special characters than ruby and notepad++(?). Can i change that so it will work when i type in prompt(when asked) and does not show an error?
Edit: Sorry for misdirection. I added invoke and specified that file has extension ".rb" not ".txt"
Edit2: Ok, I've researched some more information and I've been trying do some encoding(UTF8) to a variable. Somethin' strange occured.
puts "ę#{input.encoding}" => ęCP852
puts "\xA9" => UTF-8
Encoding to CP852 has revealed that encoding pass on bytes. I learned that value of "ę"=20+99=119, "ą" = 20 + 85, 20 = C4
Ok. got it ".encoding" - shows what encoding i use. And that resolve this problem.
puts "ę#{input.encode "UTF-8"}" => ęę
Thanks everyone for your input.

If your input in prompt( either cmd or powershell) is causing problems due to incompatibility of using differents encodings just try to encode it via methods in script.encode "UTF-8" #in case of Ruby language If you dont know what methods do that just google your_language_name encoding

Related

SmarterCSV and file encoding issues in Ruby

I'm working with a file that appears to have UTF-16LE encoding. If I run
File.read(file, :encoding => 'utf-16le')
the first line of the file is:
"<U+FEFF>=\"25/09/2013\"\t18:39:17\t=\"Unknown\"\t=\"+15168608203\"\t\"Message.\"\r\n
If I read the file using something like
csv_text = File.read(file, :encoding => 'utf-16le')
I get an error stating
ASCII incompatible encoding needs binmode (ArgumentError)
If I switch the encoding in the above to
csv_text = File.read(file, :encoding => 'utf-8')
I make it to the SmarterCSV section of the code, but get an error that states
`=~': invalid byte sequence in UTF-8 (ArgumentError)
The full code is below. If I run this in the Rails console, it works just fine, but if I run it using ruby test.rb, it gives me the first error:
require 'smarter_csv'
headers = ["date_of_message", "timestamp_of_message", "sender", "phone_number", "message"]
path = '/path/'
Dir.glob("#{path}*.CSV").each do |file|
csv_text = File.read(file, :encoding => 'utf-16le')
File.open('/tmp/tmp_file', 'w') { |tmp_file| tmp_file.write(csv_text) }
puts 'made it here'
SmarterCSV.process('/tmp/tmp_file', {
:col_sep => "\t",
:force_simple_split => true,
:headers_in_file => false,
:user_provided_headers => headers
}).each do |row|
converted_row = {}
converted_row[:date_of_message] = row[:date_of_message][2..-2].to_date
converted_row[:timestamp] = row[:timestamp]
converted_row[:sender] = row[:sender][2..-2]
converted_row[:phone_number] = row[:phone_number][2..-2]
converted_row[:message] = row[:message][1..-2]
converted_row[:room] = file.gsub(path, '')
end
end
Update - 05/13/15
Ultimately, I decided to encode the file string as UTF-8 rather than diving deeper into the SmarterCSV code. The first problem in the SmarterCSV code is that it does not allow a user to specify binary mode when reading in a file, but after adjusting the source to handle that, a myriad of other encoding-related issues popped-up, many of which related to the handling of various parameters on files that were not UTF-8 encoded. It may have been the easy way out, but encoding everything as UTF-8 before feeding it into SmarterCSV solved my issue.
Add binmode to the File.read call.
File.read(file, :encoding => 'utf-16le', mode: "rb")
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
ref: http://ruby-doc.org/core-2.0.0/IO.html#method-c-read
Now pass the correct encoding to SmarterCSV
SmarterCSV.process('/tmp/tmp_file', {
:file_encoding => "utf-16le", ...
Update
It was found that smartercsv does not support binary mode. After the OP attempted to modify the code with no success it was decided the simple solution was to convert the input to UTF-8 which smartercsv supports.
Unfortunately, you're using a 'flat-file' style of storage and character encoding is going to be an issue on both ends (reading or writing).
I would suggest using something along the lines of str = str.force_encoding("UTF-8") and see if you can get that to work.

UndefinedConversionError trying to parse Arabic from email body

using mail for ruby I am getting this message:
mail.rb:22:in `encode': "\xC7" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
from mail.rb:22:in `<main>'
If I remove encode I get a message ruby
/var/lib/gems/1.9.1/gems/bson-1.7.0/lib/bson/bson_ruby.rb:63:in `rescue in to_utf8_binary': String not valid utf-8: "<div dir=\"ltr\"><div class=\"gmail_quote\">l<br><br><br><div dir=\"ltr\"><div class=\"gmail_quote\"><br><br><br><div dir=\"ltr\"><div class=\"gmail_quote\"><br><br><br><div dir=\"ltr\"><div dir=\"rtl\">\xC7\xE1\xE4\xD5 \xC8\xC7\xE1\xE1\xDB\xC9 \xC7\xE1\xDA\xD1\xC8\xED\xC9</div></div>\r\n</div><br></div>\r\n</div><br></div>\r\n</div><br></div>" (BSON::InvalidStringEncoding)
This is my code:
require 'mail'
require 'mongo'
connection = Mongo::Connection.new
db = connection.db("DB")
db = Mongo::Connection.new.db("DB")
newsCollection = db["news"]
Mail.defaults do
retriever_method :pop3, :address => "pop.gmail.com",
:port => 995,
:user_name => 'my_username',
:password => '*****',
:enable_ssl => true
end
emails = Mail.last
#Checks if email is multipart and decods accordingly. Put to extract UTF8 from body
plain_part = emails.multipart? ? (emails.text_part ? emails.text_part.body.decoded : nil) : emails.body.decoded
html_part = emails.html_part ? emails.html_part.body.decoded : nil
mongoMessage = {"date" => emails.date.to_s , "subject" => emails.subject , "body" => plain_part.encode('UTF-8') }
msgID = newsCollection.insert(mongoMessage) #add the document to the database and returns it's ID
puts msgID
For English and Hebrew it works perfectly but it seems gmail is sending arabic with different encoding. Replacing UTF-8 with ASCII-8BIT gives a similar error.
I get the same result when using plain_part for plain email messages. I am handling emails from one specific source so I can put html_part with confidence it's not causing the error.
To make it extra weird Subject in Arabic is rendered perfectly.
What encoding should I use?
If you use encode without options, it will raise this error, if you're string pretends to be an encoding but contains characters from another encoding.
try it in this way:
plain_part.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
this replaces invalid and undefined chars for the given encoding with an "?"(more info). If this is not sufficent for your needs, you need to find a way to check if your plain_part string is valid.
For example you can use valid_encoding?(more info) for this.
I recently stumbled across a similar problem, where I couldn't be sure what encoding it really is, so I wrote this (maybe a little humble) method. May it helps you, to find a way to fix your problem.
def self.encode!(str)
return nil if str.nil?
known_encodings = %w(
UTF-8
ISO-8859-1
)
begin
str.encode(Encoding.find('UTF-8'))
rescue Encoding::UndefinedConversionError
fixed_str = ""
known_encodings.each do |encoding|
fixed_str = str
if fixed_str.force_encoding(encoding).valid_encoding?
return fixed_str.encode(Encoding.find('UTF-8'))
end
end
return str.encode(Encoding.find('UTF-8'), {:invalid => :replace, :undef => :replace, :replace => '?'})
end
end
I found a work around.
Since only specific emails will be sent to this account to just to use on this application I have full control over formatting. For some reason mail decodes text/plain attachment perfectly
so:
emails.attachments.each do | attachment |
if (attachment.content_type.start_with?('text/plain'))
# extracting txt file
begin
body = attachment.body.decoded
rescue Exception => e
puts "Unable to save data for #{filename} because #{e.message}"
end
end
end
mongoMessage = {"date" => emails.date.to_s , "subject" => emails.subject , "body" => body }

How do I get YAML in Ruby as of 1.9.3 to dump ASCII-8Bit strings as strings?

Here's the problem: I might have strings that are UTF-8, and I might have strings that are US-ASCII. Regardless of the encoding, I'd like YAML.dump(str) to actually dump String objects, instead of these useless !binary objects as the example shows.
Is there a flag or something I'm not seeing to force YAML.dump() to do the right thing?
Ruby 1.9.1 example
YAML::VERSION # "0.60"
a = "foo" # => "foo"
a.force_encoding("BINARY") # => "foo"
YAML.dump(a) # => "--- foo\n"
Ruby 1.9.3 example
YAML::VERSION # "1.2.2"
a = "foo" # => "foo"
a.force_encoding("BINARY") # => "foo"
YAML.dump(a) # => "--- !binary |-\n Zm9v\n"
Update: Got my own answer
YAML::ENGINE.yamler='syck'
YAML.dump(a) # => "--- foo\n"
So, looks like using the old yamler engine with force the old behavior.
Update: Got my own answer
YAML::ENGINE.yamler='syck'
YAML.dump(a) # => "--- foo\n"

Thor & YAML outputting as binary?

I'm using Thor and trying to output YAML to a file. In irb I get what I expect. Plain text in YAML format. But when part of a method in Thor, its output is different...
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def set
test = {"name" => "Xavier", "age" => 30}
puts test
# {"name"=>"Xavier", "age"=>30}
puts test.to_yaml
# !binary "bmFtZQ==": !binary |-
# WGF2aWVy
# !binary "YWdl": 30
File.open("data/config.yml", "w") {|f| f.write(test.to_yaml) }
end
end
Any ideas?
All Ruby 1.9 strings have an encoding attached to them.
YAML encodes some non-UTF8 strings as binary, even when they look innocent, without any high-bit characters. You might think that your code is always using UTF8, but builtins can return non-UTF8 strings (ex File path routines).
To avoid binary encoding, make sure all your strings encodings are UTF-8 before calling to_yaml. Change the encoding with force_encoding("UTF-8") method.
For example, this is how I encode my options hash into yaml:
options = {
:port => 26000,
:rackup => File.expand_path(File.join(File.dirname(__FILE__), "../sveg.rb"))
}
utf8_options = {}
options.each_pair { |k,v| utf8_options[k] = ((v.is_a? String) ? v.force_encoding("UTF-8") : v)}
puts utf8_options.to_yaml
Here is an example of yaml encoding simple strings as binary
>> x = "test"
=> "test"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.to_yaml
=> "--- test\n...\n"
>> x.force_encoding "ASCII-8BIT"
=> "test"
>> x.to_yaml
=> "--- !binary |-\n dGVzdA==\n"
After version 1.9.3p125, ruby build-in YAML engine will treat all BINARY encoding differently than before. All you need to do is to set correct non-BINARY encoding before your String.to_yaml.
in Ruby 1.9, All String object have attached a Encoding object
and as following blog ( by James Edward Gray II ) mentioned, ruby have build in three type of encoding when String is generated:
http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings.
One of encoding may solve your problem => Source code Encoding
This is the encoding of your source code, and can be specify by adding magic encoding string at the first line or second line ( if you have a sha-bang string at the first line of your source code )
the magic encoding code could be one of following:
# encoding: utf-8
# coding: utf-8
# -- encoding : utf-8 --
so in your case, if you use ruby 1.9.3p125 or later, this should be solved by adding one of magic encoding in the beginning of your code.
# encoding: utf-8
require 'thor'
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def bar
test = {"name" => "Xavier", "age" => 30}
puts test
#{"name"=>"Xavier", "age"=>30}
puts test["name"].encoding.name
#UTF-8
puts test.to_yaml
#---
#name: Xavier
#age: 30
puts test.to_yaml.encoding.name
#UTF-8
end
end
I have been struggling with this using 1.9.3p545 on Windows - just with a simple hash containing strings - and no Thor.
The gem ZAML solves the problem quite simply:
require 'ZAML'
yaml = ZAML.dump(some_hash)
File.write(path_to_yaml_file, yaml)

Ruby: Generate a utf-8 character from code point as string

I need to write all utf-8 characters in file. I have all codes as string "5363" or "328E", but I can't add it to \u, to make structure, like "\u5363". Help me please.
(this will work if you have ruby 1.9 or newer)
#irb -E utf-8
irb(main):032:0> s=""
=> ""
irb(main):033:0> i=0x328e
=> 12942
irb(main):034:0> s<<i
=> "㊎"
irb(main):036:0> s<<0x5363
=> "㊎卣"
for your case:
my_char_codes = ["5363","328E"]
s = ""
my_char_codes.each{ |c| s << c.to_i(16) }
# now s contains "㊎卣"

Resources