Ruby: Generate a utf-8 character from code point as string - ruby

I need to write all utf-8 characters in file. I have all codes as string "5363" or "328E", but I can't add it to \u, to make structure, like "\u5363". Help me please.

(this will work if you have ruby 1.9 or newer)
#irb -E utf-8
irb(main):032:0> s=""
=> ""
irb(main):033:0> i=0x328e
=> 12942
irb(main):034:0> s<<i
=> "㊎"
irb(main):036:0> s<<0x5363
=> "㊎卣"
for your case:
my_char_codes = ["5363","328E"]
s = ""
my_char_codes.each{ |c| s << c.to_i(16) }
# now s contains "㊎卣"

Related

Ruby compatibility error encoding

I'm having a problem. Let's look:
C:\temp> ruby script.rb
script.rb => Powershell output
puts "ę" => ę #irb \xA9
puts "\xA9" => ▯
puts "ę"=="\xA9" => false
input = $stdin.gets.chomp => input=="ę"
puts "e#{input}e" => eęe
puts "ę"==input => false
puts "ę#{input}" => Encoding::Compatibility Error Utf8 & CP852
irb => #command line in ruby
puts "ę"=="\xA9" => true
input = $stdin.gets.chomp => input=="ę"
puts "ę"==input => true && "\xA9"==input => true
puts "ę#{input}" => ęę
It looks like powershell's input uses other font for all special characters than ruby and notepad++(?). Can i change that so it will work when i type in prompt(when asked) and does not show an error?
Edit: Sorry for misdirection. I added invoke and specified that file has extension ".rb" not ".txt"
Edit2: Ok, I've researched some more information and I've been trying do some encoding(UTF8) to a variable. Somethin' strange occured.
puts "ę#{input.encoding}" => ęCP852
puts "\xA9" => UTF-8
Encoding to CP852 has revealed that encoding pass on bytes. I learned that value of "ę"=20+99=119, "ą" = 20 + 85, 20 = C4
Ok. got it ".encoding" - shows what encoding i use. And that resolve this problem.
puts "ę#{input.encode "UTF-8"}" => ęę
Thanks everyone for your input.
If your input in prompt( either cmd or powershell) is causing problems due to incompatibility of using differents encodings just try to encode it via methods in script.encode "UTF-8" #in case of Ruby language If you dont know what methods do that just google your_language_name encoding

Delete non-UTF characters from a string in Ruby?

How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.
This:
text.gsub!(/\xC2/, '')
returns an error:
incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.
You can use encode for that.
text.encode('UTF-8', :invalid => :replace, :undef => :replace)
For more info look into Ruby-Docs
You could do it like this
# encoding: utf-8
class String
def validate_encoding
chars.select(&:valid_encoding?).join
end
end
puts "testing\xC2 a non UTF-8 string".validate_encoding
#=>testing a non UTF-8 string
You text have ASCII-8BIT encoding, instead you should use this:
String.delete!("^\u{0000}-\u{007F}");
It will serve the same purpose.
You can use /n, as in
text.gsub!(/\xC2/n, '')
to force the Regexp to operate on bytes.
Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2 in its UTF-8 encoded form.
Try Iconv
1.9.3p194 :001 > require 'iconv'
# => true
1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
# => "testing\xC2 a non UTF-8 string"
1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
# => #<Iconv:0x000000026c9290>
1.9.3p194 :004 > ic.iconv string
# => "testing a non UTF-8 string"
The best solution to this problem that I've found is this answer to the same question: https://stackoverflow.com/a/8711118/363293.
In short: "€foo\xA0".chars.select(&:valid_encoding?).join
data = '' if not (data.force_encoding("UTF-8").valid_encoding?)

How do I get YAML in Ruby as of 1.9.3 to dump ASCII-8Bit strings as strings?

Here's the problem: I might have strings that are UTF-8, and I might have strings that are US-ASCII. Regardless of the encoding, I'd like YAML.dump(str) to actually dump String objects, instead of these useless !binary objects as the example shows.
Is there a flag or something I'm not seeing to force YAML.dump() to do the right thing?
Ruby 1.9.1 example
YAML::VERSION # "0.60"
a = "foo" # => "foo"
a.force_encoding("BINARY") # => "foo"
YAML.dump(a) # => "--- foo\n"
Ruby 1.9.3 example
YAML::VERSION # "1.2.2"
a = "foo" # => "foo"
a.force_encoding("BINARY") # => "foo"
YAML.dump(a) # => "--- !binary |-\n Zm9v\n"
Update: Got my own answer
YAML::ENGINE.yamler='syck'
YAML.dump(a) # => "--- foo\n"
So, looks like using the old yamler engine with force the old behavior.
Update: Got my own answer
YAML::ENGINE.yamler='syck'
YAML.dump(a) # => "--- foo\n"

Thor & YAML outputting as binary?

I'm using Thor and trying to output YAML to a file. In irb I get what I expect. Plain text in YAML format. But when part of a method in Thor, its output is different...
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def set
test = {"name" => "Xavier", "age" => 30}
puts test
# {"name"=>"Xavier", "age"=>30}
puts test.to_yaml
# !binary "bmFtZQ==": !binary |-
# WGF2aWVy
# !binary "YWdl": 30
File.open("data/config.yml", "w") {|f| f.write(test.to_yaml) }
end
end
Any ideas?
All Ruby 1.9 strings have an encoding attached to them.
YAML encodes some non-UTF8 strings as binary, even when they look innocent, without any high-bit characters. You might think that your code is always using UTF8, but builtins can return non-UTF8 strings (ex File path routines).
To avoid binary encoding, make sure all your strings encodings are UTF-8 before calling to_yaml. Change the encoding with force_encoding("UTF-8") method.
For example, this is how I encode my options hash into yaml:
options = {
:port => 26000,
:rackup => File.expand_path(File.join(File.dirname(__FILE__), "../sveg.rb"))
}
utf8_options = {}
options.each_pair { |k,v| utf8_options[k] = ((v.is_a? String) ? v.force_encoding("UTF-8") : v)}
puts utf8_options.to_yaml
Here is an example of yaml encoding simple strings as binary
>> x = "test"
=> "test"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.to_yaml
=> "--- test\n...\n"
>> x.force_encoding "ASCII-8BIT"
=> "test"
>> x.to_yaml
=> "--- !binary |-\n dGVzdA==\n"
After version 1.9.3p125, ruby build-in YAML engine will treat all BINARY encoding differently than before. All you need to do is to set correct non-BINARY encoding before your String.to_yaml.
in Ruby 1.9, All String object have attached a Encoding object
and as following blog ( by James Edward Gray II ) mentioned, ruby have build in three type of encoding when String is generated:
http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings.
One of encoding may solve your problem => Source code Encoding
This is the encoding of your source code, and can be specify by adding magic encoding string at the first line or second line ( if you have a sha-bang string at the first line of your source code )
the magic encoding code could be one of following:
# encoding: utf-8
# coding: utf-8
# -- encoding : utf-8 --
so in your case, if you use ruby 1.9.3p125 or later, this should be solved by adding one of magic encoding in the beginning of your code.
# encoding: utf-8
require 'thor'
class Foo < Thor
include Thor::Actions
desc "bar", "test"
def bar
test = {"name" => "Xavier", "age" => 30}
puts test
#{"name"=>"Xavier", "age"=>30}
puts test["name"].encoding.name
#UTF-8
puts test.to_yaml
#---
#name: Xavier
#age: 30
puts test.to_yaml.encoding.name
#UTF-8
end
end
I have been struggling with this using 1.9.3p545 on Windows - just with a simple hash containing strings - and no Thor.
The gem ZAML solves the problem quite simply:
require 'ZAML'
yaml = ZAML.dump(some_hash)
File.write(path_to_yaml_file, yaml)

convert utf-8 to unicode in ruby

The UTF-8 of "龅" is E9BE85 and the unicode is U+9F85. Following code did not work as expected:
irb(main):004:0> "龅"
=> "\351\276\205"
irb(main):005:0> Iconv.iconv("unicode","utf-8","龅").to_s
=> "\377\376\205\237"
P.S: I am using Ruby1.8.7.
Ruby 1.9+ is much better equipped to deal with Unicode than 1.8.7, so, I strongly suggest running under 1.9.2 if at all possible.
Part of the problem is that 1.8 didn't understand that a UTF-8 or Unicode character could be more than one byte long. 1.9 does understand that and introduces things like String#each_char.
require 'iconv'
# encoding: UTF-8
RUBY_VERSION # => "1.9.2"
"龅".encoding # => #<Encoding:UTF-8>
"龅".each_char.entries # => ["龅"]
Iconv.iconv("unicode","utf-8","龅").to_s # =>
# ~> -:8:in `iconv': invalid encoding ("unicode", "utf-8") (Iconv::InvalidEncoding)
# ~> from -:8:in `<main>'
To get the list of available encodings with Iconv, do:
require 'iconv'
puts Iconv.list
It's a long list so I won't add it here.
You can try this:
"%04x" % "龅".unpack("U*")[0]
=> "9f85"
Should use UNICODEBIG// as the target encoding
irb(main):014:0> Iconv.iconv("UNICODEBIG//","utf-8","龅")[0].each_byte {|b| puts b.to_s(16)}
9f
85
=> "\237\205"

Resources