How to convert hex into UTF-8 on ruby (1.8.7)? - ruby

For example how to create UTF-8 character from the following: "0x63 0xcc 0x8c"?
I understand that ruby 1.9 has better UTF-8 but this question is for ruby 1.8.7.

Ruby String unpack? http://ruby-doc.org/core/classes/String.src/M001112.html.
For example:
"\x68\x65\x6c\x6c\x6f".unpack("Z*") --> "hello"

Related

Can I programmatically convert "I’d" to "I’d" using Ruby?

I can't seem to find the right combination of String#encode shenanigans.
I think I'd got confused on this one so I'll post this here to hopefully help anyone else who is similarly confused.
I was trying to do my encoding in an irb session, which gives you
irb(main):002:0> 'I’d'.force_encoding('UTF-8')
=> "I’d"
And if you try using encode instead of force_encoding then you get
irb(main):001:0> 'I’d'.encode('UTF-8')
=> "I’d"
This is with irb set to use an output and input encoding of UTF-8. In my case to convert that string the way I want it involves telling Ruby that the source string is in windows-1252 encoding. You can do this by using the -E argument in which you specify `inputencoding:outputencoding' and then you get this
$ irb -EWindows-1252:UTF-8
irb(main):001:0> 'I’d'
=> "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d"
That looks wrong unless you pipe it out, which gives this
$ ruby -E Windows-1252:UTF-8 -e "puts 'I’d'"
I’d
Hurrah. I'm not sure about why Ruby showed it as "I\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2d" (something to do with the code page of the terminal?) so if anyone can comment with further insight that would be great.
I expect your script is using the encoding cp1251 and you have ruby >= 1.9.
Then you can use force_encoding:
#encoding: cp1251
#works also with encoding: binary
source = 'I’d'
puts source.force_encoding('utf-8') #-> I’d
If my exceptions are wrong: Which encoding do you use and which ruby version?
A little background:
Problems with encoding are difficult to analyse. There may be conflicts between:
Encoding of the source code (That's defined by the editor).
Expected encoding of the source code (that's defined with #encoding on the first line). This is used by ruby.
Encoding of the string (see e.g. section String encodings in http://nuclearsquid.com/writings/ruby-1-9-encodings/ )
Encoding of the output shell

Ruby error invalid multibyte char (US-ASCII)

I am trying to run the ruby script found here
but I am getting the error
invalid multibyte char (US-ASCII)
for line 12 which is
http = Net::HTTP.new("twitter.com", Net::HTTP.https_default_port())
can someone please explain to me what this means and how I can fix it, thanks
When you run the script with Ruby 1.9, change the first two lines of the script to:
#!/usr/bin/env ruby
# encoding: utf-8
require 'net/http'
This tells Ruby to run the script with support for the UTF-8 character set. Without that line Ruby 1.9 would default to the US_ASCII character set.
Just for the record: This will not work in Ruby 1.8, because 1.8 doesn't knew anything about string encodings. And the line is not needed anymore in Ruby 2.0, because Ruby 2.0 is using UTF-8 as the default anyway.
It means that a multibyte character is used and Ruby is not set to handle it. If you are using an old version of Ruby, then put the following magic comment at the beginning of the file:
# coding: utf-8
If you use a modern version of Ruby, then that problem would not arise in the first place.

`gsub': incompatible character encodings: UTF-8 and IBM437

I try to use search, google but with no luck.
OS: Windows XP
Ruby version 1.9.3po
Error:
`gsub': incompatible character encodings: UTF-8 and IBM437
Code:
require 'rubygems'
require 'hpricot'
require 'net/http'
source = Net::HTTP.get('host', '/' + ARGV[0] + '.asp')
doc = Hpricot(source)
doc.search("p.MsoNormal/a").each do |a|
puts a.to_plain_text
end
Program output few strings but when text is ”NOŻYCE” I am getting error above.
Could somebody help?
You could try converting your HTML to UTF-8 since it appears the original is in vintage-retro DOS format:
source.encode!('UTF-8')
That should flip it from 8-bit ASCII to UTF-8 as expected by the Hpricot parser.
The inner encoding of the source variable is UTF-8 but that is not what you want.
As tadman wrote, you must first tell Ruby that the actual characters in the string are in the IBM437 encoding. Then you can convert that string to your favourite encoding, but only if such a conversion is possible.
source.force_encoding('IBM437').encode('UTF-8')
In your case, you cannot convert your string to ISO-8859-2 because not all IBM437 characters can be converted to that charset. Sticking to UTF-8 is probably your best option.
Anyway, are you sure that that file is actually transmitted in IBM437? Maybe it is stored as such in the HTTP server but it is sent over-the-wire with another encoding. Or it may not even be exactly in IBM437, it may be CP852, also called MS-DOC Latin 2 (different from ISO Latin 2).

Output the euro sign in a ruby script

I'm currently trying to output the euro symbol through a simple ruby script that I made but I keep getting "?" whenever I try.
I'm currently using puts "\244".
Any thoughts?
btw. I'm using ruby 1.9.2 p180
You need to add a "magic comment" at the top of your script like this:
# encoding: UTF-8
puts "€"
... assuming that you want to use UTF-8 to allow for double-byte characters. You can then use the Euro symbol directly.
You can read more about Ruby 1.9 string encoding and magic comments here:
http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

Iconv and Kconv on Ruby (1.9.2)

I know that Iconv is used to convert strings' encoding.
From my understandings Kconv is for the same purpose (am I wrong?).
My question is: what is the difference between them, and what should I use for encoding conversions.
btw found some info that Iconv will be deprecated from 1.9.3 version.
As https://stackoverflow.com/users/23649/jtbandes says, it looks Kconv is like Iconv but specialized for Kanji ("the logographic Chinese characters that are used in the modern Japanese writing system along with hiragana" http://en.wikipedia.org/wiki/Kanji). Unless you are working on something specifically Japanese, I'm guessing you don't need Kconv.
If you're using Ruby 1.9, you can use the built-in encoding support most of the time instead of Iconv. I tried for hours to understand what I was doing until I read this:
http://www.joelonsoftware.com/articles/Unicode.html
Then you can start to use stuff like
String#encode # Ruby 1.9
String#encode! # Ruby 1.9
String#force_encoding # Ruby 1.9
with confidence. If you have more complex needs, do read http://blog.grayproductions.net/categories/character_encodings
UPDATED Thanks to JohnZ in the comments
Iconv is still useful in Ruby 1.9 because it can transliterate characters (something that String#encode et al. can't do). Here's an example of how to extend String with a function that transliterates to UTF-8:
require 'iconv'
class ::String
# Return a new String that has been transliterated into UTF-8
# Should work in Ruby 1.8 and Ruby 1.9 thanks to http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/
def as_utf8(from_encoding = 'UTF-8')
::Iconv.conv('UTF-8//TRANSLIT', from_encoding, self + ' ')[0..-2]
end
end
"foo".as_utf8 #=> "foo"
"foo".as_utf8('ISO-8859-1') #=> "foo"
Thanks JohnZ!

Resources