any PHP or Ruby library to convert Tranditional Chinese to Simplified Chinese or vice versa? - iconv

Is there any PHP or Ruby library to convert Tranditional Chinese to Simplified Chinese or vice versa (Big5 <--> GB)? The iconv library won't do it, as it merely converts the encoding, but the glyph stays the same.

Try this class for PHP - http://www.phpclasses.org/browse/package/3130.html

You might get some leverage with 1.9
Encoding.constants.grep /gb/i
=> [:GB18030, :GBK, :GB1988, :GB12345]
Encoding.constants.grep /big5/i
=> [:Big5, :BIG5, :Big5_HKSCS, :BIG5_HKSCS, :Big5_UAO, :BIG5_UAO]
so it's something like
How can I convert a string from windows-1252 to utf-8 in Ruby?
original = File.open('name', 'r:original_encoding').read
original.force_encoding('new_encoding')
Though I've never tried it.

Related

How can I transcode US ASCII HTML's e.g. é to UTF-8's é undere Linux

What a quick web search will confirm that US ASCII is a subset of UTF-8, but what I've not yet found is how to convert &foo; and { to their corresponding native UTF-8 characters.
I know that at least 7-bit US ASCII is unchanged in UTF-8, but I haven't seen yet a program to filter through and convert &foo; to how it would naturally be expressed in UTF-8.
You can use html_entity_decode(s, "UTF-8") in PHP or html.unescape(s) in Python.
https://www.php.net/manual/en/function.html-entity-decode.php
https://docs.python.org/3/library/html.html#html.unescape

Convert a unicode string to characters in Ruby?

I have the following string:
l\u0092issue
My question is how to convert it to utf8 characters ?
I have tried that
1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
=> "l\u0092issue"
You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.
In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.
However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character ’, so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.
What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.
The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.
To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:
2.1.0 :001 > s = "l\u0092issue"
=> "l\u0092issue"
2.1.0 :002 > s = s.encode('iso-8859-1')
=> "l\x92issue"
2.1.0 :003 > s.force_encoding('cp1252')
=> "l\x92issue"
2.1.0 :004 > s.encode('utf-8')
=> "l’issue"
This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.
That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.
Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

How do pack and unpack guesses the character encoding when converting to and from utf8?

Suppose I want to convert "\xBD" to UTF-8.
If I use pack & unpack, I'll get ½:
puts "\xBD".unpack('C*').pack('U*') #=> ½
as "\xBD" is ½ in ISO-8859-1.
BUT "\xBD" is œ in ISO-8859-9.
My question is: why pack used ISO-8859-1 instead of ISO-8859-9 to convert the char to UTF-8? Is there some way to configure that character encoding?
I know I can use Iconv in Ruby 1.8.7, and String#encode in 1.9.2, but I'm curious about pack because I use it in some code.
This actually has nothing to do with how \xBD is represented in ISO-8859-x. The critical part is the pack into UTF-8.
The pack receives [189]. The code point 189 is defined in UTF-8 (more precisely, Unicode) as being ½. Don't think of this as the Unicode spec writers for "preferring" ISO-8859-1 over ISO-8859-9. They had to make a choice of what code point represented ½ and they just chose 189.
Since you're trying to learn more about pack/unpack, let me explain more:
When you unpack with the C directive, ruby interprets the string as ascii-8bit, and extracts the ascii codes. In this case \xBD translates to 0xBD a.k.a. 189. This is a really basic conversion.
When you pack with the U directive, ruby will look up in its UTF-8 translation table to see what codepoints map to each of the integers in the array.
pack/unpack have very specific behavior depending on the directives you provide it. I suggest reading up on ruby-doc.org. Some of the directives still don't make sense to me, so don't be discouraged.

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

Encoding error in content get from open-uri in ruby on rails

In some cases when I use open to get a web page in Ruby the content of the page has an encoding error. Example:
open("http://www.google.com.br").read
Chars like ç and ã are replaced by ?
How can I get the right chars?
this seems to work:
require 'iconv'
i = Iconv.new('UTF-8','LATIN1')
i.iconv(open('http://google.com.br').read)
Running Ruby 1.9.2 here. Your code yields HTML which contains words like this:
Configura\xE7\xF5es
So on my work machine at least (Vista, using Windows CMD console), it returns HTML escaped characters.
Also, as far as I know, Ruby 1.9.2 is "almost" fully Unicode compliant, so I am guessing you shouldn't have UTF-8 issues unless your console cannot handle printing UTF-8 characters.
Hope that helps.

Resources