Ruby decode string - ruby

In Ruby how can I get:
"b\x81rger" by providing the string "bürger".
I need to print special characters to a Zebra printer, I can see that "b\x81rger" prints "bürger", but sending "bürger" does not print the correct character.

Turns out it’s CP850.
Proper solution (Ruby 2.5+)
Normalize the unicode string and then encode it into CP850:
"bürger".unicode_normalize(:nfc).encode(Encoding::CP850)
#⇒ "b\x81rger"
Works for both special characters and combined diacritics.
Fallback solution (Ruby 2.5-)
Encode and pray it’s a composed umlaut:
"bürger".encode(Encoding::CP850)
#⇒ "b\x81rger"

Related

Rails 3.2.21 / ruby 1.9.3 how can I \u encode unicode chars within a string

I need to sanitize some text sent to an email service provider (Sendgrid) that does not support unicode in the recipient name unless it is \u escaped.
When the UTF-8 string s = "Pablö" how can I "\u escape" any unicode inside the string so I get "Pabl\u00f6" ?
Converting to JSON also escapes the quotes (which I don't want):
"Pablö".to_json
=> "\"Pabl\\u00f6\""
What I'm looking for is something just like .force_encoding('binary') except for Unicode. Inspecting Encoding.aliases.values.uniq I don't see anything like 'unicode'.
I'm going to assume that everything is UTF-8 because we're not cavemen banging rocks together.
to_json isn't escaping quotes, it is adding quotes inside the string (because JSON requires strings to be quoted) and then inspect escapes them (and the backslash).
These quotes from to_json should always be there so you could just strip them off:
"Pablö".to_json[1..-2] # Lots of ways to do this...
=> "Pabl\\u00f6"
Keep in mind, however, that the behavior of to_json and UTF-8 depends on which JSON library you're using and possibly other things. For example, in my stock Ruby 2.2, the standard JSON library leaves UTF-8 alone; the JSON specification is quite happy with UTF-8 so why bother encoding it? So you might want to do it yourself with something like:
s.chars.map { |c| c.ord > 127 ? '\u%.4x' % c.ord : c }.join
Anything above 127 is out of ASCII range so that simple ord test takes care of anything like ö, ñ, µ, ... You'll want to adjust the map block if you need to encode other characters (such as \n).

Escaping special characters in ruby

This is a common question, but just can't seem to find the answer without resorting to unreliable regular expressions.
Basically if there is a \302\240 or similar combination in a string I want to replace it with the real character.
I am using PLruby for this, hence the warn.
obj = {"a"=>"some string with special chars"}
warn obj.inspect
NOTICE: {"Outputs"=>["a\302\240b"]} <- chars are escaped
warn "\302\240"
NOTICE: <-- there is a non breaking space here, like I want
warn "#{json.inspect}"
NOTICE: {"Outputs"=>["a\302\240"b]} <- chars are escaped
So these can be decoded when I use a string literal, but with the "#{x}" format the \xxx placeholders are never decoded into characters.
How would I assign the same string as the middle command yields?
Ruby Version: 1.8.5
You mentioned that you're using PL/ruby. That suggests that your strings are actually bytea values (the PostgreSQL version of a BLOB) using the old "escape" format. The escape format encodes non-ASCII values in octal with a leading \ so a bit of gsub and Array#pack should sort you out:
bytes = s.gsub(/\\([0-8]{3})/) { [ $1.to_i(8) ].pack('C') }
That will expand the escape values in s to raw bytes and leave them in bytes. You're still dealing with binary data though so just trying to display it on a console won't necessarily do anything useful. If you know that you're dealing with comprehensible strings then you'll have to figure out what encoding they're in and use String methods to sort out the encoding.
Perhaps you just want to use .to_s instead?

Ruby hexacode to unicode conversion

I crawled a website which contains unicode, an the results look something like, if in code
a = "\\u2665 \\uc624 \\ube60! \\uc8fd \\uae30 \\uc804 \\uc5d0"
May I know how do I do it in Ruby to convert it back to the original Unicode text which is in UTF-8 format?
If you have ruby 1.9, you can try:
a.force_encoding('UTF-8')
Otherwise if you have < 1.9, I'd suggest reading this article on converting to UTF-8 in Ruby 1.8.
short answer: you should be able to 'puts a', and see the string printed out. for me, at least, I can print out that string in both 1.8.7 and 1.9.2
long answer:
First thing: it depends on if you're using ruby 1.8.7, or 1.9.2, since the way strings and encodings were handled changed.
in 1.8.7:
strings are just lists of bytes. when you print them out, if your OS can handle it, you can just 'puts a' and it should work correctly. if you do a[0], you'll get the first byte. if you want to get each character, things are pretty darn tricky.
in 1.9.2
strings are lists of bytes, with an encoding. If the webpage was sent with the correct encoding, your string should already be encoded correctly. if not, you'll have to set it (as per Mike Lewis's answer). if you do a[0], you'll get the first character (the heart). if you want each byte, you can do a.bytes.
If your OS, for whatever reason, is giving you those literal ascii characters,my previous answer is obviously invalid, disregard it. :P
here's what you can do:
a.gsub(/\\u([a-z0-9]+)/){|p| [$1.to_i(16)].pack("U")}
this will scan for the ascii string '\u' followed by a hexadecimal number, and replace it with the correct unicode character.
You can also specify the encoding when you open a new IO object: http://www.ruby-doc.org/core/classes/IO.html#M000889
Compared to Mike's solution, this may prevent troubles if you forget to force the encoding before exposing the string to the rest of your application, if there are multiple mechanisms for retrieving strings from your module or class. However, if you begin crawling SJIS or KOI-8 encoded websites, then Mike's solution will be easier to adapt for the character encoding name returned by the web server in its headers.

convert unicode into character with ruby

I found a dictionary of Chinese characters in unicode. I'm trying to build a database of Characters out of this dictionary but I don't know how to convert unicode to a character..
p "国".unpack("U*").first #this gives the unicode 22269
How can convert 22269 back into the character value which would be the opposite of the line above.
Ruby 1.9 :
p "国".codepoints.first #=> 22269
p 22269.chr('UTF-8') #=> "国"
[22269].pack('U*') #=> "国" or "\345\233\275"
Edit: Works in 1.8.6+ (verified in 1.8.6, 1.8.7, and 1.9.2). In 1.8.x you get a three-byte string representing the single Unicode character, but using puts on that causes the correct Chinese character to appear in the terminal.

Remove all but some special characters

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

Resources