Convert weird characters - Ruby - ruby

How can we convert with Ruby a string such as this one:
๐‘™๐‘Ž๐‘ก๐‘œ๐‘Ÿ๐‘Ÿ๐‘’
To:
Latorre

s = "๐‘™๐‘Ž๐‘ก๐‘œ๐‘Ÿ๐‘Ÿ๐‘’"
=> "๐‘™๐‘Ž๐‘ก๐‘œ๐‘Ÿ๐‘Ÿ๐‘’"
s.unicode_normalize(:nfkc).capitalize
=> "Latorre"
Here is a great article about unicode normalization: https://blog.daftcode.pl/fixing-unicode-for-ruby-developers-60d7f6377388

Related

Convert an emoji to HTML UTF-8 in Ruby

I have a rails server running, where I have a bunch of clocks emojis, and I want to render them to the HTML. The emojis are in ASCII format:
ch = "\xF0\x9F\x95\x8f" ; 12.times.map { ch.next!.dup }.rotate(-1)
# => ["๐Ÿ•›", "๐Ÿ•", "๐Ÿ•‘", "๐Ÿ•’", "๐Ÿ•“", "๐Ÿ•”", "๐Ÿ••", "๐Ÿ•–", "๐Ÿ•—", "๐Ÿ•˜", "๐Ÿ•™", "๐Ÿ•š"]
What I want is this:
> String.define_method(:to_html_utf8) { chars.map! { |x| "&#x#{x.dump[3..-2].delete('{}')};" }.join }
> ch = "\xF0\x9F\x95\x8f" ; 12.times.map { ch.next!.to_html_utf8 }.rotate(-1)
# => ["๐Ÿ•›", "๐Ÿ•", "๐Ÿ•‘", "๐Ÿ•’", "๐Ÿ•“", "๐Ÿ•”", "๐Ÿ••", "๐Ÿ•–", "๐Ÿ•—", "๐Ÿ•˜", "๐Ÿ•™", "๐Ÿ•š"]
> ?๐Ÿ–„.to_html_utf8
# => "๐Ÿ–„"
> "๐Ÿญ๐Ÿน".to_html_utf8
#=> "๐Ÿญ๐Ÿน"
As you can see the to_html_utf8 does use some brute force way to get the job done.
Is there a better way to convert the emojis in aforementioned html compatible UTF-8?
Please note that it would be better to avoid and rails helpers or rails stuff in general, and it can be run with ruby 2.7+ only using standard library stuff.
The emojis are in ASCII format:
ch = "\xF0\x9F\x95\x8f"
0xf0 0x9f 0x95 0x8f is the character's UTF-8 byte sequence. Don't use that unless you absolutely have to. It's much easier to enter the emojis directly, e.g.:
ch = '๐Ÿ•'
#=> "๐Ÿ•"
or to use the character's codepoint, e.g.:
ch = "\u{1f550}"
#=> "๐Ÿ•"
ch = 0x1f550.chr('UTF-8')
#=> "๐Ÿ•"
You can usually just render that character into your HTML page if the "charset" is UTF-8.
If you want to turn the string's characters into their numeric character reference counterparts yourself, you could use:
ch.codepoints.map { |cp| format('&#x%x;', cp) }.join
#=> "๐Ÿ•"
Note that the conversion is trivial โ€“ 1f550 is simply the character's (hex) codepoint.
The easiest way is to simply use UTF-8 natively and not escaping anything.

Uuencode vs Base64 encode: why do pack('m') and pack('u') in Ruby return strings of different lengths?

According to the specs they should be same length, and a string of length 36 should translate to a string of length 48, for example:
bin = "123456789012345678901234567890123456"
[49] pry(main)> [bin].pack("m").length
=> 49
[50] pry(main)> [bin].pack("u").length
=> 50
[54] pry(main)> [bin].pack("m")
=> "MTIzNDU2Nzg5MDEyMzQ1Njc4OTAxMjM0NTY3ODkwMTIzNDU2\n"
[55] pry(main)> [bin].pack("u")
=> "D,3(S-#4V-S#Y,\#$R,S0U-C<X.3`Q,C,T-38W.#DP,3(S-#4V\n"
Compensating for the "funny newline" we get the proper length in the base64 encoding (the pack('m') variant), but I don't know how to get the line length right in the uuencoding (the pack('u') variant).
I really need that uuencoded string to be 48 chars long :) what's the issue here?
Update
I did my own uuencode implementation, created a method that generates bitmap and then split the bitmap etc to make a uuencode implementation, as the provider of the specification helpfully explained in the spec
def to_bitmap bytes
bytes.scan(/./).map{|b| b.ord.to_s(2).rjust(8, "0")}.join
end
[5] pry(main)> to_bitmap(str).scan(/.{6}/).map{|b| (from_bitmap("00"+b).ord+0x20).chr }.join
=> ",3(S-#4V-S#Y,\#$R,S0U-C<X.3 Q,C,T-38W.#DP,3(S-#4V"
[6] pry(main)> to_bitmap(str).scan(/.{6}/).map{|b| (from_bitmap("00"+b).ord+0x20).chr }.join.length
=> 48
and I assume this is the good thing, it's kind of like uuencode, but differs in a couple of places:
,3(S-#4V-S#Y,\#$R,S0U-C<X.3 Q,C,T-38W.#DP,3(S-#4V
D,3(S-#4V-S#Y,\#$R,S0U-C<X.3`Q,C,T-38W.#DP,3(S-#4V\n
Wierd, I guess it's the specification I'm implementing is using a "uuencode" and not quite uuencode though they claim that generic software libraries support this format, am I missing something or does this seem like bullshit and workaround for somebodies half-assed implementation of uuencode?

Ruby string escape for supplementary plane Unicode characters

I know that I can escape a basic Unicode character in Ruby with the \uNNNN escape sequence. For example, for a smiling face U+263A (โ˜บ) I can use the string literal "\u2603".
How do I escape Unicode characters greater than U+FFFF that fall outside the basic multilingual plane, like a winking face: U+1F609 (๐Ÿ˜‰)?
Using the surrogate pair form like in Java doesn't work; it results in an invalid string that contains the individual surrogate code points:
s = "\uD83D\uDE09" # => "\xED\xA0\xBD\xED\xB8\x89"
s.valid_encoding? # => false
You can use the escape sequence \u{XXXXXX}, where XXXXXX is between 1 and 6 hex digits:
s = "\u{1F609}" # => "๐Ÿ˜‰"
The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:
s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "ะŸั€ะธะฒะตั‚, ะผะธั€!"
You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:
# encoding: utf-8
s = "\xF0\x9F\x98\x89" # => "๐Ÿ˜‰"
s.length # => 1
# encoding: iso-8859-1
s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
s.length # => 4

Net::Telnet - puts or print string in UTF-8

I'm using an API in which I have to send client informations as a Json-object over a telnet connection (very strange, I know^^).
I'm german so the client information contains very often umlauts or the รŸ.
My procedure:
I generate a Hash with all the command information.
I convert the Hash to a Json-object.
I convert the Json-object to a string (with .to_s).
I send the string with the Net::Telnet.puts command.
My puts command looks like: (cmd is the Json-object)
host.puts(cmd.to_s.force_encoding('UTF-8'))
In the log files I see, that the Json-object don't contain the umlauts but for example this: รƒยผ instead of รผ.
I proved that the string is (with or without the force_encoding() command) in UTF-8. So I think that the puts command doesn't send the strings in UTF-8.
Is it possible to send the command in UTF-8? How can I do this?
The whole methods:
host = Net::Telnet::new(
'Host' => host_string,
'Port' => port_integer,
'Output_log' => 'log/'+Time.now.strftime('%Y-%m-%d')+'.log',
'Timeout' => false,
'Telnetmode' => false,
'Prompt' => /\z/n
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate({'*C'=>'se','Q'=>[get_cmd(cmd, params)]})
host.puts(cmd.to_s.force_encoding('UTF-8'))
add_request_to_logfile(cmd)
end
def get_cmd(cmd, params=nil)
if params == nil
return {'*C'=>'sq','CMD'=>cmd}
else
return {'*C'=>'sq','CMD'=>cmd,'PARAMS'=>params}
end
end
Addition:
I also log my sended requests by this method:
def add_request_to_logfile(request_string)
directory = 'log/'
File.open(File.join(directory, Time.now.strftime('%Y-%m-%d')+'.log'), 'a+') do |f|
f.puts ''
f.puts '> '+request_string
end
end
In the logfile my requests also don't contain UTF-8 umlauts but for example this: รƒยผ
TL;DR
Set 'Binmode' => true and use Encoding::BINARY.
The above should work for you. If you're interested in why, read on.
Telnet doesn't really have a concept of "encoding." Telnet just has two modes: Normal mode assumes you're sending 7-bit ASCII characters, and binary mode assumes you're sending 8-bit bytes. You can't tell Telnet "this is UTF-8" because Telnet doesn't know what that means. You can tell it "this is ASCII-7" or "this is a sequence of 8-bit bytes," and that's it.
This might seem like bad news, but it's actually great news, because it just so happens that UTF-8 encodes text as sequences of 8-bit bytes. frรผh, for example, is five bytes: 66 72 c3 bc 68. This is easy to confirm in Ruby:
puts str = "\x66\x72\xC3\xBC\x68"
# => frรผh
puts str.bytes.size
# => 5
In Net::Telnet we can turn on binary mode by passing the 'Binmode' => true option to Net::Telnet::new. But there's one more thing we have to do: Tell Ruby to treat the string like binary data, i.e. a sequence of 8-bit bytes.
You already tried to use String#force_encoding, but what you might not have realized is that String#force_encoding doesn't actually convert a string from one encoding to another. Its purpose isn't to change the data's encodingโ€”its purpose is to tell Ruby what encoding the data is already in:
str = "frรผh" # => "frรผh"
p str.encoding # => #<Encoding:UTF-8>
p str[2] # => "รผ"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # This is the decimal represent-
# ation of the hexadecimal bytes
# we saw before, `66 72 c3 bc 68`
str.force_encoding(Encoding::BINARY) # => "fr\xC3\xBCh"
p str[2] # => "\xC3"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # Same bytes!
Now I'll let you in on a little secret: Encoding::BINARY is just an alias for Encoding::ASCII_8BIT. Since ASCII-8BIT doesn't have multi-byte characters, Ruby shows รผ as two separate bytes, \xC3\xBC. Those bytes aren't printable characters in ASCII-8BIT, so Ruby shows the \x## escape codes instead, but the data hasn't changedโ€”only the way Ruby prints it has changed.
So here's the thing: Even though Ruby is now calling the string BINARY or ASCII-8BIT instead of UTF-8, it's still the same bytes, which means it's still UTF-8. Changing the encoding it's "tagged" as, however, means when Net::Telnet does (the equivalent of) data[n] it will always get one byte (instead of potentially getting multi-byte characters as in UTF-8), which is just what we want.
And so...
host = Net::Telnet::new(
# ...all of your other options...
'Binmode' => true
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate('*C' => 'se','Q' => [ get_cmd(cmd, params) ])
cmd.force_encoding(Encoding::BINARY)
host.puts(cmd)
# ...
end
(Note: JSON.generate always returns a UTF-8 string, so you never have to do e.g. cmd.to_s.)
Useful diagnostics
A quick way to check what data Net::Telnet is actually sending (and receiving) is to set the 'Dump_log' option (in the same way you set the 'Output_log' option). It will write both sent and received data to a log file in hexdump format, which will allow you to see if the bytes being sent are correct. For example, I started a test server (nc -l 5555) and sent the string frรผh (host.puts "frรผh".force_encoding(Encoding::BINARY)), and this is what was logged:
> 0x00000: 66 72 c3 bc 68 0a fr..h.
You can see that it sent six bytes: the first two are f and r, the next two make up รผ, and the last two are h and a newline. On the right, bytes that aren't printable characters are shown as ., ergo fr..h.. (By the same token, I sent the string IโคNY and saw I...NY. in the right column, because โค is three bytes in UTF-8: e2 9d a4).
So, if you set 'Dump_log' and send a รผ, you should see c3 bc in the output. If you do, congratulationsโ€”you're sending UTF-8!
P.S. Read Yehuda Katz' article Ruby 1.9 Encodings: A Primer and the Solution for Rails. In fact, read it yearly. It's really, really useful.

Ruby windows-1250 encoding

I'm trying to get data from site with charset windows-1250
I have this code:
require 'open-uri'
p open('http://www.ceskybenzin.cz/mapa/0').read.force_encoding('Windows-1250').encode('UTF-8').scan /addMarker\( point, '(.*?) - (.*?) - (.*?) - (.*?)', 'green', (.*?), bublina, 0 \);/
and I'm getting data like:
["EuroOil", "Prun\u00E9\u0159ov ", "U\u0161\u00E1k", "Zat\u00EDm nezadan\u00FD kraj", "181"]
could someone tell me how to correctly get data from windows-1250 site
Thank you
a[0] => ["Kont.cz (NOVA-KONT)", "Praha 4", "Opatovsk\xC3\xA1", "Hlavn\u00ED m\u011Bsto Praha", "1"]
a.last => ["EuroOil", "Prun\u00E9\u0159ov ", "U\u0161\u00E1k", "Zat\u00EDm nezadan\u00FD kraj", "181"]
a.last.select { |i| puts i.encode("utf-8") } => produces
EuroOil
Prunรฉrov
Usรกk
Zatรญm nezadanรฝ kraj
181
you have unicode-8 symbols in your data not win-1250.
to convert your current example string to correct text you can do this
data = ["EuroOil", "Prun\u00E9\u0159ov ", "U\u0161\u00E1k", "Zat\u00EDm nezadan\u00FD kraj", "181"]
data.select{|snippet| snippet.encode("UTF-8")}
=> ["EuroOil", "Prunรฉล™ov ", "Uลกรกk", "Zatรญm nezadanรฝ kraj", "181"]
if output you exampled is from console, then this is because console outputs with utf-8 encoding not with encoding of your source site (and maybe parsing works correctly until it displays)

Resources