I have a rails server running, where I have a bunch of clocks emojis, and I want to render them to the HTML. The emojis are in ASCII format:
ch = "\xF0\x9F\x95\x8f" ; 12.times.map { ch.next!.dup }.rotate(-1)
# => ["🕛", "🕐", "🕑", "🕒", "🕓", "🕔", "🕕", "🕖", "🕗", "🕘", "🕙", "🕚"]
What I want is this:
> String.define_method(:to_html_utf8) { chars.map! { |x| "&#x#{x.dump[3..-2].delete('{}')};" }.join }
> ch = "\xF0\x9F\x95\x8f" ; 12.times.map { ch.next!.to_html_utf8 }.rotate(-1)
# => ["🕛", "🕐", "🕑", "🕒", "🕓", "🕔", "🕕", "🕖", "🕗", "🕘", "🕙", "🕚"]
> ?🖄.to_html_utf8
# => "🖄"
> "🐭🐹".to_html_utf8
#=> "🐭🐹"
As you can see the to_html_utf8 does use some brute force way to get the job done.
Is there a better way to convert the emojis in aforementioned html compatible UTF-8?
Please note that it would be better to avoid and rails helpers or rails stuff in general, and it can be run with ruby 2.7+ only using standard library stuff.
The emojis are in ASCII format:
ch = "\xF0\x9F\x95\x8f"
0xf0 0x9f 0x95 0x8f is the character's UTF-8 byte sequence. Don't use that unless you absolutely have to. It's much easier to enter the emojis directly, e.g.:
ch = '🕐'
#=> "🕐"
or to use the character's codepoint, e.g.:
ch = "\u{1f550}"
#=> "🕐"
ch = 0x1f550.chr('UTF-8')
#=> "🕐"
You can usually just render that character into your HTML page if the "charset" is UTF-8.
If you want to turn the string's characters into their numeric character reference counterparts yourself, you could use:
ch.codepoints.map { |cp| format('&#x%x;', cp) }.join
#=> "🕐"
Note that the conversion is trivial – 1f550 is simply the character's (hex) codepoint.
The easiest way is to simply use UTF-8 natively and not escaping anything.
I have a string like this
"base: [_0x3e63[241], _0x3e63[242]],
gray: [_0x3e63[243], _0x3e63[244], _0x3e63[245], _0x3e63[246], _0x3e63[247], _0x3e63[248], _0x3e63[249], _0x3e63[250], _0x3e63[251], _0x3e63[252]],
red: [_0x3e63[253], _0x3e63[254], _0x3e63[255], _0x3e63[256], _0x3e63[257], _0x3e63[258], _0x3e63[259], _0x3e63[260], _0x3e63[261], _0x3e63[262]],
pink: [_0x3e63[263], _0x3e63[264], _0x3e63[265], _0x3e63[266], _0x3e63[267], _0x3e63[268], _0x3e63[269], _0x3e63[270], _0x3e63[271], _0x3e63[272]],
grape: [_0x3e63[273], _0x3e63[274], _0x3e63[275], _0x3e63[276], _0x3e63[277], _0x3e63[278], _0x3e63[279], _0x3e63[280], _0x3e63[281], _0x3e63[282]],
violet: [_0x3e63[283], _0x3e63[284], _0x3e63[285], _0x3e63[286], _0x3e63[287], _0x3e63[288], _0x3e63[289], _0x3e63[290], _0x3e63[291], _0x3e63[292]],
indigo: [_0x3e63[293], _0x3e63[294], _0x3e63[295], _0x3e63[296], _0x3e63[297], _0x3e63[298], _0x3e63[299], _0x3e63[300], _0x3e63[301], _0x3e63[302]],
blue: [_0x3e63[303], _0x3e63[304], _0x3e63[305], _0x3e63[306], _0x3e63[307], _0x3e63[308], _0x3e63[309], _0x3e63[310], _0x3e63[311], _0x3e63[312]],
cyan: [_0x3e63[313], _0x3e63[314], _0x3e63[315], _0x3e63[316], _0x3e63[317], _0x3e63[318], _0x3e63[319], _0x3e63[320], _0x3e63[321], _0x3e63[322]],
teal: [_0x3e63[323], _0x3e63[324], _0x3e63[325], _0x3e63[326], _0x3e63[327], _0x3e63[328], _0x3e63[329], _0x3e63[330], _0x3e63[331], _0x3e63[332]],
green: [_0x3e63[333], _0x3e63[334], _0x3e63[335], _0x3e63[336], _0x3e63[337], _0x3e63[338], _0x3e63[339], _0x3e63[340], _0x3e63[341], _0x3e63[342]],
lime: [_0x3e63[343], _0x3e63[344], _0x3e63[345], _0x3e63[346], _0x3e63[347], _0x3e63[348], _0x3e63[349], _0x3e63[350], _0x3e63[351], _0x3e63[352]],
yellow: [_0x3e63[353], _0x3e63[354], _0x3e63[355], _0x3e63[356], _0x3e63[357], _0x3e63[358], _0x3e63[359], _0x3e63[360], _0x3e63[361], _0x3e63[362]],
orange: [_0x3e63[363], _0x3e63[364], _0x3e63[365], _0x3e63[366], _0x3e63[367], _0x3e63[368], _0x3e63[369], _0x3e63[370], _0x3e63[371], _0x3e63[372]]"
_0x3e63 is a ruby array with the values.
_0x3e63 = ["#f783ac", "#faa2c1", "#fcc2d7", "#ffdeeb", "#fff0f6", "#862e9c", "#9c36b5", "#ae3ec9", "#be4bdb", "#cc5de8", "#da77f2", "#e599f7", "#eebefa", "#f3d9fa", "#f8f0fc", "#5f3dc4", "#6741d9", "#7048e8", "#7950f2", "#845ef7", "#9775fa", "#b197fc", "#d0bfff", "#e5dbff", "#f3f0ff", "#364fc7", "#3b5bdb", "#4263eb", "#4c6ef5", "#5c7cfa", "#748ffc", "#91a7ff", "#bac8ff", "#dbe4ff", "#edf2ff", "#1864ab", "#1971c2", "#1c7ed6", "#228be6", "#339af0", "#4dabf7", "#74c0fc", "#a5d8ff", "#d0ebff", "#e7f5ff", "#0b7285", "#0c8599", "#1098ad", "#15aabf", "#22b8cf", "#3bc9db", "#66d9e8", "#99e9f2", "#c5f6fa", "#e3fafc", "#087f5b", "#099268", "#0ca678", "#12b886", "#20c997", "#38d9a9", "#63e6be", "#96f2d7", "#c3fae8", "#e6fcf5", "#2b8a3e", "#2f9e44", "#37b24d", "#40c057", "#51cf66", "#69db7c", "#8ce99a", "#b2f2bb", "#d3f9d8", "#ebfbee", "#5c940d", "#66a80f", "#74b816", "#82c91e", "#94d82d", "#a9e34b", "#c0eb75", "#d8f5a2", "#e9fac8", "#f4fce3", "#e67700", "#f08c00", "#f59f00", "#fab005", "#fcc419", "#ffd43b", "#ffe066", "#ffec99", "#fff3bf", "#fff9db", "#d9480f", "#e8590c"]
I cannot find a way to retrieve from the string _0x3e63[xxxxxxx] replacing it with the right value....
Use String#gsub with a block.
Assuming your input string is stored in the variable input, the following code does the replacement and displays the result:
puts input.gsub(/_0x3e63\[(\d+)\]/){|s| _0x3e63[$1.to_i]}
(The array _0x3e63 you posted in the question does not contain enough values to have indices like 247 or 251 but the code works nevertheless.)
The code is very simple. The regular expression /_0x3e63\[(\d+)\]/ matches any string that starts with _0x3e63[, continues with one or more digits (\d+) and ends with ].
For each match the block is executed and the value returned by the block is used to replace the matched piece of the original string.
The replacement uses $1 (that contains the sub-string that matches the first capturing group) as an index into the array _0x3e63. Because the value of $1 is a string, .to_i is used to convert it to a number (required to be used as index in the array).
We are given:
str =<<~END
base: [arr[6], arr[3]],
gray: [arr[0], arr[4], arr[1], arr[5]],
red: [arr[2]]
END
#=> "base: [arr[6], arr[3]],\ngray: [arr[0], arr[4], arr[1], arr[5]],\nred: [arr[2]]\n"
and
arr = ["#f783ac", "#faa2c1", "#fcc2d7", "#ffdeeb", "#fff0f6", "#862e9c",
"#9c36b5"]
We can perform the required replacements by using String#gsub with a regular expression and Kernel#eval:
puts str.gsub(/\barr\[\d+\]/) { |s| eval s }
base: [#9c36b5, #ffdeeb],
gray: [#f783ac, #fff0f6, #faa2c1, #862e9c],
red: [#fcc2d7]
The regular expression preforms the following operations:
\b # match a word break (to avoid matching 'gnarr')
arr\[ # match string 'arr['
\d+ # match 1+ digits
\] # match ']'
Rubular
One must be cautious about using eval (to avoid launching missiles inadvertently, for example), but as long as the matches of the string can be trusted it's a perfectly safe and useful method.
I'm using an API in which I have to send client informations as a Json-object over a telnet connection (very strange, I know^^).
I'm german so the client information contains very often umlauts or the ß.
My procedure:
I generate a Hash with all the command information.
I convert the Hash to a Json-object.
I convert the Json-object to a string (with .to_s).
I send the string with the Net::Telnet.puts command.
My puts command looks like: (cmd is the Json-object)
host.puts(cmd.to_s.force_encoding('UTF-8'))
In the log files I see, that the Json-object don't contain the umlauts but for example this: ü instead of ü.
I proved that the string is (with or without the force_encoding() command) in UTF-8. So I think that the puts command doesn't send the strings in UTF-8.
Is it possible to send the command in UTF-8? How can I do this?
The whole methods:
host = Net::Telnet::new(
'Host' => host_string,
'Port' => port_integer,
'Output_log' => 'log/'+Time.now.strftime('%Y-%m-%d')+'.log',
'Timeout' => false,
'Telnetmode' => false,
'Prompt' => /\z/n
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate({'*C'=>'se','Q'=>[get_cmd(cmd, params)]})
host.puts(cmd.to_s.force_encoding('UTF-8'))
add_request_to_logfile(cmd)
end
def get_cmd(cmd, params=nil)
if params == nil
return {'*C'=>'sq','CMD'=>cmd}
else
return {'*C'=>'sq','CMD'=>cmd,'PARAMS'=>params}
end
end
Addition:
I also log my sended requests by this method:
def add_request_to_logfile(request_string)
directory = 'log/'
File.open(File.join(directory, Time.now.strftime('%Y-%m-%d')+'.log'), 'a+') do |f|
f.puts ''
f.puts '> '+request_string
end
end
In the logfile my requests also don't contain UTF-8 umlauts but for example this: ü
TL;DR
Set 'Binmode' => true and use Encoding::BINARY.
The above should work for you. If you're interested in why, read on.
Telnet doesn't really have a concept of "encoding." Telnet just has two modes: Normal mode assumes you're sending 7-bit ASCII characters, and binary mode assumes you're sending 8-bit bytes. You can't tell Telnet "this is UTF-8" because Telnet doesn't know what that means. You can tell it "this is ASCII-7" or "this is a sequence of 8-bit bytes," and that's it.
This might seem like bad news, but it's actually great news, because it just so happens that UTF-8 encodes text as sequences of 8-bit bytes. früh, for example, is five bytes: 66 72 c3 bc 68. This is easy to confirm in Ruby:
puts str = "\x66\x72\xC3\xBC\x68"
# => früh
puts str.bytes.size
# => 5
In Net::Telnet we can turn on binary mode by passing the 'Binmode' => true option to Net::Telnet::new. But there's one more thing we have to do: Tell Ruby to treat the string like binary data, i.e. a sequence of 8-bit bytes.
You already tried to use String#force_encoding, but what you might not have realized is that String#force_encoding doesn't actually convert a string from one encoding to another. Its purpose isn't to change the data's encoding—its purpose is to tell Ruby what encoding the data is already in:
str = "früh" # => "früh"
p str.encoding # => #<Encoding:UTF-8>
p str[2] # => "ü"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # This is the decimal represent-
# ation of the hexadecimal bytes
# we saw before, `66 72 c3 bc 68`
str.force_encoding(Encoding::BINARY) # => "fr\xC3\xBCh"
p str[2] # => "\xC3"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # Same bytes!
Now I'll let you in on a little secret: Encoding::BINARY is just an alias for Encoding::ASCII_8BIT. Since ASCII-8BIT doesn't have multi-byte characters, Ruby shows ü as two separate bytes, \xC3\xBC. Those bytes aren't printable characters in ASCII-8BIT, so Ruby shows the \x## escape codes instead, but the data hasn't changed—only the way Ruby prints it has changed.
So here's the thing: Even though Ruby is now calling the string BINARY or ASCII-8BIT instead of UTF-8, it's still the same bytes, which means it's still UTF-8. Changing the encoding it's "tagged" as, however, means when Net::Telnet does (the equivalent of) data[n] it will always get one byte (instead of potentially getting multi-byte characters as in UTF-8), which is just what we want.
And so...
host = Net::Telnet::new(
# ...all of your other options...
'Binmode' => true
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate('*C' => 'se','Q' => [ get_cmd(cmd, params) ])
cmd.force_encoding(Encoding::BINARY)
host.puts(cmd)
# ...
end
(Note: JSON.generate always returns a UTF-8 string, so you never have to do e.g. cmd.to_s.)
Useful diagnostics
A quick way to check what data Net::Telnet is actually sending (and receiving) is to set the 'Dump_log' option (in the same way you set the 'Output_log' option). It will write both sent and received data to a log file in hexdump format, which will allow you to see if the bytes being sent are correct. For example, I started a test server (nc -l 5555) and sent the string früh (host.puts "früh".force_encoding(Encoding::BINARY)), and this is what was logged:
> 0x00000: 66 72 c3 bc 68 0a fr..h.
You can see that it sent six bytes: the first two are f and r, the next two make up ü, and the last two are h and a newline. On the right, bytes that aren't printable characters are shown as ., ergo fr..h.. (By the same token, I sent the string I❤NY and saw I...NY. in the right column, because ❤ is three bytes in UTF-8: e2 9d a4).
So, if you set 'Dump_log' and send a ü, you should see c3 bc in the output. If you do, congratulations—you're sending UTF-8!
P.S. Read Yehuda Katz' article Ruby 1.9 Encodings: A Primer and the Solution for Rails. In fact, read it yearly. It's really, really useful.
When I invoke a Python 3 script from a Windows batch.cmd,
a UTF-8 arg is not passed as "UTF-8", but as a series of bytes,
each of which are interpreted by Python as individual UTF-8 chars.
How can I convert the Python 3 arg string to its intended UTF-8 state?
The calling .cmd and the called .py are shown below.
PS. As I mention in a comment below, calling u00FF.py "ÿ" directly from the Windows console commandline works fine. It is only a problem when I invoke u00FF.cmd via the .cmd, and I am looking for a Python 3 way to convert the double-encoded UTF-8 arg back to a "normally" encoded UTF-8 form.
I've now include here, the full (and latest) test code.. Its a bit long, but I hope it explains the issue clearly enough.
Update: I've seen why the file read of "ÿ" was "double-encoding"... I was reading the UTF-8 file in binary/byte mode... I should have used codecs.open('u00FF.arg', 'r', 'utf-8') instead of just plain open('u00FF.arg','r')... I've updated the offending code, and the output. The codepage issues seems to be the only problem now...
Because the Python issue has been largely resolved, and the codepage issue is quite independent of Python, I have posted another codepage specific question at
Codepage 850 works, 65001 fails! There is NO response to “call foo.cmd”. internal commands work fine.
::::::::::::::::::: BEGIN .cmd BATCH FILE ::::::::::::::::::::
:: Windows Batch file (UTF-8 encoded, no BOM): "u00FF.cmd"
#echo ÿ>u00FF.arg
#u00FF.py "ÿ"
#goto :eof
::::::::::::::::::: END OF .cmd BATCH FILE ::::::::::::::::::::
################### BEGIN .py SCRIPT #####################################
# -*- coding: utf-8 -*-
import sys
print ("""
Unicode
=======
CodePoint U+00FF
Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'
UTF-8 bytes
===========
Hex: \\xC3 \\xBF
Dec: 195 191
Char: Ã ¿ __Unicode Character 'INVERTED QUESTION MARK'
\_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'
""")
print("## ====================================================")
print("## ÿ via hard-coding in this .py script itself ========")
print("##")
hard1s = "ÿ"
hard1b = hard1s.encode('utf_8')
print("hard1s: len", len(hard1s), " '" + hard1s + "'")
print("hard1b: len", len(hard1b), hard1b)
for i in range(0,len(hard1s)):
print("CodePoint[", i, "]", hard1s[i], "U+"+"{0:x}".upper().format(ord(hard1s[i])).zfill(4) )
print(''' This is a single CodePoint for "ÿ" (as expected).''')
print()
print("## ====================================================")
print("## ÿ read into this .py script from a UTF-8 file ======")
print("##")
import codecs
file1 = codecs.open( 'u00FF.arg', 'r', 'utf-8' )
file1s = file1.readline()
file1s = file1s[:1] # remove \r
file1b = file1s.encode('utf_8')
print("file1s: len", len(file1s), " '" + file1s + "'")
print("file1b: len", len(file1b), file1b)
for i in range(0,len(file1s)):
print("CodePoint[", i, "]", file1s[i], "U+"+"{0:x}".upper().format(ord(file1s[i])).zfill(4) )
print(''' This is a single CodePoint for "ÿ" (as expected).''')
print()
print("## ====================================================")
print("## ÿ via sys.argv from a call to .py from a .cmd) ===")
print("##")
argv1s = sys.argv[1]
argv1b = argv1s.encode('utf_8')
print("argv1s: len", len(argv1s), " '" + argv1s + "'")
print("argv1b: len", len(argv1b), argv1b)
for i in range(0,len(argv1s)):
print("CodePoint[", i, "]", argv1s[i], "U+"+"{0:x}".upper().format(ord(argv1s[i])).zfill(4) )
print(''' These 2 CodePoints are way off-beam,
even allowing for the "double-encoding" seen above.
The CodePoints are from an entirely different Unicode-Block.
This must be a Codepage issue.''')
print()
################### END OF .py SCRIPT #####################################
Here is the output from the above code.
========================== BEGIN OUTPUT ================================
C:\>u00FF.cmd
Unicode
=======
CodePoint U+00FF
Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'
UTF-8 bytes
===========
Hex: \xC3 \xBF
Dec: 195 191
Char: Ã ¿ __Unicode Character 'INVERTED QUESTION MARK'
\_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'
## ====================================================
## ÿ via hard-coding in this .py script itself ========
##
hard1s: len 1 'ÿ'
hard1b: len 2 b'\xc3\xbf'
CodePoint[ 0 ] ÿ U+00FF
This is a single CodePoint for "ÿ" (as expected).
## ====================================================
## ÿ read into this .py script from a UTF-8 file ======
##
file1s: len 1 'ÿ'
file1b: len 2 b'\xc3\xbf'
CodePoint[ 0 ] ÿ U+00FF
This is a single CodePoint for "ÿ" (as expected
## ====================================================
## ÿ via sys.argv from a call to .py from a .cmd) ===
##
argv1s: len 2 '├┐'
argv1b: len 6 b'\xe2\x94\x9c\xe2\x94\x90'
CodePoint[ 0 ] ├ U+251C
CodePoint[ 1 ] ┐ U+2510
These 2 CodePoints are way off-beam,
even allowing for the "double-encoding" seen above.
The CodePoints are from an entirely different Unicode-Block.
This must be a Codepage issue.
========================== END OF OUTPUT ================================
Batch files and encodings are a finicky issue. First of all: Batch files have no direct way of specifying the encoding they're in and cmd does not really support Unicode batch files. You can easily see that if you save a batch file with a Unicode BOM or as UTF-16 – they will throw an error.
What you see when you put the ÿ directly into the command line is that when running a command Windows will initially use the command line as Unicode (it may have been converted from some legacy encoding beforehand, but in the end what Windows uses is Unicode). So Python will (hopefully) always grab the Unicode content of the arguments.
However, since cmd has its own opinions about the codepage (and you never told it to use UTF-8) the UTF-8 string you put in the batch file won't be interpreted as UTF-8 but instead in the default cmd codepage (850 or 437, in your case).
You can force UTF-8 with chcp:
chcp 65001 > nul
You can save the following file as UTF-8 and try it out:
#echo off
chcp 850 >nul
echo ÿ
chcp 65001 >nul
echo ÿ
Keep in mind, though, that the chcp setting will persist in the shell if you run the batch from there which may make things weird.
Windows shell uses a specific code page (see CHCP command output). You need to convert from Windows code page to utf-8. See iconv module or decode() / encode()