Programmatically get a list of characters a certain .ttf font file supports - ruby

Is there a way to programmatically get a list of characters a .ttf file supports using Ruby and/or Bash. I am trying to pipe the supported character codes into a text file for later processing.
(I would prefer not to use Font Forge.)

Found a Ruby gem called ttfunk which can be found here.
After a gem install ttfunk, you can get all unicode characters by running the following script:
require 'ttfunk'
file = TTFunk::File.open("path/to/font.ttf")
cmap = file.cmap
chars = {}
unicode_chars = []
cmap.tables.each do |subtable|
next if !subtable.unicode?
chars = chars.merge( subtable.code_map )
end
unicode_chars = chars.keys.map{ |dec| dec.to_s(16) }
puts "\n -- Found #{unicode_chars.length} characters in this font \n\n"
p unicode_chars
Which will output something like:
- Found 2815 characters in this font
["20", "21", "22", "23", ... , "fef8", "fef9", "fefa", "fefb", "fefc", "fffc", "ffff"]

Related

Convert an emoji to HTML UTF-8 in Ruby

I have a rails server running, where I have a bunch of clocks emojis, and I want to render them to the HTML. The emojis are in ASCII format:
ch = "\xF0\x9F\x95\x8f" ; 12.times.map { ch.next!.dup }.rotate(-1)
# => ["🕛", "🕐", "🕑", "🕒", "🕓", "🕔", "🕕", "🕖", "🕗", "🕘", "🕙", "🕚"]
What I want is this:
> String.define_method(:to_html_utf8) { chars.map! { |x| "&#x#{x.dump[3..-2].delete('{}')};" }.join }
> ch = "\xF0\x9F\x95\x8f" ; 12.times.map { ch.next!.to_html_utf8 }.rotate(-1)
# => ["🕛", "🕐", "🕑", "🕒", "🕓", "🕔", "🕕", "🕖", "🕗", "🕘", "🕙", "🕚"]
> ?🖄.to_html_utf8
# => "🖄"
> "🐭🐹".to_html_utf8
#=> "🐭🐹"
As you can see the to_html_utf8 does use some brute force way to get the job done.
Is there a better way to convert the emojis in aforementioned html compatible UTF-8?
Please note that it would be better to avoid and rails helpers or rails stuff in general, and it can be run with ruby 2.7+ only using standard library stuff.
The emojis are in ASCII format:
ch = "\xF0\x9F\x95\x8f"
0xf0 0x9f 0x95 0x8f is the character's UTF-8 byte sequence. Don't use that unless you absolutely have to. It's much easier to enter the emojis directly, e.g.:
ch = '🕐'
#=> "🕐"
or to use the character's codepoint, e.g.:
ch = "\u{1f550}"
#=> "🕐"
ch = 0x1f550.chr('UTF-8')
#=> "🕐"
You can usually just render that character into your HTML page if the "charset" is UTF-8.
If you want to turn the string's characters into their numeric character reference counterparts yourself, you could use:
ch.codepoints.map { |cp| format('&#x%x;', cp) }.join
#=> "🕐"
Note that the conversion is trivial – 1f550 is simply the character's (hex) codepoint.
The easiest way is to simply use UTF-8 natively and not escaping anything.

How to print unicode charaters in Command Prompt with Ruby

I was wondering how to print unicode characters, such as Japanese or fun characters like 📦.
I can print hearts with:
hearts = "\u2665"
puts hearts.encode('utf-8')
How can I print more unicode charaters with Ruby in Command Prompt?
My method works with some characters but not all.
Code examples would be greatly appreciated.
You need to enclose the unicode character in { and } if the number of hex digits isn't 4 (credit : /u/Stefan) e.g.:
heart = "\u2665"
package = "\u{1F4E6}"
fire_and_one_hundred = "\u{1F525 1F4AF}"
puts heart
puts package
puts fire_and_one_hundred
Alternatively you could also just put the unicode character directly in your source, which is quite easy at least on macOS with the Emoji & Symbols menu accessed by Ctrl + Command + Space by default (a similar menu can be accessed on Windows 10 by Win + ; ) in most applications including your text editor/Ruby IDE most likely:
heart = "♥"
package = "📦"
fire_and_one_hundred = "🔥💯"
puts heart
puts package
puts fire_and_one_hundred
Output:
♥
📦
🔥💯
How it looks in the macOS terminal:

can't write IP to text file without formatting issues

I'm having trouble reading an IP from a text file and properly writing it to another text file. It shows the written IP in the file as: "ÿþ1 9 2 . 1 6 8 . 1 1 0 . 4"
#Read the first line for the IP
def get_server_ip
File.open("d:\\ip_addr.txt") do |line|
a = line.readline()
b = a.to_s
end
end
#append the ip to file2
def append_ip
FileUtils.cp('file1.txt', 'file2.txt')
file_names = ['file2.txt']
file_names.each do |file_name|
text = File.read(file_name)
b = get_server_ip
new_contents = text.gsub('ip_here', b)
File.open(file_name, "w") {|file| file.puts new_contents }
end
end
I've tried .strip and .delete(' ') with no luck. Can anyone see the issue?
Thank you
The file was generated with Notepad on Windows. It is encoded as UTF-16LE.
The first two bytes in the file have the codes 0xFF and 0xFE; this is the Bytes Order Mark of UTF-16LE.
Each character is encoded on 2 bytes (16 bits), the least significant byte first (Less Endian order).
The spaces between the printable characters in the output are, in fact NUL characters (characters with code 0).
What you can do (apart from converting the file to a more decent format like UTF-8 or even ISO-8859-1) is to pass 'rb:BOM|UTF-16LE' as the second argument of File#open.
r tells File#open to open the file in read-only mode (which is also does by default);
b means "binary mode"; it is required by BOM|UTF-16;
:BOM|UTF-16LE tells Ruby to read and ignore the BOM if it is present in the file and to expect the rest of the file being encoded as UTF16-LE.
If you can, I recommend you to convert the file encoding using a decent editor (even Notepad can be used) to UTF-8 or ISO-8859-1 and all these problems vanish.

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

Net::Telnet - puts or print string in UTF-8

I'm using an API in which I have to send client informations as a Json-object over a telnet connection (very strange, I know^^).
I'm german so the client information contains very often umlauts or the ß.
My procedure:
I generate a Hash with all the command information.
I convert the Hash to a Json-object.
I convert the Json-object to a string (with .to_s).
I send the string with the Net::Telnet.puts command.
My puts command looks like: (cmd is the Json-object)
host.puts(cmd.to_s.force_encoding('UTF-8'))
In the log files I see, that the Json-object don't contain the umlauts but for example this: ü instead of ü.
I proved that the string is (with or without the force_encoding() command) in UTF-8. So I think that the puts command doesn't send the strings in UTF-8.
Is it possible to send the command in UTF-8? How can I do this?
The whole methods:
host = Net::Telnet::new(
'Host' => host_string,
'Port' => port_integer,
'Output_log' => 'log/'+Time.now.strftime('%Y-%m-%d')+'.log',
'Timeout' => false,
'Telnetmode' => false,
'Prompt' => /\z/n
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate({'*C'=>'se','Q'=>[get_cmd(cmd, params)]})
host.puts(cmd.to_s.force_encoding('UTF-8'))
add_request_to_logfile(cmd)
end
def get_cmd(cmd, params=nil)
if params == nil
return {'*C'=>'sq','CMD'=>cmd}
else
return {'*C'=>'sq','CMD'=>cmd,'PARAMS'=>params}
end
end
Addition:
I also log my sended requests by this method:
def add_request_to_logfile(request_string)
directory = 'log/'
File.open(File.join(directory, Time.now.strftime('%Y-%m-%d')+'.log'), 'a+') do |f|
f.puts ''
f.puts '> '+request_string
end
end
In the logfile my requests also don't contain UTF-8 umlauts but for example this: ü
TL;DR
Set 'Binmode' => true and use Encoding::BINARY.
The above should work for you. If you're interested in why, read on.
Telnet doesn't really have a concept of "encoding." Telnet just has two modes: Normal mode assumes you're sending 7-bit ASCII characters, and binary mode assumes you're sending 8-bit bytes. You can't tell Telnet "this is UTF-8" because Telnet doesn't know what that means. You can tell it "this is ASCII-7" or "this is a sequence of 8-bit bytes," and that's it.
This might seem like bad news, but it's actually great news, because it just so happens that UTF-8 encodes text as sequences of 8-bit bytes. früh, for example, is five bytes: 66 72 c3 bc 68. This is easy to confirm in Ruby:
puts str = "\x66\x72\xC3\xBC\x68"
# => früh
puts str.bytes.size
# => 5
In Net::Telnet we can turn on binary mode by passing the 'Binmode' => true option to Net::Telnet::new. But there's one more thing we have to do: Tell Ruby to treat the string like binary data, i.e. a sequence of 8-bit bytes.
You already tried to use String#force_encoding, but what you might not have realized is that String#force_encoding doesn't actually convert a string from one encoding to another. Its purpose isn't to change the data's encoding—its purpose is to tell Ruby what encoding the data is already in:
str = "früh" # => "früh"
p str.encoding # => #<Encoding:UTF-8>
p str[2] # => "ü"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # This is the decimal represent-
# ation of the hexadecimal bytes
# we saw before, `66 72 c3 bc 68`
str.force_encoding(Encoding::BINARY) # => "fr\xC3\xBCh"
p str[2] # => "\xC3"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # Same bytes!
Now I'll let you in on a little secret: Encoding::BINARY is just an alias for Encoding::ASCII_8BIT. Since ASCII-8BIT doesn't have multi-byte characters, Ruby shows ü as two separate bytes, \xC3\xBC. Those bytes aren't printable characters in ASCII-8BIT, so Ruby shows the \x## escape codes instead, but the data hasn't changed—only the way Ruby prints it has changed.
So here's the thing: Even though Ruby is now calling the string BINARY or ASCII-8BIT instead of UTF-8, it's still the same bytes, which means it's still UTF-8. Changing the encoding it's "tagged" as, however, means when Net::Telnet does (the equivalent of) data[n] it will always get one byte (instead of potentially getting multi-byte characters as in UTF-8), which is just what we want.
And so...
host = Net::Telnet::new(
# ...all of your other options...
'Binmode' => true
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate('*C' => 'se','Q' => [ get_cmd(cmd, params) ])
cmd.force_encoding(Encoding::BINARY)
host.puts(cmd)
# ...
end
(Note: JSON.generate always returns a UTF-8 string, so you never have to do e.g. cmd.to_s.)
Useful diagnostics
A quick way to check what data Net::Telnet is actually sending (and receiving) is to set the 'Dump_log' option (in the same way you set the 'Output_log' option). It will write both sent and received data to a log file in hexdump format, which will allow you to see if the bytes being sent are correct. For example, I started a test server (nc -l 5555) and sent the string früh (host.puts "früh".force_encoding(Encoding::BINARY)), and this is what was logged:
> 0x00000: 66 72 c3 bc 68 0a fr..h.
You can see that it sent six bytes: the first two are f and r, the next two make up ü, and the last two are h and a newline. On the right, bytes that aren't printable characters are shown as ., ergo fr..h.. (By the same token, I sent the string I❤NY and saw I...NY. in the right column, because ❤ is three bytes in UTF-8: e2 9d a4).
So, if you set 'Dump_log' and send a ü, you should see c3 bc in the output. If you do, congratulations—you're sending UTF-8!
P.S. Read Yehuda Katz' article Ruby 1.9 Encodings: A Primer and the Solution for Rails. In fact, read it yearly. It's really, really useful.

Resources