How to get glyph unicode using freetype? - freetype

I'm trying to use freetype to enumerate the glyphs (name and unicode) in a font file.
For getting the name, I'm using FT_Get_Glyph_Name.
But how can I get the glyph unicode value?
I'm a newbie to glyph and font.

The Unicode codepoint is not technically stored together with the glyph in the TrueType/OpenType font. One has to iterate the font cmap table in the font to get the mapping, which could also be a non-Unicode one and also multiple mappings pointing to the same glyph may exist. The good news is that FreeType provides facilities in the API to iterate the glyphs codepoints in the currently selected character map, which are very well documented. So, with code:
// Ensure an unicode characater map is loaded
FT_Select_Charmap(face, FT_ENCODING_UNICODE);
FT_ULong charcode;
FT_UInt gid;
charcode = FT_Get_First_Char(face, &gid);
while (gid != 0)
{
std::cout << std::format("Codepoint: {:x}, gid: {}", charcode, gid) << std::endl;
charcode = FT_Get_Next_Char(face, charcode, &gid);
}
With this information you can create a best effort map from glyphs to Unicode code points.

One would expect the FT_CharMap to hold this info:
[...] The currently active charmap is available as face->charmap.
but unfortunately it only defines the kind of encoding (Unicode, MacRoman, Shift-JIS etc.). Apparently the act of looking up a code is done elsewhere – and .notdef simply gets returned when that character is unavailable after all.
Looking in one of my own FreeType-based OpenType renderers which reports 'by name', where possible, I found in the initialization sequence some code that stores the name of a glyph if it has one, the Unicode else. But that code was based on the presence of glyph names.
Thinking further: you can test every possible Unicode codepoint and see if it returns 0 (.notdef) or a valid glyph index. So initialize an empty table for all possible glyphs and only fill in each one's Unicode if the following routine finds it.
For a moderately modern font you need only check up to Unicode U+FFFF; for something like a heavy Chinese font (up to U+2F9F4 for Heiti SC) or Emoji (up to U+1FA95 for Segoe UI Emoji) you need quite a larger array. (Getting that max number out of a font is an entirely different story, alas. Deciding what to do depends on what you want to use this for.)
printf ("num glyphs: %u\n", face->num_glyphs);
for (code=1; code<=0xFFFF; code++)
{
glyph_index = FT_Get_Char_Index(face, code);
/* 0 = .notdef */
if (glyph_index)
{
printf ("%d -> %04X\n", glyph_index, code);
}
}
This short C snippet prints out the translation table from font glyph index to a corresponding Unicode. Beware that (1) not all glyphs in a font need to have a Unicode associated with them. Some fonts have tons of 'extra' glyphs, to be used in OpenType substitutions (such as alternative designs and custom ligatures) or other uses (such as aforementioned Segoe UI Emoji; it contains color masks for all of its emoji). And (2) some glyphs may be associated with multiple Unicode characters. The glyph design for A, for example, can be used as both a Latin Capital Letter A and a Greek Capital Letter Alpha.

Not all glyphs in a font will necessarily have a Unicode code point. In OpenType text display, there is a m:n mapping that occurs between Unicode character sequences and glyph sequences. If you are interested in a relationship between Unicode code points and glyphs, the thing that makes most sense would be to use the mapping from Unicode code points to default glyph that is contained in a font's 'cmap' table.
For more background, see OpenType spec: Advanced Typographic Extensions - OpenType Layout.
As for glyph names, every glyph can have a name, regardless of whether it is mapped from a code point in the 'cmap' table or not. Glyph names are contained in the 'post' table. But not all fonts necessarily include glyph names. For example, a CJK font is unlikely to include glyph names.

Related

How to find the default font for East Asian character in MacOS

When using a text edit application, a font (such as "Menlo") is selected to present glyphs, when the selected font doesn't contain a special glyph(such as “𠹷”, it's a simple Chinese glyph, "Menlo" doesn't contain it), Application will pick up a font for you to present it, In MacOS(Catalina), there are about 62 fonts (STBaoliSC-Regular, STKaiti, STSong, PingFangSC-Regular...) contain this glyph "𠹷", I found that almost every text edit application (vscode, sublime text, TextEdit) pick up the same font -- "PingFangSC-Regular" ,so I consider if it is every glyph has its own default font? if so, how can I get the font name?
This is handled by the "cascade list." If you want the default list, it's available through CTFontCopyDefaultCascadeListForLanguages, on a per-font basis:
import CoreText
let font = CTFontCreateWithName("Helvetica" as CFString, 12, nil)
let descriptors = CTFontCopyDefaultCascadeListForLanguages(font, nil)! as! [CTFontDescriptor]
If you wanted to see the list, you could do it this way (Core Text does not have very nice bridging to Swift):
for descriptor in descriptors {
let attributes = CTFontDescriptorCopyAttributes(descriptor) as! [String: Any]
print(attributes[kCTFontNameAttribute as String]!)
}
==>
LucidaGrande
.AppleSymbolsFB
GeezaPro
NotoNastaliqUrdu
Thonburi
Kailasa
PingFangSC-Regular
PingFangTC-Regular
AppleSDGothicNeo-Regular
PingFangTC-Regular
PingFangSC-Regular
PingFangHK-Regular
PingFangSC-Regular
HiraginoSans-W3
HiraginoSansGB-W3
KohinoorBangla-Regular
KohinoorDevanagari-Regular
KohinoorGujarati-Regular
MuktaMahee-Regular
NotoSansKannada-Regular
KhmerSangamMN
LaoSangamMN
MalayalamSangamMN
NotoSansMyanmar-Regular
NotoSansZawgyi-Regular
NotoSansOriya
SinhalaSangamMN
TamilSangamMN
KohinoorTelugu-Regular
NotoSansArmenian-Regular
EuphemiaUCAS
Menlo-Regular
STIXGeneral-Regular
Galvji
Kefa-Regular
.NotoSansUniversal
AppleColorEmoji
PingFangSC-Regular is the first East Asian font in the list, so it's going to be the one picked to replace Helvetica. It's also the first East Asian font in the cascade list for Lucida Grande, and Helvetica Neue. And it's kind of straight-forward, even boring, font. But if you were using a bit more unusual font, like American Typewriter? Well, that would be replaced with Songti, which is a bit lighter. Marker Felt replaces with Kaiti, which has more variation on the stoke widths (though IMO it would be much better to replace it with Kaiti Black rather than Regular). I don't know of any built-in Asian fonts that are as "fun" as the available Latin fonts, but if you had one, you can customize the cascade list to choose it instead (using NSFontDescriptor on Mac, or UIFontDescriptor on iOS).
If you want more details about cascade lists, see the WWDC 2018 video, Creating Apps for a Global Audience.

How to get the height at which to draw a strikethrough from FreeType

FreeType has font metrics for the underline position, but I can't seem to find any metrics for the strikethrough position. How do text engines usually compute this value? Should I just put it at 1/3*ascent or whatever looks good? I suppose that for Latin at least this should be 1/2*height of "m" but I'm looking for a more general solution.
This information is not provided for all the various font formats supported by Freetype; so it is not exposed on the "main" interface.
In the (common but not universal) case of TrueType or OpenType fonts it can be retrieved in the TT_OS2 table, fields yStrikeoutSize and yStrikeoutPosition; you should be prepared for the table to be lacking, or yStrikeoutSize to be null or negative thus unusable.
I do not remember of an equivalent for plain Postscript fonts (.pfb/.pfa, even in .afm.)
The various bitmap formats might have the information available; an example is strike_out in Windows FNT; notice this is the position, while the size defaults to be the same as underlining. Basically every format is alone here.

Arabic-English Transliteration using unsupported font

I am working on language transliteration for Ar and En text.
Here is the link which displays character by character replacement : https://github.com/Shnoulle/Ar-PHP/blob/master/Arabic/data/Transliteration.xml
Now issue is:
I am dealing with font style robert_bold.ttf and robert_regular_0.ttf which has some typical characters with underline and overline as in this snap
I have .ttf file so I can see this fonts on my system. But in my application or in above Transliteration.xml characters are considered as junk like [, } [ etc.
How can I add support of this unsupported characters in Transliteration.xml file?
<pair>
<search>ي</search>
<replace>y</replace>
</pair>
<pair>
<search>ى</search>
<replace>a</replace>
</pair>
<pair>
<search>أ</search>
<replace>^</replace> // Here is one of the character s_ (s with underscore not supported)
</pair>
It seems that the font is not Unicode encoded but contains the underlined letters at some arbitrarily assigned codes. While this works up to a point, it does not work across applications, of course. It works only when that specific font is used.
The proper way is to use correct Unicode characters such as U+1E0F LATIN SMALL LETTER D WITH LINE BELOW “ḏ” and, for rendering, try to find fonts containing it.
An alternative is to use just basic Latin letters with some markup, say <u>d</u>. This means that the text must not be treated as plain text in later processing, and in rendering, the markup should be interpreted as requesting for a line under the letter(s).

How to get glyph unicode representation of Unicode character

Windows use uniscribe library to substitute arabic and indi typed characters based on their location. The new glyph is still have the original unicode of the typed character althogh it has its dedicated representation in Unicode
How to get the Unicode of what is actually displayed not what is typed.
There are lots of tools for this like ICU, Charmap and the rest. I myself recommand http://unicode.codeplex.com, it uses Unicode Character Database to represent characters.
Note that unicode is just some information about characters and never spoke about representation. They just suggest to implement a word just like their example. so that to view each code you need Standard Unicode Font like MS Arial Unicode whichis the largest and the best choise in windows platform.
Most of the characters are implemented in this font but for new characters you need an update for it (if there are such an update) or you can use the font which you know that it implemented your desire characters
Your interpretation of what is happening in Uniscribe is not correct.
Once you have glyphs the original information is gone there is no reliable way to go back to Unicode.
Even without going to Arabic, there is no way to distinguish if the glyph for the fi ligature (for example) comes from 'f' and 'i' (U+0066 U+0069) or from 'fi' (U+FB01).
(http://www.fileformat.info/info/unicode/char/fb01/index.htm)
Also, some of the resulting glyphs do not have a Unicode value associated with them, so there is no "Unicode of what is actually displayed"

Why doesn't FONTSIGNATURE reflect lfCharSet?

I'm enumerating Windows fonts like this:
LOGFONTW lf = {0};
lf.lfCharSet = DEFAULT_CHARSET;
lf.lfFaceName[0] = L'\0';
lf.lfPitchAndFamily = 0;
::EnumFontFamiliesEx(hdc, &lf,
reinterpret_cast<FONTENUMPROCW>(FontEnumCallback),
reinterpret_cast<LPARAM>(this), 0);
My callback function has this signature:
int CALLBACK FontEnumerator::FontEnumCallback(const ENUMLOGFONTEX *pelf,
const NEWTEXTMETRICEX *pMetrics,
DWORD font_type,
LPARAM context);
For TrueType fonts, I typically get each face name multiple times. For example, for multiple calls, I'll get pelf->elfFullName and pelf->elfLogFont.lfFaceName set as "Arial". Looking more closely at the other fields, I see that each call is for a different script. For example, on the first call pelf->elfScript will be "Western" and pelf->elfLogFont.lfCharSet will be the numeric equivalent of ANSI_CHARSET. On the second call, I get "Hebrew" and HEBREW_CHARSET. Third call "Arabic" and ARABIC_CHARSET. And so on. So far, so good.
But the font signature (pMetrics->ntmFontSig) field for all versions of Arial is identical. In fact, the font signature claims that all of these versions of Arial support Latin-1, Hebrew, Arabic, and others.
I know the character sets of the strings I'm trying to draw, so I'm trying to instantiate an appropriate font based on the font signatures. Because the font signatures always match, I always end up selecting the "Western" font, even when displaying Hebrew or Arabic text. I'm using low level Uniscribe APIs, so I don't get the benefit of Windows font linking, and yet my code seems to work.
Does lfCharSet actually carry any meaning or is it a legacy artifact? Should I just set lfCharSet to DEFAULT_CHARSET and stop worrying about all the script variations of each face?
For my purposes, I only care about TrueType and OpenType fonts.
I think I found the answer. Fonts that get enumerated multiple times are "big" fonts. Big fonts are single fonts that include glyphs for multiple scripts or code pages.
The Unicode portion of the FONTSIGNATURE (fsUsb) represents all the Unicode subranges that the font can handle. This is independent of the character set. If you use the wide character APIs, you can use all the included glyphs in the font, regardless of which character set was specified when you create the font.
The code page portion of the FONTSIGNATURE (fsCsb) represents the code pages that the font can handle. I believe this is only significant when the font is not a "big" font. In that case, the fsUsb masks will be all zeros, and the fsCsb will specify the appropriate character set(s). In those cases, it's important to get the lfCharSet correct in the LOGFONT.
When instantiating a "big" font and using the wide character APIs, it apparently doesn't matter which lfCharSet you specify.

Resources