freetype extracting character code from a glyph - freetype

I would like to extract a character code in some form (a unicode character code, wchar_t or anything printable to a text format) from a glyph_index and an FT_Face. I know that there is a reverse function FT_Get_Char_Index for this purpose.
Although some glyphs may be composite, but is there any way to do the translation under some mild assumptions that the text is for example in the latin alphabet?

Related

How to get glyph unicode using freetype?

I'm trying to use freetype to enumerate the glyphs (name and unicode) in a font file.
For getting the name, I'm using FT_Get_Glyph_Name.
But how can I get the glyph unicode value?
I'm a newbie to glyph and font.
The Unicode codepoint is not technically stored together with the glyph in the TrueType/OpenType font. One has to iterate the font cmap table in the font to get the mapping, which could also be a non-Unicode one and also multiple mappings pointing to the same glyph may exist. The good news is that FreeType provides facilities in the API to iterate the glyphs codepoints in the currently selected character map, which are very well documented. So, with code:
// Ensure an unicode characater map is loaded
FT_Select_Charmap(face, FT_ENCODING_UNICODE);
FT_ULong charcode;
FT_UInt gid;
charcode = FT_Get_First_Char(face, &gid);
while (gid != 0)
{
std::cout << std::format("Codepoint: {:x}, gid: {}", charcode, gid) << std::endl;
charcode = FT_Get_Next_Char(face, charcode, &gid);
}
With this information you can create a best effort map from glyphs to Unicode code points.
One would expect the FT_CharMap to hold this info:
[...] The currently active charmap is available as face->charmap.
but unfortunately it only defines the kind of encoding (Unicode, MacRoman, Shift-JIS etc.). Apparently the act of looking up a code is done elsewhere – and .notdef simply gets returned when that character is unavailable after all.
Looking in one of my own FreeType-based OpenType renderers which reports 'by name', where possible, I found in the initialization sequence some code that stores the name of a glyph if it has one, the Unicode else. But that code was based on the presence of glyph names.
Thinking further: you can test every possible Unicode codepoint and see if it returns 0 (.notdef) or a valid glyph index. So initialize an empty table for all possible glyphs and only fill in each one's Unicode if the following routine finds it.
For a moderately modern font you need only check up to Unicode U+FFFF; for something like a heavy Chinese font (up to U+2F9F4 for Heiti SC) or Emoji (up to U+1FA95 for Segoe UI Emoji) you need quite a larger array. (Getting that max number out of a font is an entirely different story, alas. Deciding what to do depends on what you want to use this for.)
printf ("num glyphs: %u\n", face->num_glyphs);
for (code=1; code<=0xFFFF; code++)
{
glyph_index = FT_Get_Char_Index(face, code);
/* 0 = .notdef */
if (glyph_index)
{
printf ("%d -> %04X\n", glyph_index, code);
}
}
This short C snippet prints out the translation table from font glyph index to a corresponding Unicode. Beware that (1) not all glyphs in a font need to have a Unicode associated with them. Some fonts have tons of 'extra' glyphs, to be used in OpenType substitutions (such as alternative designs and custom ligatures) or other uses (such as aforementioned Segoe UI Emoji; it contains color masks for all of its emoji). And (2) some glyphs may be associated with multiple Unicode characters. The glyph design for A, for example, can be used as both a Latin Capital Letter A and a Greek Capital Letter Alpha.
Not all glyphs in a font will necessarily have a Unicode code point. In OpenType text display, there is a m:n mapping that occurs between Unicode character sequences and glyph sequences. If you are interested in a relationship between Unicode code points and glyphs, the thing that makes most sense would be to use the mapping from Unicode code points to default glyph that is contained in a font's 'cmap' table.
For more background, see OpenType spec: Advanced Typographic Extensions - OpenType Layout.
As for glyph names, every glyph can have a name, regardless of whether it is mapped from a code point in the 'cmap' table or not. Glyph names are contained in the 'post' table. But not all fonts necessarily include glyph names. For example, a CJK font is unlikely to include glyph names.

does length of barcode image change because of barcode text

I was reading about barcodes, just a general query I came up with:
Does the length of barcode image change because of the text in it?
For eg: will the length of a barcode with 986262 be different than 111111?
Generally speaking, you can consider barcodes "monospaced". The only cases for which this isn't true are when a character needs an escape code to be represented.
For example in Code 128B, you need to escape to Code 128A to issue a control character like TAB, or in Code 128A you need to switch temporarily to Code 128B to embed a lowercase alphabetical character.

QTextDocument print to pdf and unicode

I try to print pdf file from QTextDocument. Content of document is set by setHtml().
Simplified example:
QTextDocument document;
document.setHtml("<h1>My html \304\205</h1>"); // Octal encoded ą
QPrinter printer(QPrinter::HighResolution);
printer.setPageSize(QPrinter::A4);
printer.setOutputFormat(QPrinter::PdfFormat);
printer.setOutputFileName("cert.pdf");
document.print(&printer);
It does not work as expected on windows (msvc). I get pdf file with "?" in place of most polish characters. It works on ubuntu.
On windows It makes pdf with tahoma font embedded subset. How to force QPrinter or QPrintEngine to embed more characters from this (or any other) font?
As pepe suggested in comments. I needed to wrap this string one of:
QString::fromUtf8
tr() (in case of joining translated parts)
Use html escape sequence (ex. &#261 for ę)
My original html in program was build from tr() parts, but I forgot to octal escape some of them. (which worked on gcc, not on msvc, even with utf-8 with BOM)

Arabic-English Transliteration using unsupported font

I am working on language transliteration for Ar and En text.
Here is the link which displays character by character replacement : https://github.com/Shnoulle/Ar-PHP/blob/master/Arabic/data/Transliteration.xml
Now issue is:
I am dealing with font style robert_bold.ttf and robert_regular_0.ttf which has some typical characters with underline and overline as in this snap
I have .ttf file so I can see this fonts on my system. But in my application or in above Transliteration.xml characters are considered as junk like [, } [ etc.
How can I add support of this unsupported characters in Transliteration.xml file?
<pair>
<search>ي</search>
<replace>y</replace>
</pair>
<pair>
<search>ى</search>
<replace>a</replace>
</pair>
<pair>
<search>أ</search>
<replace>^</replace> // Here is one of the character s_ (s with underscore not supported)
</pair>
It seems that the font is not Unicode encoded but contains the underlined letters at some arbitrarily assigned codes. While this works up to a point, it does not work across applications, of course. It works only when that specific font is used.
The proper way is to use correct Unicode characters such as U+1E0F LATIN SMALL LETTER D WITH LINE BELOW “ḏ” and, for rendering, try to find fonts containing it.
An alternative is to use just basic Latin letters with some markup, say <u>d</u>. This means that the text must not be treated as plain text in later processing, and in rendering, the markup should be interpreted as requesting for a line under the letter(s).

How to get glyph unicode representation of Unicode character

Windows use uniscribe library to substitute arabic and indi typed characters based on their location. The new glyph is still have the original unicode of the typed character althogh it has its dedicated representation in Unicode
How to get the Unicode of what is actually displayed not what is typed.
There are lots of tools for this like ICU, Charmap and the rest. I myself recommand http://unicode.codeplex.com, it uses Unicode Character Database to represent characters.
Note that unicode is just some information about characters and never spoke about representation. They just suggest to implement a word just like their example. so that to view each code you need Standard Unicode Font like MS Arial Unicode whichis the largest and the best choise in windows platform.
Most of the characters are implemented in this font but for new characters you need an update for it (if there are such an update) or you can use the font which you know that it implemented your desire characters
Your interpretation of what is happening in Uniscribe is not correct.
Once you have glyphs the original information is gone there is no reliable way to go back to Unicode.
Even without going to Arabic, there is no way to distinguish if the glyph for the fi ligature (for example) comes from 'f' and 'i' (U+0066 U+0069) or from 'fi' (U+FB01).
(http://www.fileformat.info/info/unicode/char/fb01/index.htm)
Also, some of the resulting glyphs do not have a Unicode value associated with them, so there is no "Unicode of what is actually displayed"

Resources