Windows use uniscribe library to substitute arabic and indi typed characters based on their location. The new glyph is still have the original unicode of the typed character althogh it has its dedicated representation in Unicode
How to get the Unicode of what is actually displayed not what is typed.
There are lots of tools for this like ICU, Charmap and the rest. I myself recommand http://unicode.codeplex.com, it uses Unicode Character Database to represent characters.
Note that unicode is just some information about characters and never spoke about representation. They just suggest to implement a word just like their example. so that to view each code you need Standard Unicode Font like MS Arial Unicode whichis the largest and the best choise in windows platform.
Most of the characters are implemented in this font but for new characters you need an update for it (if there are such an update) or you can use the font which you know that it implemented your desire characters
Your interpretation of what is happening in Uniscribe is not correct.
Once you have glyphs the original information is gone there is no reliable way to go back to Unicode.
Even without going to Arabic, there is no way to distinguish if the glyph for the fi ligature (for example) comes from 'f' and 'i' (U+0066 U+0069) or from 'fi' (U+FB01).
(http://www.fileformat.info/info/unicode/char/fb01/index.htm)
Also, some of the resulting glyphs do not have a Unicode value associated with them, so there is no "Unicode of what is actually displayed"
Related
I'm trying to use freetype to enumerate the glyphs (name and unicode) in a font file.
For getting the name, I'm using FT_Get_Glyph_Name.
But how can I get the glyph unicode value?
I'm a newbie to glyph and font.
The Unicode codepoint is not technically stored together with the glyph in the TrueType/OpenType font. One has to iterate the font cmap table in the font to get the mapping, which could also be a non-Unicode one and also multiple mappings pointing to the same glyph may exist. The good news is that FreeType provides facilities in the API to iterate the glyphs codepoints in the currently selected character map, which are very well documented. So, with code:
// Ensure an unicode characater map is loaded
FT_Select_Charmap(face, FT_ENCODING_UNICODE);
FT_ULong charcode;
FT_UInt gid;
charcode = FT_Get_First_Char(face, &gid);
while (gid != 0)
{
std::cout << std::format("Codepoint: {:x}, gid: {}", charcode, gid) << std::endl;
charcode = FT_Get_Next_Char(face, charcode, &gid);
}
With this information you can create a best effort map from glyphs to Unicode code points.
One would expect the FT_CharMap to hold this info:
[...] The currently active charmap is available as face->charmap.
but unfortunately it only defines the kind of encoding (Unicode, MacRoman, Shift-JIS etc.). Apparently the act of looking up a code is done elsewhere – and .notdef simply gets returned when that character is unavailable after all.
Looking in one of my own FreeType-based OpenType renderers which reports 'by name', where possible, I found in the initialization sequence some code that stores the name of a glyph if it has one, the Unicode else. But that code was based on the presence of glyph names.
Thinking further: you can test every possible Unicode codepoint and see if it returns 0 (.notdef) or a valid glyph index. So initialize an empty table for all possible glyphs and only fill in each one's Unicode if the following routine finds it.
For a moderately modern font you need only check up to Unicode U+FFFF; for something like a heavy Chinese font (up to U+2F9F4 for Heiti SC) or Emoji (up to U+1FA95 for Segoe UI Emoji) you need quite a larger array. (Getting that max number out of a font is an entirely different story, alas. Deciding what to do depends on what you want to use this for.)
printf ("num glyphs: %u\n", face->num_glyphs);
for (code=1; code<=0xFFFF; code++)
{
glyph_index = FT_Get_Char_Index(face, code);
/* 0 = .notdef */
if (glyph_index)
{
printf ("%d -> %04X\n", glyph_index, code);
}
}
This short C snippet prints out the translation table from font glyph index to a corresponding Unicode. Beware that (1) not all glyphs in a font need to have a Unicode associated with them. Some fonts have tons of 'extra' glyphs, to be used in OpenType substitutions (such as alternative designs and custom ligatures) or other uses (such as aforementioned Segoe UI Emoji; it contains color masks for all of its emoji). And (2) some glyphs may be associated with multiple Unicode characters. The glyph design for A, for example, can be used as both a Latin Capital Letter A and a Greek Capital Letter Alpha.
Not all glyphs in a font will necessarily have a Unicode code point. In OpenType text display, there is a m:n mapping that occurs between Unicode character sequences and glyph sequences. If you are interested in a relationship between Unicode code points and glyphs, the thing that makes most sense would be to use the mapping from Unicode code points to default glyph that is contained in a font's 'cmap' table.
For more background, see OpenType spec: Advanced Typographic Extensions - OpenType Layout.
As for glyph names, every glyph can have a name, regardless of whether it is mapped from a code point in the 'cmap' table or not. Glyph names are contained in the 'post' table. But not all fonts necessarily include glyph names. For example, a CJK font is unlikely to include glyph names.
I'm using the font Apple SD Gothic Neo. The letters print fine except when I have one with an accent mark, like ú:
This is not a custom font, and it happens on all font weights. If it makes a difference, I'm pulling the string from Firebase.
Why is this happening and what can I do?
Use a different font.
When a font lacks a glyph, that glyph is substituted from another font, resulting in a typographical mismatch. That’s what’s happening here. You are using a font that is very Unicode-incomplete for Latin alphabet characters. It is intended for Korean! Use a more appropriate font.
I was reading about barcodes, just a general query I came up with:
Does the length of barcode image change because of the text in it?
For eg: will the length of a barcode with 986262 be different than 111111?
Generally speaking, you can consider barcodes "monospaced". The only cases for which this isn't true are when a character needs an escape code to be represented.
For example in Code 128B, you need to escape to Code 128A to issue a control character like TAB, or in Code 128A you need to switch temporarily to Code 128B to embed a lowercase alphabetical character.
I am working on language transliteration for Ar and En text.
Here is the link which displays character by character replacement : https://github.com/Shnoulle/Ar-PHP/blob/master/Arabic/data/Transliteration.xml
Now issue is:
I am dealing with font style robert_bold.ttf and robert_regular_0.ttf which has some typical characters with underline and overline as in this snap
I have .ttf file so I can see this fonts on my system. But in my application or in above Transliteration.xml characters are considered as junk like [, } [ etc.
How can I add support of this unsupported characters in Transliteration.xml file?
<pair>
<search>ي</search>
<replace>y</replace>
</pair>
<pair>
<search>ى</search>
<replace>a</replace>
</pair>
<pair>
<search>أ</search>
<replace>^</replace> // Here is one of the character s_ (s with underscore not supported)
</pair>
It seems that the font is not Unicode encoded but contains the underlined letters at some arbitrarily assigned codes. While this works up to a point, it does not work across applications, of course. It works only when that specific font is used.
The proper way is to use correct Unicode characters such as U+1E0F LATIN SMALL LETTER D WITH LINE BELOW “ḏ” and, for rendering, try to find fonts containing it.
An alternative is to use just basic Latin letters with some markup, say <u>d</u>. This means that the text must not be treated as plain text in later processing, and in rendering, the markup should be interpreted as requesting for a line under the letter(s).
I am able to display chinese character correctly but when I try to display arabic string the output that display in OpenGL scene is different from the arabic string that display in Visual Studio Editor. I know it should be something to do with "Complex Script" but I am not able to find any good example regarding to this matter. I would like to know how to display arabic text correctly?
Unlike Latin characters which each have a single visual representation, each Arabic character can have many different appearances depending on the surrounding characters. The logical characters in an Arabic string need to be converted to a sequence of visual glyphs in order to be correctly displayed. OpenGL doesn't do this processing for you so you're seeing the logical characters rendered without this processing.
To get around this you will need to use a library such as Uniscribe to transform the logical string into a visual string which you then give to OpenGL for rendering. There are some samples here.