Interpreting a text character copied from a website and its format - windows

I'm curious as to how this works from a low-level point of view.
I understand that computers deal with text characters using Ascii code, or unicode.
For example, just now I copied a '€' character symbol from a website to put in an email because the character is not on my keyboard.
How does Windows store this character? as a unique integer identifying this character? When I paste this character into an email or word document, even it preserves its text format.
How does the email editor or word application know how to translate what I copied with exact same format? What if where I copied the character from, it was using its own special type of character-encoding, would it translate to the wrong character then when I pasted it in an email.

Related

Spacing issue between letters while converting Word to PDF on Windows

I am having a word document(docx) of urdu text in Jameel Noori Nastaleeq Font. And in word its showing 10 pages file but after exporting into PDF its showing 11 pages pdf file becuase every letter contains extra space.
Can anyone please provide information ?
Edited:
Please download the file from
File
It has to do with the XML formatting of Word. When any text is pasted into Word (while the font is Jameel Noori Nastaleeq) Word places extra formatting in between the words. That formatting shows fine in Word however in when the file is converted into PDF the extra space becomes visible. When the text is merely typed in Word, the formatting is applied to entire paragraphs rather than words. That is why a typed document doesn't contain the extra spaces.

QTextDocument print to pdf and unicode

I try to print pdf file from QTextDocument. Content of document is set by setHtml().
Simplified example:
QTextDocument document;
document.setHtml("<h1>My html \304\205</h1>"); // Octal encoded ą
QPrinter printer(QPrinter::HighResolution);
printer.setPageSize(QPrinter::A4);
printer.setOutputFormat(QPrinter::PdfFormat);
printer.setOutputFileName("cert.pdf");
document.print(&printer);
It does not work as expected on windows (msvc). I get pdf file with "?" in place of most polish characters. It works on ubuntu.
On windows It makes pdf with tahoma font embedded subset. How to force QPrinter or QPrintEngine to embed more characters from this (or any other) font?
As pepe suggested in comments. I needed to wrap this string one of:
QString::fromUtf8
tr() (in case of joining translated parts)
Use html escape sequence (ex. &#261 for ę)
My original html in program was build from tr() parts, but I forgot to octal escape some of them. (which worked on gcc, not on msvc, even with utf-8 with BOM)

Arabic-English Transliteration using unsupported font

I am working on language transliteration for Ar and En text.
Here is the link which displays character by character replacement : https://github.com/Shnoulle/Ar-PHP/blob/master/Arabic/data/Transliteration.xml
Now issue is:
I am dealing with font style robert_bold.ttf and robert_regular_0.ttf which has some typical characters with underline and overline as in this snap
I have .ttf file so I can see this fonts on my system. But in my application or in above Transliteration.xml characters are considered as junk like [, } [ etc.
How can I add support of this unsupported characters in Transliteration.xml file?
<pair>
<search>ي</search>
<replace>y</replace>
</pair>
<pair>
<search>ى</search>
<replace>a</replace>
</pair>
<pair>
<search>أ</search>
<replace>^</replace> // Here is one of the character s_ (s with underscore not supported)
</pair>
It seems that the font is not Unicode encoded but contains the underlined letters at some arbitrarily assigned codes. While this works up to a point, it does not work across applications, of course. It works only when that specific font is used.
The proper way is to use correct Unicode characters such as U+1E0F LATIN SMALL LETTER D WITH LINE BELOW “ḏ” and, for rendering, try to find fonts containing it.
An alternative is to use just basic Latin letters with some markup, say <u>d</u>. This means that the text must not be treated as plain text in later processing, and in rendering, the markup should be interpreted as requesting for a line under the letter(s).

How to get glyph unicode representation of Unicode character

Windows use uniscribe library to substitute arabic and indi typed characters based on their location. The new glyph is still have the original unicode of the typed character althogh it has its dedicated representation in Unicode
How to get the Unicode of what is actually displayed not what is typed.
There are lots of tools for this like ICU, Charmap and the rest. I myself recommand http://unicode.codeplex.com, it uses Unicode Character Database to represent characters.
Note that unicode is just some information about characters and never spoke about representation. They just suggest to implement a word just like their example. so that to view each code you need Standard Unicode Font like MS Arial Unicode whichis the largest and the best choise in windows platform.
Most of the characters are implemented in this font but for new characters you need an update for it (if there are such an update) or you can use the font which you know that it implemented your desire characters
Your interpretation of what is happening in Uniscribe is not correct.
Once you have glyphs the original information is gone there is no reliable way to go back to Unicode.
Even without going to Arabic, there is no way to distinguish if the glyph for the fi ligature (for example) comes from 'f' and 'i' (U+0066 U+0069) or from 'fi' (U+FB01).
(http://www.fileformat.info/info/unicode/char/fb01/index.htm)
Also, some of the resulting glyphs do not have a Unicode value associated with them, so there is no "Unicode of what is actually displayed"

How to render an arabic character in OpenGL?

I am able to display chinese character correctly but when I try to display arabic string the output that display in OpenGL scene is different from the arabic string that display in Visual Studio Editor. I know it should be something to do with "Complex Script" but I am not able to find any good example regarding to this matter. I would like to know how to display arabic text correctly?
Unlike Latin characters which each have a single visual representation, each Arabic character can have many different appearances depending on the surrounding characters. The logical characters in an Arabic string need to be converted to a sequence of visual glyphs in order to be correctly displayed. OpenGL doesn't do this processing for you so you're seeing the logical characters rendered without this processing.
To get around this you will need to use a library such as Uniscribe to transform the logical string into a visual string which you then give to OpenGL for rendering. There are some samples here.

Resources