How to avoid inadvertent encoding of UTF-8 files as ASCII/ANSI? - utf-8

In the process of editing a file encoded as UTF-8 w/o [spurious] BOM the content might become devoid of any Unicode characters outside the ASCII or ANSI ranges. At the next reopening of the file, some text editors (Notepad++) will interpret it as ASCII/ANSI encoded and open it as such. Unaware of the change the user will continue editing, now adding non-ANSI Unicode characters, rendered however useless, since saved in ANSI. A menu option can exist (Notepad++) to open ANSI files as UTF-8 w/o BOM, but leading to the reverse issue of inadvertently overriding ANSI files with Unicode encoding.

One workaround is to add a character outside the ANSI range to a comment in the file. Depending on the decoding algorithm, it might force the editor (Notepad++) to recognize the file as encoded in UTF-8 w/o BOM.
In a HTML document for example you could follow the charset definition in the header with such a Unicode comment, here the U+05D0 HEBREW LETTER ALEF:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <!-- א -->

How would you suggest that an editor tell the difference between ASCII/ANSI and UTF-8 w/o BOM, when the files look the same?
If you want guaranteed recognition of UTF-8 as UTF-8, either add the BOM, or force the file to contain UTF-8 characters.

Configure your editor to always use UTF-8 if possible, if not, complain to the creators of your editor. Charsets not targeting unicode are, IMO, deprecated and should be treated as such.
Files using only characters in the ASCII space (the 7-bit one) would be pretty much the same in UTF-8 anyway, so if you HAVE to deliver something in ASCII encoding, just don't type any unicode characters.

Related

editing files with bitbucket adds this to the start of files: M-oM-;M-?

Firstly, what is M-oM-;M-? ?
When I push a commit to bitbucket, and someone uses the online editor to make a small change, it changes the first line from:
<?xml version="1.0" encoding="utf-8"?>
to:
M-oM-;M-?<?xml version="1.0" encoding="utf-8"?>
I can see these special characters using cat -A <myfile>
This is a problem because this breaks my *.csproj files and fails to load projects in Visual Studio.
Bitbucket Support gave me articles about .gitattributes, and config, which I've already tried, but the issue persists:
$ git config core.autocrlf
true
$ cat .gitattributes
*.js text
*.cs text
*.xml text
*.csproj text
*.sln text
*.config text
*.cshtml text
*.json text
*.sql text
*.ts text
*.xaml text
I've also tried:
$ cat .gitattributes
*.js text eol=crlf
*.cs text eol=crlf
*.xml text eol=crlf
*.csproj text eol=crlf
*.sln text eol=crlf
*.config text eol=crlf
*.cshtml text eol=crlf
*.json text eol=crlf
*.sql text eol=crlf
*.ts text eol=crlf
*.xaml text eol=crlf
Is there some setting that I'm missing to help prevent this set of characters from being inserted into the start of my files?
First: M-o, M-;, and M-? are representation techniques to show non-ASCII characters as ASCII. Specifically, they're an encoding technique to show that bit 7 (0x80) is set, and the remaining bits are then displayed as if the characters were ASCII. Lowercase o is code 0x6f, ; is 0x3b, and ? is 0x3f. Putting the high bit (0x80) back into all three, and dropping the 0x and using uppercase, we get the values EF, BB, and BF. If nothing else, you should memorize this sequence—EF BB BF—or at least remember that it exists, because it's the UTF-8 encoding of a Unicode Byte Order Mark or BOM, U+FEFF (which you should also memorize, at least that it exists).
For more on Unicode in general, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
When storing Unicode as UTF-16, the byte order mark has a purpose: it tells you whether the stored data is UTF-16-LE, or UTF-16-BE. But when storing Unicode as UTF-8, the byte order mark is almost entirely useless. I personally believe it should never be used. Microsoft, on the other hand, apparently believe it should always be used (or almost always). See the Wikipedia quote below.
... and someone uses the online editor ...
This online editor, apparently, is either written by Microsoft, or by someone who thinks Microsoft is correct. They are inserting a UTF-8 byte order mark in your plain-text file.
Bitbucket Support gave me articles about .gitattributes ...
Unless the online editor looks inside .gitattributes files, this won't help: it's that editor that is adding the BOM.
That said, since Git 2.18, Git has had the notion of a working-tree-encoding attribute. Some editors might actually look at this. I may not understand the Microsoft philosophy correctly—I already noted that I disagree with it. I think, though, that they say: store a BOM in any UTF-8 encoded file if the "main" copy of that file should be stored in UTF-16 format. (Side note: the UTF-8 BOM tells you nothing about whether the UTF-16 file would be UTF-16-LE or UTF-16-BE, so—again in my opinion—it's pretty useless as an indicator. See also In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?)
In any case, if this editor does look at some configuration option, setting the configuration option—whatever it is—would help. If it does not, nothing you do here will help. Note that working-tree-encoding, while related to Unicode encoding, does not imply that a BOM should or should not be included. So, if your Git is 2.18 or later, you have this extra knob you can twiddle, but that's not what it's for. If it does actually help, that's great, but also quite wrong. :-)
The thing that's weirdest about this is:
[The BOM] breaks my *.csproj files and fails to load projects in Visual Studio.
Visual Studio is a Microsoft product. The Wikipedia page notes that:
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
One would think that if their editors insist on adding BOMs, their other programs would be able to handle BOMs.

QTextDocument print to pdf and unicode

I try to print pdf file from QTextDocument. Content of document is set by setHtml().
Simplified example:
QTextDocument document;
document.setHtml("<h1>My html \304\205</h1>"); // Octal encoded ą
QPrinter printer(QPrinter::HighResolution);
printer.setPageSize(QPrinter::A4);
printer.setOutputFormat(QPrinter::PdfFormat);
printer.setOutputFileName("cert.pdf");
document.print(&printer);
It does not work as expected on windows (msvc). I get pdf file with "?" in place of most polish characters. It works on ubuntu.
On windows It makes pdf with tahoma font embedded subset. How to force QPrinter or QPrintEngine to embed more characters from this (or any other) font?
As pepe suggested in comments. I needed to wrap this string one of:
QString::fromUtf8
tr() (in case of joining translated parts)
Use html escape sequence (ex. &#261 for ę)
My original html in program was build from tr() parts, but I forgot to octal escape some of them. (which worked on gcc, not on msvc, even with utf-8 with BOM)

Arabic-English Transliteration using unsupported font

I am working on language transliteration for Ar and En text.
Here is the link which displays character by character replacement : https://github.com/Shnoulle/Ar-PHP/blob/master/Arabic/data/Transliteration.xml
Now issue is:
I am dealing with font style robert_bold.ttf and robert_regular_0.ttf which has some typical characters with underline and overline as in this snap
I have .ttf file so I can see this fonts on my system. But in my application or in above Transliteration.xml characters are considered as junk like [, } [ etc.
How can I add support of this unsupported characters in Transliteration.xml file?
<pair>
<search>ي</search>
<replace>y</replace>
</pair>
<pair>
<search>ى</search>
<replace>a</replace>
</pair>
<pair>
<search>أ</search>
<replace>^</replace> // Here is one of the character s_ (s with underscore not supported)
</pair>
It seems that the font is not Unicode encoded but contains the underlined letters at some arbitrarily assigned codes. While this works up to a point, it does not work across applications, of course. It works only when that specific font is used.
The proper way is to use correct Unicode characters such as U+1E0F LATIN SMALL LETTER D WITH LINE BELOW “ḏ” and, for rendering, try to find fonts containing it.
An alternative is to use just basic Latin letters with some markup, say <u>d</u>. This means that the text must not be treated as plain text in later processing, and in rendering, the markup should be interpreted as requesting for a line under the letter(s).

UTF-8 but still not showing ÆØÅ (danish chars)

Take a look at this:
http://thebekker.dk/_skole/GFeksamen/
You can see the 2nd menu item show some weird sign, instead of "Ø"
Ive set utf-8 in meta, and even tryed with AddDefaultCharset UTF-8 in .htaccess...
Still no result, if i change to ISO-8859-1 which works fine, but that makes problem when i start making ajax calls for content...
I dont get it?
How do i get it to use UTF-8 and show ÆØÅ
If you declare that your content is encoded in UTF-8 with the meta tags or default charset, then your content needs to be actually encoded in UTF-8. The fact that it shows correctly when declaring your content to be encoded in ISO-8859 means that your content is actually encoded in ISO-8859. Save your source code file as UTF-8 or otherwise make sure that your content is UTF-8 encoded.
Saving the source file in "Western European (Windows)" in EditPlus text editor did it for me + in PHP I used utf8_encode.
you can set this characters with unicode like € or so many others. In my company we work with many translations and languages like france, that has many special chars.
set your website encoding type to utf-8 and use encodings like utf8_encode in php
or manually: http://www.sql-und-xml.de/unicode-database/online-tools/

How to get glyph unicode representation of Unicode character

Windows use uniscribe library to substitute arabic and indi typed characters based on their location. The new glyph is still have the original unicode of the typed character althogh it has its dedicated representation in Unicode
How to get the Unicode of what is actually displayed not what is typed.
There are lots of tools for this like ICU, Charmap and the rest. I myself recommand http://unicode.codeplex.com, it uses Unicode Character Database to represent characters.
Note that unicode is just some information about characters and never spoke about representation. They just suggest to implement a word just like their example. so that to view each code you need Standard Unicode Font like MS Arial Unicode whichis the largest and the best choise in windows platform.
Most of the characters are implemented in this font but for new characters you need an update for it (if there are such an update) or you can use the font which you know that it implemented your desire characters
Your interpretation of what is happening in Uniscribe is not correct.
Once you have glyphs the original information is gone there is no reliable way to go back to Unicode.
Even without going to Arabic, there is no way to distinguish if the glyph for the fi ligature (for example) comes from 'f' and 'i' (U+0066 U+0069) or from 'fi' (U+FB01).
(http://www.fileformat.info/info/unicode/char/fb01/index.htm)
Also, some of the resulting glyphs do not have a Unicode value associated with them, so there is no "Unicode of what is actually displayed"

Resources