How do I make Emacs dired mode display unicode characters in windows? - windows

I have emacs 23.3 running on windows XP and I work on some files whose filenames contain a combination of English & devanagari or tamil characters (e.g., que.प्रश्न.txt or ans.பதில்.txt).
When I visit the directory containing this file in Dired, these file names don't appear correctly, even though I can see the names in windows explorer. Dired displays names like "deva~1.txt" for filenames that have begin with english characters but in case of names fully composed of non-english characters it displays something like "47d1~1.txt".
I suppose this has something to do with what Windows internally returns to emacs but I notice that running dir on command-prompt at the same directory displays the full names (even though cmd just renders all non-english characters as ? symbol).
Is there anyway I can enable dired to render filenames with non-english characters correctly?

It's actually a limitation of Emacs's implementation. Emacs uses Windows primitives that date back to before Unicode, so any filename with chars that cannot be encoded in your "codepage" will be replaced with the mangled foo~1 name (if your file system is VFAT) or something else in other cases. Hopefully we will soon switch over to the "new" Windows primitives that use UTF-16 (IIRC) and do not suffer from such problems any more. But you may have to wait for Emacs-25.1 for that. It may happen sooner if you give us a hand, tho ;-)

Related

How to create filename with characters that are not part of UTF-8 on Windows?

[Edit/Disclaimer]: Comments pointed out that I have to clarify the encoding the user uses. Will update accordingly
I have a customer from China who recently reported an issue with their filenames on Windows. The software works with most Chinese characters, but it seems he has found one file that fails.
Unfortunately, they are not able to send me over the filename as neither zipping nor transmitting the file through other mediums seem to preserve the filename.
What is the easiest way (e.g. through Python) to generate a filename on Windows that is covered by the NTFS file system encoding but not UTF8?
Unicode strings are encoded as a series of bytes. The rules of what a series of bytes visually looks like to you in an operating system, is what operating systems use to turn bytes into characters.
Given that Windows uses a (variation of-) Unicode, and you say you have a character that's not in unicode, it also means that there is simply no way to represent that character.
Imagine if unicode only contained the numbers 0-9, and you ask someone how to encode the letter A. There's no answer to this, because only 0-9 are defined.
You could make up a new unicode codepoint for your character, but then operating systems won't know what to do with that unless you also make your own font files.
I somehow doubt that that's what you want to do though, but it's an option. Could your customer rename the file before sending it to you?

Unicode filenames on FAT-32?

As far as I understand - NTFS supports Unicode filenames (UTF-16 as Micorsoft claims?).
But official MSDN documentation is very vague regarding what codepage(s) is used to store filenames (filepaths) on FAT-32.
Here it says that OEM code page (CP437 I assume) is used to store filenames: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748.aspx
But here it turns out that there can be different OEM codepages with CP437 being one of them: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752.aspx
And we all now that utilities like mount support many more different codepages for FAT, more than just OEM codepages set.
So what is the actual cdepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created? Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?
And more specific question:
What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?
You might have to experiment here. This is a great question, and I'm not 100% confident, but:
So what is the actual codepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created?
The "OEM codepage", whatever that is for the system.
Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?
No, I don't believe FAT is directly capable of either UTF-16 or UTF-8. That said, Microsoft stores the Unicode filename in an out of band method. A file thus has two filenames. (This is how you can have longer than 8.3 character filenames, as well.)
And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?
The Unicode filename, as passed to CreateFileW is stored directly in the out of band filename. It is re-encoded into the OEM codepage (whatever that happens to be on the system) and is put there. If it cannot be converted into the OEM codepage, or exceeds 8.3 characters, Windows will call the file something like, FILENA~1.TXT.
Some citations for these answers:
First, this page tells us that the OEM code page != the Windows code page:
Non-Unicode applications that create FAT files sometimes have to use the standard C runtime library conversion functions to translate between the Windows code page character set and the OEM code page character set. With Unicode implementations of the file system functions, it is not necessary to perform such translations.
On a typical American system, the OEM code page is "CP437", but the Windows code page is Windows-1252 (The FooA calls, I believe, use the Windows code page, typically Windows-1252 on an American machine, but depends on locale).
If you have a FAT volume available, you can see this in action. The character "Σ" (U+03a3) is not present in Windows-1252, however, it is in CP437. You can see both the short and long filenames with dir /X. With a file named asdfΣ.txt, you'll see:
ASDFΣ.TXT asdfΣ.txt
However, with a file named "asdfΛ.txt" (Λ is not present in either CP437 or Windows-1252), you'll see:
ASDF~1.TXT asdf?.txt
(You'll likely see ?, because cmd.exe's font cannot display a Λ.)
For information about long filenames, see this Wikipedia article.
Also, interestingly, if you name a file "asdf©.txt", you might get:
ASDFC.TXT asdfc.txt
… I'm not 100% sure here, but I think Windows cleverly decided to substitute "c" for ©, and did likewise for displaying it. If you change the font to something not raster based, like Consolas, you'll see:
ASDFC.TXT asdf©.txt
And this is why you should use the FooW functions.
The basic FAT or FAT32 directory entries support only short names (the old DOS 8.3 format) in the current OEM codepage. However, VFAT (FAT with long filename support) which is used while under Windows, can store an additional, so-called long filename for each file, in UTF-16.

"Safe" File naming

Are there any "unsafe" file names that can be encountered in Windows, Mac OS, Linux, etc?
For example:
New Video 2012-External Room
GED Practice Sheet
RgRrE-re-_d Da-
I've heard that even naming files with spaces, underscores, capital letters, and dashes could be potentially problematic, even though Windows doesn't include them in their list of forbidden characters. Is this true? I vaguely recall seeing programs that don't distinguish between uppercase and lowercase characters, and I know that HTML URLs encode unsafe ASCII characters as % (for example, spaces).
Both Unix-like (including Linux and Mac OS) and Windows should have no problem with underscores. Spaces should also generally be fine, but you occasionally find buggy code that can't handle them.
For Windows, it's not that capitals are problematic. It's that Windows filesystems are case-insensitive, so in some cases when interoperating (e.g. with a git repo which is case sensitive) you can end up with problems (e.g. the repo ends up with duplicates with different capitalization).
I'm not sure about -. One reason to avoid it is that - has special meaning for many command-line programs (e.g. rm -r). So you have to use annoying syntax like .\-r. I would also generally avoid more exotic ones like %.
It depends strongly on context of use. Certain non-forbidden characters can cause problems for certain programs, though the vast majority of applications which use standard system APIs should not encounter any issues.
Some programs (especially command-line tools) can be sensitive to the presence of spaces in the filename. Others may use only ASCII internally, and thus be incapable of handling filenames containing characters outside of basic ASCII. (Most modern OSes, by and large, will accept almost any Unicode character in a filename).
Some tools might require certain characters to be escaped (e.g. % in batch scripts), while others may not like having quotes in the filename.
Finally, a note on upper/lowercase: most Windows filesystems are case-preserving but otherwise case-insensitive, so upper/lowercase differences usually don't matter.
But, note that in almost every case, the files can still be used even if some workaround is needed to make them work.

Emacs failure in loading charset map when saving file with unicode

I created an ordinary text file on Windows 7 64-bit using gnu emacs 23.3.1. I can edit the file with other programs such as LinqPad (the file happens to be a linqpad script, extension .linq). Everything is fine until I put a Unicode character in the file, a character such as the greek letter λ (lambda). I can input the letter in emacs and it displays correctly. However, emacs refuses to save the file, reporting the following error
Failure in loading charset map: 8859-7
If I input the λ in LinqPad, emacs will read and display them, but will not save the file.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λ's, but instead pairs of odd characters such as λ. That is fitting to an untuition (pun intended) that the unicode chars are being stored as pairs. So it looks like this is a kind of ambiguous situation (storing unicode in text files), but it also looks like linqPad and visual studio "do the obvious thing."
I want to use emacs because it's the only program that I have that reflows sequences of commented lines (lines after //, reflows them with Alt-Q), and I want to use greek characters in my comments because I'm describing a mathematical program.
I'll be grateful for advice and answers.
UPDATE: some advice in other questions said to try M-x describe-char, also bound to C-x = ; both of those give me the same failure message as above, so they're on the right track, just not answers.
This once happened to me when I had upgraded all packages (including Emacs) without realising I still had an Emacs session open during the upgrade. Next time I asked it to save some Unicode, it tried to load 8859-7 and failed because the path was different in the upgraded version. I had to redo the edit after restarting Emacs.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λs, but instead pairs of odd characters such as λ.
λ is what you get when you interpret the byte sequence 0xCE, 0xBB using the encoding ISO-8859-1, or Windows code page 1252 (Western European). Code page 1252 is probably the default (‘ANSI’) code page on your machine.
0xCE, 0xBB is the UTF-8 encoding of the character λ (U+03BB Greek small letter lambda). So to display it correctly you need to tell your text editor that the file is saved in UTF-8 and not ANSI.
In Notepad++, choose UTF-8 from the menu bar ‘Encoding’ entry.
In Emacs, C-x C-m c utf-8-dos (or unix or whatever) as a prefix to opening or saving the file. Hopefully by saving in UTF-8 you'll avoid whatever the problem is with the ISO 8859-7 (Greek) map; you certainly don't want to be saving any files in 8859-7, or indeed anything but UTF-8, if you can help it.

ASCII in Windows XP and Ubuntu Linux

I've made a program in MVSC++ which outputs memory contents (in ASCII). The ASCII I see in windows console seem to match what I see in various ASCII tables (smiley, diamond, club, right arrow etc). This program needs to compile under Linux (which is does), but the ASCII output looks completely different. A few symbols are the same but the rest are so different. Is there any way to change how terminal displays ASCII code?
EDIT: The program executes correctly, it's just the ASCII that is being displayed differently.
ASCII defines character codes from 0x00 through 0x7f. Everything else (0x80-0xff) is not part of the ASCII standard and depends on what the operating system defines as the characters to display. However, the characters you mention (smiley, diamond, club, etc) are the representations of the ASCII "control characters" that don't normally have a visual representation. Windows lets you print such characters and see the glyphs it has defined for them, but your Linux is probably interpreting the control characters as formatting control codes (which they are) instead of printing corresponding glyphs.
What you are seeing is the "extended" character set that IBM initially included when PCs were first unleashed upon the world. Yes, we are going back to the age of mighty dinosaurs, so bear with me. These characters live above $7F and the interpretation of their symbols on the screen can even be influenced by the font chosen. Most linux distros are now using UTF-8 (or something close) and as such, the fonts installed may have completely different symbols, or even missing glyphs. In cases where you are comparing "ASCII" representations (which is a misnomer, as it's not really true ASCII) of the same data, it may or may not exactly match, as you must have the same "glyph" renderings in both display fonts to correctly see similar representations. Try getting both your Windows and Linux installs to use the same font if possible, and then see if there is a change.
If your browser supports Unicode (and you have the correct fonts installed), you will see them bellow.
You can copy and paste into an editor with unicode support(Notepad). Save as UTF-16BE
Then if you open in a HexEditor you will see all the unicode codes for each char visible glyph.
In example the first ascii char Null has Unicode visible glyph 0x2639
in c\c++\java you can use it like \u2639.
Its not a null char but the visual representation.
http://en.wikipedia.org/wiki/Code_page_437
☹☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼ !"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~⌂ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■⓿

Resources