I am currently making an app that deals with Ancient Greek (Unicode extended) characters. To put it simple, the user types a word in Greek and the program analyses it morphologically and shows all its declension, i. e. all its possible forms. Pretty simple, I am new to this. I made it with Latin (a language with no special Unicode characters) a week ago and it works perfectly.
I work with Ruby 3.0.2 and the Command Prompt attached to its installation file. I can write code using Greek Unicode letters (like "puts "ἀγαθός"") and they are displayed on the Command Prompt without problem. UTF-8 works fine there. I think the default codepage is UTF-8 for the .rb file.
The problem is when the Command Prompt tries to recognize the letters the user has written. To test recognition, when the user writes a word, the program shows again the letters written (I wrote them using the Windows Polytonic Greek Keyboard). To my disgrace these letters are only squares with question marks inside them. After a while, I couldn't write any single Greek character (not even non-Unicode-extended) because the command prompt crashes after doing it.
One solution I made was to change, before running the program, the codepage to Windows-1253, which supports Greek characters but not Unicode characters. That made possible to write Greek common characters and the command prompt recognizes them well. But of course it continues crashing if I dare to write a single alpha with spiritus asper.
But I really would like to use UTF-8 for everything, and I don't know why the program is doing this. Of course Windows Powershell does the same.
I hope I have explained the problem well. Sorry if my language is not appropriate, but I hope you have got the point. Thanks!
Related
In the following code, the ü is not the single Unicode character U+00FC but is a single grapheme cluster composed of two Unicode characters, the plain ASCII u U+0075 followed by the combining diaeresis U+0308.
fmt.Println("Jürgen Džemal")
fmt.Println("Ju\u0308rgen \u01c5emel")
If I run it in the go playground, it works as expected.
If I run it in a MS Windows 10 "Command Prompt" window, it doesn't visually combine the combining character with the prior character.
However when I cut and paste the text into here it appears correctly:
C:\> ver
Microsoft Windows [Version 10.0.17134.228]
C:\> test
Jürgen Džemal
Jürgen Džemel
On screen, in the "Command Prompt" window it looked more like:
Ju¨rgen Džemel
Changing the code page (chcp) from 850 to 65001 made no difference. Changing fonts (Consolas, Courier etc) made no difference.
In the past I have experienced problems that were fundamentally because Microsoft require Windows programs to use a different API to output characters to STDOUT depending on whether STDOUT is attached to a console or to a file. I don't know if this is a different manifestation of the same issue.
Is there something I can do to make this Unicode grapheme-cluster appear correctly?
As eryksun and Peter commented,
The Windows console (conhost.exe) doesn't support combining codes. You'll have to first normalize to an equivalent string that uses precomposed characters.
you can use golang.org/x/text/unicode/norm to do the normalization (e.g. norm.NFC.String("Jürgen Džemal"))
I tried this
s := "Ju\u0308rgen \u01c5emel"
fmt.Println(s) // dieresis not combined with u by conhost.exe
s = norm.NFC.String(s)
fmt.Println(s) // shows correctly
And the output looked like this
or, for the visually impaired with fabulously sophisticated screen readers - a bit like this:
Ju¨rgen Džemel
Jürgen Džemel
Note that Unicode has four different normalised forms but NFC is the most used on the Internet in web-pages and is also appropriate for this situation.
There are other methods in this package that may be more efficient or more useful
I read there are visual-characters in use which can only be represented in Unicode using combining characters. In other words for which there is no precomposed character. A more thorough approach would be needed to do something appropriate with those. Essentially the complications of Unicode (or perhaps more accurately of human languages and their typography) are almost without end. It sometimes seems that way to me.
References
https://blog.golang.org/normalization
https://godoc.org/golang.org/x/text/unicode/norm
https://learn.microsoft.com/en-us/windows/desktop/intl/using-unicode-normalization-to-represent-strings
For example, several characters used in writing Lithuanian have double diacritics, as they have only decomposed forms. An example is lowercase U with macron and tilde ("ū̃", U+016b U+0303, where the first code point is a lowercase U with macron and the second is a combining acute accent).
I have emacs 23.3 running on windows XP and I work on some files whose filenames contain a combination of English & devanagari or tamil characters (e.g., que.प्रश्न.txt or ans.பதில்.txt).
When I visit the directory containing this file in Dired, these file names don't appear correctly, even though I can see the names in windows explorer. Dired displays names like "deva~1.txt" for filenames that have begin with english characters but in case of names fully composed of non-english characters it displays something like "47d1~1.txt".
I suppose this has something to do with what Windows internally returns to emacs but I notice that running dir on command-prompt at the same directory displays the full names (even though cmd just renders all non-english characters as ? symbol).
Is there anyway I can enable dired to render filenames with non-english characters correctly?
It's actually a limitation of Emacs's implementation. Emacs uses Windows primitives that date back to before Unicode, so any filename with chars that cannot be encoded in your "codepage" will be replaced with the mangled foo~1 name (if your file system is VFAT) or something else in other cases. Hopefully we will soon switch over to the "new" Windows primitives that use UTF-16 (IIRC) and do not suffer from such problems any more. But you may have to wait for Emacs-25.1 for that. It may happen sooner if you give us a hand, tho ;-)
I created an ordinary text file on Windows 7 64-bit using gnu emacs 23.3.1. I can edit the file with other programs such as LinqPad (the file happens to be a linqpad script, extension .linq). Everything is fine until I put a Unicode character in the file, a character such as the greek letter λ (lambda). I can input the letter in emacs and it displays correctly. However, emacs refuses to save the file, reporting the following error
Failure in loading charset map: 8859-7
If I input the λ in LinqPad, emacs will read and display them, but will not save the file.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λ's, but instead pairs of odd characters such as λ. That is fitting to an untuition (pun intended) that the unicode chars are being stored as pairs. So it looks like this is a kind of ambiguous situation (storing unicode in text files), but it also looks like linqPad and visual studio "do the obvious thing."
I want to use emacs because it's the only program that I have that reflows sequences of commented lines (lines after //, reflows them with Alt-Q), and I want to use greek characters in my comments because I'm describing a mathematical program.
I'll be grateful for advice and answers.
UPDATE: some advice in other questions said to try M-x describe-char, also bound to C-x = ; both of those give me the same failure message as above, so they're on the right track, just not answers.
This once happened to me when I had upgraded all packages (including Emacs) without realising I still had an Emacs session open during the upgrade. Next time I asked it to save some Unicode, it tried to load 8859-7 and failed because the path was different in the upgraded version. I had to redo the edit after restarting Emacs.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λs, but instead pairs of odd characters such as λ.
λ is what you get when you interpret the byte sequence 0xCE, 0xBB using the encoding ISO-8859-1, or Windows code page 1252 (Western European). Code page 1252 is probably the default (‘ANSI’) code page on your machine.
0xCE, 0xBB is the UTF-8 encoding of the character λ (U+03BB Greek small letter lambda). So to display it correctly you need to tell your text editor that the file is saved in UTF-8 and not ANSI.
In Notepad++, choose UTF-8 from the menu bar ‘Encoding’ entry.
In Emacs, C-x C-m c utf-8-dos (or unix or whatever) as a prefix to opening or saving the file. Hopefully by saving in UTF-8 you'll avoid whatever the problem is with the ISO 8859-7 (Greek) map; you certainly don't want to be saving any files in 8859-7, or indeed anything but UTF-8, if you can help it.
(Using Ruby 1.8)
I only have a brief understanding of encoding and such...but what I want to know is, in any given script handling any given text-file, is there some universal library or call I need to make to turn non-standard characters into their nearest printable equivalent. I realize there's no "all-in-one" fix, but this is for a English (U.S. gov't) text file, and so I'm wondering if there's something that mitigates what must be a relatively common issue in English text formatting.
For example, in a text file, I have an entry like this:
0-823
That hyphen is just literally a hyphen as I've typed it out. In the file though, it's something that looks like a hyphen (an n-dash?) but when copy and pasting it...for example, into this browser text box, it doesn't show up.
Printing it out via a Ruby script gets this:
08�23
How do I get my script to resolve it into a dash. Or something other than a gremlin?
It's very common to run into hyphen-like characters and dashes, especially in the output of word-processors. Converting them isn't too hard if you know what the byte is that represents the character, but gets to be a pain when you get a document with several different ones. It gets worse as you throw other accented characters into the mix.
Ruby 1.8 doesn't support multibyte and Unicode character sets as well as 1.9+, but you can work around that somewhat by using the Iconv library.
Iconv lets you convert between various character-sets, such as US-ASCII, ISO-8859-1 and WIN-1252. It's smarter than a regex, because it knows how to convert from accented characters, to similarly looking characters, or ignore them if nothing similar exists, allowing your transliteration to degrade gracefully.
I have some example code in an answer to a related question. Also read James Grey's article linked in the answer. It explains the problem and ways to fix it, ending up with recommending Iconv too.
You could whitelist with gsub:
string.gsub(/[^a-zA-Z0-9]/)
Without knowing more information, I can't build the perfect regex for you, but the general idea is to replace anything that's not what you're expecting (anything not a letter or number or expected symbols).
This is quite a low-level (low in the sense of "closer to the metal") question.
I was wondering if any of you could point me to documentation, explanations, etc. of how, upon receiving a Unicode character (or any character code, but I'm particularly interested in the Unicode Standard) the console in Windows, good ol' cmd.exe (using, say, codepage 65001) and xterm in Linux started with, say, LC_CTYPE=en_US.UTF-8 look up the corresponding glyph (and where).
I know it may be harder to know in Windows, but I can't really find much information.
Thank you.
As far as I can tell, cmd.exe is bound to whatever 256-character code page you defined as the "codepage for non-Unicode programs" or whatever it was called.
To elaborate, if I set the above setting to Japanese, cmd.exe suddenly replaces backslashes with yen signs (as does every other non-Unicode app on the system) and correctly interprets ShiftJIS codes, for example. Setting it to Dutch gives me an accented I (I forgot which), while another codepage would give a half-filled vertical solid instead on the same character.
Not Unicode. Unicode would let me do all three at the same time.
The console uses a TextWriter with an encoding created from the codepage. That means that the characters written are encoded into bytes using the specific Encoding object for the codepage.
the console doesn't support Unicode. :)