PC-DOS vs MS-DOS vs Windows multilingual text files - windows

As far as a I know, in 1987 PC-DOS 3.3 as well as MS-DOS 3.3 were released and they had several code pages (850, 860, 863, 865).
Does it mean that user could write text using Portuguese (cp860) and, say, Nordic (cp865) symbols in one file?
Or it was something like one code page per one operation system. For example, PC-DOS from Portugal had only 860 code page and user could use symbols only from that code page, and PC-DOS from Scandinavia had only 865 code page.
The same question about Windows. Starting from what version it started to support multilingual text documents?

DOS has not really knowledge of code page. They were just ASCII strings (zero or dollar terminated).
Codepage were used mostly for display: changing a code page, it will change how a bytecode is printed on screen.
What you describe here, it is a frequent problem: mixed encoding in one text. If you are old enough, you will remember a lot of such problem in web. The text file has no tag or metadata about the codepage. If you mix it, you will just see the characters according the active codepage. You change the codepage of screen, and you will get a new interpretation of characters.

You can do anything you want in your own file. It's communicating how to read it to others that would be a problem.
So, no, not really. Using more than one character encoding in a file and calling it a text file would be more trouble than it's worth.
What the setting of an operating system does not have a direct relationship on the contents of a file. Programs that exchange files between systems (such as over the Internet) might use an understanding of the source character encoding and a local setting for character encoding and do a lossy transcoding.
Nothing has changed except with the advent of Unicode more than 25 years ago, more scripts than you can imagine are available in one character set. So, if there is any transcoding to be done, ideally, it would only be to UTF-8.

Related

Character set conversion problem - debug invalid characters - reverse engineer earlier conversions

Character conversion problem.
I have a few strings which are incorrectly encoded or decoded.
The strings came in an ASCII format CSV file.
The current strings I have are:
N‚met
Tet‹
I know, that the:
"‚" character (0x82) should be originally "é" (é acute accent)
"‹" character (0x8B) should be originally "ő" (o double acute accent)
How can I debug and reverse engineer, what conversions happened with the original characters to get the current characters?
I suppose that multiple decoding encoding happened, but I was not able to reproduce the original character.
I put an expanded version of my comment as answer:
Your viewer uses CP1252 (English and Western Europe, also called ANSI in Windows) or CP1250 (Eastern Europe) or an other similar code page. Most of characters are coded in the same manner, just few language specific changes. Your example do not includes character that are different on the two encoding, so I cannot say precisely.
That code pages are used on Microsoft Windows, and they are based (but not 100% compatible) with Latin-1, so it is common to see text interpreted with such encoding. MacOs and Linux are heavily (now) UTF-8 encoded. Windows uses Unicode internally (but UTF-16)
The old encoding is probably CP437: the standard code page in DOS, so it was used frequently also for CSV files. Other frequent old encoding are CP850 (Western Europe) and CP852 (Central Europe).
For the other answers you put in the comments, I think you should go to Superuser (if you are requesting tools (some editors allow you to specify the encoding. You may use the browser (opening a local file): browsers also allow you to choose the local encoding, and I think you may copy as Unicode [not sure], other tools sometime has hidden option to import files, but possibly not with all options), or as new question in this site, if you want to do it programmatically. But so you are required to specify the language. Python is well suited for such conversions (most scripting languages are created to handle texts): python has built in many encoding, you should just specify when reading and when writing the files. R also can be instructed on the input encoding.
I wrote my own utility that helped me to diagnose and fix many thorny encoding issues. It is available as part of an Open source library. The utility converts any String to unicode sequence and vise-versa. All you will have to do is:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Hello world");
And it will return String "\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064"
The same would work for any String in any language including special characters. Here is the link to the article Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison that explains about the library and where to get it (available both on Maven central and github. In the article search for paragraph: "String Unicode converter". So when you read your String convert it and see what comes up. This way you will see what symbols are there and if the info is correct and only distorted by some wrong encoding or the info itself is lost. You can easily find info on internet that provides tables of mapping of any symbol to a unicode

Utility to Stamp/Watermark Unicode Text Into a PDF

I am looking for a (preferably) command line utility to stamp/watermark unicode text content into a PDF document.
I tried PDF Stamp and a couple of others that I found over the net, but to no avail with Greek characters (e.g. ΓΔΘΛ become ÃÄÈË).
Many thanks for any help!
With sufficiently "odd" characters, you generally need to specify a font and an encoding. I suspect that at least one of the tools you experimented with have the capability to define such things.
Reading their docs, it looks like PDFStamp will let you specify a font, but not an encoding. That doesn't bode well. It might always pick "Identity-H" for system fonts... worth trying.
I must admit, I'm surprised. "Disappointed" even. Have you contacted their email support?
Once upon a time, iText shipped with a number of command line tools that were mostly intended as examples but were none the less useful. I suspect you could dig them out of the SVN archive on sourceforge and get them to build again, if your Java-fu is up to the task. Just be sure to use BaseFont.IDENTITY_H whenever you're given a choice of encodings for a font.

using par for formatting comments in code with international characters

I'm using Par (in linux) to get nice comments formatting quickly. The problem is that now I want to introduce comments that include some international characters, like áéíóú or äëïöü...
The program Berkeley Par considers these international characters as 2 ASCII characters (I believe) and it outputs the comments somehow broken because it doesn't count characters properly.
Did you face this problem before? Do you have any solution? Ideas?
You mean the code from Add multibyte characters support in "par" (or just the patches applied to the original source) don't work for you?
Then maybe it is a problem with your shell or the font it uses. Are you sure the shell and font you use is able to reproduce unicode characters
Par, as distributed in Ubuntu from Hardy on, is supposed to handle multi-byte encodings.
http://packages.ubuntu.com/hardy/par
I've never even heard of this tool, but check out par 1.52.
The latest version of Par, released on 2001-Apr-29, tar'd and gzip'd. The only real change is better support for 8-bit character sets (as opposed to just 7-bit ASCII), but see also the release notes.
Edit: On the page, see par_1.52-i18n.3.diff.gz:
A patch by Jérôme Pouiller that adds
support for multibyte charsets (like
UTF-8), plus Debian packaging. Copied
from http://sysmic.org/par/debian/.
See also his original announcement.

ASCII in Windows XP and Ubuntu Linux

I've made a program in MVSC++ which outputs memory contents (in ASCII). The ASCII I see in windows console seem to match what I see in various ASCII tables (smiley, diamond, club, right arrow etc). This program needs to compile under Linux (which is does), but the ASCII output looks completely different. A few symbols are the same but the rest are so different. Is there any way to change how terminal displays ASCII code?
EDIT: The program executes correctly, it's just the ASCII that is being displayed differently.
ASCII defines character codes from 0x00 through 0x7f. Everything else (0x80-0xff) is not part of the ASCII standard and depends on what the operating system defines as the characters to display. However, the characters you mention (smiley, diamond, club, etc) are the representations of the ASCII "control characters" that don't normally have a visual representation. Windows lets you print such characters and see the glyphs it has defined for them, but your Linux is probably interpreting the control characters as formatting control codes (which they are) instead of printing corresponding glyphs.
What you are seeing is the "extended" character set that IBM initially included when PCs were first unleashed upon the world. Yes, we are going back to the age of mighty dinosaurs, so bear with me. These characters live above $7F and the interpretation of their symbols on the screen can even be influenced by the font chosen. Most linux distros are now using UTF-8 (or something close) and as such, the fonts installed may have completely different symbols, or even missing glyphs. In cases where you are comparing "ASCII" representations (which is a misnomer, as it's not really true ASCII) of the same data, it may or may not exactly match, as you must have the same "glyph" renderings in both display fonts to correctly see similar representations. Try getting both your Windows and Linux installs to use the same font if possible, and then see if there is a change.
If your browser supports Unicode (and you have the correct fonts installed), you will see them bellow.
You can copy and paste into an editor with unicode support(Notepad). Save as UTF-16BE
Then if you open in a HexEditor you will see all the unicode codes for each char visible glyph.
In example the first ascii char Null has Unicode visible glyph 0x2639
in c\c++\java you can use it like \u2639.
Its not a null char but the visual representation.
http://en.wikipedia.org/wiki/Code_page_437
☹☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼ !"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~⌂ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■⓿

How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).
When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.
I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.
When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.
I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.
Answer:
ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.
The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.
I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.
Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!
Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.
Here is how notepad does that
There is also the python Universal Encoding Detector which you can check.
My guess is:
First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.
ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.
The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.
Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.
There is code out there for checking validity of UTF chars.

Resources