Finding the proper ASCII character for superscript "r"? - ascii

Does the superscript ASCII code for the letter r in lowercase exist?
I have found the letter e, but in french 1st is 1er and so we are missing the r portion.
This seems to be a special character outside the ASCII range of available characters?

There are no superscript letters in ASCII. ASCII contains only the basic Latin (English) letters A–Z, a–z, digits, and a small collection of other characters. There are only 128 code positions in ASCII.
In Unicode, there are more characters (about 1,000,000 code positions, about which a little over 100,000 have currently been assigned). They include “ᵉ” U+1D49 MODIFIER LETTER SMALL E, which belongs to the Phonetic Extensions block, which means that it has been included due to its use in phonetic notations, not due to use in normal writing systems of human languages. I think it is the “superscript e” that you have found; I cannot imagine what else it might be. There is really no law against using such characters as simple superscript letters, but it isn’t particularly recommendable either.
Anyway, there is no corresponding “r”, simply because a superscript r is not used in phonetic notations (widely enough).
In general, superscript letters as often used e.g. in English “1st” or French “1er” should be regarded as stylistic variants of normal letters rather than independent characters. At least this is the Unicode position, which is what software vendors normally adhere to. So you cannot indicate superscripts at the text level but at a higher “protocol level”.
Depending on software context, this could mean 1) using superscript command in a word processor like MS Word (select letter(s) and use a formatting command); 2) using sup markup in HTML; 3) using Opentype sups feature, when supported by the software and the font. The last option is the only typographically really acceptable: it means using a superscript glyph designed by a typographer, whereas the other options use just reduced-size letters and place them higher

The superscript r letter can be coded as denary 0691 (or 691) [Unicode: 2B3 or 02B3]. Similarly, a few of the alphabet characters are available as superscripts in Unicode's Phonetic Extensions Supplement and others as shown below.
[NB: Not all alphabet characters are available as superscripts; just these ones to my knowledge.]
Here's a list of these superscript characters…
Superscript Small b → Denary 7495; Unicode: 1D47 ᵇ
Superscript Small c → Denary 7580; Unicode: 1D9C ᶜ
Superscript Small d → Denary 7496; Unicode: 1D48 ᵈ
Superscript Small f → Denary 7584; Unicode: 1DA0 ᶠ
Superscript Small g → Denary 7586; Unicode: 1DA2 ᶢ
Superscript Small h → Denary 0688; Unicode: 02B0 ʰ
Superscript Small j → Denary 0690; Unicode: 02B2 ʲ
Superscript Small k → Denary 7503; Unicode: 1D4F ᵏ
Superscript Small m → Denary 7504; Unicode: 1D50 ᵐ
Superscript Small n → Denary 8319; Unicode: 207F ⁿ
Superscript Small p → Denary 7510; Unicode: 1D56 ᵖ
Superscript Small t → Denary 7511; Unicode: 1D57 ᵗ
Superscript Capital H → Denary 7544; Unicode: 1D78 ᵸ
Superscript Capital I → Denary 7590; Unicode: 1DA6 ᶦ
Superscript Capital L → Denary 7595; Unicode: 1DAB ᶫ
Superscript Capital N → Denary 7600; Unicode: 1DB0 ᶰ
Superscript Capital S → Denary 0738; Unicode: 02E2 ˢ
Superscript Capital U → Denary 7608; Unicode: 1DB8 ᶸ
Superscript Capital V → Denary 7515; Unicode: 1D5B ᵛ
Superscript Capital X → Denary 0739; Unicode: 02E3 ˣ
Superscript Capital Z → Denary 7611; Unicode: 1DBB ᶻ
Cf: https://en.wikipedia.org/wiki/Secondary_articulation#Unicode_support_of_superscript_IPA_letters

Related

From list of numbers to number in prolog [duplicate]

This question already has answers here:
Prolog - merge digits to number
(2 answers)
Closed 4 years ago.
I'm in Prolog and i have a list of numbers like :
L = [4,8,0]
I want to obtain the correspondent number 480.
But i didn't succeed.
What i tried is to convert every numeric element from L into its correspondent char code, using the built-in operator char_code and then use the built-in operator number_codes in order to obtain the numeric term from the list of char codes. However it doesn't work. Can you help me ?
test([],[]).
test([X|Y],[C|K]):-char_code(C,X),
test(Y,K).
try(X):-
test([4,8,0],L),
number_codes(X,L).
ps: are you able to explain me why the combination char_code -> number_codes doesn't work ?
The predicate char_code is a relation that defines the equivalence between a character and its ASCII code. For example, char_code(a, 96) is true since 96 is the decimal ASCII code for the character 'a'. If you were, for example, to query, char_code(C, 96) you would get C = a as a solution. Similarly, char_code('9', 57) is true since 57 is the ASCII code for the character that that represents the number 9. Note that '9' here is not a numeric 9, but the character '9'.
In your code, you are applying the char_code/2 predicate to a raw number. So char_code(C, 9) will succeed if C is the character code whose ASCII value is the number 9 (in this case, a horizontal tab). When you apply this and then attempt to use number_codes/2 on the values, you don't get digits since you aren't providing the ASCII codes corresponding to the digits.
You could see this if you were to do a simple test on your test/2 predicate at the Prolog prompt:
?- test([4,8,0], L).
L = ['\004\', '\b', '\000\']
What you expected was L = ['4', '8', '0'].
What you need to use is atom_number/2 and not char_code/2. Then you have atom_number(C, 9) succeeding when C = '9' (C is bound to the atom '9'):
numbers_atoms([],[]).
numbers_atoms([X|Y],[C|K]) :-
atom_number(C, X),
numbers_atoms(Y,K).
digits_number(Digits, Number) :-
numbers_atoms(Digits, Atoms),
number_codes(Number, Atoms).
Then you get:
?- digits_number([4,8,0], Number).
Number = 480.
You can also write:
numbers_atoms(Numbers, Atoms) :-
maplist(atom_number, Atoms, Numbers).

set-fontset-font only overrides some Unicode character ranges (OS X)

If I put the following into a scratch buffer (running emacs -q on 24.4.1), I can get the symbols (those in the comment) to change by varying the font name (changing the size so that it becomes "MS PGothic-24" for example). This is nice, and seems to show set-fontset-font working as intended.
; multiset symbols (unions with stuff inside them: ⊌ ⊎
(set-fontset-font t '(#x228C . #x228E) "MS PGothic-22")
However, if I have
; greek alphabet: α β γ δ
(set-fontset-font t '(#x370 . #x3F0) "MS PGothic-22")
then I seem to be trapped in a Menlo font. If I put cursor on the character and do C-u C-x =, I get output including the following:
mac-ct:-*-Menlo-normal-normal-normal-*-12-*-*-*-m-0-iso10646-1 (#x2F9)
The rest of the output confirms that the character I'm looking at is in the given range:
character: α (displayed as α) (codepoint 945, #o1661, #x3b1)
By contrast, the output for the union-symbols is
mac-ct:-*-MS PGothic-normal-normal-normal-*-22-*-*-*-p-0-iso10646-1 (#x43A0)

Multi-character substitution cipher algorithm

My problem is the following. I have a list of substitutions, including one substitution for each letter of the alphabet, but also some substitutions for groups of more than one letter. For example, in my cipher p becomes b, l becomes w, e becomes i, but le becomes by, and ple becomes memi.
So, while I can think of a few simple/naïve ways of implementing this cipher, it's not very efficient, and I was wondering what the most efficient way to do it would be. The answer doesn't have to be in any particular language, a general structured English algorithm would be fine, but if it must be in some language I'd prefer C++ or Java or similar.
EDIT: I don't need this cipher to be decipherable, an algorithm that mapped all single letters to the letter 'w' but mapped the string 'had' to the string 'jon' instead should be ok, too (then the string "Mary had a little lamb." would become "Wwww jon w wwwwww wwww.").
I'd like the algorithm to be fully general.
One possible approach is to use deterministic automaton. The closest to your problem and commonly used example is Aho–Corasick string matching algorithm. The difference will be, instead of matching you would like to emit cypher at some transition. Generally at each transition you will emit or do not emit cypher.
In your example
p -> b
l -> w
e -> i
le -> by
ple -> memi
The automaton (in Erlang like pseudocode)
start(p) -> p(next());
start(l) -> l(next());
start(e) -> e(next());
...
p(l) -> pl(next);
p(X) -> emit(b), start(X).
l(e) -> emit(by), start(next());
l(X) -> emit(w), start(X).
e(X) -> emit(i), start(X).
pl(e) -> emit(memi), start(next());
pl(X) -> emit(b), l(X).
If you are not familiar with Erlang, start(), p() are functions each for one state. Each line with -> is one transition and the actions follows the ->. emit() is function which emits cypher and next() is function returning next character. The X is variable for any other character.

Difference between Context-sensitive grammar and Context-free grammar [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Context-sensitive grammar and Context-free grammar
In my textbook, here is the explain of these two terms :
Context Sensitive Grammar:
grammar can have productions of the form w1 → w2, where w1 = lAr and
w2 = lwr, where A is a nonterminal symbol, l and r are strings of zero
or more terminal or nonterminal symbols, and w is a nonempty string of
terminal or nonterminal symbols. It can also have the production S → λ
as long as S does not appear on the right-hand side of any other
production.
Context Free Grammar:
grammar can have productions only of the form w1 → w2, where w1 is a
single symbol that is not a terminal symbol. A type 3 grammar can have
productions only of the form w1 → w2 with w1 = A and either w2 = aB or
w2 = a, where A and B are nonterminal symbols and a is a terminal
symbol, or with w1 = S and w2 = λ.
In my textbook, the author said : CSG is a special case of CFG. But, I don't get this point. because in CSG, lAr -> lwr. l and r can be strings of zero or more terminal or nonterminal. So, when it is a string of zero (means : length = 0). we can write lAr as A. So, CSG will be CFG. So, CSG is CFG
Does something I have understand wrong ? Please correct it for me.
Thanks :)
The textbook is in error. As you say, a CFG is a special case of a CSG.
CSGs can express strictly more languages than CFGs can.

What encoding/code page is cmd.exe using?

When I open cmd.exe in Windows, what encoding is it using?
How can I check which encoding it is currently using? Does it depend on my regional setting or are there any environment variables to check?
What happens when you type a file with a certain encoding? Sometimes I get garbled characters (incorrect encoding used) and sometimes it kind of works. However I don't trust anything as long as I don't know what's going on. Can anyone explain?
Yes, it’s frustrating—sometimes type and other programs
print gibberish, and sometimes they do not.
First of all, Unicode characters will only display if the
current console font contains the characters. So use
a TrueType font like Lucida Console instead of the default Raster Font.
But if the console font doesn’t contain the character you’re trying to display,
you’ll see question marks instead of gibberish. When you get gibberish,
there’s more going on than just font settings.
When programs use standard C-library I/O functions like printf, the
program’s output encoding must match the console’s output encoding, or
you will get gibberish. chcp shows and sets the current codepage. All
output using standard C-library I/O functions is treated as if it is in the
codepage displayed by chcp.
Matching the program’s output encoding with the console’s output encoding
can be accomplished in two different ways:
A program can get the console’s current codepage using chcp or
GetConsoleOutputCP, and configure itself to output in that encoding, or
You or a program can set the console’s current codepage using chcp or
SetConsoleOutputCP to match the default output encoding of the program.
However, programs that use Win32 APIs can write UTF-16LE strings directly
to the console with
WriteConsoleW.
This is the only way to get correct output without setting codepages. And
even when using that function, if a string is not in the UTF-16LE encoding
to begin with, a Win32 program must pass the correct codepage to
MultiByteToWideChar.
Also, WriteConsoleW will not work if the program’s output is redirected;
more fiddling is needed in that case.
type works some of the time because it checks the start of each file for
a UTF-16LE Byte Order Mark
(BOM), i.e. the bytes 0xFF 0xFE.
If it finds such a
mark, it displays the Unicode characters in the file using WriteConsoleW
regardless of the current codepage. But when typeing any file without a
UTF-16LE BOM, or for using non-ASCII characters with any command
that doesn’t call WriteConsoleW—you will need to set the
console codepage and program output encoding to match each other.
How can we find this out?
Here’s a test file containing Unicode characters:
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
Here’s a Java program to print out the test file in a bunch of different
Unicode encodings. It could be in any programming language; it only prints
ASCII characters or encoded bytes to stdout.
import java.io.*;
public class Foo {
private static final String BOM = "\ufeff";
private static final String TEST_STRING
= "ASCII abcde xyz\n"
+ "German äöü ÄÖÜ ß\n"
+ "Polish ąęźżńł\n"
+ "Russian абвгдеж эюя\n"
+ "CJK 你好\n";
public static void main(String[] args)
throws Exception
{
String[] encodings = new String[] {
"UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE" };
for (String encoding: encodings) {
System.out.println("== " + encoding);
for (boolean writeBom: new Boolean[] {false, true}) {
System.out.println(writeBom ? "= bom" : "= no bom");
String output = (writeBom ? BOM : "") + TEST_STRING;
byte[] bytes = output.getBytes(encoding);
System.out.write(bytes);
FileOutputStream out = new FileOutputStream("uc-test-"
+ encoding + (writeBom ? "-bom.txt" : "-nobom.txt"));
out.write(bytes);
out.close();
}
}
}
}
The output in the default codepage? Total garbage!
Z:\andrew\projects\sx\1259084>chcp
Active code page: 850
Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
= bom
´╗┐ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
== UTF-16LE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
= bom
 ■A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
== UTF-16BE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
= bom
■  A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
== UTF-32LE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
= bom
 ■ A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
== UTF-32BE
= no bom
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
= bom
■  A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
However, what if we type the files that got saved? They contain the exact
same bytes that were printed to the console.
Z:\andrew\projects\sx\1259084>type *.txt
uc-test-UTF-16BE-bom.txt
■  A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
uc-test-UTF-16BE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣☺↓☺z☺|☺D☺B
R u s s i a n ♦0♦1♦2♦3♦4♦5♦6 ♦M♦N♦O
C J K O`Y}
uc-test-UTF-16LE-bom.txt
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
uc-test-UTF-16LE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
uc-test-UTF-32BE-bom.txt
■  A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
uc-test-UTF-32BE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ☺♣ ☺↓ ☺z ☺| ☺D ☺B
R u s s i a n ♦0 ♦1 ♦2 ♦3 ♦4 ♦5 ♦6 ♦M ♦N
♦O
C J K O` Y}
uc-test-UTF-32LE-bom.txt
A S C I I a b c d e x y z
G e r m a n ä ö ü Ä Ö Ü ß
P o l i s h ą ę ź ż ń ł
R u s s i a n а б в г д е ж э ю я
C J K 你 好
uc-test-UTF-32LE-nobom.txt
A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺ ↓☺ z☺ |☺ D☺ B☺
R u s s i a n 0♦ 1♦ 2♦ 3♦ 4♦ 5♦ 6♦ M♦ N
♦ O♦
C J K `O }Y
uc-test-UTF-8-bom.txt
´╗┐ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
uc-test-UTF-8-nobom.txt
ASCII abcde xyz
German ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish ąęźżńł
Russian ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK õ¢áÕÑ¢
The only thing that works is UTF-16LE file, with a BOM, printed to the
console via type.
If we use anything other than type to print the file, we get garbage:
Z:\andrew\projects\sx\1259084>copy uc-test-UTF-16LE-bom.txt CON
 ■A S C I I a b c d e x y z
G e r m a n õ ÷ ³ ─ Í ▄ ▀
P o l i s h ♣☺↓☺z☺|☺D☺B☺
R u s s i a n 0♦1♦2♦3♦4♦5♦6♦ M♦N♦O♦
C J K `O}Y
1 file(s) copied.
From the fact that copy CON does not display Unicode correctly, we can
conclude that the type command has logic to detect a UTF-16LE BOM at the
start of the file, and use special Windows APIs to print it.
We can see this by opening cmd.exe in a debugger when it goes to type
out a file:
After type opens a file, it checks for a BOM of 0xFEFF—i.e., the bytes
0xFF 0xFE in little-endian—and if there is such a BOM, type sets an
internal fOutputUnicode flag. This flag is checked later to decide
whether to call WriteConsoleW.
But that’s the only way to get type to output Unicode, and only for files
that have BOMs and are in UTF-16LE. For all other files, and for programs
that don’t have special code to handle console output, your files will be
interpreted according to the current codepage, and will likely show up as
gibberish.
You can emulate how type outputs Unicode to the console in your own programs like so:
#include <stdio.h>
#define UNICODE
#include <windows.h>
static LPCSTR lpcsTest =
"ASCII abcde xyz\n"
"German äöü ÄÖÜ ß\n"
"Polish ąęźżńł\n"
"Russian абвгдеж эюя\n"
"CJK 你好\n";
int main() {
int n;
wchar_t buf[1024];
HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
n = MultiByteToWideChar(CP_UTF8, 0,
lpcsTest, strlen(lpcsTest),
buf, sizeof(buf));
WriteConsole(hConsole, buf, n, &n, NULL);
return 0;
}
This program works for printing Unicode on the Windows console using the
default codepage.
For the sample Java program, we can get a little bit of correct output by
setting the codepage manually, though the output gets messed up in weird ways:
Z:\andrew\projects\sx\1259084>chcp 65001
Active code page: 65001
Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
ж эюя
CJK 你好
你好
好
�
= bom
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
еж эюя
CJK 你好
你好
好
�
== UTF-16LE
= no bom
A S C I I a b c d e x y z
…
However, a C program that sets a Unicode UTF-8 codepage:
#include <stdio.h>
#include <windows.h>
int main() {
int c, n;
UINT oldCodePage;
char buf[1024];
oldCodePage = GetConsoleOutputCP();
if (!SetConsoleOutputCP(65001)) {
printf("error\n");
}
freopen("uc-test-UTF-8-nobom.txt", "rb", stdin);
n = fread(buf, sizeof(buf[0]), sizeof(buf), stdin);
fwrite(buf, sizeof(buf[0]), n, stdout);
SetConsoleOutputCP(oldCodePage);
return 0;
}
does have correct output:
Z:\andrew\projects\sx\1259084>.\test
ASCII abcde xyz
German äöü ÄÖÜ ß
Polish ąęźżńł
Russian абвгдеж эюя
CJK 你好
The moral of the story?
type can print UTF-16LE files with a BOM regardless of your current codepage
Win32 programs can be programmed to output Unicode to the console, using
WriteConsoleW.
Other programs which set the codepage and adjust their output encoding accordingly can print Unicode on the console regardless of what the codepage was when the program started
For everything else you will have to mess around with chcp, and will probably still get weird output.
Type
chcp
to see your current code page (as Dewfy already said).
Use
nlsinfo
to see all installed code pages and find out what your code page number means.
You need to have Windows Server 2003 Resource kit installed (works on Windows XP) to use nlsinfo.
To answer your second query re. how encoding works, Joel Spolsky wrote a great introductory article on this. Strongly recommended.
I've been frustrated for long by Windows code page issues, and the C programs portability and localisation issues they cause. The previous posts have detailed the issues at length, so I'm not going to add anything in this respect.
To make a long story short, eventually I ended up writing my own UTF-8 compatibility library layer over the Visual C++ standard C library. Basically this library ensures that a standard C program works right, in any code page, using UTF-8 internally.
This library, called MsvcLibX, is available as open source at https://github.com/JFLarvoire/SysToolsLib. Main features:
C sources encoded in UTF-8, using normal char[] C strings, and standard C library APIs.
In any code page, everything is processed internally as UTF-8 in your code, including the main() routine argv[], with standard input and output automatically converted to the right code page.
All stdio.h file functions support UTF-8 pathnames > 260 characters, up to 64 KBytes actually.
The same sources can compile and link successfully in Windows using Visual C++ and MsvcLibX and Visual C++ C library, and in Linux using gcc and Linux standard C library, with no need for #ifdef ... #endif blocks.
Adds include files common in Linux, but missing in Visual C++. Ex: unistd.h
Adds missing functions, like those for directory I/O, symbolic link management, etc, all with UTF-8 support of course :-).
More details in the MsvcLibX README on GitHub, including how to build the library and use it in your own programs.
The release section in the above GitHub repository provides several programs using this MsvcLibX library, that will show its capabilities. Ex: Try my which.exe tool with directories with non-ASCII names in the PATH, searching for programs with non-ASCII names, and changing code pages.
Another useful tool there is the conv.exe program. This program can easily convert a data stream from any code page to any other. Its default is input in the Windows code page, and output in the current console code page. This allows to correctly view data generated by Windows GUI apps (ex: Notepad) in a command console, with a simple command like: type WINFILE.txt | conv
This MsvcLibX library is by no means complete, and contributions for improving it are welcome!
Command CHCP shows the current codepage. It has three digits: 8xx and is different from Windows 12xx. So typing a English-only text you wouldn't see any difference, but an extended codepage (like Cyrillic) will be printed wrongly.
In Java I used encoding "IBM850" to write the file. That solved the problem.
You can control the code page simply by creating a file %HOMEPATH%\init.cmd.
Mine says:
#ECHO OFF
CHCP 65001 > nul

Resources