What is the difference in calling the Win32 API function that have an A character appended to the end as opposed to the W character.
I know it means ASCII and WIDE CHARACTER or Unicode, but what is the difference in the output or the input?
For example, If I call GetDefaultCommConfigA, will it fill my COMMCONFIG structure with ASCII strings instead of WCHAR strings? (Or vice-versa for GetDefaultCommConfigW)
In other words, how do I know what Encoding the string is in, ASCII or UNICODE, it must be by the version of the function I call A or W? Correct?
I have found this question, but I don't think it answers my question.
The A functions use Ansi (not ASCII) strings as input and output, and the W functions use Unicode string instead (UCS-2 on NT4 and earlier, UTF-16 on W2K and later). Refer to MSDN for more details.
Related
I found this commit from facebook infer, and I have no idea what \027[0K and \027[%iA means.
What does these special string mean? And (I think) if there are more strings like this, where can I find the full documentation about this?
Those are escape sequences to tell your terminal what to do.
For example, the sequence of characters represented by \027[0K (where \027 is ASCII decimal value for Esc character) tells the terminal to "clear line from cursor to the end."
One helpful document/guide on this subject can be found at https://shiroyasha.svbtle.com/escape-sequences-a-quick-guide-1
The facebook code is copied from another source here, which uses hard-coded formatters imitating termcap (this page gives some background). The original has comments indicating where its information came from.
The formatter uses "%i" for integers. That's a repeat-count for the cursor movement "cursor-up" \033[A
In most languages, \033 (octal) is used for the ASCII escape character. But this source (according to the github analysis) is written in OCaml, and is using the decimal value for the ASCII escape character. According to the OCaml syntax, you could use an octal value like this: \o033
Once you see that the formatting parts (how the escape character is represented, the use of %i to format a number), the rest of this is documented in several places.
The relevant standard is ECMA-48
the termcap (or analogous terminfo) information is in the terminal database.
When testing my code that uses a routine that checks for chars to show using an ASCII value routine, my program should drop control chars but keep chars that may be entered by the user. It seems that while the ASCII value routine is called "ascii", it does not just return ascii values: giving it a char of ƒ returns 402.
For example have found this web site
but it doesn't have ƒ 402 that I can see.
Need to know whether there are other ascii codes above 402 that I need to test my code with. The character set used internally by the software that 'ascii' is written in uses UCS2. The web site found doesn't mention USC2.
There are probably many interpretations ouf »Control Character« out there, but I'll assume you mean C0 and C1 control characters (includes references to the relevant Unicode Standards).
The commonly used 32-bit integer representation of Unicode characters in general is the codepoint notation: »U+« followed by a at least 4 digit positive hex number, which you will find near mentions of characters, e.g. as in »U+007F (delete)«. The result of your »ASCII value« routine will probably be this number without the »U+«;
UCS-2 is a specific encoding for Unicode characters, which you probably won't need to care about directly), and is equivalent to Unicode codepoints for all characters within the the range of the BMP only.
From testing, it seems like trying to convert both IDNs and regular domain names 'just works' - eg, if the input doesn't need to be changed punycode will just return the input.
punycode.toASCII('lancôme.com');
returns:
'xn--lancme-lxa.com'
And
punycode.toASCII('apple.com');
returns:
'apple.com'
This looks great, but is it specified anywhere? Can I safely convert everything to punycode?
That is correct. If you look at how the procedure for converting unicode strings to ascii punycode, the process only alters any non-ascii character. Since regular domains cannot contain non-ascii characters, if your conversor is correctly implemented, it will never transform any pure-ascii string.
You can read more about how unicode is converted to punycode here: https://en.wikipedia.org/wiki/Punycode
Punycode is specified in RFC 3492: https://www.ietf.org/rfc/rfc3492.txt, and it clearly says:
"Basic code point segregation" is a very simple and
efficient encoding for basic code points occurring in the extended
string: they are simply copied all at once.
Therefore, if your extended string is made of basic code points, it will just be copied without change.
What is the difference in calling the Win32 API function that have an A character appended to the end as opposed to the W character.
I know it means ASCII and WIDE CHARACTER or Unicode, but what is the difference in the output or the input?
For example, If I call GetDefaultCommConfigA, will it fill my COMMCONFIG structure with ASCII strings instead of WCHAR strings? (Or vice-versa for GetDefaultCommConfigW)
In other words, how do I know what Encoding the string is in, ASCII or UNICODE, it must be by the version of the function I call A or W? Correct?
I have found this question, but I don't think it answers my question.
The A functions use Ansi (not ASCII) strings as input and output, and the W functions use Unicode string instead (UCS-2 on NT4 and earlier, UTF-16 on W2K and later). Refer to MSDN for more details.
My Win32/MFC program builds up a list of names, sorting them alphabetically as it puts them into the list. When it supported only ASCII strings, this worked by a simple char-by-char string comparison. But now that I want to accept UTF-8 strings, I need a more complex scheme since --for example -- all forms of the letter "a" should be equivalent from an alphabetizing standpoint.
Is there a function somewhere that can do this, or will I have to craft my own comparison table to sort these strings?
The CompareStringEx Function probably does what you need.
But note that this function (and the Windows API in general) does not use the UTF-8 encoding to represent unicode strings. Instead, it uses the UTF-16 encoding (aka "wide character strings"). You might just be confusing the UTF-8 encoding with unicode in general. But if you are really dealing with UTF-8 encoded strings then you can do the conversion from UTF-8 to wide character strings with the MultiByteToWideChar Function.