Maximum number of characters output from Win32 ToUnicode()/ToAscii() - windows

What is the maximum number of characters that could be output from the Win32 functions ToUnicode()/ToAscii()?
Surely there is a sensible upper bound on what it can output given a virtual key code, scan key code, and keyboard state?

On my Windows 8 machine USER32!ToAscii calls USER32!ToUnicode with a internal buffer and cchBuff set to 2. Because the output of ToAscii is a LPWORD and not a LPSTR we cannot assume anything about the real limits of ToUnicode from this investigation but we know that ToAscii is always going to output a WORD. The return value tells you if 0, 1 or 2 bytes of this WORD contains useful data.
Moving on to ToUnicode and things get a bit trickier. If it returns 0 then nothing was written. If it returns 1 or -1 then one UCS-2 code point was written. We are then left with the strange 2 <= return expression. We can try to dissect the MSDN documentation:
Two or more characters were written to the buffer specified by pwszBuff. The most common cause for this is that a dead-key character (accent or diacritic) stored in the keyboard layout could not be combined with the specified virtual key to form a single character. However, the buffer may contain more characters than the return value specifies. When this happens, any extra characters are invalid and should be ignored.
You could interpret this as "two or more characters were written but only two of them are valid" but then the return value should be documented as 2 and not 2 ≤ value.
I believe there are two things going on in that sentence and we should eliminate what it calls "extra characters":
However, the buffer may contain more characters than the return value specifies.
This just implies that the function may party on your buffer beyond what it is actually going to return as valid. This is confirmed by:
When this happens, any extra characters are invalid and should be ignored.
This just leaves us with the unfortunate opening sentence:
Two or more characters were written to the buffer specified by pwszBuff.
I have no problem imagining a return value of 2, it can be as simple as a base character combined with a diacritic that does not exist as a pre-composed code point.
The "or more" part could come from multiple sources. If the base character is encoded as a surrogate-pair then any additional diacritic/combining-character will push you over 2. There could simply also be more than one diacritic/combining-character on the base character. There might even be a leading LTR/RTL mark.
I don't know if it is possible to end up with all 3 conditions at the same time but I would play it safe and specify a buffer of 10 or so WCHARs. This should be well within the limits of what you can produce on a keyboard with "a single keystroke".
This is by no means a final answer but it might be the best you are going to get unless somebody from Microsoft responds.

In usual dead-key case we can receive one or two wchar_t (if key cannot be composed with dead-key it returns two whar_t's) for one ToUnicode call.
But Windows also supports ligatures:
A ligature in keyboard terminology means when a single key outputs two or more UTF-16 codepoints. Note that some languages use scripts that are outside of the BMP (Basic Multilingual Plane) and need to be completely realized by ligatures of surrogate pairs (two UTF-16 codepoints).
If we want to look from a practical side of things: Here is a list of Windows system keyboard layouts that are using ligatures.
51 out of 208 system layouts have ligatures
So as we can see from the tables - we can have up to 4 wchar_t for one ToUnicode() call (for one keypress).
If we want to look from a theoretical perspective - we can look at kbd.h in Windows SDK where underlying keyboard layout structures are defined:
/*
* Macro for ligature with "n" characters
*/
#define TYPEDEF_LIGATURE(n) typedef struct _LIGATURE##n { \
BYTE VirtualKey; \
WORD ModificationNumber; \
WCHAR wch[n]; \
} LIGATURE##n, *KBD_LONG_POINTER PLIGATURE##n;
/*
* Table element types (for various numbers of ligatures), used
* to facilitate static initializations of tables.
*
* LIGATURE1 and PLIGATURE1 are used as the generic type
*/
TYPEDEF_LIGATURE(1) // LIGATURE1, *PLIGATURE1;
TYPEDEF_LIGATURE(2) // LIGATURE2, *PLIGATURE2;
TYPEDEF_LIGATURE(3) // LIGATURE3, *PLIGATURE3;
TYPEDEF_LIGATURE(4) // LIGATURE4, *PLIGATURE4;
TYPEDEF_LIGATURE(5) // LIGATURE5, *PLIGATURE5;
typedef struct tagKbdLayer {
....
/*
* Ligatures
*/
BYTE nLgMax;
BYTE cbLgEntry;
PLIGATURE1 pLigature;
....
} KBDTABLES, *KBD_LONG_POINTER PKBDTABLES;
nLgMax here - is a size of a LIGATURE##n.wch[n] array (affects the size of each pLigature object).
cbLgEntry is a number of pLigature objects.
So we have a BYTE value in nLgMax - and that meant that ligature size could be up to 255 wchar_t's (UTF-16 code points) theoretically.

Related

bwip.js: How to use the Group Separator character with GS1-128

There is a service hosted on for generating barcodes metafloor.com using bwip.js
I want to generate a barcode for following data (GS character is represent by {GS}).
(01)10875066000333(10)1212{GS}(17)121212(30)8{GS}
According the documentation I'm able to generate a barcode for data without GS character
https://bwipjs-api.metafloor.com/?bcid=gs1-128&text=(01)10875066000333(10)1212(17)121212(30)8
But the scanner require GS characters.
The documentation is clear
Special characters must be encoded in format ^NNN
Parse option has to be true, by using parsefnc parameter
The parameter has to be URL-encoded.
So for my string it's:
https://bwipjs-api.metafloor.com/?bcid=gs1-128&text=(01)10875066000333(10)1212%5E029(17)121212(30)8%5E029&parsefnc
But this gives me Error: bwipp.GS1badCSET82character: AI 10: Invalid CSET 82 character.
I also tried
Send GS char directly as %1D
Send GS char as %5EGS
Send GS char as ^029
Send GS char directly
Set parsefnc=true
Combination of all above
But still getting the same error.
Is there something I'm doing wrong or is the problem on the other side?
For GS1 Application Identifier based data, trust the library to encode the data correctly by selecting the GS1-specific encoder for the symbology (gs1datamatrix in this case) and then provide the input in bracketed AI notation, i.e. without FNC1 / GS separators.
The encoder will automatically add all of the necessary FNC1 non-data characters (which are transmitted as ASCII GS characters when read by a scanner) and it will also validate the contents of the AI data that you supply.
Users that select a generic symbology and then attempt to perform the AI encoding themselves are prone to making several mistakes:
Omitting the required FNC1 in first position.
Omitting the required FNC1 separators at the end of AIs with no pre-determined width.
Terminating pre-defined length AIs with unnecessary FNC1 characters.
Terminating the message with an unnecessary FNC1 character.
Encoding ASCII GS data characters instead of the canonical FNC1 non-data characters.
Including illegal, literal parentheses to denote the AIs.
Providing improperly formatted or invalid AI values.
Omitting requisite AI attributes.
Including mutually-exclusive AI pairings.
Many of these mistakes will result in failure to decode and interpret the GS1 AI data (even if the barcode appears to read successfully) which may result in charge-backs and necessitate relabelling or disposal.
The data that you are providing falls afoul of at least some of these pitfalls.
See this article for a thorough description of the checks that BWIPP (and hence BWIP-JS) implements to prevent such data quality issues.

When using a convert char to ASCII value routine, need to find out what values it is actually returning as they are not strictly ascii

When testing my code that uses a routine that checks for chars to show using an ASCII value routine, my program should drop control chars but keep chars that may be entered by the user. It seems that while the ASCII value routine is called "ascii", it does not just return ascii values: giving it a char of ƒ returns 402.
For example have found this web site
but it doesn't have ƒ 402 that I can see.
Need to know whether there are other ascii codes above 402 that I need to test my code with. The character set used internally by the software that 'ascii' is written in uses UCS2. The web site found doesn't mention USC2.
There are probably many interpretations ouf »Control Character« out there, but I'll assume you mean C0 and C1 control characters (includes references to the relevant Unicode Standards).
The commonly used 32-bit integer representation of Unicode characters in general is the codepoint notation: »U+« followed by a at least 4 digit positive hex number, which you will find near mentions of characters, e.g. as in »U+007F (delete)«. The result of your »ASCII value« routine will probably be this number without the »U+«;
UCS-2 is a specific encoding for Unicode characters, which you probably won't need to care about directly), and is equivalent to Unicode codepoints for all characters within the the range of the BMP only.

Can .proto files' fields start at zero?

.proto examples all seem to start numbering their fields at one.
e.g. https://developers.google.com/protocol-buffers/docs/proto#simple
message SearchRequest {
required string query = 1;
optional int32 page_number = 2;
optional int32 result_per_page = 3;
}
If zero can be used, it will make some messages one or more bytes smaller (i.e. those with a one or more field numbers of 16).
As the key is simply a varint encoding of (fieldnum << 3 | fieldtype) I can't immediately see why zero shouldn't be used.
Is there a reason for not starting the field numbering at zero?
One very immediate reason is that zero field numbers are rejected by protoc:
test.proto:2:28: Field numbers must be positive integers.
As to why Protocol Buffers has been designed this way, I can only guess. One nice consequence of this is that a message full of zeros will be detected as invalid. It can also be used to indicate "no field" internally as a return value in protocol buffers implementation.
Assigning Tags
As you can see, each field in the message definition has a unique numbered tag. These tags are used to identify your fields in the message binary format, and should not be changed once your message type is in use. Note that tags with values in the range 1 through 15 take one byte to encode, including the identifying number and the field's type (you can find out more about this in Protocol Buffer Encoding). Tags in the range 16 through 2047 take two bytes. So you should reserve the tags 1 through 15 for very frequently occurring message elements. Remember to leave some room for frequently occurring elements that might be added in the future.
The smallest tag number you can specify is 1, and the largest is 229-1, or 536,870,911. You also cannot use the numbers 19000 through 19999 (FieldDescriptor::kFirstReservedNumber through FieldDescriptor::kLastReservedNumber), as they are reserved for the Protocol Buffers implementation - the protocol buffer compiler will complain if you use one of these reserved numbers in your .proto. Similarly, you cannot use any previously reserved tags.
https://developers.google.com/protocol-buffers/docs/proto
Just like the document says, 0 can't be detected.

Decoded barcode extra digits

I am trying to come to terms with how a barcode is decoded and generated by a scanner.
A note from the client says the following generated bar code consists of extra characters:
Generated Code: |2389299920014}
Extra Characters: Apparently the first two and last three characters are not part of the bar code.
Question
Are the extra characters attached by the bar code reader (therefore dependent on the scanner) or are they an intrinsic part of the barcode?
Here is a sample image of a barcode:
http://imageshack.us/a/img824/1862/dm6x.jpg
Thanks
[SOLVED] My apologies. This was just another one of those cases of 'shooting your mouth off' without doing proper research.
Solution The code is EAN13. The prefix and suffix are probably scanner dependent. The 13 digits in between are as follows (first digit from the left) Check Sum (Next 9 digits) Company Id + Item Id (Last 3 Digits ) GS1 prefix
It's hard to answer without understanding what format you are trying to encode, what the intended contents are, and what the purported contents are.
Some formats add extra information as part of the encoding process, but it does not become part of the content. When correctly encoded and decoded, the output should match the input exactly.
Barcodes encode what they encode and there is no data that is somehow part of the barcode but not somehow encoded in it.
EAN-13 has no scanner-dependent considerations, no. The encoding and decoding of a given number is the same everywhere. EAN-13 encodes 13 digits, so I am not sure what the 13 digits "in between" mean.
You mention GS1, which is something else. A family of barcodes in fact. You'd have to say what specifically you are using. The GS1 encodings are likewise not ambiguous or scanner-dependent. You know what you want to encode, you encode it exactly, it's read exactly.

Win32 Edit Control - GetText does not return final \n

I have a Win32 Edit window (i.e. CreateWindow with classname "EDIT").
Every time I add a line to the control I append '\r\n' (i.e new line).
However, when I call WM_GETTEXT to get the text of the EDIT window, it is always missing the last '\n'.
If I add 1 to the result of WM_GETTEXTLENGTH, it returns the correct character count, thus WM_GETTEXT returns the final '\n'.
MSDN says this about WM_GETTEXTLENGTH:
When the WM_GETTEXTLENGTH message is
sent, the DefWindowProc function
returns the length, in characters, of
the text. Under certain conditions,
the DefWindowProc function returns a
value that is larger than the actual
length of the text. This occurs with
certain mixtures of ANSI and Unicode,
and is due to the system allowing for
the possible existence of double-byte
character set (DBCS) characters within
the text. The return value, however,
will always be at least as large as
the actual length of the text; you can
thus always use it to guide buffer
allocation. This behavior can occur
when an application uses both ANSI
functions and common dialogs, which
use Unicode.
... but that doesn't explain the off by 1 conundrum.
Why does this occur and is safe for me to just add an unexplained 1 to the text length?
Edit
After disabling the unicode compile, I can get it working with an ASCII build, however, I would like to get this working with a UNICODE build, perhaps the EDIT window control does not behave well with UNICODE?
Try to set ES_MULTILINE and ES_WANTRETURN styles for your edit control.
\r and \n map to byte constructs, which work when you compile for ASCII.
Because \r, \n are not guaranteed to represent carriage return, line feed (both could map to line feed, for example), it is best to use the hexadecimal code points when building the string. (You would probably use the TCHAR functions.)
Compile for ASCII - sprintf(dest, "%s\x0D\x0A", str);
Compile for UNICODE - wsprintf(dest, "%s\0x000D\x000A", str);
When you call WM_GETTEXT to retrieve the text you might need to call WideCharToMultiByte to convert it to a certain code page or character set such as ASCII or UTF8 in order to save it to a file.
http://msdn.microsoft.com/en-us/library/aa450989.aspx
The documentation for WM_GETTEXT says the supplied buffer has to be large enough to include the null terminator. The documentation for WM_GETTEXTLENGTH says the return value does not include the null terminator. So you have to include room for an extra character when allocating the buffer that receives the text.
You have to add one character for your string terminator \0 character.

Resources