Can .proto files' fields start at zero? - protocol-buffers

.proto examples all seem to start numbering their fields at one.
e.g. https://developers.google.com/protocol-buffers/docs/proto#simple
message SearchRequest {
required string query = 1;
optional int32 page_number = 2;
optional int32 result_per_page = 3;
}
If zero can be used, it will make some messages one or more bytes smaller (i.e. those with a one or more field numbers of 16).
As the key is simply a varint encoding of (fieldnum << 3 | fieldtype) I can't immediately see why zero shouldn't be used.
Is there a reason for not starting the field numbering at zero?

One very immediate reason is that zero field numbers are rejected by protoc:
test.proto:2:28: Field numbers must be positive integers.
As to why Protocol Buffers has been designed this way, I can only guess. One nice consequence of this is that a message full of zeros will be detected as invalid. It can also be used to indicate "no field" internally as a return value in protocol buffers implementation.

Assigning Tags
As you can see, each field in the message definition has a unique numbered tag. These tags are used to identify your fields in the message binary format, and should not be changed once your message type is in use. Note that tags with values in the range 1 through 15 take one byte to encode, including the identifying number and the field's type (you can find out more about this in Protocol Buffer Encoding). Tags in the range 16 through 2047 take two bytes. So you should reserve the tags 1 through 15 for very frequently occurring message elements. Remember to leave some room for frequently occurring elements that might be added in the future.
The smallest tag number you can specify is 1, and the largest is 229-1, or 536,870,911. You also cannot use the numbers 19000 through 19999 (FieldDescriptor::kFirstReservedNumber through FieldDescriptor::kLastReservedNumber), as they are reserved for the Protocol Buffers implementation - the protocol buffer compiler will complain if you use one of these reserved numbers in your .proto. Similarly, you cannot use any previously reserved tags.
https://developers.google.com/protocol-buffers/docs/proto
Just like the document says, 0 can't be detected.

Related

Protobuf deserialize exception

Trying to deserialize a message using protobuf in Java and getting the below exception.
Caused by: com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length.
at com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:86)
at com.google.protobuf.CodedInputStream$ArrayDecoder.readRawLittleEndian64(CodedInputStream.java:1179)
at com.google.protobuf.CodedInputStream$ArrayDecoder.readFixed64(CodedInputStream.java:791)
at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(UnknownFieldSet.java:534)
at com.google.protobuf.GeneratedMessageV3.parseUnknownFieldProto3(GeneratedMessageV3.java:305)
I've manually decoded your string, and I agree with the library: your message is truncated. I'm guessing that this is because you're using string-based APIs, and there is a zero-byte in the data - many text APIs see a zero-byte (NUL in ASCII terms) to mean the end of the string.
Here's the breakdown:
\n=10=field 1, length prefix - I'm assuming this is a string
\x14=20
"id:article:v1:964000"
(22 bytes used for field 1)
\x12=18=field 2, length prefix - I'm assuming this is a sub-messssage
$=36
\n=10=field 1, length prefix - I'm assuming this is a string
\x10=16
"predicted_topics"
(18 bytes used for field 2.1)
\x12=18=field 2, length prefix - I'm assuming this is a string
\x06=6
"IS/biz"
(8 bytes used for field 2.2)
\x1a=26=field 3, length prefix - I'm assuming this is "bytes"
\x08=8
\xf0
l
\x8f
\xde
p
\x9f
\xe4
(unexpected EOF)
at the end, we're trying to decode 8 bytes of the inner-most message, and we've only got 7 bytes left. I know this isn't a sub-message because that would result in an invalid tag, and it doesn't look like UTF-8, so I'm assuming that this is a bytes field (but frankly it doesn't matter: we need 8 bytes, and we only have 7).
My guess is that the last byte in the bytes field was zero; if we assume a missing \x00 at the end, then field 2.3 is 10 bytes, and we've accounted for 18+8+10=36 bytes, which would make the sub-message (field 2) complete. There may well be more missing data after the outer sub-message - I have no way of knowing.
So: make sure you're not using text-based APIs with binary data.

How to enforce maximum string length and allowed characters in protocol buffer?

I am working with protocol buffers to integrate two systems. I want proto file to be contract between those systems. I see no way to put some constraints on string type:
maximum length
minimum length
allowed characters
How to enforce (or at least document) those rules in .proto files?
No idea how to force it, but you can document by putting standard C comments /* comment */ . That's what I do to describe some fields of my messages.

Maximum number of characters output from Win32 ToUnicode()/ToAscii()

What is the maximum number of characters that could be output from the Win32 functions ToUnicode()/ToAscii()?
Surely there is a sensible upper bound on what it can output given a virtual key code, scan key code, and keyboard state?
On my Windows 8 machine USER32!ToAscii calls USER32!ToUnicode with a internal buffer and cchBuff set to 2. Because the output of ToAscii is a LPWORD and not a LPSTR we cannot assume anything about the real limits of ToUnicode from this investigation but we know that ToAscii is always going to output a WORD. The return value tells you if 0, 1 or 2 bytes of this WORD contains useful data.
Moving on to ToUnicode and things get a bit trickier. If it returns 0 then nothing was written. If it returns 1 or -1 then one UCS-2 code point was written. We are then left with the strange 2 <= return expression. We can try to dissect the MSDN documentation:
Two or more characters were written to the buffer specified by pwszBuff. The most common cause for this is that a dead-key character (accent or diacritic) stored in the keyboard layout could not be combined with the specified virtual key to form a single character. However, the buffer may contain more characters than the return value specifies. When this happens, any extra characters are invalid and should be ignored.
You could interpret this as "two or more characters were written but only two of them are valid" but then the return value should be documented as 2 and not 2 ≤ value.
I believe there are two things going on in that sentence and we should eliminate what it calls "extra characters":
However, the buffer may contain more characters than the return value specifies.
This just implies that the function may party on your buffer beyond what it is actually going to return as valid. This is confirmed by:
When this happens, any extra characters are invalid and should be ignored.
This just leaves us with the unfortunate opening sentence:
Two or more characters were written to the buffer specified by pwszBuff.
I have no problem imagining a return value of 2, it can be as simple as a base character combined with a diacritic that does not exist as a pre-composed code point.
The "or more" part could come from multiple sources. If the base character is encoded as a surrogate-pair then any additional diacritic/combining-character will push you over 2. There could simply also be more than one diacritic/combining-character on the base character. There might even be a leading LTR/RTL mark.
I don't know if it is possible to end up with all 3 conditions at the same time but I would play it safe and specify a buffer of 10 or so WCHARs. This should be well within the limits of what you can produce on a keyboard with "a single keystroke".
This is by no means a final answer but it might be the best you are going to get unless somebody from Microsoft responds.
In usual dead-key case we can receive one or two wchar_t (if key cannot be composed with dead-key it returns two whar_t's) for one ToUnicode call.
But Windows also supports ligatures:
A ligature in keyboard terminology means when a single key outputs two or more UTF-16 codepoints. Note that some languages use scripts that are outside of the BMP (Basic Multilingual Plane) and need to be completely realized by ligatures of surrogate pairs (two UTF-16 codepoints).
If we want to look from a practical side of things: Here is a list of Windows system keyboard layouts that are using ligatures.
51 out of 208 system layouts have ligatures
So as we can see from the tables - we can have up to 4 wchar_t for one ToUnicode() call (for one keypress).
If we want to look from a theoretical perspective - we can look at kbd.h in Windows SDK where underlying keyboard layout structures are defined:
/*
* Macro for ligature with "n" characters
*/
#define TYPEDEF_LIGATURE(n) typedef struct _LIGATURE##n { \
BYTE VirtualKey; \
WORD ModificationNumber; \
WCHAR wch[n]; \
} LIGATURE##n, *KBD_LONG_POINTER PLIGATURE##n;
/*
* Table element types (for various numbers of ligatures), used
* to facilitate static initializations of tables.
*
* LIGATURE1 and PLIGATURE1 are used as the generic type
*/
TYPEDEF_LIGATURE(1) // LIGATURE1, *PLIGATURE1;
TYPEDEF_LIGATURE(2) // LIGATURE2, *PLIGATURE2;
TYPEDEF_LIGATURE(3) // LIGATURE3, *PLIGATURE3;
TYPEDEF_LIGATURE(4) // LIGATURE4, *PLIGATURE4;
TYPEDEF_LIGATURE(5) // LIGATURE5, *PLIGATURE5;
typedef struct tagKbdLayer {
....
/*
* Ligatures
*/
BYTE nLgMax;
BYTE cbLgEntry;
PLIGATURE1 pLigature;
....
} KBDTABLES, *KBD_LONG_POINTER PKBDTABLES;
nLgMax here - is a size of a LIGATURE##n.wch[n] array (affects the size of each pLigature object).
cbLgEntry is a number of pLigature objects.
So we have a BYTE value in nLgMax - and that meant that ligature size could be up to 255 wchar_t's (UTF-16 code points) theoretically.

For SCSI, was there a change to the definition of ADDITIONAL LENGTH?

I am reading the SCSI SPC4r22. In regards to ADDITIONAL LENGTH all revisions prior to spc3 have stated the following ("shall not be adjusted"):
From spc2r20.pdf:
"The ADDITIONAL LENGTH field shall specify the length in bytes of the parameters. If the ALLOCATION LENGTH of the CDB is too small to transfer all of the parameters, the ADDITIONAL LENGTH shall not be adjusted to reflect the truncation."
But I don't see that statement in SPC3 or SPC4. Has that been changed or am I missing the phrase? If I'm missing it, can someone please quote it?
It is just worded in a more general way:
"If the information being transferred to the Data-In Buffer includes fields containing counts of the number of bytes in some or all of the data, then the contents of these fields shall not
be altered to reflect the truncation"

Windows DNS server debug log hostname format

I was reading a Windows DNS server debug log file, in particular the packet captures, and am trying to understand how to parse the host names in order to use them in scripts.
The following is an example from an ANSWER section:
Offset = 0x007f, RR count = 2
Name "[C06A](5)e6033(1)g(10)akamaiedge[C059](3)net(0)"
TYPE A (1)
CLASS 1
TTL 20
DLEN 4
DATA 23.62.18.101
So, looking at the string "[C06A](5)e6033(1)g(10)akamaiedge[C059](3)net(0)" I realized that the numbers in parenthesis are a count of the number of characters that follow. Replacing all of them with dots (except the first and last, which should just be removed) works like a charm.
The stuff in square brackets, though, remains a mystery to me. If I simply remove it all after handling the parenthesis and quotes, the above string becomes e6033.g.akamaiedge.net. That is a valid host name.
So my question is: what does that content in square brackets actually mean? What is the correct way to turn that string into a proper host name I could feed to nslookup and other tools?
It appears it's the 2nd possible form of the NAME field as documented here:
http://www.zytrax.com/books/dns/ch15/#name
NAME This name reflects the QNAME of the question i.e. any may take
one of TWO formats. The first format is the label format defined for
QNAME above. The second format is a pointer (in the interests of data
compression which to fair to the original authors was far more
important then than now). A pointer is an unsigned 16-bit value with
the following format (the top two bits of 11 indicate the pointer
format):
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1
The offset in octets (bytes) from the start of the whole message.
Must point to a label format record to derive name length.
Note: Pointers, if used, terminate names. The name field may consist
of a label (or sequence of labels) terminated with a zero length
record OR a single pointer OR a label (or label sequence) terminated
with a pointer.
where the response is using pointers to refer to data elsewhere in the message.

Resources