Are BSON arrays heterogeneous or homogeneous? - bson

This is not explicitly stated in the spec, although the example in the footnotes is given for a homogeneous array.
A google search doesn't yield a definitive answer.
Looking at various APIs, the contents are returned as an object rather than a homogeneous value/type, which can then be inspected dynamically.
The only practical reason I personally can see for a heterogeneous array is that if they contain documents, these may have different sets of fields. Otherwise a user would prefer a (heterogeneous) document over a (homogeneous) array.

it does say so in the spec.
it says a document is made up of elements prepended with null bytes
document ::= int32 e_list "\x00"
a document is made up of elements
e_list ::= element e_list
and elements can be of any type BSON supports
element ::= "\x01" e_name double 64-bit binary floating point
| "\x02" e_name string UTF-8 string
| "\x03" e_name document Embedded document
....<snip>....
the first note at the bottom of the page explains that lists are simply documents with magic ascending string keys.
Array - The document for an array is a normal BSON document with
integer values for the keys, starting with 0 and continuing
sequentially. For example, the array ['red', 'blue'] would be encoded
as the document {'0': 'red', '1': 'blue'}.
as such,
BSON will happily serialize {"Key1":[12, "12", 12.1, "a string", Binary(0x001232)]}

Related

How does GitHub encode their graphQL cursors?

GitHub's graphql cursors are intentionally opaque, so they shouldn't ever be decode by a client. However I'd like to know their approach towards pagination, especially when combined with sorting.
There are multiple layers of encoding for the encoding used for pagination cursors used by GitHub. I will list them in order from the perspective of a decoder:
The cursor string is encoded using URL safe base64 meaning it uses - and _ instead of + and /. This might be to have consistency with their REST based API.
Decoding the base64 string gives us another string in the format of cursor:v2:[something] so the next step is decoding the something.
The 'something' is a binary encoded piece of data containing the actual cursor properties. The first byte defines the cursor type:
0x91 => We don't use any sorting, the cursor contains the length of the id field and the id itself. 0xcd seems to indicate a two-byte id, 0xce a four-byte id. This is followed by the id itself, which can be verified by decoding the base64 id graphql field.
0x92 => A composite cursor containing the sorted property and the id. This is either a length-prefixed ordinal number or two bytes plus a string or ISO date string followed by the length-prefixed id.

Why did the definition of dot (.) change between XPath 1.0 and 2.0?

When researching details for an answer to an XPath question here on Stack Overflow, I run into a difference between XPath 1.0 and 2.0 I can find no rationale for.
I tried to understand what . really means.
In XPath 1.0, . is an abbreviation for self::node(). Both self and node are crystal-clear to me.
In XPath 2.0, . is primary expression "context item expression". Abbreviated Syntax section explicitly states that as a note.
What was the rationale for the change? Is there a difference between . and self::node() in XPath 2.0?
From the spec itself, the intent of the change is not clear to me. I tried googling keywords like dot or period, primary expression, and rationale.
XPath 1.0 had four data types: string, number, boolean, and node-set. There was no way of handling collections of values other than nodes. This meant, for example, that there was no way of summing over derived values (if elements had attributes of the form price='$23.95', there was no way of summing over the numbers obtained by stripping off the $ sign because the result of such stripping would be a set of numbers, and there was no such data type).
So XPath 2.0 introduced more general sequences, and that meant that the facilities for manipulating sequences had to be generalised; for example if $X is a sequence of numbers, then $X[. > 0] filters the sequence to include only the positive numbers. But that only works if "." can refer to a number as well as to a node.
In short: self::node() filters out atomic items, while . does not. Atomic items (numbers, strings, and many other XML Schema types) are not nodes (unlike elements, attributes, comments, etc.).
Consider the example from the spec: (1 to 100)[. mod 5 eq 0]. If the . is replaced by self::node(), the expression is not valid XPath, because mod requires both arguments to be numeric and atomization does not help in this case.
For those scanning the spec: XPath 2.0 defines item() type-matching construct, but it has nothing to do with node tests as atomics are not nodes and axis steps always return just nodes. Therefore, dot cannot be defined as self::item(). It really needs to be a special language construct.

Searching within text fields in CloudKit

How are people searching within a string (a substring) field using CloudKit?
For making predicates for use with CloudKit, from what I gather, you can only can only do BEGINSWITH and TOKENMATCHES to search search a single text field's (prefix) or all fields (exact match) respectively. CONTAINS only works on collections despite these examples. I can't determine a way to find, for example, roses in the following string "Red roses are pretty"
I was thinking of making a tokenized version of certain string fields; for example the following fields on a hypothetical record:
description: 'Red roses are pretty'
descriptionTokenized: ['Red', 'roses', 'are', 'pretty']
testing this out makes CONTAINS somewhat useful when searching for distinct, space separated substrings but still not as good as SQL LIKE would be.

Can .proto files' fields start at zero?

.proto examples all seem to start numbering their fields at one.
e.g. https://developers.google.com/protocol-buffers/docs/proto#simple
message SearchRequest {
required string query = 1;
optional int32 page_number = 2;
optional int32 result_per_page = 3;
}
If zero can be used, it will make some messages one or more bytes smaller (i.e. those with a one or more field numbers of 16).
As the key is simply a varint encoding of (fieldnum << 3 | fieldtype) I can't immediately see why zero shouldn't be used.
Is there a reason for not starting the field numbering at zero?
One very immediate reason is that zero field numbers are rejected by protoc:
test.proto:2:28: Field numbers must be positive integers.
As to why Protocol Buffers has been designed this way, I can only guess. One nice consequence of this is that a message full of zeros will be detected as invalid. It can also be used to indicate "no field" internally as a return value in protocol buffers implementation.
Assigning Tags
As you can see, each field in the message definition has a unique numbered tag. These tags are used to identify your fields in the message binary format, and should not be changed once your message type is in use. Note that tags with values in the range 1 through 15 take one byte to encode, including the identifying number and the field's type (you can find out more about this in Protocol Buffer Encoding). Tags in the range 16 through 2047 take two bytes. So you should reserve the tags 1 through 15 for very frequently occurring message elements. Remember to leave some room for frequently occurring elements that might be added in the future.
The smallest tag number you can specify is 1, and the largest is 229-1, or 536,870,911. You also cannot use the numbers 19000 through 19999 (FieldDescriptor::kFirstReservedNumber through FieldDescriptor::kLastReservedNumber), as they are reserved for the Protocol Buffers implementation - the protocol buffer compiler will complain if you use one of these reserved numbers in your .proto. Similarly, you cannot use any previously reserved tags.
https://developers.google.com/protocol-buffers/docs/proto
Just like the document says, 0 can't be detected.

MAPI: Format of PR_SEARCH_KEY

Does anyone know the format of the MAPI property PR_SEARCH_KEY?
The online documentation has this to say about it:
The search key is formed by
concatenating the address type (in
uppercase characters), the colon
character ':', the e-mail address in
canonical form, and the terminating
null character.
And the exchange document MS-OXOABK says this:
The PidTagSearchKey property of type
PtypBinary is a binary value formed by
concatenating the ASCII string "EX: "
followed by the DN for the object
converted to all upper case, followed
by a zero byte value.
However all the MAPI messages I've seen with this property have it as some sort of binary 16 byte sequence that looks like a GUID. Does anyone else have any more information about it? Is it always 16 bytes?
Thanks!
I believe that the property PR_SEARCH_KEY will be of different formats for different objects (as alluded to by Moishe).
A MAPI message object will have a unique value assigned on creation for PR_SEARCH_KEY, however if the object is copied this property value is copied also. I presume when you reply to an e-mail, Exchange will assign the PR_SEARCH_KEY value to be the original message's value.
You will need to inspect each object type to understand how the PR_SEARCH_KEY is formed but I doubt if it's always 16 bytes for all MAPI types.
This link USENET discussion has a good discussion with Dmitry Streblechenko involved who is an expert on Extended MAPI.
The sentence before the ones you quoted from the online docs reads, "MAPI uses specific rules for constructing search keys for message recipients" which makes me think that it's talking about the PR_SEARCH_KEY property on MAPI_MAILUSER objects -- or at least not on MAPI_MESSAGE objects.

Resources