Why does protobuf's FieldMask use field names instead of field numbers?

Why does protobuf's FieldMask use field names instead of field numbers? - protocol-buffers

In the docs for FieldMask the paths use the field names (e.g., foo.bar.buzz), which means renaming the message field names can result in a breaking change.
Why doesn't FieldMask use the field numbers to define the path?
Something like 1.3.1?

You may want to consider filing an issue on the GitHub protocolbuffers repo for a definitive answer from the code's authors.
Your proposal seems logical. Using names may be a historical artifact. There's a possibly relevant comment on an issue thread in that repo:
https://github.com/protocolbuffers/protobuf/issues/3793#issuecomment-339734117
"You are right that if you use FieldMasks then you can't safely rename fields. But for that matter, if you use the JSON format or text format then you have the same issue that field names are significant and can't be changed easily. Changing field names really only works if you use the binary format only and avoid FieldMasks."

The answer for your question lies in the fact FieldMasks are a convention/utility developed on top of the proto3 schema definition language, and not a feature of it (and that utility is not present in all of the language bindings)
While you’re right in your observation that it can break easily (as schemas tend evolve and change), you need to consider this design choice from a user friendliness POV:
If you’re building an API and want to allow the user to select the field set present inside the response payload (the common use case for field masks), it’ll be much more convenient for you to allow that using field paths, rather then binary fields indices, as the latter would force the user of the gRPC/protocol generated code to be “aware” of the schema. That’s not always the desired case when providing API as a code software packages.
While implementing this as a proto schema feature can allow the user to have the best of both worlds (specify field paths, have them encoded as binary indices) for binary encoding, it would also:
Complicate code generation requirements
Still be an issue for plain text encoding.
So, you can understand why it was left as an “external utility”.

Related

What is the essential difference between Document and Collectiction in YAML syntax?

Warning: This question is a more philosophical question than practical, but I find it well as to be asked and answered in practical contexts (forums like StackOverflow here, instead of the SoftwareEngineering stack-exchange website), due to the native development in the actual use de-facto of YAML and the way the way it's specification has evolved and features have been added to it over time. Let's ask:
As opposed to formats/languages/protocols such as JSON, the YAML format allows you (according to this link, that seems pretty official, or at least accurate and reliable source to understand the YAML specification) to embed multiple 'Documents' within one file/stream, using the three-dashes marking ("---").
If so, it's hard to ignore the fact that the concept/model/idea of 'Document' in YAML, is no longer an external definition, or "meta"-directive that helps the human/parser to organize multiple/distincted documents along each other (similar to the way file-systems defining the concept of "file" to organize different files, but each file in itself - does not necessarily recognize that it's a file, or that it's being part of a file system that wraps it, by definition, AFAIK.
However, when YAML allows for a multi-Document YAML files, that gather collections of Documents in a single YAML file (and perhaps in a way that is similar/analogous to HTTP Pipelining approach of HTTP protocol), the concept/model/idea/goal of Document receives a new, wider definition/character de-facto, as a part of the YAML grammar and it's produces, and not just of the YAML specification as an assistive concept or format description that helps to describe the specification.
If so, being a Document part of the language itself, what is the added value of this data-structure, compared to the existing, familiar and well-used good old data-structure of Collection (array of items)?
I'm asking it, because I've seen in this link (here) some snippet (in the second example), which describes a YAML sequence that is actually a collection of logs. For some reason, the author of the example, chose to prefer to present each log as a separate "Document" (separated with three-dashes), gathered together in the same YAML sequence/file, instead of writing a file that has a "Collection" of logs represented with the data-type of array. Why did he choose to do this? Is his choice fit, correct, ideal?
I can speculate that the added value of the distinction between a Document and a Collection become relevant when using more advanced features of the YAML grammar, such as Anchors, Tags, References. I guess every Document provide a guarantee that all these identifiers will be a unique set, and there is no collision or duplicates among them. Am I right? And if so, is this the only advantage, or maybe there are any more justifications for the existence of these two pretty-similar data structures?
My best for now, is to see Document as a "meta"-Collection, that is more strict, and lack of high-level logic, or as two different layers of collection schemes. Is it correct, accurate way of view?
And even if I am right, why in the above example (of the logs document from the link), when there's no use and not imply or expected to use duplications or collisions or even identifiers/anchors or compound structures at all - the author is still choosing to represent the collection's items as separate documents? Is this just not so successful selection of an example? Or maybe I'm missing something, and this is a redundancy in the specification, or an evolving syntactic-sugar due to practical needs?
Because the example was written on a website that looks serious with official information written by professionals who dealt with the essence of the language and its definition, theory and philosophy behind (as opposed to practical uses in the wild), and also in light of other provided examples I have seen in it and the added value of them being meticulous, I prefer not to assume that the example is just simply imperfect/meticulous/fit, and that there may be a good reason to choose to write it this way over another, in the specific case exampled.

First, let's look at the technical difference between the list of documents in a YAML stream and a YAML sequence (which is a collection of ordered items). For this, I'll discuss YAML tags, which are an advanced feature so I'll provide a quick overview:
YAML nodes can have tags, such as !!str (the official tag for string values) or !dice (a local tag that can be interpreted by your application but is unknown to others). This applies to all nodes: Scalars, mappings and sequences. Nodes that do not have such a tag set in the source will be assigned the non-specific tag ?, except for quoted scalars which get ! instead. These non-specific tags are later resolved to specific tags, thereby defining to which kind of data structure the node will be deserialized into.
YAML implementations in scripting languages, such as PyYAML, usually only implement resolution by looking at the node's value. For example, a scalar node containing true will become a boolean value, 42 will become an integer, and droggeljug will become a string.
YAML implementations for languages with static types, however, do this differently. For example, assume you deserialize your YAML into a Java class
public class Config {
String name;
int count;
}
Assume the YAML is
name: 42
count: five
The 42 will become a String despite the fact that it looks like a number. Likewise, five will generate an error because it is not a number; it won't be deserialized into a string. This means that not the content of the node defines how it will be deserialized, but the path to the node.
What does this have to do with documents? Well, the YAML spec says:
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.)
So, the technical difference is: If you put your data into a single document with a collection at the top, the YAML processor is allowed to take into account the position of the data in the top-level collection when resolving a tag. However, when you put your data in different documents, the YAML processor must not depend on the position of the document in the YAML stream for resolving the tag.
What does this mean in practice? It means that YAML documents are structurally disjoint from one another. Whether a YAML document is valid or not must not depend on any preceeding or succeeding documents. Consequentially, even when deserialization runs into a semantic problem (such as with the five above) in one document, a following document may still be deserialized successfully.
The goal of this design is to be able to concatenate arbitrary YAML documents together without altering their semantics: A middleware component may, without understanding the semantics of the YAML documents, collect multiple streams together or split up a single stream. As long as they are syntactically correct, stream splitting and merging are sound operations that do not invalidate a YAML document even if another document is structurally invalid.
This design primary focuses on sending and receiving data over networks. Of course, nowadays, YAML is primarily used as configuration language. This is why this feature is seldom used and of rather little importance.
Edit: (Reply to comment)
What about end-cases like a string-tagged Document starts with a folded-string, making even its following "---" and "..." just a characters of the global string?
That is not the case, see rules l-bare-document and c-forbidden. A line containing un-indented ... not followed by non-whitespace will always end a document if one is open.
Moreover, ... doesn't do anything if no document is open. This ensures that a stream merger can always append ... to a document to ensure that the current document is closed, but no additional one is created.
--- has widely been adopted as separator between YAML documents (and, perhaps more prominently, between YAML front matter and content in tools like Jekyll) where ... would have been more appropriate, particularly in Jekyll. This gives the false impression that --- should be used by tooling to separate documents, when in reality ... is the syntactic element designed for that use-case.

Do (document) bundle entries always have to be referenced or referencing?

The specification for FHIR documents seems to mandate that all bundle entries in the document resource be part of the reference graph rooted at the Composition entry. That is, they should be the source or the target of a reference relation that traces all the way up to the root entry.
Unfortunately I have not been able to locate all the relevant passages in the FHIR specification; one place where it is spelled out is in 3.3.1 Document Content, but it is not really clear whether this pertains to all bundles of type 'document' (i.e. even those that happen to be bundles with type code 'document' but are merely collections of machine-processable data without any aspirations to represent a FHIRy document).
The problem with the referencedness requirement lies in the fact that the HAPI validator employs linear search for checking the references. So, if we have to ship N bundle entries full of data to a payor, we have to include a list with N references (one for each data-bearing bundle entry). That leads to N reference searches with O(N) effort during validation, which makes the reference checking complexity effectively quadratic in the number of entries.
This easily brings even the most powerful computers to their knees. Current size contraints effectively cap the number of entries per file at roughly 25000, and the HAPI validator needs several hours to chew through that, even on the most powerful CPUs currently available. Without the references, validation would take less than a minute for the same file.
In our use case, data-bearing entries have no identity outside of the containing bundle file. Practically speaking they would need neither entry.fullUrl nor entry.resource.id, because their business identifiers are contained in included base64 blobs. However, presence or absence of these identifiers has no practical influence on the time needed for validation (fractions of a second even for a 1 GB file), so who cares. It's the list of references that kills the HAPI validator.
Perhaps it would be possible to fulfil the letter of the referencedness requirement by making all entries include a reference to the Composition. The HAPI validator doesn't care either way, so I don't know whether that would be valid or not. But even if it were FHIRly valid, it would be a monstrously silly workaround.
Is there a way to ditch the referencedness requirement? Perhaps by changing the bundle type to something like 'collection', or by using contained resources?
P.S.: for the moment we are using a workaround that cuts the time for validation from hours to less than a minute, but it's a hack, and we currently don't have the resources to fix the HAPI validator. What I'm mostly concerned about is the question how the specifications (profiles) need to be changed in order to avoid the problem I described.

(i.e. even those that happen to be bundles with type code 'document' but are merely collections of machine-processable data without any aspirations to represent a FHIRy document)
If it is not a document, and not intended to be one, do not use the 'document' Bundle type. If you do, you would me misrepresenting the data which is what FHIR tries to avoid.
It seems like you want to send a collection of resources that are not necessarily related, so
Is there a way to ditch the referencedness requirement? Perhaps by changing the bundle type to something like 'collection'
Yes, I would use 'collection', or maybe a 'batch/transaction' depending on what I want to tell the receiver to do with the data.

The documents page says:
The document bundle SHALL include only:
The Composition resource, and any resources directly or indirectly (e.g. recursively) referenced from it
A Binary resource containing a stylesheet (as described below)
Provenance Resources that have a target of Composition or another resource included in the document
A document is a frozen set of content intended as an attested, human-readable, frozen set of content. If that's not what you need, then use a different Bundle type. However, if you do need the 'document' type, that doesn't mean that systems should necessarily validate all requirements at runtime

HL7 FHIR mark resources as anonymized

I am trying to map an existing domain into HL7 FHIR.
So far it was pretty easy to find FHIR resources that more or less represent the same data and can be used for that purpose. But now I am running into a problem of which I am not sure how to solve it.
The existing domain allows that data can be anonymized depending on the users access level. e.g. a patient's name or address might be removed and marked as anonymized. Other data will be pseudonymised, for example a the birthdate in 1980 will be replaced with 01.01.1980. An Age of 37 will be replaced with a category of 30-40.
So I am unsure how to integrate that into the FHIR domain. I was thinking I could create an extension holding a boolean, indicating if a value was anonymized or not and always replace or remove the original value. This might work, but I will run into big problems when the anonymized value is of a different type than the original value (e.g. Age is replaced by a range of values)
Is that even a valid approach? I thought this might be common problem, but I could not find any examples where people described methods of how to mark data as altered. Unfortunately the documentation at http://build.fhir.org/extensibility-registry.html does not contain anything that would help my case.

You can use security labels for this purpose (Resource.meta.security). Take a look at REDACTED and SUBSETTED in the security label value set: https://www.hl7.org/fhir/valueset-security-labels.html
If you need to convey a data type other than the one allowed by the resource (e.g. wanting to convey a range rather than a birthdate), you'd need to use an extension. (Note that dates are valid even if you only include the year.)

Can't figure out how to search LOINC using FHIR for a specific test by name?

Can anyone provide some insight on the required syntax to use to search LOINC using FHIR for a specific string in the labs descriptive text portion of an Observation resource?
Is this even possible?
The documentation is all over the place and I can't find an example for this generic kind of search.
I found similar examples here: https://www.hl7.org/fhir/2015Sep/valueset-operations.html
Such as: GET "[base]/ValueSet/23/$validate-code?system=http://loinc.org&code=1963-8&display=test"
But none of them are providing a general enough case to do a global search of the LOINC system for a specific string in an Observation resource.
None of my attempts to use the FHIR UI here, http://polaris.i3l.gatech.edu:8080/gt-fhir-webapp/search?serverId=gatechreadonly&resource=Observation , have been successful. I keep getting a 500 Internal Server Error because I don't know the correct syntax to use for the value part of the search, and I can't find any documentation out of all the copious documents online that explains this very simple concept.
Can anyone provide some insight?
Totally frustrated at this point.

Observation?code=12345-6
or
Observation?code=http://loinc.org|12345-6
where 12345-6 is whatever LOINC code you want to look for (e.g. 39802-4)
The second ensures you'll only match on LOINC codes as opposed to codes from other systems, though given the relatively unique format of LOINC codes, you're mostly safe without including that.
If you want to search for a set of codes, then you can separate the codes or the tuples with commas: E.g.
Observation?code=12345-6,12345-7
or
Observation?code=http://loinc.org|12345-6,http://loinc.org|123456
If you expect to search by a really long list of codes frequently, you can define a value set that includes all the desired codes and then filter by value set:
Observation?code:in=http://somwhere.org/whatever/ValueSet/123
Note: for readability, I haven't escaped the URL contents, but you'll need to escape the URL values appropriately.

Should JSON data contain formatted data?

When using JSON to populate a section of a page I often encounter that data needs special formatting - formatting that need to match that already on the page, which is done serverside.
A number might need to be formatted as a currency, a special date format or wrapped in for negative values.
But where should this formatting take place - doing it clientside will mean that I need to replicate all the formatting that takes place on the serverside. Doing it serverside and placing the formatted values in the JSON object means a less generic and reusable data set.
What is the recommended approach here?

The generic answer is to format data as late/as close to the user as is possible (or perhaps "practical" is a better term).
Irritatingly this means that its an "it depends" answer - and you've more or less already identified the compromise you're going to have to make i.e. do you remove flexibility/portability by formatting server side or do you potentially introduct duplication by doing it client side.
Personally I would tend towards client side unless there's a very good reason not to do so - simply because we're back to trying to format stuff as close to the user as possible, though I would be a bit concerned about making sure that I'm applying the right formatting rules in the browser.

JSON supports the following basic types:
Numbers,
Strings,
Boolean,
Arrays,
Objects
and Null (empty).
A currency is usually nothing else than a number, but formatted according to country-specific rules. And dates are not (yet) included in JSON at all.
Whatever is recommendable depends on what you do in your application and what kind of JScript libraries you are already using. If you are already formatting alot of data in your server side code, then add it there. If not, and you already have some classes included, which can cope with formatting (JQuery and MooTools have some capabilities), do it in the browser.
So either format them in the client or format them before sending them over - both solutions work.
If you want to delve deeper into this, i recommend this wikipedia article about JSON.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio