How to retrieve protobuf data rendomly? - protocol-buffers

I want to store large amount of data in a protobuf format in which include time-stamp parameter. And I want to retrieve the data based on the time-stamp value.
Thanks.

Protobuf is a sequential-access format. There's no way to jump into the middle of a message looking for data; you have to parse through the whole thing.
Some options:
Devise a framing format that allows you to break up your datastore into many small chunks, each of which is a separate protobuf message. This is a fairly large project.
Use SQLite or even an actual database.
Use a random-access-fieldly format like Cap'n Proto instead. (Disclosure: I'm the author of Cap'n Proto, and also of Protobufs v2 (Google's open source release).)

Related

Validating protobuf using json schema

Are there any examples / references to see how protobuf data can be validated using json schema?
Apologies if I'm starting off with something too basic...
Protobuf data can be validated using protobuf deserialisers; if data is parsed by the parser that was generated for the message (and is part of the class representing that message), then it's valid data. To generate that parser / class, you'd have started with a protobuf schema and compiled that with protoc.
Generally speaking, I'd say that wanting to validate such data against a json schema is possibly not a good idea. The point is that, to also have a json schema for the same data is to then have "two versions of the truth", which is generally a bad idea. Which one is right; the .proto schema, or the json schema? If I edit one, have I accurately edited the other?
JSON Can Do More Than Protobuf
I can see why you may want to check such data against a json schema. In a json schema you can define things like value and size constraints that cannot be expressed in a protobuf schema. For example, a message field "bearing" might in the application have a limited valid value between 0 and 359. There is no way to implement such a constraint in protobuf, but if expressed in a json schema used to validate json data, the validator would object if "bearing" were set to 412.
So, why not generate code from the json schema? I have tried (some time ago - I'm out of date) code generators for languages like C# using json schema as input, but found the result unsatisfactory (the code generators I tried didn't want to implement all the things in my json schema, e.g. unions). Things may have got a lot better since then.
Is there a Better Solution?
If this is indeed the kind of thing you need to do, then it's likely that choosing protobuf is not ideal for the purpose (due to the lack of constraints in protobuf schema). The question then is, what are the alternatives?
In my experience, if you want to stick to the concept of starting with a schema and generating code, the best I've ever used is ASN.1 (where "best" assumes you're willing to pay for good commercial ASN.1 tools from companies like Objective Systems or Nokalva - I've been a customer of both).
These days, ASN.1 can even serialise to json (or xml in several flavours, or other text and packed / unpacked binary data formats). The ASN.1 schema language does have constraints on sizes of lists and/or values of fields. There is an official translation between ASN.1 schema and XML schema (XSD), with the better ASN.1 tools able to do that translation. There may now be an defined translation between ASN.1 and json schema too (I don't know), plus tools to do that.
The point of that is, with translation tools, one can then say that the ASN.1 schema and XSD (or json) schema are "one single truth" - one being automatically generated from the other which was hand written.
A Good Halfway Hosue?
I notice (from a quick search) that there are various git* projects purporting to translate between protobuf and json schema, which if satisfactory means that your json and protocol buffer schema can be automatically translated between one and the other (which means that my 2nd para above is junk!).
Unless something has happened recently, those json protobuf schema translations are going to be limited, or disappointing. ASN.1, XSD and json schema are broadly similar in terms of what their syntaxes allow to be expressed (including size and value constraints), so translation between them doesn't necessarily lose "information". However, the syntax of protobuf schema is a lot more limited than that of json schema, so a translation from json schema to protobuf might lose the very information that you want.
The good news though would be that the protobuf schema would still be a "form of the truth" having been translated from the json schema. If you were using protobuf to generate json data instead of protobuf binary format data, the "original form of the truth" (the json schema) can be used to validate the protobuf generated json, with constraints on value and size still intact. That would be a good result!
Good luck!

Differences between .proto, .pb and .pbtxt

As explained in their website and Wikipedia, the Protocol Buffers (or Protobuf) is "used to serialize structured data". The definition of the data structure is done in a .proto file that can be compiled by protoc and turned into code (.cc/.h, .py, .java...) that can be imported to several languages to manipulate and serialize the data.
My understanding is that the .pb files contain that data in binary and the .pbtxt are an equivalent that contain it in ascii. Is that correct?
If so, why are .pbtxt so readable? I've found some with commentaries (https://github.com/google/mediapipe/blob/master/mediapipe/graphs/hand_tracking/subgraphs/hand_renderer_cpu.pbtxt).
Also, are .pb/.pbtxt enough to interpret the data? Or do you need their .proto?

How secure the protobuf is to get some of the data out?

Without any encryption, if the recipient has the serialized Protobuf file but does not have the generated Protobuf class (they don't have access to the .proto file that define its structure), is it possible for them to get any data in the Protobuf file from the binary?
If they have access to a part of the .proto file (for example, just one related message in the file) can they get a part of that data out from the entire file while skipping other unknown parts?
yes, absolutely; the protoc tool can help with this (see: --decode_raw), as can https://protogen.marcgravell.com/decode - so it should not be treated as "secure" at all
yes, absolutely - that's a key part built into the protocol that allows messages to be extensible such that they can decode the bits they understand and either ignore or just store (for round-trip or "extension" fields) the bits they don't understand
protobuf is not a security device; to someone with the right tools it is just as readable as xml or json, with the slight issue that it can be uncertain how to interpret some values; but: you can infer and guess and reverse engineer
Ok, I have found this page https://developers.google.com/protocol-buffers/docs/encoding
The message discards all the names and is just a pair of key number and values. The generated class might offer some protection for safely reading these data and could not read unknown data. (Sure enough because the generated class was generated from known structure, .proto file)
But if I am an attacker I could reference that Encoding page and try to figure out which area in the binary corresponds to which data. For example, varint might be easy to spot after changing some data. And proceed to write my own .proto file to attack this unknown data or even a custom binary reader that can selectively read part of the binary.

Searching through the protocol buffer file

I'm new to protocol buffers and I was wondering whether it was possible to search a protocol buffers binary file and read the data in a structured format. For example if a message in my .proto file has 4 fields I would like to serialize the message and write multiple messages into a file and then search for a particular field in the file. If I find the field I would like to read back the message in the same structured format as it was written. Is this possible with protocol buffers ? If possible any sample code or examples would be very helpful. Thank you
You should treat protobuf library as one serialization protocol, not an all-in-one library which supports complex operations (such as querying, indexing, picking up particular data). Google has various libraries on top of open-sourced portion of protobuf to do so, but they are not released as open source, as they are tied with their unique infrastructure. That being said, what you want is certainly possible, yet you need to write some code.
Anyhow, some of your requirements are:
one file contains various serialized binaries.
search a particular field in each serialized binary and extract that chunk.
There are several ways to achieve them.
The most popular way for serial read/write is that the file contains a series of [size, type, serialization output]. That is, one serialized output is always prefixed by size and type (either 4/8 byte or variable-length) to help reading and parsing. So you just repeat this procedure: 1) read size and type, 2) read binary with given size, 3) parse with given type 4) goto 1). If you use union type or one file shares same type, you may skip type. You cannot drop size, as there is no way know the end of output by itself. If you want random read/write, other type of data structure is necessary.
'search field' in binary file is more tricky. One way is to read/parse output one by one and to check the existance of field by HasField(). It is most obvious and slow yet straightforward way to do so. If you want to search field by number (say, you want to search 'optional string email = 3;'), thus search by binary blob (like 0x1A, field number 3, wire type 2), it is not possible. In a serialized binary stream, field information is saved merely a number. Without an exact context (.proto scheme or binary file's structure), the number alone doesn't mean anything. There is no guarantee that 0x1A is from field information, or field information from other message type, or actually number 26, or part of other number, etc. That is, you need to maintain the information by yourself. You may create another file or database with necessary information to fetch particular message (like the location of serialization output with given field).
Long story short, what you ask is beyond what open-sourced protobuf library itself does, yet you can write them with your requirements.
I hope, this is what you are looking for:
http://temk.github.io/protobuf-utils/
This is a command line utility for searching within protobuf file.

How to determine if a string is a protocol buffer object or not

I have a location that stores plain string data right now and I want to store at the same location a protobuf object in the future.
Is there a way for my new code to read the old data (plain string) and reliably determine: "this is not a protobuf object"?
Given that I can't reliably determine the format of the current plain string data (e.g. hostnames) from other formats (it's possible that some protocol buffer objects resemble hostnames, which could be anything) either, it doesn't seem feasible right?
It seems that you haven't yet stored any protocol buffers data in the location. How about marking the new data with some specific prefix?
For example, generate a GUID or just some random 64-bit value, and prepend it to the data whenever you store a protobuf message. On reading, just check if this marker is present.
By varying the length of the marker, you can make any collisions with existing data arbitrarily improbable.

Resources