Design of the Protobuf binary format: performance and varint - protocol-buffers

I need to design a binary format to save data from a scientific application. This data has to be encoded in a binary format that can't be easily read by any other application (it is a requirement by some of our clients). As a consequence, we decided to build our own binary format, its encoder and its decoder.
We got some inspiration from many binary format, including protobuf. One thing that puzzles me is the way protobuf encodes the length of embedded messages. According to https://developers.google.com/protocol-buffers/docs/encoding, the size of an embedded message is encoded at its very beginning as a varint.
But before we encode an embedded message, we don't know yet its size (think for instance of an embedded message that contains many integers encoded as varint). As a consequence, we need to encode the message entirely, before we write it to the disk so we know its size.
Imagine that this message is huge. As a consequence, it is very difficult to encode it in an efficient way. We could encode this size as a full int and seek back to this part of the file once the embedded message is written, but we loose the nice property of varints: you don't need to specify if you have a 32-bit or a 64-bit integer. So going back to Google's implementation using a varint:
Is there an implementation trick I am missing, or is this scheme likely to be inefficient for large messages?

Yes, the correct way to do this is to write the message first, at the back of the buffer, and then prepend the size. With proper buffer management, you can write the message in reverse.
That said, why write your own message format when you can just use protobuf? It would be better to just use Protobuf directly and encrypt the file format. That would be easy for you to use, and still be hard for other applications to read.

Related

Is Gzip compressed binary data or uncompressed text safe to transmit over https, or should it be base 64 encoded as the final step before sending it?

My question is in the title, this provides context to help you understand my confusion. Everything is sent over https.
My understanding of base 64 encoding is that it is a way of representing binary data as text, such that the text is safe to transmit across networks or the internet because it avoids anything that might be interpreted as a control code by the various possible protocols that might be involved at some point.
Given this understanding, I am confused why everything sent to over the internet is not base 64 encoded. When is it safe not to base 64 encode something before sending it? I understand that not everything understands or expects to receive things in base 64, but my question is why doesn't everything expect and work with this if it is the only way to send data without the possibility it could be interpreted as control codes?
I am designing an Android app and server API such that the app can use the API to send data to the server. There are some potentially large SQLite database files the client will be sending to the server (I know this sounds strange, yes it needs to send the entire database files). They are being gzipped prior to uploading. I know there is also a header that can be used to indicate this: Content-Encoding: gzip. Would it be safe to compress the data and send it with this header without base 64 encoding it? If not, why does such a header exist if it is not safe to use? I mean, if you base 64 encode it first and then compress it, you undo the point of base 64 encoding and it is not at that point base 64 encoded. If you compress it first and then base 64 encode it, that header would no longer be valid as it is not in the compressed format at that point. We actually don't want to use the header because we want to save the files in a compressed state, and using the header will cause the server to decompress it prior to our API code running. I'm only asking this to further clarify why I am confused about whether it is safe to send gzip compressed data without base 64 encoding it.
My best guess is that it depends on if what you are sending is binary data or not. If you are sending binary data, it should be base 64 encoded as the final step before uploading it. But if you are sending text data, you may not need to do this. However it still seems to my logic, this might still depends on the character encoding used. Perhaps some character encodings can result in sending data that could be interpreted as a control code? If this is true, which character encodings are safe to send without base 64 encoding them as the final step prior to sending it? If I am correct about this, it implies you should only use the that gzip header if you are sending compressed text that has not been base 64 encoded. Does compressing it create the possibility of something that could be interpreted as a control code?
I realize this was rather long, so I will repeat my primary questions (the title) here: Is either Gzip compressed binary data or uncompressed text safe to transmit, or should it be base 64 encoded as the final step before sending it? Okay I lied there is one more question involved in this. Would sending gzip compressed text always be safe to send without base 64 encoding it at the end, no matter which character encoding it had prior to compression?
My understanding of base 64 encoding is that it is a way of representing binary data as text,
Specifically, as text consisting of characters drawn from a 64-character set, plus a couple of additional characters serving special purposes.
such that the text is safe to transmit across networks or the internet because it avoids anything that might be interpreted as a control code by the various possible protocols that might be involved at some point.
That's a bit of an overstatement. For two endpoints to communicate with each other, they need to agree on one protocol. If another protocol becomes involved along the way, then it is the responsibility of the endpoints for that transmission to handle any needed encoding considerations for it.
What bytes and byte combinations can successfully be conveyed is a matter of the protocol in use, and there are plenty that handle binary data just fine.
At one time there was also an issue that some networks were not 8-bit clean, so that bytes with numeric values greater than 127 could not be conveyed across those networks, but that is not a practical concern today.
Given this understanding, I am confused why everything sent to over the internet is not base 64 encoded.
Given that the understanding you expressed is seriously flawed, it is not surprising that you are confused.
When is it safe not to base 64 encode something before sending it?
It is not only safe but essential to avoid base 64 encoding when the recipient of the transmission expects something different. The two or more parties to a given transmission must agree about the protocol to be used. That establishes the acceptable parameters of the communication. Although Base 64 is an available option for part or all of a message, it is by no means the only one, nor is it necessarily the best one for binary data, much less for data that are textual to begin with.
I understand that not everything understands or expects to receive things in base 64, but my question is why doesn't everything expect and work with this if it is the only way to send data without the possibility it could be interpreted as control codes?
Because it is not by any means the only way to avoid data being misinterpreted.
They are being gzipped prior to uploading. I know there is also a header that can be used to indicate this: Content-Encoding: gzip. Would it be safe to compress the data and send it with this header without base 64 encoding it?
It would be expected to transfer such data without base-64 encoding it. HTTP(S) handles binary data just fine. The Content-Encoding header tells the recipient how to interpret the message body, and if it specifies a binary content type (such as gzip) then binary data conforming to that content type are what the recipient will expect.
My best guess is that it depends on if what you are sending is binary data or not.
No. These days, for all practical intents and purposes, it depends only on what application-layer protocol you are using for the transmission. If it specifies that some or all of the message is to be base-64 encoded (according to a particular base-64 scheme, as there are more than one) then that's what the sender must do and how the receiver will interpret the message. If the protocol does not specify that, then the sender must not perform base-64 encoding. Some protocols afford the sender the option to make this choice, but those also provide a way for the sender to indicate inside the transmission what choice has been made.
Is either Gzip compressed binary data or uncompressed text safe to transmit, or should it be base 64 encoded as the final step before sending it?
Neither is inherently unsafe to transmit on today's networks. Whether data are base-64 encoded for transmission is a question of agreement between sender and receiver.
Okay I lied there is one more question involved in this. Would sending gzip compressed text always be safe to send without base 64 encoding it at the end, no matter which character encoding it had prior to compression?
The character encoding of the uncompressed text is not a factor in whether a gzipped version can be safely and successfully conveyed. But it probably matters for the receiver or anyone to whom they forward that data to understand the uncompressed text correctly. If you intend to accommodate multiple character encodings then you will want to provide a way to indicate which applies to each text.

How can I use Protocol Buffers for structuring a third party serial Communication Protocol?

I am working on an ECG module, which is giving out data in bytes. There's a protocol document about it explaining like how to structure the packets, that are coming out of the module.
I want to decode that data.
I am confused whether protocol buffers will help in this or not. Any other methods that would be helpful in this decoding and writing that protocol in Python ?
Protocol buffers only works with its own encoding format.
For decoding a manufacturer-specific custom binary format, I would suggest Python's built-in struct module. Here is a basic tutorial which is easy get started with.
If the manufacturer-specific format is text based instead, you can either use basic string manipulation to split it into tokens, or one of the parsing libraries for Python.

Little endian packet treated as big endian by dpkt

I am using dpkt to parse some ieee80211 packets.
I see that the ieee80211 object created has wrong values.
Digging deeper I found that the ieee80211 treats the data as big endian while in practice the packets I am providing it are little endian.
Is there a way to detect the endianness of the packet in runtime so I could maybe change it to big endian before providing it to dpkt.ieee80211?
There shouldn't be anything to detect or guess. IEEE 802.11 is a standard protocol, and its specification states the correct endianess for each and every part of a frame. It the endianess is reversed, then the frame is malformed. You can grab the latest copy of the standard here.
Looking over the 3500+ page pdf (thank god for ctrl+f), it seems that most values are big-endian, just like in TCP/IP. But apparently, little-endian is used here and there. For instance, in some TKIP fields. Frankly, that's a bit surprising.
You haven't mentioned the frame/field you're trying to create/decode, so it's hard to say anything more specific than to look it up.
The only way you're going to be able to detect endianness when you don't know one way or the other would be to inject a payload and have that parsed the same way.
You can then check for endianness by checking the identity of the payload you injected.
It turns out that for IEEE80211 under CAPWAP the frame control bytes are simply swapped.
It is probably an an-initial-mistake-gone-de-facto-standard case.
See answer in Wireshark Q&A

Saving a PNG to a Redis Server

I'm trying to save a png generated by Canvas2Image to a Redis server and then display it again as an image.
I can't think of any way to do this and by searching Google I can't find any solution. Does anybody know how to do this?
This is for a website I'm making where anybody can draw on a canvas in real time.
Redis has a binary safe protocol, and most standard instructions are fine with arbitrary binary data as both keys as values. There is no need to base-64 (or otherwise) encode, as long as your library supports the binary-safe aspect. For example, with StackExchange.Redis (for .NET) you can pass a byte[] as the value to StringSet, and the result of StringGet can be cast to a byte[].
Then the only question becomes: how to get the binary of the png; but that should just be standard IO.
It's possible to encode a PNG as a base64 byte encoded string. Redis can then store the string like any other string.
If you'd like users to be able to draw in real time on the same image, it might be more effective to maintain the image as an SVG and share the image via client to client web sockets.

How does the clipboard work in Windows?

What is its data structure? Is it XML-based? How can it distinguish between different content types, for example text, image, files etc.?
It is a system wide bit bucket, which means it just holds a sequence of bytes and an integer value (a windows atom) which describes it's format - but does not ensure that the byte sequence is really this format.
The only feature other then this is that an application can decide if it wants the system to store the byte sequence or if the application keeps the data itself and only provide it when someone is requesting it.
So as you see it is an API and not a data structure.

Resources