What is Marshal and unmarshal in Golang Proto? - go

I understand that we need to marshall object to serialise it, and unMarshall to deserialise it. However my question is, when do we invoke marshal and unmarshal and why do we serialise the objects if we are going to deserialise it again soon?
PS I just started to learn Go and Proto so would really appreciate your help. Thank you.

Good question!
Marshaling (also called serializing) converts a struct to raw bytes. Usually, you do this when you're sending data to something outside your program: you might be writing to a file or sending the struct in an HTTP request body.
Unmarshaling (also called deserializing) converts raw bytes to a struct. You do this when you're accepting data from outside your program: you might be reading from a file or an HTTP response body.
In both situations, the program sending the data has it in memory as a struct. We have to marshal and unmarshal because the recipient program can't just reach into the sender's memory and read the struct. Why?
Often the two programs are on different computers, so they can't access each other's memory directly.
Even for two processes on the same computer, shared memory is complex and usually dangerous. What happens if you're halfway through overwriting the data when the other program reads it? What if your struct includes sensitive data (like a decrypted password)?
Even if you share memory, the two programs need to interpret the data in the same way. Different languages, and even different versions of Go, represent the same struct differently in memory. For example, do you represent integers with the most-significant bit first or last? Depending on their answers, one program might interpret 001 as the integer 1, while the other might interpret the same memory as the integer 4.
Nowadays, we're usually marshaling to and unmarshaling from some standardized format, like JSON, CSV, or YAML. (Back in the day, this wasn't always true - you'd often convert your struct to and from bytes by hand, using whatever logic made sense to you at the time.) Protobuf schemas make it easier to generate marshaling and unmarshaling code that's CPU- and space-efficient (compared to CSV or JSON), strongly typed, and works consistently across programming languages.

Related

Design of the Protobuf binary format: performance and varint

I need to design a binary format to save data from a scientific application. This data has to be encoded in a binary format that can't be easily read by any other application (it is a requirement by some of our clients). As a consequence, we decided to build our own binary format, its encoder and its decoder.
We got some inspiration from many binary format, including protobuf. One thing that puzzles me is the way protobuf encodes the length of embedded messages. According to https://developers.google.com/protocol-buffers/docs/encoding, the size of an embedded message is encoded at its very beginning as a varint.
But before we encode an embedded message, we don't know yet its size (think for instance of an embedded message that contains many integers encoded as varint). As a consequence, we need to encode the message entirely, before we write it to the disk so we know its size.
Imagine that this message is huge. As a consequence, it is very difficult to encode it in an efficient way. We could encode this size as a full int and seek back to this part of the file once the embedded message is written, but we loose the nice property of varints: you don't need to specify if you have a 32-bit or a 64-bit integer. So going back to Google's implementation using a varint:
Is there an implementation trick I am missing, or is this scheme likely to be inefficient for large messages?
Yes, the correct way to do this is to write the message first, at the back of the buffer, and then prepend the size. With proper buffer management, you can write the message in reverse.
That said, why write your own message format when you can just use protobuf? It would be better to just use Protobuf directly and encrypt the file format. That would be easy for you to use, and still be hard for other applications to read.

JMS: Deliver String or Object as payload, which is relatively faster?

I have a simple Person object that contains some basic information about a person. If I want to send it by JMS:
I can convert this object into JSON, then deliver it as a String object.
I can use Person object as the payload directly.
I'm using ActiveMQ as JSM provider. Which way is faster?
And what if I need to send a Map or List as the payload?
It's all about the performance of serialization, not much about jms/activemq. So an ObjectMessage is a binary blob at transport that uses java serialization and for the string message, you can choose whatever serialization processor you want.
This article with runnable benchmarks shows that json serialization can be as fast as java object serialization. Although the article is obviously biased, you can note that also jackson/JSON serialization and java serialization is pretty close in terms of performance.
I guess you can measure yourself, with your kind of data. Either way, it's likely a micro optimization. If serialization speed truly matters that much, see if you can optimize in terms of size/quantity in terms objects sent.
As a final note, if you deal with very large payloads, the size and therefore the transport time will contribute to performance. In that case, you may want to make sure your json is not indented and possibly also compressed.

GobEncoder for Passing Anonymous Function via RPC

I'm trying to build a system that will execute a function on multiple machines, passing the function anonymously via RPC to each worker machine (a la MapReduce) to execute on some subset of data. Gob doesn't support encoding functions, though the docs for GobEncoder say that "A type that implements GobEncoder and GobDecoder has complete control over the representation of its data and may therefore contain things such as private fields, channels, and functions, which are not usually transmissible in gob streams" so it seems possible.
Any examples of how this might work? I don't know much about how this encoding/decoding should be done with Gob.
IMHO this won't work. While it is true that if your type implements
Gob{En,De}coder it can (de)serialize unexported fields of structs it is still impossible to (de)serialize code: Go is statically compiled and linked without
runtime code generation capabilities (which would circumvent compile time type
safety).
Short: You cannot serialize functions, only data. Your workers must provide
the functions you wan't to execute. Take a look at encoding/rpc.
You may want to try GoCircuit, which provides a framework that basically lets you do this:
http://www.gocircuit.org/
It works by copying your binary to the remote machine(s), starting it, then doing an RPC that effectively says "execute function X with args A, B, ..."

Is 'handle' synonymous to pointer in WinAPI?

I've been reading some books on windows programming in C++ lately, and I have had some confusing understanding of some of the recurring concepts in WinAPI. For example, there are tons of data types that start with the handle keyword'H', are these supposed to be used like pointers? But then there are other data types that start with the pointer keyword 'P'. So I guess not. Then what is it exactly? And why were pointers to some data types given separate data types in the first place? For example, PCHAR could have easily designed to be CHAR*?
Handles used to be pointers in early versions of Windows but are not anymore. Think of them as a "cookie", a unique value that allows Windows to find back a resource that was allocated earlier. Like CreateFile() returns a new handle, you later use it in SetFilePointer() and ReadFile() to read data from that same file. And CloseHandle() to clean up the internal data structure, closing the file as well. Which is the general pattern, one api function to create the resource, one or more to use it and one to destroy it.
Yes, the types that start with P are pointer types. And yes, they are superfluous, it works just as well if you use the * yourself. Not actually sure why C programmers like to declare them, I personally think it reduces code readability and I always avoid them. But do note the compound types, like LPCWSTR, a "long pointer to a constant wide string". The L doesn't mean anything anymore, that dates back to the 16-bit version of Windows. But pointer, const and wide are important. I do use that typedef, not doing so will risk future portability problems. Which is the core reason these typedefs exist.
A handle is the same as a pointer only so far as both ID a particular item. Obviously a pointer is the address of the item so if you know it's structure you can start getting fields in the item. A handle may or may not be a pointer - basically if it is a pointer you don't know what it is pointing to so you can't get into the fields.
Best way to think of a handle is that it is a unique ID for something in the system. When you pass it to something in the system the system will know what to cast it to (if it is a pointer) or how to treat it (if it is just some id or index).

gson vs protocol buffer

What are the pros and cons of protocol buffer (protobuf) over GSON?
In what situation protobuf is more appropriate than GSON?
I am sorry for a very generic question.
Both json (via the gson library) and protobuf are portable between platorms; but
protobuf is smaller (bandwidth) and cheaper (CPU) to read/write
json is human readable / editable (protobuf is binary; hard to parse without library support)
protobuf is trivial to merge fragments - just concatenate
json is easily passed to web page clients
the main java version of protobuf needs contract-definition (.proto) and code-generation; gson seems to allow arbitrary pojo usage (there are protobuf implementations that work on such objects, but not for java afaik)
If performance is key : protubuf
For use with a web page (JavaScript), or human readable: json (perhaps via gson)
If you want efficiency and cross-platform you should send raw messages between applications containing the information that is necessary and nothing more or less.
Serialising classes via Java's own mechanisms, gson, protobufs or whatever, creates data that contains not only the information you wish to send, but also information about the logical structures/hierarchies of the data structures that has been used to represent the data inside your application.
This makes those classes and data mapping dual purpose, one to represent the data internally in the application, and two to be transmitted to another application. Those two roles can be conflicting there is an onus on the developer to remember that the classes, collections and data layout he is working with at any time will also be serialised.

Resources