Thrift, Avro and ProtoBuf data governance

Thrift, Avro and ProtoBuf data governance - protocol-buffers

We have a use case of data streaming from the main transactional system to other downstream such as data analytics and machine learning team.
One of the requirements are to ensure data governance that data source can control who can read which column, and potentially lifecycle of a data to ensure data siting in another domain gets purged should the source data removed it, such as if a user deletes the account, we need to make sure the data in all downstream gets removed.
While we are considering Thrift, Avro and ProtoBuf, what are the common frameworks that we can use for such data governance? Do any of these protocol supports metadata for such data governance around data authorization, lifecycle?

Let me get this straight:
protobuf is not a security device; to someone with the right tools it is just as readable as xml or json, with the slight issue that it can be uncertain how to interpret some values;
It's not of a much difference than JSON nor XML. It is just an interface language. Sure, it has encoding, it is a bit different and a lot more customizable, but it does in no way confront security. It is up to you to secure the channel between sender and receiver.

Related

gRPC and schema evolution guarantees?

I am evaluating using gRPC. On the subject of compatibility with 'schema evolution', i did find the information that protocol buffers, which are exchanged by gRPC for data serialization, have a format such that different evolutions of a data in protobuf format can stay compatible, as long as the schemas evolution allows it.
But that does not tell me wether two iterations of a gRPC client/server will be able to exchange a command that did not change in the schema, regardless of the changes in the rest of the schema?
Does gRPC quarantee that an older generated version of a client or server code will always be able to launch/answer a command that has not changed in the schema file, with any more recent schema generated code on the interlocutor side, reglardless of the rest of the schema? (Assuming no other breaking changes like a non backward compatible gRPC version change)

gRPC has compatibility for a method as long as:
the proto package, service name, and method name is unchanged
the proto request and response messages are still compatible
the cardinality (unary vs streaming) of the request and response message remains unchanged
New services and methods can be added at will and not impact compatibility of preexisting methods. This doesn't get discussed much because those restrictions are mostly what people would expect.
There is actually some wiggle room in changing cardinality as unary is encoded the same as streaming on-the-wire, but it's generally just better to assume you can't change cardinality and add a new method instead.
This topic is discussed in the Modifying gRPC Services over Time talk (PDF slides and Youtube links available). I'll note that the slides are not intended to stand alone.

Microservice messaging in go

I am working on a project splitting services off of a monolithic codebase. Currently, the two services I am working on are a login/authentication service, a service providing a few REST API endpoints and a data retrieval service feeding the REST API. I need the services to exchange data with each other, in particular, data retrieval needs to send data to the REST API and REST API needs to exchange authentication requests/responses with the login service.
The technology of choice in go world seems to be protobufs and gRPC. I have been looking into using these, but it seems really inconvenient, so I wonder if I am just doing things wrong.
For example, my data retrieval gets records from an RDBMS and my REST API serves this data as JSON. Normally, I would define a "model" struct for each type of record in the database/API, use reflect annotations for both database and JSON in the struct definitions and used something like sqlx to scan query results into structs and encoding/json to serialize structs into JSON. But if I want to pass the data around using gRPC and protobufs this whole setup goes out of the window. Since protobuf generates its own struct types, I would have to manually implement conversion from SQL rows into protobufs and from protobufs into JSON for every single message type I define. Not that implementing conversions is hard, but it introduces more opportunities for bugs and code fragility and feels like reinventing the wheel.
And this seems like this should be a very common problem. Am I missing some obvious solution?

OpenWhisk and binary data from Google Flatbuffers

We have data being created by a simulated device being put on the network with NanoMSG with a payload of Google FlatBuffers (binary).
We would like to trigger on patterns of this data with OpenWhisk, and respond with Flatbuffer encoded responses.
Assume latency and throughput are not a big concern here.
Which approach can we take:
Write a repeater that converts the Flatbuffer to JSON (FB has a utility to do this) and then place the data onto an AMQP buss which is listened to by OpenWhisk? (we have folks familiar with AMQP, but not Kafka
Try to do something with Kafka, which seems (maybe it is only the IBM version) to directly handle the binary Flabuffers (probably still need a shim from NanoMSG to Kafka. E.g.
How to invoke an OpenWhisk action from IoT Platform in Bluemix
https://medium.com/openwhisk/serverless-transformation-of-iot-data-in-motion-with-openwhisk-272e36117d6c
Not sure if we still don't need the Flatbuffers JavaScript deserializer and serializer to convert the binary based64 data in JavaScript to JSON
Learn Kakfa, and then transform the NanoMsg payload (Flatbuffers to JSON).
Something else?
Anyone have direct experience in this?
Update
Thank you James, those are spot-on links. But it does raise some secondary issues:
If the data is in Google FlatBuffers schema, it does not seem to be any advantage to using Kafka binary transformation, since the mux/demux from base64 still needs to be done in the javascript layer.
It is slightly disturbing that Kafka (which is known for its low latency) is batching the events. That does effect latency when one has Iot (sensor data) that needs to be responded to in a closed-loop to actuators (sensor->control->actuators) is a common robotics model, and that is pretty much close to what we are doing. For the moment we are not pushing the latency issue, but I can see emerging cases where we will need the low latency. What is the thinking in the Kafka Whisk provider community about this?
I must be missing something, but the AMQP provider says it is using RHEA https://github.com/amqp/rhea#receiver . That seems to provide all one needs in terms of writing a simple trigger/rules for dealing with sensor stream data. Why would one use OpenWhisk?

Either option makes sense. OpenWhisk actions receive and return JSON messages. Binary data passed into those functions must be Base64 encoded.
If you use an AMQP feed, you can convert the binary data to JSON manually.
The Kafka feed provider does support automatic encoding of the binary input values (using the isBinary* parameters).
Kafka feeds push batches of messages to the OpenWhisk actions. This is different from a message queue, which would push one message at a time. This feed provider is built-in OpenWhisk.
There is an external community feed provider for AMQP here. This would need you to install and run it manually.

Apache NiFi with Websocket stream of Google Flatbuffers payloads

Has something like this been done before? If not, what would be involved in getting NiFi to ingest a stream arriving over a WebSocket with Google FlatBuffers?
(would a simple TCP stream make it easier or harder?)
UPDATE
I have a C++ program that is running on a node, which is collecting data and publishing it via nanomessage pub/sub channel over a websocket. The data in C++ looks like structs, and I am serializing it with Google Flatbuffers. It is a very simple struct, think of csv records. We have a team member who wants to capture this data with NiFi and put it to a database.
Personally, since Flatbuffers supports conversion of binary to JSON, I think this is almost easier just writing a short C#, python, java or javascript program to receive the flabuffers, open a DB connection, and dump the data. (maybe convert to JSON first, if needed).

To my knowledge, NiFi does not have an integration with the nanomsg library/protocol out of the box. This would likely require writing a custom processor for that is capable of consuming nanomsg packets using the nanomsg PUBSUB pattern / socket types.
One could use existing processors, such as the Consume* processors (ConsumeKafka, ConsumeJMS) as an example / guide for how to write a processor that consumes messages from a topic/queue that follows the pub/sub pattern.
You would then want to transform the payload from Flatbuffers binary to a format insertable into the desired database. Again, a custom processor using code generated from your Flatbuffer schema would probably be the right approach for that.
As you mention, this could also be accomplished with a simple program. If you wrote that program in Java (using the Java nanomsg and flatbuffers libs) as a prototype/proof-of-concept, then it could be refactored into one or more NiFi custom processors in the future if you wish to move to NiFi.

Management layer above Thrift

Thrift sounds awesome but can't find some basic stuff I'm used to in RPC frameworks (as HttpServlet). Example of the things I can't find: session management, filtering, upload/download progress.
I understand that the missing stuff might be a management layer on top of Thrift. If so, any example of such a layer? Perhaps AOP (Aspect Oriented)?
I can't imagine such a layer that compiles to all languages and that's I'm missing. Taking session management as an example, there might be several clients that all need to do some authentication and pass the session_id upon each RPC. I would expect a similar API for all languages doing so.
Anyone knows of a a management layer for Thrift?

So thrift itself is not going to help you out a lot here.
I have had similar desires, and have a few suggestions:
1. Put your management objects into the IDL
Simply add an api token or common transfer data struct as a parameter to all of your service methods. Set it as parameter id 15 so that it will always be the last parameter, even if you add others in the middle.
As the first step in your handler you can validate/store/do whatever with the extra data.
This has the advantage that it is valid in any platform that thrift supports.
2. Use thrift over http
If you use http as your transport, you can include whatever data as you want as http headers, and the thrift content as the body.
This will often require a custom http client for every platform you use to inject the data, and a custom handler on the server to use the data, but neither of those are prohibitively difficult.
3. Hack the protocol
It is possible to create your own custom protocol that wraps another protocol and injects custom data. Take a look at how the multiplexed protocol works in the thrift library for most languages:
c# here. It sends the method name across the wire as service:method. The multiplexed processor unwraps this encoding and passes it on to the appropriate processor.
I have used a similar method to encode arbitrary key/value pairs (like http headers) inside the method name.
The downside to this is that you need to write a more complicated extension for each platform you will be using. Once. It varies a bit from language to language how this works, but it is generally simple enough once you figure it out once.
These are just a few ideas I have had, and I am sure there are others. The nice thing about thrift is how the individual components are decoupled from each other. If you have special needs you can swap any of them out as you need to to add specific functionality.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio