Apache NiFi with Websocket stream of Google Flatbuffers payloads - apache-nifi

Has something like this been done before? If not, what would be involved in getting NiFi to ingest a stream arriving over a WebSocket with Google FlatBuffers?
(would a simple TCP stream make it easier or harder?)
UPDATE
I have a C++ program that is running on a node, which is collecting data and publishing it via nanomessage pub/sub channel over a websocket. The data in C++ looks like structs, and I am serializing it with Google Flatbuffers. It is a very simple struct, think of csv records. We have a team member who wants to capture this data with NiFi and put it to a database.
Personally, since Flatbuffers supports conversion of binary to JSON, I think this is almost easier just writing a short C#, python, java or javascript program to receive the flabuffers, open a DB connection, and dump the data. (maybe convert to JSON first, if needed).

To my knowledge, NiFi does not have an integration with the nanomsg library/protocol out of the box. This would likely require writing a custom processor for that is capable of consuming nanomsg packets using the nanomsg PUBSUB pattern / socket types.
One could use existing processors, such as the Consume* processors (ConsumeKafka, ConsumeJMS) as an example / guide for how to write a processor that consumes messages from a topic/queue that follows the pub/sub pattern.
You would then want to transform the payload from Flatbuffers binary to a format insertable into the desired database. Again, a custom processor using code generated from your Flatbuffer schema would probably be the right approach for that.
As you mention, this could also be accomplished with a simple program. If you wrote that program in Java (using the Java nanomsg and flatbuffers libs) as a prototype/proof-of-concept, then it could be refactored into one or more NiFi custom processors in the future if you wish to move to NiFi.

Related

Thrift, Avro and ProtoBuf data governance

We have a use case of data streaming from the main transactional system to other downstream such as data analytics and machine learning team.
One of the requirements are to ensure data governance that data source can control who can read which column, and potentially lifecycle of a data to ensure data siting in another domain gets purged should the source data removed it, such as if a user deletes the account, we need to make sure the data in all downstream gets removed.
While we are considering Thrift, Avro and ProtoBuf, what are the common frameworks that we can use for such data governance? Do any of these protocol supports metadata for such data governance around data authorization, lifecycle?
Let me get this straight:
protobuf is not a security device; to someone with the right tools it is just as readable as xml or json, with the slight issue that it can be uncertain how to interpret some values;
It's not of a much difference than JSON nor XML. It is just an interface language. Sure, it has encoding, it is a bit different and a lot more customizable, but it does in no way confront security. It is up to you to secure the channel between sender and receiver.

OpenWhisk and binary data from Google Flatbuffers

We have data being created by a simulated device being put on the network with NanoMSG with a payload of Google FlatBuffers (binary).
We would like to trigger on patterns of this data with OpenWhisk, and respond with Flatbuffer encoded responses.
Assume latency and throughput are not a big concern here.
Which approach can we take:
Write a repeater that converts the Flatbuffer to JSON (FB has a utility to do this) and then place the data onto an AMQP buss which is listened to by OpenWhisk? (we have folks familiar with AMQP, but not Kafka
Try to do something with Kafka, which seems (maybe it is only the IBM version) to directly handle the binary Flabuffers (probably still need a shim from NanoMSG to Kafka. E.g.
How to invoke an OpenWhisk action from IoT Platform in Bluemix
https://medium.com/openwhisk/serverless-transformation-of-iot-data-in-motion-with-openwhisk-272e36117d6c
Not sure if we still don't need the Flatbuffers JavaScript deserializer and serializer to convert the binary based64 data in JavaScript to JSON
Learn Kakfa, and then transform the NanoMsg payload (Flatbuffers to JSON).
Something else?
Anyone have direct experience in this?
Update
Thank you James, those are spot-on links. But it does raise some secondary issues:
If the data is in Google FlatBuffers schema, it does not seem to be any advantage to using Kafka binary transformation, since the mux/demux from base64 still needs to be done in the javascript layer.
It is slightly disturbing that Kafka (which is known for its low latency) is batching the events. That does effect latency when one has Iot (sensor data) that needs to be responded to in a closed-loop to actuators (sensor->control->actuators) is a common robotics model, and that is pretty much close to what we are doing. For the moment we are not pushing the latency issue, but I can see emerging cases where we will need the low latency. What is the thinking in the Kafka Whisk provider community about this?
I must be missing something, but the AMQP provider says it is using RHEA https://github.com/amqp/rhea#receiver . That seems to provide all one needs in terms of writing a simple trigger/rules for dealing with sensor stream data. Why would one use OpenWhisk?
Either option makes sense. OpenWhisk actions receive and return JSON messages. Binary data passed into those functions must be Base64 encoded.
If you use an AMQP feed, you can convert the binary data to JSON manually.
The Kafka feed provider does support automatic encoding of the binary input values (using the isBinary* parameters).
Kafka feeds push batches of messages to the OpenWhisk actions. This is different from a message queue, which would push one message at a time. This feed provider is built-in OpenWhisk.
There is an external community feed provider for AMQP here. This would need you to install and run it manually.

How best to insert a stream of messages from a C app into RethinkDB

I have a C-based app which is collecting measurement streams which I want to dump in to RethinkDB. There is no need for any queries, just creating a table and inserting data. Is anybody aware of such as simple library?
There is the AtnNn/librethinkdbxx driver which I most likely could wrap easily, but it has no documentation and C++ and I don't get along well :)
RethinkDB protocol isn't that hard to write your own driver, like a simple C driver with write command support only
However, I propose that you abstract it and using a simple HTTP server to write into RethinkDB. You can do this with NodeJS in under 100 lines of code easily since you only need to write.

ElasticSearch: Jest vs Rest vs TransportClient vs NodeClient

I have gone through the official documentation at https://www.elastic.co/blog/found-interfacing-elasticsearch-picking-client
But it does not give any benchmarks or performance numbers to help choose among the clients. And I am finding it non-trivial to setup a TransportClient or setup a NodeClient because the documentation for that is also really sparse with little to no examples whatsoever.
So if someone has already done some benchmarking on choosing a client, I would really appreciate that and focus more on tuning an established client rather than evaluating what client to choose.
Our application is a write-heavy application and we plan to have a 50-shard, 50-replica ES cluster for that.
All those clients are fine for querying and they all have their pros and cons (below list is not exhaustive):
A Node client provides a single hop into the cluster but since it will also be part of the cluster it can also induce too much chatter within the cluster
A Transport client is not part of the cluster, hence requires a two-hop roundtrip, and communicates with a single node at a time in a round-robin fashion (from the list provided during its construction)
Jest is basically the missing client for the ES REST interface
If you feel like you don't need all what Jest has to offer and simply want to interact with a few endpoints, you might as well create your own REST client by using Spring REST template, Apache HTTP, etc
If you're going to have a write-heavy application I suggest you don't even use any of those clients at all. The main reason is that they are all synchronous in nature and if any component of your architecture or the network were to fail for some reason, then you'd lose data, and that might not be an option for you.
If you have plenty of data to ingest, you normally go the asynchronous way, i.e. storing your data in a temporary (yet durable) queue (Kafka, Redis, JMS, etc) and then let another process stream it to ES. There are many ways to do that, but a very simple one is to use Logstash for that.
Whether you decide to store your data in Kafka or JMS or Redis, you can then let Logstash consume your data and stream it to ES, i.e. you let Logstash worry about the heavy write part, which it does very well. That can be achieved very easily with
a kafka or redis or stomp input
a few filters to massage your data
an elasticsearch output to forward the resulting data to ES via the bulk endpoint.
With that kind of well-tuned setup, you can handle very heavy write loads without needing to worry about which client you want to use and how you need to tune it. The question is still open for querying, though, but since the write part is paramount in your case, you need to make it solid, the only serious way is by going asynchronous and let a well-developed and tested ETL (such as Logstash, or fluentd, etc) do it for you.
UPDATE
It is worth noting that as of ES 5.0, there will be a new Java REST client available.

Avro RPC/Storm integration

I have an existing Avro RPC client that sends data to an Avro RPC server. The Avro RPC server currently writes the data into HDFS (and does other things as well). We are changing our server processes to be based upon Storm. I am hoping to find an easy way to get my data into Storm, hopefully using the Avro RPC messages I have now.
I have been looking around for a way to do this, so far with no success. Storm has an RPC model, but it appears to be limited to passing strings, which I want to avoid (why I went to Avro in the first place). Zeromq might be a possibility, but seems limited for what I am trying to do.
Can someone suggest an elegant way for me to get my Avro RPC, schema-based data, into Storm??
Thanks!!!!
so... have not figured out a way to do this directly... but the solution we came up with was a Storm call back procedure that would ask Avro RPC for data. So, basically we switched the client/server relationship. Seems to work well.

Resources