OpenWhisk and binary data from Google Flatbuffers - openwhisk

We have data being created by a simulated device being put on the network with NanoMSG with a payload of Google FlatBuffers (binary).
We would like to trigger on patterns of this data with OpenWhisk, and respond with Flatbuffer encoded responses.
Assume latency and throughput are not a big concern here.
Which approach can we take:
Write a repeater that converts the Flatbuffer to JSON (FB has a utility to do this) and then place the data onto an AMQP buss which is listened to by OpenWhisk? (we have folks familiar with AMQP, but not Kafka
Try to do something with Kafka, which seems (maybe it is only the IBM version) to directly handle the binary Flabuffers (probably still need a shim from NanoMSG to Kafka. E.g.
How to invoke an OpenWhisk action from IoT Platform in Bluemix
https://medium.com/openwhisk/serverless-transformation-of-iot-data-in-motion-with-openwhisk-272e36117d6c
Not sure if we still don't need the Flatbuffers JavaScript deserializer and serializer to convert the binary based64 data in JavaScript to JSON
Learn Kakfa, and then transform the NanoMsg payload (Flatbuffers to JSON).
Something else?
Anyone have direct experience in this?
Update
Thank you James, those are spot-on links. But it does raise some secondary issues:
If the data is in Google FlatBuffers schema, it does not seem to be any advantage to using Kafka binary transformation, since the mux/demux from base64 still needs to be done in the javascript layer.
It is slightly disturbing that Kafka (which is known for its low latency) is batching the events. That does effect latency when one has Iot (sensor data) that needs to be responded to in a closed-loop to actuators (sensor->control->actuators) is a common robotics model, and that is pretty much close to what we are doing. For the moment we are not pushing the latency issue, but I can see emerging cases where we will need the low latency. What is the thinking in the Kafka Whisk provider community about this?
I must be missing something, but the AMQP provider says it is using RHEA https://github.com/amqp/rhea#receiver . That seems to provide all one needs in terms of writing a simple trigger/rules for dealing with sensor stream data. Why would one use OpenWhisk?

Either option makes sense. OpenWhisk actions receive and return JSON messages. Binary data passed into those functions must be Base64 encoded.
If you use an AMQP feed, you can convert the binary data to JSON manually.
The Kafka feed provider does support automatic encoding of the binary input values (using the isBinary* parameters).
Kafka feeds push batches of messages to the OpenWhisk actions. This is different from a message queue, which would push one message at a time. This feed provider is built-in OpenWhisk.
There is an external community feed provider for AMQP here. This would need you to install and run it manually.

Related

Thrift, Avro and ProtoBuf data governance

We have a use case of data streaming from the main transactional system to other downstream such as data analytics and machine learning team.
One of the requirements are to ensure data governance that data source can control who can read which column, and potentially lifecycle of a data to ensure data siting in another domain gets purged should the source data removed it, such as if a user deletes the account, we need to make sure the data in all downstream gets removed.
While we are considering Thrift, Avro and ProtoBuf, what are the common frameworks that we can use for such data governance? Do any of these protocol supports metadata for such data governance around data authorization, lifecycle?
Let me get this straight:
protobuf is not a security device; to someone with the right tools it is just as readable as xml or json, with the slight issue that it can be uncertain how to interpret some values;
It's not of a much difference than JSON nor XML. It is just an interface language. Sure, it has encoding, it is a bit different and a lot more customizable, but it does in no way confront security. It is up to you to secure the channel between sender and receiver.

Bidirectional client-server communication using Server-Sent Events instead of WebSockets?

It is possible to achieve two-way communication between a client and server using Server Sent Events (SSE) if the clients send messages using HTTP POST and receive messages asynchronously using SSE.
It has been mentioned here that SSE with AJAX would have higher round-trip latency and higher client->server bandwidth since an HTTP request includes headers and that websockets are better in this case, however isn't it advantageous for SSE that they can be used for consistent data compression, since websockets' permessage-deflate supports selective compression, meaning some messages might be compressed while others aren't compressed
Your best bet in this scenario would be to use a WebSockets server because building a WS implementation from scratch is not only time-consuming but the fact that it has already been solved makes it useless. As you've tagged Socket.io, that's a good option to get started. It's an open source tool and easy to use and follow from the documentation.
However, since it is open-source, it doesn't provide some functionality that is critical when you want to stream data in a production level application. There are issues like scalability, interoperability (for endpoints operating on protocols other than WebSockets), fault tolerance, ensuring reliable message ordering, etc.
The real-time messaging infrastructure plus these critical production level features mentioned above are provided as a service called a 'Data Stream Network'. There are a couple of companies providing this, such as Ably, PubNub, etc.
I've extensively worked with Ably so comfortable to share an example in Node.js that uses Ably:
var Ably = require('ably');
var realtime = new Ably.Realtime('YOUR-API-KEY');
var channel = realtime.channels.get('data-stream-a');
//subscribe on devices or database
channel.subscribe(function(message) {
console.log("Received: " message.data);
});
//publish from Server A
channel.publish("example", "message data");
You can create a free account to get an API key with 3m free messages per month, should be enough for trying it out properly afaik.
There's also a concept of Reactor functions, which is essentially invoking serverless functions in realtime on AWS, Azure, Gcloud, etc. You can place a database on one side too and log data as it arrives. Pasting this image found on Ably's website for context:
Hope this helps!
Yes, it's possible.
You can have more than 1 parallel HTTP connection open, so there's nothing stopping you.

Apache NiFi with Websocket stream of Google Flatbuffers payloads

Has something like this been done before? If not, what would be involved in getting NiFi to ingest a stream arriving over a WebSocket with Google FlatBuffers?
(would a simple TCP stream make it easier or harder?)
UPDATE
I have a C++ program that is running on a node, which is collecting data and publishing it via nanomessage pub/sub channel over a websocket. The data in C++ looks like structs, and I am serializing it with Google Flatbuffers. It is a very simple struct, think of csv records. We have a team member who wants to capture this data with NiFi and put it to a database.
Personally, since Flatbuffers supports conversion of binary to JSON, I think this is almost easier just writing a short C#, python, java or javascript program to receive the flabuffers, open a DB connection, and dump the data. (maybe convert to JSON first, if needed).
To my knowledge, NiFi does not have an integration with the nanomsg library/protocol out of the box. This would likely require writing a custom processor for that is capable of consuming nanomsg packets using the nanomsg PUBSUB pattern / socket types.
One could use existing processors, such as the Consume* processors (ConsumeKafka, ConsumeJMS) as an example / guide for how to write a processor that consumes messages from a topic/queue that follows the pub/sub pattern.
You would then want to transform the payload from Flatbuffers binary to a format insertable into the desired database. Again, a custom processor using code generated from your Flatbuffer schema would probably be the right approach for that.
As you mention, this could also be accomplished with a simple program. If you wrote that program in Java (using the Java nanomsg and flatbuffers libs) as a prototype/proof-of-concept, then it could be refactored into one or more NiFi custom processors in the future if you wish to move to NiFi.

ElasticSearch: Jest vs Rest vs TransportClient vs NodeClient

I have gone through the official documentation at https://www.elastic.co/blog/found-interfacing-elasticsearch-picking-client
But it does not give any benchmarks or performance numbers to help choose among the clients. And I am finding it non-trivial to setup a TransportClient or setup a NodeClient because the documentation for that is also really sparse with little to no examples whatsoever.
So if someone has already done some benchmarking on choosing a client, I would really appreciate that and focus more on tuning an established client rather than evaluating what client to choose.
Our application is a write-heavy application and we plan to have a 50-shard, 50-replica ES cluster for that.
All those clients are fine for querying and they all have their pros and cons (below list is not exhaustive):
A Node client provides a single hop into the cluster but since it will also be part of the cluster it can also induce too much chatter within the cluster
A Transport client is not part of the cluster, hence requires a two-hop roundtrip, and communicates with a single node at a time in a round-robin fashion (from the list provided during its construction)
Jest is basically the missing client for the ES REST interface
If you feel like you don't need all what Jest has to offer and simply want to interact with a few endpoints, you might as well create your own REST client by using Spring REST template, Apache HTTP, etc
If you're going to have a write-heavy application I suggest you don't even use any of those clients at all. The main reason is that they are all synchronous in nature and if any component of your architecture or the network were to fail for some reason, then you'd lose data, and that might not be an option for you.
If you have plenty of data to ingest, you normally go the asynchronous way, i.e. storing your data in a temporary (yet durable) queue (Kafka, Redis, JMS, etc) and then let another process stream it to ES. There are many ways to do that, but a very simple one is to use Logstash for that.
Whether you decide to store your data in Kafka or JMS or Redis, you can then let Logstash consume your data and stream it to ES, i.e. you let Logstash worry about the heavy write part, which it does very well. That can be achieved very easily with
a kafka or redis or stomp input
a few filters to massage your data
an elasticsearch output to forward the resulting data to ES via the bulk endpoint.
With that kind of well-tuned setup, you can handle very heavy write loads without needing to worry about which client you want to use and how you need to tune it. The question is still open for querying, though, but since the write part is paramount in your case, you need to make it solid, the only serious way is by going asynchronous and let a well-developed and tested ETL (such as Logstash, or fluentd, etc) do it for you.
UPDATE
It is worth noting that as of ES 5.0, there will be a new Java REST client available.

which tech available for stream data from social media to hadoop?

i am searching for technologies that i can use in order to stream data from social media
to hadoop.
i searched and found those tech
Flume.
Storm.
Kafka.
which tool is the best? and why? does anyone familiar with some other tools ?
Most likely, you will want to use Flume as it is built to work with hdfs. However, as with all things, it depends.
Kafka is basically a queuing system that is usually used to persist data in the event of a failure in your analytics architecture. If this sounds like what you need, it might be worth looking into RabbitMQ, ZeroMQ, or maybe Kestrel.
Storm is used for complex event processing. If you use storm, you will be using zeroMQ under the hood, and will likely have to set up a spout that is hooked up to kafka or RabbitMQ. IF you need to do complicated munging of the data before storage, this might be the right option. There are other options that you can use too like spark. I'm inclined to suggest storm purely out of personal preference. I heard that linkedin was releasing a realtime complex event processing framework as well, but I can't remember the name of it. I'll update the post when I can find it.
On a different note, if you're asking this question, it might be because you haven't built this thing yet. If that is the case, you might want to look into something other than hadoop if you need streaming. The ecosystem is rapidly expanding, and there are probably many ways to do what you want to do.
Apache Kafka is a distributed messaging system. In very brief its like you pushed (published) some messages into a Kafka Queue using a KafKa producer and On the other end you consumed it using a Kafka consumer (subscriber). The messages/feeds can be divided into categories called Topic. Now you can run Kafka in cluster which makes it very scalable and can be expanded without any downtime.
It could be a nice choice for holding your social media streams. Kafka retains the message pushed to it for a configurable time and the best part is from their documentation they say
Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.
Check out the doc for more better visibility.
Now Storm is a very scalable, fault-tolerant distributed computation system which can easily be integrated with any queueing (like Kafka) or databases (HDFS/Cassandra etc). So you can feed your messages to a storm cluster for further processing based on your requirement. There is something called KafkaSpout which does a seamless integration between storm and kafka.
You should also look at the Kafka-hadoop loader #github which creates Hadoop Job for incremental loading messages from Kafka topics onto hdfs with multiple file output semantics
Also as #Peter Klipfel said that:
you might want to look into something other than hadoop if you need streaming
You can also check for other alternatives available like Apache Cassandra ,works great with streaming data with a very low latency.
I think it depends on where you are pulling the data and what you are trying to do with the data.
An alternative is to use IBM Streams where you can pull directly from social media streams and store to many different data store of your choice.
For example, you can use the streamsx.social toolkit from here: https://github.com/IBMStreams/streamsx.social which allows you to pull tweets directly from an HTTP stream.
Once you get data into Streams, the product also provides many adapters that allow you to store the streaming data into datastore (e.g. HDFS using streamsx.hdfs, HBase using streamsx.hbase.)
I think another consideration is what kind of analytics are you doing with the social media data. If you would like to analyze the social data in-stream before the data is stored, IBM Streams also provides a text toolkit that allows you to extract insight from the social data unstructured text. You can analyze the data without really having to store it anywhere.
Hope it helps!

Resources