I have an existing Avro RPC client that sends data to an Avro RPC server. The Avro RPC server currently writes the data into HDFS (and does other things as well). We are changing our server processes to be based upon Storm. I am hoping to find an easy way to get my data into Storm, hopefully using the Avro RPC messages I have now.
I have been looking around for a way to do this, so far with no success. Storm has an RPC model, but it appears to be limited to passing strings, which I want to avoid (why I went to Avro in the first place). Zeromq might be a possibility, but seems limited for what I am trying to do.
Can someone suggest an elegant way for me to get my Avro RPC, schema-based data, into Storm??
Thanks!!!!
so... have not figured out a way to do this directly... but the solution we came up with was a Storm call back procedure that would ask Avro RPC for data. So, basically we switched the client/server relationship. Seems to work well.
Related
We have data being created by a simulated device being put on the network with NanoMSG with a payload of Google FlatBuffers (binary).
We would like to trigger on patterns of this data with OpenWhisk, and respond with Flatbuffer encoded responses.
Assume latency and throughput are not a big concern here.
Which approach can we take:
Write a repeater that converts the Flatbuffer to JSON (FB has a utility to do this) and then place the data onto an AMQP buss which is listened to by OpenWhisk? (we have folks familiar with AMQP, but not Kafka
Try to do something with Kafka, which seems (maybe it is only the IBM version) to directly handle the binary Flabuffers (probably still need a shim from NanoMSG to Kafka. E.g.
How to invoke an OpenWhisk action from IoT Platform in Bluemix
https://medium.com/openwhisk/serverless-transformation-of-iot-data-in-motion-with-openwhisk-272e36117d6c
Not sure if we still don't need the Flatbuffers JavaScript deserializer and serializer to convert the binary based64 data in JavaScript to JSON
Learn Kakfa, and then transform the NanoMsg payload (Flatbuffers to JSON).
Something else?
Anyone have direct experience in this?
Update
Thank you James, those are spot-on links. But it does raise some secondary issues:
If the data is in Google FlatBuffers schema, it does not seem to be any advantage to using Kafka binary transformation, since the mux/demux from base64 still needs to be done in the javascript layer.
It is slightly disturbing that Kafka (which is known for its low latency) is batching the events. That does effect latency when one has Iot (sensor data) that needs to be responded to in a closed-loop to actuators (sensor->control->actuators) is a common robotics model, and that is pretty much close to what we are doing. For the moment we are not pushing the latency issue, but I can see emerging cases where we will need the low latency. What is the thinking in the Kafka Whisk provider community about this?
I must be missing something, but the AMQP provider says it is using RHEA https://github.com/amqp/rhea#receiver . That seems to provide all one needs in terms of writing a simple trigger/rules for dealing with sensor stream data. Why would one use OpenWhisk?
Either option makes sense. OpenWhisk actions receive and return JSON messages. Binary data passed into those functions must be Base64 encoded.
If you use an AMQP feed, you can convert the binary data to JSON manually.
The Kafka feed provider does support automatic encoding of the binary input values (using the isBinary* parameters).
Kafka feeds push batches of messages to the OpenWhisk actions. This is different from a message queue, which would push one message at a time. This feed provider is built-in OpenWhisk.
There is an external community feed provider for AMQP here. This would need you to install and run it manually.
Has something like this been done before? If not, what would be involved in getting NiFi to ingest a stream arriving over a WebSocket with Google FlatBuffers?
(would a simple TCP stream make it easier or harder?)
UPDATE
I have a C++ program that is running on a node, which is collecting data and publishing it via nanomessage pub/sub channel over a websocket. The data in C++ looks like structs, and I am serializing it with Google Flatbuffers. It is a very simple struct, think of csv records. We have a team member who wants to capture this data with NiFi and put it to a database.
Personally, since Flatbuffers supports conversion of binary to JSON, I think this is almost easier just writing a short C#, python, java or javascript program to receive the flabuffers, open a DB connection, and dump the data. (maybe convert to JSON first, if needed).
To my knowledge, NiFi does not have an integration with the nanomsg library/protocol out of the box. This would likely require writing a custom processor for that is capable of consuming nanomsg packets using the nanomsg PUBSUB pattern / socket types.
One could use existing processors, such as the Consume* processors (ConsumeKafka, ConsumeJMS) as an example / guide for how to write a processor that consumes messages from a topic/queue that follows the pub/sub pattern.
You would then want to transform the payload from Flatbuffers binary to a format insertable into the desired database. Again, a custom processor using code generated from your Flatbuffer schema would probably be the right approach for that.
As you mention, this could also be accomplished with a simple program. If you wrote that program in Java (using the Java nanomsg and flatbuffers libs) as a prototype/proof-of-concept, then it could be refactored into one or more NiFi custom processors in the future if you wish to move to NiFi.
I need a mechanism to send data from node-red, to be stored in HDFS (Hadoop).
I prefer the data to be streamed. I am thinking about using the 'websocket out' node to write the data to it and use a Flume agent to read.
I am new to node-red.
Could you please let know if I am in the right direction and clarify with some details if I am not? Any alternate approach should also be fine.
Update: node-red offers 'bluemixhdfs' node which is exclusively tied up with IBM bluemix whereas I am using only a vanilla hadoop.
I recently had the similar issue for a small project of mine. So I try to explain my approach.
A little background: In the application, I had to do some processing on real-time streaming data from different data sources. At the same time, I also needed to store the streaming data for future processing.
I used Apache Kafka message broker as an integration agent between Node-RED and HDFS (and also for Apache Spark Stream processing engine).
In Node-RED, I used Kafka node to publish streaming data from different data sources to separate topics in Kafka.
Node-RED flow with Streaming data sources and Apache Kafka
HDFS Sink Connector, a Kafka Connect component, is then used to store the streaming data to the HDFS.
Flow Architecture for Node-RED to HDFS and Spark Streaming using Kafka Message broker
This approach can also be adopted when many streaming data sources like IoT sensors, Stock market data, Social media data, weather api, etc. are to be connected as a single flow using Node-RED and then want to use HDFS for storing these data for further processing.
I'm afraid that I'm not a Hadoop expert and so probably can't provide an answer directly. However it looks like Kafka supports websockets and this should be reasonably performant.
Depending on your architecture though, you should pay some attention to websocket security. Unless NR and Hadoop are both on a private secured network, websockets may be tricky to secure properly.
I think that websocket performance would be reasonable as long as the data size per transaction isn't too large (kb rather than Gb). You will need to do some testing though as there are too many factors influencing the performance of Node-RED to easily predict whether it will have the performance you require.
Node-RED supports a great many types of connectivity so if websockets don't work in your architecture, there are plenty of others such as UNIX pipes, TCP or UDP connections.
I have gone through the official documentation at https://www.elastic.co/blog/found-interfacing-elasticsearch-picking-client
But it does not give any benchmarks or performance numbers to help choose among the clients. And I am finding it non-trivial to setup a TransportClient or setup a NodeClient because the documentation for that is also really sparse with little to no examples whatsoever.
So if someone has already done some benchmarking on choosing a client, I would really appreciate that and focus more on tuning an established client rather than evaluating what client to choose.
Our application is a write-heavy application and we plan to have a 50-shard, 50-replica ES cluster for that.
All those clients are fine for querying and they all have their pros and cons (below list is not exhaustive):
A Node client provides a single hop into the cluster but since it will also be part of the cluster it can also induce too much chatter within the cluster
A Transport client is not part of the cluster, hence requires a two-hop roundtrip, and communicates with a single node at a time in a round-robin fashion (from the list provided during its construction)
Jest is basically the missing client for the ES REST interface
If you feel like you don't need all what Jest has to offer and simply want to interact with a few endpoints, you might as well create your own REST client by using Spring REST template, Apache HTTP, etc
If you're going to have a write-heavy application I suggest you don't even use any of those clients at all. The main reason is that they are all synchronous in nature and if any component of your architecture or the network were to fail for some reason, then you'd lose data, and that might not be an option for you.
If you have plenty of data to ingest, you normally go the asynchronous way, i.e. storing your data in a temporary (yet durable) queue (Kafka, Redis, JMS, etc) and then let another process stream it to ES. There are many ways to do that, but a very simple one is to use Logstash for that.
Whether you decide to store your data in Kafka or JMS or Redis, you can then let Logstash consume your data and stream it to ES, i.e. you let Logstash worry about the heavy write part, which it does very well. That can be achieved very easily with
a kafka or redis or stomp input
a few filters to massage your data
an elasticsearch output to forward the resulting data to ES via the bulk endpoint.
With that kind of well-tuned setup, you can handle very heavy write loads without needing to worry about which client you want to use and how you need to tune it. The question is still open for querying, though, but since the write part is paramount in your case, you need to make it solid, the only serious way is by going asynchronous and let a well-developed and tested ETL (such as Logstash, or fluentd, etc) do it for you.
UPDATE
It is worth noting that as of ES 5.0, there will be a new Java REST client available.
i am searching for technologies that i can use in order to stream data from social media
to hadoop.
i searched and found those tech
Flume.
Storm.
Kafka.
which tool is the best? and why? does anyone familiar with some other tools ?
Most likely, you will want to use Flume as it is built to work with hdfs. However, as with all things, it depends.
Kafka is basically a queuing system that is usually used to persist data in the event of a failure in your analytics architecture. If this sounds like what you need, it might be worth looking into RabbitMQ, ZeroMQ, or maybe Kestrel.
Storm is used for complex event processing. If you use storm, you will be using zeroMQ under the hood, and will likely have to set up a spout that is hooked up to kafka or RabbitMQ. IF you need to do complicated munging of the data before storage, this might be the right option. There are other options that you can use too like spark. I'm inclined to suggest storm purely out of personal preference. I heard that linkedin was releasing a realtime complex event processing framework as well, but I can't remember the name of it. I'll update the post when I can find it.
On a different note, if you're asking this question, it might be because you haven't built this thing yet. If that is the case, you might want to look into something other than hadoop if you need streaming. The ecosystem is rapidly expanding, and there are probably many ways to do what you want to do.
Apache Kafka is a distributed messaging system. In very brief its like you pushed (published) some messages into a Kafka Queue using a KafKa producer and On the other end you consumed it using a Kafka consumer (subscriber). The messages/feeds can be divided into categories called Topic. Now you can run Kafka in cluster which makes it very scalable and can be expanded without any downtime.
It could be a nice choice for holding your social media streams. Kafka retains the message pushed to it for a configurable time and the best part is from their documentation they say
Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem.
Check out the doc for more better visibility.
Now Storm is a very scalable, fault-tolerant distributed computation system which can easily be integrated with any queueing (like Kafka) or databases (HDFS/Cassandra etc). So you can feed your messages to a storm cluster for further processing based on your requirement. There is something called KafkaSpout which does a seamless integration between storm and kafka.
You should also look at the Kafka-hadoop loader #github which creates Hadoop Job for incremental loading messages from Kafka topics onto hdfs with multiple file output semantics
Also as #Peter Klipfel said that:
you might want to look into something other than hadoop if you need streaming
You can also check for other alternatives available like Apache Cassandra ,works great with streaming data with a very low latency.
I think it depends on where you are pulling the data and what you are trying to do with the data.
An alternative is to use IBM Streams where you can pull directly from social media streams and store to many different data store of your choice.
For example, you can use the streamsx.social toolkit from here: https://github.com/IBMStreams/streamsx.social which allows you to pull tweets directly from an HTTP stream.
Once you get data into Streams, the product also provides many adapters that allow you to store the streaming data into datastore (e.g. HDFS using streamsx.hdfs, HBase using streamsx.hbase.)
I think another consideration is what kind of analytics are you doing with the social media data. If you would like to analyze the social data in-stream before the data is stored, IBM Streams also provides a text toolkit that allows you to extract insight from the social data unstructured text. You can analyze the data without really having to store it anywhere.
Hope it helps!