Use case for Apache NiFi? - apache-nifi

I was wondering if the following IoT problem/use case would be an intended fit for using Apache NiFi:
I use NB-IoT/LTE-M as connectivity means for sending messages to an IoT cloud platform (e.g. AWS IoT Core, Azure IoT Hub or others). I need a protocol converter/gateway for the messages entering as UDP or TCP and leaving as MQTT. Of course I can develop an UDP/TCP listener/server that listens for the entering messages and publishing them to the desired IoT cloud platform (MQTT) broker. But I was thinking of eventual using Apache NiFi, as it has processors for UDP, TCP and MQTT. However, I was wondering if Apache NiFi is meant for these kind of (IoT) scenarios?
Thanks.
Guy

We are using Apache NiFi to ingest and route IoT data at scale. I had to write a custom processor because of a proprietary IoT protocol, however assembling the rest of the flow has been just drag and drop. Before you invest into developing your own UDP/TCP listener/server at least try NiFi and see if you can solve your problem. With NiFi you can design your directed graphs of data routing pretty fast and have a short learn feedback loop.
Further think about:
What will limit the ability to grow the system?
Which resource constraints are important to pay attention to? E.g. metric volume, velocity, variety, volatility
How big can it get? Do you need resiliency?
With clustered NiFi you can spread your workload to multiple instances and satisfy the growth and resiliency requirement. You can also merge data and throttle its volume to protect downstream systems. The capabilities of NiFi are very versatile.
To answer your question: yes, Apache NiFi is actively used for IoT scenarios. There is even a NiFi IoT tutorial on cloudera: https://www.cloudera.com/tutorials/nifi-in-trucking-iot.html

Related

Difference between Apache nifi and Apache Heron(Storm)

I'm working on a academic project which involves working on stream data from sensors. I've rounded on Heron(Successer of storm) and Nifi. Both have support for back pressure inbuilt which is crucial for my project.
What are the main differences between Apache Nifi and Heron?
Which one is more suitable for IoT applications?
It basically comes down to stream processing vs data flow...
I think this summarizes some of the differences:
Difference between Apache Beam and Apache Nifi
In a nutshell -
NiFi is more on the data acquisition from devices that supports several protocols while Heron is a stream processing engine that allows for complex streaming computations as data flows from NiFi. Heron can work along with NiFi in a single server as the footprint of Heron is smaller around 200 MB for local installation.

How can i send data from node-red to Hadoop?

I need a mechanism to send data from node-red, to be stored in HDFS (Hadoop).
I prefer the data to be streamed. I am thinking about using the 'websocket out' node to write the data to it and use a Flume agent to read.
I am new to node-red.
Could you please let know if I am in the right direction and clarify with some details if I am not? Any alternate approach should also be fine.
Update: node-red offers 'bluemixhdfs' node which is exclusively tied up with IBM bluemix whereas I am using only a vanilla hadoop.
I recently had the similar issue for a small project of mine. So I try to explain my approach.
A little background: In the application, I had to do some processing on real-time streaming data from different data sources. At the same time, I also needed to store the streaming data for future processing.
I used Apache Kafka message broker as an integration agent between Node-RED and HDFS (and also for Apache Spark Stream processing engine).
In Node-RED, I used Kafka node to publish streaming data from different data sources to separate topics in Kafka.
Node-RED flow with Streaming data sources and Apache Kafka
HDFS Sink Connector, a Kafka Connect component, is then used to store the streaming data to the HDFS.
Flow Architecture for Node-RED to HDFS and Spark Streaming using Kafka Message broker
This approach can also be adopted when many streaming data sources like IoT sensors, Stock market data, Social media data, weather api, etc. are to be connected as a single flow using Node-RED and then want to use HDFS for storing these data for further processing.
I'm afraid that I'm not a Hadoop expert and so probably can't provide an answer directly. However it looks like Kafka supports websockets and this should be reasonably performant.
Depending on your architecture though, you should pay some attention to websocket security. Unless NR and Hadoop are both on a private secured network, websockets may be tricky to secure properly.
I think that websocket performance would be reasonable as long as the data size per transaction isn't too large (kb rather than Gb). You will need to do some testing though as there are too many factors influencing the performance of Node-RED to easily predict whether it will have the performance you require.
Node-RED supports a great many types of connectivity so if websockets don't work in your architecture, there are plenty of others such as UNIX pipes, TCP or UDP connections.

Architecture of syncing logs to hadoop

I have a different environments across a few Cloud providers, like windows servers, linux servers in rackspace, aws..etc. And there is a firewall between that and internal network.
I need to build a real time servers environment where all the newly generated IIS logs, apache logs will be sync to an internal big data environment.
I know there are tools like Splunk or Sumologic that might help but we are required to implement this logic in open source technologies. Due to the existence of the firewall, I am assuming I can only pull the logs instead push from the cloud providers.
Can anyone share with me what is the rule of thumb or common architecture for sync up tons of logs in NRT (near real time)? I heard of Apache Flume, Kafka and wondering if those are required or it is just a matter of using something like rsync.
You can use rsync to get the logs but you can't analyze them in the way Spark Streaming or Apache Storm does.
You can go ahead with one of these two options.
Apache Spark Streaming + Kafka
OR
Apache Storm + Kakfa
Have a look at this article about integration approaches of these two options.
Have a look this presentation, which covers in-depth analysis of Spark Streaming and Apache Storm.
Performance is dependent on your use case. Spark Steaming is 40x faster to Storm processing. But if you add "reliability" as key criteria, then data should be moved into HDFS first before processing by Spark Streaming. It will reduce final throughput.
Reliability Limitations: Apache Storm
Exactly once processing requires a durable data source.
At least once processing requires a reliable data source.
An unreliable data source can be wrapped to provide additional guarantees.
With durable and reliable sources, Storm will not drop data.
Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
Reliability Limitations: Spark Streaming
Fault tolerance and reliability guarantees require HDFS-backed data source.
Moving data to HDFS prior to stream processing introduces additional latency.
Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure.

sending value from cc3200 to my server using mqtt

How can I make my server to accept the data sent by cc3200 through mqtt protocol ?Made cc3200 to publish the values successfully to my server IP address but I don't know what should I do to make my server dump those incoming values into its database.Actually I use XAMPP for server functionalities.
any suggestion guys ?
Am using hivemq broker
If your primary goal is to have some telemetry data from CC3200 stored in the database, I would suggest that you take a look at this webinar. You can configure Kaa server to use one of multiple existing log appenders to publish your data to Spark, Cassandra, MongoDB, HDFS, Couchbase, etc. There are several major benefits of doing data collection with Kaa:
All of the data is structured end-to-end. You define telemetry data model in Kaa UI, which translates into Avro-compatible schemas, and generates object bindings in the Kaa SDK. Instead of writing boilerplate code for data marshalling, you just invoke SDK functions like this: kaa_logging_add_record(kaa_client_get_context(kaa_client)->log_collector, log_record); where log_record is a structure auto-generated by Kaa based on your data model. On the other end, in your analytics system, you receive structured data that you can immediately start processing and querying - no need for the custom interpretation code, it's auto-generated for you.
You can write to several destinations simultaneously: for example, save telemetry data into HDFS for warehousing, send to Spark for stream analytics, and push to your custom data processing/visualization service with REST. All of this is configurable by adding log appenders through the Kaa administrative UI.
Kaa takes care of the data delivery reliability and consistency. You can set up one or more reliable log appenders. It is not until all of the configured reliable appenders acknowledge a successful write that the client is instructed to remove the local data copy.
Kaa server is scalable and reliable out-of-the box. There is no single point of failure in the cluster. You can add more server capacity on the fly by spinning off more nodes. They would register against Zookeeper and the cluster would automatically rebalance the load. If there is a node failure, the clients automatically migrate to the remaining nodes.
Kaa is transport agnostic, so you can plug in pretty much any transport protocol implementation you like, including MQTT. The default protocol is similar to MQTT in the amount of overhead it introduces.
The integration instructions specifically for CC3200 are being prepared for the upcoming 0.8.0 release here.
Disclaimer: I work for a company behind Kaa open-source IoT platform.

Scaling multi-channel pub/sub via web-sockets

I have been looking into this gist which provides a minimal functional implementation of channelled pub/sub style communication over websockets.
For multiple channels we can have a local hash of EM::Channel instances which can be created on the fly as per requirements. What I am concerned with is how can this setup be scaled to support a cluster of server instances or what alternatives are available to facilitate channeled pub/sub via web-sockets which are usable in clustered deployments ?
The Jet Protocol provides strict pub/sub (no-poll) semantics and is open source. It is far more powerful than subscribing to "Channels" (It is called "Fetching" in the Jet wording.

Resources