Difference between Apache nifi and Apache Heron(Storm) - apache-storm

I'm working on a academic project which involves working on stream data from sensors. I've rounded on Heron(Successer of storm) and Nifi. Both have support for back pressure inbuilt which is crucial for my project.
What are the main differences between Apache Nifi and Heron?
Which one is more suitable for IoT applications?

It basically comes down to stream processing vs data flow...
I think this summarizes some of the differences:
Difference between Apache Beam and Apache Nifi

In a nutshell -
NiFi is more on the data acquisition from devices that supports several protocols while Heron is a stream processing engine that allows for complex streaming computations as data flows from NiFi. Heron can work along with NiFi in a single server as the footprint of Heron is smaller around 200 MB for local installation.

Related

Use case for Apache NiFi?

I was wondering if the following IoT problem/use case would be an intended fit for using Apache NiFi:
I use NB-IoT/LTE-M as connectivity means for sending messages to an IoT cloud platform (e.g. AWS IoT Core, Azure IoT Hub or others). I need a protocol converter/gateway for the messages entering as UDP or TCP and leaving as MQTT. Of course I can develop an UDP/TCP listener/server that listens for the entering messages and publishing them to the desired IoT cloud platform (MQTT) broker. But I was thinking of eventual using Apache NiFi, as it has processors for UDP, TCP and MQTT. However, I was wondering if Apache NiFi is meant for these kind of (IoT) scenarios?
Thanks.
Guy
We are using Apache NiFi to ingest and route IoT data at scale. I had to write a custom processor because of a proprietary IoT protocol, however assembling the rest of the flow has been just drag and drop. Before you invest into developing your own UDP/TCP listener/server at least try NiFi and see if you can solve your problem. With NiFi you can design your directed graphs of data routing pretty fast and have a short learn feedback loop.
Further think about:
What will limit the ability to grow the system?
Which resource constraints are important to pay attention to? E.g. metric volume, velocity, variety, volatility
How big can it get? Do you need resiliency?
With clustered NiFi you can spread your workload to multiple instances and satisfy the growth and resiliency requirement. You can also merge data and throttle its volume to protect downstream systems. The capabilities of NiFi are very versatile.
To answer your question: yes, Apache NiFi is actively used for IoT scenarios. There is even a NiFi IoT tutorial on cloudera: https://www.cloudera.com/tutorials/nifi-in-trucking-iot.html

Difference between Apache Beam and Apache Nifi

What are the use cases for Apache Beam and Apache Nifi?
It seems both of them are data flow engines. In case both have similar use case, which of the two is better?
Apache Beam is an abstraction layer for stream processing systems like Apache Flink, Apache Spark (streaming), Apache Apex, and Apache Storm. It lets you write your code against a standard API, and then execute the code using any of the underlying platforms. So theoretically, if you wrote your code against the Beam API, that code could run on Flink or Spark Streaming without any code changes.
Apache NiFi is a data flow tool that is focused on moving data between systems, all the way from very small edge devices with the use of MiNiFi, back to the larger data centers with NiFi. NiFi's focus is on capabilities like visual command and control, filtering of data, enrichment of data, data provenance, and security, just to name a few. With NiFi, you aren't writing code and deploying it as a job, you are building a living data flow through the UI that is taking effect with each action.
Stream processing platforms are often focused on computations involving joins of streams and windowing operations. Where as a data flow tool is often complimentary and used to manage the flow of data from the sources to the processing platforms.
There are actually several integration points between NiFi and stream processing systems... there are components for Flink, Spark, Storm, and Apex that can pull data from NiFi, or push data back to NiFi. Another common pattern would be to use MiNiFi + NiFi to get data into Apache Kafka, and then have the stream processing systems consume from Kafka.

How can i send data from node-red to Hadoop?

I need a mechanism to send data from node-red, to be stored in HDFS (Hadoop).
I prefer the data to be streamed. I am thinking about using the 'websocket out' node to write the data to it and use a Flume agent to read.
I am new to node-red.
Could you please let know if I am in the right direction and clarify with some details if I am not? Any alternate approach should also be fine.
Update: node-red offers 'bluemixhdfs' node which is exclusively tied up with IBM bluemix whereas I am using only a vanilla hadoop.
I recently had the similar issue for a small project of mine. So I try to explain my approach.
A little background: In the application, I had to do some processing on real-time streaming data from different data sources. At the same time, I also needed to store the streaming data for future processing.
I used Apache Kafka message broker as an integration agent between Node-RED and HDFS (and also for Apache Spark Stream processing engine).
In Node-RED, I used Kafka node to publish streaming data from different data sources to separate topics in Kafka.
Node-RED flow with Streaming data sources and Apache Kafka
HDFS Sink Connector, a Kafka Connect component, is then used to store the streaming data to the HDFS.
Flow Architecture for Node-RED to HDFS and Spark Streaming using Kafka Message broker
This approach can also be adopted when many streaming data sources like IoT sensors, Stock market data, Social media data, weather api, etc. are to be connected as a single flow using Node-RED and then want to use HDFS for storing these data for further processing.
I'm afraid that I'm not a Hadoop expert and so probably can't provide an answer directly. However it looks like Kafka supports websockets and this should be reasonably performant.
Depending on your architecture though, you should pay some attention to websocket security. Unless NR and Hadoop are both on a private secured network, websockets may be tricky to secure properly.
I think that websocket performance would be reasonable as long as the data size per transaction isn't too large (kb rather than Gb). You will need to do some testing though as there are too many factors influencing the performance of Node-RED to easily predict whether it will have the performance you require.
Node-RED supports a great many types of connectivity so if websockets don't work in your architecture, there are plenty of others such as UNIX pipes, TCP or UDP connections.

Architecture of syncing logs to hadoop

I have a different environments across a few Cloud providers, like windows servers, linux servers in rackspace, aws..etc. And there is a firewall between that and internal network.
I need to build a real time servers environment where all the newly generated IIS logs, apache logs will be sync to an internal big data environment.
I know there are tools like Splunk or Sumologic that might help but we are required to implement this logic in open source technologies. Due to the existence of the firewall, I am assuming I can only pull the logs instead push from the cloud providers.
Can anyone share with me what is the rule of thumb or common architecture for sync up tons of logs in NRT (near real time)? I heard of Apache Flume, Kafka and wondering if those are required or it is just a matter of using something like rsync.
You can use rsync to get the logs but you can't analyze them in the way Spark Streaming or Apache Storm does.
You can go ahead with one of these two options.
Apache Spark Streaming + Kafka
OR
Apache Storm + Kakfa
Have a look at this article about integration approaches of these two options.
Have a look this presentation, which covers in-depth analysis of Spark Streaming and Apache Storm.
Performance is dependent on your use case. Spark Steaming is 40x faster to Storm processing. But if you add "reliability" as key criteria, then data should be moved into HDFS first before processing by Spark Streaming. It will reduce final throughput.
Reliability Limitations: Apache Storm
Exactly once processing requires a durable data source.
At least once processing requires a reliable data source.
An unreliable data source can be wrapped to provide additional guarantees.
With durable and reliable sources, Storm will not drop data.
Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
Reliability Limitations: Spark Streaming
Fault tolerance and reliability guarantees require HDFS-backed data source.
Moving data to HDFS prior to stream processing introduces additional latency.
Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure.

Streaming data [Hadoop/MapReduce] - What are the challenges?

I have read in many places about Streaming data, but just trying to understand the challenges which are faced while processing it using Map Reduce technique?
i.e. the reason behind the existence of frameworks like Apache Flume, Apache Storm, etc.
Please share your advise & thoughts.
Thanks,
Ranit
There are many technologies out there, and many of them run on the Hadoop framework.
The older Hadoop services like Hive tend to be slow, and are usually used for batch jobs, not for streaming.
As streaming becomes more and more a necessity, other services have surfaced like Storm or Spark that are designed for faster execution and integration with messaging queues like Kafka for streaming.
In data analytics though, most of the time processing is not al real time: historical data may be processed in batch mode to extract models that are then used for real-time analytics, so a 'streaming' system is usually based on a Lambda Architecture http://lambda-architecture.net/
A service like Spark tries to integrate all of the components, with Spark Streaming for the speed layer, Spark SQL for the Serving layer, Spark MLLib for the modeling, all based on Hadoop Distributed File system (hdfs) for replicated large volume storage.
Flume helps in directing the data from source to hdfs for raw storage, but in order to process it, Storm or Spark are used.
Hope that helps.
Your question is open eneded. But I assume you want to understand the challenges of processing streaming data in Map Reduce environment.
1) Map Reduce is primarily designed for batch processing. It is for processing high volume of data which is at rest in disk.
2) The streaming data is a high velocity of data, which are coming from various sources like Web Application Click Stream, Social Media Logs, Twitter Tags, Application logs.
3) The stream of events might be processed either stateless manner ( assuming every event is unique) or in a stateful manner (collect the data for 2 seconds and processes them) but batch applications does not have any such requirement.
4) Streaming applications wants delivery / process guarantee. For example, the frameworks must provide "exactly once" delivery/process mechanism, so that it processes all the stream events without fail. It is not a challenge in batch processing since all the data is available locally.
5) External Connectors : Streaming frameworks must support external connectivity to read data in realtime from various sources as we discussed in (2). This is not a challenge in batch, since the data is locally available.
Hope this helps.

Resources