What is difference between Apache flume and Apache storm? - hadoop

What is difference between Apache flume and Apache storm?
Is is possible to ingest logs data into Hadoop cluster using storm?
Both are used for streaming data so can storm be used as an alternative to flume?

Apache Flume is a service for collecting large amounts of streaming data, particularly logs. Flume pushes data to consumers using mechanisms it calls data sinks. Flume can push data to many popular sinks right out of the box, including HDFS, HBase, Cassandra, and some relational databases.
Apache Storm involves streaming data. It is the bridge between batch processing and stream processing, which Hadoop is not natively designed to handle. Storm runs continuously, processing a stream of incoming data and dicing it into batches, so Hadoop can more easily ingest it. Data sources are called spouts and each processing node is a bolt. Bolts perform computations and processes on the data, including pushing output to data stores and other services.
If you need something that works out of the box, choose Flume, once you decide whether to push or pull makes more sense. If streaming data is, for now, just a small add-on to your already developed Hadoop environment, Storm is a good choice.
It is possible to ingest logs data into the Hadoop cluster using a storm
We can use the storm as an alternative to the flume

Related

What is the best practice for nifi production deployment

I have a three node nifi cluster. We just installed nifi packages on linux machines and cluster with separate zookeeper cluster. I am planning to monitor nifi performance via nagios but we saw hortonworks ambari provides fetures for management and monitoring also.
What is the best practice for nifi deployment on prod
how should we scale up
how can we monitor nifi
Should we monitor queue/process performance
Should use something like ambari
regards..
Edit-1:
#James actually I am collecting user event logs from several sources within company.
All events are first written to Kafka. Nifi consumes kafka, does simple transformations like getting a field from payload to attribute.
After transformations data is written to both elasticsearch and hdfs. Before writing to hdfs we are merging flowfiles so writing to hdfs is in batches.
I have around 50k/s event.

How to connect elasticsearch to apache spark streaming or storm?

We are building a real-time big data tool with open source tools. Our main goal is to supervise and analyze a network by getting logs from a kafka server in real-time. We saw in tutorials that we have to divide our tool in two sections: Analytic and Supervision as shown below.
For the supervision section we chose the solution Elasticsearch and Logstash.
Regarding the section analytic, my team and I are comparing Apache Storm Streaming and Apache Storm in order to use it with Elasticsearch. Despite the fact that Apache Storm is a true real-time data processing tool and faster than Apache Spark Streaming, it does not provide machine learning libraries like with Apache Spark. That's why we are thinking to choose Apache Spark. The elastic website indicates that it exists a connector ES-Hadoop to connect a Elasticsearch database to a Hadoop ecosystem. We can see that in the below figure.
However, We are a little bit confused with this picture because there is only spark SQL and not all the spark frameworks (MLlib, Spark Streaming..). We did some assumptions and we came out with two final possible architectures. We only wanted to know if there are technically correct and if we are not in the wrong direction.
With Apache Spark streaming:
With Apache Storm:
Both your architectural diagrams are ok. Keep on mind that spark streaming will not work in this scenario. Es-hadoop provides you with easy access apis to get and put data from and into elastic. Its also provides the methods to get the data inro the spark framework (RDD) or data frames inthe case of spark sql. Once the data is in the framework, all ml libraries can be applied to the data for ml or analytics generation. Elastic is not capable of streaming data so spark streaming in the strict sense is not possible. So in the diagram, the arrow to hdfs optional and then to spark streaming can be removed and the arrow juat pointa to hdfs. My concern, however, would be running mllib algos on the data in realtime and expect realtime performance. Typical use case might be do modwl generation off line and use the model in realtime for analysis.

Does Apache Kafka Store the messages internally in HDFS or Some other File system

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.

How to Stream Data To an EMR Cluster

I appreciate ideas on how to stream data from an On-Premise Windows server to a persistent EMR cluster?
Some Background
I would like to run a persistent cluster running a MR job much like the WordCount examples that are available. I would like to stream text from a local Windows Server up to the cluster and have it processed by the running job.
All of the streaming WordCount examples I have reviewed always start with a static text file in S3 and don't cover how to implement anything to generate the stream.
Does this need to be treated in two parts?
Get the data first into S3
Stream it into the EMR cluster?
I have seen tools like Logstash which tend to run agents on the local server which tail the end of a weblog and transfer it.
As you can probably tell, I'm a Windows guy, stretching into EMR and by association Linux. Feel free to let me know if there is some way cool command line tool that already does this.
Thanks in advance.
Currently EMR as-is only supports MR, Hive, Pig, HBase and Impala. MR/Hive/Pig process the data in a batch oriented fashion and data can't be streamed to them. While HBase is a NoSQL DB and Impala is used for interactive ad-hoc queries.
For processing streaming data there are a lot of other options like Storm, Samza, S4. From AWS there is Kinesis which has been moved into GA recently.
Yes a static file would go into S3 and then be the input into your EMR cluster job.
But I believe that fact you want a persistent cluster implies you are streaming in continuation from your Windows server. Is that the case ?
If so you need to create a AWS Kinesis Stream, configure your producers which put data into the stream's shards by calling the Putrecord.
Start by reading "Developing Record Consumer Applications"
I think you could use apache Flume (https://flume.apache.org/)
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

What's the difference between Flume and Sqoop?

Both Flume and Sqoop are meant for data movement, then what is the difference between them? Under what condition should I use Flume or Sqoop?
From http://flume.apache.org/
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data.
Flume helps to collect data from a variety of sources, like logs, jms, Directory etc. Multiple flume agents can be configured to collect high volume of data.
It scales horizontally.
From http://sqoop.apache.org/
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk
data between Apache Hadoop and structured datastores such as
relational databases.
Sqoop helps to move data between hadoop and other databases and it can transfer data in parallel for performance.
Both Sqoop and Flume, pull the data from the source and push it to the sink. The main difference is Flume is event driven, while Sqoop is not.
Flume:
Flume is a framework for populating Hadoop with data. Agents are populated
throughout ones IT infrastructure – inside web servers, application servers
and mobile devices, for example – to collect data and integrate it into Hadoop.
Sqoop:
Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such
as relational databases and data warehouses – into Hadoop. It allows users to
specify the target location inside of Hadoop and instruct Sqoop to move data
from Oracle,Teradata or other relational databases to the target.
You can see the full Post
Flume:
A very common use case is collecting log data from one system- a bank of web servers(aggregating it in HDFS for later analysis).
Sqoop:
On the other hand is designed for performing bulk imports of data into HDFS from structured data stores. simple use case will be an organization that runs a nightly sqoop import to load the day's data from a production DB into a Hive data ware house for analysis.
--From the definitive guide.
Apache Sqoop and Apache Flume work with various kinds of data sources. Flume functions well in streaming data sources which are generated continuously in hadoop environment such as log files from multiple servers.
whereas Apache Sqoop is designed to work well with any kind of relational database system that has JDBC connectivity.
Sqoop can also import data from NoSQL databases like MongoDB or Cassandra and also allows direct data transfer or Hive or HDFS. For transferring data to Hive using Apache Sqoop tool, a table has to be created for which the schema is taken from the database itself.
In Apache Flume data loading is event driven whereas in Apache Sqoop data load is not driven by events.
4.Flume is a better choice when moving bulk streaming data from various sources like JMS or Spooling directory whereas Sqoop is an ideal fit if the data is sitting in databases like Teradata, Oracle, MySQL Server, Postgres or any other JDBC compatible database then it is best to use Apache Sqoop.
5.In Apache Flume, data flows to HDFS through multiple channels whereas in Apache Sqoop HDFS is the destination for importing data.
6.Apache Flume has agent based architecture i.e. the code written in flume is known as agent which is responsible for fetching data whereas in Apache Sqoop the architecture is based on connectors. The connectors in Sqoop know how to connect with the various data sources and fetch data accordingly.
Lastly, Sqoop and Flume cannot be used achieve the same tasks as they are developed specifically to serve different purposes. Apache Flume agents are designed to fetch streaming data like tweets from Twitter or log file from the web server whereas Sqoop connectors are designed to work only with structured data sources and fetch data from them.
Apache Sqoop is mainly used for parallel data transfers, for data imports as it copies data quickly where Apache Flume is used for collecting and aggregating data because of its distributed, reliable nature and highly available backup routes.
Sqoop and Flume both are meant to fulfill data ingestion needs but they serve different purposes. Apache Flume works well for streaming data sources that are generated continuously in hadoop environment such as log files from multiple servers whereas whereas Apache Sqoop works well with any RDBMS has JDBC connectivity.
Sqoop is actually meant for bulk data transfers between hadoop and any other structured data stores. Flume collects log data from many sources, aggregating it, and writing it to HDFS.
I came across this interesting infographic that explains the differences between the two apache projects Sqoop and Flume -
Difference between Sqoop and Flume
Sqoop
Sqoop can perform import/export from RDBMS to HDFS/HIVE/HBASE
sqoop only import/export structured data not unstructured or semi
structured.
Flume
import stream data from multiple sources mostly semi-structured and
unstructured in nature. Now Kafka is better alternative for flume.

Resources