Apache Storm vs Apache Airflow - apache-storm

What is the difference between the two? As far as I understand both of them are based on the concept of directed acyclic graph (DAG) and Storm processes data in real time and Airflow rather moves the whole thing from one stage to the other. Is it the only difference between them? What kind of jobs are both of them suitable for?

Airflow is an orchestrating engine like Azkaban or Oozie. Storm is a near real-time processing engine which has largely been replaced by Spark Streaming or Heron (Twitter’s 2nd generation replacement for Storm).

Related

Giraph, Hadoop, Spark and Cassandra

Is it possible for me to use Giraph if I have Spark clusters and Cassandra but no Hadoop clusters?
Currently, I am using GraphX and would like to use Giraph instead. Is this possible considering that I have Spark clusters and am using Cassandra?
I have only limited experience with Giraph from years ago, and I never tried using it outside of a Hadoop cluster. But it looks like what you want is at least technically possible if not necessarily easy.
This code is the companion to Practical Graph Analytics with Apache Giraph. As you can see, it requires Hadoop in the classpath for DoubleWritable and Text, for example, but it does nothing with a Hadoop cluster. Instead, it works with in-memory arrays. It looks like all you need to do is implement compute in the BasicComputation class to do whatever you need with Cassandra as long as you keep Hadoop around as a dependency to help satisfy the type boundaries for BasicComputation.
I never found Giraph terribly intuitive, but hopefully you can make this unconventional setup work.

Differences between MapReduce and Yarn

I was searching about hadoop and mapreduce with respect to straggler problems and the papers in this problem
but yesterday I found that there is hadoop 2 with Yarn ,,
unfortunately no paper is talking about straggler problem in Yarn
So I want to know what is difference between MapReduce and Yarn in the part straggler?
is Yarn suffer from straggler problem?
and when MRmaster asks resource manger for resources , resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities ?
thanks so much,,
Here are the MapReduce 1.0 and MapReduce 2.0 (YARN)
MapReduce 1.0
In a typical Hadoop cluster, racks are interconnected via core switches. Core switches should connect to top-of-rack switches Enterprises using Hadoop should consider using 10GbE, bonded Ethernet and redundant top-of-rack switches to mitigate risk in the event of failure. A file is broken into 64MB chunks by default and distributed across Data Nodes. Each chunk has a default replication factor of 3, meaning there will be 3 copies of the data at any given time. Hadoop is “Rack Aware” and HDFS has replicated chunks on nodes on different racks. JobTracker assign tasks to nodes closest to the data depending on the location of nodes and helps the NameNode determine the ‘closest’ chunk to a client during reads. The administrator supplies a script which tells Hadoop which rack the node is in, for example: /enterprisedatacenter/rack2.
Limitations of MapReduce 1.0 – Hadoop can scale up to 4,000 nodes. When it exceeds that limit, it raises unpredictable behavior such as cascading failures and serious deterioration of overall cluster. Another issue being multi-tenancy – it is impossible to run other frameworks than MapReduce 1.0 on a Hadoop cluster.
MapReduce 2.0
MapReduce 2.0 has two components – YARN that has cluster resource management capabilities and MapReduce.
In MapReduce 2.0, the JobTracker is divided into three services:
ResourceManager, a persistent YARN service that receives and runs applications on the cluster. A MapReduce job is an application.
JobHistoryServer, to provide information about completed jobs
Application Master, to manage each MapReduce job and is terminated when the job completes.
TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a node. NodeManager is responsible for launching containers that could either be a map or reduce task.
This new architecture breaks JobTracker model by allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and lets Hadoop clusters scale up to larger configurations than 4000 nodes. This architecture also allows simultaneous execution of a variety of programming models such as graph processing, iterative processing, machine learning, and general cluster computing, including the traditional MapReduce.
You say "Differences between MapReduce and YARN". MapReduce and YARN definitely different. MapReduce is Programming Model, YARN is architecture for distribution cluster. Hadoop 2 using YARN for resource management. Besides that, hadoop support programming model which support parallel processing that we known as MapReduce. Before hadoop 2, hadoop already support MapReduce. In short, MapReduce run above YARN Architecture. Sorry, i don't mention in part of straggler problem.
"when MRmaster asks resource manger for resources?"
when user submit MapReduce Job. After MapReduce job has done, resource will be back to free.
"resource manger will give MRmaster all resources it needs or it is according to cluster computing capabilities"
I don't get this question point. Obviously, the resources manager will give all resource it needs no matter what cluster computing capabilities. Cluster computing capabilities will influence on processing time.
There is no YARN in MapReduce 1. In MapReduce there is Yarn.
If for straggler problem you mean that if first guy waits 'something' which then causes more waits along a road who depends on that first guy then I guess there is always this problem in MR jobs. Getting allocated resources naturally participate to this problem along with all other things which may cause components to wait something.
Tez which is supposed to be a drop-in replacement for MR job runtime makes a things differently. Instead of doing task runs in a same way current MR Appmaster does it tries to use DAG of tasks which does a much better job of not getting into bad straggler problem.
You need to understand a relationship between MR and YARN. YARN is simply a dummy resource scheduler meaning it doesn't schedule 'tasks'. What it gives to MR Appmaster is a set or resources(in a sense it's only combination of memory and cpu and location). It's then MR Appmaster responsibility to decide what to do with those resources.

Twitter Storm v/s Apache Hadoop

It would be great if somebody can explain me the architectural differences between Twitter Storm and Apache Hadoop? I am looking out for some internals stuff beyond real time v/s batch processing. As both technologies are quiet similar in terms of writing a topology for Storm or map-reduce on Hadoop, in terms of task tracker/job tracker for Hadoop and the equivalent nimbus/supervisor for Storm, in terms of Hadoop partition and equivalent shuffling (random, field etc.) on Storm etc. (Am I correct if I say that Storm uses message queues internally for transporting data between spouts/bolt which is not exactly the case with Hadoop where in there are intermediate files created and hence an I/O involved.)
EDIT:
I have gone through the question Apache Storm compared to Hadoop but the accepted answer leaves me with a desire to know more than just the use case i.e. real time v/s batch processing.
The main diffence is that Storm can do realtime processing of streams of Tupple s (incoming data) while Hadoop do batch processing with MapReduce jobs.
both of them process data in a distributed way, but with storm you can have live analitics while you will have to wait the mapreduce job to finish before playing with your results.
Nathan Marz (Storm creator) is writing a book about Big Data where he discusses how to create big data systems with Hadoop, Storm and other technologies.
The book is discussing "The Lambda Architecture". Checkout this slide by Nathan Marz himself: Runaway complexity in Big Data... and a plan to stop it

Best technology stack for aggregation across various properties

We are working on developing a platform which models flow of entities across a graph. The system has to answer questions of the kind how many entities having these properties are sitting at a given node on the graph , what is the inflow on a node, outflow on a node etc. Flow data is fed to the system in a stream. We are thinking of breaking the flow data in time buckets(say 5 mins) and pre-compute various aggregates against different properties and storing the aggregates in DynamoDB to serve queries.
With regards to this we are evaluating the following options:
EMR: Put flow data in AWS -S3/DynamoDB run a Map Reduce/hive job
Putting recent data into AWS- RDS, computing the aggregates via sql
Akka: It is a framework to build distributed applications via Actors
and Message passing.
If anyone has worked on similar usecase or has used any of the above technologies, please let me know what approach would be best fit for our use case.
I have used EMR to process data in S3... works pretty well. And the best part is you can spin up hadoop clusters of various sizes that fit the work load.
you may want to look into Storm for stream processing
I am also collecting a list of big-data tools here: http://hadoopilluminated.com/hadoop_book/Bigdata_Ecosystem.html
The final solution employed AWS Redshift, the driving reason was the requirement of high speed data ingestion, which Redshift provides via the COPY command.
Hadoop is built to store the data efficiently, however it does not gurantees a sub-second sla for ingestion, neither does it provide an SLA for when the data will be available for MR jobs, this was the main reason we did not go with EMR or Hadoop in general.

Apache Storm compared to Hadoop

How does Storm compare to Hadoop? Hadoop seems to be the defacto standard for open-source large scale batch processing, does Storm has any advantages over hadoop? or Are they completely different?
Why don't you tell your opinion.
http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop/
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Twitter Storm has been touted as real time Hadoop. That is more a marketing take for easy consumption.
They are superficially similar since both are distributed application solutions. Apart from the typical distributed architectural elements like master/slave, zookeeper based coordination, to me comparison falls off the cliff.
Twitter is more like a pipline for processing data as it comes. The pipe is what connects various computing nodes that receive data, compute and deliver output. (There lingo is spouts and bolts) Extend this analogy to a complex pipeline wiring that can be re-engineered when required and you get Twitter Storm.
In nut shell it processes data as it comes. There is no latency.
Hadoop how ever is different in this respect primarily due to HDFS. It a solution geared to distributed storage and tolerance to outage of many scales (disks, machines, racks etc)
M/R is built to leverage data localization on HDFS to distribute computational jobs. Together, they do not provide facility for real time data processing. But that is not always a requirement when you are looking through large data. (needle in the haystack analogy)
In short, Twitter Storm is a distributed real time data processing solution. I don't think we should compare them. Twitter built it because it needed a facility to process small tweets but humungous number of them and in real time.
See: HStreaming if you are compelled to compare it with some thing
Basically, both of them are used for analyzing big data, but Storm is used for real time processing while Hadoop is used for batch processing.
This is a very good introduction to Storm that I found:
Click here
Rather than to be compared, they are supposed to supplement each other now having batch + real-time (pseudo-real time) processing. There is a corresponding video presentation - Ted Dunning on Twitter's Storm
I've been using Storm for a while and now I've quit this really good technology for an amazing one : Spark (http://spark.apache.org) which provides developer with a unified API for batch or streaming processing (micro-batch) as well as machine learning and graph processing.
worth a try.
Storm is for Fast Data (real time) & Hadoop is for Big data(pre-existing tons of data). Storm can't process Big data but it can generate Big data as a output.
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Since many sub systems exists in Hadoop ecosystem, we have to chose right sub system depending on business requirements & feasibility of a particular system.
Hadoop MapReduce is efficient for batch processing of one job at a time. This is the reason why Hadoop is being used extensively as a data warehousing tool rather than data analysis tool.
Since the question is related to only "Storm" vs "Hadoop", have a look at Storm use cases - Financial Services, Telecom, Retail, Manufacturing, Transportation.
Hadoop MapReduce is best suited for batch processing.
Storm is a complete stream processing engine and can be used for real time data analytics with latency in sub-seconds.
Have a look at this dezyre article for comparison between Hadoop, Storm and Spark. It explains similarities and differences.
It can be summarized with below picture ( from dezyre article)

Resources