Storm framework applications - hadoop

I built an application for searching similar images stores in distributed environment using Hadoop. But Hadoop does not support real time processing, that why the response time is long. I know that Storm is another framework for big data analysis application. But I got confused whether we can use Storm to implement this kind of application.
Does anybody give an advice what kind of application that use efficiently Storm framework.

Storm is a very scalable, fast, fault-tolerant open source system for distributed computation, with a special focus on stream processing. Storm excels at event processing and incremental computation, calculating rolling metrics in real time over streams of data
Event stream processing is major strength of Storm.
Generally Hadoop is used for batch-processing. But Storm is The Hadoop of real-time processing and Spark is Distributed processing for all with in-memory data store
Have a look at this Storm and Spark and Stack Comparison links
EDIT:
My solution for this problem
1) Store the images in CMS (content management system) with CDN spread across multiple networks and not in HDFS or NoSQL database)
2) Store the Image Id, Image Name, MD5SUM, Image Location meta information in HBase table
3) Use Spark & HBase for image data processing e.g. remove duplicate image by checking MD5SUM

Related

Big data lambda architecture with cassandra and hadoop

I am working on a Big Data solution for sensor data and predictive analytics.
I am new to Big Data, and have read about the lambda-architecture.
I thought about using Cassandra Database together with Hadoop.
Cassandra is a high available and Partition tolerance database and Hadoop hdfs a file system for large analytics jobs.
If I receive the data from a Internet of Things Device, should the data be saved first in Hadoop and then to Cassandra?
The lambda architecture has Hadoop in batch layer, receiving the data and sending it to the serving layer to a nosql database.
Why should the data be first in Hadoop?
and what kind of data is stored in Cassandra if Hadoop contains the raw data?
The stream layer is out of Focus at the moment.
I just want to understand the usage of Cassandra and Hadoop together.
The data in Hadoop is for large analytics and in cassandra there should be the result from my Hadoop jobs.
Does that mean i can store my raw data in both? i can store my raw data in Cassandra and in Hadoop if not only the large analytics jobs are useful for my application?
Example
INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’);
if this is my insert and i have thousands of them in one single minute.
I want to do some large jobs i use Hadoop ?
But also i need every single Data Row for my application without analytics. Cassandra is storing it too?
The trade off is between the latency and throughput. Hadoop is supposed to provide the high throughput but the latency is quite high. So hadoop is used for batch processing in lambda architecture. But there may be requirement when you would like to pass on the pre-computed data ( Or summarized data) to another layer like visualization layer .These precomputed data is basically stored in cassandra or hbase to have low latency.
As you receive the data from a IoT Device, you need to save this data as quickly as you can. That's exactly what Cassandra is great for.
Than you need to process this data, and as the data amount is large, in the realistic case you do not want to have on-the-fly data processing, but to have batch(nightly, for example) processing instead.
And it's turn of Hadoop here.
So you have to extract the data from Cassandra, then put into Hadoop's file system (hdfs) and then do some processing (via Hive or Spark).
You could also think of having Cassandra-Spark direct streaming job, but I'd suggest to copy data from Cassandra first, as this allows to use this data as sandbox (to debug jobs, testing new algorithms, etc.) without any impact on Casandra cluster performance.
You can read about Cassandra and big data here.
Disclaimer: I am the author of this post.

MapReduce or Spark for Batch processing on Hadoop?

I know that MapReduce is a great framework for batch processing on Hadoop. But, Spark also can be used as batch framework on Hadoop that provides scalability, fault tolerance and high performance compared MapReduce. Cloudera, Hortonworks and MapR started supporting Spark on Hadoop with YARN as well.
But, a lot of companies are still using MapReduce Framework on Hadoop for batch processing instead of Spark.
So, I am trying to understand what are the current challenges of Spark to be used as batch processing framework on Hadoop?
Any thoughts?
Spark is an order of magnitude faster than mapreduce for iterative algorithms, since it gets a significant speedup from keeping intermediate data cached in the local JVM.
With Spark 1.1 which primarily includes a new shuffle implementation (sort-based shuffle instead of hash-based shuffle), a new network module (based on netty instead of using block manager for sending shuffle data), a new external shuffle service made Spark perform the fastest PetaByte sort (on 190 nodes with 46TB RAM) and TeraByte sort breaking Hadoop's old record.
Spark can easily handle the dataset which are order of magnitude larger than the cluster's aggregate memory. So, my thought is that Spark is heading in the right direction and will eventually get even better.
For reference this blog post explains how databricks performed the petabyte sort.
I'm assuming when you say Hadoop you mean HDFS.
There are number of benefits of using Spark over Hadoop MR.
Performance: Spark is at least as fast as Hadoop MR. For iterative algorithms (that need to perform number of iterations of the same dataset) is can be a few orders of magnitude faster. Map-reduce writes the output of each stage to HDFS.
1.1. Spark can cache (depending on the available memory) this intermediate results and therefore reduce latency due to disk IO.
1.2. Spark operations are lazy. This means Spark can perform certain optimizing before it starts processing the data because it can reorder operations because they have executed yet.
1.3. Spark keeps a lineage of operations and recreates the partial failed state based on this lineage in case of failure.
Unified Ecosystem: Spark provides a unified programming model for various types of analysis - batch (spark-core), interactive (REPL), streaming (spark-streaming), machine learning (mllib), graph processing (graphx), SQL queries (SparkSQL)
Richer and Simpler API: Spark's API is richer and simpler. Richer because it supports many more operations (e.g., groupBy, filter ...). Simpler because of the expressiveness of these functional constructs. Spark's API supports Java, Scala and Python (for most APIs). There is experimental support for R.
Multiple Datastore Support: Spark supports many data stores out of the box. You can use Spark to analyze data in a normal or distributed file system, HDFS, Amazon S3, Apache Cassandra, Apache Hive and ElasticSearch to name a few. I'm sure support for many other popular data stores is comings soon. This essentially if you want to adopt Spark you don't have to move your data around.
For example, here is what code for word count looks in Spark (Scala).
val textFile = sc.textFile("some file on HDFS")
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
I'm sure you have to write a few more lines if you are using standard Hadoop MR.
Here are some common misconceptions about Spark.
Spark is just a in-memory cluster computing framework. However, this is not true. Spark excels when your data can fit in memory because memory access latency is lower. But you can make it work even when your dataset doesn't completely fit in memory.
You need to learn Scala to use Spark. Spark is written in Scala and runs on the JVM. But the Spark provides support for most of the common APIs in Java and Python as well. So you can easily get started with Spark without knowing Scala.
Spark does not scale. Spark is for small datasets (GBs) only and doesn't scale to large number of machines or TBs of data. This is also not true. It has been used successfully to sort PetaBytes of data
Finally, if you do not have a legacy codebase in Hadoop MR it makes perfect sense to adopt Spark, the simple reason being all major Hadoop vendors are moving towards Spark for good reason.
Apache Spark runs in memory, making it much faster than mapreduce.
Spark started as a research project at Berkeley.
Mapreduce use disk extensively (for external sort, shuffle,..).
As the input size for a hadoop job is in order of terabytes. Spark memory requirements will be more than traditional hadoop.
So basically, for smaller jobs and with huge memory in ur cluster, sparks wins. And this is not practically the case for most clusters.
Refer to spark.apache.org for more details on spark

Can druid replace hadoop?

Druid is used for both real time and batch processing. But can it totally replace hadoop?
If not why? As in what is the advantage of hadoop over druid?
I have read that druid is used along with hadoop. So can the use of Hadoop be avoided?
We are talking about two slightly related but very different technologies here.
Druid is a real-time analytics system and is a perfect fit for timeseries and time based events aggregation.
Hadoop is HDFS (a distributed file system) + Map Reduce (a paradigm for executing distributed processes), which together have created an eco system for distributed processing and act as underlying/influencing technology for many other open source projects.
You can setup druid to use Hadoop; that is to fire MR jobs to index batch data and to read its indexed data from HDFS (of course it will cache them locally on the local disk)
If you want to ignore Hadoop, you can do your indexing and loading from a local machine as well, of course with the penalty of being limited to one machine.
Can you avoid using Hadoop with Druid? Yes, you can stream data in real-time into a Druid cluster rather than batch-loading it with Hadoop. One way to do this is to stream data into Kafka, which will handle incoming events and pass them into Storm, which can then process and load them into Druid Realtime nodes.
Typically this setup is used with Hadoop in parallel, because streamed real-time data comes with its own baggage and often needs to be fixed up and backfilled. That whole architecture has been dubbed "Lambda" by some.
Druid is used for both real time and batch processing. But can it totally replace hadoop? If not why?
It depends on your cases. Have a look at Druid official website documentation.
Druid is good choice for below use cases:
Insert rates are very high, but updates are less common
Most of queries are aggregation and reporting with low latency of 100ms to a few seconds.
Data has a time component
Load data from Kafka, HDFS, flat files, or object storage like Amazon S3
Druid is not good choince for below use cases
Need low-latency updates of existing records using a primary key. Druid supports streaming inserts, but not streaming updates
Building an offline reporting system where query latency is not very important.
In case of big joins
So if you are looking for offline reporting system where query latency is not important, Hadoop may score in that scenario.

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ?
From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.
I would also like to know how Hive compares with Pig.
MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.
Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started.
(Just to add on to my answer. No self promotion intended)
Both Hive and Pig queries get converted into MapReduce jobs under the hood.
HTH
I implemented a Hive Data platform recently in my firm and can speak to it in first person since I was a one man team.
Objective
To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language
To replace daily aggregation data generated thru MySQL with Hive
Build Custom reports thru queries in Hive
Architecture Options
I benchmarked the following options:
Hive+HDFS
Hive+HBase - queries were too slow so I dumped this option
Design
Daily log Files were transported to HDFS
MR jobs parsed these log files and output files in HDFS
Create Hive tables with partitions and locations pointing to HDFS locations
Create Hive query scripts (call it HQL if you like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data
Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator
Summary
HBase is like a Map. If you know the key, you can instantly get the value. But if you want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone.
If you have data that needs to be aggregated, rolled up, analyzed across rows then consider Hive.
Hopefully this helps.
Hive actually rocks ...I know, I have lived it for 12 months now... So does HBase...
Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
There are four main modules in Hadoop.
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Before going further, Let's note that we have three different types of data.
Structured: Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files.
Depending on type of data to be processed, we have to choose right technology.
Some more projects, which are part of Hadoop:
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Hive Vs PIG comparison can be found at this article and my other post at this SE question.
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Have a look at: Hadoop Use Cases.
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends, summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.
Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them.
If you select full table scan - use hive. If index access - HBase.
Understanding in depth
Hadoop
Hadoop is an open source project of the Apache foundation. It is a framework written in Java, originally developed by Doug Cutting in 2005. It was created to support distribution for Nutch, the text search engine. Hadoop uses Google's Map Reduce and Google File System Technologies as its foundation.
Features of Hadoop
It is optimized to handle massive quantities of structured, semi-structured and unstructured data using commodity hardware.
It has shared nothing architecture.
It replicates its data into multiple computers so that if one goes down, the data can still be processed from another machine that stores its replica.
Hadoop is for high throughput rather than low latency. It is a batch operation handling massive quantities of data; therefore the response time is not immediate.
It complements Online Transaction Processing and Online Analytical Processing. However, it is not a replacement for a RDBMS.
It is not good when work cannot be parallelized or when there are dependencies within the data.
It is not good for processing small files. It works best with huge data files and data sets.
Versions of Hadoop
There are two versions of Hadoop available :
Hadoop 1.0
Hadoop 2.0
Hadoop 1.0
It has two main parts :
1. Data Storage Framework
It is a general-purpose file system called Hadoop Distributed File System (HDFS).
HDFS is schema-less
It simply stores data files and these data files can be in just about any format.
The idea is to store files as close to their original form as possible.
This in turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement.
2. Data Processing Framework
This is a simple functional programming model initially popularized by Google as MapReduce.
It essentially uses two functions: MAP and REDUCE to process data.
The "Mappers" take in a set of key-value pairs and generate intermediate data (which is another list of key-value pairs).
The "Reducers" then act on this input to produce the output data.
The two functions seemingly work in isolation with one another, thus enabling the processing to be highly distributed in highly parallel, fault-tolerance and scalable way.
Limitations of Hadoop 1.0
The first limitation was the requirement of MapReduce programming expertise.
It supported only batch processing which although is suitable for tasks such as log analysis, large scale data mining projects but pretty much unsuitable for other kinds of projects.
One major limitation was that Hadoop 1.0 was tightly computationally coupled with MapReduce, which meant that the established data management vendors where left with two opinions:
Either rewrite their functionality in MapReduce so that it could be
executed in Hadoop or
Extract data from HDFS or process it outside of Hadoop.
None of the options were viable as it led to process inefficiencies caused by data being moved in and out of the Hadoop cluster.
Hadoop 2.0
In Hadoop 2.0, HDFS continues to be data storage framework.
However, a new and seperate resource management framework called Yet Another Resource Negotiater (YARN) has been added.
Any application capable of dividing itself into parallel tasks is supported by YARN.
YARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the flexibility, scalability and efficiency of applications.
It works by having an Application Master in place of Job Tracker, running applications on resources governed by new Node Manager.
ApplicationMaster is able to run any application and not just MapReduce.
This means it does not only support batch processing but also real-time processing. MapReduce is no longer the only data processing option.
Advantages of Hadoop
It stores data in its native from. There is no structure imposed while keying in data or storing data. HDFS is schema less. It is only later when the data needs to be processed that the structure is imposed on the raw data.
It is scalable. Hadoop can store and distribute very large datasets across hundreds of inexpensive servers that operate in parallel.
It is resilient to failure. Hadoop is fault tolerance. It practices replication of data diligently which means whenever data is sent to any node, the same data also gets replicated to other nodes in the cluster, thereby ensuring that in event of node failure,there will always be another copy of data available for use.
It is flexible. One of the key advantages of Hadoop is that it can work with any kind of data: structured, unstructured or semi-structured. Also, the processing is extremely fast in Hadoop owing to the "move code to data" paradigm.
Hadoop Ecosystem
Following are the components of Hadoop ecosystem:
HDFS: Hadoop Distributed File System. It simply stores data files as close to the original form as possible.
HBase: It is Hadoop's database and compares well with an RDBMS. It supports structured data storage for large tables.
Hive: It enables analysis of large datasets using a language very similar to standard ANSI SQL, which implies that anyone familier with SQL should be able to access data on a Hadoop cluster.
Pig: It is an easy to understand data flow language. It helps with analysis of large datasets which is quite the order with Hadoop. Pig scripts are automatically converted to MapReduce jobs by the Pig interpreter.
ZooKeeper: It is a coordination service for distributed applications.
Oozie: It is a workflow schedular system to manage Apache Hadoop jobs.
Mahout: It is a scalable machine learning and data mining library.
Chukwa: It is data collection system for managing large distributed system.
Sqoop: It is used to transfer bulk data between Hadoop and structured data stores such as relational databases.
Ambari: It is a web based tool for provisioning, managing and monitoring Hadoop clusters.
Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.
Hive is not
A relational database
A design for Online Transaction Processing (OLTP).
A language for real-time queries and row-level updates.
Features of Hive
It stores schema in database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familier, fast, scalable and extensible.
Hive Architecture
The following components are contained in Hive Architecture:
User Interface: Hive is a data warehouse infrastructure that can create interaction between user and HDFS. The User Interfaces that Hive supports are Hive Web UI, Hive Command line and Hive HD Insight(In Windows Server).
MetaStore: Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types and HDFS mapping.
HiveQL Process Engine: HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce in Java, we can write a query for MapReduce and process it.
Exceution Engine: The conjunction part of HiveQL process engine and MapReduce is the Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBase: Hadoop Distributed File System or HBase are the data storage techniques to store data into file system.
For a Comparison Between Hadoop Vs Cassandra/HBase read this post.
Basically HBase enables really fast read and writes with scalability. How fast and scalable? Facebook uses it to manage its user statuses, photos, chat messages etc. HBase is so fast sometimes stacks have been developed by Facebook to use HBase as the data store for Hive itself.
Where As Hive is more like a Data Warehousing solution. You can use a syntax similar to SQL to query Hive contents which results in a Map Reduce job. Not ideal for fast, transactional systems.
I worked on Lambda architecture processing Real time and Batch loads.
Real time processing is needed where fast decisions need to be taken in case of Fire alarm send by sensor or fraud detection in case of banking transactions.
Batch processing is needed to summarize data which can be feed into BI systems.
we used Hadoop ecosystem technologies for above applications.
Real Time Processing
Apache Storm: Stream Data processing, Rule application
HBase: Datastore for serving Realtime dashboard
Batch Processing
Hadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.
Event Handling layer
Apache Kafka was first layer to consume high velocity events from sensor.
Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors.
First of all we should get clear that Hadoop was created as a faster alternative to RDBMS. To process large amount of data at a very fast rate which earlier took a lot of time in RDBMS.
Now one should know the two terms :
Structured Data : This is the data that we used in traditional RDBMS and is divided into well defined structures.
Unstructured Data : This is important to understand, about 80% of the world data is unstructured or semi structured. These are the data which are on its raw form and cannot be processed using RDMS. Example : facebook, twitter data. (http://www.dummies.com/how-to/content/unstructured-data-in-a-big-data-environment.html).
So, large amount of data was being generated in the last few years and the data was mostly unstructured, that gave birth to HADOOP. It was mainly used for very large amount of data that takes unfeasible amount of time using RDBMS. It had many drawbacks, that it could not be used for comparatively small data in real time but they have managed to remove its drawbacks in the newer version.
Before going further I would like to tell that a new Big Data tool is created when they see a fault on the previous tools. So, whichever tool you will see that is created has been done to overcome the problem of the previous tools.
Hadoop can be simply said as two things : Mapreduce and HDFS. Mapreduce is where the processing takes place and HDFS is the DataBase where data is stored. This structure followed WORM principal i.e. write once read multiple times. So, once we have stored data in HDFS, we cannot make changes. This led to the creation of HBASE, a NOSQL product where we can make changes in the data also after writing it once.
But with time we saw that Hadoop had many faults and for that we created different environment over the Hadoop structure. PIG and HIVE are two popular examples.
HIVE was created for people with SQL background. The queries written is similar to SQL named as HIVEQL. HIVE was developed to process completely structured data. It is not used for ustructured data.
PIG on the other hand has its own query language i.e. PIG LATIN. It can be used for both structured as well as unstructured data.
Moving to the difference as when to use HIVE and when to use PIG, I don't think anyone other than the architect of PIG could say. Follow the link :
https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html
Let me try to answer in few words.
Hadoop is an eco-system which comprises of all other tools. So, you can't compare Hadoop but you can compare MapReduce.
Here are my few cents:
Hive: If your need is very SQLish meaning your problem statement can be catered by SQL, then the easiest thing to do would be to use Hive. The other case, when you would use hive is when you want a server to have certain structure of data.
Pig: If you are comfortable with Pig Latin and you need is more of the data pipelines. Also, your data lacks structure. In those cases, you could use Pig. Honestly there is not much difference between Hive & Pig with respect to the use cases.
MapReduce: If your problem can not be solved by using SQL straight, you first should try to create UDF for Hive & Pig and then if the UDF is not solving the problem then getting it done via MapReduce makes sense.
Pig: it is better to handle files and cleaning data
example: removing null values,string handling,unnecessary values
Hive: for querying on cleaned data
1.We are using Hadoop for storing Large data (i.e.structure,Unstructure and Semistructure data ) in the form file format like txt,csv.
2.If We want columnar Updations in our data then we are using Hbase tool
3.In case of Hive , we are storing Big data which is in structured format
and in addition to that we are providing Analysis on that data.
4.Pig is tool which is using Pig latin language to analyze data which is in any format(structure,semistructure and unstructure).
Cleansing Data in Pig is very easy,a suitable approach would be cleansing data through pig and then processing data through hive and later uploading it to hdfs.
Use of Hive, Hbase and Pig w.r.t. my real time experience in different projects.
Hive is used mostly for:
Analytics purpose where you need to do analysis on history data
Generating business reports based on certain columns
Efficiently managing the data together with metadata information
Joining tables on certain columns which are frequently used by using bucketing concept
Efficient Storing and querying using partitioning concept
Not useful for transaction/row level operations like update, delete, etc.
Pig is mostly used for:
Frequent data analysis on huge data
Generating aggregated values/counts on huge data
Generating enterprise level key performance indicators very frequently
Hbase is mostly used:
For real time processing of data
For efficiently managing Complex and nested schema
For real time querying and faster result
For easy Scalability with columns
Useful for transaction/row level operations like update, delete, etc.
Short answer to this question is -
Hadoop - Is Framework which facilitates distributed file system and programming model which allow us to store humongous sized data and process data in distributed fashion very efficiently and with very less processing time compare to traditional approaches.
(HDFS - Hadoop Distributed File system)
(Map Reduce - Programming Model for distributed processing)
Hive - Is query language which allows to read/write data from Hadoop distributed file system in a very popular SQL like fashion. This made life easier for many non-programming background people as they don't have to write Map-Reduce program anymore except for very complex scenarios where Hive is not supported.
Hbase - Is Columnar NoSQL Database. Underlying storage layer for Hbase is again HDFS. Most important use case for this database is to be able to store billion's of rows with million's of columns. Low latency feature of Hbase helps faster and random access of record over distributed data, is very important feature to make it useful for complex projects like Recommender Engines. Also it's record level versioning capability allow user to store transactional data very efficiently (this solves the problem of updating records we have with HDFS and Hive)
Hope this is helpful to quickly understand the above 3 features.
I believe this thread hasn't done in particular justice to HBase and Pig in particular. While I believe Hadoop is the choice of the distributed, resilient file-system for big-data lake implementations, the choice between HBase and Hive is in particular well-segregated.
As in, a lot of use-cases have a particular requirement of SQL like or No-SQL like interfaces. With Phoenix on top of HBase, though SQL like capabilities is certainly achievable, however, the performance, third-party integrations, dashboard update are a kind of painful experiences. However, it's an excellent choice for databases requiring horizontal scaling.
Pig is in particular excellent for non-recursive batch like computations or ETL pipelining (somewhere, where it outperforms Spark by a comfortable distance). Also, it's high-level dataflow implementations is an excellent choice for batch querying and scripting. The choice between Pig and Hive is also pivoted on the need of the client or server-side scripting, required file formats, etc. Pig supports Avro file format which is not true in the case of Hive. The choice for 'procedural dataflow language' vs 'declarative data flow language' is also a strong argument for the choice between pig and hive.
Hadoop:
HDFS stands for Hadoop Distributed File System which uses Computational processing model Map-Reduce.
HBase:
HBase is Key-Value storage, good for reading and writing in near real time.
Hive:
Hive is used for data extraction from the HDFS using SQL-like syntax. Hive use HQL language.
Pig:
Pig is a data flow language for creating ETL. It's an scripting language.
Pig is mostly dead after Cloudera got rid of it in CDP. Also last release on Apache was 19 June, 2017: release 0.17.0 so basically no committers actively working anymore. Use Spark or Python way more powerful than Pig.

Apache Storm compared to Hadoop

How does Storm compare to Hadoop? Hadoop seems to be the defacto standard for open-source large scale batch processing, does Storm has any advantages over hadoop? or Are they completely different?
Why don't you tell your opinion.
http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop/
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Twitter Storm has been touted as real time Hadoop. That is more a marketing take for easy consumption.
They are superficially similar since both are distributed application solutions. Apart from the typical distributed architectural elements like master/slave, zookeeper based coordination, to me comparison falls off the cliff.
Twitter is more like a pipline for processing data as it comes. The pipe is what connects various computing nodes that receive data, compute and deliver output. (There lingo is spouts and bolts) Extend this analogy to a complex pipeline wiring that can be re-engineered when required and you get Twitter Storm.
In nut shell it processes data as it comes. There is no latency.
Hadoop how ever is different in this respect primarily due to HDFS. It a solution geared to distributed storage and tolerance to outage of many scales (disks, machines, racks etc)
M/R is built to leverage data localization on HDFS to distribute computational jobs. Together, they do not provide facility for real time data processing. But that is not always a requirement when you are looking through large data. (needle in the haystack analogy)
In short, Twitter Storm is a distributed real time data processing solution. I don't think we should compare them. Twitter built it because it needed a facility to process small tweets but humungous number of them and in real time.
See: HStreaming if you are compelled to compare it with some thing
Basically, both of them are used for analyzing big data, but Storm is used for real time processing while Hadoop is used for batch processing.
This is a very good introduction to Storm that I found:
Click here
Rather than to be compared, they are supposed to supplement each other now having batch + real-time (pseudo-real time) processing. There is a corresponding video presentation - Ted Dunning on Twitter's Storm
I've been using Storm for a while and now I've quit this really good technology for an amazing one : Spark (http://spark.apache.org) which provides developer with a unified API for batch or streaming processing (micro-batch) as well as machine learning and graph processing.
worth a try.
Storm is for Fast Data (real time) & Hadoop is for Big data(pre-existing tons of data). Storm can't process Big data but it can generate Big data as a output.
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Since many sub systems exists in Hadoop ecosystem, we have to chose right sub system depending on business requirements & feasibility of a particular system.
Hadoop MapReduce is efficient for batch processing of one job at a time. This is the reason why Hadoop is being used extensively as a data warehousing tool rather than data analysis tool.
Since the question is related to only "Storm" vs "Hadoop", have a look at Storm use cases - Financial Services, Telecom, Retail, Manufacturing, Transportation.
Hadoop MapReduce is best suited for batch processing.
Storm is a complete stream processing engine and can be used for real time data analytics with latency in sub-seconds.
Have a look at this dezyre article for comparison between Hadoop, Storm and Spark. It explains similarities and differences.
It can be summarized with below picture ( from dezyre article)

Resources