I have just started exploring BigData technology and the Hadoop framework.
But, getting confused with so many ecosystem components and framework. Could you please advise to get a structured start for learning ?
I mean which ecosystem component should one focus? Any in particular or all?
Help much appreciated!
Ranit
I wrote this answer on Quora few months back. Hope this will help:
1. Go through some introductory videos on Hadoop
Its very important to have some high level idea of hadoop before directly starting working on it. These introductory videos will help in understanding the scope of Hadoop and the use cases where it can be applied. There are a lot of resources available online for the same and going through any of the videos will be beneficial.
2. Understanding MapReduce
The second thing which helped me was to understand what Map Reduce is and how it works. It is explained very nicely in this paper: http://static.googleusercontent....
Another nice tutorial is available here : http://ksat.me/map-reduce-a-real...
For points 1 and 2, go through first four lectures for week one video lectures. The whole concept of distributed computing and map reduce is explained very nicely here. https://class.coursera.org/mmds-001/lecture
3. Getting started with Cloudera VM
Once you understand the basics of Hadoop, you can download the VM provided by cloudera and starting running some hadoop commands on it. You can download the VM from this link: http://www.cloudera.com/content/...
It would be nice to get familiar with basic Hadoop commands on the VM and understanding how it works.
4. Setting up the standalone/Pseudo distributed Hadoop
I would recommend setting up your own standalone Hadoop on your machine once you are familiar with Hadoop using the VM. The steps for installing are explained very nicely on this blog by Michael G. Noll : Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll
5. Understanding the Hadoop Ecosystem
It would be nice to get familiar with other components in the Hadoop ecosystem like Apache Pig, Hive, Hbase, Flume-NG, Hue etc. All these serve different purposes and having some information on all these will be really helpful in building any product around the hadoop ecosystem. You can install all these easily on your machine and get started with them. Cloudera VM by has most of these installed already.
6. Writing Map Reduce Jobs
Once you are done with steps 1-5, I don't think writing Map Reduce would be a challenge. It is explained thoroughly in The Definitive Guide. If MapReduce really interests you a lot, I would suggest reading this book Mining Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffrey D. Ullman : Page on Stanford
I would recommend going for Hadoop first, it's the basis for a lot of those other systems out there. Check out the main site: http://hadoop.apache.org/ and check out Cloudera, they provide a Virtual image (called CDH), that comes with everything pre-installed, so you can jump into action without having to deal with installation problems: http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html
After that, I would look into HDFS, just to understand a bit more how Hadoop stores that data, and then it would depend on what type of problems you're trying to solve, each particular system tackles a specific and (usually) different problem:
Hive / Cassandra: For database-like interaction
Pig: For data transformation.
Spark: For real time data analysis
Check out this link for more details: http://www.cloudera.com/content/cloudera/en/training/library/apache-hadoop-ecosystem.html
I hope you find that useful.
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy - From wikipedia
Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
There are four main modules in Hadoop.
1.Hadoop Common: The common utilities that support the other Hadoop modules.
2.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
3.Hadoop YARN: A framework for job scheduling and cluster resource management.
4.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Before going further, Let's note that we have three different types of data.
Structured : Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files
Depending on type of data to be processed, we have to choose right technology.
Some more projects, which are part of Hadoop
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Pig™: A high-level data-flow language and execution framework for parallel computation
Hive Vs PIG comparison can be found at my other post in this question
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Have a look at: Hadoop Use Cases
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends , summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows,run a scheduled jobs, crunch big volumes of data,aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.
Apache Spark is a general compute engine that offers fast data analysis on a large scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing framework. Common uses cases for Apache Spark include real-time queries, event stream processing, iterative algorithms, complex operations and machine learning.
Mahout™: A Scalable machine learning and data mining library.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine
I have covered only some of key components of Hadoop ecosystem. If you like to have a look at all component of ecosystem, have a look at this ecosystem table
If above table is very difficult to digest, have a look at minified version of ecosystem at this article
But to understand all of these system, I would like you to start with Apache website first and explore other articles later.
Big data is not a technology in itself, instead it is a concept.
You can think of database, database is not a technology in itself, it is a concept. Oracle, DB2 etc are database technologies.
So coming back to big data, this concept is used to deal with huge data which is difficult to be analyzed using traditional databases or technologies. People think hadoop as synonym of bigdata but again let me tell you that Hadoop is nothing but a technology developed by Apache to implement bigdata concept.
Hadoop has its own file system called hdfs and it uses mapreduce to solve bigdata problems. Apart from Hadoop there is hive which is similar to sql but internally it uses map reduce. Hbase is similar to nosql database. Pig is scripting language which uses mapreduce internally.
There are many licensed version for big data like MapR, Hortonworks, Cloudera etc.
So start learning with Hadoop - HDFS, Mapreduce, Yarn, Hive.
Things I did to learn Hadoop.
a) Install Hadoop from scratch. I mean download CentOs, Hadoop , JAVA etc., and install them manually.
b) Understand how HDFS works.
c) Understand how MapReduce works.
d) Write word count in JAVA.
This will help you get started.
Related
There is a whole lot of hadoop ecosystem pictures on the internet, so i struggle to get an understanding how the tools work together.
E.g. in the picture attached, why are pig and hive based on map reduce whereas the other tools like spark or storm on YARN?
Would you be so kind and explain this?
Thanks!
BR
haddop ecosystem
The picture shows Pig and Hive on top of MapReduce. This is because MapReduce is a distributed computing engine that is used by Pig and Hive. Pig and Hive queries get executed as MapReduce jobs. It is easier to work with Pig and Hive, since they give a higher-level abstraction to work with MapReduce.
Now let's take a look at Spark/Storm/Flink on YARN in the picture. YARN is a cluster manager that allows various applications to run on top of it. Storm, Spark and Flink are all examples of applications that can run on top of YARN. MapReduce is also considered as an application that can run on YARN, as shown in the diagram. YARN handles the resource management piece so that multiple applications can share the same cluster. (If you are interested in another example of a similar technology, check out Mesos).
Finally, at the bottom of the picture is HDFS. This is the distributed storage layer that allows applications to store and access data. It provides features such as distributed storage, replication and fault tolerance.
If you are interested in deeper-dives, check out the Apache Projects page.
I am sorry if this question seems naive, But I am new to Data engineering field, as I am self learner right now, however my questions is what is the differences between ETL products like Pentaho and Hadoop?
when I use this instead of that? or I may use them together, how?
Thank you,
An ETL is a tool to Extract data, Transform (join, enrich, filter,...) it and Load the result in another data store. Good ETLS are visual, data store agnostic and easy to automate.
Hadoop is a data store distributed on a network of clusters plus software to handle diseminated data. The data transformation is specialized on few elementary operations which can be optimized to this usually massive amount of data, like (but not only) Map-Reduce.
Pentaho Data Integrator has connectors to Hadoop systems which are easy to set up and tune up. So the best strategy is to setup a Hadoop network as data store and manipulate it through the PDI.
Pentaho PDI is a tool for creating, managing, running and monitoring ETL workflows. It can work with Hadoop, RDBMS, Queues, files, etc. Hadoop is a platform for distributed computation (Map-Reduce framework, HDFS, etc). Many tools can run on Hadoop or can connect to Hadoop and use it's data, run processes.
Pentaho PDI can connect to Hadoop using it's own connectors and write/read data. You can start Hadopp job from PDI, also it can process data by itself inside transformation flow and store or send results to HDFS, RDBMS, some queue, email, etc. Of course you can invent you own tool for ETL workflows or simply use bash+Hive, etc, but PDI allows ETL processsing in a unified way not depending on data sources and targets. Also Pentaho has great visualization.
My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.
Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.
For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.
In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).
Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.
Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.
So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.
Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.
On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.
I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
The README.md file in Spark can solve your puzzle:
A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
"Specifying the Hadoop Version"
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.
It is kind of obvious and we will all agree that we can call HDFS + YARN + MapReduce as Hadoop. But what happens with different other combinations and other products in the Hadoop ecosystem?
Is, for example, HDFS + YARN + Spark still Hadoop? Is HBASE Hadoop? I guess we consider HDFS + YARN + Pig Hadoop, since Pig uses MapReduce.
Are only the MapReduce tools considered Hadoop, but anything else run on HDFS + YARN(like Spark) is non Hadoop?
I agree with your impression that the "Hadoop" term does not have a useful definition. "We have a Hadoop cluster" may mean various things.
There is an official answer though at http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F:
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
So "Hadoop" is the name of a project and a software library. Any other usage is ill-defined.
In addition to Apache hadoop definition from Official website, I would like to highlight that Hadoop is a framework and there are many sub-systems in Hadoop ecosystem
Quoting this content from official website so that broken links in future does not cause any issue to this answer.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
More or less,
Hadoop = Distributed Storage (HDFS) + Distributed Processing ( YARN + Map Reduce)
But these four modules does not cover complete Hadoop Ecosystem. There are many Hadoop related projects and 40+ subsystems in Hadoop Ecosystems.
Other Hadoop-related projects at Apache include:
Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.
Coming back to your question:
Just have a look at 40+ sub systems in Hadoop eco system. Every thing you have quoted may not be Hadoop but most them are related to Hadoop.
Spark is part of Hadoop ecosystem. But it can neither use HDFS nor YARN. HDFS data sets can be replaced with RDD ( resilient distributed dataset) and can run in Standalone mode without YARN.
Have a look at this article and this article for Hadoop & Spark comparison.
Spark use cases over Hadoop:
Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Stream processing
Sensor data processing
Since Spark doesn't have storage system, it have to depend on one of distributed storages where HDFS is one of them.
Have a look at related SE question:
Can apache spark run without hadoop?
The most generally accepted understanding of Hadoop: HDFS and Map/Reduce and their related processes and tooling.
A related term: Hadoop ecosystem: Hive/Pig/Hbase, Zookeeper, Oozie. Also vendor specific ones such as impala, ambari.
Why we require a Big Data System?
STORE (to store massive amount of data)
PROCESS (process the data/requests in timely manner)
SCALE (scale easily as data grows)
There was a big data solution provided by google.
Google File System: To solve distributed storage.
Map Reduce: To solve distributed computing.
Google published the research papers. Apache developed open source system similar to developed by Google, it is known as HADOOP.
Similar to Google File system in hadoop is HDFS(Hadoop Distributed File System): File system to manage the storage of data.
MAP Reduce: Framework to process data across multiple servers.
Note: In 2013, Apache released HADOOP 2.0
(MapReduce was broken into two components:
MapReduce: A framework to define the data processing task.
YARN: A framework to run the data processing task.
)
HADOOP ECOSYSTEM
Hadoop was not easy to understand, and it was limited to hard core developer. so to make the usage of hadoop easier. Lots of tools sprung up, which as a whole is known as Hadoop Ecosystem.
Hadoop Ecosystem contains tools like:
HIVE:
provides an SQL interface to hadoop.
bridge to hadoop for folks who dont have dont have exposure to OOPs in JAVA.
HBASE:
a database management system on top of hadoop.
integrates with our app just like a traditional database.
PIG:
DATA Manipulation Language.
transforms unstructured data into structured format.
query this structured data using interfaces like Hive.
SPARK:
A distributed computing engine used along with Hadoop.
Interactive shell to quickly process datasets.
has a bunch of built in libraries for machine learning stream processing, graph processing etc.
OOZIE:
tool to schedule workflows on all the hadoop ecosystem technologies.
FLUME/SQOOP:
tools to transfer data between other systems and hadoop.
This concludes a very high level overview of Hadoop.
Hadoop is not a stack like LAMP or MEAN stack. Hadoop is a collections of frameworks and tools that work togethor to solve complex big data problems.
It is basically a Project under Apache foundation. Various Subprojects like Mapreduce, Ambari, Sqoop, Spark, Zookeeper, etc togethor makes ,what is called a Hadoop Ecosystem.
Source : https://www.datacloudschool.com/2020/01/introduction-what-is-hadoop.html
I am looking to understand and probably play with Hadoop and am looking at the open source projects from facebook here. There seems to be way too many to many to wrap my head around. If some one can explain where and how each of these projects fit that would be a great help.
As some background I am thinking about working on a project where the primary driver is images. So want to start things off right when picking a platform (solution). So please feel free to suggest any other technologies as well.
Cloudera has a table that gives equivalents of core Hadoop projects in terms of the Google stack:
MapReduce | MapReduce
GFS | HDFS
BigTable | HBase
Chubby | ZooKeeper
Sawzall | Hive, Pig
These, and particularly the first four, are the core components others build on. MapReduce spawns workers as close as possible to the data they will work on. HDFS replicates unstructured data. HBase is a column store. ZooKeeper does service discovery, locking, and leader election. Hive and Pig are high-level query languages, which are implemented as MapReduce computations over HBase data.
There is a lot more to the project ecosystem, from self-contained tools like Avro (serialisation, think protocol buffers), toolkits like Mahout (machine learning), to full-featured products like Nutch (crawler and search engine from which Hadoop was spun off).
Integrators are making distributions of Hadoop and Hadoop-like stacks (Hadoop is loosely coupled and some provide alternatives to important components); the core projects are maintained by the Apache foundation.
I wrote an article on this very topic last month:
The Hadoop Universe
I think it explains all the Hadoop-related Apache projects reasonably, in a paragraph each.
Hadoop ecosystem is growing at a very fast pace. There are open source (like Cloudera)/commercial (like MapR) softwares. Start with the Hadoop ecosystem world map and go to the next level as required. The article is a bit outdated, but is relevant.