I know that Hadoop is based on Master/Slave architecture
HDFS works with NameNodes and DataNodes
and MapReduce works with jobtrackers and Tasktrackers
But I can't find all these services on MapR, I find out that it has its own Architecture with its own services
I'm a little bit confused, could any one please tell me what is the difference between using Hadoop only and using it with MapR !
You have to refer to Hadoop 2.x latest architecture since YARN ( Yet Another Resource Negotiator) & High Availability have been introduced in 2.x version.
Job tracker and Task tracker are replaced with Resource Manager, Node Manager and Applications Manager.
Hadoop 2.x YARN & High Availability
For MapR architecture, refer to MapR article
For comparison between different distributors, refer to this image
Detailed comparison is available at Data-magnum article by Bill Vorhies
MapR and apache Hadoop DO NOT have same architecture at storage level. MapR uses its own filesystem MaRFS which is completely different from HDFS in terms of concept and implemenation . you can find more detailed comparision here : https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots#.VfGwwxG6eUk
https://www.mapr.com/resources/videos/comparison-mapr-fs-and-hdfs
Mapr uses most of Apache bigdata distributions as their baseline.
Mapr is a hadoop (and bigdata technology stacks) distribution provider with certain add-ons and technical support to its client.
Underline the mapr is entirely on the same architecture as of apache hadoop including all the core library distribution. However mapr distribution is more like a bundle of a complete and compatible bigdata technology package.
The main benefit of mapr is that it's distribution of various technologies like hive, hbase, spark etc will be compatible with core hadoop and among each other. This I'd particularly important because the bigdata technologies are evolving in different pace and hence news releases becomes incompatible very soon.
So, the vendors like mapr, cloudera etc are providing their version of hadoop didtribution and support such that end users can concentrate on the product building without worrying about the compatibility issues. But almost all of them are using apache distribution under the carpet.
In future, they might come up certain variation and additional features in an attempt to prevent client's switch to other vendors, but as of now is not the case.
Related
It is kind of obvious and we will all agree that we can call HDFS + YARN + MapReduce as Hadoop. But what happens with different other combinations and other products in the Hadoop ecosystem?
Is, for example, HDFS + YARN + Spark still Hadoop? Is HBASE Hadoop? I guess we consider HDFS + YARN + Pig Hadoop, since Pig uses MapReduce.
Are only the MapReduce tools considered Hadoop, but anything else run on HDFS + YARN(like Spark) is non Hadoop?
I agree with your impression that the "Hadoop" term does not have a useful definition. "We have a Hadoop cluster" may mean various things.
There is an official answer though at http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F:
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
So "Hadoop" is the name of a project and a software library. Any other usage is ill-defined.
In addition to Apache hadoop definition from Official website, I would like to highlight that Hadoop is a framework and there are many sub-systems in Hadoop ecosystem
Quoting this content from official website so that broken links in future does not cause any issue to this answer.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
More or less,
Hadoop = Distributed Storage (HDFS) + Distributed Processing ( YARN + Map Reduce)
But these four modules does not cover complete Hadoop Ecosystem. There are many Hadoop related projects and 40+ subsystems in Hadoop Ecosystems.
Other Hadoop-related projects at Apache include:
Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.
Coming back to your question:
Just have a look at 40+ sub systems in Hadoop eco system. Every thing you have quoted may not be Hadoop but most them are related to Hadoop.
Spark is part of Hadoop ecosystem. But it can neither use HDFS nor YARN. HDFS data sets can be replaced with RDD ( resilient distributed dataset) and can run in Standalone mode without YARN.
Have a look at this article and this article for Hadoop & Spark comparison.
Spark use cases over Hadoop:
Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Stream processing
Sensor data processing
Since Spark doesn't have storage system, it have to depend on one of distributed storages where HDFS is one of them.
Have a look at related SE question:
Can apache spark run without hadoop?
The most generally accepted understanding of Hadoop: HDFS and Map/Reduce and their related processes and tooling.
A related term: Hadoop ecosystem: Hive/Pig/Hbase, Zookeeper, Oozie. Also vendor specific ones such as impala, ambari.
Why we require a Big Data System?
STORE (to store massive amount of data)
PROCESS (process the data/requests in timely manner)
SCALE (scale easily as data grows)
There was a big data solution provided by google.
Google File System: To solve distributed storage.
Map Reduce: To solve distributed computing.
Google published the research papers. Apache developed open source system similar to developed by Google, it is known as HADOOP.
Similar to Google File system in hadoop is HDFS(Hadoop Distributed File System): File system to manage the storage of data.
MAP Reduce: Framework to process data across multiple servers.
Note: In 2013, Apache released HADOOP 2.0
(MapReduce was broken into two components:
MapReduce: A framework to define the data processing task.
YARN: A framework to run the data processing task.
)
HADOOP ECOSYSTEM
Hadoop was not easy to understand, and it was limited to hard core developer. so to make the usage of hadoop easier. Lots of tools sprung up, which as a whole is known as Hadoop Ecosystem.
Hadoop Ecosystem contains tools like:
HIVE:
provides an SQL interface to hadoop.
bridge to hadoop for folks who dont have dont have exposure to OOPs in JAVA.
HBASE:
a database management system on top of hadoop.
integrates with our app just like a traditional database.
PIG:
DATA Manipulation Language.
transforms unstructured data into structured format.
query this structured data using interfaces like Hive.
SPARK:
A distributed computing engine used along with Hadoop.
Interactive shell to quickly process datasets.
has a bunch of built in libraries for machine learning stream processing, graph processing etc.
OOZIE:
tool to schedule workflows on all the hadoop ecosystem technologies.
FLUME/SQOOP:
tools to transfer data between other systems and hadoop.
This concludes a very high level overview of Hadoop.
Hadoop is not a stack like LAMP or MEAN stack. Hadoop is a collections of frameworks and tools that work togethor to solve complex big data problems.
It is basically a Project under Apache foundation. Various Subprojects like Mapreduce, Ambari, Sqoop, Spark, Zookeeper, etc togethor makes ,what is called a Hadoop Ecosystem.
Source : https://www.datacloudschool.com/2020/01/introduction-what-is-hadoop.html
I see a substitution for mapreduce jobs, MapR, which can read data directly from stream and process it. Is my understanding correct?
Are there any samples that I can refer?
Is it commercial?
Is there any catch in using it?
Is it a substitution for flume?
Can we use it with apache hadoop? If yes, then why does the distribution only talk about yarn and mapreduce and not MapR?
Thanks in advance.
MapR is a commercial distribution of Apache Hadoop with HDFS replaced with MapR-FS. Essentially it is the same Hadoop and same Map-Reduce jobs running on top of with, covered with tons of marketing that causes the confusion and questions like yours. Here's the diagram of the components they have in their distribution: https://www.mapr.com/products/mapr-distribution-including-apache-hadoop
For stream processing on top of MapR you can use Apache Spark Streaming, Apache Flume, Apache Storm - it depends on the task you need to solve
Yes, it is commercial, licensed per-node basis as far as I know. You can easily contact their sales guys, they would be glad to explain the prices and terms
Just like the other Hadoop distributions, but personally I would prefer fully open-source platform rather than proprietary MapR-FS, but its up to you to choose
No
Because Apache Hadoop is part of many commercial distributions: Cloudera, MapR, Hortonworks, Pivotal, etc. When you read about Hadoop, you read about the system architecture, and not about the commercial packages that offer its support for enterprises
I'm familiar with the infrastructure or architecture of Cloudera:
Master Nodes include NameNode, SecondaryNameNode, JobTracker, and HMaster.
Slave Nodes include DataNode, TaskTracker, and HRegionServer.
Master nodes should all be on their own nodes (unless its a small cluster, than SecondaryNameNode, JobTracker, and HMaster may be combined, and even the NameNode if its a really small cluster).
Slave Nodes should always be colocated on the same node. The more slave nodes, the merrier.
SecondaryNameNode is a misnomer, unless you enable it for High Availability.
Does MapR maintain this setup? How is it similar and how is it different?
Good information by #JamCon in his reply, but there are some things worth clarifying:
The comment regarding patches is not accurate. MapR packages a broad range of Hadoop projects in its distribution so you don't have to separately compile anything. And MapR has the same APIs as any other distro, meaning their packages are not about compatibility but are simply bug fixes / enhancements from the community. There's typically no extra work required to get Hadoop ecosystem projects to run on MapR. And they release ecosystem updates at least once a month, as far as I can tell, to keep current with new enhancements.
Regarding the inclusion of YARN, we've been running MapR on YARN across large clusters since July '14! I believe MapR has their own ecosystem project vetting process, and they graduate MapR packaged versions to GA once they determine a project is ready for enterprise support.
MapR deviates from the vanilla Hadoop & CDH distributions a bit. It keeps most of the services and structure (Job Tracker, Data Nodes, HBase Master & Region, MR, etc), but there are some significant differences.
One of the defining items about MapR's distribution is that it doesn't use HDFS. It has its own custom FS, which features HA and operates without Name Nodes (via distributed metadata). It also allowed them to enable NFS access years ahead of the rest of the Hadoop distros, as well as snap shotting.
The custom FS does complicate their distribution a bit, though ... for example, when you want to run products or services, you often need to install the MapR specific patches. When you want to run mahout, you need to compile it with the MapR patches from https://github.com/mapr/mahout. But it also gives them an opportunity to incorporate better security at the FS level, as seen by the implementation of "Access Control Expressions" and Cluster/Job/Volume ACLs.
Overall, it's a well structured product. My biggest concern is they've deviated so far from the norm that when new innovations are adopted, they're slow to adapt, because it has to be incorporated into their highly modified environment. YARN is a perfect example ... they haven't released it yet, even though their competitors have.
From an architecture stand point with MapR there are no master nodes. The functions that the master nodes provide in a typical Hadoop architecture are instead distributed and performed within the "data nodes" of MapR.
https://www.mapr.com/why-hadoop/why-mapr/architecture-matters
MapR doesn't have master node, inbuilt mechansim but in Cloudera have master node, secondary name node and resource manager
http://commandstech.com/mapr-vs-cloudera-vs-hortonworks/
I am looking to understand and probably play with Hadoop and am looking at the open source projects from facebook here. There seems to be way too many to many to wrap my head around. If some one can explain where and how each of these projects fit that would be a great help.
As some background I am thinking about working on a project where the primary driver is images. So want to start things off right when picking a platform (solution). So please feel free to suggest any other technologies as well.
Cloudera has a table that gives equivalents of core Hadoop projects in terms of the Google stack:
MapReduce | MapReduce
GFS | HDFS
BigTable | HBase
Chubby | ZooKeeper
Sawzall | Hive, Pig
These, and particularly the first four, are the core components others build on. MapReduce spawns workers as close as possible to the data they will work on. HDFS replicates unstructured data. HBase is a column store. ZooKeeper does service discovery, locking, and leader election. Hive and Pig are high-level query languages, which are implemented as MapReduce computations over HBase data.
There is a lot more to the project ecosystem, from self-contained tools like Avro (serialisation, think protocol buffers), toolkits like Mahout (machine learning), to full-featured products like Nutch (crawler and search engine from which Hadoop was spun off).
Integrators are making distributions of Hadoop and Hadoop-like stacks (Hadoop is loosely coupled and some provide alternatives to important components); the core projects are maintained by the Apache foundation.
I wrote an article on this very topic last month:
The Hadoop Universe
I think it explains all the Hadoop-related Apache projects reasonably, in a paragraph each.
Hadoop ecosystem is growing at a very fast pace. There are open source (like Cloudera)/commercial (like MapR) softwares. Start with the Hadoop ecosystem world map and go to the next level as required. The article is a bit outdated, but is relevant.
A project of mine is to compare different variants of Hadoop, it is said that there are many of them out there, but googling didn't work well for me :(
Does anyone know any different variants of Hadoop? The only one I found was Haloop.
I think the more generic term is "map reduce":
http://www.google.com/search?gcx=c&sourceid=chrome&ie=UTF-8&q=map+reduce&safe=active
Not exactly sure what you mean by different variants for Hadoop.
But, there are a lot of companies providing commercial support or providing their own versions of Hadoop (open-source and proprietary). You can find more details here.
For ex., MapR has it's own proprietary implementation of Hadoop, but they claim it's compatible with Apache Hadoop, which is a bit vague because Apache Hadoop is evolving and there are no standards around Hadoop API. Cloudera has it's own version of Hadoop CDH which is based on the Apache Hadoop. HortonWorks has been spun from Yahoo, which provides commercial support for Hadoop.
You can find more information here. Hadoop is evolving very fast, so this might be a bit stale.
This can refer to
- hadoops file system,
- or its effective support for map reduce...
- or even more generally, to the idea of cloud / distributed storage systems.
Best to clarify what aspects of hadoop you are interested In.
Of course when comparing hadoop academically, you must first start looking at GFS- since that is the origin of hadoop.
Taking aside HBase we can see hadoop as two layers - storage layer and map-reduce layer.
Storage layer has the following really different implementation which would be interesting to compare: standard hadoop file system, HDFS over Cassandra (Brisk), HDFS over S3, MapR hadoop implementation.
MapR also have changed Map-reduce implementation.
This site http://www.nosql-database.org/ has a list of a lot of NoSql DBs out there. Maybe it can help you.