A project of mine is to compare different variants of Hadoop, it is said that there are many of them out there, but googling didn't work well for me :(
Does anyone know any different variants of Hadoop? The only one I found was Haloop.
I think the more generic term is "map reduce":
http://www.google.com/search?gcx=c&sourceid=chrome&ie=UTF-8&q=map+reduce&safe=active
Not exactly sure what you mean by different variants for Hadoop.
But, there are a lot of companies providing commercial support or providing their own versions of Hadoop (open-source and proprietary). You can find more details here.
For ex., MapR has it's own proprietary implementation of Hadoop, but they claim it's compatible with Apache Hadoop, which is a bit vague because Apache Hadoop is evolving and there are no standards around Hadoop API. Cloudera has it's own version of Hadoop CDH which is based on the Apache Hadoop. HortonWorks has been spun from Yahoo, which provides commercial support for Hadoop.
You can find more information here. Hadoop is evolving very fast, so this might be a bit stale.
This can refer to
- hadoops file system,
- or its effective support for map reduce...
- or even more generally, to the idea of cloud / distributed storage systems.
Best to clarify what aspects of hadoop you are interested In.
Of course when comparing hadoop academically, you must first start looking at GFS- since that is the origin of hadoop.
Taking aside HBase we can see hadoop as two layers - storage layer and map-reduce layer.
Storage layer has the following really different implementation which would be interesting to compare: standard hadoop file system, HDFS over Cassandra (Brisk), HDFS over S3, MapR hadoop implementation.
MapR also have changed Map-reduce implementation.
This site http://www.nosql-database.org/ has a list of a lot of NoSql DBs out there. Maybe it can help you.
Related
I have a curious question.
What are some distributed and scalable alternatives to hadoop. Am looking for some distributed file systems like HDFS which can be used as a cheap and effective storage and would like a data processing engine(batch/real-time) on top of it. I know Spark can be a good alternative. But I would like to use this system as a file archive which is distributed,fault tolerant and scalable.Is there any apt solutions ? Suggestions are welcomed. Thanks :)
These are some other alternatives to Hadoop and Apache Spark. Cluster Map Reduce, Hydra and Conclusion, they are all relatively good for big data projects. Read more here
https://datafloq.com/read/Big-Data-Hadoop-Alternatives/1135
If you still looking into alternatives, this Gigaom article may help:
https://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/
By default Spark flushed to HDFS.
Since HDFS is open source alternative to GFS(Google FS), You can use a connector to GFS(Google FS is available via Google Cloud Platform Storage services) ... there is a catch: it is expensive on massive data transfers between nodes/clusters. Hadoop was not designed for realtime data, but less dynamic data. I hope this helps somehow.
MapR claims 20% faster than regular HDFS(but underlying FS is HDFS) https://mapr.com/why-mapr/
NetApp has an alternative to HDFS as well http://www.netapp.com/us/solutions/applications/big-data-analytics/index.aspx
All above links are the Gigaom article I shared.
I hope this helps somehow.
I know that Hadoop is based on Master/Slave architecture
HDFS works with NameNodes and DataNodes
and MapReduce works with jobtrackers and Tasktrackers
But I can't find all these services on MapR, I find out that it has its own Architecture with its own services
I'm a little bit confused, could any one please tell me what is the difference between using Hadoop only and using it with MapR !
You have to refer to Hadoop 2.x latest architecture since YARN ( Yet Another Resource Negotiator) & High Availability have been introduced in 2.x version.
Job tracker and Task tracker are replaced with Resource Manager, Node Manager and Applications Manager.
Hadoop 2.x YARN & High Availability
For MapR architecture, refer to MapR article
For comparison between different distributors, refer to this image
Detailed comparison is available at Data-magnum article by Bill Vorhies
MapR and apache Hadoop DO NOT have same architecture at storage level. MapR uses its own filesystem MaRFS which is completely different from HDFS in terms of concept and implemenation . you can find more detailed comparision here : https://www.mapr.com/blog/comparing-mapr-fs-and-hdfs-nfs-and-snapshots#.VfGwwxG6eUk
https://www.mapr.com/resources/videos/comparison-mapr-fs-and-hdfs
Mapr uses most of Apache bigdata distributions as their baseline.
Mapr is a hadoop (and bigdata technology stacks) distribution provider with certain add-ons and technical support to its client.
Underline the mapr is entirely on the same architecture as of apache hadoop including all the core library distribution. However mapr distribution is more like a bundle of a complete and compatible bigdata technology package.
The main benefit of mapr is that it's distribution of various technologies like hive, hbase, spark etc will be compatible with core hadoop and among each other. This I'd particularly important because the bigdata technologies are evolving in different pace and hence news releases becomes incompatible very soon.
So, the vendors like mapr, cloudera etc are providing their version of hadoop didtribution and support such that end users can concentrate on the product building without worrying about the compatibility issues. But almost all of them are using apache distribution under the carpet.
In future, they might come up certain variation and additional features in an attempt to prevent client's switch to other vendors, but as of now is not the case.
I started working on hadoop mapreduce.
I am beginner to Java & hadoop and know the coding for hadoop mapreduce, but interested to learn how it internally works in cloud.
Can you please share some good link which explain how hadoop works internally?
How Hadoop works in not related to cloud. It works in the same way in 3 laptop ;-) Hadoop is often "link" to cloud computing because it is designed to be used with a lot of cheap machines, so it makes sense to run Hadoop in cloud.
By the way, Hadoop is NOT only map/reduce. It's a distributed file system first, and we are able to execute distributed tasks on the distributed file. And NOT ONLY map/reduce task (since version 2 I think).
It's a very large subject. So if you start, you will have to read many articles before to be a master ;-)
My advice. First look for articles about MapReduce:
http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ (short)
https://developer.yahoo.com/hadoop/tutorial/module4.html (long)
Then look for articles about Hadoop architecture (file system then YARN)
http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/YARN.html
You should have a look on slideshare too.
I see a substitution for mapreduce jobs, MapR, which can read data directly from stream and process it. Is my understanding correct?
Are there any samples that I can refer?
Is it commercial?
Is there any catch in using it?
Is it a substitution for flume?
Can we use it with apache hadoop? If yes, then why does the distribution only talk about yarn and mapreduce and not MapR?
Thanks in advance.
MapR is a commercial distribution of Apache Hadoop with HDFS replaced with MapR-FS. Essentially it is the same Hadoop and same Map-Reduce jobs running on top of with, covered with tons of marketing that causes the confusion and questions like yours. Here's the diagram of the components they have in their distribution: https://www.mapr.com/products/mapr-distribution-including-apache-hadoop
For stream processing on top of MapR you can use Apache Spark Streaming, Apache Flume, Apache Storm - it depends on the task you need to solve
Yes, it is commercial, licensed per-node basis as far as I know. You can easily contact their sales guys, they would be glad to explain the prices and terms
Just like the other Hadoop distributions, but personally I would prefer fully open-source platform rather than proprietary MapR-FS, but its up to you to choose
No
Because Apache Hadoop is part of many commercial distributions: Cloudera, MapR, Hortonworks, Pivotal, etc. When you read about Hadoop, you read about the system architecture, and not about the commercial packages that offer its support for enterprises
I am looking to understand and probably play with Hadoop and am looking at the open source projects from facebook here. There seems to be way too many to many to wrap my head around. If some one can explain where and how each of these projects fit that would be a great help.
As some background I am thinking about working on a project where the primary driver is images. So want to start things off right when picking a platform (solution). So please feel free to suggest any other technologies as well.
Cloudera has a table that gives equivalents of core Hadoop projects in terms of the Google stack:
MapReduce | MapReduce
GFS | HDFS
BigTable | HBase
Chubby | ZooKeeper
Sawzall | Hive, Pig
These, and particularly the first four, are the core components others build on. MapReduce spawns workers as close as possible to the data they will work on. HDFS replicates unstructured data. HBase is a column store. ZooKeeper does service discovery, locking, and leader election. Hive and Pig are high-level query languages, which are implemented as MapReduce computations over HBase data.
There is a lot more to the project ecosystem, from self-contained tools like Avro (serialisation, think protocol buffers), toolkits like Mahout (machine learning), to full-featured products like Nutch (crawler and search engine from which Hadoop was spun off).
Integrators are making distributions of Hadoop and Hadoop-like stacks (Hadoop is loosely coupled and some provide alternatives to important components); the core projects are maintained by the Apache foundation.
I wrote an article on this very topic last month:
The Hadoop Universe
I think it explains all the Hadoop-related Apache projects reasonably, in a paragraph each.
Hadoop ecosystem is growing at a very fast pace. There are open source (like Cloudera)/commercial (like MapR) softwares. Start with the Hadoop ecosystem world map and go to the next level as required. The article is a bit outdated, but is relevant.