How hadoop mapreduce internally works in cloud? - hadoop

I started working on hadoop mapreduce.
I am beginner to Java & hadoop and know the coding for hadoop mapreduce, but interested to learn how it internally works in cloud.
Can you please share some good link which explain how hadoop works internally?

How Hadoop works in not related to cloud. It works in the same way in 3 laptop ;-) Hadoop is often "link" to cloud computing because it is designed to be used with a lot of cheap machines, so it makes sense to run Hadoop in cloud.
By the way, Hadoop is NOT only map/reduce. It's a distributed file system first, and we are able to execute distributed tasks on the distributed file. And NOT ONLY map/reduce task (since version 2 I think).
It's a very large subject. So if you start, you will have to read many articles before to be a master ;-)
My advice. First look for articles about MapReduce:
http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ (short)
https://developer.yahoo.com/hadoop/tutorial/module4.html (long)
Then look for articles about Hadoop architecture (file system then YARN)
http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/YARN.html
You should have a look on slideshare too.

Related

How to start exploring BigData, Hadoop and its ecosystem components?

I have just started exploring BigData technology and the Hadoop framework.
But, getting confused with so many ecosystem components and framework. Could you please advise to get a structured start for learning ?
I mean which ecosystem component should one focus? Any in particular or all?
Help much appreciated!
Ranit
I wrote this answer on Quora few months back. Hope this will help:
1. Go through some introductory videos on Hadoop
Its very important to have some high level idea of hadoop before directly starting working on it. These introductory videos will help in understanding the scope of Hadoop and the use cases where it can be applied. There are a lot of resources available online for the same and going through any of the videos will be beneficial.
2. Understanding MapReduce
The second thing which helped me was to understand what Map Reduce is and how it works. It is explained very nicely in this paper: http://static.googleusercontent....
Another nice tutorial is available here : http://ksat.me/map-reduce-a-real...
For points 1 and 2, go through first four lectures for week one video lectures. The whole concept of distributed computing and map reduce is explained very nicely here. https://class.coursera.org/mmds-001/lecture
3. Getting started with Cloudera VM
Once you understand the basics of Hadoop, you can download the VM provided by cloudera and starting running some hadoop commands on it. You can download the VM from this link: http://www.cloudera.com/content/...
It would be nice to get familiar with basic Hadoop commands on the VM and understanding how it works.
4. Setting up the standalone/Pseudo distributed Hadoop
I would recommend setting up your own standalone Hadoop on your machine once you are familiar with Hadoop using the VM. The steps for installing are explained very nicely on this blog by Michael G. Noll : Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll
5. Understanding the Hadoop Ecosystem
It would be nice to get familiar with other components in the Hadoop ecosystem like Apache Pig, Hive, Hbase, Flume-NG, Hue etc. All these serve different purposes and having some information on all these will be really helpful in building any product around the hadoop ecosystem. You can install all these easily on your machine and get started with them. Cloudera VM by has most of these installed already.
6. Writing Map Reduce Jobs
Once you are done with steps 1-5, I don't think writing Map Reduce would be a challenge. It is explained thoroughly in The Definitive Guide. If MapReduce really interests you a lot, I would suggest reading this book Mining Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffrey D. Ullman : Page on Stanford
I would recommend going for Hadoop first, it's the basis for a lot of those other systems out there. Check out the main site: http://hadoop.apache.org/ and check out Cloudera, they provide a Virtual image (called CDH), that comes with everything pre-installed, so you can jump into action without having to deal with installation problems: http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html
After that, I would look into HDFS, just to understand a bit more how Hadoop stores that data, and then it would depend on what type of problems you're trying to solve, each particular system tackles a specific and (usually) different problem:
Hive / Cassandra: For database-like interaction
Pig: For data transformation.
Spark: For real time data analysis
Check out this link for more details: http://www.cloudera.com/content/cloudera/en/training/library/apache-hadoop-ecosystem.html
I hope you find that useful.
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy - From wikipedia
Hadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
There are four main modules in Hadoop.
1.Hadoop Common: The common utilities that support the other Hadoop modules.
2.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
3.Hadoop YARN: A framework for job scheduling and cluster resource management.
4.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Before going further, Let's note that we have three different types of data.
Structured : Structured data has strong schema and schema will be checked during write & read operation. e.g. Data in RDBMS systems like Oracle, MySQL Server etc.
Unstructured: Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.
Semi-structured: Data is not strictly structured but have some structure. e.g. XML files
Depending on type of data to be processed, we have to choose right technology.
Some more projects, which are part of Hadoop
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Pig™: A high-level data-flow language and execution framework for parallel computation
Hive Vs PIG comparison can be found at my other post in this question
HBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.
You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce
You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce
Have a look at: Hadoop Use Cases
Hive should be used for analytical querying of data collected over a period of time. e.g Calculate trends , summarize website logs but it can't be used for real time queries.
HBase fits for real-time querying of Big Data. Facebook use it for messaging and real-time analytics.
PIG can be used to construct dataflows,run a scheduled jobs, crunch big volumes of data,aggregate/summarize it and store into relation database systems. Good for ad-hoc analysis.
Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.
Apache Spark is a general compute engine that offers fast data analysis on a large scale. Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing framework. Common uses cases for Apache Spark include real-time queries, event stream processing, iterative algorithms, complex operations and machine learning.
Mahout™: A Scalable machine learning and data mining library.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine
I have covered only some of key components of Hadoop ecosystem. If you like to have a look at all component of ecosystem, have a look at this ecosystem table
If above table is very difficult to digest, have a look at minified version of ecosystem at this article
But to understand all of these system, I would like you to start with Apache website first and explore other articles later.
Big data is not a technology in itself, instead it is a concept.
You can think of database, database is not a technology in itself, it is a concept. Oracle, DB2 etc are database technologies.
So coming back to big data, this concept is used to deal with huge data which is difficult to be analyzed using traditional databases or technologies. People think hadoop as synonym of bigdata but again let me tell you that Hadoop is nothing but a technology developed by Apache to implement bigdata concept.
Hadoop has its own file system called hdfs and it uses mapreduce to solve bigdata problems. Apart from Hadoop there is hive which is similar to sql but internally it uses map reduce. Hbase is similar to nosql database. Pig is scripting language which uses mapreduce internally.
There are many licensed version for big data like MapR, Hortonworks, Cloudera etc.
So start learning with Hadoop - HDFS, Mapreduce, Yarn, Hive.
Things I did to learn Hadoop.
a) Install Hadoop from scratch. I mean download CentOs, Hadoop , JAVA etc., and install them manually.
b) Understand how HDFS works.
c) Understand how MapReduce works.
d) Write word count in JAVA.
This will help you get started.

running a non mapreduce program in hadoop

I have a question.. I have a program write in Netbeans. the program read data from cassandra and write the result into it. My program is not MapReduce at all.I execute the program and make a .jar file from it. now, I want to know if I can execute it in Hadoop?
actually, I want to know can I run a non-MapReduce Program in Hadoop?
You could architect this program to run on Hadoop v2 as a Yarn application. This would require re-architecting your application to fit the Yarn paradigm. An example of how to do this is given here: Writing App Framework on Yarn
This is not a simple exercise. Also, if you are interested in using Hadoop, I would consider simply re-writing your application to use HBase (another No-SQL Columnar database competitor to Cassandra) which is written specifically for Hadoop. It translates your query requests to MapReduce calls automatically.
This question is ages long but has never been answered. Anyhow, two projects are looking into this issue:
Apache Slider (incubating): http://slider.incubator.apache.org/
and
Apache Myriad (incubating): http://myriad.incubator.apache.org/
Slider is mainly sponsored by Hortonworks while Myriad is a MapR / Mesosphere project with large assistance from PayPal.

What are sites for Hadoop Best practices

What are sites for Hadoop Best practice , Not the Books where I can get the step by step process to create new projects and small examples . I am not able to find a single site like this , please share.
There is an awesome article from yahoo developers on Apache Hadoop: Best Practices and Anti-Patterns
Hadoop is not something one single application instead it is a distributed processing framework which is used by several applications which sits top of this framework. Pig, Hive, HBase, Cassandra, etc are few of many such application designed for specific requirement. Underneath all of these application consume Hadoop framework which mainly consist of distributed file system (HDFS) and distributed processing (MapReduce).
Technically when you have a bare minimum Hadoop cluster (HDFS + MapReduce only) you can start writing MapReduce based applications (in Java or other languages are supported through Hadoop Streaming) to process some data.
What you could do is first download a pre-build/configured Hadoop virtual Image from Cloudera or Hortonworks distribution and get it running in your machine. After that start learning writing MapReduce jobs in Java and run in your virtual machine.
Here is the URL to download Cloudera Hadoop Distribution VM
Here is the link to learn writing simplest wordcount job.

Can some explain the Hadoop stack to me?

I am looking to understand and probably play with Hadoop and am looking at the open source projects from facebook here. There seems to be way too many to many to wrap my head around. If some one can explain where and how each of these projects fit that would be a great help.
As some background I am thinking about working on a project where the primary driver is images. So want to start things off right when picking a platform (solution). So please feel free to suggest any other technologies as well.
Cloudera has a table that gives equivalents of core Hadoop projects in terms of the Google stack:
MapReduce | MapReduce
GFS | HDFS
BigTable | HBase
Chubby | ZooKeeper
Sawzall | Hive, Pig
These, and particularly the first four, are the core components others build on. MapReduce spawns workers as close as possible to the data they will work on. HDFS replicates unstructured data. HBase is a column store. ZooKeeper does service discovery, locking, and leader election. Hive and Pig are high-level query languages, which are implemented as MapReduce computations over HBase data.
There is a lot more to the project ecosystem, from self-contained tools like Avro (serialisation, think protocol buffers), toolkits like Mahout (machine learning), to full-featured products like Nutch (crawler and search engine from which Hadoop was spun off).
Integrators are making distributions of Hadoop and Hadoop-like stacks (Hadoop is loosely coupled and some provide alternatives to important components); the core projects are maintained by the Apache foundation.
I wrote an article on this very topic last month:
The Hadoop Universe
I think it explains all the Hadoop-related Apache projects reasonably, in a paragraph each.
Hadoop ecosystem is growing at a very fast pace. There are open source (like Cloudera)/commercial (like MapR) softwares. Start with the Hadoop ecosystem world map and go to the next level as required. The article is a bit outdated, but is relevant.

Variants of Hadoop

A project of mine is to compare different variants of Hadoop, it is said that there are many of them out there, but googling didn't work well for me :(
Does anyone know any different variants of Hadoop? The only one I found was Haloop.
I think the more generic term is "map reduce":
http://www.google.com/search?gcx=c&sourceid=chrome&ie=UTF-8&q=map+reduce&safe=active
Not exactly sure what you mean by different variants for Hadoop.
But, there are a lot of companies providing commercial support or providing their own versions of Hadoop (open-source and proprietary). You can find more details here.
For ex., MapR has it's own proprietary implementation of Hadoop, but they claim it's compatible with Apache Hadoop, which is a bit vague because Apache Hadoop is evolving and there are no standards around Hadoop API. Cloudera has it's own version of Hadoop CDH which is based on the Apache Hadoop. HortonWorks has been spun from Yahoo, which provides commercial support for Hadoop.
You can find more information here. Hadoop is evolving very fast, so this might be a bit stale.
This can refer to
- hadoops file system,
- or its effective support for map reduce...
- or even more generally, to the idea of cloud / distributed storage systems.
Best to clarify what aspects of hadoop you are interested In.
Of course when comparing hadoop academically, you must first start looking at GFS- since that is the origin of hadoop.
Taking aside HBase we can see hadoop as two layers - storage layer and map-reduce layer.
Storage layer has the following really different implementation which would be interesting to compare: standard hadoop file system, HDFS over Cassandra (Brisk), HDFS over S3, MapR hadoop implementation.
MapR also have changed Map-reduce implementation.
This site http://www.nosql-database.org/ has a list of a lot of NoSql DBs out there. Maybe it can help you.

Resources