How does MapReduce processing work with Local File System? - hadoop

How does MapReduce processing work if the inputs/output are from local file system?
Does MapReduce job execution happen asynchronously across the Hadoop cluster?
If yes, how does that happen?
In which usecase, do we actually need to use this approach?

MapReduce Works even same in local system (mapper->reducer)
(only its matter of efficiency as it will be less efficient in local system rather than cluster).
Yes,MapReduce job execution happen asynchronously across the Hadoop cluster(it depends on what kind of scheduler you are using in your mapreduce program)
click for more about scheduler
In most of the case this used for testing purpose (running mapreduce program in local system).

Related

Hadoop in Maui+Torque cluster

I have a cluster with Torque+Maui. Is it possible to install Hadoop in same cluster? What are pros and cons of doing this if possible?
This may be a good place to start:
http://hadoop.apache.org/docs/r0.18.3/hod.html
I haven't worked with it personally but I've heard that this isn't being actively maintained.
From what I've seen Hadoop has its own scheduler which expects a set of Hadoop nodes to be running where the Hadoop file system lives. This is usually a persistent environment so you can load the file system once(big data) and assign your job to a node that happens to hold a copy of the data you need. Torque tends to take any set of free node from the cluster, assigns them to a job, runs the job, then cleans up the environment for the next job. This runs contrary to the design of Hadoop.
I can see where it would be good to have a environment that could do both to fully utilize the systems you already have but management will be messy at best.

Hadoop YARN - performance of LocalJobRunner vs. cluster deployed job

I'm doing some tests with M/R jobs running on 2 nodes Hadoop 2.2.0 cluster. One thing I would like to understand is the performance considerations of running the job in local mode (not managed by the ResourceManager) and running it on YARN. Tests I made show it runs much much faster when the job is being executed via LocalJobRunner than when it being managed by YARN. When set up the cluster I was following the steps described here http://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ , perhaps there is some configuration the guide forgot to mention?
Thanks!
You'd run LocalJobRunner for tests and small examples. You'd use the cluster when you need to processes amounts of data that would justify using Hadoop in the first place (a.k.a "Big data").
When you run a small example the overhead of running things distributed overwhelms the benefits of parallelization
Arnon is right. I found out that in one of my usecases that running using LocalJobRunner is much faster than using yarn. Running using LocalJobRunner would run the map processes as in-process and in local machine. Jobs are not submitted to HDFS cluster. Hence, map tasks are not scheduled in multiple machines. So, use LocalJobRunner shall be used for unit testing the code. Thats it. For all other practical purposes, use yarn.

Does oozie provide any performance optimizations in terms of I/O?

Since oozie is a workflow engine for Hadoop platform, does it improve the performance of execution of a DAG dependencies of MapReduce jobs?
I mean, since the output of one MapReduce job is given as input to the next MapReduce job in the DAG, does oozie provides any mechanism for storing the intermediate results in memory and thus saving I/O.
Or is it just a workflow manager, that coordinates a series of dependent MapReduce?
Want to know how internally oozie works?
It is just a workflow manager. It doesn't change how, say, MapReduce works even though it runs M/R jobs.
What you are describing is much more like what Apache Spark does. I'm not aware that Oozie integrates directly with Spark yet, but, it can't possibly be difficult or far off.
It is "just a workflow manager, that coordinates a series of MapReduce" jobs. It uses the same mechanisms to execute jobs as using the command line.

Difference between PIG local and mapreduce mode

What is the actual difference between running PIG scripts locally and on mapreduce?
I understand mapreduce mode is when you run it on a cluster that has hdfs installed. Does this mean local mode does not need HDFS and so even mapreduce jobs don't get triggered? What is the difference and when do you the other?
Local mode will build a simulated mapreduce job running off of a local file on disk. In theory equivalent to MapReduce, but it's not a "real" mr job. You shouldn't be able to tell the difference from a user perspective.
Local mode is great for development.
Local mode: All scripts are run on a single machine without requiring Hadoop MapReduce and HDFS. This can be useful for developing and testing Pig logic. If you’re using a small set of data to developer or test your code, then local mode could be faster than going through the MapReduce infrastructure.
Local mode doesn’t require Hadoop. When you run in Local mode, the Pig program runs in the context of a local Java Virtual Machine, and data access is via the local file system of a single machine. Local mode is actually a local simulation of MapReduce in Hadoop’s LocalJobRunner class.
MapReduce mode (also known as Hadoop mode): Pig is executed on the Hadoop cluster. In this case, the Pig Script gets converted into a series of MapReduce jobs that are then run on the Hadoop cluster.
If you have a terabyte of data that you want to perform operations on and you want to interactively develop a program, you may soon find things slowing down considerably, and you may start growing your storage. Local mode allows you to work with a subset of your data in a more interactive manner so that you can figure out the logic (and work out the bugs) of your Pig program.
After you have things set up as you want them and your operations are running smoothly, you can then run the script against the full data set using MapReduce mode.

How do I set up a distributed map-reduce job using hadoop streaming and ruby mappers/reducers?

I'm able to run a local mapper and reducer built using ruby with an input file.
I'm unclear about the behavior of the distributed system though.
For the production system, I have a HDFS set up across two machines. I know that if I store a large file on the HDFS, it will have some blocks on both machines to allow for parallelization. Do I also need to store the actual mappers and reducer files (my ruby files in this case) on the HDFS as well?
Also, how would I then go about actually running the streaming job so that it runs in a parallel manner on both systems?
If you were to use mapper/reducers written in ruby (or anything other than Java), you would have to use hadoop-streaming. Hadoop streaming has an option to package your mapper/reducer files when sending your job to the cluster. The following link should have what you are looking for.
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html

Resources