Hadoop: force 1 mapper task per node from jobconf - hadoop

I want to run one task (mapper) per node on my Hadoop cluster, but I cannot modify the configuration with which the tasktracker runs (i'm just a user).
For this reason, I need to be able to push the option through the job configuration. I tried to set the mapred.tasktracker.map.tasks.maximum=1 at hadoop jar command, but the tasktracker ignores it as it has a different setting in its configuration file.
By the way, the cluster uses the Capacity Scheduler.
Is there any way I can force 1 task per node?
Edited:
Why? I have a memory-bound task, so I want each task to use all the memory available to the node.

when you set the no of mappers, either through the configuration files or by some other means, it's just a hint to the framework. it doesn't guarantee that you'll get only the specified no of mappers. the creation of mappers is actually governed by the no of Splits. and the split creation is carried out by the logic which your InputFormat holds. if you really want to have just one mapper to process the entire file, set "issplittable" to true in the InputFormat class you are using. but why would you do that?the power of hadoop actually lies in distributed parallel processing.

Related

Map-reduce via Oozie

If I am using Oozie to run MapReduce job, is there a specific number about how many mappers will be started?
Is it:
one for Oozie and one for map-reduce job or
one for Oozie and one mapper for every 64MB block(default block size)
The above answers focus on how many maps and reduces a mapreduce job needs. However as you specifically ask about oozie, I will share my experience on mapreduce (in pig) via Oozie.
Explanation
When you kick off an oozie workflow, you need 1 yarn application for this. I am not sure what the logic is, but it appears that these applications usually require 1 map, and occasionally 2.
Besides the above, you still need the same amount of mappers and reducers to do the actual work as if you did not use oozie. (If you see a different number than you expected, this may be because you passed specific parameters on map or reduce properties when calling the script).
Warning
The above means, that if you were to have 100 available containers, and kickoff 100 workflows (for example by starting a daily job with a startdate of 100 days in the past), it is likely that the workflows take up all available containers, and the actual work is suspended indefinitely.
Short answer : Oozie launches mapreduce job by submitting one maponly job to the cluster called Oozie launcher. Agree with #Dennis Jaheruddin.
Detail answer after my research : Oozie's execution model
Oozie’s execution model is different from
the default approach users take to run Hadoop jobs. When a user
invokes the Hadoop, Hive, or Pig CLI tool from a Hadoop edge node, the
corresponding client executable runs on that node which is configured
to contact and submit jobs to the Hadoop cluster. When the same jobs
are defined and submitted via an Oozie workflow action, things work
differently.
Let’s say you are submitting a workflow job using the Oozie CLI on the
edge node. The Oozie client actually submits the workflow to the Oozie
server, which typically runs on a different node. Regardless of where
it runs, it’s the Oozie server’s responsibility to submit and run the
underlying MapReduce jobs on the Hadoop cluster. Oozie doesn’t do so
by using the standard client tools installed locally on the Oozie
server node. Instead, it first submits a MapReduce job called the
“launcher job,” which in turn runs the Hadoop, Hive, or Pig job using
the appropriate client APIs.
Imp Note : The Oozie launcher is basically a map-only job running a single mapper
on the Hadoop cluster. This map job knows what to do for the specific
action it’s supposed to run and does the appropriate thing by using
the libraries for Hadoop, Pig, etc. This will result in other
MapReduce jobs being spun up as required. These Oozie jobs are called
“asynchronous actions” in Oozie parlance. Oozie doesn’t run these
actions in its own server, but kicks them off on the Hadoop cluster
using a launcher job. The reason Oozie server “outsources” the
launcher to the Hadoop cluster is to protect itself from unexpected
workloads and also to isolate user code from its own services. After
all, Oozie has access to an awesome distributed system in the form of
a Hadoop cluster.
Coming to Mapreduce actions you can set number of maptasks but there is no guarantee, it will depend as described below.
The number of maps is usually driven by the total size of the inputs,
that is, the total number of blocks of the input files.
setting number of maps - Suggestion (actually based on inputsplits)
setting number of reducer - Demand
Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute
Number of mapper depend on number of logical input splits it do not depends on number of blocks. You can control number of input splits by your programme.
Refer this https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/ for more information about how input splits effects number of mapper and how to create input splits.

Hadoop Yarn - how to request fix number of containers

How can Apache Spark or Hadoop Mapreduce request a fixed number of containers?
In Spark yarn-client mode, it can be requested by setting the configuration spark.executor.instances, which is directly related to the number of YARN containers it gets. How does Spark transform this into a Yarn parameter that is understood by Yarn?
I know by default, it can depend upon number of splits and configuration values yarn.scheduler.minimum-allocation-mb, yarn.scheduler.minimum-allocation-vcores. But Spark has ability to exactly request fixed number of containers. How can any AM do that?
In Hadoop Map reduce, Number of containers for map task is decided based on number of input splits. It is based on the size of source file. for every Input split, one map container will be requested.
By default number of Reducer per job is one. It can be customized by passing arguments to mapreduce.reduce.tasks. Pig & Hive has different logic to decide number of reducers. ( this also can be customized).
One container (Reduce container, usually bigger than map container) will be requested per reducers.
Total number of mappers & reducers will be clearly defined in job config file during job submission.
I think it's by using AM api that yarn provides. AM provider can use rsrcRequest.setNumContainers(numContainers); http://hadoop.apache.org/docs/r2.5.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html#Writing_a_simple_Client
Here I had similar discussion on other questionl. Yarn container understanding and tuning

Mapreduce dataflow Internals

I tried to understand map reduce anatomy from various books/blogs.. But I am not getting a clear idea.
What happens when I submit a job to the cluster using this command:
..Loaded the files into hdfs already
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Can anyone explain the sequence of opreations that happens right from the client and inside the cluster?
The processes goes like this :
1- The client configures and sets up the job via Job and submits it to the JobTracker.
2- Once the job has been submitted the JobTracker assigns a job ID to this job.
3- Then the output specification of the job is verified. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
4- Once this is done, InputSplits for the job are created(based on the InputFormat you are using). If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
5- Based on the number of InputSplits, map tasks are created and each InputSplits gets processed by one map task.
6- Then the resources which are required to run the job are copied across the cluster like the the job JAR file, the configuration file etc. The job JAR is copied with a high replication factor (which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job.
7- Then based on the location of the data blocks, that are going to get processed, JobTracker directs TaskTrackers to run map tasks on that very same DataNode where that particular data block is present. If there are no free CPU slots on that DataNode, the data is moved to a nearby DataNode with free slots and the processes is continued without having to wait.
8- Once the map phase starts individual records(key-value pairs) from each InputSplit start getting processed by the Mapper one by one completing the entire InputSplit.
9- Once the map phase gets over, the output undergoes shuffle, sort and combine. After this the reduce phase starts giving you the final output.
Below is the pictorial representation of the entire process :
Also, I would suggest you to go through this link.
HTH

How to run hadoop multithread way in single JVM?

I have 4 core desktop and want to use all my cores for local data processing with hadoop.
(i.e. sometimes I have enough power to process data locally sometimes I submit same jobs to cluster).
By default hadoop local mode runs only one mapper and one reducer so my local jobs are really slow.
I do not want to setup cluster on single machine first because of "painful" configuration and second I have to create jar each time. So perfect solution is to how run embedded Hadoop on a single machine
PS pseudo-distributed mode is bad option since it will create cluster with Single node, so I will get only one mapper and I have to spend some time on additional configuration.
You need to use MultithreadedMapRunner - just set up it in JobConf's setMapRunnerClass method and don't forget to set mapred.map.multithreadedrunner.threads to desirable concurrency level.
Also there is an another way, you should:
set MultithreadedMapper as your mapper class in Job-typed object
call MultithreadedMapper.setMapperClass with you actual mapper class
call MultithreadedMapper.setNumberOfThreads with desirable concurrency level
But be careful, your mapper class should be thread safe and it's setup and cleanup methods would be called several times, so it isn't a smart idea to mix MultithreadedMapper with MultipulOutput, unless you implement you own MultithreadedMapper inspired class.
Hadoop purposely does not run more than one task at the same time in one JVM for isolation purposes. And in stand-alone (local) mode, only one JVM is ever used. If you want to make use of your four cores, you should run in pseudo-distributed mode, and increase the max number of concurrent tasks to four. You can do this with the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties.
Configuration conf = new Configuration();
Job job = new Job(conf, "SolerRandomHit");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(MultithreadedMapper.class);

Why all the reduce tasks are ending up in a single machine?

I wrote a relatively simple map-reduce program in Hadoop platform (cloudera distribution). Each Map & Reduce write some diagnostic information to standard ouput besides the regular map-reduce tasks.
However when I'm looking at these log files, I found that Map tasks are relatively evenly distributed among the nodes (I have 8 nodes). But the reduce task standard output log can only be found in one single machine.
I guess, that means all the reduce tasks ended up executing in a single machine and that's problematic and confusing.
Does anybody have any idea what's happening here ? Is it configuration problem ?
How can I make the reduce jobs also distribute evenly ?
If the output from your mappers all have the same key they will be put into a single reducer.
If your job has multiple reducers, but they all queue up on a single machine, then you have a configuration issue.
Use the web interface (http://MACHINE_NAME:50030) to monitor the job and see the reducers it has as well as what machines are running them. There is other information that can be drilled into that will provide information that should be helpful in figuring out the issue.
Couple questions about your configuration:
How many reducers are running for the job?
How many reducers are available on each node?
Is the node running the reducer better
hardware than the other nodes?
Hadoop decides which Reducer will process which output keys by the use of a Partitioner
If you are only outputting a few keys and want an even distribution across your reducers, you may be better off implementing a custom Partitioner for your output data. eg
public class MyCustomPartitioner extends Partitioner<KEY, VALUE>
{
public int getPartition(KEY key, VALUE value, int numPartitions) {
// do something based on the key or value to determine which
// partition we want it to go to.
}
}
You can then set this custom partitioner in the job configuration with
Job job = new Job(conf, "My Job Name");
job.setPartitionerClass(MyCustomPartitioner.class);
You can also implement the Configurable interface in your custom Partitioner if you want to do any further configuration based on job settings.
Also, check that you haven't set the number of reduce tasks to 1 anywhere in the configuration (look for "mapred.reduce.tasks"), or in code, eg
job.setNumReduceTasks(1);

Resources