I have a question.
I want to increase my map and reduce functions to the number of my input data. when I execute System.out.println(conf.getNumReduceTasks()) and System.out.println(conf.getNumMapTasks()) it shows me:
1 1
and when I execute conf.setNumReduceTasks(1000000) and conf.setNumMapTasks(1000000) and again execute the println method it shows me:
1000000 1000000
but I think there is no change in my mapreduce program execution time. my input is from cassandra, actually it is the cassandra column family rows that is about 362000 rows.
I want to set the number of my map and reduce function to the number of input rows..
what should I do?
Setting the number of map/reduce tasks for your map/reduce job does define how many map/reduce processes will be used to process your job. Consider if you really need so many java processes.
That said, the number of map tasks is mostly determined automatically; setting the number of map tasks is only a hint that can increase the number of maps that were determined by Hadoop.
For reduce tasks, the default is 1 and the practical limit is around 1,000.
See: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
It's also important to understand that each node of your cluster also has a maximum number of map/reduce tasks that can execute concurrently. This is set by the following configuration settings:
mapred.tasktracker.map.tasks.maximum
and
mapred.tasktracker.reduce.tasks.maximum
The default for both of these is 2.
So increasing the number of map/reduce tasks will be limited to the number of tasks that can run simultaneously per node. This may be one reason you aren't seeing a change in execution time for your job.
See: http://hadoop.apache.org/docs/stable/mapred-default.html
The summary is:
Let Hadoop determine the number of maps, unless you want more map tasks.
Use the mapred.tasktracker..tasks.maximum settings to control how many tasks can run at one time.
The max value for number of reduce tasks should be somewhere between 1 or 2 * (mapred.tasktracker.reduce.tasks.maximum * #nodes). You also have to take into account how many map/reduce jobs you expect to run at once, so that a single job doesn't consume all the available reduce slots.
A value of 1,000,000 is almost certainly too high for either setting; it's not practical to run that many java processes. I expect that such high values are simply being ignored.
After setting the mapred.tasktracker..tasks.maximum to the number of tasks your nodes are able to run simultaneously, then try increasing your job's map/reduce tasks incrementally.
You can see the actual number of tasks used by your job in the job.xml file to verify your settings.
Related
When running certain file on Hadoop using map reduce, sometimes it creates 1 map task and 1 reduce tasks while other file can use 4 map and 1 reduce tasks.
My question is based on what the number of map and reduce tasks is being decided?
is there a certain map/reduce size after which a new map/reduce is created?
Many Thanks Folks.
From the the official doc :
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
The ideal reducers should be the optimal value that gets them closest to:
A multiple of the block size
A task time between 5 and 15 minutes
Creates the fewest files possible
Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of:
Terrible performance on the next phase of the workflow
Terrible performance due to the shuffle
Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless
Destroying disk IO for no really sane reason
Lots of network transfers
The number of Mappers is equal to the the number of HDFS blocks for the input file that will be processed.
The number of reducers ideally should be about 10% of your total mappers. Say you have 100 mappers then ideally the number of reducers should be somewhere around 10.
But however it is possible specify the number of reducers in our Map Reduce job.
In hadoop, what is the difference between using n mappers and n reduce, or n mappers and 1 reduce.
in the case of using 1 reduce, the reduce phase is made of which computer (mappers), if I have 3 computers
The number of mappers is controlled by the amount of data being processed. Reducers are controlled either by the developer or different system parameters.
To override the number of reducers:
set mapreduce.job.reduces=#;
or if it is a Hive job and you want to control more how much work each reducer has to do then you can tweak certain parameters such as:
hive.exec.reducers.bytes.per.reducer.
You can still override by using mapreduce.job.reduces it is just using the bytes per reducer allows you to control the amount each reducer processes.
In regards to controlling where the reducers run you really cannot control that except by using Node Labels. This would mean controlling where all of the tasks in the job run not just the reducers.
Assume that 8GB memory is available with a node in hadoop system.
If the task tracker and data nodes consume 2GB and if the memory required for each task is 200MB, how many map and reduce can get started?
8-2 = 6GB
So, 6144MB/200MB = 30.72
So, 30 total map and reduce tasks will be started.
Am I right or am I missing something?
The number of mappers and reducers is not determined by the resources available. You have to set the number of reducers in your code by calling setNumReduceTasks().
For the number of mappers, it is more complicated, as they are set by Hadoop. By default, there is roughly one map task per input split. You can tweak that by changing the default block size, record reader, number of input files.
You should also set in the hadoop configuration files the maximum number of map tasks and reduce tasks that run concurrently, as well as the memory allocated to each task. Those last two configurations are the ones that are based on the available resources. Keep in mind that map and reduce tasks run on your CPU, so you are practically restricted by the number of available cores (one core cannot run two tasks at the same time).
This guide may help you with more details.
The number of concurrent task is not decided just based on the memory available on a node. it depends on the number of cores as well. If your node has 8 vcores and each of your task takes 1 core then at a time only 8 task can run.
I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.
Is it possible to limit the number of mappers running for a job at any given time using Hadoop Streaming? For example, I have a 28 node cluster that can run 1 task per node. If I have a job with 100 tasks, I'd like to only use say 20 out of the 28 nodes at any point in time. I'd like to do limit some jobs because they may contain many long running tasks and I sometimes want to run some faster running jobs and be sure that they can run immediately, rather than wait for the long running job to finish.
I saw this question and the title is spot on but the answers don't seem to address this particular issue.
Thanks!
While i am not aware about "node-wise" capacity scheduling, there is alternative scheduler built for the very similar case: Capacity Scheduler.
http://hadoop.apache.org/common/docs/r0.19.2/capacity_scheduler.html
You should define special queue for potentially long jobs and queue for short jobs and this scheduler will care to have some capacity to be always available for each queue's jobs.
Following option may make sense if the amount of work in each mapper is substantial, since this strategy does involve overhead of reading up to 20 counters in each map invocation.
Create a group of counters and make the groupname MY_TASK_MAPPERS . make the key equal to MAPPER<1..K> where K is the max #of mappers you want. Then in the Mapper iterate through the counters until one of them is found to be 0. Place the machine's un-dotted ip address as a long value in the counter - effectively assigning that one machine to that mapper. If instead all K are already taken, then just quit the mapper without doing anything.