How to skip failed map tasks in hadoop streaming - hadoop

I am running a hadoop streaming mapreduce job which has 26895 map tasks in total. However, one task that deals a certain input always fails. So I set mapreduce.map.failures.maxpercent=1 and want to skip failed tasks, but the job was still not successful.
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts
map 100.00% 26895 0 0 26894 1 8 / 44
reduce 100.00% 1 0 0 0 1 0 / 1
How can I do to skip this?

There is a configuration available for the same.
Specify the mapred.max.map.failures.percent and mapred.max.reduce.failures.percent in the mapred-site.xml to specify the failure threshold. Both are set to 0.
These properties are deprecated now and use following properties for this purpose
mapreduce.map.failures.maxpercent
mapreduce.reduce.failures.maxpercent

Related

sge All queues dropped because of overload or full

I'm going to run a million batch jobs with " sge ".
Approximately 10,000 jobs are well executed, but after an hour of execution, they stop running.
After about an hour's run, the process slows down and eventually stops.
Checking the error message does not confirm any errors.
i can check the message below only.
"All queues dropped because of overload or full"
How do I set up the layout to run normally?
there is one master server and four clients and files share using nfs
and every system run on docker and docker-swirm
do qstat when job execution speed was slow down
$qstat -j
queue instance "peteris.q#sge00" dropped because it is full
queue instance "peteris.q#sge02" dropped because it is full
queue instance "peteris.q#sge03" dropped because it is full
queue instance "peteris.q#sge01" dropped because it is full
All queues dropped because of overload or full
detail messages
$qstat -j 1595799
=============================================================
job_number: 1595799
exec_file: job_scripts/1595799
submission_time: Sun May 27 08:08:10 2018
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_workdir: /data/23andMe
sge_o_host: sge
account: sge
cwd: /data/23andMe
mail_list: root#sge
notify: FALSE
job_name: python3
jobshare: 0
env_list:
job_args: lineage.py,makeShell/1009_user3130_user3600.list
script_file: python3
usage 1: cpu=00:00:02, mem=0.59503 GBs, io=0.03963, vmem=493.180M, maxvmem=493.180M
scheduling info: queue instance "peteris.q#sge00" dropped because it is full
queue instance "peteris.q#sge02" dropped because it is full
queue instance "peteris.q#sge03" dropped because it is full
queue instance "peteris.q#sge01" dropped because it is full
All queues dropped because of overload or full
sge config
algorithm default
schedule_interval 0:0:10
maxujobs 0
queue_sort_method load
job_load_adjustments np_load_avg=100.0
load_adjustment_decay_time 0:7:30
load_formula np_load_avg
schedd_job_info true
flush_submit_sec 2
flush_finish_sec 2
params none
reprioritize_interval 0:0:0
halftime 168
usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor 5.000000
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
weight_tickets_functional 0
weight_tickets_share 0
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 200
report_pjob_tickets TRUE
max_pending_tasks_per_job 50
halflife_decay_list none
policy_hierarchy OFS
weight_ticket 0.500000
weight_waiting_time 0.278000
weight_deadline 3600000.000000
weight_urgency 0.500000
weight_priority 0.000000
max_reservation 0
default_duration INFINITY
sge queue config
qname peteris.q
hostlist #allhosts
seq_no 0
load_thresholds NONE
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:00:05
priority 0
min_cpu_interval 00:00:05
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 20
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:01
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Seems like you have hit a practical limit on the number of active jobs that the queue can handle at any given time. I cannot confirm where the maximum is defined by SGE, but seems likely it is:
max_jobs
The number of active (not finished) jobs simultaneously
allowed in Sun Grid Engine is controlled by this parameter.
A value greater than 0 defines the limit. The default value
0 means "unlimited". If the max_jobs limit is exceeded by a
job submission then the submission command exits with exit
status 25 and an appropriate error message.
Changing max_jobs will take immediate effect.
This value is a global configuration parameter only. It can-
not be overwritten by the execution host local configura-
tion.
From: http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html?pathrev=V62u5_TAG
If this is correct then value is unlimited; however, SGE will likely not perform well trying to manage ~1 million active jobs, hence the issue you are likely having. I would recommend you use job arrays, as this is the purpose of this type of job ie, to manage and run many near identical tasks.
There are many resources online for job arrays in SGE, such as this one:
http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto
http://talby.rcs.manchester.ac.uk/~ri/_linux_and_hpc_lib/sge_array.html
https://wiki.duke.edu/display/SCSC/SGE+Array+Jobs
I am happy to assist further if you edit your question with specific requirements for each task. For example, does each of the ~ 1 millions tasks require one or more parameters as input?

Hive cross join fails on local map join

Is there a direct way to address the following error or overall a better way to use Hive to get the join that I need? Output to a stored table isn't a requirement as I can be content with an INSERT OVERWRITE LOCAL DIRECTORY to a csv.
I am trying to perform the following cross join. ipint is a 9GB table, and geoiplite is 270MB.
CREATE TABLE iplatlong_sample AS
SELECT ipintegers.networkinteger, geoiplite.latitude, geoiplite.longitude
FROM geoiplite
CROSS JOIN ipintegers
WHERE ipintegers.networkinteger >= geoiplite.network_start_integer AND ipintegers.networkinteger <= geoiplite.network_last_integer;
I use CROSS JOIN on ipintegers instead of geoiplite because I have read that the rule is for the smaller table to be on the left, larger on the right.
Map and Reduce stages complete to 100% according to HIVE, but then
2015-08-01 04:45:36,947 Stage-1 map = 100%, reduce = 100%, Cumulative
CPU 8767.09 sec
MapReduce Total cumulative CPU time: 0 days 2 hours 26
minutes 7 seconds 90 msec
Ended Job = job_201508010407_0001
Stage-8 is selected by condition resolver.
Execution log at: /tmp/myuser/.log
2015-08-01 04:45:38 Starting to launch local task to process map
join; maximum memory = 12221153280
Execution failed with exit status: 3
Obtaining error information
Task failed!
Task ID: Stage-8
Logs:
/tmp/myuser/hive.log
FAILED: Execution Error, return code 3 from
org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
MapReduce Jobs
Launched: Job 0: Map: 38 Reduce: 1 Cumulative CPU: 8767.09 sec
HDFS Read: 9438495086 HDFS Write: 8575548486 SUCCESS
My hive config:
SET hive.mapred.local.mem=40960;
SET hive.exec.parallel=true;
SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate = true;
SET hive.optimize.skewjoin = true;
SET mapred.compress.map.output=true;
SET hive.stats.autogather=false;
I have varied SET hive.auto.convert.join between true and false but with the same result.
Here are the errors in the output log from /tmp/myuser/hive.log
$ tail -12 -f tmp/mysyer/hive.log
2015-08-01 07:30:46,086 ERROR exec.Task (SessionState.java:printError(419)) - Execution failed with exit status: 3
2015-08-01 07:30:46,086 ERROR exec.Task (SessionState.java:printError(419)) - Obtaining error information
2015-08-01 07:30:46,087 ERROR exec.Task (SessionState.java:printError(419)) -
Task failed!
Task ID:
Stage-8
Logs:
2015-08-01 07:30:46,087 ERROR exec.Task (SessionState.java:printError(419)) - /tmp/myuser/hive.log
2015-08-01 07:30:46,087 ERROR mr.MapredLocalTask (MapredLocalTask.java:execute(268)) - Execution failed with exit status: 3
2015-08-01 07:30:46,094 ERROR ql.Driver (SessionState.java:printError(419)) - FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask
I am running the hive client on the Master, a Google Cloud Platform instance of type n1-highmem-8 type (8 CPU, 52GB) and workers are n1-highmem-4 (4CPU 26GB), but I suspect after MAP and REDUCE that a local join (as implied) takes place on the Master. Regardless, in bdutils I configured the JAVAOPTS for the worker nodes (n1-highmem-4) to: n1-highmem-4
SOLUTION EDIT: The solution is to organize the data the range data into a range tree.
I don't think it is possible to perform this kind of cross join brute force - just multiply the row numbers, it's a little out of hand. You need some optimizations, which I don't think hive is capable yet.
But is this problem can actually be solved in O(N1+N2) time providing you have your data sorted (which hive can do for you) - you just go through both lists simultaneously, on each step getting an ip integer, seeing if any intervals start on this integer, adding them, removing those that ended, emitting matching tuples, and so on. Pseudocode:
intervals=[]
ipintegers = iterator(ipintegers_sorted_file)
intervals = iterator(intervals_sorted_on_start_file)
for x in ipintegers:
intervals = [i for i in intervals if i.end >= x]
while(intervals.current.start<=x):
intervals.append(intervals.current)
intervals.next()
for i in intervals:
output_match(i, x)
Now, if you have an external script/UDF function that knows how to read the smaller table and gets ip integers as input and spits matching tuples as output, you can use hive and SELECT TRANSFORM to stream the inputs to it.
Or you can probably just run this algorithm on a local machine with two input files, because this is just O(N), and even 9 gb of data is very doable.

Hive takes long time to launch hadoop job

I am a newbie to Hadoop and Hive. I am using Hive integration with Hadoop to execute the queries. When I submit any query, following log messages appear on console:
Hive history
file=/tmp/root/hive_job_log_root_28058#hadoop2_201203062232_1076893031.txt Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce
tasks determined at compile time: 1 In order to change the average
load for a reducer (in bytes): set
hive.exec.reducers.bytes.per.reducer= In order to limit the
maximum number of reducers: set hive.exec.reducers.max= In
order to set a constant number of reducers: set
mapred.reduce.tasks= Starting Job = job_201203062223_0004,
Tracking URL =
http://:50030/jobdetails.jsp?jobid=job_201203062223_0004 Kill
Command = //opt/hadoop_installation/hadoop-0.20.2/bin/../bin/hadoop
job -kill job_201203062223_0004 Hadoop job information for Stage-1:
number of mappers: 1; number of reducers: 1 2012-03-06 22:32:26,707
Stage-1 map = 0%, reduce = 0% 2012-03-06 22:32:29,716 Stage-1 map =
100%, reduce = 0% 2012-03-06 22:32:38,748 Stage-1 map = 100%, reduce
= 100% Ended Job = job_201203062223_0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 HDFS Read: 8107686 HDFS Write: 4 SUCCESS Total
MapReduce CPU Time Spent: 0 msec OK
The text mentioned in bold starts a hadoop job (that's what I believe). It takes long time to start the job. Once this line gets executed, the map reduce operations execute swiftly. Following are my questions:
Is there any way to make the launch of hadoop job faster. Is it possible to skip this phase?
Where does the value of 'Kill command' come from (in the bold text)?
Please let me know if any inputs are required.
1) Starting Job = job_201203062223_0004, Tracking URL = http: :50030/jobdetails.jsp?jobid=job_201203062223_0004
ANS: your HQL query > translated to hadoop job > hadoop will do some background work (like planning resources,data locality,stages needed to process query,launch configs,job,taskids generation etc) > launch mappers > sort && shuffle > reduce (aggregation) > result to hdfs .
The above flow is part of hadoop job life cycle, so no skipping of any..
http://namenode:port/jobtracker.jsp --- you can see ur job status with job-id :job_201203062223_0004, (Monitering)
2) Kill Command = HADOOP_HOME/bin/hadoop job -kill job_201203062223_0004
Ans : before launching your mappers, you will be showed with these lines because, hadoop works on bigdata, which may take much or less time depends on your dataset size. so at any point of time if you want to kill the job, its a help line . For any hadoop-job this line will be shown, it won't take much time to show an info line like this.
some addons with respect to your comments :
Hive is not meant for low Latency jobs , i mean immediate in time results not possible.
(plz check the hive -purposes in apache.hive)
launching overhead(refer q1s - hadoop will do some background work) is there in Hive, it cant be avoided.
Even for datasets of small size, these launching over head is there in hadoop.
PS : if you are really expecting in time quick results ( plz refer shark )
first,Hive is the tool which replace your mr work by HQL.In the background,it has lost of predefined funcitions,mr programes.Run a HQL,HADOOP Cluster will do lost of things,find the data blocks,allocating taskļ¼Œand so on.
Second,you can kill a job by the hadoop shell command.
If you job id is AAAAA.
you can execute below command to kill it
$HADOOP_HOME/bin/hadoop job -kill AAAAA
Launch of hadoop job can get delayed due to unavailability of resources. If you use yarn you can see that the jobs are in accepted state but not yet running. This means there is some other ongoing job that has consumed all your executors and the new query is waiting to run.
You can kill the older job by using hadoop job -kill <job_id> command or wait for it to finish.

Parallel processing with dependencies on a SGE cluster

I'm doing some experiments on a computing cluster. My algorithm has two steps. The first one writes its outputs to some files which will be used by the second step. The dependecies are 1 to n meaning one step2 programs needs the output of n step1 program. I'm not sure what to do neither waist cluster resources nor keep the head node busy. My current solution is:
submit script (this runs on the head node)
for different params, p:
run step 1 with p
sleep some time based on the an estimate of how much step 1 takes
for different params, q:
run step 2 with q
step 2 algorithm (this runs on the computing nodes)
while files are not ready:
sleep a few minutes
do the step 2
Is there any better way to do this?
SGE provides both job dependencies and array jobs for that. You can submit your phase 1 computations an array job and then submit the phase 2 computation as a dependent job using the qsub -hold_jid <phase 1 job ID|name> .... This will make the phase 2 job wait until all the phase 1 computations have finished and then it will be released and dispatched. The phase 1 computations will run in parallel as long as there are enough slots in the cluster.
In a submission script it might be useful to specifiy holds by job name and name each array job in a unique way. E.g.
mkdir experiment_1; cd experiment_1
qsub -N phase1_001 -t 1-100 ./phase1
qsub -hold_jid phase1_001 -N phase2_001 ./phase2 q1
cd ..
mkdir experiment_2; cd experiment_2
qsub -N phase1_002 -t 1-42 ./phase1 parameter_file
qsub -hold_jid phase1_002 -N phase2_002 ./phase2 q2
cd ..
This will schedule 100 executions of the phase1 script as the array job phase1_001 and another 42 executions as the array job phase1_002. If there are 142 slots on the cluster, all 142 executions will run in parallel. Then one execution of the phase2 script will be dispatched after all tasks in the phase1_001 job have finished and one execution will be dispatched after all tasks in the phase1_002 job have finished. Again those can run in parallel.
Each taks in the array job will receive a unique $SGE_TASK_ID value ranging from 1 to 100 for the tasks in job phase1_001 and from 1 to 42 for the tasks in job phase1_002. From it you can compute the p parameter.

Setting the number of map tasks and reduce tasks

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not display. Can someone tell me what I am doing wrong.
I am using this command
hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0
Output:
11/07/30 19:48:56 INFO mapred.JobClient: Job complete: job_201107291018_0164
11/07/30 19:48:56 INFO mapred.JobClient: Counters: 18
11/07/30 19:48:56 INFO mapred.JobClient: Job Counters
11/07/30 19:48:56 INFO mapred.JobClient: Launched reduce tasks=13
11/07/30 19:48:56 INFO mapred.JobClient: Rack-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient: Launched map tasks=24
11/07/30 19:48:56 INFO mapred.JobClient: Data-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient: FileSystemCounters
11/07/30 19:48:56 INFO mapred.JobClient: FILE_BYTES_READ=4020792636
11/07/30 19:48:56 INFO mapred.JobClient: HDFS_BYTES_READ=1556534680
11/07/30 19:48:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6026699058
11/07/30 19:48:56 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1928893942
11/07/30 19:48:56 INFO mapred.JobClient: Map-Reduce Framework
11/07/30 19:48:56 INFO mapred.JobClient: Reduce input groups=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Combine output records=0
11/07/30 19:48:56 INFO mapred.JobClient: Map input records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Reduce shuffle bytes=1974162269
11/07/30 19:48:56 INFO mapred.JobClient: Reduce output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Spilled Records=120000000
11/07/30 19:48:56 INFO mapred.JobClient: Map output bytes=1928893942
11/07/30 19:48:56 INFO mapred.JobClient: Combine input records=0
11/07/30 19:48:56 INFO mapred.JobClient: Map output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Reduce input records=40000000
[hcrc1425n30]s0907855:
The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is just a hint to the InputFormat for the number of maps.
In your example Hadoop has determined there are 24 input splits and will spawn 24 map tasks in total. But, you can control how many map tasks can be executed in parallel by each of the task tracker.
Also, removing a space after -D might solve the problem for reduce.
For more information on the number of map and reduce tasks, please look at the below url
https://cwiki.apache.org/confluence/display/HADOOP2/HowManyMapsAndReduces
As Praveen mentions above, when using the basic FileInputFormat classes is just the number of input splits that constitute the data. The number of reducers is controlled by mapred.reduce.tasks specified in the way you have it: -D mapred.reduce.tasks=10 would specify 10 reducers. Note that the space after -D is required; if you omit the space, the configuration property is passed along to the relevant JVM, not to Hadoop.
Are you specifying 0 because there is no reduce work to do? In that case, if you're having trouble with the run-time parameter, you can also set the value directly in code. Given a JobConf instance job, call
job.setNumReduceTasks(0);
inside, say, your implementation of Tool.run. That should produce output directly from the mappers. If your job actually produces no output whatsoever (because you're using the framework just for side-effects like network calls or image processing, or if the results are entirely accounted for in Counter values), you can disable output by also calling
job.setOutputFormat(NullOutputFormat.class);
It's important to keep in mind that the MapReduce framework in Hadoop allows us only to
suggest the number of Map tasks for a job
which like Praveen pointed out above will correspond to the number of input splits for the task. Unlike it's behavior for the number of reducers (which is directly related to the number of files output by the MapReduce job) where we can
demand that it provide n reducers.
To explain it with a example:
Assume your hadoop input file size is 2 GB and you set block size as 64 MB so 32 Mappers tasks are set to run while each mapper will process 64 MB block to complete the Mapper Job of your Hadoop Job.
==> Number of mappers set to run are completely dependent on 1) File Size and 2) Block Size
Assume you have running hadoop on a cluster size of 4:
Assume you set mapred.map.tasks and mapred.reduce.tasks parameters in your conf file to the nodes as follows:
Node 1: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 2: mapred.map.tasks = 2 and mapred.reduce.tasks = 2
Node 3: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 4: mapred.map.tasks = 1 and mapred.reduce.tasks = 1
Assume you set the above paramters for 4 of your nodes in this cluster. If you notice Node 2 has set only 2 and 2 respectively because the processing resources of the Node 2 might be less e.g(2 Processors, 2 Cores) and Node 4 is even set lower to just 1 and 1 respectively might be due to processing resources on that node is 1 processor, 2 cores so can't run more than 1 mapper and 1 reducer task.
So when you run the job Node 1, Node 2, Node 3, Node 4 are configured to run a max. total of (4+2+4+1)11 mapper tasks simultaneously out of 42 mapper tasks that needs to be completed by the Job. After each Node completes its map tasks it will take the remaining mapper tasks left in 42 mapper tasks.
Now comming to reducers, as you set mapred.reduce.tasks = 0 so we only get mapper output in to 42 files(1 file for each mapper task) and no reducer output.
In the newer version of Hadoop, there are much more granular mapreduce.job.running.map.limit and mapreduce.job.running.reduce.limit which allows you to set the mapper and reducer count irrespective of hdfs file split size. This is helpful if you are under constraint to not take up large resources in the cluster.
JIRA
From your log I understood that you have 12 input files as there are 12 local maps generated. Rack Local maps are spawned for the same file if some of the blocks of that file are in some other data node. How many data nodes you have?
In your example, the -D parts are not picked up:
hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0
They should come after the classname part like this:
hadoop jar Test_Parallel_for.jar Test_Parallel_for -Dmapred.map.tasks=20 -Dmapred.reduce.tasks=0 Matrix/test4.txt Result 3
A space after -D is allowed though.
Also note that changing the number of mappers is probably a bad idea as other people have mentioned here.
Number of map tasks is directly defined by number of chunks your input is splitted. The size of data chunk (i.e. HDFS block size) is controllable and can be set for an individual file, set of files, directory(-s). So, setting specific number of map tasks in a job is possible but involves setting a corresponding HDFS block size for job's input data. mapred.map.tasks can be used for that too but only if its provided value is greater than number of splits for job's input data.
Controlling number of reducers via mapred.reduce.tasks is correct. However, setting it to zero is a rather special case: the job's output is an concatenation of mappers' outputs (non-sorted). In Matt's answer one can see more ways to set the number of reducers.
One way you can increase the number of mappers is to give your input in the form of split files [you can use linux split command]. Hadoop streaming usually assigns that many mappers as there are input files[if there are a large number of files] if not it will try to split the input into equal sized parts.
Use -D property=value rather than -D property = value (eliminate
extra whitespaces). Thus -D mapred.reduce.tasks=value would work
fine.
Setting number of map tasks doesnt always reflect the value you have
set since it depends on split size and InputFormat used.
Setting the number of reduces will definitely override the number of
reduces set on cluster/client-side configuration.
I agree the number mapp task depends upon the input split but in some of the scenario i could see its little different
case-1 I created a simple mapp task only it creates 2 duplicate out put file (data ia same)
command I gave below
bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -D mapred.reduce.tasks=0 -input /home/sample.csv -output /home/sample_csv112.txt -mapper /home/amitav/workpython/readcsv.py
Case-2 So I restrcted the mapp task to 1 the out put came correctly with one output file but one reducer also lunched in the UI screen although I restricted the reducer job. The command is given below.
bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -D mapred.map.tasks=1 mapred.reduce.tasks=0 -input /home/sample.csv -output /home/sample_csv115.txt -mapper /home/amitav/workpython/readcsv.py
The first part has already been answered, "just a suggestion"
The second part has also been answered, "remove extra spaces around ="
If both these didnt work, are you sure you have implemented ToolRunner ?
Number of map task depends on File size, If you want n number of Map, divide the file size by n as follows:
conf.set("mapred.max.split.size", "41943040"); // maximum split file size in bytes
conf.set("mapred.min.split.size", "20971520"); // minimum split file size in bytes
Folks from this theory it seems we cannot run map reduce jobs in parallel.
Lets say I configured total 5 mapper jobs to run on particular node.Also I want to use this in such a way that JOB1 can use 3 mappers and JOB2 can use 2 mappers so that job can run in parallel. But above properties are ignored then how can execute jobs in parallel.
From what I understand reading above, it depends on the input files. If Input Files are 100 means - Hadoop will create 100 map tasks.
However, it depends on the Node configuration on How Many can be run at one point of time.
If a node is configured to run 10 map tasks - only 10 map tasks will run in parallel by picking 10 different input files out of the 100 available.
Map tasks will continue to fetch more files as and when it completes processing of a file.

Resources