Nutch fetch command not fetching data - hadoop

I have a cluster setup with the following software stack :
nutch-branch-2.3.1,
gora-hbase 0.6.1
Hadoop 2.5.2,
hbase-0.98.8-hadoop2
So initial command is : inject, generate, fetch, parse, updatedb
Out of which first 2 i.e. inject, generate are working fine, but for nutch command (even though its executing successfully) its not fetching any data, and because fetch process is failing its subsequent processes also getting failed.
Please find the logs for counters for each process :
Inject job:
2016-01-08 14:12:45,649 INFO [main] mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=114853
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=836443
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=179217
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=59739
Total vcore-seconds taken by all map tasks=59739
Total megabyte-seconds taken by all map tasks=183518208
Map-Reduce Framework
Map input records=29973
Map output records=29973
Input split bytes=94
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=318
CPU time spent (ms)=24980
Physical memory (bytes) snapshot=427704320
Virtual memory (bytes) snapshot=5077356544
Total committed heap usage (bytes)=328728576
injector
urls_injected=29973
File Input Format Counters
Bytes Read=836349
File Output Format Counters
Bytes Written=0
generate job:
2016-01-08 14:14:38,257 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=137140
FILE: Number of bytes written=623942
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=937
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=43788
Total time spent by all reduces in occupied slots (ms)=305690
Total time spent by all map tasks (ms)=14596
Total time spent by all reduce tasks (ms)=61138
Total vcore-seconds taken by all map tasks=14596
Total vcore-seconds taken by all reduce tasks=61138
Total megabyte-seconds taken by all map tasks=44838912
Total megabyte-seconds taken by all reduce tasks=313026560
Map-Reduce Framework
Map input records=14345
Map output records=14342
Map output bytes=1261921
Map output materialized bytes=137124
Input split bytes=937
Combine input records=0
Combine output records=0
Reduce input groups=14342
Reduce shuffle bytes=137124
Reduce input records=14342
Reduce output records=14342
Spilled Records=28684
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1299
CPU time spent (ms)=39600
Physical memory (bytes) snapshot=2060779520
Virtual memory (bytes) snapshot=15215738880
Total committed heap usage (bytes)=1864892416
Generator
GENERATE_MARK=14342
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2016-01-08 14:14:38,429 INFO [main] crawl.GeneratorJob: GeneratorJob: finished at 2016-01-08 14:14:38, time elapsed: 00:01:47
2016-01-08 14:14:38,431 INFO [main] crawl.GeneratorJob: GeneratorJob: generated batch id: 1452242570-1295749106 containing 14342 URLs
Fetching :
../nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1452242566-14060 -crawlId 1 -threads 50
2016-01-08 14:14:43,142 INFO [main] fetcher.FetcherJob: FetcherJob: starting at 2016-01-08 14:14:43
2016-01-08 14:14:43,145 INFO [main] fetcher.FetcherJob: FetcherJob: batchId: 1452242566-14060
2016-01-08 14:15:53,837 INFO [main] mapreduce.Job: Job job_1452239500353_0024 completed successfully
2016-01-08 14:15:54,286 INFO [main] mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=44
FILE: Number of bytes written=349279
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1087
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Launched reduce tasks=2
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30528
Total time spent by all reduces in occupied slots (ms)=136535
Total time spent by all map tasks (ms)=10176
Total time spent by all reduce tasks (ms)=27307
Total vcore-seconds taken by all map tasks=10176
Total vcore-seconds taken by all reduce tasks=27307
Total megabyte-seconds taken by all map tasks=31260672
Total megabyte-seconds taken by all reduce tasks=139811840
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=28
Input split bytes=1087
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=28
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=426
CPU time spent (ms)=11140
Physical memory (bytes) snapshot=1884893184
Virtual memory (bytes) snapshot=15245959168
Total committed heap usage (bytes)=1751646208
FetcherStatus
HitByTimeLimit-QueueFeeder=0
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2016-01-08 14:15:54,314 INFO [main] fetcher.FetcherJob: FetcherJob: finished at 2016-01-08 14:15:54, time elapsed: 00:01:11
Please advise.

It's been a while since i worked with nutch, but from memory there is a time to live on fetching a page. for instance if you crawl http://helloworld.com today, and try to issue the fetch command again today, then it will probably just finish without fetching anything as the timetolive on the url http://helloworld.com is belated by x amount of days (forgot the default time to live).
I think you can fix this by clearing the crawl_db and trying again - or there may be a command now to set the timetolive to 0.

Finally after several hours r&d I fond the problem was because of a bug in nutch, which is like "The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch.". Listed here as an issue https://issues.apache.org/jira/browse/NUTCH-2143
Special thanks to andrew-butkus :)

Related

hadoop: how to show execution time of put command? Or How to show the duration of load a file in hdfs?

How do I configure put command in hadoop so it shows the execution time?
Because this command:
hadoop fs -put table.txt /tables/table
is just returning this:
16/04/04 01:44:47 WARN util.NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
The command works, but does not show any execution time. Do you know if it is possible the command shows the execution time? Or there is another way to get that info?
Per my understating hadoop fs command does not provide any debug information like execution time, but you can get the execution time in two ways:
The Bash way: start=$(date +'%s') && hadoop fs -put visit-sequences.csv /user/hadoop/temp && echo "It took $(($(date +'%s') - $start)) seconds"
From Log file: You can check the namenode log file which lists all the details related to executed command, like how much time it took, file-size, replication etc.
e.g. I tried this command hadoop fs -put visit-sequences.csv /user/hadoop/temp and got below logs, specific to put operation, in log file.
2016-04-04 20:30:00,097 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 127.0.0.1
2016-04-04 20:30:00,097 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2016-04-04 20:30:00,097 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 38
2016-04-04 20:30:00,097 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 75
2016-04-04 20:30:00,118 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 3 SyncTimes(ms): 95
2016-04-04 20:30:00,120 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data/misc/hadoop/store/hdfs/namenode/current/edits_inprogress_0000000000000000038 -> /data/misc/hadoop/store/hdfs/namenode/current/edits_0000000000000000038-0000000000000000039
2016-04-04 20:30:00,120 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 40
2016-04-04 20:30:01,781 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 0.06s at 15.63 KB/s
2016-04-04 20:30:01,781 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000000039 size 1177 bytes.
2016-04-04 20:30:01,830 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 0
2016-04-04 20:30:56,252 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741829_1005{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-1b928386-65b9-4438-a781-b154cdb9a579:NORMAL:127.0.0.1:50010|RBW]]} for /user/hadoop/temp/visit-sequences.csv._COPYING_
2016-04-04 20:30:56,532 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741829_1005{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-1b928386-65b9-4438-a781-b154cdb9a579:NORMAL:127.0.0.1:50010|RBW]]} is not COMPLETE (ucState = COMMITTED, replication# = 0 < minimum = 1) in file /user/hadoop/temp/visit-sequences.csv._COPYING_
2016-04-04 20:30:56,533 INFO org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream: Nothing to flush
2016-04-04 20:30:56,548 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073741829_1005{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-1b928386-65b9-4438-a781-b154cdb9a579:NORMAL:127.0.0.1:50010|RBW]]} size 742875
2016-04-04 20:30:56,957 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/temp/visit-sequences.csv._COPYING_ is closed by DFSClient_NONMAPREDUCE_1242172231_1

Hadoop Streaming - Too many map tasks

I realize that we can't exactly dictate how many map tasks to use, we can only suggest. But still it doesn't make sense.
2016-01-07 07:19:25,117 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 1
2016-01-07 07:19:25,165 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:40
I have a single .txt file in my input which contains:
x,2,65
t,6,12
y,5,11
n,3,71
.
.
(8 lines)
I would expect 8 map tasks to be created, but instead I get 40 map tasks, from which 32 have nothing coming through stdin and hence don't do anything.
I'm running a separate executable through each map task with each line containing the parameters needed.
How does this all work?

yarn hadoop run slowly

I installed cloudera manager(CDH 5) and create own claster. Everything is good but when I run task that it run slowly(18 min). But the ruby's script is running about 5 seconds.
My task consists of:
#mapper.py
import sys
def do_map(doc):
for word in doc.split():
yield word.lower(), 1
for line in sys.stdin:
for key, value in do_map(line):
print(key + "\t" + str(value))
and
#reducer.py
import sys
def do_reduce(word, values):
return word, sum(values)
prev_key = None
values = []
for line in sys.stdin:
key, value = line.split("\t")
if key != prev_key and prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
print(result_key + "\t" + str(result_value))
values = []
prev_key = key
values.append(int(value))
if prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
print(result_key + "\t" + str(result_value))
I run my task this is command:
yarn jar hadoop-streaming.jar -input lenta_articles -output lenta_wordcount -file mapper.py -file reducer.py -mapper "python mapper.py" -reducer "python reducer.py"
log of run command:
15/11/17 10:14:27 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py] [/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hadoop-streaming-2.6.0-cdh5.4.8.jar] /tmp/streamjob8334226755199432389.jar tmpDir=null
15/11/17 10:14:29 INFO client.RMProxy: Connecting to ResourceManager at manager/10.128.181.136:8032
15/11/17 10:14:29 INFO client.RMProxy: Connecting to ResourceManager at manager/10.128.181.136:8032
15/11/17 10:14:31 INFO mapred.FileInputFormat: Total input paths to process : 909
15/11/17 10:14:32 INFO mapreduce.JobSubmitter: number of splits:909
15/11/17 10:14:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1447762910705_0010
15/11/17 10:14:32 INFO impl.YarnClientImpl: Submitted application application_1447762910705_0010
15/11/17 10:14:32 INFO mapreduce.Job: The url to track the job: http://manager:8088/proxy/application_1447762910705_0010/
15/11/17 10:14:32 INFO mapreduce.Job: Running job: job_1447762910705_0010
15/11/17 10:14:49 INFO mapreduce.Job: Job job_1447762910705_0010 running in uber mode : false
15/11/17 10:14:49 INFO mapreduce.Job: map 0% reduce 0%
15/11/17 10:16:04 INFO mapreduce.Job: map 1% reduce 0%
size of lenta_wordcount folder 2.5 mb. It consists of 909 files. Аverage file size 3КБ.
Ask questions if there is something you need to learn or perform any command
What am i doing wrong?
Hadoop is not efficient in handling large number of small files but it is efficient in processing small number of large files.
Since you have already using Cloudera, have a look at alternatives to improve performance with large number of small files with Hadoop as quoted in Cloudera article
Main reason for slow processing
Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
If you have more number of files, you need more number of Mappers to read & process data. Thousands of Mappers processing small files & passing the output to Reducers over the Network will degrade the performance.
Passing input as sequential files with LZO compressions is one of the best alternatives to handle large number of small files. Have a look at SE Question 1 and Other Alternative
There are some other alternatives (some are not related to phtyon) but you should look at this article
Change the ingestion process/interval
Batch file consolidation
Sequence files
HBase
S3DistCp (If using Amazon EMR)
Using a CombineFileInputFormat
Hive configuration settings
Using Hadoop’s append capabilities

issue with gzip input file with size > 64 MB

I am running a Hadoop streaming job, it has only mappers, no reducers. I am giving this job 4 input files which are all gzipped to make sure that each input file goes to one mapper. Two gzipped input files have size less than 64 MB, whereas two other gzipped input files have size greater than 64MB. Job runs for a long time nearly 40 min and then fails saying "Error: # of failed Map Tasks exceeded allowed limit." Normally the job should not take more than 1 min, not sure why it went on for 40 min
When I check the output directory I see that the output is generated for two gzipped input files with size < 64 MB and output is not generated for gzipped input files with size > 64 MB.
Has anybody seen such a behaviour?
I see following messages when the job is launched (I dont see this if I pass smaller size files ( < 64 MB) as input to the job)
12/02/06 10:39:10 INFO mapred.FileInputFormat: Total input paths to process : 2
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.191.0/10.209.191.57:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.191.0/10.209.191.50:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.186.0/10.209.186.28:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.188.0/10.209.188.48:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.185.0/10.209.185.50:1004
12/02/06 10:39:10 INFO net.NetworkTopology: Adding a new node: /10.209.188.0/10.209.188.35:1004
In case you have defined your own derivative of FileInputFormat then I suspect you ran into this bug:
https://issues.apache.org/jira/browse/MAPREDUCE-2094
If you have then I recommend copying the implementation of the isSplitable method from TextInputFormat into your own class.

Setting the number of map tasks and reduce tasks

I am currently running a job I fixed the number of map task to 20 but and getting a higher number. I also set the reduce task to zero but I am still getting a number other than zero. The total time for the MapReduce job to complete is also not display. Can someone tell me what I am doing wrong.
I am using this command
hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0
Output:
11/07/30 19:48:56 INFO mapred.JobClient: Job complete: job_201107291018_0164
11/07/30 19:48:56 INFO mapred.JobClient: Counters: 18
11/07/30 19:48:56 INFO mapred.JobClient: Job Counters
11/07/30 19:48:56 INFO mapred.JobClient: Launched reduce tasks=13
11/07/30 19:48:56 INFO mapred.JobClient: Rack-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient: Launched map tasks=24
11/07/30 19:48:56 INFO mapred.JobClient: Data-local map tasks=12
11/07/30 19:48:56 INFO mapred.JobClient: FileSystemCounters
11/07/30 19:48:56 INFO mapred.JobClient: FILE_BYTES_READ=4020792636
11/07/30 19:48:56 INFO mapred.JobClient: HDFS_BYTES_READ=1556534680
11/07/30 19:48:56 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6026699058
11/07/30 19:48:56 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1928893942
11/07/30 19:48:56 INFO mapred.JobClient: Map-Reduce Framework
11/07/30 19:48:56 INFO mapred.JobClient: Reduce input groups=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Combine output records=0
11/07/30 19:48:56 INFO mapred.JobClient: Map input records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Reduce shuffle bytes=1974162269
11/07/30 19:48:56 INFO mapred.JobClient: Reduce output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Spilled Records=120000000
11/07/30 19:48:56 INFO mapred.JobClient: Map output bytes=1928893942
11/07/30 19:48:56 INFO mapred.JobClient: Combine input records=0
11/07/30 19:48:56 INFO mapred.JobClient: Map output records=40000000
11/07/30 19:48:56 INFO mapred.JobClient: Reduce input records=40000000
[hcrc1425n30]s0907855:
The number of map tasks for a given job is driven by the number of input splits and not by the mapred.map.tasks parameter. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits. mapred.map.tasks is just a hint to the InputFormat for the number of maps.
In your example Hadoop has determined there are 24 input splits and will spawn 24 map tasks in total. But, you can control how many map tasks can be executed in parallel by each of the task tracker.
Also, removing a space after -D might solve the problem for reduce.
For more information on the number of map and reduce tasks, please look at the below url
https://cwiki.apache.org/confluence/display/HADOOP2/HowManyMapsAndReduces
As Praveen mentions above, when using the basic FileInputFormat classes is just the number of input splits that constitute the data. The number of reducers is controlled by mapred.reduce.tasks specified in the way you have it: -D mapred.reduce.tasks=10 would specify 10 reducers. Note that the space after -D is required; if you omit the space, the configuration property is passed along to the relevant JVM, not to Hadoop.
Are you specifying 0 because there is no reduce work to do? In that case, if you're having trouble with the run-time parameter, you can also set the value directly in code. Given a JobConf instance job, call
job.setNumReduceTasks(0);
inside, say, your implementation of Tool.run. That should produce output directly from the mappers. If your job actually produces no output whatsoever (because you're using the framework just for side-effects like network calls or image processing, or if the results are entirely accounted for in Counter values), you can disable output by also calling
job.setOutputFormat(NullOutputFormat.class);
It's important to keep in mind that the MapReduce framework in Hadoop allows us only to
suggest the number of Map tasks for a job
which like Praveen pointed out above will correspond to the number of input splits for the task. Unlike it's behavior for the number of reducers (which is directly related to the number of files output by the MapReduce job) where we can
demand that it provide n reducers.
To explain it with a example:
Assume your hadoop input file size is 2 GB and you set block size as 64 MB so 32 Mappers tasks are set to run while each mapper will process 64 MB block to complete the Mapper Job of your Hadoop Job.
==> Number of mappers set to run are completely dependent on 1) File Size and 2) Block Size
Assume you have running hadoop on a cluster size of 4:
Assume you set mapred.map.tasks and mapred.reduce.tasks parameters in your conf file to the nodes as follows:
Node 1: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 2: mapred.map.tasks = 2 and mapred.reduce.tasks = 2
Node 3: mapred.map.tasks = 4 and mapred.reduce.tasks = 4
Node 4: mapred.map.tasks = 1 and mapred.reduce.tasks = 1
Assume you set the above paramters for 4 of your nodes in this cluster. If you notice Node 2 has set only 2 and 2 respectively because the processing resources of the Node 2 might be less e.g(2 Processors, 2 Cores) and Node 4 is even set lower to just 1 and 1 respectively might be due to processing resources on that node is 1 processor, 2 cores so can't run more than 1 mapper and 1 reducer task.
So when you run the job Node 1, Node 2, Node 3, Node 4 are configured to run a max. total of (4+2+4+1)11 mapper tasks simultaneously out of 42 mapper tasks that needs to be completed by the Job. After each Node completes its map tasks it will take the remaining mapper tasks left in 42 mapper tasks.
Now comming to reducers, as you set mapred.reduce.tasks = 0 so we only get mapper output in to 42 files(1 file for each mapper task) and no reducer output.
In the newer version of Hadoop, there are much more granular mapreduce.job.running.map.limit and mapreduce.job.running.reduce.limit which allows you to set the mapper and reducer count irrespective of hdfs file split size. This is helpful if you are under constraint to not take up large resources in the cluster.
JIRA
From your log I understood that you have 12 input files as there are 12 local maps generated. Rack Local maps are spawned for the same file if some of the blocks of that file are in some other data node. How many data nodes you have?
In your example, the -D parts are not picked up:
hadoop jar Test_Parallel_for.jar Test_Parallel_for Matrix/test4.txt Result 3 \ -D mapred.map.tasks = 20 \ -D mapred.reduce.tasks =0
They should come after the classname part like this:
hadoop jar Test_Parallel_for.jar Test_Parallel_for -Dmapred.map.tasks=20 -Dmapred.reduce.tasks=0 Matrix/test4.txt Result 3
A space after -D is allowed though.
Also note that changing the number of mappers is probably a bad idea as other people have mentioned here.
Number of map tasks is directly defined by number of chunks your input is splitted. The size of data chunk (i.e. HDFS block size) is controllable and can be set for an individual file, set of files, directory(-s). So, setting specific number of map tasks in a job is possible but involves setting a corresponding HDFS block size for job's input data. mapred.map.tasks can be used for that too but only if its provided value is greater than number of splits for job's input data.
Controlling number of reducers via mapred.reduce.tasks is correct. However, setting it to zero is a rather special case: the job's output is an concatenation of mappers' outputs (non-sorted). In Matt's answer one can see more ways to set the number of reducers.
One way you can increase the number of mappers is to give your input in the form of split files [you can use linux split command]. Hadoop streaming usually assigns that many mappers as there are input files[if there are a large number of files] if not it will try to split the input into equal sized parts.
Use -D property=value rather than -D property = value (eliminate
extra whitespaces). Thus -D mapred.reduce.tasks=value would work
fine.
Setting number of map tasks doesnt always reflect the value you have
set since it depends on split size and InputFormat used.
Setting the number of reduces will definitely override the number of
reduces set on cluster/client-side configuration.
I agree the number mapp task depends upon the input split but in some of the scenario i could see its little different
case-1 I created a simple mapp task only it creates 2 duplicate out put file (data ia same)
command I gave below
bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -D mapred.reduce.tasks=0 -input /home/sample.csv -output /home/sample_csv112.txt -mapper /home/amitav/workpython/readcsv.py
Case-2 So I restrcted the mapp task to 1 the out put came correctly with one output file but one reducer also lunched in the UI screen although I restricted the reducer job. The command is given below.
bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -D mapred.map.tasks=1 mapred.reduce.tasks=0 -input /home/sample.csv -output /home/sample_csv115.txt -mapper /home/amitav/workpython/readcsv.py
The first part has already been answered, "just a suggestion"
The second part has also been answered, "remove extra spaces around ="
If both these didnt work, are you sure you have implemented ToolRunner ?
Number of map task depends on File size, If you want n number of Map, divide the file size by n as follows:
conf.set("mapred.max.split.size", "41943040"); // maximum split file size in bytes
conf.set("mapred.min.split.size", "20971520"); // minimum split file size in bytes
Folks from this theory it seems we cannot run map reduce jobs in parallel.
Lets say I configured total 5 mapper jobs to run on particular node.Also I want to use this in such a way that JOB1 can use 3 mappers and JOB2 can use 2 mappers so that job can run in parallel. But above properties are ignored then how can execute jobs in parallel.
From what I understand reading above, it depends on the input files. If Input Files are 100 means - Hadoop will create 100 map tasks.
However, it depends on the Node configuration on How Many can be run at one point of time.
If a node is configured to run 10 map tasks - only 10 map tasks will run in parallel by picking 10 different input files out of the 100 available.
Map tasks will continue to fetch more files as and when it completes processing of a file.

Resources