Mapreduce api giving wrong mapper count - hadoop

I am trying to get the number of mappers in a mapreduce program by using below piece of code. I get the value for mapreduce.job.maps as 2 but the program actually launches 6 mappers as there are 6 small files. Anyone getting similar issue?
Code
job.getConfiguration().get("mapreduce.job.maps")
Log:
num of mappers : 2
...
17/05/13 06:56:47 INFO input.FileInputFormat: Total input paths to process : 6
17/05/13 06:56:47 INFO mapreduce.JobSubmitter: number of splits:6
...
17/05/13 06:56:48 INFO mapreduce.Job: Running job: job_1494588725898_0047
17/05/13 06:56:59 INFO mapreduce.Job: Job job_1494588725898_0047 running in uber mode : false
17/05/13 06:56:59 INFO mapreduce.Job: map 0% reduce 0%
...
17/05/13 06:57:39 INFO mapreduce.Job: map 100% reduce 100%
17/05/13 06:57:40 INFO mapreduce.Job: Job job_1494588725898_0047 completed successfully
17/05/13 06:57:40 INFO mapreduce.Job: Counters: 49
File System Counters
...
Job Counters
Launched map tasks=6
Launched reduce tasks=2

This is not an issue, but the actual behaviour of MapReduce.
The value you get for mapreduce.job.maps property is its default value, 2. The number of mapper tasks will always be determined from the File Input Splits, which is 6 in this scenario. And to get the actual number of map tasks launched for a job, you have to wait till the job is completed.

Related

Hive Testbench data generation failed

I cloned the Hive Testbench to try to run Hive benchmark on a hadoop cluster built with Apache binary distributions of Hadoop v2.9.0, Hive 2.3.0 and Tez 0.9.0.
I managed to finish the build of the two data generators: TPC-H and TPC-DS. Then the next step of data generation on either TPC-H and TPC-DS are all failed. The failure is very consistent that each time it would failed at the exactly same step and produce same error messages.
For TPC-H, the data generation screen output is here:
$ ./tpch-setup.sh 10
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Generating data at scale factor 10.
...
18/01/02 14:43:00 INFO mapreduce.Job: Running job: job_1514226810133_0050
18/01/02 14:43:01 INFO mapreduce.Job: Job job_1514226810133_0050 running in uber mode : false
18/01/02 14:43:01 INFO mapreduce.Job: map 0% reduce 0%
18/01/02 14:44:38 INFO mapreduce.Job: map 10% reduce 0%
18/01/02 14:44:39 INFO mapreduce.Job: map 20% reduce 0%
18/01/02 14:44:46 INFO mapreduce.Job: map 30% reduce 0%
18/01/02 14:44:48 INFO mapreduce.Job: map 40% reduce 0%
18/01/02 14:44:58 INFO mapreduce.Job: map 70% reduce 0%
18/01/02 14:45:14 INFO mapreduce.Job: map 80% reduce 0%
18/01/02 14:45:15 INFO mapreduce.Job: map 90% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job: map 100% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job: Job job_1514226810133_0050 completed successfully
18/01/02 14:45:23 INFO mapreduce.Job: Counters: 0
SLF4J: Class path contains multiple SLF4J bindings.
...
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Data generation failed, exiting.
For TPC-DS, the error messages are here:
$ ./tpcds-setup.sh 10
...
18/01/02 22:13:58 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
18/01/02 22:13:58 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:13:59 INFO input.FileInputFormat: Total input files to process : 1
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: number of splits:10
18/01/02 22:13:59 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
18/01/02 22:13:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1514226810133_0082
18/01/02 22:14:00 INFO client.YARNRunner: Number of stages: 1
18/01/02 22:14:00 INFO Configuration.deprecation: mapred.job.map.memory.mb is deprecated. Instead, use mapreduce.map.memory.mb
18/01/02 22:14:00 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.9.0, revision=0873a0118a895ca84cbdd221d8ef56fedc4b43d0, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2017-07-18T05:41:23Z ]
18/01/02 22:14:00 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:00 INFO client.TezClient: Submitting DAG application with id: application_1514226810133_0082
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs://192.168.10.15:8020/apps/tez,hdfs://192.168.10.15:8020/apps/tez/lib/
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/01/02 22:14:00 INFO client.TezClient: Tez system stage directory hdfs://192.168.10.15:8020/tmp/hadoop-yarn/staging/rapids/.staging/job_1514226810133_0082/.tez/application_1514226810133_0082 doesn't exist and is created
18/01/02 22:14:01 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1514226810133_0082, dagName=GenTable+all_10
18/01/02 22:14:01 INFO impl.YarnClientImpl: Submitted application application_1514226810133_0082
18/01/02 22:14:01 INFO client.TezClient: The url to track the Tez AM: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:05 INFO mapreduce.Job: The url to track the job: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO mapreduce.Job: Running job: job_1514226810133_0082
18/01/02 22:14:06 INFO mapreduce.Job: Job job_1514226810133_0082 running in uber mode : false
18/01/02 22:14:06 INFO mapreduce.Job: map 0% reduce 0%
18/01/02 22:15:51 INFO mapreduce.Job: map 10% reduce 0%
18/01/02 22:15:54 INFO mapreduce.Job: map 20% reduce 0%
18/01/02 22:15:55 INFO mapreduce.Job: map 40% reduce 0%
18/01/02 22:15:56 INFO mapreduce.Job: map 50% reduce 0%
18/01/02 22:16:07 INFO mapreduce.Job: map 60% reduce 0%
18/01/02 22:16:09 INFO mapreduce.Job: map 70% reduce 0%
18/01/02 22:16:11 INFO mapreduce.Job: map 80% reduce 0%
18/01/02 22:16:19 INFO mapreduce.Job: map 90% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job: map 100% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job: Job job_1514226810133_0082 completed successfully
18/01/02 22:19:54 INFO mapreduce.Job: Counters: 0
...
TPC-DS text data generation complete.
Loading text data into external tables.
Optimizing table time_dim (2/24).
Optimizing table date_dim (1/24).
Optimizing table item (3/24).
Optimizing table customer (4/24).
Optimizing table household_demographics (6/24).
Optimizing table customer_demographics (5/24).
Optimizing table customer_address (7/24).
Optimizing table store (8/24).
Optimizing table promotion (9/24).
Optimizing table warehouse (10/24).
Optimizing table ship_mode (11/24).
Optimizing table reason (12/24).
Optimizing table income_band (13/24).
Optimizing table call_center (14/24).
Optimizing table web_page (15/24).
Optimizing table catalog_page (16/24).
Optimizing table web_site (17/24).
make: *** [store_sales] Error 2
make: *** Waiting for unfinished jobs....
make: *** [store_returns] Error 2
Data loaded into database tpcds_bin_partitioned_orc_10.
I notice the targeted temporary HDFS directory during the job running and after the failure are always empty except for the generated sub-directories.
Now I even don't know if the failure is due to Hadoop configuration issues, or mismatch software versions or any other reasons. Any help?
I had similar issue when running this job. When I specified the hdfs location to this script where I had permissions to write to, the script was successful.
./tpcds-setup.sh 10 <hdfs_directory_path>
I still get this error when the script kicks off:
Data loaded into database tpcds_bin_partitioned_orc_10.
ls: `<hdfs_directory_path>/10': No such file or directory
However the script runs successfully and the data is generated and loaded into the hive tables at the end.
Hope that helps.

Running Hadoop MapReduce word count for the first time fails?

When running the Hadoop word count example the first time it fails. Here's what I'm doing:
Format namenode: $HADOOP_HOME/bin/hdfs namenode -format
Start HDFS/YARN:
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/yarn-daemon.sh start nodemanager
Run wordcount: hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount input output
(let's say input folder is already in HDFS I'm not gonna put every single command here)
Output:
16/07/17 01:04:34 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/172.20.0.2:8032
16/07/17 01:04:35 INFO input.FileInputFormat: Total input paths to process : 2
16/07/17 01:04:35 INFO mapreduce.JobSubmitter: number of splits:2
16/07/17 01:04:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468688654488_0001
16/07/17 01:04:36 INFO impl.YarnClientImpl: Submitted application application_1468688654488_0001
16/07/17 01:04:36 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1468688654488_0001/
16/07/17 01:04:36 INFO mapreduce.Job: Running job: job_1468688654488_0001
16/07/17 01:04:46 INFO mapreduce.Job: Job job_1468688654488_0001 running in uber mode : false
16/07/17 01:04:46 INFO mapreduce.Job: map 0% reduce 0%
Terminated
And then HDFS crashes so I can't access http://localhost:50070/
Then I restart eveyrthing (repeat step 2), rerun the example and everything's fine.
How can I fix it for the first run? My HDFS obviously has no data the first time around, maybe that's the problem?
UPDATE:
Running an even simpler example fails as well:
hadoop#8f98bf86ceba:~$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar pi 3 3
Number of Maps = 3
Samples per Map = 3
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Starting Job
16/07/17 03:21:28 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/172.20.0.3:8032
16/07/17 03:21:29 INFO input.FileInputFormat: Total input paths to process : 3
16/07/17 03:21:29 INFO mapreduce.JobSubmitter: number of splits:3
16/07/17 03:21:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468696855031_0001
16/07/17 03:21:31 INFO impl.YarnClientImpl: Submitted application application_1468696855031_0001
16/07/17 03:21:31 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1468696855031_0001/
16/07/17 03:21:31 INFO mapreduce.Job: Running job: job_1468696855031_0001
16/07/17 03:21:43 INFO mapreduce.Job: Job job_1468696855031_0001 running in uber mode : false
16/07/17 03:21:43 INFO mapreduce.Job: map 0% reduce 0%
Same problem, HDFS terminates
Your post looks incomplete to deduce what is wrong here. My guess is that hadoop-mapreduce-examples-2.7.2-sources.jar is not what you want. More likely you need hadoop-mapreduce-examples-2.7.2.jar containing .class files and not the sources.
HDFS has to be restarted the first time before MapReduce jobs can be successfully ran. This is because HDFS creates some data on the first run but stopping it can clean up its state so MapReduce jobs can be ran through YARN afterwards.
So my solution was:
Start Hadoop: $HADOOP_HOME/sbin/start-dfs.sh
Stop Hadoop: $HADOOP_HOME/sbin/stop-dfs.sh
Start Hadoop again: $HADOOP_HOME/sbin/start-dfs.sh

Hadoop mapper phase stuck after 19% In which case this possibility can occur?

My MapReduce program is running fine with other MR code. There is no error in the code. Still it is getting stuck.
15/05/28 19:53:29 INFO input.FileInputFormat: Total input paths to process : 1
15/05/28 19:53:31 INFO mapred.JobClient: Running job: job_201504101709_0927
15/05/28 19:53:32 INFO mapred.JobClient: map 0% reduce 0%
15/05/28 19:53:46 INFO mapred.JobClient: map 19% reduce 0%
15/05/28 20:03:50 INFO mapred.JobClient: map 0% reduce 0%
15/05/28 20:03:51 INFO mapred.JobClient: Task Id : attempt_201504101709_0927_m_000000_0, Status : FAILED
Task attempt_201504101709_0927_m_000000_0 failed to report status for 602 seconds. Killing!
Possible reason for this might be a bug in Mapper, like an infinite loop. Just check if everything is fine in Mapper. If you feel its not problem, update your question with your mapper code.

In Hadoop, How can I find which slave node is executing an attempt N?

I'm using Hadoop 1.2.1, and my hadoop application fails in doing Reduce. From Hadoop run I see messages like following :
15/05/22 18:14:15 INFO mapred.JobClient: map 0% reduce 0% 15/05/22
18:14:25 INFO mapred.JobClient: map 100% reduce 0% 15/05/22 18:24:25
INFO mapred.JobClient: map 0% reduce 0% 15/05/22 18:24:26 INFO
mapred.JobClient: Task Id : attempt_201505221804_0013_m_000000_0,
Status : FAILED Task attempt_201505221804_0013_m_000000_0 failed to
report status for 600 seconds. Killing! 15/05/22 18:24:35 INFO
mapred.JobClient: map 100% reduce 0%
I'd like to see the log of attempt_201505221804_0013_m_000000_0, but it is too time-consuming to find which slave had executed attempt_201505221804_0013_m_000000_0.
Someone told me to use Hadoop web pages to find it, but there is some firewall on this cluster and I can't change the option because the cluster is fundamentally not owned by our group.
Is there any way to find in where this attempt was executed?
You should be able to find this information in the jobtracker logs which are by default under HADOOP_HOME/logs. This will contain entries looking similar to this:
INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201503262103_0001_m_000000_0' to tip task_201503262103_0001_m_000000, for tracker 'host'
You can search the file for the specific attempt id.

Hadoop, Launched reduce tasks = 1

Hadoop is running on a cluster of 8 nodes. The submitted job produces several key-value objects as mapper output with different keys (manually checked), so I except to have several launched reducers to manage the data in the nodes.
I don't know why, as the log report, the number of launched reduce tasks is always 1. Since there are tens different keys I expect to have at least as many reducers as the number of nodes, i.e. 8 (which is also the number of slaves).
This is the log when job ends
13/05/25 04:02:31 INFO mapred.JobClient: Job complete: job_201305242051_0051
13/05/25 04:02:31 INFO mapred.JobClient: Counters: 30
13/05/25 04:02:31 INFO mapred.JobClient: Job Counters
13/05/25 04:02:31 INFO mapred.JobClient: Launched reduce tasks=1
13/05/25 04:02:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=21415994
13/05/25 04:02:31 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/05/25 04:02:31 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/05/25 04:02:31 INFO mapred.JobClient: Rack-local map tasks=7
13/05/25 04:02:31 INFO mapred.JobClient: Launched map tasks=33
13/05/25 04:02:31 INFO mapred.JobClient: Data-local map tasks=26
13/05/25 04:02:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=5486645
13/05/25 04:02:31 INFO mapred.JobClient: File Output Format Counters
13/05/25 04:02:31 INFO mapred.JobClient: Bytes Written=2798
13/05/25 04:02:31 INFO mapred.JobClient: FileSystemCounters
13/05/25 04:02:31 INFO mapred.JobClient: FILE_BYTES_READ=2299685944
13/05/25 04:02:31 INFO mapred.JobClient: HDFS_BYTES_READ=2170126861
13/05/25 04:02:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2879025663
13/05/25 04:02:31 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2798
13/05/25 04:02:31 INFO mapred.JobClient: File Input Format Counters
13/05/25 04:02:31 INFO mapred.JobClient: Bytes Read=2170123000
Other (useful?) information:
for each node I have 1 core assigned to the job
I manually checked that the job is effectively running on 8 nodes.
There is no parameter set by me for setting the reducers tasks fixed to one
Hadoop version: 1.1.2
So, do you have any idea of why the reducer number is 1? and not more?
Thanks
You should:
firstly checkout whether your cluster support more than 1 reducer
Specify the reduce members you want to run
checkout the supported reducer count
The most convienient way to checkout it out is using the jobtracker webUI: http://localhost:50030/machines.jsp?type=active ( you may need to remove localhost with the hostname that the jobtracker is running. It will show all the active TaskTrackers in your cluster, and how many reducers each TaskTracker could run concurrently.
Specify the reducer number
There are three ways for you:
Specify the reducer number in your code
Like zsxwing have showed out, you should specify the reducer number by calls setNumReduceTasks() method of JobConf. And give the reduce number as the parameter.
Specify the reducer number in your command line
you could also pass the reducer number in command line like the following:
bin/hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.reduce.tasks=2 teragen teragen_out.
The above command line will start 2 reducers.
Specify the reducer number in your conf/mapred-site.xml
You can also add a new property in your mapred-site.xml like this:
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
</property>
You need to set the reducer number by yourself (default is 1) no matter how many keys outputed by mappers. You can use job.setNumReduceTasks(5) to set the reduce tasks 5.

Resources