Hive Testbench data generation failed - hadoop

I cloned the Hive Testbench to try to run Hive benchmark on a hadoop cluster built with Apache binary distributions of Hadoop v2.9.0, Hive 2.3.0 and Tez 0.9.0.
I managed to finish the build of the two data generators: TPC-H and TPC-DS. Then the next step of data generation on either TPC-H and TPC-DS are all failed. The failure is very consistent that each time it would failed at the exactly same step and produce same error messages.
For TPC-H, the data generation screen output is here:
$ ./tpch-setup.sh 10
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Generating data at scale factor 10.
...
18/01/02 14:43:00 INFO mapreduce.Job: Running job: job_1514226810133_0050
18/01/02 14:43:01 INFO mapreduce.Job: Job job_1514226810133_0050 running in uber mode : false
18/01/02 14:43:01 INFO mapreduce.Job: map 0% reduce 0%
18/01/02 14:44:38 INFO mapreduce.Job: map 10% reduce 0%
18/01/02 14:44:39 INFO mapreduce.Job: map 20% reduce 0%
18/01/02 14:44:46 INFO mapreduce.Job: map 30% reduce 0%
18/01/02 14:44:48 INFO mapreduce.Job: map 40% reduce 0%
18/01/02 14:44:58 INFO mapreduce.Job: map 70% reduce 0%
18/01/02 14:45:14 INFO mapreduce.Job: map 80% reduce 0%
18/01/02 14:45:15 INFO mapreduce.Job: map 90% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job: map 100% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job: Job job_1514226810133_0050 completed successfully
18/01/02 14:45:23 INFO mapreduce.Job: Counters: 0
SLF4J: Class path contains multiple SLF4J bindings.
...
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Data generation failed, exiting.
For TPC-DS, the error messages are here:
$ ./tpcds-setup.sh 10
...
18/01/02 22:13:58 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
18/01/02 22:13:58 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:13:59 INFO input.FileInputFormat: Total input files to process : 1
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: number of splits:10
18/01/02 22:13:59 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
18/01/02 22:13:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1514226810133_0082
18/01/02 22:14:00 INFO client.YARNRunner: Number of stages: 1
18/01/02 22:14:00 INFO Configuration.deprecation: mapred.job.map.memory.mb is deprecated. Instead, use mapreduce.map.memory.mb
18/01/02 22:14:00 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.9.0, revision=0873a0118a895ca84cbdd221d8ef56fedc4b43d0, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2017-07-18T05:41:23Z ]
18/01/02 22:14:00 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:00 INFO client.TezClient: Submitting DAG application with id: application_1514226810133_0082
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs://192.168.10.15:8020/apps/tez,hdfs://192.168.10.15:8020/apps/tez/lib/
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/01/02 22:14:00 INFO client.TezClient: Tez system stage directory hdfs://192.168.10.15:8020/tmp/hadoop-yarn/staging/rapids/.staging/job_1514226810133_0082/.tez/application_1514226810133_0082 doesn't exist and is created
18/01/02 22:14:01 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1514226810133_0082, dagName=GenTable+all_10
18/01/02 22:14:01 INFO impl.YarnClientImpl: Submitted application application_1514226810133_0082
18/01/02 22:14:01 INFO client.TezClient: The url to track the Tez AM: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:05 INFO mapreduce.Job: The url to track the job: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO mapreduce.Job: Running job: job_1514226810133_0082
18/01/02 22:14:06 INFO mapreduce.Job: Job job_1514226810133_0082 running in uber mode : false
18/01/02 22:14:06 INFO mapreduce.Job: map 0% reduce 0%
18/01/02 22:15:51 INFO mapreduce.Job: map 10% reduce 0%
18/01/02 22:15:54 INFO mapreduce.Job: map 20% reduce 0%
18/01/02 22:15:55 INFO mapreduce.Job: map 40% reduce 0%
18/01/02 22:15:56 INFO mapreduce.Job: map 50% reduce 0%
18/01/02 22:16:07 INFO mapreduce.Job: map 60% reduce 0%
18/01/02 22:16:09 INFO mapreduce.Job: map 70% reduce 0%
18/01/02 22:16:11 INFO mapreduce.Job: map 80% reduce 0%
18/01/02 22:16:19 INFO mapreduce.Job: map 90% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job: map 100% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job: Job job_1514226810133_0082 completed successfully
18/01/02 22:19:54 INFO mapreduce.Job: Counters: 0
...
TPC-DS text data generation complete.
Loading text data into external tables.
Optimizing table time_dim (2/24).
Optimizing table date_dim (1/24).
Optimizing table item (3/24).
Optimizing table customer (4/24).
Optimizing table household_demographics (6/24).
Optimizing table customer_demographics (5/24).
Optimizing table customer_address (7/24).
Optimizing table store (8/24).
Optimizing table promotion (9/24).
Optimizing table warehouse (10/24).
Optimizing table ship_mode (11/24).
Optimizing table reason (12/24).
Optimizing table income_band (13/24).
Optimizing table call_center (14/24).
Optimizing table web_page (15/24).
Optimizing table catalog_page (16/24).
Optimizing table web_site (17/24).
make: *** [store_sales] Error 2
make: *** Waiting for unfinished jobs....
make: *** [store_returns] Error 2
Data loaded into database tpcds_bin_partitioned_orc_10.
I notice the targeted temporary HDFS directory during the job running and after the failure are always empty except for the generated sub-directories.
Now I even don't know if the failure is due to Hadoop configuration issues, or mismatch software versions or any other reasons. Any help?

I had similar issue when running this job. When I specified the hdfs location to this script where I had permissions to write to, the script was successful.
./tpcds-setup.sh 10 <hdfs_directory_path>
I still get this error when the script kicks off:
Data loaded into database tpcds_bin_partitioned_orc_10.
ls: `<hdfs_directory_path>/10': No such file or directory
However the script runs successfully and the data is generated and loaded into the hive tables at the end.
Hope that helps.

Related

Difference between NameNode heap usage and ResourceManager heap usage (trying to find NameNode heap usage cause)?

What is the difference between NameNode heap usage and ResourceManager heap usage? I am trying to find heavy NameNode heap usage cause.
In the ambari dashboard, I see...
when running some sqoop jobs. Not sure what is causing the NN usage to be so high here (not a lot of experience with hadoop admin stuff)? Is this an unusual amount (only noticed recently)?
Furthermore the sqoop jobs appear to be frozen after 100% completion of the mapreduce task for abnormal amount of time than usual, eg. seeing...
[2020-01-31 14:00:55,193] INFO mapreduce.JobSubmitter: number of splits:12
[2020-01-31 14:00:55,402] INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1579648183118_1085
[2020-01-31 14:00:55,402] INFO mapreduce.JobSubmitter: Executing with tokens: []
[2020-01-31 14:00:55,687] INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml
[2020-01-31 14:00:55,784] INFO impl.YarnClientImpl: Submitted application application_1579648183118_1085
[2020-01-31 14:00:55,837] mapreduce.Job: The url to track the job: http://hw001.ucera.local:8088/proxy/application_1579648183118_1085/
[2020-01-31 14:00:55,837] mapreduce.Job: Running job: job_1579648183118_1085
[2020-01-31 14:01:02,964] mapreduce.Job: Job job_1579648183118_1085 running in uber mode : false
[2020-01-31 14:01:02,965] mapreduce.Job: map 0% reduce 0%
[2020-01-31 14:01:18,178] mapreduce.Job: map 8% reduce 0%
[2020-01-31 14:02:21,552] mapreduce.Job: map 17% reduce 0%
[2020-01-31 14:04:55,239] mapreduce.Job: map 25% reduce 0%
[2020-01-31 14:05:36,417] mapreduce.Job: map 33% reduce 0%
[2020-01-31 14:05:37,424] mapreduce.Job: map 42% reduce 0%
[2020-01-31 14:05:40,440] mapreduce.Job: map 50% reduce 0%
[2020-01-31 14:05:41,444] mapreduce.Job: map 58% reduce 0%
[2020-01-31 14:05:44,455] mapreduce.Job: map 67% reduce 0%
[2020-01-31 14:05:52,484] mapreduce.Job: map 75% reduce 0%
[2020-01-31 14:05:56,499] mapreduce.Job: map 83% reduce 0%
[2020-01-31 14:05:59,528] mapreduce.Job: map 92% reduce 0%
[2020-01-31 14:06:00,534] INFO mapreduce.Job: map 100% reduce 0%
<...after some time longer than usual...>
[2020-01-31 14:10:05,446] INFO mapreduce.Job: Job job_1579648183118_1085 completed successfully
My hadoop version
[airflow#airflowetl root]$ hadoop version
Hadoop 3.1.1.3.1.0.0-78
Source code repository git#github.com:hortonworks/hadoop.git -r e4f82af51faec922b4804d0232a637422ec29e64
Compiled by jenkins on 2018-12-06T12:26Z
Compiled with protoc 2.5.0
From source with checksum eab9fa2a6aa38c6362c66d8df75774
This command was run using /usr/hdp/3.1.0.0-78/hadoop/hadoop-common-3.1.1.3.1.0.0-78.jar
Anyone with more hadoop experience know what could be going on here? Any debugging advice?
Namenode heap is mostly determined by the number of file blocks that are stored in HDFS. In particular, many small files or many files being written at once would cause a large heap.
The ResourceManager is not correlated with the namenode. It's heap would be determinate on the number of YARN jobs that are actively being tracked
In a cluster I've maintained, the namenode heap was 32G, and I think the ResourceManager was only 8GB

Mapreduce api giving wrong mapper count

I am trying to get the number of mappers in a mapreduce program by using below piece of code. I get the value for mapreduce.job.maps as 2 but the program actually launches 6 mappers as there are 6 small files. Anyone getting similar issue?
Code
job.getConfiguration().get("mapreduce.job.maps")
Log:
num of mappers : 2
...
17/05/13 06:56:47 INFO input.FileInputFormat: Total input paths to process : 6
17/05/13 06:56:47 INFO mapreduce.JobSubmitter: number of splits:6
...
17/05/13 06:56:48 INFO mapreduce.Job: Running job: job_1494588725898_0047
17/05/13 06:56:59 INFO mapreduce.Job: Job job_1494588725898_0047 running in uber mode : false
17/05/13 06:56:59 INFO mapreduce.Job: map 0% reduce 0%
...
17/05/13 06:57:39 INFO mapreduce.Job: map 100% reduce 100%
17/05/13 06:57:40 INFO mapreduce.Job: Job job_1494588725898_0047 completed successfully
17/05/13 06:57:40 INFO mapreduce.Job: Counters: 49
File System Counters
...
Job Counters
Launched map tasks=6
Launched reduce tasks=2
This is not an issue, but the actual behaviour of MapReduce.
The value you get for mapreduce.job.maps property is its default value, 2. The number of mapper tasks will always be determined from the File Input Splits, which is 6 in this scenario. And to get the actual number of map tasks launched for a job, you have to wait till the job is completed.

Can't finish MR when using Sqoop transfer data from HDFS to MYSQL

While transferring data from HDFS to MySQL, a MapReduce job gets spawned. But, it gets stuck and does not get completed.
sqoop export --connect jdbc:mysql://crxy2:3306/test --username root --password 19911130 --table info --export-dir sqoop_export
I see following in the logs:
Warning: /software/sqoop-1.4.6.alpha/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /software/sqoop-1.4.6.alpha/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /software/sqoop-1.4.6.alpha/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /software/sqoop-1.4.6.alpha/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
15/12/02 01:17:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
15/12/02 01:17:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/12/02 01:17:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/12/02 01:17:37 INFO tool.CodeGenTool: Beginning code generation
15/12/02 01:17:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `info` AS t LIMIT 1
15/12/02 01:17:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `info` AS t LIMIT 1
15/12/02 01:17:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /software/hadoop-2.6.0
Note: /tmp/sqoop-root/compile/344126e97612def1e3976c1978c2e75e/info.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/12/02 01:17:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/344126e97612def1e3976c1978c2e75e/info.jar
15/12/02 01:17:42 INFO mapreduce.ExportJobBase: Beginning export of info
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/software/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/software/hbase-0.98.8-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/12/02 01:17:43 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/12/02 01:17:45 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/12/02 01:17:45 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/12/02 01:17:45 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/12/02 01:17:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/12/02 01:17:50 INFO input.FileInputFormat: Total input paths to process : 1
15/12/02 01:17:50 INFO input.FileInputFormat: Total input paths to process : 1
15/12/02 01:17:50 INFO mapreduce.JobSubmitter: number of splits:4
15/12/02 01:17:50 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/12/02 01:17:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1449047829255_0001
15/12/02 01:17:51 INFO impl.YarnClientImpl: Submitted application application_1449047829255_0001
15/12/02 01:17:52 INFO mapreduce.Job: The url to track the job: http://crxy2:8088/proxy/application_1449047829255_0001/
15/12/02 01:17:52 INFO mapreduce.Job: Running job: job_1449047829255_0001
15/12/02 01:18:12 INFO mapreduce.Job: Job job_1449047829255_0001 running in uber mode : false
15/12/02 01:18:12 INFO mapreduce.Job: map 0% reduce 0%
15/12/02 01:19:10 INFO mapreduce.Job: map 75% reduce 0%
15/12/02 01:19:12 INFO mapreduce.Job: map 100% reduce 0%
15/12/02 01:29:41 INFO mapreduce.Job: Task Id : attempt_1449047829255_0001_m_000001_0, Status : FAILED
AttemptID:attempt_1449047829255_0001_m_000001_0 Timed out after 600 secs
15/12/02 01:29:42 INFO mapreduce.Job: map 75% reduce 0%
15/12/02 01:29:58 INFO mapreduce.Job: map 100% reduce 0%
15/12/02 01:40:11 INFO mapreduce.Job: Task Id : attempt_1449047829255_0001_m_000001_1, Status : FAILED
AttemptID:attempt_1449047829255_0001_m_000001_1 Timed out after 600 secs
15/12/02 01:40:12 INFO mapreduce.Job: map 75% reduce 0%
15/12/02 01:40:28 INFO mapreduce.Job: map 100% reduce 0%
15/12/02 01:50:41 INFO mapreduce.Job: Task Id : attempt_1449047829255_0001_m_000001_2, Status : FAILED
AttemptID:attempt_1449047829255_0001_m_000001_2 Timed out after 600 secs
15/12/02 01:50:42 INFO mapreduce.Job: map 75% reduce 0%
15/12/02 01:51:00 INFO mapreduce.Job: map 100% reduce 0%
15/12/02 02:01:13 INFO mapreduce.Job: Job job_1449047829255_0001 failed with state FAILED due to: Task failed task_1449047829255_0001_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
15/12/02 02:01:13 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=370395
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=556
HDFS: Number of bytes written=0
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=4
Launched map tasks=7
Other local map tasks=3
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=2732612
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2732612
Total vcore-seconds taken by all map tasks=2732612
Total megabyte-seconds taken by all map tasks=2798194688
Map-Reduce Framework
Map input records=0
Map output records=0
Input split bytes=504
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=759
CPU time spent (ms)=5170
Physical memory (bytes) snapshot=245080064
Virtual memory (bytes) snapshot=2529026048
Total committed heap usage (bytes)=46792704
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
15/12/02 02:01:13 INFO mapreduce.ExportJobBase: Transferred 556 bytes in 2,607.4894 seconds (0.2132 bytes/sec)
15/12/02 02:01:13 INFO mapreduce.ExportJobBase: Exported 0 records.
15/12/02 02:01:13 ERROR tool.ExportTool: Error during export: Export job failed!
2015-12-02 08:01:15,791 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=Application Finished - Succeeded TARGET=RMAppManager RESULT=SUCCESS APPID=application_1449047829255_0002
2015-12-02 08:01:15,793 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Attempt appattempt_1449047829255_0002_000001 is done. finalState=FINISHED
2015-12-02 08:01:15,793 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1449047829255_0002 requests cleared
2015-12-02 08:01:15,794 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application removed - appId: application_1449047829255_0002 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2015-12-02 08:01:15,794 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1449047829255_0002 user: root leaf-queue of parent: root #applications: 0
2015-12-02 08:01:15,794 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1449047829255_0002,name=info.jar,user=root,queue=default,state=FINISHED,trackingUrl=http://crxy2:8088/proxy/application_1449047829255_0002/jobhistory/job/job_1449047829255_0002,appMasterHost=crxy2,startTime=1449069503787,finishTime=1449072069229,finalStatus=FAILED
2015-12-02 08:01:15,796 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning master appattempt_1449047829255_0002_000001
2015-12-02 08:01:15,873 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed...
2015-12-02 08:01:15,873 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed...
2015-12-02 08:01:16,879 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed...
Questioner was looking at incorrect logs. He is able to troubleshoot the issue by going through failed task logs as per the suggestion in the comments section.

hadoop yarn single node performance tuning

I have hadoop 2.5.2 single mode installation on my Ubuntu VM, which is: 4-core, 3GHz per core; 4G memory. This VM is not for production, only for demo and learning.
Then, I wrote a vey simple map-reduce application using python, and use this application to process 49 xmls. All these xml files are small-size, hundreds of lines each. So, I expected a quick process. But, big22 surprise to me, it took more than 20 minutes to finish the job (the output of the job is correct.). Below is the output metrics :
14/12/15 19:37:55 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 19:37:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 19:38:03 INFO mapred.FileInputFormat: Total input paths to process : 49
14/12/15 19:38:06 INFO mapreduce.JobSubmitter: number of splits:49
14/12/15 19:38:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1418368500264_0005
14/12/15 19:38:10 INFO impl.YarnClientImpl: Submitted application application_1418368500264_0005
14/12/15 19:38:10 INFO mapreduce.Job: Running job: job_1418368500264_0005
14/12/15 19:38:59 INFO mapreduce.Job: Job job_1418368500264_0005 running in uber mode : false
14/12/15 19:38:59 INFO mapreduce.Job: map 0% reduce 0%
14/12/15 19:39:42 INFO mapreduce.Job: map 2% reduce 0%
14/12/15 19:40:05 INFO mapreduce.Job: map 4% reduce 0%
14/12/15 19:40:28 INFO mapreduce.Job: map 6% reduce 0%
14/12/15 19:40:49 INFO mapreduce.Job: map 8% reduce 0%
14/12/15 19:41:10 INFO mapreduce.Job: map 10% reduce 0%
14/12/15 19:41:29 INFO mapreduce.Job: map 12% reduce 0%
14/12/15 19:41:50 INFO mapreduce.Job: map 14% reduce 0%
14/12/15 19:42:08 INFO mapreduce.Job: map 16% reduce 0%
14/12/15 19:42:28 INFO mapreduce.Job: map 18% reduce 0%
14/12/15 19:42:49 INFO mapreduce.Job: map 20% reduce 0%
14/12/15 19:43:08 INFO mapreduce.Job: map 22% reduce 0%
14/12/15 19:43:28 INFO mapreduce.Job: map 24% reduce 0%
14/12/15 19:43:48 INFO mapreduce.Job: map 27% reduce 0%
14/12/15 19:44:09 INFO mapreduce.Job: map 29% reduce 0%
14/12/15 19:44:29 INFO mapreduce.Job: map 31% reduce 0%
14/12/15 19:44:49 INFO mapreduce.Job: map 33% reduce 0%
14/12/15 19:45:09 INFO mapreduce.Job: map 35% reduce 0%
14/12/15 19:45:28 INFO mapreduce.Job: map 37% reduce 0%
14/12/15 19:45:49 INFO mapreduce.Job: map 39% reduce 0%
14/12/15 19:46:09 INFO mapreduce.Job: map 41% reduce 0%
14/12/15 19:46:29 INFO mapreduce.Job: map 43% reduce 0%
14/12/15 19:46:49 INFO mapreduce.Job: map 45% reduce 0%
14/12/15 19:47:09 INFO mapreduce.Job: map 47% reduce 0%
14/12/15 19:47:29 INFO mapreduce.Job: map 49% reduce 0%
14/12/15 19:47:49 INFO mapreduce.Job: map 51% reduce 0%
14/12/15 19:48:08 INFO mapreduce.Job: map 53% reduce 0%
14/12/15 19:48:28 INFO mapreduce.Job: map 55% reduce 0%
14/12/15 19:48:48 INFO mapreduce.Job: map 57% reduce 0%
14/12/15 19:49:09 INFO mapreduce.Job: map 59% reduce 0%
14/12/15 19:49:29 INFO mapreduce.Job: map 61% reduce 0%
14/12/15 19:49:55 INFO mapreduce.Job: map 63% reduce 0%
14/12/15 19:50:23 INFO mapreduce.Job: map 65% reduce 0%
14/12/15 19:50:53 INFO mapreduce.Job: map 67% reduce 0%
14/12/15 19:51:22 INFO mapreduce.Job: map 69% reduce 0%
14/12/15 19:51:50 INFO mapreduce.Job: map 71% reduce 0%
14/12/15 19:52:18 INFO mapreduce.Job: map 73% reduce 0%
14/12/15 19:52:48 INFO mapreduce.Job: map 76% reduce 0%
14/12/15 19:53:18 INFO mapreduce.Job: map 78% reduce 0%
14/12/15 19:53:48 INFO mapreduce.Job: map 80% reduce 0%
14/12/15 19:54:18 INFO mapreduce.Job: map 82% reduce 0%
14/12/15 19:54:48 INFO mapreduce.Job: map 84% reduce 0%
14/12/15 19:55:19 INFO mapreduce.Job: map 86% reduce 0%
14/12/15 19:55:48 INFO mapreduce.Job: map 88% reduce 0%
14/12/15 19:56:16 INFO mapreduce.Job: map 90% reduce 0%
14/12/15 19:56:44 INFO mapreduce.Job: map 92% reduce 0%
14/12/15 19:57:14 INFO mapreduce.Job: map 94% reduce 0%
14/12/15 19:57:45 INFO mapreduce.Job: map 96% reduce 0%
14/12/15 19:58:15 INFO mapreduce.Job: map 98% reduce 0%
14/12/15 19:58:46 INFO mapreduce.Job: map 100% reduce 0%
14/12/15 19:59:20 INFO mapreduce.Job: map 100% reduce 100%
14/12/15 19:59:28 INFO mapreduce.Job: Job job_1418368500264_0005 completed successfully
14/12/15 19:59:30 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=17856
FILE: Number of bytes written=5086434
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=499030
HDFS: Number of bytes written=10049
HDFS: Number of read operations=150
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=49
Launched reduce tasks=1
Data-local map tasks=49
Total time spent by all maps in occupied slots (ms)=8854232
Total time spent by all reduces in occupied slots (ms)=284672
Total time spent by all map tasks (ms)=1106779
Total time spent by all reduce tasks (ms)=35584
Total vcore-seconds taken by all map tasks=1106779
Total vcore-seconds taken by all reduce tasks=35584
Total megabyte-seconds taken by all map tasks=1133341696
Total megabyte-seconds taken by all reduce tasks=36438016
Map-Reduce Framework
Map input records=9352
Map output records=296
Map output bytes=17258
Map output materialized bytes=18144
Input split bytes=6772
Combine input records=0
Combine output records=0
Reduce input groups=53
Reduce shuffle bytes=18144
Reduce input records=296
Reduce output records=52
Spilled Records=592
Shuffled Maps =49
Failed Shuffles=0
Merged Map outputs=49
GC time elapsed (ms)=33590
CPU time spent (ms)=191390
Physical memory (bytes) snapshot=13738057728
Virtual memory (bytes) snapshot=66425016320
Total committed heap usage (bytes)=10799808512
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=492258
File Output Format Counters
Bytes Written=10049
14/12/15 19:59:30 INFO streaming.StreamJob: Output directory: /data_output/sb50projs_1_output
As a newbie to hadoop, for this crazy unreasonable performance, I have several questions:
how to configure my hadoop/yarn/mapreduce to make the whole environment more convenient for trial usage?
I understand hadoop is designed for huge-data and big files. But for a trial environment, my files are small and my data is very limited, which default configuration items should I change? I have changed "dfs.blocksize" of hdfs-site.xml to a smaller value to match my small files, but seems no big enhancements. I know there are some JVM configuration items in yarn-site.xml and mapred-site.xml, but I am not sure about how to adjust them.
how to read hadoop logs
Under the logs folder, there are separate log files for nodemanager/resourcemanager/namenode/datanode. I tried to read these files to understand how the 20 minutes are spent during the process, but it's not easy for a newbie like me. So I wonder is there any tool/UI could help me to analyze the logs.
basic performance tuning tools
Actually I have googled around for this question, and I got a bunch of names like Ganglia/Nagios/Vaidya/Ambari. I want to know, which tool is best analyse the issue like , "why it took 20 minutes to do such a simple job?".
big number of hadoop processes
Even if there is no job running on my hadoop, I found around 100 hadoop processes on my VM, like below (I am using htop, and sort the result by memory). Is this normal for hadoop ? Or am I incorrect for some environment configuration?
You don't have to change anything.
The default configuration is done for small environment. You may change it if you grow the environment. Ant there is a lot of params and a lot of time for fine tuning.
But I admit your configuration is smaller than the usual ones for tests.
The log you have to read isn't the services ones but the job ones. Find them in /var/log/hadoop-yarn/containers/
If you want a better view of your MR, use the web interface on http://127.0.0.1:8088/. You will see your job's progression in real time.
IMO, Basic tuning = use hadoop web interfaces. There are plenty available natively.
I think you find your problem. This can be nomal, or not.
But quickly, YARN launch MR to use all the available memory :
Available memory is set in your yarn-site.xml : yarn.nodemanager.resource.memory-mb (default to 8 Gio).
Memory for a task is defined in mapred-site.xml or in the task itself by the property : mapreduce.map.memory.mb (default to 1536 Mio)
So :
Change the available memory for your nodemanager (to 3Gio, in order to let 1 Gio for the system)
Change the memory available for hadoop services (-Xmx in hadoop-env.sh, yarn-env.sh) (system + each hadoop services (namenode / datanode / ressourcemanager / nodemanager) < 1 Gio.
Change the memory for your map tasks (512 Mio ?). The lesser it is, more task can be executed in the same time.
Change yarn.scheduler.minimum-allocation-mb to 512 in yarn-site.xml to allow mappers with less than 1 Gio of memory.
I hope this will help you.

Map Reduce Job is reported as complete on history server while on console it shows as only half way thru

I am running a MRV1 job on Hadoop YARN 2.3.0 cluster , the problem is when I submit this job YARN created multiple applications for that submitted job , and the last application that is running in YARN is marked as complete even as on console its reported as only 58% complete . I have confirmed that its also not printing the log statements that its supposed to print when the job is actually complete .
Please see the output from the job submission console below. It just stops at 58% and job history server and YARN cluster UI reports that this job has already succeeded.
4/08/28 08:36:19 INFO mapreduce.Job: map 54% reduce 0%
14/08/28 08:44:13 INFO mapreduce.Job: map 55% reduce 0%
14/08/28 08:52:16 INFO mapreduce.Job: map 56% reduce 0%
14/08/28 08:59:22 INFO mapreduce.Job: map 57% reduce 0%
14/08/28 09:07:33 INFO mapreduce.Job: map 58% reduce 0%
Thanks.

Resources