HBase Import command - hadoop

We are currently migrating from CDH3u4 to CDH5. We made new cluster and copied all data. Everything went smooth thanks to Cloudera manager. But we have problem with migrating data from HBase 0.90.6 to HBase 0.96.1.1.
I tried to migrate data by using Export/Import feature of HBase (https://hbase.apache.org/book/ops_mgt.html#export. I have managed to export data and copy them to new server (discp). When I used command on destination cluster:
hbase -Dhbase.import.version=0.90 org.apache.hadoop.hbase.mapreduce.Import ip /user/rtomsej/ip3
Job was completed successfully, but no data was load (table ip is still blank):
14/06/25 09:04:58 INFO mapreduce.Job: Job job_1403615212297_0014 running in uber mode : false
14/06/25 09:04:58 INFO mapreduce.Job: map 0% reduce 0%
14/06/25 09:05:08 INFO mapreduce.Job: map 7% reduce 0%
14/06/25 09:05:11 INFO mapreduce.Job: map 43% reduce 0%
14/06/25 09:05:16 INFO mapreduce.Job: map 45% reduce 0%
14/06/25 09:05:18 INFO mapreduce.Job: map 50% reduce 0%
14/06/25 09:05:20 INFO mapreduce.Job: map 55% reduce 0%
14/06/25 09:05:21 INFO mapreduce.Job: map 57% reduce 0%
14/06/25 09:05:22 INFO mapreduce.Job: map 80% reduce 0%
14/06/25 09:05:23 INFO mapreduce.Job: map 86% reduce 0%
14/06/25 09:05:25 INFO mapreduce.Job: map 91% reduce 0%
14/06/25 09:05:26 INFO mapreduce.Job: map 98% reduce 0%
14/06/25 09:05:28 INFO mapreduce.Job: map 100% reduce 0%
14/06/25 09:05:28 INFO mapreduce.Job: Job job_1403615212297_0014 completed successfully
14/06/25 09:05:28 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=5172058
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5452414893
HDFS: Number of bytes written=0
HDFS: Number of read operations=132
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=44
Data-local map tasks=44
Total time spent by all maps in occupied slots (ms)=410004
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=410004
Total vcore-seconds taken by all map tasks=410004
Total megabyte-seconds taken by all map tasks=419844096
Map-Reduce Framework
Map input records=9964456
Map output records=0
Input split bytes=5720
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7648
CPU time spent (ms)=117230
Physical memory (bytes) snapshot=17097363456
Virtual memory (bytes) snapshot=68115570688
Total committed heap usage (bytes)=26497384448
File Input Format Counters
Bytes Read=5452409173
File Output Format Counters
Bytes Written=0
When I look into log no error is here.
I would appreciate any idea, thank you so much!

It seems the problem was in command:
hbase -Dhbase.import.version=0.90 org.apache.hadoop.hbase.mapreduce.Import ip /user/rtomsej/ip3
When I modified it like this, whole job went OK:
hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import ip /user/rtomsej/ip3
Think that import.version=0.90 is not supported.

I have the same problem, but your solution does not works for me.
I tried lots of time, found that once I disable the Table before running the Import task. There are error of "regionserver not online", but during the task is running, I enable the Table. The Import Task ended smoothly and the new data is loaded!!!

Related

Hadoop producing no output?

I've recently started learning how to use the Hadoop system, and decided it's time to try writing some code. Before that, I wanted to try running the examples seen in the Getting Started page. However, it does not seem to produce any visible results.
I'm currently using Hadoop version 3.3.1 using a single-node setup,
and using jdk 11.0.11. I am running this on Windows 10 (due to current development requirements).
I've used the following command on cmd:
hadoop jar %hadoop_home%/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar grep input /output 'dfs[a-z.]+'
The output to the command:
C:\Windows\system32>hadoop jar %hadoop_home%/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar grep input /output 'dfs[a-z.]+'
2021-12-15 00:33:10,486 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-15 00:33:10,800 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/E/.staging/job_1639519343908_0005
2021-12-15 00:33:11,029 INFO input.FileInputFormat: Total input files to process : 10
2021-12-15 00:33:11,108 INFO mapreduce.JobSubmitter: number of splits:10
2021-12-15 00:33:11,281 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639519343908_0005
2021-12-15 00:33:11,281 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-15 00:33:11,442 INFO conf.Configuration: resource-types.xml not found
2021-12-15 00:33:11,443 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-12-15 00:33:11,497 INFO impl.YarnClientImpl: Submitted application application_1639519343908_0005
2021-12-15 00:33:11,527 INFO mapreduce.Job: The url to track the job: http://DESKTOP-S15C716:8088/proxy/application_1639519343908_0005/
2021-12-15 00:33:11,528 INFO mapreduce.Job: Running job: job_1639519343908_0005
2021-12-15 00:33:19,611 INFO mapreduce.Job: Job job_1639519343908_0005 running in uber mode : false
2021-12-15 00:33:19,615 INFO mapreduce.Job: map 0% reduce 0%
2021-12-15 00:33:31,178 INFO mapreduce.Job: map 50% reduce 0%
2021-12-15 00:33:32,263 INFO mapreduce.Job: map 60% reduce 0%
2021-12-15 00:33:39,624 INFO mapreduce.Job: map 90% reduce 0%
2021-12-15 00:33:40,632 INFO mapreduce.Job: map 100% reduce 0%
2021-12-15 00:33:41,636 INFO mapreduce.Job: map 100% reduce 100%
2021-12-15 00:33:41,648 INFO mapreduce.Job: Job job_1639519343908_0005 completed successfully
2021-12-15 00:33:41,760 INFO mapreduce.Job: Counters: 51
File System Counters
FILE: Number of bytes read=6
FILE: Number of bytes written=3021766
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=31877
HDFS: Number of bytes written=86
HDFS: Number of read operations=35
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Killed map tasks=1
Launched map tasks=10
Launched reduce tasks=1
Data-local map tasks=10
Total time spent by all maps in occupied slots (ms)=89653
Total time spent by all reduces in occupied slots (ms)=8222
Total time spent by all map tasks (ms)=89653
Total time spent by all reduce tasks (ms)=8222
Total vcore-milliseconds taken by all map tasks=89653
Total vcore-milliseconds taken by all reduce tasks=8222
Total megabyte-milliseconds taken by all map tasks=91804672
Total megabyte-milliseconds taken by all reduce tasks=8419328
Map-Reduce Framework
Map input records=819
Map output records=0
Map output bytes=0
Map output materialized bytes=60
Input split bytes=1139
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=60
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=90
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2952790016
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=30738
File Output Format Counters
Bytes Written=86
2021-12-15 00:33:41,790 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-15 00:33:41,814 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/E/.staging/job_1639519343908_0006
2021-12-15 00:33:41,855 INFO input.FileInputFormat: Total input files to process : 1
2021-12-15 00:33:41,913 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-15 00:33:41,950 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639519343908_0006
2021-12-15 00:33:41,950 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-15 00:33:42,179 INFO impl.YarnClientImpl: Submitted application application_1639519343908_0006
2021-12-15 00:33:42,190 INFO mapreduce.Job: The url to track the job: http://DESKTOP-S15C716:8088/proxy/application_1639519343908_0006/
2021-12-15 00:33:42,191 INFO mapreduce.Job: Running job: job_1639519343908_0006
2021-12-15 00:33:55,301 INFO mapreduce.Job: Job job_1639519343908_0006 running in uber mode : false
2021-12-15 00:33:55,302 INFO mapreduce.Job: map 0% reduce 0%
2021-12-15 00:34:00,336 INFO mapreduce.Job: map 100% reduce 0%
2021-12-15 00:34:06,366 INFO mapreduce.Job: map 100% reduce 100%
2021-12-15 00:34:07,375 INFO mapreduce.Job: Job job_1639519343908_0006 completed successfully
2021-12-15 00:34:07,404 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=6
FILE: Number of bytes written=548197
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=212
HDFS: Number of bytes written=0
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3232
Total time spent by all reduces in occupied slots (ms)=3610
Total time spent by all map tasks (ms)=3232
Total time spent by all reduce tasks (ms)=3610
Total vcore-milliseconds taken by all map tasks=3232
Total vcore-milliseconds taken by all reduce tasks=3610
Total megabyte-milliseconds taken by all map tasks=3309568
Total megabyte-milliseconds taken by all reduce tasks=3696640
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=6
Input split bytes=126
Combine input records=0
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=13
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=536870912
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=86
File Output Format Counters
Bytes Written=0
Yet when viewing the contents of the now-made 'output' folder,
I receive the following result:
hdfs dfs -ls /output
Found 2 items
-rw-r--r-- 1 E supergroup 0 2021-12-15 00:34 /output/_SUCCESS
-rw-r--r-- 1 E supergroup 0 2021-12-15 00:34 /output/part-r-00000
I.e. there's no data written to those files!
May anyone please assist me?
If you have no data in your HDFS input folder that matches the grep pattern 'dfs[a-z.]+', then the output will be empty
From the linked docs (which are for Unix, not Windows), make sure this command completed
bin/hdfs dfs -put %HADOOP_HOME%/etc/hadoop/*.xml input
And you can grep dfs $HADOOP_HOME/etc/hadoop/*.xml (at least on Unix) as well, locally, to verify there should be data output

Why there is no reducer when running 1TB teragen?

I am running a terasort benchmark for hadoop using the following command:
jar /Users/karan.verma/Documents/backups/h/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen -Dmapreduce.job.maps=100 1t random-data
and got the following logs printed for 100 map tasks:
18/03/27 13:06:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/27 13:06:04 INFO client.RMProxy: Connecting to ResourceManager at /127.0.0.1:8032
18/03/27 13:06:05 INFO terasort.TeraSort: Generating -727379968 using 100
18/03/27 13:06:05 INFO mapreduce.JobSubmitter: number of splits:100
18/03/27 13:06:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1522131782827_0001
18/03/27 13:06:06 INFO impl.YarnClientImpl: Submitted application application_1522131782827_0001
18/03/27 13:06:06 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1522131782827_0001/
18/03/27 13:06:06 INFO mapreduce.Job: Running job: job_1522131782827_0001
18/03/27 13:06:16 INFO mapreduce.Job: Job job_1522131782827_0001 running in uber mode : false
18/03/27 13:06:16 INFO mapreduce.Job: map 0% reduce 0%
18/03/27 13:06:29 INFO mapreduce.Job: map 2% reduce 0%
18/03/27 13:06:31 INFO mapreduce.Job: map 3% reduce 0%
18/03/27 13:06:32 INFO mapreduce.Job: map 5% reduce 0%
....
18/03/27 13:09:27 INFO mapreduce.Job: map 100% reduce 0%
and here is the final counters as printed on console:
18/03/27 13:09:29 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=10660990
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=8594
HDFS: Number of bytes written=0
HDFS: Number of read operations=400
HDFS: Number of large read operations=0
HDFS: Number of write operations=200
Job Counters
Launched map tasks=100
Other local map tasks=100
Total time spent by all maps in occupied slots (ms)=983560
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=983560
Total vcore-milliseconds taken by all map tasks=983560
Total megabyte-milliseconds taken by all map tasks=1007165440
Map-Reduce Framework
Map input records=0
Map output records=0
Input split bytes=8594
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=9746
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=11220811776
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
and here is the output on job schedular:
Please suggest why there is no reduce task?
Your run command says that you're running teragen and not terasort. teragen simply generates data that you can then use for terasort, and so no reducers are needed.
To run terasort over the data that you've just generated, run:
hadoop jar /Users/karan.verma/Documents/backups/h/hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort random-data terasort-output
You should then see reducers.
No reduce tasks run when executing teragen. Here is the documentation:
TeraGen will run map tasks to generate the data and will not run any reduce tasks. The default number of map task is defined by the "mapreduce.job.maps=2" param. It's the only purpose here is to generate the 1TB of random data in the following format " 10 bytes key | 2 bytes break | 32 bytes acsii/hex | 4 bytes break | 48 bytes filler | 4 bytes break | \r\n".

Hadoop mapper phase stuck after 19% In which case this possibility can occur?

My MapReduce program is running fine with other MR code. There is no error in the code. Still it is getting stuck.
15/05/28 19:53:29 INFO input.FileInputFormat: Total input paths to process : 1
15/05/28 19:53:31 INFO mapred.JobClient: Running job: job_201504101709_0927
15/05/28 19:53:32 INFO mapred.JobClient: map 0% reduce 0%
15/05/28 19:53:46 INFO mapred.JobClient: map 19% reduce 0%
15/05/28 20:03:50 INFO mapred.JobClient: map 0% reduce 0%
15/05/28 20:03:51 INFO mapred.JobClient: Task Id : attempt_201504101709_0927_m_000000_0, Status : FAILED
Task attempt_201504101709_0927_m_000000_0 failed to report status for 602 seconds. Killing!
Possible reason for this might be a bug in Mapper, like an infinite loop. Just check if everything is fine in Mapper. If you feel its not problem, update your question with your mapper code.

hadoop yarn single node performance tuning

I have hadoop 2.5.2 single mode installation on my Ubuntu VM, which is: 4-core, 3GHz per core; 4G memory. This VM is not for production, only for demo and learning.
Then, I wrote a vey simple map-reduce application using python, and use this application to process 49 xmls. All these xml files are small-size, hundreds of lines each. So, I expected a quick process. But, big22 surprise to me, it took more than 20 minutes to finish the job (the output of the job is correct.). Below is the output metrics :
14/12/15 19:37:55 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 19:37:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 19:38:03 INFO mapred.FileInputFormat: Total input paths to process : 49
14/12/15 19:38:06 INFO mapreduce.JobSubmitter: number of splits:49
14/12/15 19:38:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1418368500264_0005
14/12/15 19:38:10 INFO impl.YarnClientImpl: Submitted application application_1418368500264_0005
14/12/15 19:38:10 INFO mapreduce.Job: Running job: job_1418368500264_0005
14/12/15 19:38:59 INFO mapreduce.Job: Job job_1418368500264_0005 running in uber mode : false
14/12/15 19:38:59 INFO mapreduce.Job: map 0% reduce 0%
14/12/15 19:39:42 INFO mapreduce.Job: map 2% reduce 0%
14/12/15 19:40:05 INFO mapreduce.Job: map 4% reduce 0%
14/12/15 19:40:28 INFO mapreduce.Job: map 6% reduce 0%
14/12/15 19:40:49 INFO mapreduce.Job: map 8% reduce 0%
14/12/15 19:41:10 INFO mapreduce.Job: map 10% reduce 0%
14/12/15 19:41:29 INFO mapreduce.Job: map 12% reduce 0%
14/12/15 19:41:50 INFO mapreduce.Job: map 14% reduce 0%
14/12/15 19:42:08 INFO mapreduce.Job: map 16% reduce 0%
14/12/15 19:42:28 INFO mapreduce.Job: map 18% reduce 0%
14/12/15 19:42:49 INFO mapreduce.Job: map 20% reduce 0%
14/12/15 19:43:08 INFO mapreduce.Job: map 22% reduce 0%
14/12/15 19:43:28 INFO mapreduce.Job: map 24% reduce 0%
14/12/15 19:43:48 INFO mapreduce.Job: map 27% reduce 0%
14/12/15 19:44:09 INFO mapreduce.Job: map 29% reduce 0%
14/12/15 19:44:29 INFO mapreduce.Job: map 31% reduce 0%
14/12/15 19:44:49 INFO mapreduce.Job: map 33% reduce 0%
14/12/15 19:45:09 INFO mapreduce.Job: map 35% reduce 0%
14/12/15 19:45:28 INFO mapreduce.Job: map 37% reduce 0%
14/12/15 19:45:49 INFO mapreduce.Job: map 39% reduce 0%
14/12/15 19:46:09 INFO mapreduce.Job: map 41% reduce 0%
14/12/15 19:46:29 INFO mapreduce.Job: map 43% reduce 0%
14/12/15 19:46:49 INFO mapreduce.Job: map 45% reduce 0%
14/12/15 19:47:09 INFO mapreduce.Job: map 47% reduce 0%
14/12/15 19:47:29 INFO mapreduce.Job: map 49% reduce 0%
14/12/15 19:47:49 INFO mapreduce.Job: map 51% reduce 0%
14/12/15 19:48:08 INFO mapreduce.Job: map 53% reduce 0%
14/12/15 19:48:28 INFO mapreduce.Job: map 55% reduce 0%
14/12/15 19:48:48 INFO mapreduce.Job: map 57% reduce 0%
14/12/15 19:49:09 INFO mapreduce.Job: map 59% reduce 0%
14/12/15 19:49:29 INFO mapreduce.Job: map 61% reduce 0%
14/12/15 19:49:55 INFO mapreduce.Job: map 63% reduce 0%
14/12/15 19:50:23 INFO mapreduce.Job: map 65% reduce 0%
14/12/15 19:50:53 INFO mapreduce.Job: map 67% reduce 0%
14/12/15 19:51:22 INFO mapreduce.Job: map 69% reduce 0%
14/12/15 19:51:50 INFO mapreduce.Job: map 71% reduce 0%
14/12/15 19:52:18 INFO mapreduce.Job: map 73% reduce 0%
14/12/15 19:52:48 INFO mapreduce.Job: map 76% reduce 0%
14/12/15 19:53:18 INFO mapreduce.Job: map 78% reduce 0%
14/12/15 19:53:48 INFO mapreduce.Job: map 80% reduce 0%
14/12/15 19:54:18 INFO mapreduce.Job: map 82% reduce 0%
14/12/15 19:54:48 INFO mapreduce.Job: map 84% reduce 0%
14/12/15 19:55:19 INFO mapreduce.Job: map 86% reduce 0%
14/12/15 19:55:48 INFO mapreduce.Job: map 88% reduce 0%
14/12/15 19:56:16 INFO mapreduce.Job: map 90% reduce 0%
14/12/15 19:56:44 INFO mapreduce.Job: map 92% reduce 0%
14/12/15 19:57:14 INFO mapreduce.Job: map 94% reduce 0%
14/12/15 19:57:45 INFO mapreduce.Job: map 96% reduce 0%
14/12/15 19:58:15 INFO mapreduce.Job: map 98% reduce 0%
14/12/15 19:58:46 INFO mapreduce.Job: map 100% reduce 0%
14/12/15 19:59:20 INFO mapreduce.Job: map 100% reduce 100%
14/12/15 19:59:28 INFO mapreduce.Job: Job job_1418368500264_0005 completed successfully
14/12/15 19:59:30 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=17856
FILE: Number of bytes written=5086434
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=499030
HDFS: Number of bytes written=10049
HDFS: Number of read operations=150
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=49
Launched reduce tasks=1
Data-local map tasks=49
Total time spent by all maps in occupied slots (ms)=8854232
Total time spent by all reduces in occupied slots (ms)=284672
Total time spent by all map tasks (ms)=1106779
Total time spent by all reduce tasks (ms)=35584
Total vcore-seconds taken by all map tasks=1106779
Total vcore-seconds taken by all reduce tasks=35584
Total megabyte-seconds taken by all map tasks=1133341696
Total megabyte-seconds taken by all reduce tasks=36438016
Map-Reduce Framework
Map input records=9352
Map output records=296
Map output bytes=17258
Map output materialized bytes=18144
Input split bytes=6772
Combine input records=0
Combine output records=0
Reduce input groups=53
Reduce shuffle bytes=18144
Reduce input records=296
Reduce output records=52
Spilled Records=592
Shuffled Maps =49
Failed Shuffles=0
Merged Map outputs=49
GC time elapsed (ms)=33590
CPU time spent (ms)=191390
Physical memory (bytes) snapshot=13738057728
Virtual memory (bytes) snapshot=66425016320
Total committed heap usage (bytes)=10799808512
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=492258
File Output Format Counters
Bytes Written=10049
14/12/15 19:59:30 INFO streaming.StreamJob: Output directory: /data_output/sb50projs_1_output
As a newbie to hadoop, for this crazy unreasonable performance, I have several questions:
how to configure my hadoop/yarn/mapreduce to make the whole environment more convenient for trial usage?
I understand hadoop is designed for huge-data and big files. But for a trial environment, my files are small and my data is very limited, which default configuration items should I change? I have changed "dfs.blocksize" of hdfs-site.xml to a smaller value to match my small files, but seems no big enhancements. I know there are some JVM configuration items in yarn-site.xml and mapred-site.xml, but I am not sure about how to adjust them.
how to read hadoop logs
Under the logs folder, there are separate log files for nodemanager/resourcemanager/namenode/datanode. I tried to read these files to understand how the 20 minutes are spent during the process, but it's not easy for a newbie like me. So I wonder is there any tool/UI could help me to analyze the logs.
basic performance tuning tools
Actually I have googled around for this question, and I got a bunch of names like Ganglia/Nagios/Vaidya/Ambari. I want to know, which tool is best analyse the issue like , "why it took 20 minutes to do such a simple job?".
big number of hadoop processes
Even if there is no job running on my hadoop, I found around 100 hadoop processes on my VM, like below (I am using htop, and sort the result by memory). Is this normal for hadoop ? Or am I incorrect for some environment configuration?
You don't have to change anything.
The default configuration is done for small environment. You may change it if you grow the environment. Ant there is a lot of params and a lot of time for fine tuning.
But I admit your configuration is smaller than the usual ones for tests.
The log you have to read isn't the services ones but the job ones. Find them in /var/log/hadoop-yarn/containers/
If you want a better view of your MR, use the web interface on http://127.0.0.1:8088/. You will see your job's progression in real time.
IMO, Basic tuning = use hadoop web interfaces. There are plenty available natively.
I think you find your problem. This can be nomal, or not.
But quickly, YARN launch MR to use all the available memory :
Available memory is set in your yarn-site.xml : yarn.nodemanager.resource.memory-mb (default to 8 Gio).
Memory for a task is defined in mapred-site.xml or in the task itself by the property : mapreduce.map.memory.mb (default to 1536 Mio)
So :
Change the available memory for your nodemanager (to 3Gio, in order to let 1 Gio for the system)
Change the memory available for hadoop services (-Xmx in hadoop-env.sh, yarn-env.sh) (system + each hadoop services (namenode / datanode / ressourcemanager / nodemanager) < 1 Gio.
Change the memory for your map tasks (512 Mio ?). The lesser it is, more task can be executed in the same time.
Change yarn.scheduler.minimum-allocation-mb to 512 in yarn-site.xml to allow mappers with less than 1 Gio of memory.
I hope this will help you.

Is there a way to find why reducer was killed by AM

I am trying to run some graph processing job on Hadoop using 1GB Input. But my reduce task are being killed by Application Master.
Here is the output
14/06/29 16:15:02 INFO mapreduce.Job: map 100% reduce 53%
14/06/29 16:15:03 INFO mapreduce.Job: map 100% reduce 57%
14/06/29 16:15:04 INFO mapreduce.Job: map 100% reduce 60%
14/06/29 16:15:05 INFO mapreduce.Job: map 100% reduce 63%
14/06/29 16:15:05 INFO mapreduce.Job: Task Id : attempt_1404050864296_0002_r_000003_0, Status : FAILED
Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal
and In the end job is failed with following output.
14/06/29 16:11:58 INFO mapreduce.Job: Job job_1404050864296_0001 failed with state FAILED due to: Task failed task_1404050864296_0001_r_000001
Job failed as tasks failed. failedMaps:0 failedReduces:1
14/06/29 16:11:58 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=1706752372
FILE: Number of bytes written=3414132444
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1319319669
HDFS: Number of bytes written=0
HDFS: Number of read operations=30
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed reduce tasks=7
Killed reduce tasks=1
Launched map tasks=10
Launched reduce tasks=8
Data-local map tasks=10
Total time spent by all maps in occupied slots (ms)=12527776
Total time spent by all reduces in occupied slots (ms)=1256256
Total time spent by all map tasks (ms)=782986
Total time spent by all reduce tasks (ms)=78516
Total vcore-seconds taken by all map tasks=782986
Total vcore-seconds taken by all reduce tasks=78516
Total megabyte-seconds taken by all map tasks=6263888000
Total megabyte-seconds taken by all reduce tasks=628128000
Map-Reduce Framework
Map input records=85331845
Map output records=170663690
Map output bytes=1365309520
Map output materialized bytes=1706637020
Input split bytes=980
Combine input records=0
Spilled Records=341327380
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=2573
CPU time spent (ms)=820310
Physical memory (bytes) snapshot=18048614400
Virtual memory (bytes) snapshot=72212246528
Total committed heap usage (bytes)=28289007616
File Input Format Counters
Bytes Read=1319318689
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at pegasus.DegDist.run(DegDist.java:201)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at pegasus.DegDist.main(DegDist.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
I have checked the Logs but there is no information why reduce task was killed. Is there a way to find out why this reduce task was killed.I am intereseted in specific reason of killing Reduce Job.

Resources