Very odd case of ArrayIndexOutOfBounds in a Scalding-driven job running on Hadoop 2.7.1. Mapper log dump below. It looks like Equator somehow gets set to a negative number in spill 2. Is this normal?
2015-08-12 23:39:19,649 INFO [main] org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2015-08-12 23:39:20,174 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 0 kvi 469762044(1879048176)
2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1792
2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 187904816
2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1879048192
2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 469762044; length = 117440512
2015-08-12 23:39:20,214 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2015-08-12 23:39:20,216 INFO [main] cascading.flow.hadoop.FlowMapper: cascading version: 2.6.1
2015-08-12 23:39:20,216 INFO [main] cascading.flow.hadoop.FlowMapper: child jvm opts: -Xmx1024m -Djava.io.tmpdir=./tmp
2015-08-12 23:39:20,516 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2015-08-12 23:39:20,552 INFO [main] cascading.flow.hadoop.FlowMapper: sourcing from: TempHfs["SequenceFile[['docId', 'otherDocId', 'score']]"][9909013673/_pipe_11__pipe_12/]
2015-08-12 23:39:20,552 INFO [main] cascading.flow.hadoop.FlowMapper: sinking to: GroupBy(_pipe_11+_pipe_12)[by:[
{1}
:'docId']]
2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 108647886; bufvoid = 1879048192
2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 469762044(1879048176); kvend = 449947816(1799791264); length = 19814229/117440512
2015-08-12 23:39:29,425 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 839953118 kvi 209988272(839953088)
2015-08-12 23:39:43,985 INFO [SpillThread] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.gz]
2015-08-12 23:39:46,767 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 0
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 839953118 kv 209988272(839953088) kvi 178264648(713058592)
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 839953118; bufend = 1014433072; bufvoid = 1879048192
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 209988272(839953088); kvend = 178264648(713058592); length = 31723625/117440512
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 1696670336 kvi 424167580(1696670320)
2015-08-12 23:40:22,641 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 1
2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 1696670336 kv 424167580(1696670320) kvi 392768808(1571075232)
2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 1696670336; bufend = 1869363604; bufvoid = 1879048192
2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 424167580(1696670320); kvend = 392768808(1571075232); length = 31398773/117440512
2015-08-12 23:40:22,642 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) -1742031900 kvi 34254072(137016288)
2015-08-12 23:40:47,329 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 2
2015-08-12 23:40:47,330 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator -1742031900 kv 34254072(137016288) kvi 34254072(137016288)
2015-08-12 23:40:47,331 ERROR [main] cascading.flow.stream.TrapHandler: caught Throwable, no trap available, rethrowing
cascading.flow.stream.DuctException: internal error: ['7541904654925238223', '2.812180059539485']
at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:81)
at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:37)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:80)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:133)
at cascading.operation.Identity$2.operate(Identity.java:137)
at cascading.operation.Identity.operate(Identity.java:150)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349)
at java.io.DataOutputStream.write(DataOutputStream.java:88)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:273)
at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:253)
at cascading.tuple.hadoop.io.HadoopTupleOutputStream.writeIntInternal(HadoopTupleOutputStream.java:155)
at cascading.tuple.io.TupleOutputStream.write(TupleOutputStream.java:86)
at cascading.tuple.io.TupleOutputStream.writeTuple(TupleOutputStream.java:64)
at cascading.tuple.hadoop.io.TupleSerializer.serialize(TupleSerializer.java:37)
at cascading.tuple.hadoop.io.TupleSerializer.serialize(TupleSerializer.java:28)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1149)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:610)
at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:68)
... 18 more
It is mapreduce.task.io.sort.mb that made the difference. When setting to 2G or large, it will constantly running into the problem.
It is suggested to set to the value below or smaller:
Dmapreduce.task.io.sort.mb=1792
I suspect a threading issue, so I tried the below and it worked. Not sure if the cure will stick.
<property>
<name>mapreduce.map.sort.spill.percent</name>
<value>0.8</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>10</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>100</value>
</property>
<property>
<name>mapred.map.multithreadedrunner.threads</name>
<value>1</value>
</property>
<property>
<name>mapreduce.mapper.multithreadedmapper.threads</name>
<value>1</value>
</property>
Related
I am trying to format namenode but don't know whats wrong here.
I have done entry in /etc/hosts and core-site.xml but still i am not able to format drive.
[hduser#ec2-54-242-91-165 ~]$ hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
2018-02-14 20:55:25,989 INFO [main] namenode.NameNode (LogAdapter.java:info(47)) - STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.7.5
STARTUP_MSG: classpath = /usr/local/hadoop/etc/:/usr/local/hadoop/share/hadoop/common/lib/httpclient-4.2.5.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-framework-2.7.1.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/usr/local/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/common/lib/jettison-1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/jets3t-0.9.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/junit-4.11.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-json-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-math3-3.1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/common/lib/avro-1.7.4.jar:/usr/local/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/gson-2.2.4.jar:/usr/local/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/common/lib/jsch-0.1.54.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-sslengine-6.1.26.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-annotations-2.7.5.jar:/usr/local/hadoop/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/usr/local/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-collections-3.2.2.jar:/usr/local/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop/share/hadoop/common/lib/httpcore-4.2.5.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-httpclient-3.1.jar:/usr/local/hadoop/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/local/hadoop/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar:/usr/local/hadoop/share/hadoop/common/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-compress-1.4.1.jar:/usr/local/hadoop/share/hadoop/common/lib/stax-api-1.0-2.jar:/usr/local/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/usr/local/hadoop/share/hadoop/common/lib/curator-client-2.7.1.jar:/usr/local/hadoop/share/hadoop/common/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/usr/local/hadoop/share/hadoop/common/lib/xz-1.0.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/usr/local/hadoop/share/hadoop/common/lib/jsr305-3.0.0.jar:/usr/local/hadoop/share/hadoop/common/lib/activation-1.1.jar:/usr/local/hadoop/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar:/usr/local/hadoop/share/hadoop/common/lib/netty-3.6.2.Final.jar:/usr/local/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/usr/local/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/usr/local/hadoop/share/hadoop/common/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/common/lib/hadoop-auth-2.7.5.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/common/lib/zookeeper-3.4.6.jar:/usr/local/hadoop/share/hadoop/common/lib/hamcrest-core-1.3.jar:/usr/local/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/common/lib/slf4j-api-1.7.10.jar:/usr/local/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.5.jar:/usr/local/hadoop/share/hadoop/common/hadoop-nfs-2.7.5.jar:/usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.5-tests.jar:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xercesImpl-2.9.1.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-all-4.0.23.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/htrace-core-3.1.0-incubating.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/xml-apis-1.3.04.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.7.5-tests.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-nfs-2.7.5.jar:/usr/local/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jettison-1.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-json-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/usr/local/hadoop/share/hadoop/yarn/lib/servlet-api-2.5.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/javax.inject-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6-tests.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-collections-3.2.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jetty-6.1.26.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/xz-1.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/usr/local/hadoop/share/hadoop/yarn/lib/aopalliance-1.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jsr305-3.0.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/activation-1.1.jar:/usr/local/hadoop/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guava-11.0.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jersey-client-1.9.jar:/usr/local/hadoop/share/hadoop/yarn/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/usr/local/hadoop/share/hadoop/yarn/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-lang-2.6.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-codec-1.4.jar:/usr/local/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guice-3.0.jar:/usr/local/hadoop/share/hadoop/yarn/lib/commons-cli-1.2.jar:/usr/local/hadoop/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-common-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-tests-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-registry-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-api-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-client-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.7.5.jar:/usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/junit-4.11.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/avro-1.7.4.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/javax.inject-1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/hadoop-annotations-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/xz-1.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/asm-3.2.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/guice-3.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/usr/local/hadoop/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.5.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.5-tests.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.7.5.jar
STARTUP_MSG: build = https://shv#git-wip-us.apache.org/repos/asf/hadoop.git -r 18065c2b6806ed4aa6a3187d77cbe21bb3dba075; compiled by 'kshvachk' on 2017-12-16T01:06Z
STARTUP_MSG: java = 1.8.0_161
************************************************************/
2018-02-14 20:55:25,999 INFO [main] namenode.NameNode (LogAdapter.java:info(47)) - registered UNIX signal handlers for [TERM, HUP, INT]
2018-02-14 20:55:26,001 INFO [main] namenode.NameNode (NameNode.java:createNameNode(1441)) - createNameNode [-format]
2018-02-14 20:55:26,590 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-d4daffff-5e9b-4eba-9de0-a880e9de496d
2018-02-14 20:55:26,943 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(721)) - No KeyProvider found.
2018-02-14 20:55:26,944 INFO [main] namenode.FSNamesystem (FSNamesystemLock.java:<init>(120)) - fsLock is fair: true
2018-02-14 20:55:26,947 INFO [main] namenode.FSNamesystem (FSNamesystemLock.java:<init>(136)) - Detailed lock hold time metrics enabled: false
2018-02-14 20:55:26,989 INFO [main] blockmanagement.DatanodeManager (DatanodeManager.java:<init>(239)) - dfs.block.invalidate.limit=1000
2018-02-14 20:55:26,990 INFO [main] blockmanagement.DatanodeManager (DatanodeManager.java:<init>(245)) - dfs.namenode.datanode.registration.ip-hostname-check=true
2018-02-14 20:55:26,991 INFO [main] blockmanagement.BlockManager (InvalidateBlocks.java:printBlockDeletionTime(72)) - dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2018-02-14 20:55:26,993 INFO [main] blockmanagement.BlockManager (InvalidateBlocks.java:printBlockDeletionTime(77)) - The block deletion will start around 2018 Feb 14 20:55:26
2018-02-14 20:55:26,995 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map BlocksMap
2018-02-14 20:55:26,995 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type = 64-bit
2018-02-14 20:55:26,997 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 2.0% max memory 966.7 MB = 19.3 MB
2018-02-14 20:55:26,997 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity = 2^21 = 2097152 entries
2018-02-14 20:55:27,037 INFO [main] blockmanagement.BlockManager (BlockManager.java:createBlockTokenSecretManager(385)) - dfs.block.access.token.enable=false
2018-02-14 20:55:27,039 INFO [main] blockmanagement.BlockManager (BlockManager.java:<init>(371)) - defaultReplication = 3
2018-02-14 20:55:27,040 INFO [main] blockmanagement.BlockManager (BlockManager.java:<init>(372)) - maxReplication = 512
2018-02-14 20:55:27,041 INFO [main] blockmanagement.BlockManager (BlockManager.java:<init>(373)) - minReplication = 1
2018-02-14 20:55:27,041 INFO [main] blockmanagement.BlockManager (BlockManager.java:<init>(374)) - maxReplicationStreams = 2
2018-02-14 20:55:27,041 INFO [main] blockmanagement.BlockManager (BlockManager.java:<init>(375)) - replicationRecheckInterval = 3000
2018-02-14 20:55:27,041 INFO [main] blockmanagement.BlockManager (BlockManager.java:<init>(376)) - encryptDataTransfer = false
2018-02-14 20:55:27,041 INFO [main] blockmanagement.BlockManager (BlockManager.java:<init>(377)) - maxNumBlocksToLog = 1000
2018-02-14 20:55:27,054 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(749)) - fsOwner = hduser (auth:SIMPLE)
2018-02-14 20:55:27,056 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(750)) - supergroup = supergroup
2018-02-14 20:55:27,056 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(751)) - isPermissionEnabled = true
2018-02-14 20:55:27,057 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(762)) - HA Enabled: false
2018-02-14 20:55:27,059 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(799)) - Append Enabled: true
2018-02-14 20:55:27,169 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map INodeMap
2018-02-14 20:55:27,169 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type = 64-bit
2018-02-14 20:55:27,170 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 1.0% max memory 966.7 MB = 9.7 MB
2018-02-14 20:55:27,171 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity = 2^20 = 1048576 entries
2018-02-14 20:55:27,174 INFO [main] namenode.FSDirectory (FSDirectory.java:<init>(241)) - ACLs enabled? false
2018-02-14 20:55:27,180 INFO [main] namenode.FSDirectory (FSDirectory.java:<init>(245)) - XAttrs enabled? true
2018-02-14 20:55:27,181 INFO [main] namenode.FSDirectory (FSDirectory.java:<init>(253)) - Maximum size of an xattr: 16384
2018-02-14 20:55:27,181 INFO [main] namenode.NameNode (FSDirectory.java:<init>(304)) - Caching file names occuring more than 10 times
2018-02-14 20:55:27,191 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map cachedBlocks
2018-02-14 20:55:27,196 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type = 64-bit
2018-02-14 20:55:27,197 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 0.25% max memory 966.7 MB = 2.4 MB
2018-02-14 20:55:27,197 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity = 2^18 = 262144 entries
2018-02-14 20:55:27,198 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(5147)) - dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2018-02-14 20:55:27,198 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(5148)) - dfs.namenode.safemode.min.datanodes = 0
2018-02-14 20:55:27,198 INFO [main] namenode.FSNamesystem (FSNamesystem.java:<init>(5149)) - dfs.namenode.safemode.extension = 30000
2018-02-14 20:55:27,201 INFO [main] metrics.TopMetrics (TopMetrics.java:logConf(76)) - NNTop conf: dfs.namenode.top.window.num.buckets = 10
2018-02-14 20:55:27,204 INFO [main] metrics.TopMetrics (TopMetrics.java:logConf(78)) - NNTop conf: dfs.namenode.top.num.users = 10
2018-02-14 20:55:27,205 INFO [main] metrics.TopMetrics (TopMetrics.java:logConf(80)) - NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2018-02-14 20:55:27,211 INFO [main] namenode.FSNamesystem (FSNamesystem.java:initRetryCache(908)) - Retry cache on namenode is enabled
2018-02-14 20:55:27,216 INFO [main] namenode.FSNamesystem (FSNamesystem.java:initRetryCache(916)) - Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2018-02-14 20:55:27,219 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map NameNodeRetryCache
2018-02-14 20:55:27,219 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type = 64-bit
2018-02-14 20:55:27,219 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 0.029999999329447746% max memory 966.7 MB = 297.0 KB
2018-02-14 20:55:27,222 INFO [main] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity = 2^15 = 32768 entries
Re-format filesystem in Storage Directory /tmp/hadoop-hduser/dfs/name ? (Y or N) Y
2018-02-14 20:56:40,909 INFO [main] namenode.FSImage (FSImage.java:format(176)) - Allocated new BlockPoolId: BP-1602166314-127.0.0.1-1518641800839
2018-02-14 20:56:40,926 INFO [main] common.Storage (NNStorage.java:format(568)) - Storage directory /tmp/hadoop-hduser/dfs/name has been successfully formatted.
2018-02-14 20:56:40,945 INFO [FSImageSaver for /tmp/hadoop-hduser/dfs/name of type IMAGE_AND_EDITS] namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(413)) - Saving image file /tmp/hadoop-hduser/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2018-02-14 20:56:41,108 INFO [FSImageSaver for /tmp/hadoop-hduser/dfs/name of type IMAGE_AND_EDITS] namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:save(416)) - Image file /tmp/hadoop-hduser/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.
2018-02-14 20:56:41,123 INFO [main] namenode.NNStorageRetentionManager (NNStorageRetentionManager.java:getImageTxIdToRetain(203)) - Going to retain 1 images with txid >= 0
2018-02-14 20:56:41,125 INFO [main] util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0
2018-02-14 20:56:41,126 INFO [Thread-1] namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
************************************************************/
As I understand sqoop, it launches few mappers on different data nodes making jdbc connection with RDBMS. Once connection is formed data is transferred to HDFS.
Just trying to understand, does sqoop mapper spill data temporary on disk (data node)? I know spilling happens in MapReduce but not sure about sqoop job.
It seems sqoop-import runs on mapper and doesn't spill. And sqoop-merge runs on map-reduce and does spill. You can check it on Job tracker during sqoop import run.
Have a look at this part of sqoop import log, it does not spill, fetches and writes to hdfs:
INFO [main] ... mapreduce.db.DataDrivenDBRecordReader: Using query: SELECT...
[main] mapreduce.db.DBRecordReader: Executing query: SELECT...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
INFO [Thread-16] ...mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1489705733959_2462784_m_000000_0 is done. And is in the process of committing
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1489705733959_2462784_m_000000_0' to hdfs://
Have a look at this sqoop-merge log(skipped some rows), it spills on disk (note Spilling map output in the log):
INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://bla-bla/part-m-00000:0+48322717
...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
...
INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1024
INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 751619264
INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452; length = 67108864
INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$**MapOutputBuffer**
INFO [main] com.pepperdata.supervisor.agent.resource.r: Datanode bla-bla is LOCAL.
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
...
INFO [main] org.apache.hadoop.mapred.MapTask: **Starting flush of map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **Spilling map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **bufstart** = 0; **bufend** = 184775274; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452(1073741808); kvend = 267347800(1069391200); length = 1087653/67108864
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
[main] org.apache.hadoop.mapred.MapTask: Finished spill 0
...Task:attempt_1489705733959_2479291_m_000000_0 is done. And is in the process of committing
I have written a MapReduce program. At first it was running fine, but after a while, I changed something then suddenly my computer said my computer have no memory. Then I realize the job I have run used lots of memory and I don't know why. And after I delete the spilling files I found that my program can not run correctly.It always do spilling and I don't remember what codes I have changed. Here is my mapper, reducer, driver and the console messages:
Mapper:
package SalesProduct;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class SalesCategoryMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
private DoubleWritable one = new DoubleWritable(1);
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String valueString = value.toString();
StringTokenizer tokenizerArticle = new StringTokenizer(valueString,"\n");
System.out.println("Here: In map \n");
while (tokenizerArticle.hasMoreTokens()){
//StringTokenizer tokenizer = new StringTokenizer(tokenizerArticle.nextToken());
String[] items = valueString.split("\t");
String itemName = items[3];
double itemPrice = Double.parseDouble(items[4]);
context.write(new Text(itemName), new DoubleWritable(itemPrice));
//context.write(new Text(itemName), one);
}
}
}
Reducer:
package SalesProduct;
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class SalesItemCategoryReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
private DoubleWritable result = new DoubleWritable();
public void reduce(Text t_key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
Text key = t_key;
double sum = 0;
for(DoubleWritable val : values){
sum = sum + val.get();
}
/*
while(values.hasNext()){
DoubleWritable tmp = values.next();
sum = sum + tmp.get();
}*/
//result.set(sum);
context.write(key, result);
}
}
Driver:
package SalesResult;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SalesItemDriver {
public static void main(String[] args) throws ClassNotFoundException, IOException, InterruptedException{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"SalesItemDriver");
job.setJarByClass(SalesItemDriver.class);
// get category
job.setMapperClass(SalesProduct.SalesCategoryMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DoubleWritable.class);
job.setCombinerClass(SalesProduct.SalesItemCategoryReducer.class);
job.setReducerClass(SalesProduct.SalesItemCategoryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
//设置分片大小
job.setInputFormatClass(CombineTextInputFormat.class);
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
CombineTextInputFormat.setMinInputSplitSize(job, 2097152);
FileInputFormat.setInputPaths(job, new Path(args[0]));
//FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Path path = new Path(args[1]);
Path path = new Path(args[1]);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(path)) {
fs.delete(path, true);
}
FileOutputFormat.setOutputPath(job, path);
job.waitForCompletion(true);
}
}
2017-04-21 22:04:50,780 WARN [org.apache.hadoop.util.NativeCodeLoader] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-21 22:04:51,843 INFO [org.apache.hadoop.conf.Configuration.deprecation] - session.id is deprecated. Instead, use dfs.metrics.session-id
2017-04-21 22:04:51,844 INFO [org.apache.hadoop.metrics.jvm.JvmMetrics] - Initializing JVM Metrics with processName=JobTracker, sessionId=
2017-04-21 22:04:52,132 WARN [org.apache.hadoop.mapreduce.JobResourceUploader] - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-04-21 22:04:52,138 WARN [org.apache.hadoop.mapreduce.JobResourceUploader] - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2017-04-21 22:04:52,148 INFO [org.apache.hadoop.mapreduce.lib.input.FileInputFormat] - Total input files to process : 1
2017-04-21 22:04:52,256 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - number of splits:2
2017-04-21 22:04:52,412 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local1001883244_0001
2017-04-21 22:04:52,646 INFO [org.apache.hadoop.mapreduce.Job] - The url to track the job: http://localhost:8080/
2017-04-21 22:04:52,647 INFO [org.apache.hadoop.mapreduce.Job] - Running job: job_local1001883244_0001
2017-04-21 22:04:52,648 INFO [org.apache.hadoop.mapred.LocalJobRunner] - OutputCommitter set in config null
2017-04-21 22:04:52,653 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - File Output Committer Algorithm version is 1
2017-04-21 22:04:52,653 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2017-04-21 22:04:52,654 INFO [org.apache.hadoop.mapred.LocalJobRunner] - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2017-04-21 22:04:52,717 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Waiting for map tasks
2017-04-21 22:04:52,718 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Starting task: attempt_local1001883244_0001_m_000000_0
2017-04-21 22:04:52,742 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - File Output Committer Algorithm version is 1
2017-04-21 22:04:52,742 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2017-04-21 22:04:52,754 INFO [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] - ProcfsBasedProcessTree currently is supported only on Linux.
2017-04-21 22:04:52,754 INFO [org.apache.hadoop.mapred.Task] - Using ResourceCalculatorProcessTree : null
2017-04-21 22:04:52,760 INFO [org.apache.hadoop.mapred.MapTask] - Processing split: hdfs://localhost:9000/input/purchases.txt:0+134217728
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 0 kvi 26214396(104857584)
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - mapreduce.task.io.sort.mb: 100
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - soft limit at 83886080
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 0; bufvoid = 104857600
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 26214396; length = 6553600
2017-04-21 22:04:52,841 INFO [org.apache.hadoop.mapred.MapTask] - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2017-04-21 22:04:53,652 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local1001883244_0001 running in uber mode : false
2017-04-21 22:04:53,654 INFO [org.apache.hadoop.mapreduce.Job] - map 0% reduce 0%
2017-04-21 22:04:54,718 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:04:54,718 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 0; bufend = 49471275; bufvoid = 104857600
2017-04-21 22:04:54,718 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 26214396(104857584); kvend = 17610700(70442800); length = 8603697/6553600
2017-04-21 22:04:54,718 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 58074971 kvi 14518736(58074944)
2017-04-21 22:04:55,730 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 0
2017-04-21 22:04:55,738 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 58074971 kv 14518736(58074944) kvi 12367824(49471296)
2017-04-21 22:04:56,831 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:04:56,831 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 58074971; bufend = 2688654; bufvoid = 104857592
2017-04-21 22:04:56,831 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 14518736(58074944); kvend = 5915040(23660160); length = 8603697/6553600
2017-04-21 22:04:56,831 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 11292334 kvi 2823076(11292304)
2017-04-21 22:04:57,661 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 1
2017-04-21 22:04:57,670 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 11292334 kv 2823076(11292304) kvi 672168(2688672)
2017-04-21 22:04:58,665 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:04:58,665 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 11292334; bufend = 60763609; bufvoid = 104857600
2017-04-21 22:04:58,665 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 2823076(11292304); kvend = 20433780(81735120); length = 8603697/6553600
2017-04-21 22:04:58,665 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 69367289 kvi 17341816(69367264)
2017-04-21 22:04:59,369 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 2
2017-04-21 22:04:59,377 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 69367289 kv 17341816(69367264) kvi 15190908(60763632)
2017-04-21 22:05:00,401 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:05:00,401 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 69367289; bufend = 13980964; bufvoid = 104857600
2017-04-21 22:05:00,401 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 17341816(69367264); kvend = 8738120(34952480); length = 8603697/6553600
2017-04-21 22:05:00,401 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 22584644 kvi 5646156(22584624)
2017-04-21 22:05:01,083 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 3
2017-04-21 22:05:01,092 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 22584644 kv 5646156(22584624) kvi 3495248(13980992)
2017-04-21 22:05:02,071 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:05:02,071 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 22584644; bufend = 72055919; bufvoid = 104857600
2017-04-21 22:05:02,071 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 5646156(22584624); kvend = 23256860(93027440); length = 8603697/6553600
2017-04-21 22:05:02,071 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 80659599 kvi 20164892(80659568)
2017-04-21 22:05:02,769 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 4
2017-04-21 22:05:02,777 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 80659599 kv 20164892(80659568) kvi 18013984(72055936)
2017-04-21 22:05:03,792 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:05:03,792 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 80659599; bufend = 25273274; bufvoid = 104857600
2017-04-21 22:05:03,792 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 20164892(80659568); kvend = 11561196(46244784); length = 8603697/6553600
2017-04-21 22:05:03,792 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 33876954 kvi 8469232(33876928)
2017-04-21 22:05:04,491 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 5
2017-04-21 22:05:04,499 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 33876954 kv 8469232(33876928) kvi 6318324(25273296)
2017-04-21 22:05:04,755 INFO [org.apache.hadoop.mapred.LocalJobRunner] - map > map
2017-04-21 22:05:05,507 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
And it will keep doing spilling until I shut down it.
Why?
I'm so confused ...
I run this program in my computer, not in cloud.
My computer left memory is only 5GB, does that matter?
It was runnable at first, which means it could output a file part-00000. Although the content is not in my expect...
Now it will output many files like that
I'm very new in hadoop mapreduce, however i install the multinode cluster but i still get a sequential excution.
How can i work out if my program is running on the other machines in the cluster or not?
This is the result of execution :
Picked up _JAVA_OPTIONS: -Xmx1g
16/06/07 14:49:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/07 14:49:19 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/06/07 14:49:19 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/06/07 14:49:21 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/06/07 14:49:21 INFO input.FileInputFormat: Total input paths to process : 3
16/06/07 14:49:22 INFO mapreduce.JobSubmitter: number of splits:3
16/06/07 14:49:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1881318657_0001
16/06/07 14:49:24 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/06/07 14:49:24 INFO mapreduce.Job: Running job: job_local1881318657_0001
16/06/07 14:49:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/06/07 14:49:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/06/07 14:49:24 INFO mapred.LocalJobRunner: Waiting for map tasks
16/06/07 14:49:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000000_0
16/06/07 14:49:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 14:49:24 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia.txt:0+1172207
16/06/07 14:49:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 14:49:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 14:49:24 INFO mapred.MapTask: soft limit at 83886080
16/06/07 14:49:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 14:49:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 14:49:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 14:49:25 INFO mapreduce.Job: Job job_local1881318657_0001 running in uber mode : false
16/06/07 14:49:25 INFO mapreduce.Job: map 0% reduce 0%
16/06/07 14:49:31 INFO mapred.LocalJobRunner: map > map
16/06/07 14:49:31 INFO mapreduce.Job: map 22% reduce 0%
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-2.9889415942690763E-9
-2.9889415942690763E-9
-2.9889415942690763E-9
-2.9287384547432996E-9
-2.898469757139896E-9
-2.898469757139896E-9
-2.880377562441664E-9
-2.880377562441664E-9
-2.880377562441664E-9
-2.8430632294667886E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
931
16/06/07 15:00:44 INFO mapred.LocalJobRunner: map > map
16/06/07 15:00:44 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:00:44 INFO mapred.MapTask: Spilling map output
16/06/07 15:00:44 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:00:44 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:00:46 INFO mapred.MapTask: Finished spill 0
16/06/07 15:00:46 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000000_0 is done. And is in the process of committing
16/06/07 15:00:47 INFO mapred.LocalJobRunner: map
16/06/07 15:00:47 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000000_0' done.
16/06/07 15:00:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000000_0
16/06/07 15:00:47 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000001_0
16/06/07 15:00:48 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:00:48 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia1.txt:0+1172207
16/06/07 15:00:48 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 15:00:48 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 15:00:48 INFO mapred.MapTask: soft limit at 83886080
16/06/07 15:00:48 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 15:00:48 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 15:00:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 15:00:48 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:01:47 INFO mapred.LocalJobRunner: map > map
16/06/07 15:01:48 INFO mapreduce.Job: map 56% reduce 0%
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.001716001136338E-9
-2.997252637652067E-9
-2.997252637652067E-9
-2.9593407930592893E-9
-2.9178102507568847E-9
-2.9178102507568847E-9
-2.9178102507568847E-9
-2.8542232742481287E-9
-2.8542232742481287E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8222418341121026E-9
-2.8222418341121026E-9
907
16/06/07 15:11:30 INFO mapred.LocalJobRunner: map > map
16/06/07 15:11:30 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:11:30 INFO mapred.MapTask: Spilling map output
16/06/07 15:11:30 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:11:30 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:11:30 INFO mapred.MapTask: Finished spill 0
16/06/07 15:11:30 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000001_0 is done. And is in the process of committing
16/06/07 15:11:30 INFO mapred.LocalJobRunner: map
16/06/07 15:11:30 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000001_0' done.
16/06/07 15:11:30 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000001_0
16/06/07 15:11:30 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000002_0
16/06/07 15:11:30 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:11:30 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia2.txt:0+1172207
16/06/07 15:11:30 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:11:31 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 15:11:31 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 15:11:31 INFO mapred.MapTask: soft limit at 83886080
16/06/07 15:11:31 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 15:11:31 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 15:11:31 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 15:11:37 INFO mapred.LocalJobRunner: map > map
16/06/07 15:11:38 INFO mapreduce.Job: map 89% reduce 0%
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.0090989883906007E-9
-2.9474075636124447E-9
-2.9474075636124447E-9
-2.9474075636124447E-9
-2.9388849943338927E-9
-2.9388849943338927E-9
-2.8915704649620403E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
925
16/06/07 15:20:19 INFO mapred.LocalJobRunner: map > map
16/06/07 15:20:19 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:20:19 INFO mapred.MapTask: Spilling map output
16/06/07 15:20:19 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:20:19 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:20:20 INFO mapred.MapTask: Finished spill 0
16/06/07 15:20:20 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000002_0 is done. And is in the process of committing
16/06/07 15:20:22 INFO mapred.LocalJobRunner: map
16/06/07 15:20:22 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000002_0' done.
16/06/07 15:20:22 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000002_0
16/06/07 15:20:22 INFO mapred.LocalJobRunner: map task executor complete.
16/06/07 15:20:22 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:20:23 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/06/07 15:20:23 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_r_000000_0
16/06/07 15:20:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:20:24 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#7f5be2d5
16/06/07 15:20:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=668309888, maxSingleShuffleLimit=167077472, mergeThreshold=441084544, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/06/07 15:20:25 INFO reduce.EventFetcher: attempt_local1881318657_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/06/07 15:20:28 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000002_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:29 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000002_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->14157
16/06/07 15:20:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000001_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:30 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000001_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 2, commitMemory -> 14157, usedMemory ->28314
16/06/07 15:20:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000000_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:30 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000000_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 3, commitMemory -> 28314, usedMemory ->42471
16/06/07 15:20:30 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/06/07 15:20:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: finalMerge called with 3 in-memory map-outputs and 0 on-disk map-outputs
16/06/07 15:20:30 INFO mapred.Merger: Merging 3 sorted segments
16/06/07 15:20:30 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 42435 bytes
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merged 3 segments, 42471 bytes to disk to satisfy reduce memory limit
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merging 1 files, 42471 bytes from disk
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/06/07 15:20:30 INFO mapred.Merger: Merging 1 sorted segments
16/06/07 15:20:30 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 42455 bytes
16/06/07 15:20:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/06/07 15:20:33 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:33 INFO mapreduce.Job: map 100% reduce 67%
16/06/07 15:20:36 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:38 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/06/07 15:20:42 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:42 INFO mapreduce.Job: map 100% reduce 100%
16/06/07 15:20:44 INFO mapred.Task: Task:attempt_local1881318657_0001_r_000000_0 is done. And is in the process of committing
16/06/07 15:20:44 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:44 INFO mapred.Task: Task attempt_local1881318657_0001_r_000000_0 is allowed to commit now
16/06/07 15:20:45 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1881318657_0001_r_000000_0' to hdfs://master:9000/output2/_temporary/0/task_local1881318657_0001_r_000000
16/06/07 15:20:45 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:45 INFO mapred.Task: Task 'attempt_local1881318657_0001_r_000000_0' done.
16/06/07 15:20:45 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_r_000000_0
16/06/07 15:20:45 INFO mapred.LocalJobRunner: reduce task executor complete.
16/06/07 15:20:45 INFO mapreduce.Job: Job job_local1881318657_0001 completed successfully
16/06/07 15:20:46 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=177067554
FILE: Number of bytes written=179551452
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=10549863
HDFS: Number of bytes written=42438
HDFS: Number of read operations=37
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=42453
Map output materialized bytes=42483
Input split bytes=557
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=42483
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=227283
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2477260800
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=42438
peace
By the job ID. Your's says: job_local1881318657_0001 running in uber mode : false. Which is a local job. If you ran on a cluster it would just be the job and the identifiers of the app master and attempts.
You need to check the JobTracker ( default port 50030) and explore the job id details mentioned in the above logs.
You can monitor the jobs at:
localhost:8088
I have submitted a MR job using hadoop jar command with the following command on CDH5 Beta 2
hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/
I've also tried providing the fs name and job tracker url explicitly as below without any success
hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver -Dfs.default.name=hdfs://abc.com:8020 -Dmapreduce.job.tracker=x.x.x.x:8021 tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/
The job runs successfully but is using the LocalJobRunner instead of submitting to the cluster. The output is written to HDFS and is correct. Not sure what I am doing wrong here so appreciate your input. I've also tried explicitly specifying the fs and job tracker as below but have the same result
14/04/16 20:35:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/04/16 20:35:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/04/16 20:35:45 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14/04/16 20:35:45 INFO input.FileInputFormat: Total input paths to process : 2
14/04/16 20:35:45 INFO mapreduce.JobSubmitter: number of splits:2
14/04/16 20:35:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1427968352_0001
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/04/16 20:35:46 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
14/04/16 20:35:46 INFO mapreduce.Job: Running job: job_local1427968352_0001
14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14/04/16 20:35:46 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/16 20:35:46 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:46 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:46 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/ratings.csv:0+4388258
14/04/16 20:35:46 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/04/16 20:35:46 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/04/16 20:35:46 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/04/16 20:35:46 INFO mapred.MapTask: soft limit at 83886080
14/04/16 20:35:46 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/04/16 20:35:46 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/04/16 20:35:47 INFO mapreduce.Job: Job job_local1427968352_0001 running in uber mode : false
14/04/16 20:35:47 INFO mapreduce.Job: map 0% reduce 0%
14/04/16 20:35:48 INFO mapred.LocalJobRunner:
14/04/16 20:35:48 INFO mapred.MapTask: Starting flush of map output
14/04/16 20:35:48 INFO mapred.MapTask: Spilling map output
14/04/16 20:35:48 INFO mapred.MapTask: bufstart = 0; bufend = 6485388; bufvoid = 104857600
14/04/16 20:35:48 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 24860980(99443920); length = 1353417/6553600
14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0
14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000000_0 is done. And is in the process of committing
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map
14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000000_0' done.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:49 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/users.csv:0+186304
14/04/16 20:35:49 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/04/16 20:35:49 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/04/16 20:35:49 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/04/16 20:35:49 INFO mapred.MapTask: soft limit at 83886080
14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/04/16 20:35:49 INFO mapred.LocalJobRunner:
14/04/16 20:35:49 INFO mapred.MapTask: Starting flush of map output
14/04/16 20:35:49 INFO mapred.MapTask: Spilling map output
14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufend = 209667; bufvoid = 104857600
14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26192144(104768576); length = 22253/6553600
14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0
14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000001_0 is done. And is in the process of committing
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map
14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000001_0' done.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map task executor complete.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Waiting for reduce tasks
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_r_000000_0
14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:49 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5116331d
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
14/04/16 20:35:49 INFO reduce.EventFetcher: attempt_local1427968352_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000001_0 decomp: 220797 len: 220801 to MEMORY
14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 220797 bytes from map-output for attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 220797, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->220797
14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000000_0 decomp: 7162100 len: 7162104 to MEMORY
14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 7162100 bytes from map-output for attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 7162100, inMemoryMapOutputs.size() -> 2, commitMemory -> 220797, usedMemory ->7382897
14/04/16 20:35:49 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
14/04/16 20:35:49 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
14/04/16 20:35:49 INFO mapred.Merger: Merging 2 sorted segments
14/04/16 20:35:49 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 7382885 bytes
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merged 2 segments, 7382897 bytes to disk to satisfy reduce memory limit
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 1 files, 7382899 bytes from disk
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
14/04/16 20:35:50 INFO mapred.Merger: Merging 1 sorted segments
14/04/16 20:35:50 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7382889 bytes
14/04/16 20:35:50 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:50 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
14/04/16 20:35:50 INFO mapreduce.Job: map 100% reduce 0%
14/04/16 20:35:51 INFO mapred.Task: Task:attempt_local1427968352_0001_r_000000_0 is done. And is in the process of committing
14/04/16 20:35:51 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:51 INFO mapred.Task: Task attempt_local1427968352_0001_r_000000_0 is allowed to commit now
14/04/16 20:35:51 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1427968352_0001_r_000000_0' to hdfs://...:8020/user/ird2/tech_talks/output/ReduceSideJoinDriver/_temporary/0/task_local1427968352_0001_r_000000
14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce > reduce
14/04/16 20:35:51 INFO mapred.Task: Task 'attempt_local1427968352_0001_r_000000_0' done.
14/04/16 20:35:51 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_r_000000_0
14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce task executor complete.
14/04/16 20:35:52 INFO mapreduce.Job: map 100% reduce 100%
14/04/16 20:35:52 INFO mapreduce.Job: Job job_local1427968352_0001 completed successfully
14/04/16 20:35:52 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=14767932
FILE: Number of bytes written=29952985
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=13537382
HDFS: Number of bytes written=2949787
HDFS: Number of read operations=28
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=343919
Map output records=343919
Map output bytes=6695055
Map output materialized bytes=7382905
Input split bytes=272
Combine input records=0
Combine output records=0
Reduce input groups=5564
Reduce shuffle bytes=7382905
Reduce input records=343919
Reduce output records=5564
Spilled Records=687838
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=92
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1416101888
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=4574562
File Output Format Counters
Bytes Written=2949787
Driver code
public class ReduceSideJoinDriver extends Configured implements Tool
{
#Override
public int run(String[] args) throws Exception
{
if (args.length != 3)
{
System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Path usersFile = new Path(args[0]);
Path ratingsFile = new Path(args[1]);
Job job = Job.getInstance(getConf(), "Aravind - Reduce Side Join");
job.getConfiguration().setStrings(usersFile.getName(), "user");
job.getConfiguration().setStrings(ratingsFile.getName(), "rating");
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(TagAndRecord.class);
TextInputFormat.addInputPath(job, usersFile);
TextInputFormat.addInputPath(job, ratingsFile);
TextOutputFormat.setOutputPath(job, new Path(args[2]));
job.setMapperClass(ReduceSideJoinMapper.class);
job.setReducerClass(ReduceSideJoinReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new Configuration(), new ReduceSideJoinDriver(), args);
System.exit(exitCode);
}
}
Make sure you have valid following configuration files in hadoop classpath. By default configuration files are taken from the directory /etc/hadoop/conf. This activity should be performed a part of hadoop client node setup.
mapred-site.xml
yarn-site.xml
core-site.xml
If the above mentioned configuration files are empty. You got to pupulate the above files with right properties. Population can be achieved in two ways
In Cloudera Manager when click on service yarn, in action portion, there is an option Deploy client configuration along with start,stop etc. Use that option to deploy the client configuration.
Sometimes above option maynot work if the node is not managed by CM and yarn gateway is not configured on the node. use the option Download client configuration instead of deploy client Configuration. Extract the downloaded zip configuration file(above files) and copy those files to the location /etc/hadoop/conf manually.
For executing the jar either hadoop or yarn can be used.
Apparently, you can only submit a hadoop job from the node designated as the gateway node. Everything is working once I submitted the job from the gateway node.