Increasing io.sort.mb - hadoop

It would be highly appreciated if someone could help me to find out what went wrong in my configuration.
I wanted to increase the value of io.sort.mb and thus I added the property below in core-site.xml.
io.sort.mb
350m
The runtime information I am attaching below clearly shows that the value of io.sort.mb, did not change rather the default value io.sort.mb = 100 stayed.
13/08/15 16:43:34 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1e5e96c1
13/08/15 16:43:34 INFO mapred.MapTask: numReduceTasks: 1
13/08/15 16:43:34 INFO mapred.MapTask: **io.sort.mb = 100**
13/08/15 16:43:34 INFO mapred.MapTask: data buffer = 79691776/99614720
13/08/15 16:43:34 INFO mapred.MapTask: record buffer = 262144/327680
13/08/15 16:43:34 INFO mapred.MapTask: Starting flush of map output
13/08/15 16:43:34 INFO mapred.MapTask: Finished spill 0
13/08/15 16:43:34 INFO mapred.Task: Task:attempt_local_0001_m_004609_0 is done. And is in the process of commiting
Since it was not working, I added the property in mapred-site.xml schema, however I got the same outcome as above.
Can anyone suggest me what should I do?
Thanking you in advance.
Haq

according to the article here io.sort.mb should be 10 * io.sort.factor incase you have ram.
"core-site.xml"
<property>
<name>io.sort.factor</name>
<value>100</value>
<description>More streams merged at once while sorting files.</description>
</property>
<property>
<name>io.sort.mb</name>
<value>200</value>
<description>Higher memory-limit while sorting data.</description>
</property>
trying changing sort factor also on all nodes.

this conf should be in mapred-site.xml instead of core-site.xml
refer: http://hadoop.apache.org/docs/r1.0.4/mapred-default.html

Related

Mapreduce Job running in local mode instead of cluster

Configuration are done for running mapreduce job in cluster mode on top of yarn but its running on local mode.
Not able to figuring out whats the issue.
below is yarn-site.xml (at master node)
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>namenode:8031</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name> //node manager servi
<value>mapreduce_shuffle</value> //This will specify that how mapper reducer work
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>namenode:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>namenode:8032</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>namenode</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2042</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
yarn-site.xml (at slave node)
<property>
<name>yarn.nodemanager.aux-services</name> //node manager service
<value>mapreduce_shuffle</value> //This will specify that how mapper reducer work
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>namenode:8031</value> //Tell the ip_address of resource tracker
</property>
mapred-site.xml (at master node and slave node)
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
on submission the job output is like below.
18/12/06 16:20:43 INFO input.FileInputFormat: Total input paths to process : 1
18/12/06 16:20:43 INFO mapreduce.JobSubmitter: number of splits:2
18/12/06 16:20:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1556004420_0001
18/12/06 16:20:43 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
18/12/06 16:20:43 INFO mapreduce.Job: Running job: job_local1556004420_0001
18/12/06 16:20:43 INFO mapred.LocalJobRunner: OutputCommitter set in config null
18/12/06 16:20:43 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
18/12/06 16:20:43 INFO mapred.LocalJobRunner: Waiting for map tasks
18/12/06 16:20:43 INFO mapred.LocalJobRunner: Starting task: attempt_local1556004420_0001_m_000000_0
18/12/06 16:20:43 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
18/12/06 16:20:43 INFO mapred.MapTask: Processing split: hdfs://namenode:9001/all-the-news/articles1.csv:0+134217728
18/12/06 16:20:43 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
18/12/06 16:20:43 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
18/12/06 16:20:43 INFO mapred.MapTask: soft limit at 83886080
18/12/06 16:20:43 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
18/12/06 16:20:43 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
18/12/06 16:20:43 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
18/12/06 16:20:44 INFO mapreduce.Job: Job job_local1556004420_0001 running in uber mode : false
18/12/06 16:20:44 INFO mapreduce.Job: map 0% reduce 0%
18/12/06 16:20:49 INFO mapred.LocalJobRunner: map > map
18/12/06 16:20:50 INFO mapreduce.Job: map 1% reduce 0%
18/12/06 16:20:52 INFO mapred.LocalJobRunner: map > map
18/12/06 16:20:55 INFO mapred.LocalJobRunner: map > map
18/12/06 16:20:56 INFO mapreduce.Job: map 2% reduce 0%
18/12/06 16:20:58 INFO mapred.LocalJobRunner: map > map
18/12/06 16:21:01 INFO mapred.LocalJobRunner: map > map
18/12/06 16:21:02 INFO mapreduce.Job: map 3% reduce 0%
18/12/06 16:21:04 INFO mapred.LocalJobRunner: map > map
Why it's running in local mode.
I am running this job on 200MB file with 3 nodes 2 datanode and 1 namenode.
etc/hosts file is as shown below
127.0.0.1 localhost
127.0.1.1 anil-Lenovo-Product
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
192.168.8.98 namenode
192.168.8.99 datanode
192.168.8.100 datanode2
first check if these configurations are effective:
http://{your-resource-manager-host}:8088/conf by default or
your configured UI address: http://namenode:8088/conf
then make sure these properties are configured:
in mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
in yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
restart YARN service and check if it works.
jobs are submitted by ClientProtocol interface, and one of its two implementations are created when service started:
LocalClientProtocolProvider prefix with job_local
YarnClientProtocolProvider prefix with job_
according to MRConfig.FRAMEWORK_NAME(value is "mapreduce.framework.name") configuration, and its valid options are classic, yarn, local.
Good luck!

Hive and Hadoop Running Only Locally

I have configured a 3 node Hadoop cluster. I was trying to use Hive on top of it. Hive always seems to running only in local mode. I heard that Hive takes values from Hadoop about the cluster. So I ran a job in Hadoop and it seems to be running in Local mode as well. I have installed Hive on all three nodes as well.I'm attaching the logs and configuration files. Please ask me if you need any further details.
Hive Log:
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:1
INFO : Submitting tokens for job: job_local49819314_0002
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2016-01-27 23:56:30,389 Stage-1 map = 100%, reduce = 100%
INFO : Ended Job = job_local49819314_0002
Hadoop Word Count Log:
16/01/27 23:46:20 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/01/27 23:46:20 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/01/27 23:46:20 INFO input.FileInputFormat: Total input paths to process : 1
16/01/27 23:46:20 INFO mapreduce.JobSubmitter: number of splits:1
16/01/27 23:46:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local494116460_0001
16/01/27 23:46:20 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/01/27 23:46:20 INFO mapreduce.Job: Running job: job_local494116460_0001
16/01/27 23:46:20 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/01/27 23:46:20 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/01/27 23:46:20 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/01/27 23:46:20 INFO mapred.LocalJobRunner: Waiting for map tasks
16/01/27 23:46:20 INFO mapred.LocalJobRunner: Starting task: attempt_local494116460_0001_m_000000_0
16/01/27 23:46:20 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/01/27 23:46:20 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/01/27 23:46:20 INFO mapred.MapTask: Processing split: hdfs://master:9000/exercise3:0+18834811
16/01/27 23:46:20 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/01/27 23:46:20 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/01/27 23:46:20 INFO mapred.MapTask: soft limit at 83886080
16/01/27 23:46:20 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/01/27 23:46:20 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/01/27 23:46:20 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/01/27 23:46:21 INFO mapreduce.Job: Job job_local494116460_0001 running in uber mode : false
16/01/27 23:46:21 INFO mapreduce.Job: map 0% reduce 0%
16/01/27 23:46:26 INFO mapred.LocalJobRunner: map > map
16/01/27 23:46:27 INFO mapreduce.Job: map 13% reduce 0%
16/01/27 23:46:29 INFO mapred.LocalJobRunner: map > map
16/01/27 23:46:30 INFO mapreduce.Job: map 19% reduce 0%
16/01/27 23:46:32 INFO mapred.LocalJobRunner: map > map
16/01/27 23:46:33 INFO mapreduce.Job: map 29% reduce 0%
16/01/27 23:46:35 INFO mapred.LocalJobRunner: map > map
16/01/27 23:46:36 INFO mapreduce.Job: map 36% reduce 0%
16/01/27 23:46:38 INFO mapred.LocalJobRunner: map > map
16/01/27 23:46:39 INFO mapreduce.Job: map 45% reduce 0%
16/01/27 23:46:41 INFO mapred.LocalJobRunner: map > map
16/01/27 23:46:42 INFO mapreduce.Job: map 54% reduce 0%
16/01/27 23:46:44 INFO mapred.LocalJobRunner: map > map
16/01/27 23:46:45 INFO mapreduce.Job: map 62% reduce 0%
16/01/27 23:46:46 INFO mapred.LocalJobRunner: map > map
16/01/27 23:46:46 INFO mapred.MapTask: Starting flush of map output
16/01/27 23:46:46 INFO mapred.MapTask: Spilling map output
16/01/27 23:46:46 INFO mapred.MapTask: bufstart = 0; bufend = 21289849; bufvoid = 104857600
16/01/27 23:46:46 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 23806260(95225040); length = 2408137/6553600
16/01/27 23:46:47 INFO mapred.MapTask: Finished spill 0
16/01/27 23:46:47 INFO mapred.Task: Task:attempt_local494116460_0001_m_000000_0 is done. And is in the process of committing
16/01/27 23:46:47 INFO mapred.LocalJobRunner: map
16/01/27 23:46:47 INFO mapred.Task: Task 'attempt_local494116460_0001_m_000000_0' done.
16/01/27 23:46:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local494116460_0001_m_000000_0
16/01/27 23:46:47 INFO mapred.LocalJobRunner: map task executor complete.
16/01/27 23:46:47 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/01/27 23:46:47 INFO mapred.LocalJobRunner: Starting task: attempt_local494116460_0001_r_000000_0
16/01/27 23:46:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/01/27 23:46:47 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/01/27 23:46:47 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#15602819
16/01/27 23:46:47 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/01/27 23:46:47 INFO reduce.EventFetcher: attempt_local494116460_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/01/27 23:46:47 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local494116460_0001_m_000000_0 decomp: 13082052 len: 13082056 to MEMORY
16/01/27 23:46:47 INFO reduce.InMemoryMapOutput: Read 13082052 bytes from map-output for attempt_local494116460_0001_m_000000_0
16/01/27 23:46:47 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 13082052, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->13082052
16/01/27 23:46:47 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/01/27 23:46:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/01/27 23:46:47 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
16/01/27 23:46:47 INFO mapred.Merger: Merging 1 sorted segments
16/01/27 23:46:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 13082040 bytes
16/01/27 23:46:47 INFO reduce.MergeManagerImpl: Merged 1 segments, 13082052 bytes to disk to satisfy reduce memory limit
16/01/27 23:46:47 INFO reduce.MergeManagerImpl: Merging 1 files, 13082056 bytes from disk
16/01/27 23:46:47 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/01/27 23:46:47 INFO mapred.Merger: Merging 1 sorted segments
16/01/27 23:46:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 13082040 bytes
16/01/27 23:46:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/01/27 23:46:47 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/01/27 23:46:47 INFO mapreduce.Job: map 100% reduce 0%
16/01/27 23:46:53 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:46:53 INFO mapreduce.Job: map 100% reduce 85%
16/01/27 23:46:56 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:46:56 INFO mapreduce.Job: map 100% reduce 89%
16/01/27 23:46:59 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:46:59 INFO mapreduce.Job: map 100% reduce 92%
16/01/27 23:47:02 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:47:02 INFO mapreduce.Job: map 100% reduce 96%
16/01/27 23:47:05 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:47:05 INFO mapreduce.Job: map 100% reduce 99%
16/01/27 23:47:08 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:47:08 INFO mapreduce.Job: map 100% reduce 100%
16/01/27 23:47:11 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:47:18 INFO mapred.Task: Task:attempt_local494116460_0001_r_000000_0 is done. And is in the process of committing
16/01/27 23:47:18 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:47:18 INFO mapred.Task: Task attempt_local494116460_0001_r_000000_0 is allowed to commit now
16/01/27 23:47:18 INFO output.FileOutputCommitter: Saved output of task 'attempt_local494116460_0001_r_000000_0' to hdfs://master:9000/output/_temporary/0/task_local494116460_0001_r_000000
16/01/27 23:47:18 INFO mapred.LocalJobRunner: reduce > reduce
16/01/27 23:47:18 INFO mapred.Task: Task 'attempt_local494116460_0001_r_000000_0' done.
16/01/27 23:47:18 INFO mapred.LocalJobRunner: Finishing task: attempt_local494116460_0001_r_000000_0
16/01/27 23:47:18 INFO mapred.LocalJobRunner: reduce task executor complete.
16/01/27 23:47:18 INFO mapreduce.Job: Job job_local494116460_0001 completed successfully
16/01/27 23:47:18 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=26711328
FILE: Number of bytes written=40348644
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=37669622
HDFS: Number of bytes written=12758437
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=65535
Map output records=602035
Map output bytes=21289849
Map output materialized bytes=13082056
Input split bytes=93
Combine input records=602035
Combine output records=58349
Reduce input groups=58349
Reduce shuffle bytes=13082056
Reduce input records=58349
Reduce output records=58349
Spilled Records=116698
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=123
Total committed heap usage (bytes)=848297984
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=18834811
File Output Format Counters
Bytes Written=12758437
Configuration Files
The same configuration is present in all the systems
mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.tracker</name>
<value>master:5431</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/home/huser/hadoop-2.7.1/hadoop_tmp/history/intermediate</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/home/huser/hadoop-2.7.1/hadoop_tmp/history/done</value>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>master:54311</value>
</property>
<property>
<name>mapreduce.jobtracker.http.address</name>
<value>master:50030</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/huser/hadoop-2.7.1/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/huser/hadoop-2.7.1/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>
</configuration>
Bashrc
export JAVA_HOME=/opt/jdk/jdk1.8.0_66
export PATH=$PATH:$JAVA_HOME
# -- HADOOP ENVIRONMENT VARIABLES START -- #
export HADOOP_HOME=/home/huser/hadoop-2.7.1/
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #
# -- Hive Variables Start --#
export HIVE_HOME=/home/huser/apache-hive-1.2.1-bin
export HIVE_CONF=$HIVE_HOME/conf
export PATH=$HIVE_HOME/bin:$PATH
export PATH=$HIVE_HOME/lib:$PATH
export ANT_LIB=/home/huser/apache-ant-1.9.6/lib
# -- Hive Variables End -- #
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value></value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
</property>
<property>
<name>mapreduce.job.tracker</name>
<value>master:5431</value>
</property>
</configuration>
Hive Properties
Set hive.exec.mode.local.auto;
+----------------------------------+--+
| set |
+----------------------------------+--+
| hive.exec.mode.local.auto=false |
+----------------------------------+--+
set mapred.job.tracker;
+----------------------------------+--+
| set |
+----------------------------------+--+
| mapred.job.tracker=master:54311 |
+----------------------------------+--+
start-dfs.sh and start-yarn.sh
start all the datanodes, namenode, resourcemanager etc. The cluster is working fine the problem seems to be only with the jobs. I can see that all the datanodes are available at http://master:50070 and a job history page at http://master:8088
If you need any further logs or config files let me know.
Thanks.
The problem seems to be with the mapred-site.xml.
This is the new file
<configuration>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/home/huser/hadoop-2.7.1/hadoop_tmp/history/intermediate</value>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/home/huser/hadoop-2.7.1/hadoop_tmp/history/done</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>master:54311</value>
</property>
<property>
<name>mapreduce.jobtracker.http.address</name>
<value>master:50030</value>
</property>
</configuration>
The mapreduce.job.tracker seems to be not a valid property.
and I have also changed my
HADOOP_HOME=/home/huser/hadoop-2.7.1/ to
HADOOP_HOME=/home/huser/hadoop-2.7.1 removing the forward slash (/).
I have also changes the hive-site.xml to the following:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://master/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>1</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
</property>
</configuration>

Mapreduce throwing OutOfMemoryError for large input file

Hi I have a mapreduce jar that runs perfectly fine for small input files. When I say small I mean sample input files that I've created with less than 10 lines of input. But when I try to run mapreduce on an input file of size 1.8GB, I get the OutOfMemoryError. I'm not sure what i'm supposed to be doing.
Is there anyway that I can limit the number of tasks being spawned? And have few tasks run for longer durations?
Around 20 tasks are spawned on the large input file before I get this error. Here's part of the log that's generated for the first two tasks.
13/12/13 12:00:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
13/12/13 12:00:22 INFO mapreduce.Job: Running job: job_local1170901099_0001
13/12/13 12:00:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
13/12/13 12:00:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
13/12/13 12:00:22 INFO mapred.LocalJobRunner: Waiting for map tasks
13/12/13 12:00:22 INFO mapred.LocalJobRunner: Starting task: attempt_local1170901099_0001_m_000000_0
13/12/13 12:00:22 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
13/12/13 12:00:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : null
13/12/13 12:00:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/chaitanya.nadig/friendship.txt:0+134217728
13/12/13 12:00:22 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/12/13 12:00:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/12/13 12:00:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/12/13 12:00:23 INFO mapred.MapTask: soft limit at 83886080
13/12/13 12:00:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/12/13 12:00:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/12/13 12:00:23 INFO mapreduce.Job: Job job_local1170901099_0001 running in uber mode : false
13/12/13 12:00:23 INFO mapreduce.Job: map 0% reduce 0%
13/12/13 12:00:24 INFO mapred.MapTask: Starting flush of map output
13/12/13 12:00:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1170901099_0001_m_000001_0
13/12/13 12:00:24 INFO util.ProcfsBasedProcessTree: ProcfsBasedProcessTree currently is supported only on Linux.
13/12/13 12:00:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : null
13/12/13 12:00:24 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/chaitanya.nadig/friendship.txt:134217728+134217728
13/12/13 12:00:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
13/12/13 12:00:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
13/12/13 12:00:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
13/12/13 12:00:24 INFO mapred.MapTask: soft limit at 83886080
13/12/13 12:00:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
13/12/13 12:00:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
13/12/13 12:00:25 INFO mapred.MapTask: Starting flush of map output
This is the tail of the log which is generated when the error occurs.
13/12/13 12:00:43 INFO mapred.MapTask: Starting flush of map output
13/12/13 12:00:43 INFO mapred.Task: Task:attempt_local1170901099_0001_m_000020_0 is done. And is in the process of committing
13/12/13 12:00:43 INFO mapred.LocalJobRunner: map
13/12/13 12:00:43 INFO mapred.Task: Task 'attempt_local1170901099_0001_m_000020_0' done.
13/12/13 12:00:43 INFO mapred.LocalJobRunner: Finishing task: attempt_local1170901099_0001_m_000020_0
13/12/13 12:00:43 INFO mapred.LocalJobRunner: Map task executor complete.
13/12/13 12:00:43 WARN mapred.LocalJobRunner: job_local1170901099_0001
java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:403)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at org.apache.hadoop.io.Text.setCapacity(Text.java:266)
at org.apache.hadoop.io.Text.append(Text.java:236)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:238)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
13/12/13 12:00:44 INFO mapreduce.Job: map 100% reduce 0%
13/12/13 12:00:44 INFO mapreduce.Job: Job job_local1170901099_0001 failed with state FAILED due to: NA
13/12/13 12:00:44 INFO mapreduce.Job: Counters: 22
File System Counters
FILE: Number of bytes read=27635962
FILE: Number of bytes written=28018656
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=5338170260
HDFS: Number of bytes written=0
HDFS: Number of read operations=25
HDFS: Number of large read operations=0
HDFS: Number of write operations=1
Map-Reduce Framework
Map input records=0
Map output records=0
Map output bytes=0
Map output materialized bytes=6
Input split bytes=122
Combine input records=0
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=5
Total committed heap usage (bytes)=530186240
File Input Format Counters
Bytes Read=118909386
This answer is late, but posting it in case it helps someone else. The problem was that the file I was trying to process was corrupted. I got different copy of the file and ran my MR job on it and everything worked fine.
My first impulse would be to ask what your startup parameters are. Typically, when you run MapReduce and experience an out-of-memory error, you would use something like the following as your startup params:
-Dmapred.map.child.java.opts=-Xmx1G -Dmapred.reduce.child.java.opts=-Xmx1G
The key here is that these two amounts are cumulative. So, the amounts you specificy added together should not come close to exceeding the memory available on your system after you start MapReduce.
Might be late but i solved this by setting the following parameter to 0.2
mapred.job.shuffle.input.buffer.percent
This tells the reducer JVM in the shuffle space to ask only 0.2 % of the heap space,rather than 0.7%.You are getting "Out of heap space" error because the shuffle space is asking the JVM for memory which is not available to it.Rather than spilling it just throws the exception.But if you ask only for 0.2% chances are you will get the memory.Also once you exceed the alloted memory the spilling logic comes into picture.
Ofcourse the downside is the slowless.
You can also calculate at run-time the amount of memory available and then reset the buffer.

Debugging a Tutorial Hadoop Pipes-Project

I am working through this tutorial
and got to the very last part (with some small changes).
Now I am stuck with an error message I can't make sense of.
damian#damian-ThinkPad-T61:~/hadoop-1.1.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input dft1 -output dft1-out -program bin/word_count
13/06/09 20:17:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/09 20:17:01 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/06/09 20:17:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/09 20:17:01 INFO mapred.FileInputFormat: Total input paths to process : 1
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Creating word_count in /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin-work-1867423021697266227 with rwxr-xr-x
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Cached bin/word_count as /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin/word_count
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Cached bin/word_count as /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin/word_count
13/06/09 20:17:02 INFO mapred.JobClient: Running job: job_local_0001
13/06/09 20:17:02 INFO util.ProcessTree: setsid exited with exit code 0
13/06/09 20:17:02 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#4200d3
13/06/09 20:17:02 INFO mapred.MapTask: numReduceTasks: 1
13/06/09 20:17:02 INFO mapred.MapTask: io.sort.mb = 100
13/06/09 20:17:02 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/09 20:17:02 INFO mapred.MapTask: record buffer = 262144/327680
13/06/09 20:17:02 WARN mapred.LocalJobRunner: job_local_0001
java.lang.NullPointerException
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:103)
at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:68)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
13/06/09 20:17:03 INFO mapred.JobClient: map 0% reduce 0%
13/06/09 20:17:03 INFO mapred.JobClient: Job complete: job_local_0001
13/06/09 20:17:03 INFO mapred.JobClient: Counters: 0
13/06/09 20:17:03 INFO mapred.JobClient: Job Failed: NA
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
Does anyone see where the error hides? What is a straightforward way for debugging Hadoop Pipes programs?
Thanks!
The exception :
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:103)
Is caused by the following lines in the source:
//Add token to the environment if security is enabled
Token<JobTokenIdentifier> jobToken = TokenCache.getJobToken(conf
.getCredentials());
// This password is used as shared secret key between this application and
// child pipes process
byte[] password = jobToken.getPassword();
The actual NPE is throw in the final line as jobToken is null.
As you're using local mode (local job tracker and local file system), i'm not sure that security should be 'enabled' - do you have either of the following properties configured in your core-site.xml, or hdfs-site.xml coniguration files (if so, what are their values):
hadoop.security.authentication
hadoop.security.authorization
Possibly because your cluster is running in local mode. Do you have the following property in your mapred-site.xml file?
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>
Let the MapReduce jobs run with the yarn framework.
</description>
</property>
If you don't have this property, your cluster, by default, will run in local mode. I used to have exactly the same problem in local mode. After I add this property, the cluster will run in distributed mode and the problem will be gone.
HTH,
Shumin

Mahout RecommenderJob not converging

This is my first SO post so please let me know if I've missed out anything important. I am a Mahout/Hadoop beginner, and am trying to put together a distributed recommendation engine.
In order to simulate working on a remote cluster, I have set up hadoop on my machine to communicate with a Ubuntu VM (using VirtualBox), also located on my machine, which has hadoop installed on it. This setup seems to be working fine and I am now trying to run Mahout's 'RecommenderJob' on a (very!) small trial dataset as a test.
The input consists of a .csv file (saved on the hadoop dfs) containing around 50 user preferences in the format: userID, itemID, preference ... and the command I am running is:
hadoop jar /Users/MyName/src/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/MyName/Recommendations/input/TestRatings.csv -Dmapred.output.dir=/user/MyName/Recommendations/output -s SIMILARITY_PEARSON_CORELLATION
where TestRatings.csv is the file containing the preferences and output is the desired output directory.
At first the job looks like it's running fine, and I get the following output:
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --maxPrefsPerUser=[10], --maxPrefsPerUserInItemSimilarity=[1000], --maxSimilaritiesPerItem=[100], --minPrefsPerUser=[1], --numRecommendations=[10], --similarityClassname=[SIMILARITY_PEARSON_CORELLATION], --startPhase=[0], --tempDir=[temp]}
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[/user/Naaman/Delphi/input/TestRatings.csv], --maxPrefsPerUser=[1000], --minPrefsPerUser=[1], --output=[temp/preparePreferenceMatrix], --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}
12/12/11 12:26:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/12/11 12:26:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/12/11 12:26:22 INFO input.FileInputFormat: Total input paths to process : 1
12/12/11 12:26:22 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/11 12:26:22 INFO mapred.JobClient: Running job: job_local_0001
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/12/11 12:26:22 INFO mapred.MapTask: io.sort.mb = 100
12/12/11 12:26:22 INFO mapred.MapTask: data buffer = 79691776/99614720
12/12/11 12:26:22 INFO mapred.MapTask: record buffer = 262144/327680
12/12/11 12:26:22 INFO mapred.MapTask: Starting flush of map output
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new compressor
12/12/11 12:26:22 INFO mapred.MapTask: Finished spill 0
12/12/11 12:26:22 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/12/11 12:26:22 INFO mapred.LocalJobRunner:
12/12/11 12:26:22 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/12/11 12:26:22 INFO mapred.ReduceTask: ShuffleRamManager: MemoryLimit=1491035776, MaxSingleShuffleLimit=372758944
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging on-disk files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging in memory files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread waiting: Thread for merging on-disk files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for polling Map Completion Events
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
12/12/11 12:26:23 INFO mapred.JobClient: map 100% reduce 0%
12/12/11 12:26:28 INFO mapred.LocalJobRunner: reduce > copy >
12/12/11 12:26:31 INFO mapred.LocalJobRunner: reduce > copy >
12/12/11 12:26:37 INFO mapred.LocalJobRunner: reduce > copy >
But then the last three lines repeat indefinitely (I left it overnight...), with the two lines:
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
repeating every twelve rows.
I'm not sure whether there's something wrong with my input, or whether the tiny size of the trial data is messing things up. Any help and/or advice on the best way to go about this would be much appreciated.
p.s. I was trying to follow the instructions from https://www.box.com/s/041rdjeh7sny128r2uki
This is really a Hadoop or cluster issue. It is waiting on mapper output that is not coming. Look for earlier failures, in the mapping phase.

Resources