Run map reduce program in my eclipse but it is always do spilling - hadoop

I have written a MapReduce program. At first it was running fine, but after a while, I changed something then suddenly my computer said my computer have no memory. Then I realize the job I have run used lots of memory and I don't know why. And after I delete the spilling files I found that my program can not run correctly.It always do spilling and I don't remember what codes I have changed. Here is my mapper, reducer, driver and the console messages:
Mapper:
package SalesProduct;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class SalesCategoryMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
private DoubleWritable one = new DoubleWritable(1);
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String valueString = value.toString();
StringTokenizer tokenizerArticle = new StringTokenizer(valueString,"\n");
System.out.println("Here: In map \n");
while (tokenizerArticle.hasMoreTokens()){
//StringTokenizer tokenizer = new StringTokenizer(tokenizerArticle.nextToken());
String[] items = valueString.split("\t");
String itemName = items[3];
double itemPrice = Double.parseDouble(items[4]);
context.write(new Text(itemName), new DoubleWritable(itemPrice));
//context.write(new Text(itemName), one);
}
}
}
Reducer:
package SalesProduct;
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class SalesItemCategoryReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
private DoubleWritable result = new DoubleWritable();
public void reduce(Text t_key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
Text key = t_key;
double sum = 0;
for(DoubleWritable val : values){
sum = sum + val.get();
}
/*
while(values.hasNext()){
DoubleWritable tmp = values.next();
sum = sum + tmp.get();
}*/
//result.set(sum);
context.write(key, result);
}
}
Driver:
package SalesResult;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SalesItemDriver {
public static void main(String[] args) throws ClassNotFoundException, IOException, InterruptedException{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"SalesItemDriver");
job.setJarByClass(SalesItemDriver.class);
// get category
job.setMapperClass(SalesProduct.SalesCategoryMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DoubleWritable.class);
job.setCombinerClass(SalesProduct.SalesItemCategoryReducer.class);
job.setReducerClass(SalesProduct.SalesItemCategoryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
//设置分片大小
job.setInputFormatClass(CombineTextInputFormat.class);
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);
CombineTextInputFormat.setMinInputSplitSize(job, 2097152);
FileInputFormat.setInputPaths(job, new Path(args[0]));
//FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Path path = new Path(args[1]);
Path path = new Path(args[1]);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(path)) {
fs.delete(path, true);
}
FileOutputFormat.setOutputPath(job, path);
job.waitForCompletion(true);
}
}
2017-04-21 22:04:50,780 WARN [org.apache.hadoop.util.NativeCodeLoader] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-21 22:04:51,843 INFO [org.apache.hadoop.conf.Configuration.deprecation] - session.id is deprecated. Instead, use dfs.metrics.session-id
2017-04-21 22:04:51,844 INFO [org.apache.hadoop.metrics.jvm.JvmMetrics] - Initializing JVM Metrics with processName=JobTracker, sessionId=
2017-04-21 22:04:52,132 WARN [org.apache.hadoop.mapreduce.JobResourceUploader] - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-04-21 22:04:52,138 WARN [org.apache.hadoop.mapreduce.JobResourceUploader] - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2017-04-21 22:04:52,148 INFO [org.apache.hadoop.mapreduce.lib.input.FileInputFormat] - Total input files to process : 1
2017-04-21 22:04:52,256 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - number of splits:2
2017-04-21 22:04:52,412 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local1001883244_0001
2017-04-21 22:04:52,646 INFO [org.apache.hadoop.mapreduce.Job] - The url to track the job: http://localhost:8080/
2017-04-21 22:04:52,647 INFO [org.apache.hadoop.mapreduce.Job] - Running job: job_local1001883244_0001
2017-04-21 22:04:52,648 INFO [org.apache.hadoop.mapred.LocalJobRunner] - OutputCommitter set in config null
2017-04-21 22:04:52,653 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - File Output Committer Algorithm version is 1
2017-04-21 22:04:52,653 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2017-04-21 22:04:52,654 INFO [org.apache.hadoop.mapred.LocalJobRunner] - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2017-04-21 22:04:52,717 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Waiting for map tasks
2017-04-21 22:04:52,718 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Starting task: attempt_local1001883244_0001_m_000000_0
2017-04-21 22:04:52,742 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - File Output Committer Algorithm version is 1
2017-04-21 22:04:52,742 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2017-04-21 22:04:52,754 INFO [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] - ProcfsBasedProcessTree currently is supported only on Linux.
2017-04-21 22:04:52,754 INFO [org.apache.hadoop.mapred.Task] - Using ResourceCalculatorProcessTree : null
2017-04-21 22:04:52,760 INFO [org.apache.hadoop.mapred.MapTask] - Processing split: hdfs://localhost:9000/input/purchases.txt:0+134217728
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 0 kvi 26214396(104857584)
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - mapreduce.task.io.sort.mb: 100
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - soft limit at 83886080
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 0; bufvoid = 104857600
2017-04-21 22:04:52,837 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 26214396; length = 6553600
2017-04-21 22:04:52,841 INFO [org.apache.hadoop.mapred.MapTask] - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2017-04-21 22:04:53,652 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local1001883244_0001 running in uber mode : false
2017-04-21 22:04:53,654 INFO [org.apache.hadoop.mapreduce.Job] - map 0% reduce 0%
2017-04-21 22:04:54,718 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:04:54,718 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 0; bufend = 49471275; bufvoid = 104857600
2017-04-21 22:04:54,718 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 26214396(104857584); kvend = 17610700(70442800); length = 8603697/6553600
2017-04-21 22:04:54,718 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 58074971 kvi 14518736(58074944)
2017-04-21 22:04:55,730 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 0
2017-04-21 22:04:55,738 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 58074971 kv 14518736(58074944) kvi 12367824(49471296)
2017-04-21 22:04:56,831 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:04:56,831 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 58074971; bufend = 2688654; bufvoid = 104857592
2017-04-21 22:04:56,831 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 14518736(58074944); kvend = 5915040(23660160); length = 8603697/6553600
2017-04-21 22:04:56,831 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 11292334 kvi 2823076(11292304)
2017-04-21 22:04:57,661 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 1
2017-04-21 22:04:57,670 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 11292334 kv 2823076(11292304) kvi 672168(2688672)
2017-04-21 22:04:58,665 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:04:58,665 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 11292334; bufend = 60763609; bufvoid = 104857600
2017-04-21 22:04:58,665 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 2823076(11292304); kvend = 20433780(81735120); length = 8603697/6553600
2017-04-21 22:04:58,665 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 69367289 kvi 17341816(69367264)
2017-04-21 22:04:59,369 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 2
2017-04-21 22:04:59,377 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 69367289 kv 17341816(69367264) kvi 15190908(60763632)
2017-04-21 22:05:00,401 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:05:00,401 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 69367289; bufend = 13980964; bufvoid = 104857600
2017-04-21 22:05:00,401 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 17341816(69367264); kvend = 8738120(34952480); length = 8603697/6553600
2017-04-21 22:05:00,401 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 22584644 kvi 5646156(22584624)
2017-04-21 22:05:01,083 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 3
2017-04-21 22:05:01,092 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 22584644 kv 5646156(22584624) kvi 3495248(13980992)
2017-04-21 22:05:02,071 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:05:02,071 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 22584644; bufend = 72055919; bufvoid = 104857600
2017-04-21 22:05:02,071 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 5646156(22584624); kvend = 23256860(93027440); length = 8603697/6553600
2017-04-21 22:05:02,071 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 80659599 kvi 20164892(80659568)
2017-04-21 22:05:02,769 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 4
2017-04-21 22:05:02,777 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 80659599 kv 20164892(80659568) kvi 18013984(72055936)
2017-04-21 22:05:03,792 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
2017-04-21 22:05:03,792 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 80659599; bufend = 25273274; bufvoid = 104857600
2017-04-21 22:05:03,792 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 20164892(80659568); kvend = 11561196(46244784); length = 8603697/6553600
2017-04-21 22:05:03,792 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 33876954 kvi 8469232(33876928)
2017-04-21 22:05:04,491 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 5
2017-04-21 22:05:04,499 INFO [org.apache.hadoop.mapred.MapTask] - (RESET) equator 33876954 kv 8469232(33876928) kvi 6318324(25273296)
2017-04-21 22:05:04,755 INFO [org.apache.hadoop.mapred.LocalJobRunner] - map > map
2017-04-21 22:05:05,507 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output
And it will keep doing spilling until I shut down it.
Why?
I'm so confused ...
I run this program in my computer, not in cloud.
My computer left memory is only 5GB, does that matter?
It was runnable at first, which means it could output a file part-00000. Although the content is not in my expect...
Now it will output many files like that

Related

map reduce program to find maximum temprature

I have written map reduce program, but the reducer is not working, below is the code which I have written. please let me know what is the mistake in the program, as I am not getting any error, please kindly help me on the same.
below is the data
temp1.txt
1993 23
1991 25
1992 56
1991 78
temp2.txt
1991 11
1993 24
1992 35
Mapper:
package p1;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.IntWritable;
import java.io.*;
public class mymaaper extends Mapper <LongWritable,Text,Text, IntWritable>
{
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
{
String arr1[]= value.toString().split("\\s");
String year = arr1[0];
int temp = Integer.parseInt(arr1[1]);
con.write(new Text(year), new IntWritable(temp));
//con.write(new Text(year), new Text(year));
System.out.println(year+""+temp);
}
}
Reducer:
package p1;
import java.io.*;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
public class myreducer extends Reducer <Text, IntWritable, Text, IntWritable>
{
public myreducer()
{
System.out.println("myreducer().hashcode="+ hashCode());
}
public void reduce(Text key, Iterable<IntWritable> value, Context con) throws IOException, InterruptedException
{
System.out.println("reduce(-,-,-)");
System.out.println("context="+con);
System.out.println("key="+key);
System.out.print("All values=");
int maxvalue =Integer.MIN_VALUE;
for(IntWritable sw:value)
{
maxvalue = Math.max(maxvalue, sw.get());
}
con.write(key, new IntWritable(maxvalue));
}
}
Driver:
package p1;
import java.io.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class mydriver
{
public static void main(String args[]) throws ClassNotFoundException, IOException, InterruptedException
{
Path input= new Path("hdfs://localhost:9000/input_temp/");
Path output= new Path("hdfs://localhost:9000/output_temp/");
Configuration conf= new Configuration();
Job j1= Job.getInstance(conf, "maxtemp");
j1.setJarByClass(mydriver.class);
j1.setMapperClass(mymaaper.class);
j1.setReducerClass(myreducer.class);
j1.setOutputKeyClass(Text.class);
j1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j1,input);
FileOutputFormat.setOutputPath(j1,output);
output.getFileSystem(conf).delete(output, true);
System.exit(j1.waitForCompletion(true)? 0 : 1);
}
}
o/p:
2018-09-19 09:42:13,222 WARN util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(60)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-09-19 09:42:22,319 INFO beanutils.FluentPropertyBeanIntrospector (FluentPropertyBeanIntrospector.java:introspect(147)) - Error when creating PropertyDescriptor for public final void org.apache.hadoop.shaded.org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object)! Ignoring this property.
2018-09-19 09:42:22,864 INFO impl.MetricsConfig (MetricsConfig.java:loadFirst(121)) - loaded properties from hadoop-metrics2.properties
2018-09-19 09:42:23,829 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:startTimer(374)) - Scheduled Metric snapshot period at 0 second(s).
2018-09-19 09:42:23,834 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:start(191)) - JobTracker metrics system started
2018-09-19 09:42:26,003 WARN mapreduce.JobResourceUploader (JobResourceUploader.java:uploadResourcesInternal(147)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-09-19 09:42:26,053 WARN mapreduce.JobResourceUploader (JobResourceUploader.java:uploadJobJar(480)) - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2018-09-19 09:42:27,001 INFO input.FileInputFormat (FileInputFormat.java:listStatus(292)) - Total input files to process : 2
2018-09-19 09:42:27,512 INFO mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(205)) - number of splits:2
2018-09-19 09:42:29,048 INFO mapreduce.JobSubmitter (JobSubmitter.java:printTokens(301)) - Submitting tokens for job: job_local342787376_0001
2018-09-19 09:42:29,068 INFO mapreduce.JobSubmitter (JobSubmitter.java:printTokens(302)) - Executing with tokens: []
2018-09-19 09:42:30,382 INFO mapreduce.Job (Job.java:submit(1574)) - The url to track the job: http://localhost:8080/
2018-09-19 09:42:30,387 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1619)) - Running job: job_local342787376_0001
2018-09-19 09:42:30,408 INFO mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(501)) - OutputCommitter set in config null
2018-09-19 09:42:30,469 INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(140)) - File Output Committer Algorithm version is 2
2018-09-19 09:42:30,478 INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(155)) - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-09-19 09:42:30,539 INFO mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(519)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2018-09-19 09:42:31,402 INFO mapred.LocalJobRunner (LocalJobRunner.java:runTasks(478)) - Waiting for map tasks
2018-09-19 09:42:31,416 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(252)) - Starting task: attempt_local342787376_0001_m_000000_0
2018-09-19 09:42:31,444 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1640)) - Job job_local342787376_0001 running in uber mode : false
2018-09-19 09:42:31,447 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1647)) - map 0% reduce 0%
2018-09-19 09:42:31,768 INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(140)) - File Output Committer Algorithm version is 2
2018-09-19 09:42:31,778 INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(155)) - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-09-19 09:42:32,028 INFO mapred.Task (Task.java:initialize(625)) - Using ResourceCalculatorProcessTree : [ ]
2018-09-19 09:42:32,085 INFO mapred.MapTask (MapTask.java:runNewMapper(768)) - Processing split: hdfs://localhost:9000/input_temp/temp1:0+41
2018-09-19 09:42:33,881 INFO mapred.MapTask (MapTask.java:setEquator(1219)) - (EQUATOR) 0 kvi 26214396(104857584)
2018-09-19 09:42:33,888 INFO mapred.MapTask (MapTask.java:init(1012)) - mapreduce.task.io.sort.mb: 100
2018-09-19 09:42:33,888 INFO mapred.MapTask (MapTask.java:init(1013)) - soft limit at 83886080
2018-09-19 09:42:33,889 INFO mapred.MapTask (MapTask.java:init(1014)) - bufstart = 0; bufvoid = 104857600
2018-09-19 09:42:33,890 INFO mapred.MapTask (MapTask.java:init(1015)) - kvstart = 26214396; length = 6553600
2018-09-19 09:42:33,964 INFO mapred.MapTask (MapTask.java:createSortingCollector(409)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
199121
1992-5
199310
199152
1993-67
2018-09-19 09:42:35,960 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(628)) -
2018-09-19 09:42:35,992 INFO mapred.MapTask (MapTask.java:flush(1476)) - Starting flush of map output
2018-09-19 09:42:36,001 INFO mapred.MapTask (MapTask.java:flush(1498)) - Spilling map output
2018-09-19 09:42:36,001 INFO mapred.MapTask (MapTask.java:flush(1499)) - bufstart = 0; bufend = 45; bufvoid = 104857600
2018-09-19 09:42:36,007 INFO mapred.MapTask (MapTask.java:flush(1501)) - kvstart = 26214396(104857584); kvend = 26214380(104857520); length = 17/6553600
2018-09-19 09:42:36,175 INFO mapred.MapTask (MapTask.java:sortAndSpill(1696)) - Finished spill 0
2018-09-19 09:42:36,337 INFO mapred.Task (Task.java:done(1232)) - Task:attempt_local342787376_0001_m_000000_0 is done. And is in the process of committing
2018-09-19 09:42:36,419 INFO mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(628)) - map
2018-09-19 09:42:36,426 INFO mapred.Task (Task.java:sendDone(1368)) - Task 'attempt_local342787376_0001_m_000000_0' done.
2018-09-19 09:42:36,571 INFO mapred.Task (Task.java:done(1264)) - Final Counters for attempt_local342787376_0001_m_000000_0: Counters: 22
File System Counters
FILE: Number of bytes read=267
FILE: Number of bytes written=495006
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=41
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=45
Map output materialized bytes=61
Input split bytes=103
Combine input records=0
Spilled Records=5
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=339
Total committed heap usage (bytes)=167841792
File Input Format Counters
Bytes Read=41
2018-09-19 09:42:36,578 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(277)) - Finishing task: attempt_local342787376_0001_m_000000_0
2018-09-19 09:42:36,581 INFO mapred.LocalJobRunner (LocalJobRunner.java:run(252)) - Starting task: attempt_local342787376_0001_m_000001_0
2018-09-19 09:42:36,606 INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(140)) - File Output Committer Algorithm version is 2
2018-09-19 09:42:36,607 INFO output.FileOutputCommitter (FileOutputCommitter.java:<init>(155)) - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2018-09-19 09:42:36,609 INFO mapred.Task (Task.java:initialize(625)) - Using ResourceCalculatorProcessTree : [ ]
2018-09-19 09:42:36,644 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1647)) - map 100% reduce 0%
2018-09-19 09:42:36,668 INFO mapred.MapTask (MapTask.java:runNewMapper(768)) - Processing split: hdfs://localhost:9000/input_temp/temp2:0+33
2018-09-19 09:42:37,175 INFO mapred.MapTask (MapTask.java:setEquator(1219)) - (EQUATOR) 0 kvi 26214396(104857584)
2018-09-19 09:42:37,180 INFO mapred.MapTask (MapTask.java:init(1012)) - mapreduce.task.io.sort.mb: 100
2018-09-19 09:42:37,183 INFO mapred.MapTask (MapTask.java:init(1013)) - soft limit at 83886080
2018-09-19 09:42:37,187 INFO mapred.MapTask (MapTask.java:init(1014)) - bufstart = 0; bufvoid = 104857600
2018-09-19 09:42:37,187 INFO mapred.MapTask (MapTask.java:init(1015)) - kvstart = 26214396; length = 6553600
2018-09-19 09:42:37,199 INFO mapred.MapTask (MapTask.java:createSortingCollector(409)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
199246
1993-9
199188
1992-2
2018-09-19 09:42:37,354 INFO mapred.MapTask (MapTask.java:flush(1476)) - Starting flush of map output
2018-09-19 09:42:37,355 INFO mapred.MapTask (MapTask.java:flush(1498)) - Spilling map output
2018-09-19 09:42:37,355 INFO mapred.MapTask (MapTask.java:flush(1499)) - bufstart = 0; bufend = 36; bufvoid = 104857600
2018-09-19 09:42:37,355 INFO mapred.MapTask (MapTask.java:flush(1501)) - kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
2018-09-19 09:42:37,419 INFO mapred.MapTask (MapTask.java:sortAndSpill(1696)) - Finished spill 0
2018-09-19 09:42:37,480 INFO mapred.LocalJobRunner (LocalJobRunner.java:runTasks(486)) - map task executor complete.
2018-09-19 09:42:37,498 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(590)) - job_local342787376_0001
java.lang.Exception: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
at p1.mymaaper.map(mymaaper.java:16)
at p1.mymaaper.map(mymaaper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-09-19 09:42:37,648 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1660)) - Job job_local342787376_0001 failed with state FAILED due to: NA
2018-09-19 09:42:37,786 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1665)) - Counters: 22
File System Counters
FILE: Number of bytes read=267
FILE: Number of bytes written=495006
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=41
HDFS: Number of bytes written=0
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=45
Map output materialized bytes=61
Input split bytes=103
Combine input records=0
Spilled Records=5
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=339
Total committed heap usage (bytes)=167841792
File Input Format Counters
Bytes Read=41
You get out of bound exception. I think there is a bad record in your input files. Check arr1 size in mapper before using it.

Does sqoop spill temporary data to disk

As I understand sqoop, it launches few mappers on different data nodes making jdbc connection with RDBMS. Once connection is formed data is transferred to HDFS.
Just trying to understand, does sqoop mapper spill data temporary on disk (data node)? I know spilling happens in MapReduce but not sure about sqoop job.
It seems sqoop-import runs on mapper and doesn't spill. And sqoop-merge runs on map-reduce and does spill. You can check it on Job tracker during sqoop import run.
Have a look at this part of sqoop import log, it does not spill, fetches and writes to hdfs:
INFO [main] ... mapreduce.db.DataDrivenDBRecordReader: Using query: SELECT...
[main] mapreduce.db.DBRecordReader: Executing query: SELECT...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output Committer Algorithm version is 1
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
INFO [Thread-16] ...mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1489705733959_2462784_m_000000_0 is done. And is in the process of committing
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_1489705733959_2462784_m_000000_0' to hdfs://
Have a look at this sqoop-merge log(skipped some rows), it spills on disk (note Spilling map output in the log):
INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://bla-bla/part-m-00000:0+48322717
...
INFO [main] org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
...
INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1024
INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 751619264
INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452; length = 67108864
INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$**MapOutputBuffer**
INFO [main] com.pepperdata.supervisor.agent.resource.r: Datanode bla-bla is LOCAL.
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
...
INFO [main] org.apache.hadoop.mapred.MapTask: **Starting flush of map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **Spilling map output**
INFO [main] org.apache.hadoop.mapred.MapTask: **bufstart** = 0; **bufend** = 184775274; bufvoid = 1073741824
INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 268435452(1073741808); kvend = 267347800(1069391200); length = 1087653/67108864
INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.snappy]
[main] org.apache.hadoop.mapred.MapTask: Finished spill 0
...Task:attempt_1489705733959_2479291_m_000000_0 is done. And is in the process of committing

Map Reduce File Output Counter is zero

I am writing Map Reduce code for Inverted Indexing of a file which contains each line as "Doc_id Title Document Contents".
I am not able to figure out why File output format counter is zero although map reduce jobs are successfully completed without any Exception.
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndex {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text> {
private Text word = new Text();
private Text docID_Title = new Text();
//RemoveStopWords is a different class
static RemoveStopWords rmvStpWrd = new RemoveStopWords();
//Stemmer is a different class
Stemmer stemmer = new Stemmer();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
rmvStpWrd.makeStopWordList();
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll(" [^\\p{L}]", " "));
//fetching id of the document
String id = null;
String title = null;
if(itr.hasMoreTokens())
id = itr.nextToken();
//fetching title of the document
if(itr.hasMoreTokens())
title = itr.nextToken();
String ID_TITLE = id + title;
if(id!=null)
docID_Title.set(ID_TITLE);
while (itr.hasMoreTokens()) {
/*manipulation of tokens:
* First we remove stop words
* Then Stem the words
*/
String temp = itr.nextToken().toLowerCase();
if(RemoveStopWords.isStopWord(temp)) {
continue;
}
else {
//now the word is not a stop word
//we will stem it
char[] a;
stemmer.add((a = temp.toCharArray()), a.length);
stemmer.stem();
temp = stemmer.toString();
word.set(temp);
context.write(word, docID_Title);
}
}//end while
}//end map
}//end mapper
public static class IntSumReducer
extends Reducer<Text,Text,Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//to iterate over the values
Iterator<Text> itr = values.iterator();
String old = itr.next().toString();
int freq = 1;
String next = null;
boolean isThere = true;
StringBuilder stringBuilder = new StringBuilder();
while(itr.hasNext()) {
//freq counts number of times a word comes in a document
freq = 1;
while((isThere = itr.hasNext())) {
next = itr.next().toString();
if(old == next)
freq++;
else {
//the loop break when we get different docID_Title for the word(key)
break;
}
//if more data is there
if(isThere) {
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
stringBuilder.setLength(0);
}
else {
//for the last key
freq++;
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
}//end else
}//end while
}//end while
}//end reduce
}//end reducer
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "InvertedIndex");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}//end main
}//end InvertexIndex
This is the output I am getting:
16/10/03 15:34:21 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/10/03 15:34:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/10/03 15:34:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/10/03 15:34:22 INFO input.FileInputFormat: Total input paths to process : 1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: number of splits:1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local507694567_0001
16/10/03 15:34:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/10/03 15:34:22 INFO mapreduce.Job: Running job: job_local507694567_0001
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Waiting for map tasks
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/sonu/ss.txt:0+1002072
16/10/03 15:34:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/10/03 15:34:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/10/03 15:34:23 INFO mapred.MapTask: soft limit at 83886080
16/10/03 15:34:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/10/03 15:34:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/10/03 15:34:23 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/10/03 15:34:23 INFO mapreduce.Job: Job job_local507694567_0001 running in uber mode : false
16/10/03 15:34:23 INFO mapreduce.Job: map 0% reduce 0%
16/10/03 15:34:24 INFO mapred.LocalJobRunner:
16/10/03 15:34:24 INFO mapred.MapTask: Starting flush of map output
16/10/03 15:34:24 INFO mapred.MapTask: Spilling map output
16/10/03 15:34:24 INFO mapred.MapTask: bufstart = 0; bufend = 2206696; bufvoid = 104857600
16/10/03 15:34:24 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25789248(103156992); length = 425149/6553600
16/10/03 15:34:24 INFO mapred.MapTask: Finished spill 0
16/10/03 15:34:24 INFO mapred.Task: Task:attempt_local507694567_0001_m_000000_0 is done. And is in the process of committing
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map
16/10/03 15:34:24 INFO mapred.Task: Task 'attempt_local507694567_0001_m_000000_0' done.
16/10/03 15:34:24 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map task executor complete.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:25 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:25 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5d0e7307
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/10/03 15:34:25 INFO reduce.EventFetcher: attempt_local507694567_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/10/03 15:34:25 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local507694567_0001_m_000000_0 decomp: 2 len: 6 to MEMORY
16/10/03 15:34:25 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output for attempt_local507694567_0001_m_000000_0
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
16/10/03 15:34:25 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to disk to satisfy reduce memory limit
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 1 files, 6 bytes from disk
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/10/03 15:34:25 INFO mapred.Task: Task:attempt_local507694567_0001_r_000000_0 is done. And is in the process of committing
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO mapred.Task: Task attempt_local507694567_0001_r_000000_0 is allowed to commit now
16/10/03 15:34:25 INFO output.FileOutputCommitter: Saved output of task 'attempt_local507694567_0001_r_000000_0' to hdfs://localhost:9000/user/sonu/output/_temporary/0/task_local507694567_0001_r_000000
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce > reduce
16/10/03 15:34:25 INFO mapred.Task: Task 'attempt_local507694567_0001_r_000000_0' done.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce task executor complete.
16/10/03 15:34:25 INFO mapreduce.Job: map 100% reduce 100%
16/10/03 15:34:25 INFO mapreduce.Job: Job job_local507694567_0001 completed successfully
16/10/03 15:34:25 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=17342
FILE: Number of bytes written=571556
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2004144
HDFS: Number of bytes written=0
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=53
Map output records=106288
Map output bytes=2206696
Map output materialized bytes=6
Input split bytes=103
Combine input records=106288
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=12
Total committed heap usage (bytes)=562036736
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1002072
File Output Format Counters
Bytes Written=0

ArrayIndexOutOfBoundsException at MapOutputBuffer$Buffer.write in MapTask (Hadoop 2.7.1)

Very odd case of ArrayIndexOutOfBounds in a Scalding-driven job running on Hadoop 2.7.1. Mapper log dump below. It looks like Equator somehow gets set to a negative number in spill 2. Is this normal?
2015-08-12 23:39:19,649 INFO [main] org.apache.hadoop.mapred.MapTask: numReduceTasks: 1
2015-08-12 23:39:20,174 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 0 kvi 469762044(1879048176)
2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 1792
2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 187904816
2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 1879048192
2015-08-12 23:39:20,175 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 469762044; length = 117440512
2015-08-12 23:39:20,214 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2015-08-12 23:39:20,216 INFO [main] cascading.flow.hadoop.FlowMapper: cascading version: 2.6.1
2015-08-12 23:39:20,216 INFO [main] cascading.flow.hadoop.FlowMapper: child jvm opts: -Xmx1024m -Djava.io.tmpdir=./tmp
2015-08-12 23:39:20,516 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2015-08-12 23:39:20,552 INFO [main] cascading.flow.hadoop.FlowMapper: sourcing from: TempHfs["SequenceFile[['docId', 'otherDocId', 'score']]"][9909013673/_pipe_11__pipe_12/]
2015-08-12 23:39:20,552 INFO [main] cascading.flow.hadoop.FlowMapper: sinking to: GroupBy(_pipe_11+_pipe_12)[by:[
{1}
:'docId']]
2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 108647886; bufvoid = 1879048192
2015-08-12 23:39:29,424 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 469762044(1879048176); kvend = 449947816(1799791264); length = 19814229/117440512
2015-08-12 23:39:29,425 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 839953118 kvi 209988272(839953088)
2015-08-12 23:39:43,985 INFO [SpillThread] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.gz]
2015-08-12 23:39:46,767 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 0
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 839953118 kv 209988272(839953088) kvi 178264648(713058592)
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 839953118; bufend = 1014433072; bufvoid = 1879048192
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 209988272(839953088); kvend = 178264648(713058592); length = 31723625/117440512
2015-08-12 23:39:46,767 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 1696670336 kvi 424167580(1696670320)
2015-08-12 23:40:22,641 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 1
2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 1696670336 kv 424167580(1696670320) kvi 392768808(1571075232)
2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 1696670336; bufend = 1869363604; bufvoid = 1879048192
2015-08-12 23:40:22,641 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 424167580(1696670320); kvend = 392768808(1571075232); length = 31398773/117440512
2015-08-12 23:40:22,642 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) -1742031900 kvi 34254072(137016288)
2015-08-12 23:40:47,329 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 2
2015-08-12 23:40:47,330 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator -1742031900 kv 34254072(137016288) kvi 34254072(137016288)
2015-08-12 23:40:47,331 ERROR [main] cascading.flow.stream.TrapHandler: caught Throwable, no trap available, rethrowing
cascading.flow.stream.DuctException: internal error: ['7541904654925238223', '2.812180059539485']
at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:81)
at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:37)
at cascading.flow.stream.FunctionEachStage$1.collect(FunctionEachStage.java:80)
at cascading.tuple.TupleEntryCollector.safeCollect(TupleEntryCollector.java:145)
at cascading.tuple.TupleEntryCollector.add(TupleEntryCollector.java:133)
at cascading.operation.Identity$2.operate(Identity.java:137)
at cascading.operation.Identity.operate(Identity.java:150)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
at cascading.flow.stream.SourceStage.run(SourceStage.java:58)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:130)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349)
at java.io.DataOutputStream.write(DataOutputStream.java:88)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
at org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:273)
at org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:253)
at cascading.tuple.hadoop.io.HadoopTupleOutputStream.writeIntInternal(HadoopTupleOutputStream.java:155)
at cascading.tuple.io.TupleOutputStream.write(TupleOutputStream.java:86)
at cascading.tuple.io.TupleOutputStream.writeTuple(TupleOutputStream.java:64)
at cascading.tuple.hadoop.io.TupleSerializer.serialize(TupleSerializer.java:37)
at cascading.tuple.hadoop.io.TupleSerializer.serialize(TupleSerializer.java:28)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1149)
at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:610)
at cascading.tap.hadoop.util.MeasuredOutputCollector.collect(MeasuredOutputCollector.java:69)
at cascading.flow.hadoop.stream.HadoopGroupByGate.receive(HadoopGroupByGate.java:68)
... 18 more
It is mapreduce.task.io.sort.mb that made the difference. When setting to 2G or large, it will constantly running into the problem.
It is suggested to set to the value below or smaller:
Dmapreduce.task.io.sort.mb=1792
I suspect a threading issue, so I tried the below and it worked. Not sure if the cure will stick.
<property>
<name>mapreduce.map.sort.spill.percent</name>
<value>0.8</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>10</value>
</property>
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>100</value>
</property>
<property>
<name>mapred.map.multithreadedrunner.threads</name>
<value>1</value>
</property>
<property>
<name>mapreduce.mapper.multithreadedmapper.threads</name>
<value>1</value>
</property>

Cannot run the job on hadoop cluster. only runs using LocalJobRunner

I have submitted a MR job using hadoop jar command with the following command on CDH5 Beta 2
hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/
I've also tried providing the fs name and job tracker url explicitly as below without any success
hadoop jar ./hadoop-examples-0.0.1-SNAPSHOT.jar com.aravind.learning.hadoop.mapred.join.ReduceSideJoinDriver -Dfs.default.name=hdfs://abc.com:8020 -Dmapreduce.job.tracker=x.x.x.x:8021 tech_talks/users.csv tech_talks/ratings.csv tech_talks/output/ReduceSideJoinDriver/
The job runs successfully but is using the LocalJobRunner instead of submitting to the cluster. The output is written to HDFS and is correct. Not sure what I am doing wrong here so appreciate your input. I've also tried explicitly specifying the fs and job tracker as below but have the same result
14/04/16 20:35:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
14/04/16 20:35:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
14/04/16 20:35:45 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
14/04/16 20:35:45 INFO input.FileInputFormat: Total input paths to process : 2
14/04/16 20:35:45 INFO mapreduce.JobSubmitter: number of splits:2
14/04/16 20:35:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1427968352_0001
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/staging/ird21427968352/.staging/job_local1427968352_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/04/16 20:35:46 WARN conf.Configuration: file:/tmp/hadoop-ird2/mapred/local/localRunner/ird2/job_local1427968352_0001/job_local1427968352_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
14/04/16 20:35:46 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
14/04/16 20:35:46 INFO mapreduce.Job: Running job: job_local1427968352_0001
14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter set in config null
14/04/16 20:35:46 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
14/04/16 20:35:46 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/16 20:35:46 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:46 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:46 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/ratings.csv:0+4388258
14/04/16 20:35:46 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/04/16 20:35:46 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/04/16 20:35:46 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/04/16 20:35:46 INFO mapred.MapTask: soft limit at 83886080
14/04/16 20:35:46 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/04/16 20:35:46 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/04/16 20:35:47 INFO mapreduce.Job: Job job_local1427968352_0001 running in uber mode : false
14/04/16 20:35:47 INFO mapreduce.Job: map 0% reduce 0%
14/04/16 20:35:48 INFO mapred.LocalJobRunner:
14/04/16 20:35:48 INFO mapred.MapTask: Starting flush of map output
14/04/16 20:35:48 INFO mapred.MapTask: Spilling map output
14/04/16 20:35:48 INFO mapred.MapTask: bufstart = 0; bufend = 6485388; bufvoid = 104857600
14/04/16 20:35:48 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 24860980(99443920); length = 1353417/6553600
14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0
14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000000_0 is done. And is in the process of committing
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map
14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000000_0' done.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:49 INFO mapred.MapTask: Processing split: hdfs://...:8020/user/ird2/tech_talks/users.csv:0+186304
14/04/16 20:35:49 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
14/04/16 20:35:49 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
14/04/16 20:35:49 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
14/04/16 20:35:49 INFO mapred.MapTask: soft limit at 83886080
14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
14/04/16 20:35:49 INFO mapred.LocalJobRunner:
14/04/16 20:35:49 INFO mapred.MapTask: Starting flush of map output
14/04/16 20:35:49 INFO mapred.MapTask: Spilling map output
14/04/16 20:35:49 INFO mapred.MapTask: bufstart = 0; bufend = 209667; bufvoid = 104857600
14/04/16 20:35:49 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26192144(104768576); length = 22253/6553600
14/04/16 20:35:49 INFO mapred.MapTask: Finished spill 0
14/04/16 20:35:49 INFO mapred.Task: Task:attempt_local1427968352_0001_m_000001_0 is done. And is in the process of committing
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map
14/04/16 20:35:49 INFO mapred.Task: Task 'attempt_local1427968352_0001_m_000001_0' done.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO mapred.LocalJobRunner: map task executor complete.
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Waiting for reduce tasks
14/04/16 20:35:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1427968352_0001_r_000000_0
14/04/16 20:35:49 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
14/04/16 20:35:49 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5116331d
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=652528832, maxSingleShuffleLimit=163132208, mergeThreshold=430669056, ioSortFactor=10, memToMemMergeOutputsThreshold=10
14/04/16 20:35:49 INFO reduce.EventFetcher: attempt_local1427968352_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000001_0 decomp: 220797 len: 220801 to MEMORY
14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 220797 bytes from map-output for attempt_local1427968352_0001_m_000001_0
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 220797, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->220797
14/04/16 20:35:49 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1427968352_0001_m_000000_0 decomp: 7162100 len: 7162104 to MEMORY
14/04/16 20:35:49 INFO reduce.InMemoryMapOutput: Read 7162100 bytes from map-output for attempt_local1427968352_0001_m_000000_0
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 7162100, inMemoryMapOutputs.size() -> 2, commitMemory -> 220797, usedMemory ->7382897
14/04/16 20:35:49 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
14/04/16 20:35:49 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:49 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
14/04/16 20:35:49 INFO mapred.Merger: Merging 2 sorted segments
14/04/16 20:35:49 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 7382885 bytes
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merged 2 segments, 7382897 bytes to disk to satisfy reduce memory limit
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 1 files, 7382899 bytes from disk
14/04/16 20:35:50 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
14/04/16 20:35:50 INFO mapred.Merger: Merging 1 sorted segments
14/04/16 20:35:50 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7382889 bytes
14/04/16 20:35:50 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:50 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
14/04/16 20:35:50 INFO mapreduce.Job: map 100% reduce 0%
14/04/16 20:35:51 INFO mapred.Task: Task:attempt_local1427968352_0001_r_000000_0 is done. And is in the process of committing
14/04/16 20:35:51 INFO mapred.LocalJobRunner: 2 / 2 copied.
14/04/16 20:35:51 INFO mapred.Task: Task attempt_local1427968352_0001_r_000000_0 is allowed to commit now
14/04/16 20:35:51 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1427968352_0001_r_000000_0' to hdfs://...:8020/user/ird2/tech_talks/output/ReduceSideJoinDriver/_temporary/0/task_local1427968352_0001_r_000000
14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce > reduce
14/04/16 20:35:51 INFO mapred.Task: Task 'attempt_local1427968352_0001_r_000000_0' done.
14/04/16 20:35:51 INFO mapred.LocalJobRunner: Finishing task: attempt_local1427968352_0001_r_000000_0
14/04/16 20:35:51 INFO mapred.LocalJobRunner: reduce task executor complete.
14/04/16 20:35:52 INFO mapreduce.Job: map 100% reduce 100%
14/04/16 20:35:52 INFO mapreduce.Job: Job job_local1427968352_0001 completed successfully
14/04/16 20:35:52 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=14767932
FILE: Number of bytes written=29952985
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=13537382
HDFS: Number of bytes written=2949787
HDFS: Number of read operations=28
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Map-Reduce Framework
Map input records=343919
Map output records=343919
Map output bytes=6695055
Map output materialized bytes=7382905
Input split bytes=272
Combine input records=0
Combine output records=0
Reduce input groups=5564
Reduce shuffle bytes=7382905
Reduce input records=343919
Reduce output records=5564
Spilled Records=687838
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=92
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1416101888
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=4574562
File Output Format Counters
Bytes Written=2949787
Driver code
public class ReduceSideJoinDriver extends Configured implements Tool
{
#Override
public int run(String[] args) throws Exception
{
if (args.length != 3)
{
System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Path usersFile = new Path(args[0]);
Path ratingsFile = new Path(args[1]);
Job job = Job.getInstance(getConf(), "Aravind - Reduce Side Join");
job.getConfiguration().setStrings(usersFile.getName(), "user");
job.getConfiguration().setStrings(ratingsFile.getName(), "rating");
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(TagAndRecord.class);
TextInputFormat.addInputPath(job, usersFile);
TextInputFormat.addInputPath(job, ratingsFile);
TextOutputFormat.setOutputPath(job, new Path(args[2]));
job.setMapperClass(ReduceSideJoinMapper.class);
job.setReducerClass(ReduceSideJoinReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new Configuration(), new ReduceSideJoinDriver(), args);
System.exit(exitCode);
}
}
Make sure you have valid following configuration files in hadoop classpath. By default configuration files are taken from the directory /etc/hadoop/conf. This activity should be performed a part of hadoop client node setup.
mapred-site.xml
yarn-site.xml
core-site.xml
If the above mentioned configuration files are empty. You got to pupulate the above files with right properties. Population can be achieved in two ways
In Cloudera Manager when click on service yarn, in action portion, there is an option Deploy client configuration along with start,stop etc. Use that option to deploy the client configuration.
Sometimes above option maynot work if the node is not managed by CM and yarn gateway is not configured on the node. use the option Download client configuration instead of deploy client Configuration. Extract the downloaded zip configuration file(above files) and copy those files to the location /etc/hadoop/conf manually.
For executing the jar either hadoop or yarn can be used.
Apparently, you can only submit a hadoop job from the node designated as the gateway node. Everything is working once I submitted the job from the gateway node.

Resources