Hadoop Streaming job failing - hadoop

I am trying to run my first mapreduce job, which aggregates some data from xml files. My job is failing, and as I am a newbie at Hadoop, I would appreciate if someone could please take a look at what is going wrong.
I have:
posts_mapper.py:
#!/usr/bin/env python
import sys
import xml.etree.ElementTree as ET
input_string = sys.stdin.read()
class User(object):
def __init__(self, id):
self.id = id
self.post_type_1_count = 0
self.post_type_2_count = 0
self.aggregate_post_score = 0
self.aggregate_post_size = 0
self.tags_count = {}
users = {}
root = ET.fromstring(input_string)
for child in root.getchildren():
user_id = int(child.get("OwnerUserId"))
post_type = int(child.get("PostTypeId"))
score = int(child.get("Score"))
#view_count = int(child.get("ViewCount"))
post_size = len(child.get("Body"))
tags = child.get("Tags")
if user_id not in users:
users[user_id] = User(user_id)
user = users[user_id]
if post_type == 1:
user.post_type_1_count += 1
else:
user.post_type_2_count += 1
user.aggregate_post_score += score
user.aggregate_post_size += post_size
if tags != None:
tags = tags.replace("<", " ").replace(">", " ").split()
for tag in tags:
if tag not in user.tags_count:
user.tags_count[tag] = 0
user.tags_count[tag] += 1
for i in users:
user = users[i]
out = "%d %d %d %d %d " % (user.id, user.post_type_1_count, user.post_type_2_count, user.aggregate_post_score, user.aggregate_post_size)
for tag in user.tags_count:
out += "%s %d " % (tag, user.tags_count[tag])
print out
posts_reducer.py:
#!/usr/bin/env python
import sys
class User(object):
def __init__(self, id):
self.id = id
self.post_type_1_count = 0
self.post_type_2_count = 0
self.aggregate_post_score = 0
self.aggregate_post_size = 0
self.tags_count = {}
users = {}
for line in sys.stdin:
vals = line.split()
user_id = int(vals[0])
post_type_1 = int(vals[1])
post_type_2 = int(vals[2])
aggregate_post_score = int(vals[3])
aggregate_post_size = int(vals[4])
tags = {}
if len(vals) > 5:
#this means we got tags
for i in range (5, len(vals), 2):
tag = vals[i]
count = int((vals[i+1]))
tags[tag] = count
if user_id not in users:
users[user_id] = User(user_id)
user = users[user_id]
user.post_type_1_count += post_type_1
user.post_type_2_count += post_type_2
user.aggregate_post_score += aggregate_post_score
user.aggregate_post_size += aggregate_post_size
for tag in tags:
if tag not in user.tags_count:
user.tags_count[tag] = 0
user.tags_count[tag] += tags[tag]
for i in users:
user = users[i]
out = "%d %d %d %d %d " % (user.id, user.post_type_1_count, user.post_type_2_count, user.aggregate_post_score, user.aggregate_post_size)
for tag in user.tags_count:
out += "%s %d " % (tag, user.tags_count[tag])
print out
I run the command:
bin/hadoop jar hadoop-streaming-2.6.0.jar -input /stackexchange/beer/posts -output /stackexchange/beer/results -mapper posts_mapper.py -reducer posts_reducer.py -file ~/mapreduce/posts_mapper.py -file ~/mapreduce/posts_reducer.py
and get the output:
packageJobJar: [/home/hduser/mapreduce/posts_mapper.py, /home/hduser/mapreduce/posts_reducer.py, /tmp/hadoop-unjar6585010774815976682/] [] /tmp/streamjob8863638738687983603.jar tmpDir=null
15/03/20 10:18:55 INFO client.RMProxy: Connecting to ResourceManager at Master/10.1.1.22:8040
15/03/20 10:18:55 INFO client.RMProxy: Connecting to ResourceManager at Master/10.1.1.22:8040
15/03/20 10:18:57 INFO mapred.FileInputFormat: Total input paths to process : 10
15/03/20 10:18:57 INFO mapreduce.JobSubmitter: number of splits:10
15/03/20 10:18:57 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1426769192808_0004
15/03/20 10:18:58 INFO impl.YarnClientImpl: Submitted application application_1426769192808_0004
15/03/20 10:18:58 INFO mapreduce.Job: The url to track the job: http://i-644dd931:8088/proxy/application_1426769192808_0004/
15/03/20 10:18:58 INFO mapreduce.Job: Running job: job_1426769192808_0004
15/03/20 10:19:11 INFO mapreduce.Job: Job job_1426769192808_0004 running in uber mode : false
15/03/20 10:19:11 INFO mapreduce.Job: map 0% reduce 0%
15/03/20 10:19:41 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000006_0, Status : FAILED
15/03/20 10:19:48 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000007_0, Status : FAILED
15/03/20 10:19:50 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000008_0, Status : FAILED
15/03/20 10:19:50 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000009_0, Status : FAILED
15/03/20 10:20:00 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000006_1, Status : FAILED
15/03/20 10:20:08 INFO mapreduce.Job: map 7% reduce 0%
15/03/20 10:20:10 INFO mapreduce.Job: map 20% reduce 0%
15/03/20 10:20:10 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000007_1, Status : FAILED
15/03/20 10:20:11 INFO mapreduce.Job: map 10% reduce 0%
15/03/20 10:20:17 INFO mapreduce.Job: map 20% reduce 0%
15/03/20 10:20:17 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000008_1, Status : FAILED
15/03/20 10:20:19 INFO mapreduce.Job: map 10% reduce 0%
15/03/20 10:20:19 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000009_1, Status : FAILED
15/03/20 10:20:22 INFO mapreduce.Job: map 20% reduce 0%
15/03/20 10:20:22 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000006_2, Status : FAILED
15/03/20 10:20:25 INFO mapreduce.Job: map 40% reduce 0%
15/03/20 10:20:25 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000002_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
15/03/20 10:20:28 INFO mapreduce.Job: map 50% reduce 0%
15/03/20 10:20:28 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000007_2, Status : FAILED
15/03/20 10:20:42 INFO mapreduce.Job: map 50% reduce 17%
15/03/20 10:20:52 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000008_2, Status : FAILED
15/03/20 10:20:54 INFO mapreduce.Job: Task Id : attempt_1426769192808_0004_m_000009_2, Status : FAILED
15/03/20 10:20:56 INFO mapreduce.Job: map 90% reduce 0%
15/03/20 10:20:57 INFO mapreduce.Job: map 100% reduce 100%
15/03/20 10:20:58 INFO mapreduce.Job: Job job_1426769192808_0004 failed with state FAILED due to: Task failed task_1426769192808_0004_m_000006
Job failed as tasks failed. failedMaps:1 failedReduces:0

Unfortunately, hadoop does not show stderr for your python mapper/reducer so this output does not give any clue.
I would recommend you the following 2 throubleshooting steps:
Test your mapper/reducer locally:
cat {your_input_files} | ./posts_mapper.py | sort | ./posts_reducer.py
If you did not find any issue on step1, create the map reduce job and check the output logs:
yarn logs -applicationId application_1426769192808_0004
or
hdfs dfs -cat /var/log/hadoop-yarn/apps/{user}/logs/

Related

Map Reduce File Output Counter is zero

I am writing Map Reduce code for Inverted Indexing of a file which contains each line as "Doc_id Title Document Contents".
I am not able to figure out why File output format counter is zero although map reduce jobs are successfully completed without any Exception.
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndex {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, Text> {
private Text word = new Text();
private Text docID_Title = new Text();
//RemoveStopWords is a different class
static RemoveStopWords rmvStpWrd = new RemoveStopWords();
//Stemmer is a different class
Stemmer stemmer = new Stemmer();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
rmvStpWrd.makeStopWordList();
StringTokenizer itr = new StringTokenizer(value.toString().replaceAll(" [^\\p{L}]", " "));
//fetching id of the document
String id = null;
String title = null;
if(itr.hasMoreTokens())
id = itr.nextToken();
//fetching title of the document
if(itr.hasMoreTokens())
title = itr.nextToken();
String ID_TITLE = id + title;
if(id!=null)
docID_Title.set(ID_TITLE);
while (itr.hasMoreTokens()) {
/*manipulation of tokens:
* First we remove stop words
* Then Stem the words
*/
String temp = itr.nextToken().toLowerCase();
if(RemoveStopWords.isStopWord(temp)) {
continue;
}
else {
//now the word is not a stop word
//we will stem it
char[] a;
stemmer.add((a = temp.toCharArray()), a.length);
stemmer.stem();
temp = stemmer.toString();
word.set(temp);
context.write(word, docID_Title);
}
}//end while
}//end map
}//end mapper
public static class IntSumReducer
extends Reducer<Text,Text,Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//to iterate over the values
Iterator<Text> itr = values.iterator();
String old = itr.next().toString();
int freq = 1;
String next = null;
boolean isThere = true;
StringBuilder stringBuilder = new StringBuilder();
while(itr.hasNext()) {
//freq counts number of times a word comes in a document
freq = 1;
while((isThere = itr.hasNext())) {
next = itr.next().toString();
if(old == next)
freq++;
else {
//the loop break when we get different docID_Title for the word(key)
break;
}
//if more data is there
if(isThere) {
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
stringBuilder.setLength(0);
}
else {
//for the last key
freq++;
old = old +"_"+ freq;
stringBuilder.append(old);
stringBuilder.append(" | ");
old = next;
context.write(key, new Text(stringBuilder.toString()));
}//end else
}//end while
}//end while
}//end reduce
}//end reducer
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "InvertedIndex");
job.setJarByClass(InvertedIndex.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}//end main
}//end InvertexIndex
This is the output I am getting:
16/10/03 15:34:21 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/10/03 15:34:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/10/03 15:34:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/10/03 15:34:22 INFO input.FileInputFormat: Total input paths to process : 1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: number of splits:1
16/10/03 15:34:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local507694567_0001
16/10/03 15:34:22 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/10/03 15:34:22 INFO mapreduce.Job: Running job: job_local507694567_0001
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Waiting for map tasks
16/10/03 15:34:22 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:22 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:22 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:22 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/sonu/ss.txt:0+1002072
16/10/03 15:34:23 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/10/03 15:34:23 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/10/03 15:34:23 INFO mapred.MapTask: soft limit at 83886080
16/10/03 15:34:23 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/10/03 15:34:23 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/10/03 15:34:23 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/10/03 15:34:23 INFO mapreduce.Job: Job job_local507694567_0001 running in uber mode : false
16/10/03 15:34:23 INFO mapreduce.Job: map 0% reduce 0%
16/10/03 15:34:24 INFO mapred.LocalJobRunner:
16/10/03 15:34:24 INFO mapred.MapTask: Starting flush of map output
16/10/03 15:34:24 INFO mapred.MapTask: Spilling map output
16/10/03 15:34:24 INFO mapred.MapTask: bufstart = 0; bufend = 2206696; bufvoid = 104857600
16/10/03 15:34:24 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25789248(103156992); length = 425149/6553600
16/10/03 15:34:24 INFO mapred.MapTask: Finished spill 0
16/10/03 15:34:24 INFO mapred.Task: Task:attempt_local507694567_0001_m_000000_0 is done. And is in the process of committing
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map
16/10/03 15:34:24 INFO mapred.Task: Task 'attempt_local507694567_0001_m_000000_0' done.
16/10/03 15:34:24 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_m_000000_0
16/10/03 15:34:24 INFO mapred.LocalJobRunner: map task executor complete.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Starting task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/10/03 15:34:25 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/10/03 15:34:25 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#5d0e7307
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=333971456, maxSingleShuffleLimit=83492864, mergeThreshold=220421168, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/10/03 15:34:25 INFO reduce.EventFetcher: attempt_local507694567_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/10/03 15:34:25 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local507694567_0001_m_000000_0 decomp: 2 len: 6 to MEMORY
16/10/03 15:34:25 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output for attempt_local507694567_0001_m_000000_0
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
16/10/03 15:34:25 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to disk to satisfy reduce memory limit
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 1 files, 6 bytes from disk
16/10/03 15:34:25 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/10/03 15:34:25 INFO mapred.Merger: Merging 1 sorted segments
16/10/03 15:34:25 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/10/03 15:34:25 INFO mapred.Task: Task:attempt_local507694567_0001_r_000000_0 is done. And is in the process of committing
16/10/03 15:34:25 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/10/03 15:34:25 INFO mapred.Task: Task attempt_local507694567_0001_r_000000_0 is allowed to commit now
16/10/03 15:34:25 INFO output.FileOutputCommitter: Saved output of task 'attempt_local507694567_0001_r_000000_0' to hdfs://localhost:9000/user/sonu/output/_temporary/0/task_local507694567_0001_r_000000
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce > reduce
16/10/03 15:34:25 INFO mapred.Task: Task 'attempt_local507694567_0001_r_000000_0' done.
16/10/03 15:34:25 INFO mapred.LocalJobRunner: Finishing task: attempt_local507694567_0001_r_000000_0
16/10/03 15:34:25 INFO mapred.LocalJobRunner: reduce task executor complete.
16/10/03 15:34:25 INFO mapreduce.Job: map 100% reduce 100%
16/10/03 15:34:25 INFO mapreduce.Job: Job job_local507694567_0001 completed successfully
16/10/03 15:34:25 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=17342
FILE: Number of bytes written=571556
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2004144
HDFS: Number of bytes written=0
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=53
Map output records=106288
Map output bytes=2206696
Map output materialized bytes=6
Input split bytes=103
Combine input records=106288
Combine output records=0
Reduce input groups=0
Reduce shuffle bytes=6
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=12
Total committed heap usage (bytes)=562036736
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1002072
File Output Format Counters
Bytes Written=0

How do i know if my hadoop mapreduce application is running in distributed mode

I'm very new in hadoop mapreduce, however i install the multinode cluster but i still get a sequential excution.
How can i work out if my program is running on the other machines in the cluster or not?
This is the result of execution :
Picked up _JAVA_OPTIONS: -Xmx1g
16/06/07 14:49:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/07 14:49:19 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/06/07 14:49:19 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/06/07 14:49:21 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/06/07 14:49:21 INFO input.FileInputFormat: Total input paths to process : 3
16/06/07 14:49:22 INFO mapreduce.JobSubmitter: number of splits:3
16/06/07 14:49:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1881318657_0001
16/06/07 14:49:24 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/06/07 14:49:24 INFO mapreduce.Job: Running job: job_local1881318657_0001
16/06/07 14:49:24 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/06/07 14:49:24 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/06/07 14:49:24 INFO mapred.LocalJobRunner: Waiting for map tasks
16/06/07 14:49:24 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000000_0
16/06/07 14:49:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 14:49:24 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia.txt:0+1172207
16/06/07 14:49:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 14:49:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 14:49:24 INFO mapred.MapTask: soft limit at 83886080
16/06/07 14:49:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 14:49:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 14:49:24 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 14:49:25 INFO mapreduce.Job: Job job_local1881318657_0001 running in uber mode : false
16/06/07 14:49:25 INFO mapreduce.Job: map 0% reduce 0%
16/06/07 14:49:31 INFO mapred.LocalJobRunner: map > map
16/06/07 14:49:31 INFO mapreduce.Job: map 22% reduce 0%
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-3.042421771435325E-9
-2.9889415942690763E-9
-2.9889415942690763E-9
-2.9889415942690763E-9
-2.9287384547432996E-9
-2.898469757139896E-9
-2.898469757139896E-9
-2.880377562441664E-9
-2.880377562441664E-9
-2.880377562441664E-9
-2.8430632294667886E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
-2.819146987128837E-9
931
16/06/07 15:00:44 INFO mapred.LocalJobRunner: map > map
16/06/07 15:00:44 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:00:44 INFO mapred.MapTask: Spilling map output
16/06/07 15:00:44 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:00:44 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:00:46 INFO mapred.MapTask: Finished spill 0
16/06/07 15:00:46 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000000_0 is done. And is in the process of committing
16/06/07 15:00:47 INFO mapred.LocalJobRunner: map
16/06/07 15:00:47 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000000_0' done.
16/06/07 15:00:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000000_0
16/06/07 15:00:47 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000001_0
16/06/07 15:00:48 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:00:48 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia1.txt:0+1172207
16/06/07 15:00:48 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 15:00:48 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 15:00:48 INFO mapred.MapTask: soft limit at 83886080
16/06/07 15:00:48 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 15:00:48 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 15:00:48 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 15:00:48 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:01:47 INFO mapred.LocalJobRunner: map > map
16/06/07 15:01:48 INFO mapreduce.Job: map 56% reduce 0%
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.0279963370711145E-9
-3.001716001136338E-9
-2.997252637652067E-9
-2.997252637652067E-9
-2.9593407930592893E-9
-2.9178102507568847E-9
-2.9178102507568847E-9
-2.9178102507568847E-9
-2.8542232742481287E-9
-2.8542232742481287E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8510431833778047E-9
-2.8222418341121026E-9
-2.8222418341121026E-9
907
16/06/07 15:11:30 INFO mapred.LocalJobRunner: map > map
16/06/07 15:11:30 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:11:30 INFO mapred.MapTask: Spilling map output
16/06/07 15:11:30 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:11:30 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:11:30 INFO mapred.MapTask: Finished spill 0
16/06/07 15:11:30 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000001_0 is done. And is in the process of committing
16/06/07 15:11:30 INFO mapred.LocalJobRunner: map
16/06/07 15:11:30 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000001_0' done.
16/06/07 15:11:30 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000001_0
16/06/07 15:11:30 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_m_000002_0
16/06/07 15:11:30 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:11:30 INFO mapred.MapTask: Processing split: hdfs://master:9000/input/leukemia2.txt:0+1172207
16/06/07 15:11:30 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:11:31 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/06/07 15:11:31 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/06/07 15:11:31 INFO mapred.MapTask: soft limit at 83886080
16/06/07 15:11:31 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/06/07 15:11:31 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/06/07 15:11:31 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/06/07 15:11:37 INFO mapred.LocalJobRunner: map > map
16/06/07 15:11:38 INFO mapreduce.Job: map 89% reduce 0%
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.064963887619912E-9
-3.0090989883906007E-9
-2.9474075636124447E-9
-2.9474075636124447E-9
-2.9474075636124447E-9
-2.9388849943338927E-9
-2.9388849943338927E-9
-2.8915704649620403E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
-2.8102046711682226E-9
925
16/06/07 15:20:19 INFO mapred.LocalJobRunner: map > map
16/06/07 15:20:19 INFO mapred.MapTask: Starting flush of map output
16/06/07 15:20:19 INFO mapred.MapTask: Spilling map output
16/06/07 15:20:19 INFO mapred.MapTask: bufstart = 0; bufend = 14151; bufvoid = 104857600
16/06/07 15:20:19 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
16/06/07 15:20:20 INFO mapred.MapTask: Finished spill 0
16/06/07 15:20:20 INFO mapred.Task: Task:attempt_local1881318657_0001_m_000002_0 is done. And is in the process of committing
16/06/07 15:20:22 INFO mapred.LocalJobRunner: map
16/06/07 15:20:22 INFO mapred.Task: Task 'attempt_local1881318657_0001_m_000002_0' done.
16/06/07 15:20:22 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_m_000002_0
16/06/07 15:20:22 INFO mapred.LocalJobRunner: map task executor complete.
16/06/07 15:20:22 INFO mapreduce.Job: map 100% reduce 0%
16/06/07 15:20:23 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/06/07 15:20:23 INFO mapred.LocalJobRunner: Starting task: attempt_local1881318657_0001_r_000000_0
16/06/07 15:20:24 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/06/07 15:20:24 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle#7f5be2d5
16/06/07 15:20:25 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=668309888, maxSingleShuffleLimit=167077472, mergeThreshold=441084544, ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/06/07 15:20:25 INFO reduce.EventFetcher: attempt_local1881318657_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
16/06/07 15:20:28 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000002_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:29 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000002_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->14157
16/06/07 15:20:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000001_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:30 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000001_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 2, commitMemory -> 14157, usedMemory ->28314
16/06/07 15:20:30 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1881318657_0001_m_000000_0 decomp: 14157 len: 14161 to MEMORY
16/06/07 15:20:30 INFO reduce.InMemoryMapOutput: Read 14157 bytes from map-output for attempt_local1881318657_0001_m_000000_0
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 14157, inMemoryMapOutputs.size() -> 3, commitMemory -> 28314, usedMemory ->42471
16/06/07 15:20:30 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/06/07 15:20:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: finalMerge called with 3 in-memory map-outputs and 0 on-disk map-outputs
16/06/07 15:20:30 INFO mapred.Merger: Merging 3 sorted segments
16/06/07 15:20:30 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 42435 bytes
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merged 3 segments, 42471 bytes to disk to satisfy reduce memory limit
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merging 1 files, 42471 bytes from disk
16/06/07 15:20:30 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
16/06/07 15:20:30 INFO mapred.Merger: Merging 1 sorted segments
16/06/07 15:20:30 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 42455 bytes
16/06/07 15:20:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
16/06/07 15:20:33 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:33 INFO mapreduce.Job: map 100% reduce 67%
16/06/07 15:20:36 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:38 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/06/07 15:20:42 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:42 INFO mapreduce.Job: map 100% reduce 100%
16/06/07 15:20:44 INFO mapred.Task: Task:attempt_local1881318657_0001_r_000000_0 is done. And is in the process of committing
16/06/07 15:20:44 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:44 INFO mapred.Task: Task attempt_local1881318657_0001_r_000000_0 is allowed to commit now
16/06/07 15:20:45 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1881318657_0001_r_000000_0' to hdfs://master:9000/output2/_temporary/0/task_local1881318657_0001_r_000000
16/06/07 15:20:45 INFO mapred.LocalJobRunner: reduce > reduce
16/06/07 15:20:45 INFO mapred.Task: Task 'attempt_local1881318657_0001_r_000000_0' done.
16/06/07 15:20:45 INFO mapred.LocalJobRunner: Finishing task: attempt_local1881318657_0001_r_000000_0
16/06/07 15:20:45 INFO mapred.LocalJobRunner: reduce task executor complete.
16/06/07 15:20:45 INFO mapreduce.Job: Job job_local1881318657_0001 completed successfully
16/06/07 15:20:46 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=177067554
FILE: Number of bytes written=179551452
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=10549863
HDFS: Number of bytes written=42438
HDFS: Number of read operations=37
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework
Map input records=3
Map output records=3
Map output bytes=42453
Map output materialized bytes=42483
Input split bytes=557
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=42483
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=227283
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2477260800
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=42438
peace
By the job ID. Your's says: job_local1881318657_0001 running in uber mode : false. Which is a local job. If you ran on a cluster it would just be the job and the identifiers of the app master and attempts.
You need to check the JobTracker ( default port 50030) and explore the job id details mentioned in the above logs.
You can monitor the jobs at:
localhost:8088

java.lang.NullPointerException at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.close

I am running two map-reduce pairs. The output of first map-reduce is being used as the input for the next map-reduce. In order to do that I have given the job.setOutputFormatClass(SequenceFileOutputFormat.class). While running the following Driver class:
package org;
import org.apache.commons.configuration.ConfigurationFactory;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.math.VarLongWritable;
import org.apache.mahout.math.VectorWritable;
public class Driver1 extends Configured implements Tool
{
public int run(String[] args) throws Exception
{
if(args.length !=3) {
System.err.println("Usage: MaxTemperatureDriver <input path> <outputpath>");
System.exit(-1);
}
//ConfFactory WorkFlow=new ConfFactory(new Path("/input.txt"),new Path("/output.txt"),TextInputFormat.class,VarLongWritable.class,Text.class,VarLongWritable.class,VectorWritable.class,SequenceFileOutputFormat.class);
Job job = new Job();
Job job1=new Job();
job.setJarByClass(Driver1.class);
job.setJobName("Max Temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(UserVectorMapper.class);
job.setReducerClass(UserVectorReducer.class);
job.setOutputKeyClass(VarLongWritable.class);
job.setOutputValueClass(VectorWritable.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job1.setJarByClass(Driver1.class);
//job.setJobName("Max Temperature");
job1.setInputFormatClass(SequenceFileInputFormat.class);
FileInputFormat.addInputPath(job1, new Path("output/part-r-00000"));
FileOutputFormat.setOutputPath(job1,new Path(args[2]));
job1.setMapperClass(ItemToItemPrefMapper.class);
//job1.setReducerClass(UserVectorReducer.class);
job1.setOutputKeyClass(VectorWritable.class);
job1.setOutputValueClass(VectorWritable.class);
job1.setOutputFormatClass(SequenceFileOutputFormat.class);
System.exit(job.waitForCompletion(true) && job1.waitForCompletion(true) ? 0:1);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
Driver1 driver = new Driver1();
int exitCode = ToolRunner.run(driver, args);
System.exit(exitCode);
}
}
I am getting the following runtime log.
15/02/24 20:00:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/24 20:00:49 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/02/24 20:00:49 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/02/24 20:00:49 INFO input.FileInputFormat: Total input paths to process : 1
15/02/24 20:00:49 WARN snappy.LoadSnappy: Snappy native library not loaded
15/02/24 20:00:49 INFO mapred.JobClient: Running job: job_local1723586736_0001
15/02/24 20:00:49 INFO mapred.LocalJobRunner: Waiting for map tasks
15/02/24 20:00:49 INFO mapred.LocalJobRunner: Starting task: attempt_local1723586736_0001_m_000000_0
15/02/24 20:00:49 INFO util.ProcessTree: setsid exited with exit code 0
15/02/24 20:00:49 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1185f32
15/02/24 20:00:49 INFO mapred.MapTask: Processing split: file:/home/smaiti/workspace/recommendationsy/data.txt:0+1979173
15/02/24 20:00:50 INFO mapred.MapTask: io.sort.mb = 100
15/02/24 20:00:50 INFO mapred.MapTask: data buffer = 79691776/99614720
15/02/24 20:00:50 INFO mapred.MapTask: record buffer = 262144/327680
15/02/24 20:00:50 INFO mapred.JobClient: map 0% reduce 0%
15/02/24 20:00:50 INFO mapred.MapTask: Starting flush of map output
15/02/24 20:00:51 INFO mapred.MapTask: Finished spill 0
15/02/24 20:00:51 INFO mapred.Task: Task:attempt_local1723586736_0001_m_000000_0 is done. And is in the process of commiting
15/02/24 20:00:51 INFO mapred.LocalJobRunner:
15/02/24 20:00:51 INFO mapred.Task: Task 'attempt_local1723586736_0001_m_000000_0' done.
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Finishing task: attempt_local1723586736_0001_m_000000_0
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Map task executor complete.
15/02/24 20:00:51 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#9cce9
15/02/24 20:00:51 INFO mapred.LocalJobRunner:
15/02/24 20:00:51 INFO mapred.Merger: Merging 1 sorted segments
15/02/24 20:00:51 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2074779 bytes
15/02/24 20:00:51 INFO mapred.LocalJobRunner:
15/02/24 20:00:51 INFO mapred.Task: Task:attempt_local1723586736_0001_r_000000_0 is done. And is in the process of commiting
15/02/24 20:00:51 INFO mapred.LocalJobRunner:
15/02/24 20:00:51 INFO mapred.Task: Task attempt_local1723586736_0001_r_000000_0 is allowed to commit now
15/02/24 20:00:51 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1723586736_0001_r_000000_0' to output
15/02/24 20:00:51 INFO mapred.LocalJobRunner: reduce > reduce
15/02/24 20:00:51 INFO mapred.Task: Task 'attempt_local1723586736_0001_r_000000_0' done.
15/02/24 20:00:51 INFO mapred.JobClient: map 100% reduce 100%
15/02/24 20:00:51 INFO mapred.JobClient: Job complete: job_local1723586736_0001
15/02/24 20:00:51 INFO mapred.JobClient: Counters: 20
15/02/24 20:00:51 INFO mapred.JobClient: File Output Format Counters
15/02/24 20:00:51 INFO mapred.JobClient: Bytes Written=1012481
15/02/24 20:00:51 INFO mapred.JobClient: File Input Format Counters
15/02/24 20:00:51 INFO mapred.JobClient: Bytes Read=1979173
15/02/24 20:00:51 INFO mapred.JobClient: FileSystemCounters
15/02/24 20:00:51 INFO mapred.JobClient: FILE_BYTES_READ=6033479
15/02/24 20:00:51 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5264031
15/02/24 20:00:51 INFO mapred.JobClient: Map-Reduce Framework
15/02/24 20:00:51 INFO mapred.JobClient: Reduce input groups=943
15/02/24 20:00:51 INFO mapred.JobClient: Map output materialized bytes=2074783
15/02/24 20:00:51 INFO mapred.JobClient: Combine output records=0
15/02/24 20:00:51 INFO mapred.JobClient: Map input records=100000
15/02/24 20:00:51 INFO mapred.JobClient: Reduce shuffle bytes=0
15/02/24 20:00:51 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
15/02/24 20:00:51 INFO mapred.JobClient: Reduce output records=943
15/02/24 20:00:51 INFO mapred.JobClient: Spilled Records=200000
15/02/24 20:00:51 INFO mapred.JobClient: Map output bytes=1874777
15/02/24 20:00:51 INFO mapred.JobClient: Total committed heap usage (bytes)=415760384
15/02/24 20:00:51 INFO mapred.JobClient: CPU time spent (ms)=0
15/02/24 20:00:51 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
15/02/24 20:00:51 INFO mapred.JobClient: SPLIT_RAW_BYTES=118
15/02/24 20:00:51 INFO mapred.JobClient: Map output records=100000
15/02/24 20:00:51 INFO mapred.JobClient: Combine input records=0
15/02/24 20:00:51 INFO mapred.JobClient: Reduce input records=100000
15/02/24 20:00:51 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/02/24 20:00:51 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/02/24 20:00:51 INFO input.FileInputFormat: Total input paths to process : 1
15/02/24 20:00:51 INFO mapred.JobClient: Running job: job_local735350013_0002
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Waiting for map tasks
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Starting task: attempt_local735350013_0002_m_000000_0
15/02/24 20:00:51 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#1a970
15/02/24 20:00:51 INFO mapred.MapTask: Processing split: file:/home/smaiti/workspace/recommendationsy/output/part-r-00000:0+1004621
15/02/24 20:00:51 INFO mapred.MapTask: io.sort.mb = 100
15/02/24 20:00:51 INFO mapred.MapTask: data buffer = 79691776/99614720
15/02/24 20:00:51 INFO mapred.MapTask: record buffer = 262144/327680
15/02/24 20:00:51 INFO mapred.MapTask: Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader#9cc591
java.lang.NullPointerException
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.close(SequenceFileRecordReader.java:101)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:496)
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1776)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:778)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/24 20:00:51 INFO mapred.LocalJobRunner: Map task executor complete.
15/02/24 20:00:51 WARN mapred.LocalJobRunner: job_local735350013_0002
java.lang.Exception: java.lang.ClassCastException: class org.apache.mahout.math.VectorWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.ClassCastException: class org.apache.mahout.math.VectorWritable
at java.lang.Class.asSubclass(Class.java:3208)
at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:964)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:673)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/24 20:00:52 INFO mapred.JobClient: map 0% reduce 0%
15/02/24 20:00:52 INFO mapred.JobClient: Job complete: job_local735350013_0002
15/02/24 20:00:52 INFO mapred.JobClient: Counters: 0
The first exception that I am getting is this:
java.lang.NullPointerException
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.close(SequenceFileRecordReader.java:101)
Please help.
This is mainly because Hadoop is confused while Serializing the data.
Make sure to
You should set Input and output file format class to both the reducers.
Check that Inputformat of second class is OutputFormat of first class.
It might be possible that intermediate file format is different from what the reducer is expecting to read.
Maintain consistent FileFormats across your program.

Hadoop MapReduce: running two jobs inside one Configured/Tool produces no output on 2nd job

I need to run 2 map reduce jobs such that the 2nd takes as input the output from the first job. I'd like to do this within a single invocation, where MyClass extends Configured and implements Tool.
I've written the code, and it works as long as I don't run the two jobs within the same invocation (this works):
hadoop jar myjar.jar path.to.my.class.MyClass -i input -o output -m job1
hadoop jar myjar.jar path.to.my.class.MyClass -i dummy -o output -m job2
But this doesn't:
hadoop jar myjar.jar path.to.my.class.MyClass -i input -o output -m all
(-m stands for "mode")
In this case, the output of the first job does not make it to the mappers of the 2nd job (I figured this out by debugging), but I can't figure out why.
I've seen other posts on chaining, but they are for the "old" mapred api. And I need to run 3rd party code between the jobs, so I don't know if ChainMapper/ChainReducer will work for my use case.
Using hadoop version 1.0.3, AWS Elastic MapReduce distribution.
Code:
import java.io.IOException;
import org.apache.commons.cli.BasicParser;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.OptionBuilder;
import org.apache.commons.cli.Options;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MyClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new HBasePrep(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
CommandLineParser parser = new BasicParser();
Options allOptions = setupOptions();
Configuration conf = getConf();
String[] argv_ = new GenericOptionsParser(conf, args).getRemainingArgs();
CommandLine cmdLine = parser.parse(allOptions, argv_);
boolean doJob1 = true;
boolean doJob2 = true;
if (cmdLine.hasOption('m')) {
String mode = cmdLine.getOptionValue('m');
if ("job1".equals(mode)) {
doJob2 = false;
} else if ("job2".equals(mode)){
doJob1 = false;
}
}
Path outPath = new Path(cmdLine.getOptionValue("output"), "job1out");
Job job = new Job(conf, "HBase Prep For Data Build");
Job job2 = new Job(conf, "HBase SessionIndex load");
if (doJob1) {
conf = job.getConfiguration();
String[] values = cmdLine.getOptionValues("input");
if (values != null && values.length > 0) {
for (String input : values) {
System.out.println("input:" + input);
FileInputFormat.addInputPaths(job, input);
}
}
job.setJarByClass(HBasePrep.class);
job.setMapperClass(SessionMapper.class);
MultipleOutputs.setCountersEnabled(job, false);
MultipleOutputs.addNamedOutput(job, "sessionindex", TextOutputFormat.class, Text.class, Text.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat.class);
HTable hTable = new HTable(conf, "session");
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileOutputFormat.setOutputPath(job, outPath);
if (!job.waitForCompletion(true)) {
return 1;
}
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(outPath, hTable);
FileSystem fs = FileSystem.get(outPath.toUri(), conf);
fs.delete(new Path(outPath, "cf"), true); # i delete this because after the hbase build load, it is left an empty directory which causes problems later
}
/////////////////////////////////////////////
// SECOND JOB //
/////////////////////////////////////////////
if (doJob2) {
conf = job2.getConfiguration();
System.out.println("-- job 2 input path : " + outPath.toString());
FileInputFormat.setInputPaths(job2, outPath.toString());
job2.setJarByClass(HBasePrep.class);
job2.setMapperClass(SessionIndexMapper.class);
MultipleOutputs.setCountersEnabled(job2, false);
job2.setMapOutputKeyClass(ImmutableBytesWritable.class);
job2.setMapOutputValueClass(KeyValue.class);
job2.setOutputFormatClass(HFileOutputFormat.class);
HTable hTable = new HTable(conf, "session_index_by_hour");
// Auto configure partitioner and reducer
HFileOutputFormat.configureIncrementalLoad(job2, hTable);
outPath = new Path(cmdLine.getOptionValue("output"), "job2out");
System.out.println("-- job 2 output path: " + outPath.toString());
FileOutputFormat.setOutputPath(job2, outPath);
if (!job2.waitForCompletion(true)) {
return 2;
}
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(outPath, hTable);
}
return 0;
}
public static class SessionMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
private MultipleOutputs<ImmutableBytesWritable, KeyValue> multiOut;
#Override
public void setup(Context context) throws IOException {
multiOut = new MultipleOutputs<ImmutableBytesWritable, KeyValue>(context);
}
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
...
context.write(..., ...); # this is called mutiple times
multiOut.write("sessionindex", new Text(...), new Text(...), "sessionindex");
}
}
public static class SessionIndexMapper extends
Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(new ImmutableBytesWritable(...), new KeyValue(...));
}
}
private static Options setupOptions() {
Option input = createOption("i", "input",
"input file(s) for the Map step", "path", Integer.MAX_VALUE,
true);
Option output = createOption("o", "output",
"output directory for the Reduce step", "path", 1, true);
Option mode = createOption("m", "mode",
"what mode ('all', 'job1', 'job2')", "-mode-", 1, false);
return new Options().addOption(input).addOption(output)
.addOption(mode);
}
public static Option createOption(String name, String longOpt, String desc,
String argName, int max, boolean required) {
OptionBuilder.withArgName(argName);
OptionBuilder.hasArgs(max);
OptionBuilder.withDescription(desc);
OptionBuilder.isRequired(required);
OptionBuilder.withLongOpt(longOpt);
return OptionBuilder.create(name);
}
}
Output (single invocation):
input:s3n://...snip...
13/12/09 23:08:43 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/09 23:08:43 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/12/09 23:08:43 INFO compress.CodecPool: Got brand-new compressor
13/12/09 23:08:43 INFO mapred.JobClient: Default number of map tasks: null
13/12/09 23:08:43 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 2
13/12/09 23:08:43 INFO mapred.JobClient: Default number of reduce tasks: 1
13/12/09 23:08:43 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
13/12/09 23:08:43 INFO mapred.JobClient: Setting group to hadoop
13/12/09 23:08:43 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 23:08:43 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/12/09 23:08:43 WARN lzo.LzoCodec: Could not find build properties file with revision hash
13/12/09 23:08:43 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
13/12/09 23:08:43 WARN snappy.LoadSnappy: Snappy native library is available
13/12/09 23:08:43 INFO snappy.LoadSnappy: Snappy native library loaded
13/12/09 23:08:44 INFO mapred.JobClient: Running job: job_201312062235_0044
13/12/09 23:08:45 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 23:09:09 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 23:09:27 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 23:09:32 INFO mapred.JobClient: Job complete: job_201312062235_0044
13/12/09 23:09:32 INFO mapred.JobClient: Counters: 42
13/12/09 23:09:32 INFO mapred.JobClient: MyCounter1
13/12/09 23:09:32 INFO mapred.JobClient: ValidCurrentDay=3526
13/12/09 23:09:32 INFO mapred.JobClient: Job Counters
13/12/09 23:09:32 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=19693
13/12/09 23:09:32 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 23:09:32 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 23:09:32 INFO mapred.JobClient: Rack-local map tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: Launched map tasks=1
13/12/09 23:09:32 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15201
13/12/09 23:09:32 INFO mapred.JobClient: File Output Format Counters
13/12/09 23:09:32 INFO mapred.JobClient: Bytes Written=1979245
13/12/09 23:09:32 INFO mapred.JobClient: FileSystemCounters
13/12/09 23:09:32 INFO mapred.JobClient: S3N_BYTES_READ=51212
13/12/09 23:09:32 INFO mapred.JobClient: FILE_BYTES_READ=400417
13/12/09 23:09:32 INFO mapred.JobClient: HDFS_BYTES_READ=231
13/12/09 23:09:32 INFO mapred.JobClient: FILE_BYTES_WRITTEN=859881
13/12/09 23:09:32 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2181624
13/12/09 23:09:32 INFO mapred.JobClient: File Input Format Counters
13/12/09 23:09:32 INFO mapred.JobClient: Bytes Read=51212
13/12/09 23:09:32 INFO mapred.JobClient: MyCounter2
13/12/09 23:09:32 INFO mapred.JobClient: ASCII=3526
13/12/09 23:09:32 INFO mapred.JobClient: StatsUnaggregatedMapEventTypeCurrentDay
13/12/09 23:09:32 INFO mapred.JobClient: adProgress0=343
13/12/09 23:09:32 INFO mapred.JobClient: asset=562
13/12/09 23:09:32 INFO mapred.JobClient: podComplete=612
13/12/09 23:09:32 INFO mapred.JobClient: adProgress100=247
13/12/09 23:09:32 INFO mapred.JobClient: adProgress25=247
13/12/09 23:09:32 INFO mapred.JobClient: click=164
13/12/09 23:09:32 INFO mapred.JobClient: adProgress50=247
13/12/09 23:09:32 INFO mapred.JobClient: adCall=244
13/12/09 23:09:32 INFO mapred.JobClient: adProgress75=247
13/12/09 23:09:32 INFO mapred.JobClient: podStart=613
13/12/09 23:09:32 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 23:09:32 INFO mapred.JobClient: Map output materialized bytes=400260
13/12/09 23:09:32 INFO mapred.JobClient: Map input records=3526
13/12/09 23:09:32 INFO mapred.JobClient: Reduce shuffle bytes=400260
13/12/09 23:09:32 INFO mapred.JobClient: Spilled Records=14104
13/12/09 23:09:32 INFO mapred.JobClient: Map output bytes=2343990
13/12/09 23:09:32 INFO mapred.JobClient: Total committed heap usage (bytes)=497549312
13/12/09 23:09:32 INFO mapred.JobClient: CPU time spent (ms)=10120
13/12/09 23:09:32 INFO mapred.JobClient: Combine input records=0
13/12/09 23:09:32 INFO mapred.JobClient: SPLIT_RAW_BYTES=231
13/12/09 23:09:32 INFO mapred.JobClient: Reduce input records=7052
13/12/09 23:09:32 INFO mapred.JobClient: Reduce input groups=246
13/12/09 23:09:32 INFO mapred.JobClient: Combine output records=0
13/12/09 23:09:32 INFO mapred.JobClient: Physical memory (bytes) snapshot=519942144
13/12/09 23:09:32 INFO mapred.JobClient: Reduce output records=7052
13/12/09 23:09:32 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3076526080
13/12/09 23:09:32 INFO mapred.JobClient: Map output records=7052
13/12/09 23:09:32 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job1out/_SUCCESS
13/12/09 23:09:32 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job1out/sessionindex-m-00000
1091740526
-- job 2 input path : /path/job1out
-- job 2 output path: /path/job2out
13/12/09 23:09:32 INFO mapred.JobClient: Default number of map tasks: null
13/12/09 23:09:32 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 2
13/12/09 23:09:32 INFO mapred.JobClient: Default number of reduce tasks: 1
13/12/09 23:09:33 INFO mapred.JobClient: Setting group to hadoop
13/12/09 23:09:33 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 23:09:33 INFO mapred.JobClient: Running job: job_201312062235_0045
13/12/09 23:09:34 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 23:09:51 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 23:10:03 INFO mapred.JobClient: map 100% reduce 33%
13/12/09 23:10:06 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 23:10:11 INFO mapred.JobClient: Job complete: job_201312062235_0045
13/12/09 23:10:11 INFO mapred.JobClient: Counters: 27
13/12/09 23:10:11 INFO mapred.JobClient: Job Counters
13/12/09 23:10:11 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 23:10:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13533
13/12/09 23:10:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 23:10:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 23:10:11 INFO mapred.JobClient: Launched map tasks=1
13/12/09 23:10:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=12176
13/12/09 23:10:11 INFO mapred.JobClient: File Output Format Counters
13/12/09 23:10:11 INFO mapred.JobClient: Bytes Written=0
13/12/09 23:10:11 INFO mapred.JobClient: FileSystemCounters
13/12/09 23:10:11 INFO mapred.JobClient: FILE_BYTES_READ=173
13/12/09 23:10:11 INFO mapred.JobClient: HDFS_BYTES_READ=134
13/12/09 23:10:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=57735
13/12/09 23:10:11 INFO mapred.JobClient: File Input Format Counters
13/12/09 23:10:11 INFO mapred.JobClient: Bytes Read=0
13/12/09 23:10:11 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 23:10:11 INFO mapred.JobClient: Map output materialized bytes=16
13/12/09 23:10:11 INFO mapred.JobClient: Map input records=0
13/12/09 23:10:11 INFO mapred.JobClient: Reduce shuffle bytes=16
13/12/09 23:10:11 INFO mapred.JobClient: Spilled Records=0
13/12/09 23:10:11 INFO mapred.JobClient: Map output bytes=0
13/12/09 23:10:11 INFO mapred.JobClient: Total committed heap usage (bytes)=434634752
13/12/09 23:10:11 INFO mapred.JobClient: CPU time spent (ms)=2270
13/12/09 23:10:11 INFO mapred.JobClient: Combine input records=0
13/12/09 23:10:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
13/12/09 23:10:11 INFO mapred.JobClient: Reduce input records=0
13/12/09 23:10:11 INFO mapred.JobClient: Reduce input groups=0
13/12/09 23:10:11 INFO mapred.JobClient: Combine output records=0
13/12/09 23:10:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=423612416
13/12/09 23:10:11 INFO mapred.JobClient: Reduce output records=0
13/12/09 23:10:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3058089984
13/12/09 23:10:11 INFO mapred.JobClient: Map output records=0
13/12/09 23:10:11 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://10.91.18.96:9000/path/job2out/_SUCCESS
13/12/09 23:10:11 WARN mapreduce.LoadIncrementalHFiles: Bulk load operation did not find any files to load in directory /path/job2out. Does it contain files in subdirectories that correspond to column family names?

ORDER BY job failed in the Pig script while running EmbeddedPig using Java

I have this following pig script, which works perfectly using grunt shell (stored the results to HDFS without any issues); however, the last job (ORDER BY) failed if I ran the same script using Java EmbeddedPig. If I replace the ORDER BY job by others, such as GROUP or FOREACH GENERATE, the whole script then succeeded in Java EmbeddedPig. So I think it's the ORDER BY which causes the issue. Anyone has any experience with this? Any help would be appreciated!
The Pig script:
REGISTER pig-udf-0.0.1-SNAPSHOT.jar;
user_similarity = LOAD '/tmp/sample-sim-score-results-31/part-r-00000' USING PigStorage('\t') AS (user_id: chararray, sim_user_id: chararray, basic_sim_score: float, alt_sim_score: float);
simplified_user_similarity = FOREACH user_similarity GENERATE $0 AS user_id, $1 AS sim_user_id, $2 AS sim_score;
grouped_user_similarity = GROUP simplified_user_similarity BY user_id;
ordered_user_similarity = FOREACH grouped_user_similarity {
sorted = ORDER simplified_user_similarity BY sim_score DESC;
top = LIMIT sorted 10;
GENERATE group, top;
};
top_influencers = FOREACH ordered_user_similarity GENERATE com.aol.grapevine.similarity.pig.udf.AssignPointsToTopInfluencer($1, 10);
all_influence_scores = FOREACH top_influencers GENERATE FLATTEN($0);
grouped_influence_scores = GROUP all_influence_scores BY bag_of_topSimUserTuples::user_id;
influence_scores = FOREACH grouped_influence_scores GENERATE group AS user_id, SUM(all_influence_scores.bag_of_topSimUserTuples::points) AS influence_score;
ordered_influence_scores = ORDER influence_scores BY influence_score DESC;
STORE ordered_influence_scores INTO '/tmp/cc-test-results-1' USING PigStorage();
The error log from Pig:
12/04/05 10:00:56 INFO pigstats.ScriptState: Pig script settings are added to the job
12/04/05 10:00:56 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
12/04/05 10:00:58 INFO mapReduceLayer.JobControlCompiler: Setting up single store job
12/04/05 10:00:58 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
12/04/05 10:00:58 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission.
12/04/05 10:00:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/04/05 10:00:58 INFO input.FileInputFormat: Total input paths to process : 1
12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths to process : 1
12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths (combined) to process : 1
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating tmp-1546565755 in /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134-work-6955502337234509704 with rwxr-xr-x
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:58 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir.
12/04/05 10:00:58 INFO mapred.TaskRunner: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/pigsample_854728855_1333645258470
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.jar.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.jar.crc
12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.split.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.split.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.splitmetainfo.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.splitmetainfo.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.xml.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.xml.crc
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.jar <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.jar
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.split <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.split
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.splitmetainfo <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.splitmetainfo
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.xml <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.xml
12/04/05 10:00:59 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/04/05 10:00:59 INFO mapred.MapTask: io.sort.mb = 100
12/04/05 10:00:59 INFO mapred.MapTask: data buffer = 79691776/99614720
12/04/05 10:00:59 INFO mapred.MapTask: record buffer = 262144/327680
12/04/05 10:00:59 WARN mapred.LocalJobRunner: job_local_0004
java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:139)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(PigFileInputFormat.java:37)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:112)
... 6 more
12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Deleted path /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755
12/04/05 10:00:59 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0004
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: job job_local_0004 has failed! Stop running all dependent jobs
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: 100% complete
12/04/05 10:01:04 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed!
12/04/05 10:01:04 INFO pigstats.PigStats: Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2-cdh3u3 0.8.1-cdh3u3 cchuang 2012-04-05 10:00:34 2012-04-05 10:01:04 GROUP_BY,ORDER_BY
Some jobs have failed! Stop running all dependent jobs
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_local_0001 0 0 0 0 0 0 0 0 all_influence_scores,grouped_user_similarity,simplified_user_similarity,user_similarity GROUP_BY
job_local_0002 0 0 0 0 0 0 0 0 grouped_influence_scores,influence_scores GROUP_BY,COMBINER
job_local_0003 0 0 0 0 0 0 0 0 ordered_influence_scores SAMPLER
Failed Jobs:
JobId Alias Feature Message Outputs
job_local_0004 ordered_influence_scores ORDER_BY Message: Job failed! Error - NA /tmp/cc-test-results-1,
Input(s):
Successfully read 0 records from: "/tmp/sample-sim-score-results-31/part-r-00000"
Output(s):
Failed to produce result in "/tmp/cc-test-results-1"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local_0001 -> job_local_0002,
job_local_0002 -> job_local_0003,
job_local_0003 -> job_local_0004,
job_local_0004
12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: Some jobs have failed! Stop running all dependent jobs
Make sure PIG_HOME environment variable is set to pig installation.

Resources