Hadoop - Reducer is waiting for Mapper inputs? - hadoop

as explained in the title, when i execute my Hadoop Program (and debug it in local mode) the following happens:
1. All 10 csv-lines in my test data are handled correctly in the Mapper, the Partitioner and the RawComperator(OutputKeyComparatorClass) that is called after the map-step. But the OutputValueGroupingComparatorClass's and the ReduceClass's functions do NOT get executed afterwards.
2. My application looks like the following. (due to space constraints i omit the implementation of the classes i used as configuration parameters, til somebody has an idea, that involves them):
public class RetweetApplication {
public static int DEBUG = 1;
static String INPUT = "/home/ema/INPUT-H";
static String OUTPUT = "/home/ema/OUTPUT-H "+ (new Date()).toString();
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(RetweetApplication.class);
if(DEBUG > 0){
conf.set("mapred.job.tracker", "local");
conf.set("fs.default.name", "file:///");
conf.set("dfs.replication", "1");
}
FileInputFormat.setInputPaths(conf, new Path(INPUT));
FileOutputFormat.setOutputPath(conf, new Path(OUTPUT));
//conf.setOutputKeyClass(Text.class);
//conf.setOutputValueClass(Text.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Text.class);
conf.setMapperClass(RetweetMapper.class);
conf.setPartitionerClass(TweetPartitioner.class);
conf.setOutputKeyComparatorClass(TwitterValueGroupingComparator.class);
conf.setOutputValueGroupingComparator(TwitterKeyGroupingComparator.class);
conf.setReducerClass(RetweetReducer.class);
conf.setOutputFormat(TextOutputFormat.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
3. I get the following console output(sorry for the format, but somehow this log doesnt get formatted correctly):
12/05/22 03:51:05 INFO mapred.MapTask: io.sort.mb = 100 12/05/22
03:51:05 INFO mapred.MapTask: data buffer = 79691776/99614720
12/05/22 03:51:05 INFO mapred.MapTask: record buffer = 262144/327680
12/05/22 03:51:06 INFO mapred.JobClient: map 0% reduce 0%
12/05/22 03:51:11 INFO mapred.LocalJobRunner:
file:/home/ema/INPUT-H/tweets:0+967 12/05/22 03:51:12 INFO
mapred.JobClient: map 39% reduce 0%
12/05/22 03:51:14 INFO mapred.LocalJobRunner:
file:/home/ema/INPUT-H/tweets:0+967 12/05/22 03:51:15 INFO
mapred.MapTask: Starting flush of map output
12/05/22 03:51:15 INFO mapred.MapTask: Finished spill 0
12/05/22 03:51:15 INFO mapred.Task: Task:attempt_local_0001_m_000000_0
is done. And is in the process of commiting
12/05/22 03:51:15 INFO mapred.JobClient: map 79% reduce 0%
12/05/22 03:51:17 INFO mapred.LocalJobRunner:
file:/home/ema/INPUT-H/tweets:0+967
12/05/22 03:51:17 INFO mapred.LocalJobRunner:
file:/home/ema/INPUT-H/tweets:0+967
12/05/22 03:51:17 INFO mapred.Task: Task
'attempt_local_0001_m_000000_0' done.
12/05/22 03:51:17 INFO mapred.Task: Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin#35eed0
12/05/22 03:51:17 INFO mapred.ReduceTask: ShuffleRamManager:
MemoryLimit=709551680, MaxSingleShuffleLimit=177387920
12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Thread started: Thread for merging
on-disk files
12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Thread waiting: Thread for merging
on-disk files
12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Thread started: Thread for merging in
memory files
12/05/22 03:51:17 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is
already in progress 12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)
12/05/22 03:51:17 INFO mapred.ReduceTask:
attempt_local_0001_r_000000_0 Thread started: Thread for polling Map
Completion Events
12/05/22 03:51:18 INFO mapred.JobClient: map 100% reduce 0% 12/05/22 03:51:23 INFO mapred.LocalJobRunner: reduce > copy >
The bold marked lines repeat endlessly from this point.
4. Alot of open processes are active after the mapper saw every tuple:
RetweetApplication (1) [Remote Java Application]
OpenJDK Client VM[localhost:5002]
Thread [main] (Running)
Thread [Thread-2] (Running)
Daemon Thread [communication thread] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.0] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.1] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.2] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.4] (Running)
Thread [MapOutputCopier attempt_local_0001_r_000000_0.3] (Running)
Daemon Thread [Thread for merging on-disk files] (Running)
Daemon Thread [Thread for merging in memory files] (Running)
Daemon Thread [Thread for polling Map Completion Events] (Running)
Is there any reason, why Hadoop expects more output from the mapper (see the bold marked lines in the log) than i put into the input directory? As already mentioned, i debugged that ALL inputs are properly processed in the mapper/partitioner/etc.
UPDATE
With the help of Chris (see comments) i found out, that my program was NOT started in localMode as i expected it: the isLocal variable in the ReduceTask class is set to false, though it should be true.
To me it is absolutely unclear why this happens, since the 3 options that have to be set to enable the standalone mode were set the right way. Surprisingly: tho the local setting was ignored, the "read from normal disc" setting wasnt, which is very strange imho, because i thought local mode and the file:/// protocol are coupled.
During debugging ReduceTask i set the isLocal variable to true by evaluating isLocal=true in my debug view and then tried to execute the rest of the program. It did not work out and this is the stacktrace:
12/05/22 14:28:28 INFO mapred.LocalJobRunner:
12/05/22 14:28:28 INFO mapred.Merger: Merging 1 sorted segments
12/05/22 14:28:28 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1956 bytes
12/05/22 14:28:28 INFO mapred.LocalJobRunner:
12/05/22 14:28:29 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:29 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
12/05/22 14:28:30 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 0 time(s).
12/05/22 14:28:31 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 1 time(s).
12/05/22 14:28:32 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 2 time(s).
12/05/22 14:28:33 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 3 time(s).
12/05/22 14:28:34 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 4 time(s).
12/05/22 14:28:35 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 5 time(s).
12/05/22 14:28:36 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 6 time(s).
12/05/22 14:28:37 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 7 time(s).
12/05/22 14:28:38 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 8 time(s).
12/05/22 14:28:39 INFO ipc.Client: Retrying connect to server: master/127.0.0.1:9001. Already tried 9 time(s).
12/05/22 14:28:39 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:39 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
12/05/22 14:28:39 WARN mapred.LocalJobRunner: job_local_0001
java.net.ConnectException: Call to master/127.0.0.1:9001 failed on connection exception: java.net.ConnectException: Connection refused
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
at org.apache.hadoop.ipc.Client.call(Client.java:1071)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:446)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
at org.apache.hadoop.ipc.Client.call(Client.java:1046)
... 17 more
12/05/22 14:28:39 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:39 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
12/05/22 14:28:39 INFO mapred.JobClient: Job complete: job_local_0001
12/05/22 14:28:39 INFO mapred.JobClient: Counters: 20
12/05/22 14:28:39 INFO mapred.JobClient: File Input Format Counters
12/05/22 14:28:39 INFO mapred.JobClient: Bytes Read=967
12/05/22 14:28:39 INFO mapred.JobClient: FileSystemCounters
12/05/22 14:28:39 INFO mapred.JobClient: FILE_BYTES_READ=14093
12/05/22 14:28:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=47859
12/05/22 14:28:39 INFO mapred.JobClient: Map-Reduce Framework
12/05/22 14:28:39 INFO mapred.JobClient: Map output materialized bytes=1960
12/05/22 14:28:39 INFO mapred.JobClient: Map input records=10
12/05/22 14:28:39 INFO mapred.JobClient: Reduce shuffle bytes=0
12/05/22 14:28:39 INFO mapred.JobClient: Spilled Records=10
12/05/22 14:28:39 INFO mapred.JobClient: Map output bytes=1934
12/05/22 14:28:39 INFO mapred.JobClient: Total committed heap usage (bytes)=115937280
12/05/22 14:28:39 INFO mapred.JobClient: CPU time spent (ms)=0
12/05/22 14:28:39 INFO mapred.JobClient: Map input bytes=967
12/05/22 14:28:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=82
12/05/22 14:28:39 INFO mapred.JobClient: Combine input records=0
12/05/22 14:28:39 INFO mapred.JobClient: Reduce input records=0
12/05/22 14:28:39 INFO mapred.JobClient: Reduce input groups=0
12/05/22 14:28:39 INFO mapred.JobClient: Combine output records=0
12/05/22 14:28:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
12/05/22 14:28:39 INFO mapred.JobClient: Reduce output records=0
12/05/22 14:28:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
12/05/22 14:28:39 INFO mapred.JobClient: Map output records=10
12/05/22 14:28:39 INFO mapred.JobClient: Job Failed: NA
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at uni.kassel.macek.rtprep.RetweetApplication.main(RetweetApplication.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Since this stacktrace now shows me, that the port 9001 is used during execution, i guess that somehow the xml-configuration file overwrites the local-java-made setting (which i use for testing), which is strange since i read over and over on the internet, that java overwrites xml configuration. If nobody knows how to correct this, ill try to simply erase all configuration-xmls. Perhaps this solves the problem...
NEW UPDATE
Renaming Hadoops conf folder solved the problem of the waiting copier and the program is executed til the end. Sadly the execution doesnt wait anymore for my debugger although HADOOP_OPTS is set correctly.
RESUME:Its only a configuration issue: XML may (for some configuration parameters) overwrite JAVA. If somebody knew how i can get debugging to run again, it would be perfect, but for now im just glad i dont see this stacktrace anymore! ;)
Thank you Chris for your time and effords!

Sorry i didn't see this before, but you appear to have two important configuration properties set to final in your conf xml files, as denoted by the following log statements:
12/05/22 14:28:29 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: fs.default.name; Ignoring.
12/05/22 14:28:29 WARN conf.Configuration: file:/tmp/hadoop-ema/mapred/local/localRunner/job_local_0001.xml:a attempt to override final parameter: mapred.job.tracker; Ignoring.
This means that your job is unable to actually run in local mode, it starts in local mode, but the reducer reads the serialized job configuration and determines it is not in local mode, and tried to fetch map outputs via the task tracker ports.
You said your fix was to rename the conf folder - this will default hadoop back to the default configuration, where these two properties are not marked as 'final'

Related

Nutch Rest not working on EMR in distributed mode

I'm running Nutch 2.3 on EMR (AMI version 2.4.2). The crawl steps are working fine in local and distributed mode (hadoop -jar apache-nutch-2.3.job <MainClass> <args>), and am able to call the steps by spinning up the rest service in local mode. But, when I try to run the rest in distributed mode (hadoop -jar apache-nutch-2.3.job org.apache.nutch.api.NutchServer), the rest is receiving the calls, but is not getting the job done. What is the correct way to run nutch in distributed mode?
Info
When the InjectorJob is run offline in a distributed mode, the output is as follows:
COMMAND:
hadoop jar ./apache-nutch-2.3.job org.apache.nutch.crawl.InjectorJob s3://myemrbucket/urls -crawlId 2
15/11/19 09:55:06 INFO crawl.InjectorJob: InjectorJob: starting at 2015-11-19 09:55:06
15/11/19 09:55:06 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: s3://myemrbucket/urls
15/11/19 09:55:06 INFO s3native.NativeS3FileSystem: Created AmazonS3 with InstanceProfileCredentialsProvider
15/11/19 09:55:08 WARN store.HBaseStore: Mismatching schema's names. Mappingfile schema: 'webpage'. PersistentClass schema's name: '2_webpage'Assuming they are the same.
15/11/19 09:55:08 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
15/11/19 09:55:08 INFO mapred.JobClient: Default number of map tasks: null
15/11/19 09:55:08 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 4
15/11/19 09:55:08 INFO mapred.JobClient: Default number of reduce tasks: 0
15/11/19 09:55:10 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
15/11/19 09:55:10 INFO mapred.JobClient: Setting group to hadoop
15/11/19 09:55:10 INFO input.FileInputFormat: Total input paths to process : 1
15/11/19 09:55:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
15/11/19 09:55:10 WARN lzo.LzoCodec: Could not find build properties file with revision hash
15/11/19 09:55:10 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
15/11/19 09:55:10 WARN snappy.LoadSnappy: Snappy native library is available
15/11/19 09:55:10 INFO snappy.LoadSnappy: Snappy native library loaded
15/11/19 09:55:10 INFO mapred.JobClient: Running job: job_201511182052_0037
15/11/19 09:55:11 INFO mapred.JobClient: map 0% reduce 0%
15/11/19 09:55:38 INFO mapred.JobClient: map 100% reduce 0%
15/11/19 09:55:43 INFO mapred.JobClient: Job complete: job_201511182052_0037
15/11/19 09:55:43 INFO mapred.JobClient: Counters: 20
15/11/19 09:55:43 INFO mapred.JobClient: Job Counters
15/11/19 09:55:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=16424
15/11/19 09:55:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/11/19 09:55:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/11/19 09:55:43 INFO mapred.JobClient: Rack-local map tasks=1
15/11/19 09:55:43 INFO mapred.JobClient: Launched map tasks=1
15/11/19 09:55:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
15/11/19 09:55:43 INFO mapred.JobClient: File Output Format Counters
15/11/19 09:55:43 INFO mapred.JobClient: Bytes Written=0
15/11/19 09:55:43 INFO mapred.JobClient: injector
15/11/19 09:55:43 INFO mapred.JobClient: urls_injected=1
15/11/19 09:55:43 INFO mapred.JobClient: FileSystemCounters
15/11/19 09:55:43 INFO mapred.JobClient: HDFS_BYTES_READ=98
15/11/19 09:55:43 INFO mapred.JobClient: S3_BYTES_READ=61
15/11/19 09:55:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=36254
15/11/19 09:55:43 INFO mapred.JobClient: File Input Format Counters
15/11/19 09:55:43 INFO mapred.JobClient: Bytes Read=61
15/11/19 09:55:43 INFO mapred.JobClient: Map-Reduce Framework
15/11/19 09:55:43 INFO mapred.JobClient: Map input records=1
15/11/19 09:55:43 INFO mapred.JobClient: Physical memory (bytes) snapshot=193712128
15/11/19 09:55:43 INFO mapred.JobClient: Spilled Records=0
15/11/19 09:55:43 INFO mapred.JobClient: CPU time spent (ms)=3960
15/11/19 09:55:43 INFO mapred.JobClient: Total committed heap usage (bytes)=298319872
15/11/19 09:55:43 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1525059584
15/11/19 09:55:43 INFO mapred.JobClient: Map output records=1
15/11/19 09:55:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=98
15/11/19 09:55:44 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0
15/11/19 09:55:44 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 1
15/11/19 09:55:44 INFO crawl.InjectorJob: Injector: finished at 2015-11-19 09:55:44, elapsed: 00:00:38
By calling it through the REST, the job gets stuck after giving out the following output:
POST ARGS:
{
"crawlId":"11",
"confId":"default",
"type":"INJECT",
"args":{"seedDir":"s3://myemrbucket/urls"}
}
15/11/19 09:46:14 INFO api.NutchServer: Starting NutchServer on port: 8081 with logging level: INFO ...
Nov 19, 2015 9:46:14 AM org.restlet.engine.connector.NetServerHelper start
INFO: Starting the internal [HTTP/1.1] server on port 8081
15/11/19 09:46:14 INFO api.NutchServer: Started NutchServer on port 8081
Nov 19, 2015 9:46:25 AM org.restlet.engine.log.LogFilter afterHandle
INFO: 2015-11-19 09:46:25 1xx.xx.x.xx - - 8081 POST /job/create - 200 28 110 498 http://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8081 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36-
15/11/19 09:46:25 INFO s3native.NativeS3FileSystem: Created AmazonS3 with InstanceProfileCredentialsProvider
15/11/19 09:46:27 WARN store.HBaseStore: Mismatching schema's names. Mappingfile schema: 'webpage'. PersistentClass schema's name: '11_webpage'Assuming they are the same.
15/11/19 09:46:28 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
15/11/19 09:46:28 INFO mapred.JobClient: Default number of map tasks: null
15/11/19 09:46:28 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 4
15/11/19 09:46:28 INFO mapred.JobClient: Default number of reduce tasks: 0
15/11/19 09:46:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
and does not move ahead.

Can't finish MR when using Sqoop transfer data from HDFS to MYSQL

While transferring data from HDFS to MySQL, a MapReduce job gets spawned. But, it gets stuck and does not get completed.
sqoop export --connect jdbc:mysql://crxy2:3306/test --username root --password 19911130 --table info --export-dir sqoop_export
I see following in the logs:
Warning: /software/sqoop-1.4.6.alpha/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /software/sqoop-1.4.6.alpha/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /software/sqoop-1.4.6.alpha/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /software/sqoop-1.4.6.alpha/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
15/12/02 01:17:37 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
15/12/02 01:17:37 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/12/02 01:17:37 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/12/02 01:17:37 INFO tool.CodeGenTool: Beginning code generation
15/12/02 01:17:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `info` AS t LIMIT 1
15/12/02 01:17:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `info` AS t LIMIT 1
15/12/02 01:17:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /software/hadoop-2.6.0
Note: /tmp/sqoop-root/compile/344126e97612def1e3976c1978c2e75e/info.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/12/02 01:17:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/344126e97612def1e3976c1978c2e75e/info.jar
15/12/02 01:17:42 INFO mapreduce.ExportJobBase: Beginning export of info
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/software/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/software/hbase-0.98.8-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/12/02 01:17:43 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/12/02 01:17:45 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/12/02 01:17:45 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/12/02 01:17:45 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/12/02 01:17:46 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/12/02 01:17:50 INFO input.FileInputFormat: Total input paths to process : 1
15/12/02 01:17:50 INFO input.FileInputFormat: Total input paths to process : 1
15/12/02 01:17:50 INFO mapreduce.JobSubmitter: number of splits:4
15/12/02 01:17:50 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
15/12/02 01:17:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1449047829255_0001
15/12/02 01:17:51 INFO impl.YarnClientImpl: Submitted application application_1449047829255_0001
15/12/02 01:17:52 INFO mapreduce.Job: The url to track the job: http://crxy2:8088/proxy/application_1449047829255_0001/
15/12/02 01:17:52 INFO mapreduce.Job: Running job: job_1449047829255_0001
15/12/02 01:18:12 INFO mapreduce.Job: Job job_1449047829255_0001 running in uber mode : false
15/12/02 01:18:12 INFO mapreduce.Job: map 0% reduce 0%
15/12/02 01:19:10 INFO mapreduce.Job: map 75% reduce 0%
15/12/02 01:19:12 INFO mapreduce.Job: map 100% reduce 0%
15/12/02 01:29:41 INFO mapreduce.Job: Task Id : attempt_1449047829255_0001_m_000001_0, Status : FAILED
AttemptID:attempt_1449047829255_0001_m_000001_0 Timed out after 600 secs
15/12/02 01:29:42 INFO mapreduce.Job: map 75% reduce 0%
15/12/02 01:29:58 INFO mapreduce.Job: map 100% reduce 0%
15/12/02 01:40:11 INFO mapreduce.Job: Task Id : attempt_1449047829255_0001_m_000001_1, Status : FAILED
AttemptID:attempt_1449047829255_0001_m_000001_1 Timed out after 600 secs
15/12/02 01:40:12 INFO mapreduce.Job: map 75% reduce 0%
15/12/02 01:40:28 INFO mapreduce.Job: map 100% reduce 0%
15/12/02 01:50:41 INFO mapreduce.Job: Task Id : attempt_1449047829255_0001_m_000001_2, Status : FAILED
AttemptID:attempt_1449047829255_0001_m_000001_2 Timed out after 600 secs
15/12/02 01:50:42 INFO mapreduce.Job: map 75% reduce 0%
15/12/02 01:51:00 INFO mapreduce.Job: map 100% reduce 0%
15/12/02 02:01:13 INFO mapreduce.Job: Job job_1449047829255_0001 failed with state FAILED due to: Task failed task_1449047829255_0001_m_000001
Job failed as tasks failed. failedMaps:1 failedReduces:0
15/12/02 02:01:13 INFO mapreduce.Job: Counters: 32
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=370395
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=556
HDFS: Number of bytes written=0
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=4
Launched map tasks=7
Other local map tasks=3
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=2732612
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=2732612
Total vcore-seconds taken by all map tasks=2732612
Total megabyte-seconds taken by all map tasks=2798194688
Map-Reduce Framework
Map input records=0
Map output records=0
Input split bytes=504
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=759
CPU time spent (ms)=5170
Physical memory (bytes) snapshot=245080064
Virtual memory (bytes) snapshot=2529026048
Total committed heap usage (bytes)=46792704
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
15/12/02 02:01:13 INFO mapreduce.ExportJobBase: Transferred 556 bytes in 2,607.4894 seconds (0.2132 bytes/sec)
15/12/02 02:01:13 INFO mapreduce.ExportJobBase: Exported 0 records.
15/12/02 02:01:13 ERROR tool.ExportTool: Error during export: Export job failed!
2015-12-02 08:01:15,791 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=Application Finished - Succeeded TARGET=RMAppManager RESULT=SUCCESS APPID=application_1449047829255_0002
2015-12-02 08:01:15,793 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Application Attempt appattempt_1449047829255_0002_000001 is done. finalState=FINISHED
2015-12-02 08:01:15,793 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1449047829255_0002 requests cleared
2015-12-02 08:01:15,794 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Application removed - appId: application_1449047829255_0002 user: root queue: default #user-pending-applications: 0 #user-active-applications: 0 #queue-pending-applications: 0 #queue-active-applications: 0
2015-12-02 08:01:15,794 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: Application removed - appId: application_1449047829255_0002 user: root leaf-queue of parent: root #applications: 0
2015-12-02 08:01:15,794 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1449047829255_0002,name=info.jar,user=root,queue=default,state=FINISHED,trackingUrl=http://crxy2:8088/proxy/application_1449047829255_0002/jobhistory/job/job_1449047829255_0002,appMasterHost=crxy2,startTime=1449069503787,finishTime=1449072069229,finalStatus=FAILED
2015-12-02 08:01:15,796 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Cleaning master appattempt_1449047829255_0002_000001
2015-12-02 08:01:15,873 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed...
2015-12-02 08:01:15,873 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed...
2015-12-02 08:01:16,879 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Null container completed...
Questioner was looking at incorrect logs. He is able to troubleshoot the issue by going through failed task logs as per the suggestion in the comments section.

Second mapper not getting called - MultipleInputs

My job gets stuck once the first mapper (Reducemapper2) gets complete at "map 50% reduce 0%". I tried to debug a lot and googled it as well, but I'm not able to figure out the reason. Below is the driver class.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Reducedriver {
public static void main(String args[]) throws Exception
{
if(args.length!=3)
{
System.err.println("Usage: Worddrivernewapi <input path1> <inputpath2> <output path>");
System.exit(-1);
}
Configuration conf=new Configuration();
Job job=new Job(conf,"Reducesideexample");
job.setJarByClass(Reducedriver.class);
job.setJobName("Reducedriver");
Path path1=new Path(args[0]);
Path path2=new Path(args[1]); MultipleInputs.addInputPath(job,path1,TextInputFormat.class,Reducemapper1.class);
MultipleInputs.addInputPath(job,path2,TextInputFormat.class,Reducemapper2.class);
FileOutputFormat.setOutputPath(job,new Path(args[2]));
//job.setMapperClass(Reducemapper1.class);
job.setPartitionerClass(Reducepartitioner.class);
//job.setSortComparatorClass(Reducesortcomparator.class);
job.setGroupingComparatorClass(Reducegroupcomparator.class);
job.setReducerClass(Reducereducer.class);
//job.setNumReduceTasks(0);
job.setMapOutputKeyClass(ReduceWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Could some one help me in figuring out the issue?
This is a pseudo distributed mode with 2 mapper and reducer capacity. I had multiple successful runs in my 2 node capacity.
Log for a single mapper(Jobtracker log):
2015-05-16 11:10:56,630 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2015-05-16 11:10:57,126 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2015-05-16 11:10:57,288 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2015-05-16 11:10:57,309 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#42f93a98
2015-05-16 11:10:57,484 INFO org.apache.hadoop.mapred.MapTask: Processing split: hdfs://localhost:9000/user/hduser/test/mapmainfile.dat:0+40
2015-05-16 11:10:57,512 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2015-05-16 11:10:57,591 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2015-05-16 11:10:57,592 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2015-05-16 11:10:57,607 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded
2015-05-16 11:10:57,666 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
2015-05-16 11:10:57,669 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
From the terminal:
15/05/16 11:10:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/05/16 11:10:50 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:10:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/05/16 11:10:50 WARN snappy.LoadSnappy: Snappy native library not loaded
15/05/16 11:10:50 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:10:51 INFO mapred.JobClient: Running job: job_201505161109_0001
15/05/16 11:10:52 INFO mapred.JobClient: map 0% reduce 0%
15/05/16 11:11:04 INFO mapred.JobClient: map 100% reduce 0%.
When I tried to debug through localhost I could see that the first mapper completes and the map progress stops at 50%.
Localjobrunner log:
15/05/16 11:36:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/16 11:36:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/05/16 11:36:08 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/05/16 11:36:08 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:36:08 WARN snappy.LoadSnappy: Snappy native library not loaded
15/05/16 11:36:08 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:36:08 INFO mapred.JobClient: Running job: job_local815502428_0001
15/05/16 11:36:09 INFO mapred.LocalJobRunner: Waiting for map tasks
15/05/16 11:36:09 INFO mapred.LocalJobRunner: Starting task: attempt_local815502428_0001_m_000000_0
15/05/16 11:36:09 INFO util.ProcessTree: setsid exited with exit code 0
15/05/16 11:36:09 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#11507b87
15/05/16 11:36:09 INFO mapred.MapTask: Processing split: file:/home/hduser/hadoop/myexamples/mainmapdatafile.dat:0+137
15/05/16 11:36:09 INFO mapred.MapTask: io.sort.mb = 100
15/05/16 11:36:09 INFO mapred.MapTask: data buffer = 79691776/99614720
15/05/16 11:36:09 INFO mapred.MapTask: record buffer = 262144/327680
15/05/16 11:36:09 INFO mapred.JobClient: map 0% reduce 0%
15/05/16 11:36:18 INFO mapred.LocalJobRunner:
15/05/16 11:36:18 INFO mapred.JobClient: map 6% reduce 0%
15/05/16 11:36:27 INFO mapred.LocalJobRunner:
15/05/16 11:36:28 INFO mapred.JobClient: map 12% reduce 0%
15/05/16 11:36:36 INFO mapred.LocalJobRunner:
15/05/16 11:36:37 INFO mapred.JobClient: map 18% reduce 0%
15/05/16 11:36:45 INFO mapred.LocalJobRunner:
15/05/16 11:36:46 INFO mapred.JobClient: map 25% reduce 0%
15/05/16 11:36:51 INFO mapred.LocalJobRunner:
15/05/16 11:36:52 INFO mapred.JobClient: map 31% reduce 0%
15/05/16 11:36:57 INFO mapred.LocalJobRunner:
15/05/16 11:36:58 INFO mapred.JobClient: map 37% reduce 0%
15/05/16 11:37:03 INFO mapred.LocalJobRunner:
15/05/16 11:37:04 INFO mapred.JobClient: map 43% reduce 0%
15/05/16 11:37:09 INFO mapred.LocalJobRunner:
15/05/16 11:37:10 INFO mapred.JobClient: map 50% reduce 0%
15/05/16 11:37:12 INFO mapred.MapTask: Starting flush of map output
15/05/16 11:37:12 INFO mapred.MapTask: Starting flush of map output
15/05/16 11:37:18 INFO mapred.LocalJobRunner:

Debugging a Tutorial Hadoop Pipes-Project

I am working through this tutorial
and got to the very last part (with some small changes).
Now I am stuck with an error message I can't make sense of.
damian#damian-ThinkPad-T61:~/hadoop-1.1.2$ bin/hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input dft1 -output dft1-out -program bin/word_count
13/06/09 20:17:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/06/09 20:17:01 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/06/09 20:17:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/09 20:17:01 INFO mapred.FileInputFormat: Total input paths to process : 1
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Creating word_count in /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin-work-1867423021697266227 with rwxr-xr-x
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Cached bin/word_count as /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin/word_count
13/06/09 20:17:02 INFO filecache.TrackerDistributedCacheManager: Cached bin/word_count as /tmp/hadoop-damian/mapred/local/archive/7642618178782392982_1522484642_696507214/filebin/word_count
13/06/09 20:17:02 INFO mapred.JobClient: Running job: job_local_0001
13/06/09 20:17:02 INFO util.ProcessTree: setsid exited with exit code 0
13/06/09 20:17:02 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#4200d3
13/06/09 20:17:02 INFO mapred.MapTask: numReduceTasks: 1
13/06/09 20:17:02 INFO mapred.MapTask: io.sort.mb = 100
13/06/09 20:17:02 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/09 20:17:02 INFO mapred.MapTask: record buffer = 262144/327680
13/06/09 20:17:02 WARN mapred.LocalJobRunner: job_local_0001
java.lang.NullPointerException
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:103)
at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:68)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
13/06/09 20:17:03 INFO mapred.JobClient: map 0% reduce 0%
13/06/09 20:17:03 INFO mapred.JobClient: Job complete: job_local_0001
13/06/09 20:17:03 INFO mapred.JobClient: Counters: 0
13/06/09 20:17:03 INFO mapred.JobClient: Job Failed: NA
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248)
at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479)
at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494)
Does anyone see where the error hides? What is a straightforward way for debugging Hadoop Pipes programs?
Thanks!
The exception :
at org.apache.hadoop.mapred.pipes.Application.<init>(Application.java:103)
Is caused by the following lines in the source:
//Add token to the environment if security is enabled
Token<JobTokenIdentifier> jobToken = TokenCache.getJobToken(conf
.getCredentials());
// This password is used as shared secret key between this application and
// child pipes process
byte[] password = jobToken.getPassword();
The actual NPE is throw in the final line as jobToken is null.
As you're using local mode (local job tracker and local file system), i'm not sure that security should be 'enabled' - do you have either of the following properties configured in your core-site.xml, or hdfs-site.xml coniguration files (if so, what are their values):
hadoop.security.authentication
hadoop.security.authorization
Possibly because your cluster is running in local mode. Do you have the following property in your mapred-site.xml file?
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>
Let the MapReduce jobs run with the yarn framework.
</description>
</property>
If you don't have this property, your cluster, by default, will run in local mode. I used to have exactly the same problem in local mode. After I add this property, the cluster will run in distributed mode and the problem will be gone.
HTH,
Shumin

Mahout RecommenderJob not converging

This is my first SO post so please let me know if I've missed out anything important. I am a Mahout/Hadoop beginner, and am trying to put together a distributed recommendation engine.
In order to simulate working on a remote cluster, I have set up hadoop on my machine to communicate with a Ubuntu VM (using VirtualBox), also located on my machine, which has hadoop installed on it. This setup seems to be working fine and I am now trying to run Mahout's 'RecommenderJob' on a (very!) small trial dataset as a test.
The input consists of a .csv file (saved on the hadoop dfs) containing around 50 user preferences in the format: userID, itemID, preference ... and the command I am running is:
hadoop jar /Users/MyName/src/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/MyName/Recommendations/input/TestRatings.csv -Dmapred.output.dir=/user/MyName/Recommendations/output -s SIMILARITY_PEARSON_CORELLATION
where TestRatings.csv is the file containing the preferences and output is the desired output directory.
At first the job looks like it's running fine, and I get the following output:
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --maxPrefsPerUser=[10], --maxPrefsPerUserInItemSimilarity=[1000], --maxSimilaritiesPerItem=[100], --minPrefsPerUser=[1], --numRecommendations=[10], --similarityClassname=[SIMILARITY_PEARSON_CORELLATION], --startPhase=[0], --tempDir=[temp]}
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[/user/Naaman/Delphi/input/TestRatings.csv], --maxPrefsPerUser=[1000], --minPrefsPerUser=[1], --output=[temp/preparePreferenceMatrix], --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]}
12/12/11 12:26:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/12/11 12:26:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/12/11 12:26:22 INFO input.FileInputFormat: Total input paths to process : 1
12/12/11 12:26:22 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/11 12:26:22 INFO mapred.JobClient: Running job: job_local_0001
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/12/11 12:26:22 INFO mapred.MapTask: io.sort.mb = 100
12/12/11 12:26:22 INFO mapred.MapTask: data buffer = 79691776/99614720
12/12/11 12:26:22 INFO mapred.MapTask: record buffer = 262144/327680
12/12/11 12:26:22 INFO mapred.MapTask: Starting flush of map output
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new compressor
12/12/11 12:26:22 INFO mapred.MapTask: Finished spill 0
12/12/11 12:26:22 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/12/11 12:26:22 INFO mapred.LocalJobRunner:
12/12/11 12:26:22 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/12/11 12:26:22 INFO mapred.ReduceTask: ShuffleRamManager: MemoryLimit=1491035776, MaxSingleShuffleLimit=372758944
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging on-disk files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging in memory files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread waiting: Thread for merging on-disk files
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for polling Map Completion Events
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
12/12/11 12:26:23 INFO mapred.JobClient: map 100% reduce 0%
12/12/11 12:26:28 INFO mapred.LocalJobRunner: reduce > copy >
12/12/11 12:26:31 INFO mapred.LocalJobRunner: reduce > copy >
12/12/11 12:26:37 INFO mapred.LocalJobRunner: reduce > copy >
But then the last three lines repeat indefinitely (I left it overnight...), with the two lines:
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
repeating every twelve rows.
I'm not sure whether there's something wrong with my input, or whether the tiny size of the trial data is messing things up. Any help and/or advice on the best way to go about this would be much appreciated.
p.s. I was trying to follow the instructions from https://www.box.com/s/041rdjeh7sny128r2uki
This is really a Hadoop or cluster issue. It is waiting on mapper output that is not coming. Look for earlier failures, in the mapping phase.

Resources