Understanding Hadoop behavior with GZ files - hadoop

I have a small JSON file in two separate folders in my S3 bucket. I ran the same command with the same mapper on those two separately.
NORMAL JSON
$ hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -Dmapred.reduce.tasks=0 -file ./mapper.py -mapper ./mapper.py -input s3://mybucket/normaltest -output smalltest-output
14/08/28 08:33:53 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
packageJobJar: [./mapper.py, /mnt/var/lib/hadoop/tmp/hadoop-unjar6225144044327095484/] [] /tmp/streamjob6947060448653690043.jar tmpDir=null
14/08/28 08:33:56 INFO mapred.JobClient: Default number of map tasks: null
14/08/28 08:33:56 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 160
14/08/28 08:33:56 INFO mapred.JobClient: Default number of reduce tasks: 0
14/08/28 08:33:56 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/08/28 08:33:56 INFO mapred.JobClient: Setting group to hadoop
14/08/28 08:33:56 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/08/28 08:33:56 WARN lzo.LzoCodec: Could not find build properties file with revision hash
14/08/28 08:33:56 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
14/08/28 08:33:56 WARN snappy.LoadSnappy: Snappy native library is available
14/08/28 08:33:56 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/28 08:33:58 INFO mapred.FileInputFormat: Total input paths to process : 1
14/08/28 08:33:58 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
14/08/28 08:33:58 INFO streaming.StreamJob: Running job: job_201408260907_0053
14/08/28 08:33:58 INFO streaming.StreamJob: To kill this job, run:
14/08/28 08:33:58 INFO streaming.StreamJob: /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.165.13.124:9001 -kill job_201408260907_0053
14/08/28 08:33:58 INFO streaming.StreamJob: Tracking URL: http://ip-10-165-13-124.ec2.internal:9100/jobdetails.jsp?jobid=job_201408260907_0053
14/08/28 08:33:59 INFO streaming.StreamJob: map 0% reduce 0%
14/08/28 08:34:23 INFO streaming.StreamJob: map 1% reduce 0%
14/08/28 08:34:26 INFO streaming.StreamJob: map 2% reduce 0%
14/08/28 08:34:29 INFO streaming.StreamJob: map 9% reduce 0%
14/08/28 08:34:32 INFO streaming.StreamJob: map 45% reduce 0%
14/08/28 08:34:35 INFO streaming.StreamJob: map 56% reduce 0%
14/08/28 08:34:36 INFO streaming.StreamJob: map 57% reduce 0%
14/08/28 08:34:38 INFO streaming.StreamJob: map 84% reduce 0%
14/08/28 08:34:39 INFO streaming.StreamJob: map 85% reduce 0%
14/08/28 08:34:41 INFO streaming.StreamJob: map 99% reduce 0%
14/08/28 08:34:44 INFO streaming.StreamJob: map 100% reduce 0%
14/08/28 08:34:50 INFO streaming.StreamJob: map 100% reduce 100%
14/08/28 08:34:50 INFO streaming.StreamJob: Job complete: job_201408260907_0053
14/08/28 08:34:50 INFO streaming.StreamJob: Output: smalltest-output
In smalltest-output, I get several small files containing a part of the processed JSON.
GZIPed JSON
$ hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -Dmapred.reduce.tasks=0 -file ./mapper.py -mapper ./mapper.py -input s3://weblablatency/gztest -output smalltest-output
14/08/28 08:39:45 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
packageJobJar: [./mapper.py, /mnt/var/lib/hadoop/tmp/hadoop-unjar2539293594337011579/] [] /tmp/streamjob301144784484156113.jar tmpDir=null
14/08/28 08:39:48 INFO mapred.JobClient: Default number of map tasks: null
14/08/28 08:39:48 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 160
14/08/28 08:39:48 INFO mapred.JobClient: Default number of reduce tasks: 0
14/08/28 08:39:48 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/08/28 08:39:48 INFO mapred.JobClient: Setting group to hadoop
14/08/28 08:39:48 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
14/08/28 08:39:48 WARN lzo.LzoCodec: Could not find build properties file with revision hash
14/08/28 08:39:48 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
14/08/28 08:39:48 WARN snappy.LoadSnappy: Snappy native library is available
14/08/28 08:39:48 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/28 08:39:50 INFO mapred.FileInputFormat: Total input paths to process : 1
14/08/28 08:39:51 INFO streaming.StreamJob: getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
14/08/28 08:39:51 INFO streaming.StreamJob: Running job: job_201408260907_0055
14/08/28 08:39:51 INFO streaming.StreamJob: To kill this job, run:
14/08/28 08:39:51 INFO streaming.StreamJob: /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.165.13.124:9001 -kill job_201408260907_0055
14/08/28 08:39:51 INFO streaming.StreamJob: Tracking URL: http://ip-10-165-13-124.ec2.internal:9100/jobdetails.jsp?jobid=job_201408260907_0055
14/08/28 08:39:52 INFO streaming.StreamJob: map 0% reduce 0%
14/08/28 08:40:20 INFO streaming.StreamJob: map 100% reduce 0%
14/08/28 08:40:26 INFO streaming.StreamJob: map 100% reduce 100%
14/08/28 08:40:26 INFO streaming.StreamJob: Job complete: job_201408260907_0055
In smalltest-output I get a correctly parsed file, but as a single file.
Why this difference and what is happening? Is my job not being distributed properly in the gz case?
In my actual use case I need to process ~2000 gz files totalling to around 4GB uncompressed; every 4 hours. So I can't afford any performance issues because of compression.

Gzip is not splittable. You will find bazillions of articles and questions speaking about this issue so I won't go into details.
Your options are:
Don't use Gzip (don't compress or use another splittable compression format)
Use a hack to make GZip splittable, like https://github.com/nielsbasjes/splittablegzip. Each mapper will still have to read the file from the beginning so it's a trade-off. Read the documentation to learn more.
It depends on what you do, but for most processing 4GB of data is nothing. I would make sure that I really need an elephant like Hadoop for my use case. It is scalable but complex, painful to work and usually slow for small data sets.

Related

Nutch Rest not working on EMR in distributed mode

I'm running Nutch 2.3 on EMR (AMI version 2.4.2). The crawl steps are working fine in local and distributed mode (hadoop -jar apache-nutch-2.3.job <MainClass> <args>), and am able to call the steps by spinning up the rest service in local mode. But, when I try to run the rest in distributed mode (hadoop -jar apache-nutch-2.3.job org.apache.nutch.api.NutchServer), the rest is receiving the calls, but is not getting the job done. What is the correct way to run nutch in distributed mode?
Info
When the InjectorJob is run offline in a distributed mode, the output is as follows:
COMMAND:
hadoop jar ./apache-nutch-2.3.job org.apache.nutch.crawl.InjectorJob s3://myemrbucket/urls -crawlId 2
15/11/19 09:55:06 INFO crawl.InjectorJob: InjectorJob: starting at 2015-11-19 09:55:06
15/11/19 09:55:06 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: s3://myemrbucket/urls
15/11/19 09:55:06 INFO s3native.NativeS3FileSystem: Created AmazonS3 with InstanceProfileCredentialsProvider
15/11/19 09:55:08 WARN store.HBaseStore: Mismatching schema's names. Mappingfile schema: 'webpage'. PersistentClass schema's name: '2_webpage'Assuming they are the same.
15/11/19 09:55:08 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
15/11/19 09:55:08 INFO mapred.JobClient: Default number of map tasks: null
15/11/19 09:55:08 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 4
15/11/19 09:55:08 INFO mapred.JobClient: Default number of reduce tasks: 0
15/11/19 09:55:10 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
15/11/19 09:55:10 INFO mapred.JobClient: Setting group to hadoop
15/11/19 09:55:10 INFO input.FileInputFormat: Total input paths to process : 1
15/11/19 09:55:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
15/11/19 09:55:10 WARN lzo.LzoCodec: Could not find build properties file with revision hash
15/11/19 09:55:10 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
15/11/19 09:55:10 WARN snappy.LoadSnappy: Snappy native library is available
15/11/19 09:55:10 INFO snappy.LoadSnappy: Snappy native library loaded
15/11/19 09:55:10 INFO mapred.JobClient: Running job: job_201511182052_0037
15/11/19 09:55:11 INFO mapred.JobClient: map 0% reduce 0%
15/11/19 09:55:38 INFO mapred.JobClient: map 100% reduce 0%
15/11/19 09:55:43 INFO mapred.JobClient: Job complete: job_201511182052_0037
15/11/19 09:55:43 INFO mapred.JobClient: Counters: 20
15/11/19 09:55:43 INFO mapred.JobClient: Job Counters
15/11/19 09:55:43 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=16424
15/11/19 09:55:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/11/19 09:55:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/11/19 09:55:43 INFO mapred.JobClient: Rack-local map tasks=1
15/11/19 09:55:43 INFO mapred.JobClient: Launched map tasks=1
15/11/19 09:55:43 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
15/11/19 09:55:43 INFO mapred.JobClient: File Output Format Counters
15/11/19 09:55:43 INFO mapred.JobClient: Bytes Written=0
15/11/19 09:55:43 INFO mapred.JobClient: injector
15/11/19 09:55:43 INFO mapred.JobClient: urls_injected=1
15/11/19 09:55:43 INFO mapred.JobClient: FileSystemCounters
15/11/19 09:55:43 INFO mapred.JobClient: HDFS_BYTES_READ=98
15/11/19 09:55:43 INFO mapred.JobClient: S3_BYTES_READ=61
15/11/19 09:55:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=36254
15/11/19 09:55:43 INFO mapred.JobClient: File Input Format Counters
15/11/19 09:55:43 INFO mapred.JobClient: Bytes Read=61
15/11/19 09:55:43 INFO mapred.JobClient: Map-Reduce Framework
15/11/19 09:55:43 INFO mapred.JobClient: Map input records=1
15/11/19 09:55:43 INFO mapred.JobClient: Physical memory (bytes) snapshot=193712128
15/11/19 09:55:43 INFO mapred.JobClient: Spilled Records=0
15/11/19 09:55:43 INFO mapred.JobClient: CPU time spent (ms)=3960
15/11/19 09:55:43 INFO mapred.JobClient: Total committed heap usage (bytes)=298319872
15/11/19 09:55:43 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1525059584
15/11/19 09:55:43 INFO mapred.JobClient: Map output records=1
15/11/19 09:55:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=98
15/11/19 09:55:44 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0
15/11/19 09:55:44 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 1
15/11/19 09:55:44 INFO crawl.InjectorJob: Injector: finished at 2015-11-19 09:55:44, elapsed: 00:00:38
By calling it through the REST, the job gets stuck after giving out the following output:
POST ARGS:
{
"crawlId":"11",
"confId":"default",
"type":"INJECT",
"args":{"seedDir":"s3://myemrbucket/urls"}
}
15/11/19 09:46:14 INFO api.NutchServer: Starting NutchServer on port: 8081 with logging level: INFO ...
Nov 19, 2015 9:46:14 AM org.restlet.engine.connector.NetServerHelper start
INFO: Starting the internal [HTTP/1.1] server on port 8081
15/11/19 09:46:14 INFO api.NutchServer: Started NutchServer on port 8081
Nov 19, 2015 9:46:25 AM org.restlet.engine.log.LogFilter afterHandle
INFO: 2015-11-19 09:46:25 1xx.xx.x.xx - - 8081 POST /job/create - 200 28 110 498 http://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8081 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36-
15/11/19 09:46:25 INFO s3native.NativeS3FileSystem: Created AmazonS3 with InstanceProfileCredentialsProvider
15/11/19 09:46:27 WARN store.HBaseStore: Mismatching schema's names. Mappingfile schema: 'webpage'. PersistentClass schema's name: '11_webpage'Assuming they are the same.
15/11/19 09:46:28 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
15/11/19 09:46:28 INFO mapred.JobClient: Default number of map tasks: null
15/11/19 09:46:28 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 4
15/11/19 09:46:28 INFO mapred.JobClient: Default number of reduce tasks: 0
15/11/19 09:46:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
and does not move ahead.

Second mapper not getting called - MultipleInputs

My job gets stuck once the first mapper (Reducemapper2) gets complete at "map 50% reduce 0%". I tried to debug a lot and googled it as well, but I'm not able to figure out the reason. Below is the driver class.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Reducedriver {
public static void main(String args[]) throws Exception
{
if(args.length!=3)
{
System.err.println("Usage: Worddrivernewapi <input path1> <inputpath2> <output path>");
System.exit(-1);
}
Configuration conf=new Configuration();
Job job=new Job(conf,"Reducesideexample");
job.setJarByClass(Reducedriver.class);
job.setJobName("Reducedriver");
Path path1=new Path(args[0]);
Path path2=new Path(args[1]); MultipleInputs.addInputPath(job,path1,TextInputFormat.class,Reducemapper1.class);
MultipleInputs.addInputPath(job,path2,TextInputFormat.class,Reducemapper2.class);
FileOutputFormat.setOutputPath(job,new Path(args[2]));
//job.setMapperClass(Reducemapper1.class);
job.setPartitionerClass(Reducepartitioner.class);
//job.setSortComparatorClass(Reducesortcomparator.class);
job.setGroupingComparatorClass(Reducegroupcomparator.class);
job.setReducerClass(Reducereducer.class);
//job.setNumReduceTasks(0);
job.setMapOutputKeyClass(ReduceWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Could some one help me in figuring out the issue?
This is a pseudo distributed mode with 2 mapper and reducer capacity. I had multiple successful runs in my 2 node capacity.
Log for a single mapper(Jobtracker log):
2015-05-16 11:10:56,630 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2015-05-16 11:10:57,126 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
2015-05-16 11:10:57,288 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2015-05-16 11:10:57,309 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#42f93a98
2015-05-16 11:10:57,484 INFO org.apache.hadoop.mapred.MapTask: Processing split: hdfs://localhost:9000/user/hduser/test/mapmainfile.dat:0+40
2015-05-16 11:10:57,512 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
2015-05-16 11:10:57,591 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720
2015-05-16 11:10:57,592 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680
2015-05-16 11:10:57,607 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded
2015-05-16 11:10:57,666 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
2015-05-16 11:10:57,669 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
From the terminal:
15/05/16 11:10:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/05/16 11:10:50 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:10:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/05/16 11:10:50 WARN snappy.LoadSnappy: Snappy native library not loaded
15/05/16 11:10:50 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:10:51 INFO mapred.JobClient: Running job: job_201505161109_0001
15/05/16 11:10:52 INFO mapred.JobClient: map 0% reduce 0%
15/05/16 11:11:04 INFO mapred.JobClient: map 100% reduce 0%.
When I tried to debug through localhost I could see that the first mapper completes and the map progress stops at 50%.
Localjobrunner log:
15/05/16 11:36:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/05/16 11:36:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/05/16 11:36:08 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
15/05/16 11:36:08 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:36:08 WARN snappy.LoadSnappy: Snappy native library not loaded
15/05/16 11:36:08 INFO input.FileInputFormat: Total input paths to process : 1
15/05/16 11:36:08 INFO mapred.JobClient: Running job: job_local815502428_0001
15/05/16 11:36:09 INFO mapred.LocalJobRunner: Waiting for map tasks
15/05/16 11:36:09 INFO mapred.LocalJobRunner: Starting task: attempt_local815502428_0001_m_000000_0
15/05/16 11:36:09 INFO util.ProcessTree: setsid exited with exit code 0
15/05/16 11:36:09 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin#11507b87
15/05/16 11:36:09 INFO mapred.MapTask: Processing split: file:/home/hduser/hadoop/myexamples/mainmapdatafile.dat:0+137
15/05/16 11:36:09 INFO mapred.MapTask: io.sort.mb = 100
15/05/16 11:36:09 INFO mapred.MapTask: data buffer = 79691776/99614720
15/05/16 11:36:09 INFO mapred.MapTask: record buffer = 262144/327680
15/05/16 11:36:09 INFO mapred.JobClient: map 0% reduce 0%
15/05/16 11:36:18 INFO mapred.LocalJobRunner:
15/05/16 11:36:18 INFO mapred.JobClient: map 6% reduce 0%
15/05/16 11:36:27 INFO mapred.LocalJobRunner:
15/05/16 11:36:28 INFO mapred.JobClient: map 12% reduce 0%
15/05/16 11:36:36 INFO mapred.LocalJobRunner:
15/05/16 11:36:37 INFO mapred.JobClient: map 18% reduce 0%
15/05/16 11:36:45 INFO mapred.LocalJobRunner:
15/05/16 11:36:46 INFO mapred.JobClient: map 25% reduce 0%
15/05/16 11:36:51 INFO mapred.LocalJobRunner:
15/05/16 11:36:52 INFO mapred.JobClient: map 31% reduce 0%
15/05/16 11:36:57 INFO mapred.LocalJobRunner:
15/05/16 11:36:58 INFO mapred.JobClient: map 37% reduce 0%
15/05/16 11:37:03 INFO mapred.LocalJobRunner:
15/05/16 11:37:04 INFO mapred.JobClient: map 43% reduce 0%
15/05/16 11:37:09 INFO mapred.LocalJobRunner:
15/05/16 11:37:10 INFO mapred.JobClient: map 50% reduce 0%
15/05/16 11:37:12 INFO mapred.MapTask: Starting flush of map output
15/05/16 11:37:12 INFO mapred.MapTask: Starting flush of map output
15/05/16 11:37:18 INFO mapred.LocalJobRunner:

hadoop test examples to validate the installation

I have successfully configured Hadoop 2.4 on my Ubuntu 14.04 using this tutorial.
http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/
Now after completing installtion how can I perform test on it?
How and where can I get the test data or jar files?
You have some example jars in your hadoop installation directory.
Simplest thing you can do is run the teragen example(or wordcount).
It is the first step in perform terasort.
Steps:
Go to the hadoop installation directory.
Run "hadoop jar hadoop-examples-0.20.2-cdh3u0.jar" to see all the jars you can run.
Go to home/[user] directory and create a file "example.txt" with the following data
"This is a file to test Hadoop Installation example
For the sake of the experiment, consider it to be 1TB"
While you are in that directory, run "hadoop dfs -put examples.txt /" this uploads the file onto your HDFS
Run "hadoop dfs -ls /" to check it is on there
Go to your Hadoop installation directory and run "hadoop jar hadoop-examples-0.20.2-cdh3u0.jar teragen 1000 /user/teragendata" - 1000 is the size data is to be broken into and the other param is the output directory.
On successful execution, you will see something like the text at the bottom.
Now to see how your MR job was run, in your browser open JobTracker and see the completed jobs. "localhost50030/jobtracker.jsp"
cloudera#cloudera-vm:/usr/lib/hadoop$ hadoop jar hadoop-examples-0.20.2-cdh3u0.jar teragen 600 /user/teragendata
Generating 600 using 2 maps with step of 300
14/07/24 09:02:44 INFO mapred.JobClient: Running job: job_201407230030_0008
14/07/24 09:02:45 INFO mapred.JobClient: map 0% reduce 0%
14/07/24 09:02:57 INFO mapred.JobClient: map 100% reduce 0%
14/07/24 09:03:00 INFO mapred.JobClient: Job complete: job_201407230030_0008
14/07/24 09:03:00 INFO mapred.JobClient: Counters: 13
14/07/24 09:03:00 INFO mapred.JobClient: Job Counters
14/07/24 09:03:00 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22008
14/07/24 09:03:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/07/24 09:03:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/07/24 09:03:00 INFO mapred.JobClient: Launched map tasks=2
14/07/24 09:03:00 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
14/07/24 09:03:00 INFO mapred.JobClient: FileSystemCounters
14/07/24 09:03:00 INFO mapred.JobClient: HDFS_BYTES_READ=164
14/07/24 09:03:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=105150
14/07/24 09:03:00 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=60000
14/07/24 09:03:00 INFO mapred.JobClient: Map-Reduce Framework
14/07/24 09:03:00 INFO mapred.JobClient: Map input records=600
14/07/24 09:03:00 INFO mapred.JobClient: Spilled Records=0
14/07/24 09:03:00 INFO mapred.JobClient: Map input bytes=600
14/07/24 09:03:00 INFO mapred.JobClient: Map output records=600
14/07/24 09:03:00 INFO mapred.JobClient: SPLIT_RAW_BYTES=164

hadoop streaming job failed Unable to load realm info from SCDynamicStore env: ruby\r: No such file or directory

While running hadoop streaming using ruby as my mapper and reduce functions, I get the following error.
packageJobJar: [summarymapper.rb, wcreducer.rb, /var/lib/hadoop/hadoop-unjar6514686449101598265/] [] /var/folders/md/0ww65qrx1_n1nlhrr7hrs8d00000gn/T/streamjob9165241112855689376.jar tmpDir=null
14/06/25 19:54:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/06/25 19:54:35 WARN snappy.LoadSnappy: Snappy native library not loaded
14/06/25 19:54:35 INFO mapred.FileInputFormat: Total input paths to process : 1
14/06/25 19:54:35 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop/mapred/local]
14/06/25 19:54:35 INFO streaming.StreamJob: Running job: job_201406251944_0005
14/06/25 19:54:35 INFO streaming.StreamJob: To kill this job, run:
14/06/25 19:54:35 INFO streaming.StreamJob: /Users/oladotunopasina/hadoop-1.2.1/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201406251944_0005
14/06/25 19:54:35 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201406251944_0005
14/06/25 19:54:36 INFO streaming.StreamJob: map 0% reduce 0%
14/06/25 19:55:18 INFO streaming.StreamJob: map 100% reduce 100%
14/06/25 19:55:18 INFO streaming.StreamJob: To kill this job, run:
14/06/25 19:55:18 INFO streaming.StreamJob: /Users/oladotunopasina/hadoop-1.2.1/libexec/../bin/hadoop job -Dmapred.job.tracker=localhost:8021 -kill job_201406251944_0005
14/06/25 19:55:18 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201406251944_0005
14/06/25 19:55:18 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201406251944_0005_m_000001
14/06/25 19:55:18 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
On checking the log file produced , I see this
stderr logs
2014-06-25 19:54:38.332 java[8468:1003] Unable to load realm info from SCDynamicStore
env: ruby\r: No such file or directory
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I have tried using this thread Hadoop environment variables
but still have no success. Kindly help.
I solved this problem by resaving my .rb files on the mac.It seems like the version I downloaded was saved as a pc file. "\r" is an hidden character present in the mapper and reducer classes.

hadoop showing map reduce percentages running twice

I'm running Apache's Hadoop, and using the grep example provided by that installation. I'm wondering why map reduce percentages show up running twice? I thought they only had to run once; which makes me doubt my understanding of map reduce. I looked it up (http://grokbase.com/t/gg/mongodb-user/125ay1eazq/map-reduce-percentage-seems-running-twice) but there really wasn't an explanation and this link was for MongoDB.
hduser#ubse1:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar grep /user/hduser/grep /user/hduser/grep-output4 ".*woe is me.*"
I'm running this on a project gutenberg .txt file. The output file is correct.
Here is the output for running the command if needed:
12/08/06 06:56:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/06 06:56:57 WARN snappy.LoadSnappy: Snappy native library not loaded
12/08/06 06:56:57 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:56:58 INFO mapred.JobClient: Running job: job_201208030925_0011
12/08/06 06:56:59 INFO mapred.JobClient: map 0% reduce 0%
12/08/06 06:57:18 INFO mapred.JobClient: map 100% reduce 0%
12/08/06 06:57:30 INFO mapred.JobClient: map 100% reduce 100%
12/08/06 06:57:35 INFO mapred.JobClient: Job complete: job_201208030925_0011
12/08/06 06:57:35 INFO mapred.JobClient: Counters: 30
12/08/06 06:57:35 INFO mapred.JobClient: Job Counters
12/08/06 06:57:35 INFO mapred.JobClient: Launched reduce tasks=1
12/08/06 06:57:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=31034
12/08/06 06:57:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:57:35 INFO mapred.JobClient: Rack-local map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient: Launched map tasks=2
12/08/06 06:57:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=11233
12/08/06 06:57:35 INFO mapred.JobClient: File Input Format Counters
12/08/06 06:57:35 INFO mapred.JobClient: Bytes Read=5592666
12/08/06 06:57:35 INFO mapred.JobClient: File Output Format Counters
12/08/06 06:57:35 INFO mapred.JobClient: Bytes Written=391
12/08/06 06:57:35 INFO mapred.JobClient: FileSystemCounters
12/08/06 06:57:35 INFO mapred.JobClient: FILE_BYTES_READ=281
12/08/06 06:57:35 INFO mapred.JobClient: HDFS_BYTES_READ=5592862
12/08/06 06:57:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=65331
12/08/06 06:57:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=391
12/08/06 06:57:35 INFO mapred.JobClient: Map-Reduce Framework
12/08/06 06:57:35 INFO mapred.JobClient: Map output materialized bytes=287
12/08/06 06:57:35 INFO mapred.JobClient: Map input records=124796
12/08/06 06:57:35 INFO mapred.JobClient: Reduce shuffle bytes=287
12/08/06 06:57:35 INFO mapred.JobClient: Spilled Records=10
12/08/06 06:57:35 INFO mapred.JobClient: Map output bytes=265
12/08/06 06:57:35 INFO mapred.JobClient: Total committed heap usage (bytes)=336404480
12/08/06 06:57:35 INFO mapred.JobClient: CPU time spent (ms)=7040
12/08/06 06:57:35 INFO mapred.JobClient: Map input bytes=5590193
12/08/06 06:57:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=196
12/08/06 06:57:35 INFO mapred.JobClient: Combine input records=5
12/08/06 06:57:35 INFO mapred.JobClient: Reduce input records=5
12/08/06 06:57:35 INFO mapred.JobClient: Reduce input groups=5
12/08/06 06:57:35 INFO mapred.JobClient: Combine output records=5
12/08/06 06:57:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=464568320
12/08/06 06:57:35 INFO mapred.JobClient: Reduce output records=5
12/08/06 06:57:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1539559424
12/08/06 06:57:35 INFO mapred.JobClient: Map output records=5
12/08/06 06:57:35 INFO mapred.FileInputFormat: Total input paths to process : 1
12/08/06 06:57:35 INFO mapred.JobClient: Running job: job_201208030925_0012
12/08/06 06:57:36 INFO mapred.JobClient: map 0% reduce 0%
12/08/06 06:57:50 INFO mapred.JobClient: map 100% reduce 0%
12/08/06 06:58:05 INFO mapred.JobClient: map 100% reduce 100%
12/08/06 06:58:10 INFO mapred.JobClient: Job complete: job_201208030925_0012
12/08/06 06:58:10 INFO mapred.JobClient: Counters: 30
12/08/06 06:58:10 INFO mapred.JobClient: Job Counters
12/08/06 06:58:10 INFO mapred.JobClient: Launched reduce tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=15432
12/08/06 06:58:10 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/08/06 06:58:10 INFO mapred.JobClient: Rack-local map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: Launched map tasks=1
12/08/06 06:58:10 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14264
12/08/06 06:58:10 INFO mapred.JobClient: File Input Format Counters
12/08/06 06:58:10 INFO mapred.JobClient: Bytes Read=391
12/08/06 06:58:10 INFO mapred.JobClient: File Output Format Counters
12/08/06 06:58:10 INFO mapred.JobClient: Bytes Written=235
12/08/06 06:58:10 INFO mapred.JobClient: FileSystemCounters
12/08/06 06:58:10 INFO mapred.JobClient: FILE_BYTES_READ=281
12/08/06 06:58:10 INFO mapred.JobClient: HDFS_BYTES_READ=505
12/08/06 06:58:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=42985
12/08/06 06:58:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=235
12/08/06 06:58:10 INFO mapred.JobClient: Map-Reduce Framework
12/08/06 06:58:10 INFO mapred.JobClient: Map output materialized bytes=281
12/08/06 06:58:10 INFO mapred.JobClient: Map input records=5
12/08/06 06:58:10 INFO mapred.JobClient: Reduce shuffle bytes=0
12/08/06 06:58:10 INFO mapred.JobClient: Spilled Records=10
EDIT Driver Class for Grep:
Grep.java
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.examples;
import java.util.Random;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
private Grep() {} // singleton
public int run(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
ToolRunner.printGenericCommandUsage(System.out);
return -1;
}
Path tempDir =
new Path("grep-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
JobConf grepJob = new JobConf(getConf(), Grep.class);
try {
grepJob.setJobName("grep-search");
FileInputFormat.setInputPaths(grepJob, args[0]);
grepJob.setMapperClass(RegexMapper.class);
grepJob.set("mapred.mapper.regex", args[2]);
if (args.length == 4)
grepJob.set("mapred.mapper.regex.group", args[3]);
grepJob.setCombinerClass(LongSumReducer.class);
grepJob.setReducerClass(LongSumReducer.class);
FileOutputFormat.setOutputPath(grepJob, tempDir);
grepJob.setOutputFormat(SequenceFileOutputFormat.class);
grepJob.setOutputKeyClass(Text.class);
grepJob.setOutputValueClass(LongWritable.class);
JobClient.runJob(grepJob);
JobConf sortJob = new JobConf(getConf(), Grep.class);
sortJob.setJobName("grep-sort");
FileInputFormat.setInputPaths(sortJob, tempDir);
sortJob.setInputFormat(SequenceFileInputFormat.class);
sortJob.setMapperClass(InverseMapper.class);
sortJob.setNumReduceTasks(1); // write a single file
FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
sortJob.setOutputKeyComparatorClass // sort by decreasing freq
(LongWritable.DecreasingComparator.class);
JobClient.runJob(sortJob);
}
finally {
FileSystem.get(grepJob).delete(tempDir, true);
}
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Grep(), args);
System.exit(res);
}
}
In the file there are the statistics of two jobs: job: job_201208030925_0011 and job: job_201208030925_0012. The percentages belong to these two jobs, hence there are 2 map progress percentages.

Resources