How to put the files into memory using Hadoop Distributed cache? - hadoop

As far as I know, distributed cache copies files to every node, then map or reduce reads the files from the local file system.
My question is: Is there a way that we can put our files into memory using Hadoop distributed cache so that every map or reduce can read files directly from memory?
My MapReduce program distributes a png picture which is about 1M to every node, then every map task reads the picture from the distributed cache and does some image processing with another picture from the input of the map.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
Path[] uris = DistributedCache.getLocalCacheFiles(context
.getConfiguration());
try{
BufferedReader readBuffer1 = new BufferedReader(new FileReader(uris[0].toString()));
String line;
while ((line=readBuffer1.readLine())!=null){
System.out.println(line);
}
readBuffer1.close();
}
catch (Exception e){
System.out.println(e.toString());
}
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
int length=key.getLength();
System.out.println("length"+length);
result.set(sum);
/* key.set("lenght"+lenght);*/
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
final String NAME_NODE = "hdfs://localhost:9000";
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
DistributedCache.addCacheFile(new URI(NAME_NODE
+ "/dataset1.txt"),
job.getConfiguration());
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

great question. I am also trying to solve the similar issue. I don’t think Hadoop supports in memory cache out of the box. However it should not be very difficult to have another in memory cache somewhere on the grid for this purpose. We can pass the location of cache and name of the parameter as part of Job Configuration.
As far as code example above is concerned it doesn’t answer the original question. In addition it showcases non-optimum code sample. Ideally you should access the cache file as part of setup() method and cache any information you may want to use as part of map() method. In the example above cache file will be read once for every key-value pair which compromises with the performance of the mapreduce job.

Related

Hadoop multiple input files error

Im trying to read 2 file from hdfs input with below code but I face with error as follow
I am beginner in mapreduce programing and stuck on this problem for couple of days,any help will be appreciated.
My code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class Recipe {
public static class TokenizerMapper1
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line=value.toString();
word.set(line.substring(2,8));
context.write(word,one);
}
}
public static class TokenizerMapper2
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line=value.toString();
word.set(line.substring(2,8));
context.write(word,one);
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: recipe <in> <out>");
System.exit(2);
}
#SuppressWarnings("deprecation")
Job job = new Job(conf, "Recipe");
job.setJarByClass(Recipe.class);
job.setMapperClass(TokenizerMapper1.class);
job.setMapperClass(TokenizerMapper2.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,TokenizerMapper1.class);
MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,TokenizerMapper2.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
//FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/in"));
//FileOutputFormat.setOutputPath(job, new Path("hdfs://127.0.0.1:9000/out"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
// job.submit();
}
And i've set program run configuration arguments like this:
/in /put
Error:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2
at Recipe.main(Recipe.java:121)
There are several issues. Program is expecting 3 parameters and you are passing only 2. Also if you have to process multiple input formats you need to use MultipleInputs.
Assume that you invoke program /in1 /in2 /out
MultipleInputs.addInputPath(job, args[0], TokenizerMapper1.class, FirstMapper.class);
MultipleInputs.addInputPath(job, args[1], TokenizerMapper2.class, SecondMapper.class);
You can remove these lines from the code:
job.setMapperClass(TokenizerMapper1.class);
job.setMapperClass(TokenizerMapper2.class);
Now it works with the following modifications:
Put every file in a separate directory.
Use real address instead of arg[], as shown below:
MultipleInputs.addInputPath(job,new Path("hdfs://localhost:9000/in1"),TextInputFormat.class,TokenizerMapper1.class);
MultipleInputs.addInputPath(job,new Path("hdfs://localhost:9000/in2"),TextInputFormat.class,TokenizerMapper1.class);
FileOutputFormat.setOutputPath(job, new Path("hdfs://127.0.0.1:9000/out"));
Specify all input and output paths in run configurations\arguments like this:
127.0.0.1:9000/in1/DNAIn.txt 127.0.0.1:9000/in2/DNAIn2.txt 127.0.0.1:9000/out

Run WordCount without reducer in hadoop

I have installed hadoop cluster environment (Master & Slave). Works smoothly.
I tried wordcount and grep using (hadoop.example.jar) file and also works fine.
Now, I want to edit the (hadoop.example.jar) to run only mapper without reducer. Is there a way on doing that???
I read some articles that says I have to set the value to zero of setNumReducerTask(0), but I don't know how? using the (hadoop.example.jar) file.
You can't change the hadoop.example.jar file.
You need to create your own custom code and export it as a jar file.
The modified wordcount code should be:
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
**job.setNumReduceTasks( 0 ); **
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
The origianl source code

Sampling Records from Hadoop Mapper

I have a dataset whose key consists of 3 parts: a, b and c. In my mapper, I would like to emit records with the key as 'a' and the value as 'a,b,c'
How do I emit 10% of the total records for each 'a' that is detected from the mapper in Hadoop? Should one consider saving the total number of records seen for each 'a' from a previous Map-Reduce job in a temp file?
If you want close to 10%, you can use Random. Here is an example of Mapper:
public class Test extends Mapper<LongWritable, Text, LongWritable, Text> {
private Random r = new Random();
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if (r.nextInt(10) == 0) {
context.write(key, value);
}
}
}
Use this java code to select 10% randomly:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class RandomSample {
public static class Map extends Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if (Math.random()<0.1)
context.write(value,null);
else
context.write(null,null);
context.write(value,null);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "randomsample");
job.setJarByClass(RandomSample.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
And use this bash script to run it
echo "Running Job"
hadoop jar RandomSample.jar RandomSample $1 tmp
echo "copying result to local path (RandomSample)"
hadoop fs -getmerge tmp RandomSample
echo "Clean up"
hadoop fs -rmr tmp
For example, if we name the script random_sample.sh, to select 10% from folder /example/, simply run
./random_sample.sh /example/

Replicated join using distributed cache in Hadoop 0.20

I have been trying the Replicated join using distributed cache on both a cluster and a karmasphere interface. I have pasted code below. My program is unable to find the file in the cache memory
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Hashtable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
// A demostration of Hadoop's DistributedCache tool
//
public class MapperSideJoinWithDistributedCache extends Configured implements Tool {
private final static String inputa = "C:/Users/LopezGG/workspace/Second_join/input1_1" ;
public static class MapClass extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
private Hashtable<String, String> joinData = new Hashtable<String, String>();
#Override
public void configure(JobConf conf) {
try {
Path [] cacheFiles = DistributedCache.getLocalCacheFiles(conf);
System.out.println("ds"+DistributedCache.getLocalCacheFiles(conf));
if (cacheFiles != null && cacheFiles.length > 0) {
String line;
String[] tokens;
BufferedReader joinReader = new BufferedReader(new FileReader(cacheFiles[0].toString()));
try {
while ((line = joinReader.readLine()) != null) {
tokens = line.split(",", 2);
joinData.put(tokens[0], tokens[1]);
}
} finally {
joinReader.close();
}
}
else
System.out.println("joinreader not set" );
} catch(IOException e) {
System.err.println("Exception reading DistributedCache: " + e);
}
}
public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String joinValue = joinData.get(key.toString());
if (joinValue != null) {
output.collect(key,new Text(value.toString() + "," + joinValue));
}
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
JobConf job = new JobConf(conf, MapperSideJoinWithDistributedCache.class);
DistributedCache.addCacheFile(new Path(args[0]).toUri(), job);
//System.out.println( DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf));
Path in = new Path(args[1]);
Path out = new Path(args[2]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("DataJoin with DistributedCache");
job.setMapperClass(MapClass.class);
job.setNumReduceTasks(0);
job.setInputFormat( KeyValueTextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
long time1= System.currentTimeMillis();
System.out.println(time1);
int res = ToolRunner.run(new Configuration(),
new MapperSideJoinWithDistributedCache(),args);
long time2= System.currentTimeMillis();
System.out.println(time2);
System.out.println("millsecs elapsed:"+(time2-time1));
System.exit(res);
}
}
The error I get is
O mapred.MapTask: numReduceTasks: 0
Exception reading DistributedCache: java.io.FileNotFoundException: \tmp\hadoop-LopezGG\mapred\local\archive\-2564469513526622450_-1173562614_1653082827\file\C\Users\LopezGG\workspace\Second_join\input1_1 (The system cannot find the file specified)
ds[Lorg.apache.hadoop.fs.Path;#366a88bb
12/04/24 23:15:01 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/04/24 23:15:01 INFO mapred.LocalJobRunner:
But the task executes to completion. Coudl someone please help me> i have looked at the other posts and made all modifications but still it does not work
I must confess that i never use the DistributedCache class (rather i use the -files option via the GenericOptionsParser), but i'm not sure the DistributedCache automatically copies local files into HDFS prior to running your job.
While i can't find any evidence of this fact in the Hadoop docs, there is a mention in the Pro Hadoop book that mentions something to this effect:
http://books.google.com/books?id=8DV-EzeKigQC&pg=PA133&dq=%22The+URI+must+be+on+the+JobTracker+shared+file+system%22&hl=en&sa=X&ei=jNGXT_LKOKLA6AG1-7j6Bg&ved=0CEsQ6AEwAA#v=onepage&q=%22The%20URI%20must%20be%20on%20the%20JobTracker%20shared%20file%20system%22&f=false
In your case, copy the file to HDFS first, and the when you call DistributedCache.addCacheFile, pass the URI of the file in HDFS, and see if that works for you

MapReduce Job not showing my print statements on the terminal

I am currently trying to figure out when you run a MapReduce job what happens by making some system.out.println() at certain places on the code but know of those print statement gets to print on my terminal when the job runs. Can someone help me out figure out what exactly am i doing wrong here.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.StatusReporter;
import org.apache.hadoop.mapreduce.TaskAttemptID;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountJob {
public static int iterations;
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
#Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
System.out.println("blalblbfbbfbbbgghghghghghgh");
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
String myWord = itr.nextToken();
int n = 0;
while(n< 5){
myWord = myWord+ "Test my appending words";
n++;
}
System.out.println("Print my word: "+myWord);
word.set(myWord);
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
TaskAttemptID taskid = new TaskAttemptID();
TokenizerMapper my = new TokenizerMapper();
if (args.length != 3) {
System.err.println("Usage: WordCountJob <in> <out> <iterations>");
System.exit(2);
}
iterations = new Integer(args[2]);
Path inPath = new Path(args[0]);
Path outPath = null;
for (int i = 0; i<iterations; ++i){
System.out.println("Iteration number: "+i);
outPath = new Path(args[1]+i);
Job job = new Job(conf, "WordCountJob");
job.setJarByClass(WordCountJob.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);
job.waitForCompletion(true);
inPath = outPath;
}
}
}
This depends on how you are submitting your job, I think you're submitting it using bin/hadoop jar yourJar.jar right?
Your System.out.println() is only available in your main method, that is because the mapper/reducer is executed inside of hadoop in a different JVM, all outputs are redirected to special log files (out/log-files).
And I would recommend to use your own Apache-commons log using:
Log log = LogFactory.getLog(YOUR_MAPPER_CLASS.class)
And therefore do some info logging:
log.info("Your message");
If you're in "local"-mode then you can see this log in your shell, otherwise this log will be stored somewhere on the machine where the task gets executed. Please use the jobtracker's web UI to look at these log files, it is quite convenient. By default the job tracker runs on port 50030.
Alternatively, you can make use of MultipleOutputs class and re-direct all your log data into one output file(log).
MultipleOutputs<Text, Text> mos = new MultipleOutputs<Text, Text>(context);
Text tKey = new Text("key");
Text tVal = new Text("log message");
mos.write(tKey, tVal, <lOG_FILE>);

Resources