Distributed Cache Hadoop not retrieving the file content - hadoop

I am getting some garbage like value instead of the data from the file I want to use as distributed cache.
The Job Configuration is as follows:
Configuration config5 = new Configuration();
JobConf conf5 = new JobConf(config5, Job5.class);
conf5.setJobName("Job5");
conf5.setOutputKeyClass(Text.class);
conf5.setOutputValueClass(Text.class);
conf5.setMapperClass(MapThree4c.class);
conf5.setReducerClass(ReduceThree5.class);
conf5.setInputFormat(TextInputFormat.class);
conf5.setOutputFormat(TextOutputFormat.class);
DistributedCache.addCacheFile(new URI("/home/users/mlakshm/ap1228"), conf5);
FileInputFormat.setInputPaths(conf5, new Path(other_args.get(5)));
FileOutputFormat.setOutputPath(conf5, new Path(other_args.get(6)));
JobClient.runJob(conf5);
In the Mapper, I have the following code:
public class MapThree4c extends MapReduceBase implements Mapper<LongWritable, Text,
Text, Text >{
private Set<String> prefixCandidates = new HashSet<String>();
Text a = new Text();
public void configure(JobConf conf5) {
Path[] dates = new Path[0];
try {
dates = DistributedCache.getLocalCacheFiles(conf5);
System.out.println("candidates: "+candidates);
String astr = dates.toString();
a = new Text(astr);
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer st = new StringTokenizer(line);
st.nextToken();
String t = st.nextToken();
String uidi = st.nextToken();
String uidj = st.nextToken();
String check = null;
output.collect(new Text(line), a);
}
}
The output value, I am getting from this mapper is:[Lorg.apache.hadoop.fs.Path;#786c1a82
instead of the value from the distributed cache file.

That looks like what you get when you call toString() on an array and if you look at the javadocs for DistributedCache.getLocalCacheFiles(), that is what it returns. If you need to actually read the contents of the files in the cache, you can open/read them with the standard java APIs.

From your code:
Path[] dates = DistributedCache.getLocalCacheFiles(conf5);
Implies that:
String astr = dates.toString(); // is a pointer to the above array (ie.dates) which is what you see in the output as [Lorg.apache.hadoop.fs.Path;#786c1a82.
You need to do the following to see the actual paths:
for(Path cacheFile: dates){
output.collect(new Text(line), new Text(cacheFile.getName()));
}

Related

How to remove r-00000 extention from reducer output in mapreduce

I am able to rename my reducer output file correctly but r-00000 is still persisting .
I have used MultipleOutputs in my reducer class .
Here is details of the that .Not sure what am i missing or what extra i have to do?
public class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(MyReducer.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
public void setup(Context context) {
logger.info("Inside Reducer.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
sb.append(strArrvalueStr[0] + "|!|");
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()),strName);
}
}
public void cleanup(Context context) throws IOException,
InterruptedException {
multipleOutputs.close();
}
}
I was able to do it explicitly after my job finishes and thats ok for me.No delay in the job
if (b){
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd-HHmm");
Calendar cal = Calendar.getInstance();
String strDate=dateFormat.format(cal.getTime());
FileSystem hdfs = FileSystem.get(getConf());
FileStatus fs[] = hdfs.listStatus(new Path(args[1]));
if (fs != null){
for (FileStatus aFile : fs) {
if (!aFile.isDir()) {
hdfs.rename(aFile.getPath(), new Path(aFile.getPath().toString()+".txt"));
}
}
}
}
A more suitable approach to the problem would be changing the OutputFormat.
For eg :- If you are using TextOutputFormatClass, just get the source code of the TextOutputFormat class and modify the below method to get the proper filename (without r-00000). We need to then set the modified output format in the driver.
public synchronized static String getUniqueFile(TaskAttemptContext context, String name, String extension) {
/*TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();*/
StringBuilder result = new StringBuilder();
result.append(name);
/*
* result.append('-');
* result.append(TaskID.getRepresentingCharacter(taskId.getTaskType()));
* result.append('-'); result.append(NUMBER_FORMAT.format(partition));
* result.append(extension);
*/
return result.toString();
}
So whatever name is passed through the multiple outputs, filename will be created according to it.

ArrayIndexOutOfBoundsException in MapReduce

I am getting array index out of bound error in MAP part. My code is as below. I am trying to read the input file from the HDFS. Is there any better way to read the HDFS file?
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>
{
private Text key12 = new Text();
private Text value = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
String line=value.toString();
while((line = value.toString()) != null)
{
//StringTokenizer tokenizer = new StringTokenizer(line);
//String field = tokenizer.nextToken();
//
String[] parts= line.split(" ");
if(parts[0].contains("STN") == false)
{
String field=parts[0];
String month=parts[3];
String temp;
if(parts[7].trim().equals(""))
{
temp=parts[8];
}
else
temp=parts[7];
//tokenizer.nextToken();
//String month = tokenizer.nextToken();
month=month.substring(4,6);
//String temp = tokenizer.nextToken();
String val = month+temp;
key12.set(field);
value.set(val);
output.collect(key12, value);
}
}
}
There are an awful lot of places where this could go wrong, irrespective of where this particular error is. What if parts doesn't have 9 elements? What if it does have 9 elements but some of them are null? What if line doesn't have a space character in it? What if month only has three characters in it?
Handle all of these situations and your issue will be resolved.
As an aside, use
if(!parts[0].contains("STN"))
instead of
if(parts[0].contains("STN") == false)
and consider extracting some of your Strings (such as "STN" and " " into private static final String variables. This will greatly improve your performance.

Hadoop Mapreduce: Custom Input Format

I have a file with data having text and "^" in between:
SOME TEXT^GOES HERE^
AND A FEW^MORE
GOES HERE
I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:
SOME TEXT
GOES HERE
AND A FEW
MORE GOES HERE
I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.
public class MyRecordReader extends RecordReader<LongWritable, Text>
{
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
{
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in = fs.open(path);
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
}
#Override
public boolean nextKeyValue() throws IOException
{
if(key==null)
key = new LongWritable();
key.set(current);
if(value==null)
value = new Text();
long readSize = 0;
while(current<end)
{
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
if(readSize==0)
break;
current+=readSize;
}
if(readSize==0)
{
key = null;
value = null;
return false;
}
return true;
}
#Override
public float getProgress() throws IOException
{
}
#Override
public LongWritable getCurrentKey() throws IOException
{
}
#Override
public Text getCurrentValue() throws IOException
{
}
#Override
public void close() throws IOException
{
}
}
There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.
conf.set("textinputformat.record.delimiter", "^");
This should work fine with the normal TextInputFormat.

Accesing file in Mapper through Distributed Cache

I want to access the contents of the distributed file in my Mapper. Below is the code I have written which generates the name of the file for Distributed Cache. Please help me accessing the contents of the file
public class DistCacheExampleMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text >
{
Text a = new Text();
Path[] dates = new Path[0];
public void configure(JobConf conf) {
try {
dates = DistributedCache.getLocalCacheFiles(conf);
String astr = dates.toString();
a = new Text(astr);
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
for(Path cacheFile: dates){
output.collect(new Text(line), new Text(cacheFile.getName()));
}
}
}
Try this instead in your configure() method:
List<String []> lines;
Path[] files = new Path[0];
public void configure(JobConf conf) {
lines = new ArrayList<>();
BufferedReader SW;
try {
files = DistributedCache.getLocalCacheFiles(conf);
SW = new BufferedReader(new FileReader(files[0].toString()));
String line;
while ((line = SW.readLine()) != null) {
lines.add(line.split(",")); //now, each lines entry is a String array, with each element being a column
}
SW.close();
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
This way, you will have the contents of the files (in this case the first file) in the Distributed Cache, in the variable lines. Each lines entry represent a String array, which is split by ','. So the first column of the first row is lines.get(0)[0], the third row of the second line is lines.get(1)[2], etc.

Writing a value to file without moving to reducer

I have an input of records like this,
a|1|Y,
b|0|N,
c|1|N,
d|2|Y,
e|1|Y
Now, in mapper, i has to check the value of third column. If it is 'Y' then that record has to write directly to output file without moving that record to reducer or else i.e, 'N' value records has to move to reducer for further processing..
So,
a|1|Y,
d|2|Y,
e|1|Y
should not go to reducer but
b|0|N,
c|1|N
should go to reducer and then to output file.
How can i do this??
What you can probably do is use MultipleOutputs - click here to separate out records of 'Y' and 'N' type to two different files from mappers.
Next, you run saparate jobs for the two newly generated 'Y' and 'N' type data sets.
For 'Y' types set number of reducers to 0, so that, Reducers aren't use. And, for 'N' types do it the way you want using reducers.
Hope this helps.
See if this works,
public class Xxxx {
public static class MyMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
Random r = new Random();
FileSplit split = (FileSplit)context.getInputSplit();
String fileName = split.getPath().getName();
FSDataOutputStream out = fs.create(new Path(fileName + "-m-" + r.nextInt()));
String parts[];
String line = value.toString();
String[] splits = line.split(",");
for(String s : splits) {
parts = s.split("\\|");
if(parts[2].equals("Y")) {
out.writeBytes(line);
}else {
context.write(key, value);
}
}
out.close();
fs.close();
}
}
public static class MyReducer extends
Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
for(Text t : values) {
context.write(key, t);
}
}
}
/**
* #param args
* #throws IOException
* #throws InterruptedException
* #throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:9000");
conf.set("mapred.job.tracker", "localhost:9001");
Job job = new Job(conf, "Xxxx");
job.setJarByClass(Xxxx.class);
Path outPath = new Path("/output_path");
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, new Path("/input.txt"));
FileOutputFormat.setOutputPath(job, outPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
In your map function, you will get input line by line. Split it according by using | as the delimiter. (by using the String.split() method to be exact)
It will look like this
String[] line = value.toString().split('|');
Access the third element of this array by line[2]
Then, using a simple if else statement, emit the output with N value for further processing.

Resources