how to share HashMap between Mappers in Hadoop? - hadoop

Can I share HashMap with different Mapper with same Values like static variable? I am running job in hadoop cluster, And I am trying to share variable values between all mappers which are running on different datanodes.
INPUT ==> FileID FilePath
InputFormat => KeyValueTextInputFormat
public class Demo {
static int termID=0;
public static class DemoMapper extends Mapper<Object, Text, IntWritable, Text> {
static HashMap<String, Integer> termMapping = new HashMap<String, Integer>();
#Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
BufferedReader reader = new BufferedReader(new FileReader(value));
String line;
String currentTerm;
while ((line = reader.readLine()) != null) {
tokenizer = new StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
currentTerm = tokenizer.nextToken();
if (!termMap.containsKey(currentTerm)) {
if (!termMapping.containsKey(currentTerm)) {
termMapping.put(currentTerm, termID++);
}
termMap.put(currentTerm, 1);
} else {
termMap.put(currentTerm, termMap.get(currentTerm) + 1);
}
}
}
}
}
public static void main(String[] args) {
}
}

I don't think you really need to share anything.
All you are doing here is a variety of simple word count (of paths).
Just output (currentTerm, 1) and let the reducer handle the appropriate aggregation. You can also toss in a Combiner for improved performance.
You don't need to worry about duplicates - just look back over the WordCount example.
Also, I think you types should instead be extends Mapper<LongWritable, Text, Text, IntWritable> if you are reading a file and outputing (String, int) data
There is also a MapWritable class, but that seems like overkill

Related

How to remove r-00000 extention from reducer output in mapreduce

I am able to rename my reducer output file correctly but r-00000 is still persisting .
I have used MultipleOutputs in my reducer class .
Here is details of the that .Not sure what am i missing or what extra i have to do?
public class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(MyReducer.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
public void setup(Context context) {
logger.info("Inside Reducer.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
sb.append(strArrvalueStr[0] + "|!|");
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()),strName);
}
}
public void cleanup(Context context) throws IOException,
InterruptedException {
multipleOutputs.close();
}
}
I was able to do it explicitly after my job finishes and thats ok for me.No delay in the job
if (b){
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd-HHmm");
Calendar cal = Calendar.getInstance();
String strDate=dateFormat.format(cal.getTime());
FileSystem hdfs = FileSystem.get(getConf());
FileStatus fs[] = hdfs.listStatus(new Path(args[1]));
if (fs != null){
for (FileStatus aFile : fs) {
if (!aFile.isDir()) {
hdfs.rename(aFile.getPath(), new Path(aFile.getPath().toString()+".txt"));
}
}
}
}
A more suitable approach to the problem would be changing the OutputFormat.
For eg :- If you are using TextOutputFormatClass, just get the source code of the TextOutputFormat class and modify the below method to get the proper filename (without r-00000). We need to then set the modified output format in the driver.
public synchronized static String getUniqueFile(TaskAttemptContext context, String name, String extension) {
/*TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();*/
StringBuilder result = new StringBuilder();
result.append(name);
/*
* result.append('-');
* result.append(TaskID.getRepresentingCharacter(taskId.getTaskType()));
* result.append('-'); result.append(NUMBER_FORMAT.format(partition));
* result.append(extension);
*/
return result.toString();
}
So whatever name is passed through the multiple outputs, filename will be created according to it.

ArrayIndexOutOfBoundsException in MapReduce

I am getting array index out of bound error in MAP part. My code is as below. I am trying to read the input file from the HDFS. Is there any better way to read the HDFS file?
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>
{
private Text key12 = new Text();
private Text value = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException
{
String line=value.toString();
while((line = value.toString()) != null)
{
//StringTokenizer tokenizer = new StringTokenizer(line);
//String field = tokenizer.nextToken();
//
String[] parts= line.split(" ");
if(parts[0].contains("STN") == false)
{
String field=parts[0];
String month=parts[3];
String temp;
if(parts[7].trim().equals(""))
{
temp=parts[8];
}
else
temp=parts[7];
//tokenizer.nextToken();
//String month = tokenizer.nextToken();
month=month.substring(4,6);
//String temp = tokenizer.nextToken();
String val = month+temp;
key12.set(field);
value.set(val);
output.collect(key12, value);
}
}
}
There are an awful lot of places where this could go wrong, irrespective of where this particular error is. What if parts doesn't have 9 elements? What if it does have 9 elements but some of them are null? What if line doesn't have a space character in it? What if month only has three characters in it?
Handle all of these situations and your issue will be resolved.
As an aside, use
if(!parts[0].contains("STN"))
instead of
if(parts[0].contains("STN") == false)
and consider extracting some of your Strings (such as "STN" and " " into private static final String variables. This will greatly improve your performance.

How to effectively reduce the length of input to mapper

My data has 20 fields in the schema. Only the first three fields are important to me as far as my map reduce program is concerned. How can I decrease the size of input to mapper so that only the first three fields are received.
1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.
NOTE I cant use PIG as some other map reduce logic is implemented in MAP REDUCE.
You need a custom RecordReader to do this :
public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineReader;
private LongWritable lineKey;
private Text lineValue;
public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
}
public boolean next(LongWritable key, Text value) throws IOException {
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
String[] fields = lineValue.toString().split(",");
if (fields.length < 3) {
throw new IOException("Invalid record received");
}
value.set(fields[0] + "," + fields[1] + "," + fields[2]);
return true;
}
public LongWritable createKey() {
return lineReader.createKey();
}
public Text createValue() {
return lineReader.createValue();
}
public long getPos() throws IOException {
return lineReader.getPos();
}
public void close() throws IOException {
lineReader.close();
}
public float getProgress() throws IOException {
return lineReader.getProgress();
}
}
It should be pretty self-explanatory, just a wrap up of LineRecordReader.
Unfortunately, to invoke it you need to extend the InputFormat too. The following is enough :
public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {
public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(input.toString());
return new TrimmedRecordReader(job, (FileSplit) input);
}
}
Just don't forget to set it in the driver.
You can implement custom input format in map reduce to read the required fields alone.
Just for reference, following blog post explains how to read text as Paragraphs
http://blog.minjar.com/post/54759039969/mapreduce-custom-input-formats-reading

Hadoop Mapreduce: Custom Input Format

I have a file with data having text and "^" in between:
SOME TEXT^GOES HERE^
AND A FEW^MORE
GOES HERE
I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:
SOME TEXT
GOES HERE
AND A FEW
MORE GOES HERE
I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.
public class MyRecordReader extends RecordReader<LongWritable, Text>
{
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
{
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in = fs.open(path);
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
}
#Override
public boolean nextKeyValue() throws IOException
{
if(key==null)
key = new LongWritable();
key.set(current);
if(value==null)
value = new Text();
long readSize = 0;
while(current<end)
{
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
if(readSize==0)
break;
current+=readSize;
}
if(readSize==0)
{
key = null;
value = null;
return false;
}
return true;
}
#Override
public float getProgress() throws IOException
{
}
#Override
public LongWritable getCurrentKey() throws IOException
{
}
#Override
public Text getCurrentValue() throws IOException
{
}
#Override
public void close() throws IOException
{
}
}
There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.
conf.set("textinputformat.record.delimiter", "^");
This should work fine with the normal TextInputFormat.

Can I get a Partition number of Hadoop?

I am a hadoop newbie.
I want to get a partition number on output file.
At first, I made a customized partitioner.
public static class MyPartitioner extends Partitioner<Text, LongWritable> {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
}
}
It works. But, I want to output partition numbers 'visually' on Reducer.
How can I get a partiton number ??
Below is my reducer source.
public static class MyReducer extends Reducer<Text, LongWritable, Text, Text>{
private Text textList = new Text();
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
String list = new String();
for(LongWritable value: values) {
list = new String(list + "\t" + value.toString());
}
textList.set(list);
context.write(key, textList);
}
}
I want to put a partition number on 'list' respectively. There will be '0' or '1'.
list = new String(list + "\t" + value.toString() + "\t" + ??);
It would be great if someone helps me.
+
Thanks to the answer, I got a solution. But, It didn't work and I think I did something wrong.
Below is the modified MyPartitioner.
public static class MyPartitioner extends Partitioner {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
private int bring_num = 0;
public void configure(JobConf job) {
bring_num = jobConf.getInt(numOfChars & numReduceTasks);
}
}
}
Add the below code to the Reducer class to get the partition number in a class variable which can be later used in the reducer method.
String partition;
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
partition = conf.get("mapred.task.partition");
}

Resources