MapReduce One-to-one processing of multiple input files - hadoop

Please clarify
I have set of input files (say 10) with specific names. I run word count job on all files at once (input path is folder). I am expecting 10 output files with same names as input files. I.e. File1 input should be counted and should be stored in a separate output file with "file1" name. And so on to all files.

There are 2 approaches you can take to achieve multiple outputs
Use MultipleOutputs class - refer this document for information about multipleclassoutput (https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html) , for more information about how to implement refer this http://appsintheopen.com/posts/44-map-reduce-multiple-outputs
Another option is using LazyOuputFormat, however, this is used in conjunction with multipleoutputs, for more information about its implementation refer this ( https://ssmolen.wordpress.com/2014/07/09/hadoop-mapreduce-write-output-to-multiple-directories-depending-on-the-reduce-key/ ).
I feel using LazyOutputFormat in conjunction with MultipleOuputs class is better approach.

Set the number of reduce tasks to be equal to the number of input files. This will create the given number of output files, as well.
Add a file prefix to each map output key (word). E.g., when you meet the word "cat" in file named "file0.txt" you can emit the key "0_cat", or "file0_cat", or anything else that is unique for "file0.txt". Use the context to get each time the filename.
Override the default Partitioner, to make sure that all the map output keys with prefix "0_", or "file0_" will go to the first partition, all the keys with prefix "1_", or "file1_" will go to the second, etc.
In the reducer, remove the "x_" or "filex_" prefix from the output key and use it as the name of the output file (using MultipleOutputs). Otherwise, if you don't want MultipleOutputs, you can easily do the mapping between outputfiles and input files by checking your Partitioner code. (e.g., part-00000 will be the partition 0's output)

Related

Part files in mapper Output Represent the Split?

Do part files which are generated as an output of a mapper only job as part-m-00000,Part-m-00001,so on represent the first input split, second input split and so on and are they generated sequentially ??
May not be. The split array returned by the getSplits() method is sorted into order based on size, so that the biggest go first. This sorted array is passed farther down and map tasks are created for each element. So, the ordering information would be lost when you do the sort.
Reference: org.apache.hadoop.mapreduce.JobSubmitter class. See method writeSplits(..)
Link to source code:
https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java
Further reading on how the file names are decided:
Once the task id is determined, the name of the file is decided by the getDefaultWorkFile API available in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat class. Here is the documentation:
getDefaultWorkFile
public Path getDefaultWorkFile(TaskAttemptContext context,
String extension)
throws IOException
Get the default path and filename for the output format.
Parameters:
context - the task context
extension - an extension to add to the filename
Returns:
a full path $output/_temporary/$taskid/part-[mr]-$id
This means "part" is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number (i.e. task id). For example, the file for the first map of the job the generated name will be 'part-m-00000'.
Javadoc reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext, java.lang.String)
The older FileOutputFormat API sitting in org.apache.hadoop.mapred package also works in a similar way. Here is the reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getUniqueName(org.apache.hadoop.mapred.JobConf, java.lang.String)

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Multiple inputs into MapReduce job

I'm trying to write a MapReduce job which takes a number of delimited input sources. All sources contain the same information, but it may be in different columns and the separator may be different per source. The sources are parsed in the mapper by a configuration file. This configuration file allows users to confine these different separators and column mappings.
For example, input1 is parsed using configuration properties
input1.separator=,
input1.id=1
input1.housename=2
input1.age=15
where 1, 2 and 15 are the columns in input1 which relate to those properties.
So, the mapper needs to know which configuration properties to use for each input source. I can't hard code this as other people will be running my job and will want to add new inputs without requiring a compiler.
The obvious solution is to extract the file name from the splits and apply configuration that way.
For example, assume I'm inputting two files, "source1.txt" and "source2.txt". I could write my configuration like
source1.separator=,
source1.id=2
...
source2.separator=|
source2.id=4
...
The mapper would get the file name from the splits, and then read the configuration properties with the same prefix.
However, if I'm pointing to folders in a Hive warehouse, I can't use this. I could extract bits of the path and use those, but I don't really feel that's an elegant or sturdy solution. Is there an easier way to do this?
I'm not sure whether MultipleInputs provides PathFilter integration. However you can extend one and feed matched files to different Mapper types based on your criteria.
FileStatus[] csvfiles = fileSystem.listStatus(new Path("hive/path"),
new PathFilter() {
public boolean accept(Path path) {
return (path.getName().matches(".*csv$"));
}
});
Assign handling Mapper to this list :
MultipleInputs.addInputPath(job, csvfiles[i].getPath(),
YourFormat.class, CsvMapper.class);
For each file type you have to provide the required regex. Hope you are good at it.
I've solved it. It turns out that the order in which input sources (files or directories) are added to FileInputFormat is maintained, and then stored in the job context as mapreduce.input.fileinputformat.inputdir. So, my solution
Runner.java
for(int i=X; i<ar.length; i++) {
FileInputFormat.addInputPath(job, new Path(ar[i]));
}
where X is the first integer at which an input path can be found.
InputMapper.java
#Get the name of the input source in the current mapper
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
#Get the ordered list of all input sources
String pathMappings = context.getConfiguration()
.get("mapreduce.input.fileinputformat.inputdir");
As I know the order in which input sources are added to the job, I can then have the user set configuration properties using numbers, and map the numbers to the order in which input sources were added to the job in the CLI.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

two file comparison in hdfs

I want to write a map reduce to compare two large file in hdfs. any thoughts how to achieve that. Or if there is nay other way to do the comparison because file size is very large, so thought map-reduce would be an ideal approach.
Thanks for your help.
You may do this in 2 steps.
First make the line number to be the part of text files:
Say initial file looks like:
I am awesome
He is my best friend
Now, convert this to something like this:
1,I am awesome
2,He is my best friend
This may well be done by a MapReduce job itself or some other tool.
2. Now write a MapReduce step where in mapper emit the line number as the key and rest of the actual sentence as value. Then in reducer just compare the values. As and when it doesn't match emit out the line number (the key) and the payloads, whatever you may want here. Also if the count of the values is just 1 then also it is a mismatch.
EDIT: Better approach
Better still what you can do is, just emit the complete line read at a time in the mapper as the key and make the value a number, say 1. So taking my above example your mapper output would be as follows:
< I am awesome,1 >
< He is my best friend,1 >
And in reducer just check the count of values, if it isn't 2, you have a mismatch.
But there is one catch in this approach, if there is a possibility of exactly same line occurring at two different places then instead of checking for the length of values for a given key in reducer, you should be checking it to be a multiple of 2.
One possible solution could be, put line number as count in map job.
There are two files like below :
File 1:
I am here --Line 1
I am awesome -- Line 2
You are my best friend -- Line 3
File 2 also similar kind
Now your map job output should be like , < I am awesome, 2>...
Once you done with Map job for both the file, you have two record(key,value) which has same value to reduce.
At the time of reduce, you can either compare the counter or generate the output as , and so on. If the line is exist in the different location too than out put could be which indicates that this line is mismatch.
I have a solution for comparing files with keys. In your case if you know that your ID's are unique, you could emit the ID's as keys in the map, the entire record as value. Lets say your file has ID,Line1 then emit as key and as value from mapper.
In the shuffle and sort phase, the ID's will be sorted and you will get an iterator with data from both the files. ie, the records from both files with same ID will end up in same iterator.
Then in the reducer, compare both the values from the iterator and if they match move on with next record. Else, if they do not match write them to an output.
I have done this and it worked like a charm.
Scenario - No matching key
If there is no matching ID between two files, they will have only one iterator value.
Scenario 2 - Duplicate keys
If the files have duplicate keys, the iterator will have more than 2 values.
Note: You should compare the values only when the iterator has only 2 values.
**Tip:**The iterator will not have values in order always. To identify the value from a particular file, in the mapper add a small indicator at the end of the line like Line1;file1
Line1;file2
Then on the reducer you will be able to identify which value belongs to which mapper.

Resources