how to design 1 mapper for 1 text file in Mapreduce - hadoop

I am running Mapreduce on hadoop 2.9.0.
My problem:
I have a number of text files (about 10- 100 text files). Each file is very small in terms of size, but due to my logical problem, I need 1 mapper to handle 1 text file. The result of these mappers will be aggregated by my reducers.
I need to design so that the number of mappers always equals number of files. How to do that in Java code? What kind of function that I need to extend?
Thanks a lot.

I've had to do something very similar, and faced similar problems to you.
The way I achieved this was to feed in a text file containing the path's to each file, for example the text file would contain this kind of information:
/path/to/filea
/path/to/fileb
/a/different/path/to/filec
/a/different/path/to/another/called/filed
I'm not sure what exactly you want your mapper's to do, but when creating your job, you want to do the following:
public static void main( String args[] ) {
Job job = Job.getInstance(new Configuration(), 'My Map reduce application');
job.setJarByClass(Main.class);
job.setMapperClass(CustomMapper.class);
job.setInputFormatClass(NLineInputFormat.class);
...
}
Your CustomMapper.class will want to extend Mapper like so:
public class CustomMapper extends Mapper<LongWritable, Text, <Reducer Key>, <Reducer Value> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration configuration = context.getConfiguration();
ObjectTool tool = new ObjectTool(configuration, new Path(value.toString()));
context.write(<reducer key>, <reducer value>);
}
}
Where ObjectTool is another class which deals with what you want to actually do with your files.
So let me explain broadly what this is doing, the magic here is job.setInputFormatClass(NLineInputFormat.class), but what is it doing exactly?
It's essentially taking your input and splitting the data by each line, and sends each line to a mapper. By having a text file containing each file by a new line, you then create a 1:1 relationship between mappers and files. A great addition to this setup is it allows you to create advanced tooling for the files you want to deal with.
I used this to create a compression tool in HDFS, when I was researching on approaches to this, a lot of people were essentially reading the file to stdout and compressing it that way, however, when it came to doing a checksum on the original file and the file being compressed and decompressed, the results were different. This was due to the type of data in these files, and there was no easy way to implement bytes writeable. (Information on the cat'ing of files to std out can be seen here).
That link also quotes the following:
org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask
Hope this helps!

Related

Multiple inputs into MapReduce job

I'm trying to write a MapReduce job which takes a number of delimited input sources. All sources contain the same information, but it may be in different columns and the separator may be different per source. The sources are parsed in the mapper by a configuration file. This configuration file allows users to confine these different separators and column mappings.
For example, input1 is parsed using configuration properties
input1.separator=,
input1.id=1
input1.housename=2
input1.age=15
where 1, 2 and 15 are the columns in input1 which relate to those properties.
So, the mapper needs to know which configuration properties to use for each input source. I can't hard code this as other people will be running my job and will want to add new inputs without requiring a compiler.
The obvious solution is to extract the file name from the splits and apply configuration that way.
For example, assume I'm inputting two files, "source1.txt" and "source2.txt". I could write my configuration like
source1.separator=,
source1.id=2
...
source2.separator=|
source2.id=4
...
The mapper would get the file name from the splits, and then read the configuration properties with the same prefix.
However, if I'm pointing to folders in a Hive warehouse, I can't use this. I could extract bits of the path and use those, but I don't really feel that's an elegant or sturdy solution. Is there an easier way to do this?
I'm not sure whether MultipleInputs provides PathFilter integration. However you can extend one and feed matched files to different Mapper types based on your criteria.
FileStatus[] csvfiles = fileSystem.listStatus(new Path("hive/path"),
new PathFilter() {
public boolean accept(Path path) {
return (path.getName().matches(".*csv$"));
}
});
Assign handling Mapper to this list :
MultipleInputs.addInputPath(job, csvfiles[i].getPath(),
YourFormat.class, CsvMapper.class);
For each file type you have to provide the required regex. Hope you are good at it.
I've solved it. It turns out that the order in which input sources (files or directories) are added to FileInputFormat is maintained, and then stored in the job context as mapreduce.input.fileinputformat.inputdir. So, my solution
Runner.java
for(int i=X; i<ar.length; i++) {
FileInputFormat.addInputPath(job, new Path(ar[i]));
}
where X is the first integer at which an input path can be found.
InputMapper.java
#Get the name of the input source in the current mapper
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
#Get the ordered list of all input sources
String pathMappings = context.getConfiguration()
.get("mapreduce.input.fileinputformat.inputdir");
As I know the order in which input sources are added to the job, I can then have the user set configuration properties using numbers, and map the numbers to the order in which input sources were added to the job in the CLI.

in Hadoop, How can you give whole file as input to mapper?

An interviewer recently asked me this question:
I said by configuring block size or split size equal to file size.
He said it is wrong.
Well if you told it like that I think that he didn't like the "configuring block size" part.
EDIT : Somehow I think changing block size is a bad idea because it is global to HDFS.
On the other hand a solution to prevent splitting, would be to set the min split size bigger than the largest file to map.
A cleaner solution would be to subclass the concerned InputFormat implementation. Especially by overriding the isSpitable() method to return false. In your case you could do something like this with FileInputFormat:
public class NoSplitFileInputFormat extends FileInputFormat
{
#Override
protected boolean isSplitable(JobContext context, Path file)
{
return false;
}
}
The interviewer wanted to hear that you can make isSplitable to return false by gzip-compressing the input file.
In this case, MapReduce will do the right thing and not try to split the gzipped file,
since it knows that the input is gzip-compressed (by looking at the filename extension)
and that gzip does not support splitting.
This will work, but at the expense of locality: a single map will process all HDFS blocks, most of which will not be local to the map. Also, with fewer maps, the job is less granular, and so may take longer to run.

Merging hdfs files

I have 1000+ files available in HDFS with a naming convention of 1_fileName.txt to N_fileName.txt. Size of each file is 1024 MB.
I need to merge these files in to one (HDFS)with keeping the order of the file. Say 5_FileName.txt should append only after 4_fileName.txt
What is the best and fastest way to perform this operation.
Is there any method to perform this merging without copying the actual data between data nodes? For e-g: Get the block locations of this files and create a new entry (FileName) in the Namenode with these block locations?
There is no efficient way of doing this, you'll need to move all the data to one node, then back to HDFS.
A command line scriptlet to do this could be as follows:
hadoop fs -text *_fileName.txt | hadoop fs -put - targetFilename.txt
This will cat all files that match the glob to standard output, then you'll pipe that stream to the put command and output the stream to an HDFS file named targetFilename.txt
The only problem you have is the filename structure you have gone for - if you have fixed width, zeropadded the number part it would be easier, but in it's current state you'll get an unexpected lexigraphic order (1, 10, 100, 1000, 11, 110, etc) rather than numeric order (1,2,3,4, etc). You could work around this by amending the scriptlet to:
hadoop fs -text [0-9]_fileName.txt [0-9][0-9]_fileName.txt \
[0-9][0-9[0-9]_fileName.txt | hadoop fs -put - targetFilename.txt
There is an API method org.apache.hadoop.fs.FileUtil.copyMerge that performs this operation:
public static boolean copyMerge(
FileSystem srcFS,
Path srcDir,
FileSystem dstFS,
Path dstFile,
boolean deleteSource,
Configuration conf,
String addString)
It reads all files in srcDir in alphabetical order and appends their content to dstFile.
If you can use spark. It can be done like
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
Hope this works, since spark works in distributed fashion, you wont have to copy filed into one node. Though just a caution, coalescing files in spark can be slow if the files are very large.
Since the file order is important and lexicographical order does not fulfill the purpose, it looks like a good candidate to write a mapper program for this task, which can probably run periodically.
Offcourse there is no reducer, writing this as an HDFS map task is efficient because it can merge these files into one output file without much data movement across data nodes. As the source files are in HDFS, and since mapper tasks will try data affinity, it can merge files without moving files across different data nodes.
The mapper program will need a custom InputSplit (taking file names in the input directory and ordering it as required) and a custom InputFormat.
The mapper can either use hdfs append or a raw output stream where it can write in byte[].
A rough sketch of the Mapper program I am thinking of is something like:
public class MergeOrderedFileMapper extends MapReduceBase implements Mapper<ArrayWritable, Text, ??, ??>
{
FileSystem fs;
public void map(ArrayWritable sourceFiles, Text destFile, OutputCollector<??, ??> output, Reporter reporter) throws IOException
{
//Convert the destFile to Path.
...
//make sure the parent directory of destFile is created first.
FSDataOutputStream destOS = fs.append(destFilePath);
//Convert the sourceFiles to Paths.
List<Path> srcPaths;
....
....
for(Path p: sourcePaths) {
FSDataInputStream srcIS = fs.open(p);
byte[] fileContent
srcIS.read(fileContent);
destOS.write(fileContent);
srcIS.close();
reporter.progress(); // Important, else mapper taks may timeout.
}
destOS.close();
// Delete source files.
for(Path p: sourcePaths) {
fs.delete(p, false);
reporter.progress();
}
}
}
I wrote an implementation for PySpark as we use this quite often.
Modeled after Hadoop's copyMerge() and uses same lower-level Hadoop APIs to achive this.
https://github.com/Tagar/abalon/blob/v2.3.3/abalon/spark/sparkutils.py#L335
It keeps alphabetical order of file names.

Hadoop searching words from one file in another file

I want to build a hadoop application which can read words from one file and search in another file.
If the word exists - it has to write to one output file
If the word doesn't exist - it has to write to another output file
I tried a few examples in hadoop. I have two questions
Two files are approximately 200MB each. Checking every word in another file might cause out of memory. Is there an alternative way of doing this?
How to write data to different files because output of the reduce phase of hadoop writes to only one file. Is it possible to have a filter for reduce phase to write data to different output files?
Thank you.
How I would do it:
split value in 'map' by words, emit (<word>, <source>) (*1)
you'll get in 'reduce': (<word>, <list of sources>)
check source-list (might be long for both/all sources)
if NOT all sources are in the list, emit every time (<missingsource>, <word>)
job2: job.setNumReduceTasks(<numberofsources>)
job2: emit in 'map' (<missingsource>, <word>)
job2: emit for each <missingsource> in 'reduce' all (null, <word>)
You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.
(*1) Howto find out the source in map (0.20):
private String localname;
private Text outkey = new Text();
private Text outvalue = new Text();
...
public void setup(Context context) throws InterruptedException, IOException {
super.setup(context);
localname = ((FileSplit)context.getInputSplit()).getPath().toString();
}
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
...
outkey.set(...);
outvalue.set(localname);
context.write(outkey, outvalue);
}
Are you using Hadoop/MapReduce for a specific reason to solve this problem? This sounds like something more suited to a Lucene based application than Hadoop.
If you have to use Hadoop I have a few suggestions:
Your 'documents' will need to be in a format that MapReduce can deal with. The easiest format to use would be a CSV based file with each word in the document on a line. Having PDF etc will not work.
To take a set of words as input to you MapReduce job to compare against the data that the MapReduce processes you could use the Distributed Cache to enable each mapper to build a set of words you want to find in the input. However if your list of words to find it large (you mention 200MB) I doubt this would work. This method is one of the main ways you can do a join in MapReduce however.
The indexing method mentioned in another answer here does also offer possibilities. Again though, the terms indexing a document just make me think Lucene and not hadoop. If you did use this method you would need to make sure the key value contains a document identifier as well as the word, so that you have the word counts contained within each document.
I don't think i've ever produced multiple output files from a MapReduce job. You would need to write some (and it would be very simple) code to process the indexed output into multiple files.
You'll want to do this in two stages, in my opinion. Run the wordcount program (included in the hadoop examples jar) against the two initial documents, this will give you two files, each containing a unique list (with count) of the words in each document. From there, rather than using hadoop do a simple diff on the two files which should answer your question,

Processing files with headers in Hadoop

I want to process a lot of files in Hadoop -- each file has some header information, followed by a lot of records, each stored in a fixed number of bytes. Any suggestions on that?
I think the best solution is to write a custom InputFormat.
There is one solution , you can check the offset of line of files that mapper reads .
It will be zero for the first line in the file . so you can add line in Map as follows:
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException
{
if(key.get() > 0)
{
your mapper code
}
}
So, it will skip the first line of the file.
However, its not a good way because in this way this condition will be checked for each line in the file.
Best way is to use your Custom Input Format
In addition to write a custom FileInputFormat, you will also want to make sure that the file is not splitable so the reader knows how to process the records inside the file.

Resources