Part files in mapper Output Represent the Split? - hadoop

Do part files which are generated as an output of a mapper only job as part-m-00000,Part-m-00001,so on represent the first input split, second input split and so on and are they generated sequentially ??

May not be. The split array returned by the getSplits() method is sorted into order based on size, so that the biggest go first. This sorted array is passed farther down and map tasks are created for each element. So, the ordering information would be lost when you do the sort.
Reference: org.apache.hadoop.mapreduce.JobSubmitter class. See method writeSplits(..)
Link to source code:
https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java
Further reading on how the file names are decided:
Once the task id is determined, the name of the file is decided by the getDefaultWorkFile API available in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat class. Here is the documentation:
getDefaultWorkFile
public Path getDefaultWorkFile(TaskAttemptContext context,
String extension)
throws IOException
Get the default path and filename for the output format.
Parameters:
context - the task context
extension - an extension to add to the filename
Returns:
a full path $output/_temporary/$taskid/part-[mr]-$id
This means "part" is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number (i.e. task id). For example, the file for the first map of the job the generated name will be 'part-m-00000'.
Javadoc reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext, java.lang.String)
The older FileOutputFormat API sitting in org.apache.hadoop.mapred package also works in a similar way. Here is the reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getUniqueName(org.apache.hadoop.mapred.JobConf, java.lang.String)

Related

Possible to take multiple input files and not create one RDD in pyspark?

In Hadoop, I can point an app to a path which then the mappers will process the files individually. I have to handle it this way because I need to parse the file name and path to match up with other files that I load directly in the mappers.
In pyspark, passing the path to SparkContext's textFile creates one RDD. Is there any way to replicate the same Hadoop behavior in Spark / pyspark?
I hope this resolve some of your confusions :
sparkContext.wholeTextFiles(path) returns a pairRDD (helpful link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html)
In short, pairRDD is more like a map (i.e. have key, value)
rdd = sparkContext.wholeTextFiles(path)
def func_work_on_individual_files(x):
# x is a tuple which will receive both (key, value) for the pairRDD Row Elements passed. key -> file path, value -> content of a file with line seperated by '/n' (as you mentioned). To access key use x[0], to access value use x[1].
# your logic to do something useful with file data,
# to get separate lines you can use: x[1].split('\n')
# end function by return the values you want to return out of a file's data.
# I am simply returning the whole content of file
return x[1]
#loop over each of the file in the pairRdd created above
file_contents = rdd.map(func_work_on_individual_files)
#this will create just one partition out of all elements in list (as you mentioned)
consolidated_contents = file_contents.repartition(1)
#Save final output - this will create just one path like Hadoop
consolidated_contents.saveAsTextFile(path)
Pyspark provides a function for this use case: sparkContext.wholeTextFiles(path). It will read a directory of text files and produce a key-value pair, where key is the path of each file and value is the content of each file.

MapReduce One-to-one processing of multiple input files

Please clarify
I have set of input files (say 10) with specific names. I run word count job on all files at once (input path is folder). I am expecting 10 output files with same names as input files. I.e. File1 input should be counted and should be stored in a separate output file with "file1" name. And so on to all files.
There are 2 approaches you can take to achieve multiple outputs
Use MultipleOutputs class - refer this document for information about multipleclassoutput (https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html) , for more information about how to implement refer this http://appsintheopen.com/posts/44-map-reduce-multiple-outputs
Another option is using LazyOuputFormat, however, this is used in conjunction with multipleoutputs, for more information about its implementation refer this ( https://ssmolen.wordpress.com/2014/07/09/hadoop-mapreduce-write-output-to-multiple-directories-depending-on-the-reduce-key/ ).
I feel using LazyOutputFormat in conjunction with MultipleOuputs class is better approach.
Set the number of reduce tasks to be equal to the number of input files. This will create the given number of output files, as well.
Add a file prefix to each map output key (word). E.g., when you meet the word "cat" in file named "file0.txt" you can emit the key "0_cat", or "file0_cat", or anything else that is unique for "file0.txt". Use the context to get each time the filename.
Override the default Partitioner, to make sure that all the map output keys with prefix "0_", or "file0_" will go to the first partition, all the keys with prefix "1_", or "file1_" will go to the second, etc.
In the reducer, remove the "x_" or "filex_" prefix from the output key and use it as the name of the output file (using MultipleOutputs). Otherwise, if you don't want MultipleOutputs, you can easily do the mapping between outputfiles and input files by checking your Partitioner code. (e.g., part-00000 will be the partition 0's output)

Multiple inputs into MapReduce job

I'm trying to write a MapReduce job which takes a number of delimited input sources. All sources contain the same information, but it may be in different columns and the separator may be different per source. The sources are parsed in the mapper by a configuration file. This configuration file allows users to confine these different separators and column mappings.
For example, input1 is parsed using configuration properties
input1.separator=,
input1.id=1
input1.housename=2
input1.age=15
where 1, 2 and 15 are the columns in input1 which relate to those properties.
So, the mapper needs to know which configuration properties to use for each input source. I can't hard code this as other people will be running my job and will want to add new inputs without requiring a compiler.
The obvious solution is to extract the file name from the splits and apply configuration that way.
For example, assume I'm inputting two files, "source1.txt" and "source2.txt". I could write my configuration like
source1.separator=,
source1.id=2
...
source2.separator=|
source2.id=4
...
The mapper would get the file name from the splits, and then read the configuration properties with the same prefix.
However, if I'm pointing to folders in a Hive warehouse, I can't use this. I could extract bits of the path and use those, but I don't really feel that's an elegant or sturdy solution. Is there an easier way to do this?
I'm not sure whether MultipleInputs provides PathFilter integration. However you can extend one and feed matched files to different Mapper types based on your criteria.
FileStatus[] csvfiles = fileSystem.listStatus(new Path("hive/path"),
new PathFilter() {
public boolean accept(Path path) {
return (path.getName().matches(".*csv$"));
}
});
Assign handling Mapper to this list :
MultipleInputs.addInputPath(job, csvfiles[i].getPath(),
YourFormat.class, CsvMapper.class);
For each file type you have to provide the required regex. Hope you are good at it.
I've solved it. It turns out that the order in which input sources (files or directories) are added to FileInputFormat is maintained, and then stored in the job context as mapreduce.input.fileinputformat.inputdir. So, my solution
Runner.java
for(int i=X; i<ar.length; i++) {
FileInputFormat.addInputPath(job, new Path(ar[i]));
}
where X is the first integer at which an input path can be found.
InputMapper.java
#Get the name of the input source in the current mapper
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
#Get the ordered list of all input sources
String pathMappings = context.getConfiguration()
.get("mapreduce.input.fileinputformat.inputdir");
As I know the order in which input sources are added to the job, I can then have the user set configuration properties using numbers, and map the numbers to the order in which input sources were added to the job in the CLI.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

Hadoop searching words from one file in another file

I want to build a hadoop application which can read words from one file and search in another file.
If the word exists - it has to write to one output file
If the word doesn't exist - it has to write to another output file
I tried a few examples in hadoop. I have two questions
Two files are approximately 200MB each. Checking every word in another file might cause out of memory. Is there an alternative way of doing this?
How to write data to different files because output of the reduce phase of hadoop writes to only one file. Is it possible to have a filter for reduce phase to write data to different output files?
Thank you.
How I would do it:
split value in 'map' by words, emit (<word>, <source>) (*1)
you'll get in 'reduce': (<word>, <list of sources>)
check source-list (might be long for both/all sources)
if NOT all sources are in the list, emit every time (<missingsource>, <word>)
job2: job.setNumReduceTasks(<numberofsources>)
job2: emit in 'map' (<missingsource>, <word>)
job2: emit for each <missingsource> in 'reduce' all (null, <word>)
You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.
(*1) Howto find out the source in map (0.20):
private String localname;
private Text outkey = new Text();
private Text outvalue = new Text();
...
public void setup(Context context) throws InterruptedException, IOException {
super.setup(context);
localname = ((FileSplit)context.getInputSplit()).getPath().toString();
}
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
...
outkey.set(...);
outvalue.set(localname);
context.write(outkey, outvalue);
}
Are you using Hadoop/MapReduce for a specific reason to solve this problem? This sounds like something more suited to a Lucene based application than Hadoop.
If you have to use Hadoop I have a few suggestions:
Your 'documents' will need to be in a format that MapReduce can deal with. The easiest format to use would be a CSV based file with each word in the document on a line. Having PDF etc will not work.
To take a set of words as input to you MapReduce job to compare against the data that the MapReduce processes you could use the Distributed Cache to enable each mapper to build a set of words you want to find in the input. However if your list of words to find it large (you mention 200MB) I doubt this would work. This method is one of the main ways you can do a join in MapReduce however.
The indexing method mentioned in another answer here does also offer possibilities. Again though, the terms indexing a document just make me think Lucene and not hadoop. If you did use this method you would need to make sure the key value contains a document identifier as well as the word, so that you have the word counts contained within each document.
I don't think i've ever produced multiple output files from a MapReduce job. You would need to write some (and it would be very simple) code to process the indexed output into multiple files.
You'll want to do this in two stages, in my opinion. Run the wordcount program (included in the hadoop examples jar) against the two initial documents, this will give you two files, each containing a unique list (with count) of the words in each document. From there, rather than using hadoop do a simple diff on the two files which should answer your question,

Resources