How does MapReduce read from multiple input files? - hadoop

I am developing a code to read data and write it into HDFS using mapreduce. However when I have multiple files I don't understand how it is processed . The input path to the mapper is the name of the directory as evident from the output of
String filename = conf1.get("map.input.file");
So how does it process the files in the directory ?

In order to get the input file path you can use the context object, like this:
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String inputFilePath = fileSplit.getPath().toString();
And as for how it multiple files are processed:
Several instances of the mapper function are created on the different machines in the cluster. Each instance receives a different input file. If files are bigger than the default dfs block size(128 MB) then files are further split into smaller parts and are then distributed to mappers.
So you can configure the input size being received by each mapper by following 2 ways:
change the HDFS block size (eg dfs.block.size=1048576)
set the paramaeter mapred.min.split.size (this can be only set to larger than HDFS block size)
Note:
These parameters will only be effective if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so these will be ignored.

In continuation to #Amar 's answer , I used FileStatus object in the following code as my customised inoput format would not split the input file.
FileSystem fs = file.getFileSystem(conf);
FileStatus status= fs.getFileStatus(file);
String fileName=status.getPath().toString();

Related

Possible to take multiple input files and not create one RDD in pyspark?

In Hadoop, I can point an app to a path which then the mappers will process the files individually. I have to handle it this way because I need to parse the file name and path to match up with other files that I load directly in the mappers.
In pyspark, passing the path to SparkContext's textFile creates one RDD. Is there any way to replicate the same Hadoop behavior in Spark / pyspark?
I hope this resolve some of your confusions :
sparkContext.wholeTextFiles(path) returns a pairRDD (helpful link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html)
In short, pairRDD is more like a map (i.e. have key, value)
rdd = sparkContext.wholeTextFiles(path)
def func_work_on_individual_files(x):
# x is a tuple which will receive both (key, value) for the pairRDD Row Elements passed. key -> file path, value -> content of a file with line seperated by '/n' (as you mentioned). To access key use x[0], to access value use x[1].
# your logic to do something useful with file data,
# to get separate lines you can use: x[1].split('\n')
# end function by return the values you want to return out of a file's data.
# I am simply returning the whole content of file
return x[1]
#loop over each of the file in the pairRdd created above
file_contents = rdd.map(func_work_on_individual_files)
#this will create just one partition out of all elements in list (as you mentioned)
consolidated_contents = file_contents.repartition(1)
#Save final output - this will create just one path like Hadoop
consolidated_contents.saveAsTextFile(path)
Pyspark provides a function for this use case: sparkContext.wholeTextFiles(path). It will read a directory of text files and produce a key-value pair, where key is the path of each file and value is the content of each file.

Zip the output files from MapReduce without merging them

I have a MR job which uses multipleoutput format and outputs 500 files. I want to zip those files without merging them.
You have to use SequenceFileOutputFormat : An OutputFormat that writes keys, values to SequenceFiles in binary(raw) format
You can have three variations in SequenceFile.CompressionType
BLOCK : Compress sequences of records together in blocks.
NONE : Do not compress records.
RECORD: Compress values only, each separately.
Key changes in your code.
Path outDir = new Path(WORK_DIR_PREFIX + "/out/" + jobName);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
SequenceFileOutputFormat.setOutputPath(job, outDir);
SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
Have a look at working example on usage of SequenceFileOutputFormat.

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Multiple inputs into MapReduce job

I'm trying to write a MapReduce job which takes a number of delimited input sources. All sources contain the same information, but it may be in different columns and the separator may be different per source. The sources are parsed in the mapper by a configuration file. This configuration file allows users to confine these different separators and column mappings.
For example, input1 is parsed using configuration properties
input1.separator=,
input1.id=1
input1.housename=2
input1.age=15
where 1, 2 and 15 are the columns in input1 which relate to those properties.
So, the mapper needs to know which configuration properties to use for each input source. I can't hard code this as other people will be running my job and will want to add new inputs without requiring a compiler.
The obvious solution is to extract the file name from the splits and apply configuration that way.
For example, assume I'm inputting two files, "source1.txt" and "source2.txt". I could write my configuration like
source1.separator=,
source1.id=2
...
source2.separator=|
source2.id=4
...
The mapper would get the file name from the splits, and then read the configuration properties with the same prefix.
However, if I'm pointing to folders in a Hive warehouse, I can't use this. I could extract bits of the path and use those, but I don't really feel that's an elegant or sturdy solution. Is there an easier way to do this?
I'm not sure whether MultipleInputs provides PathFilter integration. However you can extend one and feed matched files to different Mapper types based on your criteria.
FileStatus[] csvfiles = fileSystem.listStatus(new Path("hive/path"),
new PathFilter() {
public boolean accept(Path path) {
return (path.getName().matches(".*csv$"));
}
});
Assign handling Mapper to this list :
MultipleInputs.addInputPath(job, csvfiles[i].getPath(),
YourFormat.class, CsvMapper.class);
For each file type you have to provide the required regex. Hope you are good at it.
I've solved it. It turns out that the order in which input sources (files or directories) are added to FileInputFormat is maintained, and then stored in the job context as mapreduce.input.fileinputformat.inputdir. So, my solution
Runner.java
for(int i=X; i<ar.length; i++) {
FileInputFormat.addInputPath(job, new Path(ar[i]));
}
where X is the first integer at which an input path can be found.
InputMapper.java
#Get the name of the input source in the current mapper
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
#Get the ordered list of all input sources
String pathMappings = context.getConfiguration()
.get("mapreduce.input.fileinputformat.inputdir");
As I know the order in which input sources are added to the job, I can then have the user set configuration properties using numbers, and map the numbers to the order in which input sources were added to the job in the CLI.

Merging hdfs files

I have 1000+ files available in HDFS with a naming convention of 1_fileName.txt to N_fileName.txt. Size of each file is 1024 MB.
I need to merge these files in to one (HDFS)with keeping the order of the file. Say 5_FileName.txt should append only after 4_fileName.txt
What is the best and fastest way to perform this operation.
Is there any method to perform this merging without copying the actual data between data nodes? For e-g: Get the block locations of this files and create a new entry (FileName) in the Namenode with these block locations?
There is no efficient way of doing this, you'll need to move all the data to one node, then back to HDFS.
A command line scriptlet to do this could be as follows:
hadoop fs -text *_fileName.txt | hadoop fs -put - targetFilename.txt
This will cat all files that match the glob to standard output, then you'll pipe that stream to the put command and output the stream to an HDFS file named targetFilename.txt
The only problem you have is the filename structure you have gone for - if you have fixed width, zeropadded the number part it would be easier, but in it's current state you'll get an unexpected lexigraphic order (1, 10, 100, 1000, 11, 110, etc) rather than numeric order (1,2,3,4, etc). You could work around this by amending the scriptlet to:
hadoop fs -text [0-9]_fileName.txt [0-9][0-9]_fileName.txt \
[0-9][0-9[0-9]_fileName.txt | hadoop fs -put - targetFilename.txt
There is an API method org.apache.hadoop.fs.FileUtil.copyMerge that performs this operation:
public static boolean copyMerge(
FileSystem srcFS,
Path srcDir,
FileSystem dstFS,
Path dstFile,
boolean deleteSource,
Configuration conf,
String addString)
It reads all files in srcDir in alphabetical order and appends their content to dstFile.
If you can use spark. It can be done like
sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)
Hope this works, since spark works in distributed fashion, you wont have to copy filed into one node. Though just a caution, coalescing files in spark can be slow if the files are very large.
Since the file order is important and lexicographical order does not fulfill the purpose, it looks like a good candidate to write a mapper program for this task, which can probably run periodically.
Offcourse there is no reducer, writing this as an HDFS map task is efficient because it can merge these files into one output file without much data movement across data nodes. As the source files are in HDFS, and since mapper tasks will try data affinity, it can merge files without moving files across different data nodes.
The mapper program will need a custom InputSplit (taking file names in the input directory and ordering it as required) and a custom InputFormat.
The mapper can either use hdfs append or a raw output stream where it can write in byte[].
A rough sketch of the Mapper program I am thinking of is something like:
public class MergeOrderedFileMapper extends MapReduceBase implements Mapper<ArrayWritable, Text, ??, ??>
{
FileSystem fs;
public void map(ArrayWritable sourceFiles, Text destFile, OutputCollector<??, ??> output, Reporter reporter) throws IOException
{
//Convert the destFile to Path.
...
//make sure the parent directory of destFile is created first.
FSDataOutputStream destOS = fs.append(destFilePath);
//Convert the sourceFiles to Paths.
List<Path> srcPaths;
....
....
for(Path p: sourcePaths) {
FSDataInputStream srcIS = fs.open(p);
byte[] fileContent
srcIS.read(fileContent);
destOS.write(fileContent);
srcIS.close();
reporter.progress(); // Important, else mapper taks may timeout.
}
destOS.close();
// Delete source files.
for(Path p: sourcePaths) {
fs.delete(p, false);
reporter.progress();
}
}
}
I wrote an implementation for PySpark as we use this quite often.
Modeled after Hadoop's copyMerge() and uses same lower-level Hadoop APIs to achive this.
https://github.com/Tagar/abalon/blob/v2.3.3/abalon/spark/sparkutils.py#L335
It keeps alphabetical order of file names.

Resources