Possible to take multiple input files and not create one RDD in pyspark? - hadoop

In Hadoop, I can point an app to a path which then the mappers will process the files individually. I have to handle it this way because I need to parse the file name and path to match up with other files that I load directly in the mappers.
In pyspark, passing the path to SparkContext's textFile creates one RDD. Is there any way to replicate the same Hadoop behavior in Spark / pyspark?

I hope this resolve some of your confusions :
sparkContext.wholeTextFiles(path) returns a pairRDD (helpful link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html)
In short, pairRDD is more like a map (i.e. have key, value)
rdd = sparkContext.wholeTextFiles(path)
def func_work_on_individual_files(x):
# x is a tuple which will receive both (key, value) for the pairRDD Row Elements passed. key -> file path, value -> content of a file with line seperated by '/n' (as you mentioned). To access key use x[0], to access value use x[1].
# your logic to do something useful with file data,
# to get separate lines you can use: x[1].split('\n')
# end function by return the values you want to return out of a file's data.
# I am simply returning the whole content of file
return x[1]
#loop over each of the file in the pairRdd created above
file_contents = rdd.map(func_work_on_individual_files)
#this will create just one partition out of all elements in list (as you mentioned)
consolidated_contents = file_contents.repartition(1)
#Save final output - this will create just one path like Hadoop
consolidated_contents.saveAsTextFile(path)

Pyspark provides a function for this use case: sparkContext.wholeTextFiles(path). It will read a directory of text files and produce a key-value pair, where key is the path of each file and value is the content of each file.

Related

MapReduce One-to-one processing of multiple input files

Please clarify
I have set of input files (say 10) with specific names. I run word count job on all files at once (input path is folder). I am expecting 10 output files with same names as input files. I.e. File1 input should be counted and should be stored in a separate output file with "file1" name. And so on to all files.
There are 2 approaches you can take to achieve multiple outputs
Use MultipleOutputs class - refer this document for information about multipleclassoutput (https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html) , for more information about how to implement refer this http://appsintheopen.com/posts/44-map-reduce-multiple-outputs
Another option is using LazyOuputFormat, however, this is used in conjunction with multipleoutputs, for more information about its implementation refer this ( https://ssmolen.wordpress.com/2014/07/09/hadoop-mapreduce-write-output-to-multiple-directories-depending-on-the-reduce-key/ ).
I feel using LazyOutputFormat in conjunction with MultipleOuputs class is better approach.
Set the number of reduce tasks to be equal to the number of input files. This will create the given number of output files, as well.
Add a file prefix to each map output key (word). E.g., when you meet the word "cat" in file named "file0.txt" you can emit the key "0_cat", or "file0_cat", or anything else that is unique for "file0.txt". Use the context to get each time the filename.
Override the default Partitioner, to make sure that all the map output keys with prefix "0_", or "file0_" will go to the first partition, all the keys with prefix "1_", or "file1_" will go to the second, etc.
In the reducer, remove the "x_" or "filex_" prefix from the output key and use it as the name of the output file (using MultipleOutputs). Otherwise, if you don't want MultipleOutputs, you can easily do the mapping between outputfiles and input files by checking your Partitioner code. (e.g., part-00000 will be the partition 0's output)

Part files in mapper Output Represent the Split?

Do part files which are generated as an output of a mapper only job as part-m-00000,Part-m-00001,so on represent the first input split, second input split and so on and are they generated sequentially ??
May not be. The split array returned by the getSplits() method is sorted into order based on size, so that the biggest go first. This sorted array is passed farther down and map tasks are created for each element. So, the ordering information would be lost when you do the sort.
Reference: org.apache.hadoop.mapreduce.JobSubmitter class. See method writeSplits(..)
Link to source code:
https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java
Further reading on how the file names are decided:
Once the task id is determined, the name of the file is decided by the getDefaultWorkFile API available in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat class. Here is the documentation:
getDefaultWorkFile
public Path getDefaultWorkFile(TaskAttemptContext context,
String extension)
throws IOException
Get the default path and filename for the output format.
Parameters:
context - the task context
extension - an extension to add to the filename
Returns:
a full path $output/_temporary/$taskid/part-[mr]-$id
This means "part" is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number (i.e. task id). For example, the file for the first map of the job the generated name will be 'part-m-00000'.
Javadoc reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext, java.lang.String)
The older FileOutputFormat API sitting in org.apache.hadoop.mapred package also works in a similar way. Here is the reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getUniqueName(org.apache.hadoop.mapred.JobConf, java.lang.String)

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Multiple inputs into MapReduce job

I'm trying to write a MapReduce job which takes a number of delimited input sources. All sources contain the same information, but it may be in different columns and the separator may be different per source. The sources are parsed in the mapper by a configuration file. This configuration file allows users to confine these different separators and column mappings.
For example, input1 is parsed using configuration properties
input1.separator=,
input1.id=1
input1.housename=2
input1.age=15
where 1, 2 and 15 are the columns in input1 which relate to those properties.
So, the mapper needs to know which configuration properties to use for each input source. I can't hard code this as other people will be running my job and will want to add new inputs without requiring a compiler.
The obvious solution is to extract the file name from the splits and apply configuration that way.
For example, assume I'm inputting two files, "source1.txt" and "source2.txt". I could write my configuration like
source1.separator=,
source1.id=2
...
source2.separator=|
source2.id=4
...
The mapper would get the file name from the splits, and then read the configuration properties with the same prefix.
However, if I'm pointing to folders in a Hive warehouse, I can't use this. I could extract bits of the path and use those, but I don't really feel that's an elegant or sturdy solution. Is there an easier way to do this?
I'm not sure whether MultipleInputs provides PathFilter integration. However you can extend one and feed matched files to different Mapper types based on your criteria.
FileStatus[] csvfiles = fileSystem.listStatus(new Path("hive/path"),
new PathFilter() {
public boolean accept(Path path) {
return (path.getName().matches(".*csv$"));
}
});
Assign handling Mapper to this list :
MultipleInputs.addInputPath(job, csvfiles[i].getPath(),
YourFormat.class, CsvMapper.class);
For each file type you have to provide the required regex. Hope you are good at it.
I've solved it. It turns out that the order in which input sources (files or directories) are added to FileInputFormat is maintained, and then stored in the job context as mapreduce.input.fileinputformat.inputdir. So, my solution
Runner.java
for(int i=X; i<ar.length; i++) {
FileInputFormat.addInputPath(job, new Path(ar[i]));
}
where X is the first integer at which an input path can be found.
InputMapper.java
#Get the name of the input source in the current mapper
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
#Get the ordered list of all input sources
String pathMappings = context.getConfiguration()
.get("mapreduce.input.fileinputformat.inputdir");
As I know the order in which input sources are added to the job, I can then have the user set configuration properties using numbers, and map the numbers to the order in which input sources were added to the job in the CLI.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

Resources