Pig: Force one mapper per input line/row - hadoop

I have a Pig Streaming job where the number of mappers should equal the number of rows/lines in the input file. I know that setting
set mapred.min.split.size 16
set mapred.max.split.size 16
set pig.noSplitCombination true
will ensure that each block is 16 bytes. But how do I ensure that each map job has exactly one line as input? The lines are variable length, so using a constant number for mapred.min.split.size and mapred.max.split.size is not the best solution.
Here is the code I intend to use:
input = load 'hdfs://cluster/tmp/input';
DEFINE CMD `/usr/bin/python script.py`;
OP = stream input through CMD;
dump OP;
SOLVED! Thanks to zsxwing
And, in case anyone else runs into this weird nonsense, know this:
To ensure that Pig creates one mapper for each input file you must set
set pig.splitCombination false
and not
set pig.noSplitCombination true
Why this is the case, I have no idea!

Following your clue, I browsed the Pig source codes to find out the answer.
Set pig.noSplitCombination in the Pig script does't work. In the Pig script, you need to use pig.splitCombination. Then Pig will set the pig.noSplitCombination in JobConf according to the value of pig.splitCombination.
If you want to set pig.noSplitCombination directly, you need to use the command line. For example,
pig -Dpig.noSplitCombination=true -f foo.pig
The difference between these two ways is: if you use set instruction in the Pig script, it is stored in Pig properties. If you use -D, it is stored in Hadoop Configuration.
If you use set pig.noSplitCombination true, then (pig.noSplitCombination, true) is stored in Pig properties. But when Pig wants to init a JobConf, it fetches the value using pig.splitCombination from Pig properties. So your setting has not effect. Here is the source codes. The correct way is set pig.splitCombination false as you mentioned.
If you use -Dpig.noSplitCombination=true, (pig.noSplitCombination, true) is stored in Hadoop Configuration. Since JobConf is copied from Configuration, the value of -D is directly passed to JobConf.
At last, PigInputFormat reads pig.noSplitCombination from JobConf to decide if using the combination. Here is the source codes.

Related

Pig load files using tuple's field

I need help for following use case:
Initially we load some files and and process those records (or more technically tuples). After this processing, finally we have tuples of the form:
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000, some_field_3)
(some_field_1, hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00001, some_field_3)
So basically, tuples has file path as value of its field (We can obviously transform this tuple having only one field having file path as value OR to a single tuple having only one field with some delimiter (say comma) separated string).
So now I have to load these files in Pig script, but I am not able to do so. Could you please suggest how to proceed further. I thought of using advanced foreach operator and tried as follows:
data = foreach tuples_with_file_info {
fileData = load $2 using PigStorage(',');
....
....
};
However its not working.
Edit:
For simplicity lets assume, I have single tuple with one field having file name:
(hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000)
You can't use Pig out of the box to do it.
What I would do is use some other scripting language (bash, Python, Ruby...) to read the file from hdfs and concatenate the files into a single string that you can then push as a parameter to a Pig script to use in your LOAD statement. Pig supports globbing so you can do the following:
a = LOAD '{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}' ...
so all that's left to do is read the file that contains those file names, concatenate them into a glob such as:
{hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
and pass that as a parameter to Pig so your script would start with:
a = LOAD '$input'
and your pig call would look like this:
pig -f script.pig -param input={hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000,hdfs://localhost:9000/user/kailashgupta/data/2/part-r-00000}
First, store the tuples_with_file_info into some file:
STORE tuples_with_file_info INTO 'some_temporary_file';
then,
data = LOAD 'some_temporary_file' using MyCustomLoader();
where
MyCustomLoader is nothing but a Pig loader extending LoadFunc, which uses MyInputFormat as InputFormat.
MyInputFormat is an encapsulation over the actual InputFormat (e.g. TextInputFormat) which has to be used to read actual data from the files (e.g. in my case from file hdfs://localhost:9000/user/kailashgupta/data/1/part-r-00000).
In MyInputFormat, override getSplits method; first read the actual file name(s) from the some_temporary_file (You have to get this file name from Configuration's mapred.input.dir property), then update the same Configuration mapred.input.dir with retrieved file names, then return result from wrapped up InputFormat (e.g. in my case TextInputFormat).
Note: 1. You cannot use the setLocation API from the LoadFunc (or some other similar API) to read the contents of some_temporary_file, as its contents will be available only at run time.
2. One doubt may arise in your mind, what if LOAD statement executes before STORE? But this would not happen because if STORE and LOAD use same file in the script, Pig ensures that the jobs are executed in the right sequence. For more detail you may read section Store-load sequences on Pig Wiki

Multiple inputs into MapReduce job

I'm trying to write a MapReduce job which takes a number of delimited input sources. All sources contain the same information, but it may be in different columns and the separator may be different per source. The sources are parsed in the mapper by a configuration file. This configuration file allows users to confine these different separators and column mappings.
For example, input1 is parsed using configuration properties
input1.separator=,
input1.id=1
input1.housename=2
input1.age=15
where 1, 2 and 15 are the columns in input1 which relate to those properties.
So, the mapper needs to know which configuration properties to use for each input source. I can't hard code this as other people will be running my job and will want to add new inputs without requiring a compiler.
The obvious solution is to extract the file name from the splits and apply configuration that way.
For example, assume I'm inputting two files, "source1.txt" and "source2.txt". I could write my configuration like
source1.separator=,
source1.id=2
...
source2.separator=|
source2.id=4
...
The mapper would get the file name from the splits, and then read the configuration properties with the same prefix.
However, if I'm pointing to folders in a Hive warehouse, I can't use this. I could extract bits of the path and use those, but I don't really feel that's an elegant or sturdy solution. Is there an easier way to do this?
I'm not sure whether MultipleInputs provides PathFilter integration. However you can extend one and feed matched files to different Mapper types based on your criteria.
FileStatus[] csvfiles = fileSystem.listStatus(new Path("hive/path"),
new PathFilter() {
public boolean accept(Path path) {
return (path.getName().matches(".*csv$"));
}
});
Assign handling Mapper to this list :
MultipleInputs.addInputPath(job, csvfiles[i].getPath(),
YourFormat.class, CsvMapper.class);
For each file type you have to provide the required regex. Hope you are good at it.
I've solved it. It turns out that the order in which input sources (files or directories) are added to FileInputFormat is maintained, and then stored in the job context as mapreduce.input.fileinputformat.inputdir. So, my solution
Runner.java
for(int i=X; i<ar.length; i++) {
FileInputFormat.addInputPath(job, new Path(ar[i]));
}
where X is the first integer at which an input path can be found.
InputMapper.java
#Get the name of the input source in the current mapper
Path filePath = ((FileSplit) context.getInputSplit()).getPath();
String filePathString = ((FileSplit) context.getInputSplit()).getPath().toString();
#Get the ordered list of all input sources
String pathMappings = context.getConfiguration()
.get("mapreduce.input.fileinputformat.inputdir");
As I know the order in which input sources are added to the job, I can then have the user set configuration properties using numbers, and map the numbers to the order in which input sources were added to the job in the CLI.

Mapreduce without reducer function

I have a file of data and my task is to use map reduce to create a new data from each line of the file because the data is huge in the file.
ex: the file contains: expression (3 -4 *7-4) and I need to create a new expression randomly from this expression (3+4/7*4). When I implement the task using map reduce I use map to do the change, and reduce to just to receive data from mapper and sort them Is it correct to use just map to do the main task?
If you do not need sorting of map results - you set 0 reduced, ( by doing
job.setNumReduceTasks(0);
in your driver code )
and the job is called map only.
Your implementation is correct. Just make sure the keys output from the mapper are all unique if you don't want any expressions that happen to be identical being combined.
For example, since you said you have a huge data file, there may be a possibility that you get two expressions such as 3-4*7-4 and 3*4/7+4 and both new expressions turn out to be 3+4*7-4. If you use the expression as the key, the reducer will only get called once for both expressions. If you don't want this to happen, make sure you use a unique number for each key.

Is there any Conditional IF like operator in Apache PIG?

Actually I am writing PIG Script and want to execute some set of statements if one of the condition is satisfied.
I have set one variable and checking for some value of that variable. Suppose
if flag==0 then
A = LOAD 'file' using PigStorage() as (f1:int, ....);
B = ...;
C = ....;
else
again some Pig Latin statements
Can I do this in PIG Script? If yes, then how can I do this?
Thanks.
Yes, Pig does offer an if-then-else construction, but it is not used in the way you're asking.
Pig's if-then-else is an arithmetic operator invoked with the shorthand "condition ? true_value : false_value" as part of an expression, such as:
X = FOREACH A GENERATE f2, (f2==1?1:COUNT(B));
You have to already have loaded the table A to do this. To execute control flow around entire Pig statements you'll need something like oozie, as suggested by Fakrudeen.
You can create a Python wrapper around your Pig script. See Embedded Pig in the docs.
Pig is data flow language not control flow.
Only construct which comes close is PIG split, but it is very limited.
You can use oozie and its decision construct with two pig scripts.
Create a UDF (say, in Java) and then embed that into your PIG script. You will need to 'register' the jar file that you generate after writing the UDF.
//(something like this), say your Java UDF class is UDFCondition & the generated jar file is PigUDFCondition.jar, then in your PIG Code
register PigUDFCondition.jar
X = foreach A generate UDFCondition(..flag...)
There is a CASE Statement available from version 0.12 onwards.

How does MapReduce read from multiple input files?

I am developing a code to read data and write it into HDFS using mapreduce. However when I have multiple files I don't understand how it is processed . The input path to the mapper is the name of the directory as evident from the output of
String filename = conf1.get("map.input.file");
So how does it process the files in the directory ?
In order to get the input file path you can use the context object, like this:
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String inputFilePath = fileSplit.getPath().toString();
And as for how it multiple files are processed:
Several instances of the mapper function are created on the different machines in the cluster. Each instance receives a different input file. If files are bigger than the default dfs block size(128 MB) then files are further split into smaller parts and are then distributed to mappers.
So you can configure the input size being received by each mapper by following 2 ways:
change the HDFS block size (eg dfs.block.size=1048576)
set the paramaeter mapred.min.split.size (this can be only set to larger than HDFS block size)
Note:
These parameters will only be effective if your input format supports splitting the input files. Common compression codecs (such as gzip) don't support splitting the files, so these will be ignored.
In continuation to #Amar 's answer , I used FileStatus object in the following code as my customised inoput format would not split the input file.
FileSystem fs = file.getFileSystem(conf);
FileStatus status= fs.getFileStatus(file);
String fileName=status.getPath().toString();

Resources