pig hadoop needed for I want to do? - hadoop

I have a question for you, well a clarification...
I developed a program that uses hadoop map reduce wich gets just a column from a dataset (csv file) and process this data with some functions, so this program is finished, but the real question is
Is a good idea to develop this program in Pig? note that in the process of the file I dont use FILTERS COUNTS or any built in function of Pig...
Am I right if I say that passing this hadoop map reduce program to Pig has no sense? because all my functions will need to be rewrited as a Pig User Defined Function UDF...

in my opinion there is actually not really a need to write your very small script in PigLatin. It'll be more interesting when its getting bigger, because PigLatin requires less lines of code than java, and its more readable, especially to non-java-developer :)

Related

Downloading list of files in parallel in Apache Pig

I have a simple text file which contains list of folders on some FTP servers. Each line is a separate folder. Each folder contains couple of thousand images. I want to connect to each folder, store all files inside that foder in a SequenceFile and then remove that folder from FTP server. I have written a simple pig UDF for this. Here it is:
dirs = LOAD '/var/location.txt' USING PigStorage();
results = FOREACH dirs GENERATE download_whole_folder_into_single_sequence_file($0);
/* I don't need results bag. It is just a dummy bag */
The problem is I'm not sure if each line of input is processed in separate mapper. The input file is not a huge file just couple of hundred lines. If it were pure Map/Reduce then I would use NLineInputFormat and process each line in a separate Mapper. How can I achieve the same thing in pig?
Pig lets you write your own load functions, which let you specify which InputFormat you'll be using. So you could write your own.
That said, the job you described sounds like it would only involve a single map-reduce step. Since using Pig wouldn't reduce complexity in this case, and you'd have to write custom code just to use Pig, I'd suggest just doing it in vanilla map-reduce. If the total file size is Gigabytes or less, I'd just do it all directly on a single host. It's simpler not to use map reduce if you don't have to.
I typically use map-reduce to first load data into HDFS, and then Pig for all data processing. Pig doesn't really add any benefits over vanilla hadoop for loading data IMO, it's just a wrapper around InputFormat/RecordReader with additional methods you need to implement. Plus it's technically possible with Pig that your loader will be called multiple times. That's a gotcha you don't need to worry about using Hadoop map-reduce directly.

Java Vs Scripting for HDFS map/reduce

I am a DB person, so java is new to me. Looking for scripting language for working with HDFS, may be Python I am looking for. But I see in one of the previous question, you mentioned that "Heart Beat" between Name and Data node will not happen if we use scripting language. Why, I could not understand? As we are writing our application logic to process data in the scripts or java code, and how it matter for the "Heart Beat"?
Any idea, on this?
Python is good choice for hadoop if you know already how to code with it. I've used php and perl with success. This part of Hadoop framework is called Streaming.
For "Heart Beat" I believe you are thinking of Counters. They are user defined "variables" that can only be incremented. Hadoop will terminate task attempt if no counters are incremented for 10 minutes. However you shouldn't worry about this as there are system counters that are automatically incremented for you. If you do have a job that takes very long, you can still use counters with Python (Hadoop Streaming) by sending something like this to standard error output:
reporter:counter:MyGroup,MyCounter,1
For more info on counters with Hadoop Streaming see this

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
else:
emit
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

Is the input file to a MapReduce program mandatory?

I am working on a use case where I generate random data using a map reduce program and I do not require any input file in HDFS. If I don't give input path MR program doesn't work. So, currently I have a dummy input file. Is there any way to avoid this?
Usually MR programs have some sort of data for processing. But, there might be scenarios like Random Generation where is there is no data to be processed. Checkout the TeraGen program for the random number generation which takes number of rows and the output directory as input. Also, I haven't tried the DataGenerator, but it seems interesting.

Resources