How do you deal with empty or missing input files in Apache Pig? - hadoop

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.
During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.
Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?
(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)

(For posterity, a sub-par solution we've come up with:)
To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).
but at least Pig doesn't crash with an exception.
Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.
These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.

The approach I've been using is to run pig scripts from a shell. I have one job that gets data from six different input directories. So I've written a fragment for each input file.
The shell checks for the existence of the input file and assembles a final pig script from the fragments.
It then executes the final pig script. I know it's a bit of a Rube Goldberg approach, but so far so good. :-)

Related

Downloading list of files in parallel in Apache Pig

I have a simple text file which contains list of folders on some FTP servers. Each line is a separate folder. Each folder contains couple of thousand images. I want to connect to each folder, store all files inside that foder in a SequenceFile and then remove that folder from FTP server. I have written a simple pig UDF for this. Here it is:
dirs = LOAD '/var/location.txt' USING PigStorage();
results = FOREACH dirs GENERATE download_whole_folder_into_single_sequence_file($0);
/* I don't need results bag. It is just a dummy bag */
The problem is I'm not sure if each line of input is processed in separate mapper. The input file is not a huge file just couple of hundred lines. If it were pure Map/Reduce then I would use NLineInputFormat and process each line in a separate Mapper. How can I achieve the same thing in pig?
Pig lets you write your own load functions, which let you specify which InputFormat you'll be using. So you could write your own.
That said, the job you described sounds like it would only involve a single map-reduce step. Since using Pig wouldn't reduce complexity in this case, and you'd have to write custom code just to use Pig, I'd suggest just doing it in vanilla map-reduce. If the total file size is Gigabytes or less, I'd just do it all directly on a single host. It's simpler not to use map reduce if you don't have to.
I typically use map-reduce to first load data into HDFS, and then Pig for all data processing. Pig doesn't really add any benefits over vanilla hadoop for loading data IMO, it's just a wrapper around InputFormat/RecordReader with additional methods you need to implement. Plus it's technically possible with Pig that your loader will be called multiple times. That's a gotcha you don't need to worry about using Hadoop map-reduce directly.

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries.
Now,
Not all entries in a log file are useful
the entries which are useful needs to be transformed and the output needs to be stored in an output file, so that I can efficiently query (using Hive) the output logs later.
I have a python script which can simply take a log file and do part a. and b. mentioned above but I have not written any mappers or reducers.
Hive takes care of Mappers and Reducers for its queries. Please tell me if and how it is possible to use the python script to run it over all logs and save the output in 'hdfs:///outputlogs' ?
I am new to Map Reduce and have seen some examples of Word count but all of them has a single input file. Where can I find examples which has multiple input files ?
Here I see that you have two-fold issue:
Having more than one file as input
The same word count example will work if you pass in more than one
file as input. In fact you can very easily pass a folder name as
input instead of a file name, in your case hdfs:///logs.
you may even pass on a comma separated list of paths as input, for
this instead of using following:
FileInputFormat.setInputPaths(conf, new Path(args[0]));
You may use the following:
FileInputFormat.setInputPaths(job, args[0]);
Note that only passing a list of comma separated as args[0] will be
sufficient.
How to convert your logic to mapreduce
This does have a steep learning curve as you will need to think in
terms of key and values. But I feel that you can just have all the
logic in the mapper itself and have an IdentityReducer, like this :
conf.setReducerClass(IdentityReducer.class);
If you spend sometime reading examples from the following locations,
you should be in a better position to make these decisions:
hadoop-map-reduce-examples ( http://hadoop-map-reduce-examples.googlecode.com/svn/trunk/hadoop-examples/src/ )
http://developer.yahoo.com/hadoop/tutorial/module4.html
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
The long-term correct way to do this is, as Amar stated, to write a MapReduce job to do it.
However, if this is a one-time thing, and the data isn't too enormous, it might be simplest/easiest to do this with a simple bash script since you already have the python script:
hadoop fs -text /logs/* > input.log
python myscript.py input.log output.log
hadoop fs -copyFromLocal output.log /outputlogs
rm -f input.log output.log
If this is a repeated process - something you want to be reliable and efficient - or if you just want to learn to use MapReduce better, then stick with Amar's answer.
If you have logic already written, and you want to do parallell processing using EMR and/or vanilla Hadoop - you can use Hadoop streaming : http://hadoop.apache.org/docs/r0.15.2/streaming.html. In a nutshell - your script taking data into stdin and making output to stdout can became a mapper.
Thus you will run the processing of data in HDFS using cluster, without a need to repackage you code.

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

Can Apache Pig load data from STDIN instead of a file?

I want to use Apache pig to transform/join data in two files, but I want to implement it step by step, which means, test it from real data, but with a small size(10 lines for example), is it possible to use pig that read from STDIN and output to STDOUT?
Basically Hadoop supports Streaming in various ways, but Pig originally lacked support for loading data through streaming. However there are some solutions.
You can check out HStreaming:
A = LOAD 'http://myurl.com:1234/index.html' USING HStream('\n') AS (f1, f2);
The answer is no. The data needs to be out in the cluster on data nodes before any MR job can even run over the data.
However if you are using a small sample of data and are just wanting to do something simple you could use Pig in local mode and just write stdin to a local file and run it through your script.
But the bigger question becomes why are you wanting to use MR/Pig on a stream of data? It was and is not intended for that type of use.

Resources