Hadoop Load and Store - hadoop

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?

HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.

Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

Related

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries.
Now,
Not all entries in a log file are useful
the entries which are useful needs to be transformed and the output needs to be stored in an output file, so that I can efficiently query (using Hive) the output logs later.
I have a python script which can simply take a log file and do part a. and b. mentioned above but I have not written any mappers or reducers.
Hive takes care of Mappers and Reducers for its queries. Please tell me if and how it is possible to use the python script to run it over all logs and save the output in 'hdfs:///outputlogs' ?
I am new to Map Reduce and have seen some examples of Word count but all of them has a single input file. Where can I find examples which has multiple input files ?
Here I see that you have two-fold issue:
Having more than one file as input
The same word count example will work if you pass in more than one
file as input. In fact you can very easily pass a folder name as
input instead of a file name, in your case hdfs:///logs.
you may even pass on a comma separated list of paths as input, for
this instead of using following:
FileInputFormat.setInputPaths(conf, new Path(args[0]));
You may use the following:
FileInputFormat.setInputPaths(job, args[0]);
Note that only passing a list of comma separated as args[0] will be
sufficient.
How to convert your logic to mapreduce
This does have a steep learning curve as you will need to think in
terms of key and values. But I feel that you can just have all the
logic in the mapper itself and have an IdentityReducer, like this :
conf.setReducerClass(IdentityReducer.class);
If you spend sometime reading examples from the following locations,
you should be in a better position to make these decisions:
hadoop-map-reduce-examples ( http://hadoop-map-reduce-examples.googlecode.com/svn/trunk/hadoop-examples/src/ )
http://developer.yahoo.com/hadoop/tutorial/module4.html
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
The long-term correct way to do this is, as Amar stated, to write a MapReduce job to do it.
However, if this is a one-time thing, and the data isn't too enormous, it might be simplest/easiest to do this with a simple bash script since you already have the python script:
hadoop fs -text /logs/* > input.log
python myscript.py input.log output.log
hadoop fs -copyFromLocal output.log /outputlogs
rm -f input.log output.log
If this is a repeated process - something you want to be reliable and efficient - or if you just want to learn to use MapReduce better, then stick with Amar's answer.
If you have logic already written, and you want to do parallell processing using EMR and/or vanilla Hadoop - you can use Hadoop streaming : http://hadoop.apache.org/docs/r0.15.2/streaming.html. In a nutshell - your script taking data into stdin and making output to stdout can became a mapper.
Thus you will run the processing of data in HDFS using cluster, without a need to repackage you code.

hadoop/HDFS: Is it possible to write from several processes to the same file?

f.e. create file 20bytes.
1st process will write from 0 to 4
2nd from 5 to 9
etc
I need this to parallel creating a big files using my MapReduce.
Thanks.
P.S. Maybe it is not implemented yet, but it is possible in general - point me where I should dig please.
Are you able to explain what you plan to do with this file after you have created it.
If you need to get it out of HDFS to then use it then you can let Hadoop M/R create separate files and then use a command like hadoop fs -cat /path/to/output/part* > localfile to combine the parts to a single file and save off to the local file system.
Otherwise, there is no way you can have multiple writers open to the same file - reading and writing to HDFS is stream based, and while you can have multiple readers open (possibly reading different blocks), multiple writing is not possible.
Web downloaders request parts of the file using the Range HTTP header in multiple threads, and then either using tmp files before merging the parts together later (as Thomas Jungblut suggests), or they might be able to make use of Random IO, buffering the downloaded parts in memory before writing them off to the output file in the correct location. You unfortunately don't have the ability to perform random output with Hadoop HDFS.
I think the short answer is no. The way you accomplish this is write your multiple 'preliminary' files to hadoop and then M/R them into a single consolidated file. Basically, use hadoop, don't reinvent the wheel.

Writing to single file from mappers

I am working on mapreduce that is generating CSV file out of some data that is read from HBase. Is there a way to write to single file from mappers without reduce phase (or to merge multiple files generated by mappers at the end of job)? I know that I can set output format to write in file on Job level, is it possible to do similar thing for mappers?
Thanks
It is possible (and not uncommon) to have a Map/Reduce-Job without a reduce phase (example). For that you just use job.setNumReduceTasks(0).
However I am not sure how Job-Output is handled in this case. Ususally you get one result file per reducer. Without reducers I could imagine that you either get one file per mapper or that you cannot produce job output. You will have to try/research that.
If the above does not work for you, you could still use the default Reducer implementation, that just forwards the mapper output (identity function).
Seriously, this is not how MapReduce works.
Why do you even need a Job for that? Write a simple Java application that does the same for you. There are also command line utils that does the same for you.

Can Apache Pig load data from STDIN instead of a file?

I want to use Apache pig to transform/join data in two files, but I want to implement it step by step, which means, test it from real data, but with a small size(10 lines for example), is it possible to use pig that read from STDIN and output to STDOUT?
Basically Hadoop supports Streaming in various ways, but Pig originally lacked support for loading data through streaming. However there are some solutions.
You can check out HStreaming:
A = LOAD 'http://myurl.com:1234/index.html' USING HStream('\n') AS (f1, f2);
The answer is no. The data needs to be out in the cluster on data nodes before any MR job can even run over the data.
However if you are using a small sample of data and are just wanting to do something simple you could use Pig in local mode and just write stdin to a local file and run it through your script.
But the bigger question becomes why are you wanting to use MR/Pig on a stream of data? It was and is not intended for that type of use.

How do you deal with empty or missing input files in Apache Pig?

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.
During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.
Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?
(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)
(For posterity, a sub-par solution we've come up with:)
To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).
but at least Pig doesn't crash with an exception.
Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.
These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.
The approach I've been using is to run pig scripts from a shell. I have one job that gets data from six different input directories. So I've written a fragment for each input file.
The shell checks for the existence of the input file and assembles a final pig script from the fragments.
It then executes the final pig script. I know it's a bit of a Rube Goldberg approach, but so far so good. :-)

Resources