processing file using mapreduce - hadoop

I use simple pig script that reads the input .txt file and for each line new filed is added.
The output relation is then stored into avro.
Is there any benefit to run such a script in the mapreduce mode compare to local mode?
Thank you

In local mode you are running your job on your local machine. With mapreduce you run your job in a cluster (your file will be splitted into pieces and will be processed on several machines in parallel).
So, in theory, if your file is big enough (or there are lots of files like this to process), you'll be able to accomplish your job in less time with mapreduce mode.

Related

How to run mapreduce program on large number of files simultaneously?

I am working on large data sets and run Mapreduce program on it. I can easily run Mapreduce on single file, whose size around 3 GB. know I want to run mapreduce on all files. Is there any shortcut or technique to run mapreduce on all files directly.
Using OS-Ubuntu
Hadoop-2.7.1
If you have all the files available, specify directory/regular expression in map-reduce input parameter in place of file name.
Example:
bin/hadoop jar wc.jar WordCount /user/joe/wordcount/*.txt /user/joe/wordcount/output
If you are getting file continuously and want process as and when they arrive.
you have to run map-reduce job again and again. because it is batch job.

Adding new files to a running hadoop cluster

consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Thanks,
Morteza
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm https://storm.apache.org/ or Apache Spark https://spark.apache.org/

Writing MapReduce job to concurrently download files?

Not sure if this is a suitable use case for MapReduce: Part of the OOZIE workflow I'm trying to implement is to download a series of files named with sequential numbers (e.g. 1 through 20). I wanted those files to be downloaded simultaneously (5 files at a time), so I created a python script that creates 5 text files as follows:
1.txt: 1,2,3,4
2.txt: 5,6,7,8
3.txt: 9,10,11,12
4.txt: 13,14,15,16
5.txt: 17,18,19,20
Then for the next step of the workflow, I created a download.sh shell script that consumes a comma-separated list of numbers and download the requested files. In the workflow, I setup a streaming action in Oozie and used the directory that contains files generated above as input (mapred.input.dir) and used download.sh as the mapper command and "cat" as the reducer command. I assumed that Hadoop will spawn a different mapper for each of the input files above.
This seems to work sometimes, it would download the files correctly, but sometimes it just get stuck trying to execute and I don't know why. I noticed that this happen when I increase the number of simultaneous downloads (e.g. instead of files per txt file, I would do 20 and so forth).
So my question is: Is this a correct way to implement parallel retrieval of files using MapReduce and OOZIE? If not, how is this normally done using OOZIE? I'm trying to get my CSV files into the HDFS prior to running the Hive script and I'm not sure what the best way would be to achieve that.
After looking deeper into this, it seems that creating an Oozie "Fork" node would be the best approach. So I created a fork node, under which I created 6 shell actions that executes download.sh and take the list of file numbers as an argument. So I ended up modifying the python script so it outputs the file numbers that need to be downloaded to STDOUT (instead of saving them on HDFS). I had oozie capture that output and then pass them as arguments to the download.sh forks.
Cloudera Hue interface does not provide a way to create fork nodes (at least not that I was able to find) so I downloaded the workflow.xml file and added the fork nodes myself and then re-imported it as a new workflow.

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

How do you deal with empty or missing input files in Apache Pig?

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.
During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.
Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?
(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)
(For posterity, a sub-par solution we've come up with:)
To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).
but at least Pig doesn't crash with an exception.
Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.
These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.
The approach I've been using is to run pig scripts from a shell. I have one job that gets data from six different input directories. So I've written a fragment for each input file.
The shell checks for the existence of the input file and assembles a final pig script from the fragments.
It then executes the final pig script. I know it's a bit of a Rube Goldberg approach, but so far so good. :-)

Resources