How to run mapreduce program on large number of files simultaneously? - hadoop

I am working on large data sets and run Mapreduce program on it. I can easily run Mapreduce on single file, whose size around 3 GB. know I want to run mapreduce on all files. Is there any shortcut or technique to run mapreduce on all files directly.
Using OS-Ubuntu
Hadoop-2.7.1

If you have all the files available, specify directory/regular expression in map-reduce input parameter in place of file name.
Example:
bin/hadoop jar wc.jar WordCount /user/joe/wordcount/*.txt /user/joe/wordcount/output
If you are getting file continuously and want process as and when they arrive.
you have to run map-reduce job again and again. because it is batch job.

Related

processing file using mapreduce

I use simple pig script that reads the input .txt file and for each line new filed is added.
The output relation is then stored into avro.
Is there any benefit to run such a script in the mapreduce mode compare to local mode?
Thank you
In local mode you are running your job on your local machine. With mapreduce you run your job in a cluster (your file will be splitted into pieces and will be processed on several machines in parallel).
So, in theory, if your file is big enough (or there are lots of files like this to process), you'll be able to accomplish your job in less time with mapreduce mode.

Writing MapReduce job to concurrently download files?

Not sure if this is a suitable use case for MapReduce: Part of the OOZIE workflow I'm trying to implement is to download a series of files named with sequential numbers (e.g. 1 through 20). I wanted those files to be downloaded simultaneously (5 files at a time), so I created a python script that creates 5 text files as follows:
1.txt: 1,2,3,4
2.txt: 5,6,7,8
3.txt: 9,10,11,12
4.txt: 13,14,15,16
5.txt: 17,18,19,20
Then for the next step of the workflow, I created a download.sh shell script that consumes a comma-separated list of numbers and download the requested files. In the workflow, I setup a streaming action in Oozie and used the directory that contains files generated above as input (mapred.input.dir) and used download.sh as the mapper command and "cat" as the reducer command. I assumed that Hadoop will spawn a different mapper for each of the input files above.
This seems to work sometimes, it would download the files correctly, but sometimes it just get stuck trying to execute and I don't know why. I noticed that this happen when I increase the number of simultaneous downloads (e.g. instead of files per txt file, I would do 20 and so forth).
So my question is: Is this a correct way to implement parallel retrieval of files using MapReduce and OOZIE? If not, how is this normally done using OOZIE? I'm trying to get my CSV files into the HDFS prior to running the Hive script and I'm not sure what the best way would be to achieve that.
After looking deeper into this, it seems that creating an Oozie "Fork" node would be the best approach. So I created a fork node, under which I created 6 shell actions that executes download.sh and take the list of file numbers as an argument. So I ended up modifying the python script so it outputs the file numbers that need to be downloaded to STDOUT (instead of saving them on HDFS). I had oozie capture that output and then pass them as arguments to the download.sh forks.
Cloudera Hue interface does not provide a way to create fork nodes (at least not that I was able to find) so I downloaded the workflow.xml file and added the fork nodes myself and then re-imported it as a new workflow.

How to control the number of hadoop streaming output files

Here is the detail:
The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output
In the input path, there are 20,000 files from part-00000 to part-19999, each file is about 64MB.
What I want to do is to write a hadoop streaming job to merge these 20,000 files into 10,000 files.
Is there a way to merge these 20,000 files to 10,000 files using hadoop streaming job? Or, in other words, Is there a way to control the number of hadoop streaming output files?
Thanks in advance!
It looks like right now you have a map-only streaming job. The behavior with a map-only job is to have one output file per map task. There isn't much you can do about changing this behavior.
You can exploit the way MapReduce works by adding the reduce phase so that it has 10,000 reducers. Then, each reducer will output one file, so you are left with 10,000 files. Note that your data records will be "scattered" across the 10,000... it won't be just two files concatenated. To do this, use the -D mapred.reduce.tasks=10000 flag in your command line args.
This is probably the default behavior, but you can also specify the identity reducer as your reducer. This doesn't do anything other than pass on the record, which is what I think you want here. Use this flag to do this: -reducer org.apache.hadoop.mapred.lib.IdentityReducer

What are SUCCESS and part-r-00000 files in hadoop

Although I use Hadoop frequently on my Ubuntu machine I have never thought about SUCCESS and part-r-00000 files. The output always resides in part-r-00000 file, but what is the use of SUCCESS file? Why does the output file have the name part-r-0000? Is there any significance/any nomenclature or is this just a randomly defined?
See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/
On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947)
This would typically be used by job scheduling systems (such as OOZIE), to denote that follow-on processing on the contents of this directory can commence as all the data has been output.
Update (in response to comment)
The output files are by default named part-x-yyyyy where:
x is either 'm' or 'r', depending on whether the job was a map only job, or reduce
yyyyy is the mapper or reducer task number (zero based)
So a job which has 32 reducers will have files named part-r-00000 to part-r-00031, one for each reducer task.

Restrict Hadoop MapReduce to Specific File Extension

I am trying to run a MapReduce job on my cluster that only runs on a specific file extension. We have a bunch of heterogeneous data that sits on the cluster and for this particular job I only want to execute on .jpg. Is there a way this can be done without restricting it in the mapper. It seems like this should be something easy to do when you execute the job. I'm thinking something like hadoop fs JobName /users/myuser/data/*.jpg /users/myuser/output.
Your example should work as written, but you'll want to check with the input format that you're calling the setInputPaths(Job, String) method, as this will resolve the glob string "/users/myuser/data/*.jpg" into the individual jpg files in /users/myuser/data.

Resources