Use Hadoop Streaming to run binary via script - hadoop

I am new to Hadoop and I am trying to figure out a way to do the following:
I have multiple input image files.
I have binary executables that processes these files.
These binary executables write text files as output.
I have a folder that contains all of these executables.
I have a script which runs all of these executables in certain order, passing image location as arguments.
My question is this: can I use Hadoop streaming to process these images via these binaries and spit out the results from the text file.
I am currently trying this.
I have my Hadoop cluster running. I uploaded by binaries and my images onto the HDFS.
I have set up a scrip which, when hadoop runs should change directory into the folder with images and execute another script which executes the binaries.
Then the scrip spits out via stdout the results.
However, I can't figure out how to have my map script change into the image folder on HDFS and then execute the other script.
Can someone give me a hint?
sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input /user/hduser/7posLarge \
-output /user/hduser/output5 \
-mapper RunHadoopJob.sh \
-verbose
And my RunHadoopJob.sh:
#!/bin/bash
cd /user/hduser/7posLarge/;
/user/hduser/RunSFM/RunSFM.sh;
My HDFS looks like this:
hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.
Found 4 items
drwxr-xr-x - hduser supergroup 0 2012-11-28 17:32 /user/hduser/7posLarge
drwxr-xr-x - hduser supergroup 0 2012-11-28 17:39 /user/hduser/RunSFM
drwxr-xr-x - root supergroup 0 2012-11-30 14:32 /user/hduser/output5
I know this is not the standard use of MapReduce. I am simply looking for a way to easily, without writing much overhead spin up multiple jobs on different clusters of the same program with different input. It seems like this is possible looking at Hadoop Streaming documentation.
"How do I use Hadoop Streaming to run an arbitrary set of
(semi-)independent tasks?
Often you do not need the full power of Map Reduce, but only need to
run multiple instances of the same program - either on different parts
of the data, or on the same data, but with different parameters. You
can use Hadoop Streaming to do this. "
If this is not possible, is there another tool on AmazonAWS for example that can do this for me?
UPDATE:
Looks like there are similar solutions but I have trouble following them. They are here and here.

There are several issues when dealing with Hadoop-streaming and binary files:
Hadoop doesn't know itself how to process image files
mappers are taking the input from the stdin line by line so you need to create an intermediate shell script that writes the image
data from the stdin to some temp. file which is then passed
to the executable.
Just passing the directory location to the executables is not really efficient since in this case you'll loose data locality. I don't want to repeat the already well answered questions on this topic, so here are the links:
Using Amazon MapReduce/Hadoop for Image Processing
Hadoop: how to access (many) photo images to be processed by map/reduce?
Another approach would be to transform the image files into splittable SequenceFiles. I.e: each record would be one image in the SequenceFile. Then using this as input format the mappers would call the executables on each record they get. Note that you have to provide them to the TaskTracker nodes beforehand with the correct file permissions so that they are executable from java code.
Some more information on this topic:
Hadoop: Example process to generating a SequenceFile with image binaries to be processed in map/reduce

I was able to use a "hack" to have a prototype of a workaround.
I am still trying this out, and I don't think this will work on an elastic cluster since you would have to recompile your binaries depending on your cluster's OS architecture. But, if you have a private cluster this may be a solution.
Using hadoop streaming you can package your binaries in .jar files and ship them to the node, where they will get unpacked before your script runs.
I have my images in pics.jar and my program which processes all images found in the directory from where you start the program in BinaryProgramFolder.jar. Inside the folder I have a script which launcher the program.
My streaming job ships the images and a the binary program + scripts to the node and starts them. Again this is a workaround hack...not a "real" solution to the problem.
So,
sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-archives 'hdfs://master:54310/user/hduser/pics.jar#pics','hdfs://master:54310/user/hduser/BinaryProgramFolder.jar#BinaryProgramFolder' \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input /user/hduser/input.txt \
-output /user/hduser/output \
-mapper RunHadoopJob.sh \
-verbose
Filler input file text.txt:
Filler text for streaming job.
RunHadoopJob.sh
cp -Hr BinaryProgramFolder ./pics; #copy a sym link to your unpacked program folder into your pics directory.
cd ./pics;
./BinaryProgramFolder/BinaryProgramLauncScript.sh <params>; #lunch your program following the symlink to the programs folder, I also used a script to launch my bin program which was in the same folder as the launch script.
NOTE: you must first put your program and images into a jar archive and then copy them to the HDFS. Use hadoop fs -copyFromLocal ./<file location> ./<hadoop fs location>

Related

Hadoop distcp wrong path still copies - where did data go?

I was running hadoop distcp to copy a whole directory (500GB+) from /path/to/source to /path/to/destination. However, instead of running
$ hadoop distcp /path/to/source /path/to/destination
I did the following in mistake
$ hadoop distcp /path/to/source path/to/destination
The operation completed like a normal distcp copy, with mapreduce taking some time to run, and of course I did not get my data in /path/to/destination. It was also not in /path/to/source/path/to/destination, or other relative paths I could think of.
Where did the data go? Thanks.
It doesn't go anywhere if the destination path is not correct it stays in the source location

How does cp command work in Hadoop?

I am reading "Hadoop: The Defnitive Guide" and to explain my question let me quote from the book
distcp is implemented as a MapReduce job where the work of copying is done by the
maps that run in parallel across the cluster. There are no reducers. Each file is copied
by a single map, and distcp tries to give each map approximately the same amount of
data by bucketing files into roughly equal allocations. By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp.
and in a footnote
Even for a single file copy, the distcp variant is preferred for large files since hadoop fs -cp copies the file
via the client running the command.
I understand why distcp works better for collection of files as different mappers are performing parallelly each on a single file. But when only a single file is to be copied why distcp performs better when the file size is large (according to the footnote). I am only getting started so it would be helpful if how cp command in hadoop works is explained and what is meant by "hadoop fs -cp copies the file via the client running the command.". I understand the write process of Hadoop which is explained in the book where a pipeline of datanodes are formed and each datanode is responsible to write data to the following datanode in the pipeline.
When a file is copied "via the client", the byte content is streamed from HDFS, to the local node running the command, then uploaded back to the destination HDFS location. The file metadata is not simply copied over to a new spot between datanodes directly as you'd expect.
Compare that to distcp, which creates smaller, parallel cp commands spread out over multiple hosts

Combine Map output for directory to one file

I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.

Running commands on Hadoop using Java Runtime.exec()

There is a program called "cufflinks" which is run as follows:
cufflinks -o <output-dir> <input-file>
This program takes 1 file as input and generates 4 files as output in the "output-dir".
I am trying to run the same program on a Hadoop cluster using Runtime.exec() in a mapper class. I am setting
output-dir=/some/path/on/HDFS
I was expecting that the 4 files will be generated on HDFS as o/p. However, that is not true and the o/p directory on HDFS does not contain any of these 4 files.
I then tried setting
output-dir=/tmp/output/
and it worked.
Can anyone please suggest why it does not work on HDFS? What do I need to do to make it work on HDFS?
Thanks.
The problem is that cufflinks program should use HDFS API internal to create a file in HDFS and not regular file operations.

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

Resources