There is a program called "cufflinks" which is run as follows:
cufflinks -o <output-dir> <input-file>
This program takes 1 file as input and generates 4 files as output in the "output-dir".
I am trying to run the same program on a Hadoop cluster using Runtime.exec() in a mapper class. I am setting
output-dir=/some/path/on/HDFS
I was expecting that the 4 files will be generated on HDFS as o/p. However, that is not true and the o/p directory on HDFS does not contain any of these 4 files.
I then tried setting
output-dir=/tmp/output/
and it worked.
Can anyone please suggest why it does not work on HDFS? What do I need to do to make it work on HDFS?
Thanks.
The problem is that cufflinks program should use HDFS API internal to create a file in HDFS and not regular file operations.
Related
I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename
I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.
Basically what is the major difference between moveFromLocal and copyToLocal instead of using put and get command in CLI of hadoop.
moveFromLocal: Similar to put command, except that the source localsrc is deleted after it’s copied.
copyToLocal: Similar to get command, except that the destination is restricted to a local file reference.
Source.
I am new to Hadoop and I am trying to figure out a way to do the following:
I have multiple input image files.
I have binary executables that processes these files.
These binary executables write text files as output.
I have a folder that contains all of these executables.
I have a script which runs all of these executables in certain order, passing image location as arguments.
My question is this: can I use Hadoop streaming to process these images via these binaries and spit out the results from the text file.
I am currently trying this.
I have my Hadoop cluster running. I uploaded by binaries and my images onto the HDFS.
I have set up a scrip which, when hadoop runs should change directory into the folder with images and execute another script which executes the binaries.
Then the scrip spits out via stdout the results.
However, I can't figure out how to have my map script change into the image folder on HDFS and then execute the other script.
Can someone give me a hint?
sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input /user/hduser/7posLarge \
-output /user/hduser/output5 \
-mapper RunHadoopJob.sh \
-verbose
And my RunHadoopJob.sh:
#!/bin/bash
cd /user/hduser/7posLarge/;
/user/hduser/RunSFM/RunSFM.sh;
My HDFS looks like this:
hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.
Found 4 items
drwxr-xr-x - hduser supergroup 0 2012-11-28 17:32 /user/hduser/7posLarge
drwxr-xr-x - hduser supergroup 0 2012-11-28 17:39 /user/hduser/RunSFM
drwxr-xr-x - root supergroup 0 2012-11-30 14:32 /user/hduser/output5
I know this is not the standard use of MapReduce. I am simply looking for a way to easily, without writing much overhead spin up multiple jobs on different clusters of the same program with different input. It seems like this is possible looking at Hadoop Streaming documentation.
"How do I use Hadoop Streaming to run an arbitrary set of
(semi-)independent tasks?
Often you do not need the full power of Map Reduce, but only need to
run multiple instances of the same program - either on different parts
of the data, or on the same data, but with different parameters. You
can use Hadoop Streaming to do this. "
If this is not possible, is there another tool on AmazonAWS for example that can do this for me?
UPDATE:
Looks like there are similar solutions but I have trouble following them. They are here and here.
There are several issues when dealing with Hadoop-streaming and binary files:
Hadoop doesn't know itself how to process image files
mappers are taking the input from the stdin line by line so you need to create an intermediate shell script that writes the image
data from the stdin to some temp. file which is then passed
to the executable.
Just passing the directory location to the executables is not really efficient since in this case you'll loose data locality. I don't want to repeat the already well answered questions on this topic, so here are the links:
Using Amazon MapReduce/Hadoop for Image Processing
Hadoop: how to access (many) photo images to be processed by map/reduce?
Another approach would be to transform the image files into splittable SequenceFiles. I.e: each record would be one image in the SequenceFile. Then using this as input format the mappers would call the executables on each record they get. Note that you have to provide them to the TaskTracker nodes beforehand with the correct file permissions so that they are executable from java code.
Some more information on this topic:
Hadoop: Example process to generating a SequenceFile with image binaries to be processed in map/reduce
I was able to use a "hack" to have a prototype of a workaround.
I am still trying this out, and I don't think this will work on an elastic cluster since you would have to recompile your binaries depending on your cluster's OS architecture. But, if you have a private cluster this may be a solution.
Using hadoop streaming you can package your binaries in .jar files and ship them to the node, where they will get unpacked before your script runs.
I have my images in pics.jar and my program which processes all images found in the directory from where you start the program in BinaryProgramFolder.jar. Inside the folder I have a script which launcher the program.
My streaming job ships the images and a the binary program + scripts to the node and starts them. Again this is a workaround hack...not a "real" solution to the problem.
So,
sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-archives 'hdfs://master:54310/user/hduser/pics.jar#pics','hdfs://master:54310/user/hduser/BinaryProgramFolder.jar#BinaryProgramFolder' \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input /user/hduser/input.txt \
-output /user/hduser/output \
-mapper RunHadoopJob.sh \
-verbose
Filler input file text.txt:
Filler text for streaming job.
RunHadoopJob.sh
cp -Hr BinaryProgramFolder ./pics; #copy a sym link to your unpacked program folder into your pics directory.
cd ./pics;
./BinaryProgramFolder/BinaryProgramLauncScript.sh <params>; #lunch your program following the symlink to the programs folder, I also used a script to launch my bin program which was in the same folder as the launch script.
NOTE: you must first put your program and images into a jar archive and then copy them to the HDFS. Use hadoop fs -copyFromLocal ./<file location> ./<hadoop fs location>
I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!