How can I run the wordCount example in Hadoop? - hadoop

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)

Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

Related

How to copy file from local directory in another drive to HDFS in Apache Hadoop?

I'm new to Apache Hadoop and I'm trying to copy a simple text file from my local directory to HDFS on Hadoop, which is up and running. However, Hadoop is installed in D: while my file is in C:.
If I use the -put or copyFromLocal command in cmd with the file in the aforementioned drive, it doesn't allow me to do that. However, if I place the text file in the same D: drive, the file is correctly uploaded to Hadoop and can be seen on Hadoop localhost. The code that works with the file and Hadoop in the same drive is as follows:
hadoop fs -put /test.txt /user/testDirectory
If my file is in a separate drive, I get the error '/test.txt': No such file or directory. I've tried variations of /C/pathOfFile/test.txt but to no avail, so in short, I need to know how to access a local file in another directory, specifically with respect to the -put command. Any help for this probably amateurish question will be appreciated.
If your current cmd session is in D:\, then your command would look at the root of that drive
You could try prefixing the path
file:/C:/test.txt
Otherwise, cd to the path containing your file first, then just -put test.txt or -put .\test.txt
Note: HDFS doesn't know about the difference between C and D unless you actually set fs.defaultFS to be something like file:/D:/hdfs
From your question I assume that you have installed Hadoop in a Virtual Machine (VM) on a Windows installation. Please provide more details on that if this assumption is incorrect. The issue is that your VM considers drive D: as the Local Directory, where -put and -copyFromLocal can see files at. C: is not visible to these commands currently.
You need to mount drive C: to your VM, in order to make its files available as local for Hadoop. There are guides out there depending on your VM. I advise care while at it, in order not to mishandle any Windows installation files.

Combine Map output for directory to one file

I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.

What exactly does this Hadoop command performed in the shell?

I am absolutly new in Apache Hadoop and I am following a video course.
So I have correctly installed Hadoop 1.2.1 on a linux Ubuntu virtual machine.
After the installation the instructor perform this command in the shell:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
to see that Hadoop is working and that it is correctly installed.
But what exactly does this command?
This command runs a grep job defined inside hadoop examples jar file (containing map, reduce and driver code) with input folder to search for specified in input folder in hdfs while output is the folder where output after searching for patter would be in that file while dfs[a-z.]+ is a regular expression which you are saying to grep for in input.

determine which default namenode and namenode_port "hadoop fs -ls" is using

I'm running the below command:
hadoop fs -ls
I know that this command really corresponds to something like the below:
hadoop fs -ls hdfs://<namenode>:<namenode_port>
How do I determine these values of namenode and namenode_port?
I've already tried examining environment variables and looking at the documentation, but I couldn't find anything to do exactly this.
You need to refer hadoop configuration file core-site.xml for <namenode> and <namenode_port> values. Search for configuration property fs.defaultFS(latest) or fs.default.name(depricated).
core-site.xml file can be located in /etc/hadoop/conf or $HADOOP_HOME/etc/hadoop/conf locations.
Edit
There is a dynamic way to do this. Below hadoop command will give HDFS URI.
hdfs getconf -confKey "fs.defaultFS"
However, every time hadoop commands get executed, hadoop loads client configurations xml files. If you wanted to test this out, create a separate client configurations in some other directories by copying contents of /etc/hadoop/conf to new directory( let's say /home//hadoop-conf, it should contain core-site.xml, hdfs-site.xml, mapred-site.xml), override the environment variable using the command export HADOOP_CONF_DIR=/home/<USER>/hadoop-conf, update core-site.xml files, then test using hadoop command. Remember this is only for testing purpose, unset the environment variable ( unset HADOOP_CONF_DIR) once you are done with your testing.

How to change hadoop conf directory location?

In the exception stack trace I can see that my hadoop configuration is loaded from /etc/hadoop/conf.empty/hdfs-site.xml
How do I change it to /etc/hadoop/conf/hdfs-site.xml ?
In principle you can load any bash script before loading the hadoop daemon, e.g. in /etc/init.d/hadoop-hdfs-datanode for datanodes.
You can add towards the top to this init script,
export HADOOP_CONF_DIR=/etc/hadoop/conf
Also, you can check whether this is overridden by hadoop in /usr/lib/hadoop/etc/hadoop/hadoop-env.sh. A similar file should also be: /etc/hadoop/hadoop-env.sh
Best way is to check, which are the files are getting called when hadoop calls the init script in /etc/init.d/hadoop-hdfs-*
You can always set your configuration files from the command line via: (example ls)
hadoop fs -conf configFile.xml -ls ./

Resources