I'm installing hadoop now using the following link :
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
I have Question on installing and setting up hadoop platform as stand-alone mode.
First making input file in Standalone operation, this site write command as follows :
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
$ cat output/*
what is this processing?? running example??
and I issue those commands, I got the error as displayed in the image below :
what is problem??
what is this processing?? running example??
Those commands didn't process anything seriously rather than that, just executing a predefined example available with hadoop jar file to make sure you have installed & configured the setup properly.
As assumed that you were in the directory "/" while executing the following commands :
1) $ mkdir input : creating a directory called input under root directory /
2) $ cp etc/hadoop/*.xml input : Copying the hadoop conf files (*.xml) from /etc/hadoop to /input
3) $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+' :
Executing an inbuilt example class shipped with hadoop libraries. This example do extract the parameter starts with dfs from all the hadoop xml conf files located under the directory /input and write the result into the directory /output (implicitly created by hadoop as part of execution).
4) $ cat output/* : This command print all the file contents under the directory /output in terminal.
what is problem??
The problem you are facing here is the "input path". The path is vague and it was not resolved by hadoop. Make sure you are running hadoop as standalone mode. And finally execute the example by giving absolute path (for both input and output) as follows :
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep /input /output 'dfs[a-z.]+'
Related
I am absolutly new in Apache Hadoop and I am following a video course.
So I have correctly installed Hadoop 1.2.1 on a linux Ubuntu virtual machine.
After the installation the instructor perform this command in the shell:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
to see that Hadoop is working and that it is correctly installed.
But what exactly does this command?
This command runs a grep job defined inside hadoop examples jar file (containing map, reduce and driver code) with input folder to search for specified in input folder in hdfs while output is the folder where output after searching for patter would be in that file while dfs[a-z.]+ is a regular expression which you are saying to grep for in input.
I am learning Hadoop and I have never worked on Unix before . So, I am facing a problem here . What I am doing is:
$ hadoop fs -mkdir -p /user/user_name/abcd
now I am gonna put a ready made file with name file.txt in HDFS
$ hadoop fs -put file.txt /user/user_name/abcd
The file gets stored in hdfs since it shows up on running -ls command.
Now , I want to remove this file from HDFS . How should i do this ? What command should i use?
If you run the command hadoop fs -usage you'll get a look at what commands the filesystem supports and with hadoop fs -help you'll get a more in-depth description of them.
For removing files the commands is simply -rm with -rf specified for recursively removing folders. Read the command descriptions and try them out.
I am new to hadoop and behemoth and I followed the tutorial on https://github.com/DigitalPebble/behemoth/wiki/tutorial to generate a behemoth corpus for a text document, using the following command:
sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /home/madhumita/Documents/testFile -o /home/madhumita/behemoth/testGateOpCorpus
I am getting the error:
ERROR util.CorpusGenerator: Input does not exist : /home/madhumita/Documents/testFile
every time I run the command, though I have checked with gedit that the path is correct. I searched online for any similar issues, but I could not find any.
Any ideas as to why it may be happening? If .txt file format is not acceptable, what is the required file format?
Okay, I managed to solve the problem. The input path required was the path to the file on the hadoop distributed file system, not on the local machine.
So first I copied the local file to /data/test.txt on HDFS and gave this path as the input parameter. The commands are as follows:
sudo bin/hadoop fs -copyFromLocal /home/madhumita/Documents/testFile/test.txt /docs/test.txt
sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /docs/test.txt -o /docs/behemoth/test
This solves the issue. Thanks to everyone who tried to solve the problem.
To generate Behemoth corpus directly from local filesystem, refer it using file protocol. (file:///)
hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i "file:///home/madhumita/Documents/testFile/test.txt" -o "/docs/behemoth/test"
I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!
I'm trying to execute the code below : http://blog.xebia.com/2011/09/27/wiki-pagerank-with-hadoop/
The javac shows no errors, but I don't know how to get the output? These are the execution steps I'm following:
$ javac -Xlint -classpath /home/james/Downloads/hadoop-0.20.203.0/hadoop-core-0.20.203.0.jar -d doc WikiPageRanking.java
$ jar -cvf WikiPageRanking.jar -C doc/ .
$ bin/hadoop dfs -mkdir /user/james/wiki/in
$ bin/hadoop dfs -copyFromLocal wiki-micro.txt /user/james/wiki/in
$ bin/hadoop jar WikiPageRanking.jar org.myorg.WikiPageRanking /user/james/wiki/in /user/james/wiki/result
Is this right? I seriously doubt the last step - the input and output paths!! In the code, they have used wiki/in, that's why I gave the same path here, and I have copied my sample dataset to this path. The map reduce process starts, but I get no output!!
What does the following commands give you:
hadoop fs -ls /user/james/wiki/result
hadoop fs -text /user/james/wiki/result/part*
Running a job does not automatically dump the results of the job to the console - they are most typically stored in HDFS (in your case in the path /user/james/wiki/result). You can view the contents of this directory using the first command, and assuming there are some part* files, the second command will dump their contents to disk
Final point to note - if the output format is SequenceFileOutputFormat and you're using custom key / value objects, you'll need to amend the second command to include your jar:
hadoop fs -libjars WikiPageRanking.jar -text /user/james/wiki/result/part*