What exactly does this Hadoop command performed in the shell?

What exactly does this Hadoop command performed in the shell? - hadoop

I am absolutly new in Apache Hadoop and I am following a video course.
So I have correctly installed Hadoop 1.2.1 on a linux Ubuntu virtual machine.
After the installation the instructor perform this command in the shell:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
to see that Hadoop is working and that it is correctly installed.
But what exactly does this command?

This command runs a grep job defined inside hadoop examples jar file (containing map, reduce and driver code) with input folder to search for specified in input folder in hdfs while output is the folder where output after searching for patter would be in that file while dfs[a-z.]+ is a regular expression which you are saying to grep for in input.

Related

hadoop in win10, hadoop version will cause can't find class org.apache.hadoop.util.VersionInfo，how should I do?

hadoop have two files to start( hadoop and hadoop.sh), Because I use win10, hadoop.cmd is automatically called, I know that the hadoop version problem can be solved by modifying the exce command in the hadoop file, but the hadoop file is a shell language and the system will automatically call the .cmd file. How can I modify
"exec" $ JAVA "-classpath" $ (cygpath- pw "$ CLASSPATH") "$ JAVA_HEAP_MAX $ HADOOP_OPTS $ CLASS" $ # "
This command becomes a .cmd command? I don't know .cmd instructions

Rather than edit your shell scripts (because they work for plenty of other people), you should extract Hadoop to a location that contains no spaces or special characters.
For example, c:\hadoop rather than your downloads folder

Installing and setting up hadoop 2.7.2 in stand-alone mode

I'm installing hadoop now using the following link :
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
I have Question on installing and setting up hadoop platform as stand-alone mode.
First making input file in Standalone operation, this site write command as follows :
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
$ cat output/*
what is this processing?? running example??
and I issue those commands, I got the error as displayed in the image below :
what is problem??

what is this processing?? running example??
Those commands didn't process anything seriously rather than that, just executing a predefined example available with hadoop jar file to make sure you have installed & configured the setup properly.
As assumed that you were in the directory "/" while executing the following commands :
1) $ mkdir input : creating a directory called input under root directory /
2) $ cp etc/hadoop/*.xml input : Copying the hadoop conf files (*.xml) from /etc/hadoop to /input
3) $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+' :
Executing an inbuilt example class shipped with hadoop libraries. This example do extract the parameter starts with dfs from all the hadoop xml conf files located under the directory /input and write the result into the directory /output (implicitly created by hadoop as part of execution).
4) $ cat output/* : This command print all the file contents under the directory /output in terminal.
what is problem??
The problem you are facing here is the "input path". The path is vague and it was not resolved by hadoop. Make sure you are running hadoop as standalone mode. And finally execute the example by giving absolute path (for both input and output) as follows :
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep /input /output 'dfs[a-z.]+'

Splitting a file on Hadoop

I have an 8.8G file on the hadoop cluster that I'm trying to extract certain lines for testing purpose.
Seeing that Apache Hadoop 2.6.0 have no split command, how am I able to do it without having to download the file.
If the file was on a linux server I would've used:
$ csplit filename %2015-07-17%
The previous command works as desired, is something close to that possible on Hadoop?

You could use a combination of unix and hdfs commands.
hadoop fs -cat filename.dat | head -250 > /redirect/filename
Or if last KB of the file is suffice you could use this.
hadoop fs -tail filename.dat > /redirect/filename

Error in generating Behemoth corpus

I am new to hadoop and behemoth and I followed the tutorial on https://github.com/DigitalPebble/behemoth/wiki/tutorial to generate a behemoth corpus for a text document, using the following command:
sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /home/madhumita/Documents/testFile -o /home/madhumita/behemoth/testGateOpCorpus
I am getting the error:
ERROR util.CorpusGenerator: Input does not exist : /home/madhumita/Documents/testFile
every time I run the command, though I have checked with gedit that the path is correct. I searched online for any similar issues, but I could not find any.
Any ideas as to why it may be happening? If .txt file format is not acceptable, what is the required file format?

Okay, I managed to solve the problem. The input path required was the path to the file on the hadoop distributed file system, not on the local machine.
So first I copied the local file to /data/test.txt on HDFS and gave this path as the input parameter. The commands are as follows:
sudo bin/hadoop fs -copyFromLocal /home/madhumita/Documents/testFile/test.txt /docs/test.txt
sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /docs/test.txt -o /docs/behemoth/test
This solves the issue. Thanks to everyone who tried to solve the problem.

To generate Behemoth corpus directly from local filesystem, refer it using file protocol. (file:///)
hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i "file:///home/madhumita/Documents/testFile/test.txt" -o "/docs/behemoth/test"

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)

Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio