Error in generating Behemoth corpus

Error in generating Behemoth corpus - hadoop

I am new to hadoop and behemoth and I followed the tutorial on https://github.com/DigitalPebble/behemoth/wiki/tutorial to generate a behemoth corpus for a text document, using the following command:
sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /home/madhumita/Documents/testFile -o /home/madhumita/behemoth/testGateOpCorpus
I am getting the error:
ERROR util.CorpusGenerator: Input does not exist : /home/madhumita/Documents/testFile
every time I run the command, though I have checked with gedit that the path is correct. I searched online for any similar issues, but I could not find any.
Any ideas as to why it may be happening? If .txt file format is not acceptable, what is the required file format?

Okay, I managed to solve the problem. The input path required was the path to the file on the hadoop distributed file system, not on the local machine.
So first I copied the local file to /data/test.txt on HDFS and gave this path as the input parameter. The commands are as follows:
sudo bin/hadoop fs -copyFromLocal /home/madhumita/Documents/testFile/test.txt /docs/test.txt
sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /docs/test.txt -o /docs/behemoth/test
This solves the issue. Thanks to everyone who tried to solve the problem.

To generate Behemoth corpus directly from local filesystem, refer it using file protocol. (file:///)
hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i "file:///home/madhumita/Documents/testFile/test.txt" -o "/docs/behemoth/test"

Related

Installing and setting up hadoop 2.7.2 in stand-alone mode

I'm installing hadoop now using the following link :
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
I have Question on installing and setting up hadoop platform as stand-alone mode.
First making input file in Standalone operation, this site write command as follows :
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
$ cat output/*
what is this processing?? running example??
and I issue those commands, I got the error as displayed in the image below :
what is problem??

what is this processing?? running example??
Those commands didn't process anything seriously rather than that, just executing a predefined example available with hadoop jar file to make sure you have installed & configured the setup properly.
As assumed that you were in the directory "/" while executing the following commands :
1) $ mkdir input : creating a directory called input under root directory /
2) $ cp etc/hadoop/*.xml input : Copying the hadoop conf files (*.xml) from /etc/hadoop to /input
3) $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+' :
Executing an inbuilt example class shipped with hadoop libraries. This example do extract the parameter starts with dfs from all the hadoop xml conf files located under the directory /input and write the result into the directory /output (implicitly created by hadoop as part of execution).
4) $ cat output/* : This command print all the file contents under the directory /output in terminal.
what is problem??
The problem you are facing here is the "input path". The path is vague and it was not resolved by hadoop. Make sure you are running hadoop as standalone mode. And finally execute the example by giving absolute path (for both input and output) as follows :
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep /input /output 'dfs[a-z.]+'

What exactly does this Hadoop command performed in the shell?

I am absolutly new in Apache Hadoop and I am following a video course.
So I have correctly installed Hadoop 1.2.1 on a linux Ubuntu virtual machine.
After the installation the instructor perform this command in the shell:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
to see that Hadoop is working and that it is correctly installed.
But what exactly does this command?

This command runs a grep job defined inside hadoop examples jar file (containing map, reduce and driver code) with input folder to search for specified in input folder in hdfs while output is the folder where output after searching for patter would be in that file while dfs[a-z.]+ is a regular expression which you are saying to grep for in input.

WordCount command can't find file location

I run Wordcount in Eclipse and my text file exists in hdfs
Eclipse shows me this error:
Input path does not exist: file:/home/hduser/workspace/sample1user/hduser/test1

Input path does not exist: file:/home/hduser/workspace/sample1user/hduser/test1
Your error shows that the wordcount is searching for the file in local filesystem and not in hdfs. Try copying the input file in local file system.
Post the results for following commands in your question:
hdfs dfs -ls /home/hduser/workspace/sample1user/hduser/test1
hdfs dfs -ls /home/hduser/workspace/sample1user/hduser
ls -l /home/hduser/workspace/sample1user/hduser/test1
ls -l /home/hduser/workspace/sample1user/hduser

I too ran into the similar issue..(I am a beginner too) I gave the full hdfs path via Arguments for the Wordcount program like below and it worked (I was running the Pseudo-Distributed mode)
hdfs://krishl#localhost:9000/user/Perumal/Input hdfs://krish#localhost:9000/user/Perumal/Output
hdfs://krish#localhost:9000 is my hdfs location and My hadoop daemons were running during the testing.
Note: This may not be the best practice but it helped me get started!!

Hadoop job not working

I'm following this instructions to run hadoop:
http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_(Single-Node_Cluster)
however, I couldn't get this command to work:
hadoop-*/bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
all what I get is:
Exception in thread "main" java.io.IOException: Error opening job jar: /Users/hadoop/hadoop-1.0.1/hadoop-examples-1.0.1.jargrep
at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:127)
at java.util.jar.JarFile.<init>(JarFile.java:135)
at java.util.jar.JarFile.<init>(JarFile.java:72)
at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
I added this to my hadoop-env.sh :
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
but still the same error.
Any clue guys?

When you run the following command:
hadoop-/bin/hadoop jar hadoop--examples.jar grep input output 'dfs[a-z.]+'
grep is the hadoop program which is part of example
input is the folder where your source data is and hope you have created it at HDFS
output is the folder which will be created as result.
'dfs[a=-z.]+' is the regular options used with grep program
because the output is "Grep......." it seems to me that the actual sample application class is not available or missing some info when Hadoop command is running.. you would need to check that first and also look for regular expression if that applies with your input data.

I know this is old, but in case anyone else has the same problem and sees this SO question, I want to put up what I did to solve this, as it's very simple.
It looks like it's a typo in the example's instructions. If you look in the Hadoop distribution directory you will notice that the example file being referred to is called hadoop-examples-1.0.4.jar, or whatever version you are using.
So instead of:
hadoop-*/bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
try:
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)

Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio