Doubts regarding Pagerank execution - hadoop

I'm trying to execute the code below : http://blog.xebia.com/2011/09/27/wiki-pagerank-with-hadoop/
The javac shows no errors, but I don't know how to get the output? These are the execution steps I'm following:
$ javac -Xlint -classpath /home/james/Downloads/hadoop-0.20.203.0/hadoop-core-0.20.203.0.jar -d doc WikiPageRanking.java
$ jar -cvf WikiPageRanking.jar -C doc/ .
$ bin/hadoop dfs -mkdir /user/james/wiki/in
$ bin/hadoop dfs -copyFromLocal wiki-micro.txt /user/james/wiki/in
$ bin/hadoop jar WikiPageRanking.jar org.myorg.WikiPageRanking /user/james/wiki/in /user/james/wiki/result
Is this right? I seriously doubt the last step - the input and output paths!! In the code, they have used wiki/in, that's why I gave the same path here, and I have copied my sample dataset to this path. The map reduce process starts, but I get no output!!

What does the following commands give you:
hadoop fs -ls /user/james/wiki/result
hadoop fs -text /user/james/wiki/result/part*
Running a job does not automatically dump the results of the job to the console - they are most typically stored in HDFS (in your case in the path /user/james/wiki/result). You can view the contents of this directory using the first command, and assuming there are some part* files, the second command will dump their contents to disk
Final point to note - if the output format is SequenceFileOutputFormat and you're using custom key / value objects, you'll need to amend the second command to include your jar:
hadoop fs -libjars WikiPageRanking.jar -text /user/james/wiki/result/part*

Related

hdfs copy multiple files to same target directory

I learned that if you want to copy multiple files from one hadoop folder to another hadoop folder you can better create one big 'hdfs dfs -cp' statement with lots of components, instead of creating several hdfs dfs -cp statements.
With 'better' I mean that it will improve the overal time it takes to copy files: one command is quicker than several separate -cp commands run after each other.
When I do this and my target directory is the same for all files that I want to copy I get a warning.
I'm executing the following command:
hdfs dfs -cp -f /path1/file1 /pathx/target /path2/file2 /pathx/target /path3/file3 /pathx/target
After executing it I get the following warning returned:
cp: `/pathx/target' to `/pathx/target/target': is a subdirectory of itself
Although I get this weird warning the copy itself succeeds like it should.
Is this a bug or am I missing something?
Try to use the following syntax:
hadoop fs -cp /path1/file1 /path2/file2 path3/file3 /pathx/target
Or You could do it like this:
hadoop fs -cp /path1/{file1, file2, file3} /pathx/target
If you want to copy all the files then:
hadoop fs -cp /path1/* /pathx/target

Installing and setting up hadoop 2.7.2 in stand-alone mode

I'm installing hadoop now using the following link :
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
I have Question on installing and setting up hadoop platform as stand-alone mode.
First making input file in Standalone operation, this site write command as follows :
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
$ cat output/*
what is this processing?? running example??
and I issue those commands, I got the error as displayed in the image below :
what is problem??
what is this processing?? running example??
Those commands didn't process anything seriously rather than that, just executing a predefined example available with hadoop jar file to make sure you have installed & configured the setup properly.
As assumed that you were in the directory "/" while executing the following commands :
1) $ mkdir input : creating a directory called input under root directory /
2) $ cp etc/hadoop/*.xml input : Copying the hadoop conf files (*.xml) from /etc/hadoop to /input
3) $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+' :
Executing an inbuilt example class shipped with hadoop libraries. This example do extract the parameter starts with dfs from all the hadoop xml conf files located under the directory /input and write the result into the directory /output (implicitly created by hadoop as part of execution).
4) $ cat output/* : This command print all the file contents under the directory /output in terminal.
what is problem??
The problem you are facing here is the "input path". The path is vague and it was not resolved by hadoop. Make sure you are running hadoop as standalone mode. And finally execute the example by giving absolute path (for both input and output) as follows :
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep /input /output 'dfs[a-z.]+'

How do I remove a file from HDFS

I am learning Hadoop and I have never worked on Unix before . So, I am facing a problem here . What I am doing is:
$ hadoop fs -mkdir -p /user/user_name/abcd
now I am gonna put a ready made file with name file.txt in HDFS
$ hadoop fs -put file.txt /user/user_name/abcd
The file gets stored in hdfs since it shows up on running -ls command.
Now , I want to remove this file from HDFS . How should i do this ? What command should i use?
If you run the command hadoop fs -usage you'll get a look at what commands the filesystem supports and with hadoop fs -help you'll get a more in-depth description of them.
For removing files the commands is simply -rm with -rf specified for recursively removing folders. Read the command descriptions and try them out.

WordCount command can't find file location

I run Wordcount in Eclipse and my text file exists in hdfs
Eclipse shows me this error:
Input path does not exist: file:/home/hduser/workspace/sample1user/hduser/test1
Input path does not exist: file:/home/hduser/workspace/sample1user/hduser/test1
Your error shows that the wordcount is searching for the file in local filesystem and not in hdfs. Try copying the input file in local file system.
Post the results for following commands in your question:
hdfs dfs -ls /home/hduser/workspace/sample1user/hduser/test1
hdfs dfs -ls /home/hduser/workspace/sample1user/hduser
ls -l /home/hduser/workspace/sample1user/hduser/test1
ls -l /home/hduser/workspace/sample1user/hduser
I too ran into the similar issue..(I am a beginner too) I gave the full hdfs path via Arguments for the Wordcount program like below and it worked (I was running the Pseudo-Distributed mode)
hdfs://krishl#localhost:9000/user/Perumal/Input hdfs://krish#localhost:9000/user/Perumal/Output
hdfs://krish#localhost:9000 is my hdfs location and My hadoop daemons were running during the testing.
Note: This may not be the best practice but it helped me get started!!

How can I concatenate two files in hadoop into one using Hadoop FS shell?

I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)
Here is the command I'm submitting (names have been changed):
**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**
It returns bash: /user/username/folder/outputdirectory/: No such file or directory
I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.
I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.
The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.
The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:
bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
-getmerge also outputs to the local file system, not HDFS
Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in
a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)
To concatenate all files in the folder to an output file:
hadoop fs -cat myfolder/* | hadoop fs -put - myfolder/output.txt
If you have multiple folders on hdfs and you want to concatenate files in each of those folders, you can use a shell script to do this. (note: this is not very effective and can be slow)
Syntax :
for i in `hadoop fs -ls <folder>| cut -d' ' -f19` ;do `hadoop fs -cat $i/* | suy hadoop fs -put - $i/<outputfilename>`; done
eg:
for i in `hadoop fs -ls my-job-folder | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - $i/output.csv`; done
Explanation:
So you basically loop over all the files and cat each of the folders contents into an output file on the hdfs.

Resources