I'd like to see the source code for certain hadoop commands like -put and -ls. I want to be able to add additional information to the log outputs that are associated with running these commands.
How can I find these?
The source can be found in Apache's SVN.
You'll see files such as Ls.java, CopyCommands.java (defines -put, -copyFromLocal), etc., that all define the HDFS commands that you're looking for.
Related
Is there any way to modify the txt file inside HDFS directly via terminal?
Assume, I have "my_text_file.txt", and I would like to modify it inside HDFS using below command.
$ hdfs dfs -XXXX user/my_text_file.txt
I am interested to know "xxxx" if there exists any.
Please note that I don't want to make modification in local and then copy it to HDFS.
You cannot edit files, which all are already in HDFS. It will not support. HDFS works on "Write once, read many". So if you want to edit a file, make changes in your local copy then move it to HDFS.
Currently as explained by #BruceWayne, its not possible. It would be very difficult to edit the files stored in hdfs because all the files are distributed in hdfs and it would be very difficult to edit in the terminal using hdfs commands. Currently these are supported as terminal commands.
You can edit them by locating the data location of each datanode in the cluster.But that would be troublesome.
Moreover you can install HUE. With HUE you can edit the files in hdfs using web UI.
You can not edit files in HDFS, as it works on the principle of Write once, Read Many.But now a day, we can edit file using Hue file browser in cloudera.
I'd like to see the source code for certain hadoop commands like -put and -ls. I want to be able to add additional information to the log outputs that are associated with running these commands. For example, i want to show the message "Hii user, your file is copying from local file system to hdfs" during the execution of -get or copyFromLocal command.
I want to change in core files not in api files like copyCommand.java(http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/shell/CopyCommands.java?view=markup)
this type of message should print on execution of command.
Can anyone tell which file I should change.
How can I find these?
Basically what is the major difference between moveFromLocal and copyToLocal instead of using put and get command in CLI of hadoop.
moveFromLocal: Similar to put command, except that the source localsrc is deleted after it’s copied.
copyToLocal: Similar to get command, except that the destination is restricted to a local file reference.
Source.
I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!
We're using Amazon's Elastic Map Reduce to perform some large file processing jobs. As a part of our workflow, we occasionally need to remove files from S3 that may already exist. We do so using the hadoop fs interface, like this:
hadoop fs -rmr s3://mybucket/a/b/myfile.log
This removes the file from S3 appropriately, but in it's place leaves an empty file named "s3://mybucket/a/b_$folder$". As described in this question, Hadoop's Pig is unable to handle these files, so later steps in the workflow can choke on this file.
(Note, it doesn't seem to matter whether we use -rmr or -rm or whether we use s3:// or s3n:// as the scheme: all of these exhibit the described behavior.)
How do I use the hadoop fs interface to remove files from S3 and be sure not to leave these troublesome files behind?
I wasn't able to figure out if it's possible to use the hadoop fs interface in this way. However, the s3cmd interface does the right thing (but only for one key at a time):
s3cmd del s3://mybucket/a/b/myfile.log
This requires configuring a ~/.s3cfg file with your AWS credentials first. s3cmd --configure will interactively help you create this file.
It is how the S3 suppot is implemented in Hadoop, see this: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html.
So use s3cmd.