What is the difference between Hadoop fs and regular Unix commands? - hadoop

I am new to Hadoop and HDFS, I am trying to see why the Hadoop fs commands are needed versus just using the Unix command equivalents. They both seem to work, my first thought was the Hadoop command interfaced directly with the HDFS namenode and propagated it to all nodes. However this seems to be the case when I use just the Unix shell command. I poured over the internet and did not find an easy explanation. Help is greatly appreciated. Or a link to an explanation of the difference.

If you're running on HDFS over NFS then you can expect most simple commands to work properly (such as ls, cd, mkdir, mv, rm, chmod, chgrp, chown). The only need for the hadoop fs or hdfs dfs command is if you are using extended ACLs or want to do other Hadoop specific things like:
alter the replication factor hadoop fs -setrep
remove files under /user/USERNAME/.Trash hdfs dfs -expunge

Thanks to the commenters TK421 it made me think that this is over NFS and its also not straight HDFS, Its a MAPR implementation so it differs, I found some documentation from mapr that explains.
You can also set read, write, and execute permissions on a file or
directory for users and groups with standard UNIX commands, when that
volume has been mounted through NFS, or using standard hadoop fs
commands.
https://mapr.com/docs/52/MapROverview/c_volumes.html

The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others. The FS shell is invoked by:
bin/hadoop fs <args>
All FS shell commands take path URIs as arguments. The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as
hdfs://namenodehost/parent/child
or simply as
/parent/child
(given that your configuration is set to point to hdfs://namenodehost).
Most of the commands in FS shell behave like corresponding Unix commands.
You may not find some regular shell commands like -head, but -tail and -cat are available. Subtle differences in the same commands can be found for individual commands here.

Related

Uploading file with square brackets in its name to Hadoop via hadoop fs -put

I have a file that has a square bracket in its name. This file needs to be uploaded to Hadoop via hadoop fs -put. I am using MapR 6.
The following variants lead to a put: unexpected URISyntaxException
hadoop fs -put aaa[bbb.txt /destination
hadoop fs -put aaa\[bbb.txt /destination
hadoop fs -put "aaa[bbb.txt" /destination
hadoop fs -put "aaa\[bbb.txt" /destination
did you tried "aaa%5Bbbb.txt" ?
Hadoop commands such as hadoop fs -put generally do a bad job with escaping names.
That is the bad news.
The good news is that with MapR, you can avoid all of that and simply copy the file to a local mount of the MapR file system using standard Linux commands like cp. There is no need to "upload" anything because MapR feels and acts just like an ordinary file system. You can get the required mount using NFS or the POSIX drivers.
The big benefit of this is that you get the benefit of the maturity of the implementations of the Linux commands. That is, those commands (and the shell) do quoting correctly and you can get the result you want relatively trivially. Just use single quotes and be done with it.

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

determine which default namenode and namenode_port "hadoop fs -ls" is using

I'm running the below command:
hadoop fs -ls
I know that this command really corresponds to something like the below:
hadoop fs -ls hdfs://<namenode>:<namenode_port>
How do I determine these values of namenode and namenode_port?
I've already tried examining environment variables and looking at the documentation, but I couldn't find anything to do exactly this.
You need to refer hadoop configuration file core-site.xml for <namenode> and <namenode_port> values. Search for configuration property fs.defaultFS(latest) or fs.default.name(depricated).
core-site.xml file can be located in /etc/hadoop/conf or $HADOOP_HOME/etc/hadoop/conf locations.
Edit
There is a dynamic way to do this. Below hadoop command will give HDFS URI.
hdfs getconf -confKey "fs.defaultFS"
However, every time hadoop commands get executed, hadoop loads client configurations xml files. If you wanted to test this out, create a separate client configurations in some other directories by copying contents of /etc/hadoop/conf to new directory( let's say /home//hadoop-conf, it should contain core-site.xml, hdfs-site.xml, mapred-site.xml), override the environment variable using the command export HADOOP_CONF_DIR=/home/<USER>/hadoop-conf, update core-site.xml files, then test using hadoop command. Remember this is only for testing purpose, unset the environment variable ( unset HADOOP_CONF_DIR) once you are done with your testing.

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

How do I prevent `hadoop fs rmr <uri>` from creating $folder$ files?

We're using Amazon's Elastic Map Reduce to perform some large file processing jobs. As a part of our workflow, we occasionally need to remove files from S3 that may already exist. We do so using the hadoop fs interface, like this:
hadoop fs -rmr s3://mybucket/a/b/myfile.log
This removes the file from S3 appropriately, but in it's place leaves an empty file named "s3://mybucket/a/b_$folder$". As described in this question, Hadoop's Pig is unable to handle these files, so later steps in the workflow can choke on this file.
(Note, it doesn't seem to matter whether we use -rmr or -rm or whether we use s3:// or s3n:// as the scheme: all of these exhibit the described behavior.)
How do I use the hadoop fs interface to remove files from S3 and be sure not to leave these troublesome files behind?
I wasn't able to figure out if it's possible to use the hadoop fs interface in this way. However, the s3cmd interface does the right thing (but only for one key at a time):
s3cmd del s3://mybucket/a/b/myfile.log
This requires configuring a ~/.s3cfg file with your AWS credentials first. s3cmd --configure will interactively help you create this file.
It is how the S3 suppot is implemented in Hadoop, see this: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html.
So use s3cmd.

Resources