hadoop file system change directory command - hadoop

I was going through the HADOOP fs commands list. I am little perplexed not to find any "cd" command in hadoop fs.
Why is it so? It might sound silly question for the HADOOP users, but as I am beginner I can not understand why there is no list of cd command in HADOOP fs level?

Think about it like this:
Hadoop has a special file system called "hdfs" which runs on top of existing say linux file system. There is no concept of current or present working directory a.k.a. pwd
Let's say we have following structure in hdfs:
d1/
d2/
f1
d3/
f2
d4/
f3
You could do cd in your Linux file system from moving from one to the other but do you think changing directory in hadoop would makes sense? HDFS is like virtual file system and you dont directly interact with hdfs except via hadoop command or job tracker.

HDFS provides various features that enable accessing HDFS(Hadoop Filesystem) easy on local machines or edge nodes. You have an option to mount HDFS using any of the following methods. Once Hadoop file system is mounted on your machine, you may use cd command to browse through the file system (It's is like mounting remote network filesystem like NAS)
Fuse dfs (Available from Hadoop 0.20 onwards )
NFSv3 Gateway access to HDFS data (Available from Hadoop version
Hadoop 2.2.0)

Related

Uploading file in HDFS cluster

I was learning hadoop and till now I configured 3 Node cluster
127.0.0.1 localhost
10.0.1.1 hadoop-namenode
10.0.1.2 hadoop-datanode-2
10.0.1.3 hadoop-datanode-3
My hadoop Namenode directory looks like below
hadoop
bin
data-> ./namenode ./datanode
etc
logs
sbin
--
--
As I learned that when we upload a large file in the cluster in divide the file into blocks, I want to upload a 1Gig file in my cluster and want to see how it is being stored in datanode.
Can anyone help me with the commands to upload file and see where these blocks are being stored.
First, you need to check if you have Hadoop tools in your path, if not - I recommend integrate them into it.
One of the possible ways of uploading a file to HDFS:hadoop fs -put /path/to/localfile /path/in/hdfs
I would suggest you read the documentation and get familiar with high-level commands first as it will save you time
Hadoop Documentation
Start with "dfs" command, as this one of the most often used commands

how hadoop directory differ from hadoop-x.x.x

I am new to hadoop and recently when I was running MapReduce jobs on Openstack hadoop cluster and cd into directory on a datanode machine, I found there are two hadoop folders one is called "hadoop" while the other named"hadoop-2.7.1". Obviously, the latter one makes more sense as it tells the hadoop version. The two folder contains same sub-directories, but how these two differ from each other? What if I'd like to disable HDFS permission checking on this machine, which one should I go?
Here is a screenshot
As colors in the screenshot suggest, hadoop is not a separate directory but is just a symbolic link, obviously pointing to hadoop-2.7.1. Run ls -l to check this.
You should cd into hadoop directory. It exists intentionally to avoid writing hadoop version explicitly. When new version of hadoop is deployed, a new versioned directory will be created, and hadoop symbolic link will be changed to point to the latest versioned directory. Like this:
hadoop-2.7.1
hadoop-2.7.2
hadoop-2.7.3
hadoop -> hadoop-2.7.3

How can I get spark to access local HDFS on windows?

I have installed both hadoop and spark locally on a windows machine.
I can access HDFS files in hadoop, e.g.,
hdfs dfs -tail hdfs:/out/part-r-00000
works as expected. However, if I try to access the same file from the spark shell, e.g.,
val f = sc.textFile("hdfs:/out/part-r-00000")
I get an error that the file does not exist. Spark can access files in the windows file system using the file:/... syntax, though.
I have set the HADOOP_HOME environment variable to c:\hadoop which is the folder containing the hadoop install (in particular winutils.exe, which seems to be necessary for spark, is in c:\hadoop\bin).
Because it seems that HDFS data is stored in the c:\tmp folder, I was wondering whether there is would be a way to let spark know about this location.
Any help would be greatly appreciated. Thank you.
If you are getting file doesn't exist, that means your spark application(code snippet) is able to connect to HDFS.
The HDFS file path that you are using seems wrong.
This should solve your issue
val f = sc.textFile("hdfs://localhost:8020/out/part-r-00000")

Retrieve files from remote HDFS

My local machine does not have an hdfs installation. I want to retrieve files from a remote hdfs cluster. What's the best way to achieve this? Do I need to get the files from hdfs to one of the cluster machines fs and then use ssh to retrieve them? I want to be able to do this programmatically through say a bash script.
Here are the steps:
Make sure there is connectivity between your host and the target cluster
Configure your host as client, you need to install compatible hadoop binaries. Also your host needs to be running using same operating system.
Make sure you have the same configuration files (core-site.xml, hdfs-site.xml)
You can run hadoop fs -get command to get the files directly
Also there are alternatives
If Webhdfs/httpFS is configured, you can actually download files using curl or even your browser. You can write bash scritps if Webhdfs is configured.
If your host cannot have Hadoop binaries installed to be client, then you can use following instructions.
enable password less login from your host to the one of the node on the cluster
run command ssh <user>#<host> "hadoop fs -get <hdfs_path> <os_path>"
then scp command to copy files
You can have the above 2 commands in one script

How do I use the HDFS shell to access two or more remote Hadoop filesystems?

For various reasons, I have one hadoop installation on machine A, a second hadoop installation on cluster B, and a third hadoop installation on cluster C.
When I set up machine A, the xml files were set so that I could use the HDFS shell to find the HDFS on machine A.
I can rewrite the xml files on machine A so that the HDFS shell invoked from machine A sees a different HDFS by default.
However, I would like to be able to access all filesystems conveniently, without resetting the xml files.
Example: while logged in at machine A, I would like to copy a file from cluster B to cluster C with syntax something like:
hdfs dfs -cp hdfs://nn1.exampleB.com/file1 hdfs://nn2.exampleC.com/file2
Currently it seems that syntax does not work (although the errors are varied; sometimes they are EOF; other times they are network timeouts).
Should the above syntax be valid without modifications to the XML configuration files?
You should be using distcp command:
$ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
See more here: http://hadoop.apache.org/docs/r0.19.0/distcp.html

Resources