Data lost after shutting down hadoop HDFS? - hadoop

Hi I'm learning hadoop and I have a simple dumb question: After I shut down HDFS(by calling hadoop_home/sbin/stop-dfs.sh), is the data on HDFS lost or can I get it back?

Data wouldn't be lost if you stop HDFS, provided you store the data of NameNode and DataNode's in a persistent locations specified using the properties:
dfs.namenode.name.dir -> Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. Default value: file://${hadoop.tmp.dir}/dfs/name
dfs.datanode.data.dir -> Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. Default value: file://${hadoop.tmp.dir}/dfs/data
As you could see, the default values for both properties point to ${hadoop.tmp.dir} which by default is /tmp. You might already know that the data in /tmp in Unix based systems get's cleared on reboot's.
So, if you would specify dir location's other than /tmp then Hadoop HDFS daemons on reboot would be able to read back the data and hence no data loss even on cluster restart's.

Please make sure you are not deleting metadata of your data stored in HDFS and this you can achieve simply if you are keeping dfs.namenode.name.dir and dfs.datanode.data.dir untouced, means not deleting path present in these tags which present in your hdfs-site.xml file.

Related

Copy files/chunks from HDFS to local file system of slave nodes

In Hadoop, I understand that the master node(Namenode) is responsible for storing the blocks of data in the slave machines(Datanode).
When we use -copyToLocal or -get, from the master, the files could be copied from the HDFS to the local storage of the master node. Is there any way the slaves can copy the blocks(data) that are stored in them, to their own local file system?
For ex, a file of 128 MB could be split among 2 slave nodes storing 64MB each. Is there any way for the slave to identify and load this chunk of data to its local file system? If so, how can this be done programmatically? Can the commands -copyToLocal or -get be used in this case also? Please help.
Short Answer: No
The data/files cannot be copied directly from Datandode's. The reason is, Datanodes store the data but they don't have any metadata information about the stored files. For them, they are just block of bits and bytes. The metadata of the files is stored in the Namenode. This metadata contains all the information about the files (name, size, etc.). Along with this, Namenode keeps track of which blocks of the file are stored on which Datanodes. The Datanodes are also not aware of the ordering of the blocks, when actual files are splits in multiple blocks.
Can the commands -copyToLocal or -get be used in this case also?
Yes, you can simply run these from the slave. The slave will then contact the namenode (if you've configured it properly) and download the data to your local filesystem.
What it doesn't do is a "short-circuit" copy, in which it would just copy the raw blocks between directories. There is also no guarantee it will read the blocks from the local machine at all, as your commandline client doesn't know its location.
HDFS blocks are stored on the slaves local FS only . you can dig down the directory defined under property "dfs.datanode.dir"
But you wont get any benefit of reading blocks directly (without HDFS API). Also reading and editing block files directory can corrupt the file on HDFS.
If you want to store data on different slave local then you will have to implement your logic of maintaining block metadata (which is already written in Namenode and do for you).
Can you elaborate more why you want to distribute blocks by yourself when Hadoop takes care of all challenges faced in distributed data?
You can copy particular file or directory from one slave to another slave by using distcp
Usage: distcp slave1address slave2address

What is the difference between fs.checkpoint.dir and dfs.name.dir?

Mainly dfs.name.dir property is use to store the fsimage of namenode to the particular location for backup and fs.checkpoint.dir property is the location where the fsimage merge. This is little bit confuse to me. Can any one explain me in detail.
dfs.name.dir is the place where the namenode stores the fsimage and editlogs in disk. This is a mandatory location. Without this location, a hadoop cluster will not start. This will be located in the namenode host.
fs.checkpoint.dir is the directory on the local filesystem where the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy. This is not a mandatory location. Without this directory also the hadoop cluster will start. This will be located in the secondary namenode host.
The fsimage and edit logs are merged periodically through secondary namenode. If secondary is not present, the merging of fsimage and editlogs will happen only at the time of namenode restart.
The explanation of secondary namenode is available in this blog post
dfs.name.dir
It was deprecated and replaced by dfs.namenode.name.dir. It determines where on the local file system the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
This property is used by Name Node.
fs.checkpoint.dir
It is deprecated and replaced by dfs.namenode.checkpoint.dir. It determines where on the local file system the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy.
The secondary Name Node merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary Name Node since its memory requirements are on the same order as the primary Name Node.
The secondary Name Node stores the latest checkpoint in a directory which is structured the same way as the primary Name Node’s directory. So that the check pointed image is always ready to be read by the primary Name Node if necessary.
The NameNode will upload the checkpoint from the dfs.namenode.checkpoint.dir directory and then save it to the NameNode directory(s) set in dfs.namenode.name.dir.
The NameNode will fail if a legal image is contained in dfs.namenode.name.dir.
The NameNode verifies that the image in dfs.namenode.checkpoint.dir is consistent, but does not modify it in any way.
Refer to HDFS user guide

Where the data is stored in HDFS? Is there a way to change where its stored?

I'm a novice. I have a 3-Node Cluster. The Name Node, Job Tracker and Secondary Name Node are running in one node and two data nodes (HData1, HData2) in the other two cluster. If I store data from my local system to HDFS, how to find in which node it resides? Is there a way I can explicitly specify in which data node it has to be stored?
Thanks in advance!
Yes you can find it using hadoop fsck path
you can refer below links
how does hdfs choose a datanode to store
How to explicilty define datanodes to store a particular given file in HDFS?

Add a entire directory to hadoop file system (hdfs)

I have data that is stored within the sub directories and would like to put the parent directory in the HDFS. The data is always present at the last directory and the directory structure extends upto 2 levels.
So the structure is [parent_dir]->[sub_directories]->[sub_directories]->data
I tried to add the entire directory by doing
hadoop fs -put parent_dir input
This takes a long long time ! The sub directories are possibly 258X258. And this eventually fails with
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(X.X.X.245:50010, storageID=DS-262356658-X.X.X.245-50010-1394905028736, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on devic
I can see the required space on the nodes. What am I doing wrong here ?
Also I the way I was planning to access my files was
hadoop jar Computation.jar input/*/* output
This worked well for small data set.
That error message is usually fundamentally correct. You may not be taking into account the replication factor for the HDFS filesystem. If your replication factor is set to 3, which is the default, then you need 300GB of storage available to store a 100GB dataset.
There are a couple of things you can do to help get around the issue:
1) Decrease your replication factor (dfs.replication), and your maximum blocks (dfs.replication.max) to 2 in your hdfs-site.xml
2) Compress your datasets. Hadoop can operate on bzip and gzip compressed files (though you need to be careful of splitting)

HDFS vs LFS - How Hadoop Dist. File System is built over local file system?

From various blogs I read, I comprehended that HDFS is another layer that exists over Local filesystem in a computer.
I also installed hadoop but I have trouble understanding the existence of hdfs layer over local file system.
Here is my question..
Consider I am installing hadoop in pseudo-distributed mode. What happens under the hood during this installation? I added a tmp.dir parameter in configuration files. Is is the single folder to which namenode daemon talks to, when it attemps to access the datanode??
OK..let me give it a try..When you configure Hadoop it lays down a virtual FS on top of your local FS, which is the HDFS. HDFS stores data as blocks(similar to the local FS, but much much bigger as compared to it) in a replicated fashion. But the HDFS directory tree or the filesystem namespace is identical to that of local FS. When you start writing data into HDFS, it eventually gets written onto the local FS only, but you can't see it there directly.
The temp directory actually serves 3 purposes :
1- Directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name and can be specified explicitly by dfs.name.dir. If you specify dfs.name.dir, then the namenode metedata will be stored in the directory given as the value of this property.
2- Directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data and can be specified explicitly by dfs.data.dir. If you specify dfs.data.dir, then the HDFS data will be stored in the directory given as the value of this property.
3- Directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary and can be specified explicitly by fs.checkpoint.dir.
So, it's always better to use some proper dedicated location as the values for these properties for a cleaner setup.
When access to a particular block of data is required metadata stored in the dfs.name.dir directory is searched and the location of that block on a particular datanode is returned to the client(which is somewhere in dfs.data.dir directory on the local FS). The client then reads data directly from there (same holds good for writes as well).
One important point to note here is that HDFS is not a physical FS. It is rather a virtual abstraction on top of your local FS which can't be browsed simply like the local FS. You need to use the HDFS shell or the HDFS webUI or the available APIs to do that.
HTH
When you install hadoop in pseudo distributed mode, all the HDFS daemons namdenode, datanode and secondary name node run on the same machine. The temp dir which you configure is where a data node stores the data. So when you look at it from HDFS point of view, your data is still stored in block and read in blocks which are much bigger (and aggregation) on multiple file system level blocks.

Resources