Restarting datanodes after reformating namenode in a hadoop cluster - hadoop

Using the basic configuration provided in the hadoop setup official documentation, I can run a hadoop cluster and submit mapreduce jobs.
The problem is whenever I stop all the daemons and reformat the namenode, when I subsequently start all the daemons, the datanode does not start.
I've been looking around for a solution and it appears that it is because the formatting only formats the namenode and the disk space for the datanode needs to be erased.
How can I do this? What changes do I need to make to my config files? After those changes are made, how do I delete the correct files when formatting the namenode again?

Specifically if you have provided configuration of below 2 parameters which can be defined in hdfs-site.xml
dfs.name.dir: Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.data.dir: Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored
if you have provided the specific directory location for above 2 parameters then you need to delete those directories as well before formating namenode .
if you have not provided the above 2 parameter so by default it gets created under below parameter :
hadoop.tmp.dir which can be configured in core-site.xml
Again if you have specified this parameter then you need to remove that directory before formating namenode .
if you have not defined so by default it gets created in /tmp/hadoop-$username(hadoop) user so you need to remove this directory .
Summary: you have to delete the name node and data node directory before formating the system. By default it gets created at /tmp/ location .

Related

how do you create a hdfs data directory?

everytime my hadoop server reboots, I have to format the namenode to start the hadoop. This removes all of the files in my hadoop installation.
I need to move my hadoop hdfs location from /tmp file to permenant location where whenever the server reboots, I don't have to format the namenode etc.
I am very new to hadoop.
How do I create a hdfs file in another directory?
How do I reference this data directory in config file so that I don't have to format the namenode?
These two properties of the hdfs-site.xml determine where local files are stored.
The defaults are under /tmp
dfs.namenode.name.dir
dfs.datanode.data.dir
You typically have to format a namenode only when the HDFS processes failed to terminate correctly (such as a power failure or forced shutdown). It is encouraged to run a standby Namenode to prevent these scenarios.

why there is a need of hadoop commands in Pseudo-distributed mode?

It might be a stupid question but I needed to know.
For example: Why do we need hadoop fs -ls command to list files? Instead why can't just ls be used?
If in pseudo-distributed mode, is that case part of filesystem is given to hadoop file system that is only accessible to hadoop namenode daemon...this is my guess. Please explain.
ls will list all file spaces available to your computer
You can set the fs.defaultFS property to be file:///, the default, then both will act the same, but this is not considered pseudodistributed mode.
Pseudodistributed node requires that you specify a list of datanode and namenode volumes on each respective system in the cluster, and hdfs dfs commands will only list those files that are known by the namenode.
And its called pseudodistributed only because it's a single node. Once you have that working, adding another node should be straightforward given appropriate networking connections

how to save data in HDFS with spark?

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{
frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
where ip_addr is the ip address of my hdfs remote server.
/home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,
myNewFolder is the folder where I want to save my data.
Thanks in advance.
Yassir
The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder will be created.
The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

Does changing the value of dfs.blocksizeaffect existing data

My Hadoop version is 2.5.2. I am changing my dfs.blocksize in hdfs-site.xml file on the master node. I have the following question:
1) Will this change affect the existing data in HDFS
2) Do I need to propogate this change to all he nodes in Hadoop cluster or only on the NameNode is sufficient
1) Will this change affect the existing data in HDFS
No, it will not. It will keep the old block size on the old files. In order for it to take the new block change, you need to rewrite the data. You can either do a hadoop fs -cp or a distcp on your data. The new copy will have the new block size and you can delete your old data.
2) Do I need to propogate this change to all he nodes in Hadoop cluster or only on the NameNode is sufficient?
I believe in this case you only need to change the NameNode. However, this is a very very bad idea. You need to keep all of your configuration files in sync for a number of good reasons. When you get more serious about your Hadoop deployment, you should probably start using something like Puppet or Chef to manage your configs.
Also, note that whenever you change a configuration, you need to restart the NameNode and DataNodes in order for them to change their behavior.
Interesting note: you can set the blocksize of individual files as you write them to overwrite the default block size. E.g., hadoop fs -D fs.local.block.size=134217728 -put a b
you should be making changes in hdfs-site.xml of all slaves also... dfs.block size should be consistent accross all datanodes.
ochanging the block size in hdfs-site.xml will only affect the new data.
which distribution you are using... by seeing your questions it looks like you are using apache distribution..easiest way i can find is write a shell script to first delete hdfs-site.xml in slaves like
ssh username#domain.com 'rm /some/hadoop/conf/hdfs-site.xml'
ssh username#domain2.com 'rm /some/hadoop/conf/hdfs-site.xml'
ssh username#domain3.com 'rm /some/hadoop/conf/hdfs-site.xml'
later copy the hdfs-site.xml from master to all the slaves
scp /hadoop/conf/hdfs-site.xml username#domain.com:/hadoop/conf/
scp /hadoop/conf/hdfs-site.xml username#domain2.com:/hadoop/conf/
scp /hadoop/conf/hdfs-site.xml username#domain3.com:/hadoop/conf/

How to Switch between namenodes in hadoop?

Pseudomode Cluster:
Suppose first time I created a namenode on Machine "A" with name "Root1".
This will create a HDFS on tha machine.
Now i copy some file to HDFS using copyFromLocal and do some mapreduce.
Now i need to change some /conf files.
I'll change config file and to make them effective I formatted namenode with name "Root2".
If i browse the HDFS , it will be empty (means it will not contain those which copied earlier for "Root1").
If I want to see old file (for "Root1"), is there any way to switch to that HDFS or namenode (Root2 to Root1 ) ??
To be clear. Did you launch the another namenode on your machine ?
Type sudo jps in console or http://localhost:50070 in browser and check if you have more than one datanode. If there is just one node you lost your data from HDFS. If you have two namenodes you can check the filesystem in Internet browser on http://localhost:50070.
Here is instruction how to launch more than one datanode on one machine.

Resources