In Ambari, If there is no directory from mentioned in datanode.data.dir, its creating it on the root drive - hortonworks-data-platform

I have 3 data nodes A, B, C. A and B contain 3 hard drives mounted as
/hadoop/data1
/hadoop/data2
/hadoop/data3
In C node, I have only 2 drives mounted
/hadoop/data1
/hadoop/data2
I have installed HDFS with
datanode.data.dir = /hadoop/data1,/hadoop/data2,/hadoop/data3
Ambari installation says that the data directory which is not present will be ignored.
But in my case, a new folder(/hadoop/data3) is created under root drive in node C.
How can I make it ignore the non-existent directory?

I created and assigned a new Host Config Group to the node C.
I then added only the required data node directories to the newly created host config group.
The tutorial can be found in this link:
https://community.hortonworks.com/questions/56163/machines-with-various-disk-numbers.html
https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.2/bk_ambari-operations/content/using_host_config_groups.html

Related

HDFS: Removing data node directory of only one node

We have a Hadoop cluster (HDP 3.1.4 with Ambari 2.7) containing 3 data nodes: data1, data2, data3 with following HDFS disks and mountpoints:
host data[1-2]
/dev/sdb -> /mnt/datadisk1
/dev/sdc -> /mnt/datadisk2
/dev/sdd -> /mnt/datadisk3
host data3
/dev/sdb -> /mnt/datadisk1
/dev/sdc -> /mnt/datadisk2
During cluster setup we set dfs.datanode.data.dir to value /mnt/datadisk1,/mnt/datadisk2,/mnt/datadisk3. Now we saw, that our root partition (/) on data3 node runs full, because the mount point /mnt/datadisk3 does not exist and therefore the (HDFS) data is stored on the root partition, instead of being ignored.
Is there a way to remove this wrong path (data3 : /mnt/datadisk3) somehow, without editing the config files directly on OS (we want to use Ambari)?
#D.Muller you should be able to edit the paths in ambari. During setup, ambari will try to add multiple paths to the directory listing. Your paths need to be consistent across all the nodes. This is likely where things got confused for you when a path was missing on a single node. If you login to ambari and remove the non-existent disk, you should be able to fix this issue.

DataNode not started after changing the datanode directories parameter. DiskErrorException

I have added the new disk to the hortonworks sandbox on the OracleVM, following this example:
https://muffinresearch.co.uk/adding-more-disk-space-to-a-linux-virtual-machine/
I set the owner of the mounted disk directory as hdfs:hadoop recursively and give the 777 permisions to it.
I added the mounted disk folder to the datanode directories after coma using Ambari. Also tried changing XML directly.
And after the restart the dataNode always crashes with the DiskErrorException Too many failed volumes.
Any ideas what I am doing wrong?
I found the workaround of this problem, I mounted the disk to the to the /mnt/sdb folder,
and this folder I use as the datanode directories entry. But if create the /mnt/sdb/data and use it as entry the exception dissapears and everythig works like a charm.
No idea why(

Hadoop removing a mount point folder from Cloudera

I've searched and I've been reading on Cloudera Hadoop on removing mount point file systems but I cannot find a thing on removing them.
I have two SSD drives in 6 machines and when I initially installed Cloudera Hadoop it added all file systems and I only need two mount points to run a few teragen and terasorts.
I need to remove everything except for:
/dev/nvme0n1 and /dev/nvme1n1
In Cloudera Manager you can modify the list of drives used for HDFS data at:
Clusters > HDFS > Configuration > DataNode Default Group (or whatever you may have renamed this to) > DataNode Data Directory

Restarting datanodes after reformating namenode in a hadoop cluster

Using the basic configuration provided in the hadoop setup official documentation, I can run a hadoop cluster and submit mapreduce jobs.
The problem is whenever I stop all the daemons and reformat the namenode, when I subsequently start all the daemons, the datanode does not start.
I've been looking around for a solution and it appears that it is because the formatting only formats the namenode and the disk space for the datanode needs to be erased.
How can I do this? What changes do I need to make to my config files? After those changes are made, how do I delete the correct files when formatting the namenode again?
Specifically if you have provided configuration of below 2 parameters which can be defined in hdfs-site.xml
dfs.name.dir: Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.data.dir: Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored
if you have provided the specific directory location for above 2 parameters then you need to delete those directories as well before formating namenode .
if you have not provided the above 2 parameter so by default it gets created under below parameter :
hadoop.tmp.dir which can be configured in core-site.xml
Again if you have specified this parameter then you need to remove that directory before formating namenode .
if you have not defined so by default it gets created in /tmp/hadoop-$username(hadoop) user so you need to remove this directory .
Summary: you have to delete the name node and data node directory before formating the system. By default it gets created at /tmp/ location .

[hdfs]how to configure different dfs.datanode.data.dir for each datanode?

I use ambari to setup a hadoop cluster.
but when I configure the hdfs's config. I found that if I modify the dfs.datanode.data.dir, the configure will take effect on all datanodes...
How could I configure different configs for each datanode?
for example, there are two disks in machine A, which is mounted to /data1, /data2
but there is only one disk in machine B, which is mounted to /data1
so I want to configure the dfs.datanode.data.dir to "/data1,/data2" for machine A.
but only "/data1" for machine B
HDFS directories that don't exist will be ignored. Put them all in, it won't matter.
Remember that each Hadoop node in the cluster has its own set of configuration files too (under the usual conf\ dir). So you can login to that data-node machine and change config files.
The local configuration on data-node will take effect for that data-node.

Resources