How to settle multy-local datanodes and with datas out of the tmp file on Mac ? Location = null - hadoop

(first participation, please do not hesite to ask further elements)
I try too follow a tuto (https://openclassrooms.com/fr/courses/4467481-creez-votre-data-lake/4509436-deployez-hdfs-en-production-et-passez-a-l-echelle) to create a multi datanode on my laptop.
First we start with a namenode and one data node, then we add other data nodes.
The problem seems to be in two steps :
- During the namenode -format : I have a "location = null ??"
- I can run two datanodes but not two or more
The protocol is for the first part (1 datanode) :
Edit core-site.xml
Edit hdfs-site.xml
Create the folders corresponding the previous path edition
namenode -format
lunch namenode
lunch datanode
Second part (Three new datanodes) :
Create config files for new datanodes
lunch namenode
lunch datanodes : ./bin/hdfs datanode -conf ./etc/hadoop/datanode1.xml
NB : I define new folder for the data because we don't want them in the tmp file.
In the second part of the protocol, configuration files are edited (cf files) and we lunch the datanodes with. Thus, is the "dfs.data.dir" part in the hdfs-site file useless for more than one datanode ?
Issues :
When I do namenode -format i have these ligne : "Re-format filesystem in Storage Directory root= /Users/XXXXX/code/hdfs/namenode; location= null ? (Y or N)." I do Y
At this point, i don't understand the location = null ? It is not present in the tutorial, thus i think that my path is not ok. If I double chekcs my files, it appears to be ok.
Then i run my namode : It is ok
Then i run my firts datanode : It is ok
But when i run my seconde data node : Not ok : "datanode is running as process 72080. Stop it first and ensure /tmp/hadoop-XXXX-datanode.pid file is empty before retry"
I tend to think that it is all related to a path issue.
I looked in my tmp folder, and there is the .pid files. I don't understand because i though that i decide not to use the tmp files thanks to the change in my configuration
Here my files :
hdfs-site :
`<configuration>
<property>
<name>dfs.name.dir</name>
<value>/Users/XXXX/code/hdfs/namenode/</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/Users/XXXX/code/hdfs/datanode/</value>
</property>
</configuration>`
Core-site
`<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>`
DatanodeX
`<configuration>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:50012</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:50077</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>0.0.0.0:50022</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/Users/XXXX/code/hdfs/datanode2/</value>
</property>
</configuration>
`

Related

"ERROR: Cannot set priority of secondarynamenode process 31231"

I have a problem with Hadoop. I am on mac OS and I have a problem when I want to launch my node.
I installed Hadoop this way :
brew install hadoop
I also configured the different files like this :
hadoop-env.sh :
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk-17.0.2.jdk/Contents/Home"
core-site.xml :
<!-- Put site-specific property overrides in this file. --><configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description>A base for other temporary directories</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
mapred-site.xml :
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>localhost:8021</value>
</property>
</configuration>
hdfs-site.xml :
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
I finally executed this command:
hdfs namenode -format
Finally when I want to launch the command ./start-dfs.sh I get this error :
"ERROR: Cannot set priority of secondarynamenode process 31231"
I would like to specify that my main node launches correctly :
I can't find a solution on the internet.
Has anyone faced the same situation as me?
I tried all the solutions but doesn't work : localhost: ERROR: Cannot set priority of datanode process 32156
Sincerely,
For people who had the same problems as me, here is a tuto that might work:
:https://techblost.com/how-to-install-hadoop-on-mac-with-homebrew/

set up Hadoop multi cluster on 2 windows 10

I am trying to set up a multi-node Hadoop cluster between 2 windows devices. I am using Hadoop 2.9.2.
how can I achieve that, please.
after a lot of trial and error the following did the job me.
do same configuration as previous answer by #AbsoluteBeginner.
disable windows firewall on all machines (i think you could keep it on and just mess around with the rules, but thats for you to find out)
hdfs namenode -format all nodes (master and slaves)
make sure that the datanode folder is empty in all 3 nodes (just shift+del)
in master node run start-all.cmd. all the following should appear.
50436 NameNode
54696 NodeManager
54744 DataNode
60028 Jps
7340 ResourceManager
in slave nodes run start-all.cmd. all the following should appear
6116 DataNode
2408 Jps
3208 NodeManager
note the reason that nameode and resource manager isn't appearing, is becuase they are running on master node and already occupy the port, and you only need the master resourcemanger and name node running
note if you saw multi-cluster tutorial of linux the master node also shows SeceondryNameNode when executing jps. not really sure why its not appearing in windows.
go to master:50070, and navigate to data nodes you should see something like this
go to master:8088, and navigate to Node you should see something like this
Install open-ssh server on both of your systems using this guide. Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password. Add the public key to the authorized_keys and add your hostname to list of known hosts. You can find guides on how to do this by searching the internet.
2.Add your hadoop master and slave ips to your hosts file. Open “C:\Windows\System32\drivers\etc\hosts”
and add
your-master-ip hadoopMaster
your-salve-ip hadoopSlave
you can use these names in your configuration files.
much like Linux systems, these are the steps you have to follow in order to run a Hadoop cluster on windows:
3. First you need to have Java installed on your system and JAVA_HOME must be added to your environment variables. You can download Java from Oracle website and install it.
Download Hadoop binary files from Apache website and extract it.
Note that you shouldn't have space in your folder names or you might encounter problems.
Next you have to add Java and Hadoop home and bin folders to your environment variables. just open start menu and type "environment variable" and open the edit environment variables window from control panel.
Add
HADOOP_HOME=”root of your hadoop extracted folder\hadoop-2.9.2″
HADOOP_BIN=”root of hadoop extracted folder\hadoop-2.9.2\bin”
JAVA_HOME=<Root of your JDK installation>”
Edit your "path" environment variable and add %JAVA_HOME%, %HADOOP_HOME%, %HADOOP_BIN%, %HADOOP_HOME%/sbin to your PATH one by one.
you can validate your additions by opening cmd and type in:
echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%
CONFIGURING HADOOP:
10. Open "your hadoop root\hadoop-2.9.2\etc\hadoop\hadoop-env.cmd" and add the following lines to the bottom of the file:
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
11.Open "your-hadoop-root\hadoop-2.9.2\etc\hadoop\hdfs-site.xml" and add the below content:
<property>
<name>dfs.name.dir</name>
<value>your desired address</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>your desired address</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoopMaster:50070</value>
<description>Your NameNode hostname for http access.</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoopMaster:50090</value>
<description>Your Secondary NameNode hostname for http access.</description>
</property>
edit your core-site.xml and add:
<property>
<name>fs.default.name</name>
<value>hdfs://hadoopMaster:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>your-temp-directory</value>
<description>A base for other temporary directories.</description>
</property>
Open "root to hadoop\hadoop-2.9.2\etc\hadoop\mapred-site.xml" and add below content within tags. If you don’t see mapred-site.xml then open mapred-site.xml.template file and rename it to mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>hadoopMaster:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
14.Edit your yarn-site.xml and add:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
<description>Long running service which executes on Node Manager(s) and provides MapReduce Sort and Shuffle functionality.</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>Enable log aggregation so application logs are moved onto hdfs and are viewable via web ui after the application completed. The default location on hdfs is '/log' and can be changed via yarn.nodemanager.remote-app-log-dir property</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoopMaster:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoopMaster:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoopMaster:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoopMaster:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoopMaster:8088</value>
</property>
In your slaves file in "root-hadoop-directory/hadoop/bin" add
hadoopSlave
Do these steps on your slave nodes too.
open cmd and cd to your sbin folder in hadoop directory.
18.format your nameNode
hadoop namenode -format
19.run the following command:
start-dfs.sh
then run:
start-yarn.sh

how to configure hadoop so that namenode gets started without reformatting on reboot

Namenode is missing every time the hadoop is started after a reboot. hadoop namenode -format corrects this problem at the cost of losing files through format. How to prevent formatting and preserve namenode files
put the following configuration in you hdfs-site.xml
$vi hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hduser/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hduser/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
the dfs.name.dir Determines where on the local file system the DFS name node should store the name table(fsimage).The name table is replicated in all of the directories, for redundancy
the dfs.data.dir Determines where on the local file system an DFS data node should store its blocks.Data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.

Configuration of Hadoop xml file

What is the right configuration of hdfs-site.xml file while configuring Hadoop.
On all the websites I see this :
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
I used the same configuration but was unable to start the datanode.
later I changed the configuration of datanode to
<name>dfs.datanode.name.dir</name>
instead of
<name>dfs.datanode.data.dir</name>
and it worked.
which one is right name.dir or data.dir?
because all the websites say the data.dir but that does not work in my case.
Thanks guys.
The options are dfs.namenode.name.dir and dfs.datanode.data.dir. The first defines the directory where namenode information will be held. Second defines where datanode information will be held.
If the node is a namenode, you need the namenode folder configuration. If it's a datanode, you need the datanode folder configuration. If it's a single-node cluster, you need both, as the node acts as both namenode and datanode.
If that doesn't work, what is the error message you get? Have you formatted the namenode?

Failed to get system directory - hadoop

Using hadoop multinode setup (1 mater , 1 salve)
After starting up start-mapred.sh on master , i found below error in TT logs (Slave an)
org.apache.hadoop.mapred.TaskTracker: Failed to get system directory
can some one help me to know what can be done to avoid this error
I am using
Hadoop 1.2.0
jetty-6.1.26
java version "1.6.0_23"
mapred-site.xml file
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/workspace</value>
</property>
</configuration>
It seems that you just added hadoop.tmp.dir and started the job. You need to restart the Hadoop daemons after adding any property to the configuration files. You have specified in your comment that you added this property at a later stage. This means that all the data and metadata along with other temporary files is still in the /tmp directory. Copy all those things from there into your /home/hduser/workspace directory, restart Hadoop and re run the job.
Do let me know the result. Thank you.
If, it is your windows PC and you are using cygwin to run Hadoop. Then task tracker will not work.

Resources