What consume the computer memory in Hadoop YARN? - hadoop

I want to ask about hadoop YARN. For example before I start the daemon I have free memory "X" MB, so after I do start-all.sh to start the daemon, my free memory become "Y" MB. So I want to know in particular what service or something that consume my node memory?
Thanks all....

start-all.sh script starts the HDFS daemons: NameNode and DataNode, and YARN daemons: ResourceManager and Nodemanager. Enter the "jps" command, this will list all the running java processes. These daemons consume your CPU Memory.

Related

Hadoop HDFS start up fails requires formatting

I have a multi-node standalone hadoop cluster for HDFS. I am able to load data to HDFS, however everytime I reboot my computer and start the cluster by start-dfs.sh, I don't see the dashboard until I perform hdfs namenode -format which erases all my data.
How do I start hadoop cluster without having to go through hdfs namenode -format?
You need to shutdown hdfs and the namenode cleanly (stop-dfs) before you shutdown your computer. Otherwise, you can corrupt the namenode, causing you to need to format to get back to a clean state

Datanode is not starting in hadoop-hbase start?

I am running the following script to run all the hbase and hadoop processes in my hbase setup in virtual machine.
#!/bin/sh
start-dfs.sh
start-yarn.sh
start-hbase.sh
#hbase-daemon.sh start rest
hbase-daemon.sh start thriftr
Earlier all the processes used to run properly. But recently, I have force shutdown my virtual machine process without stopping the hbase and hadoop related processes. Then my datanode process stopped. Later I have formatted my name node process, using some suggestion on online. Now my name node comes properly but data node process does not come up. When I check the running java process (JPS) the datanode process is missing
4672 NodeManager
5474 ThriftServer
4098 NameNode
4408 SecondaryNameNode
5723 Jps
4555 ResourceManager
5372 HRegionServer
5246 HMaster
5182 HQuorumPeer
But earlier the DataNode process used to come properly. Is it because of formatting my namenode. Do I need to change any config data or someting also?

How to cleaning hadoop mapreduce memory usage?

I want to ask. I can say for example I have 10 MB memory on each node after I activate start-all.sh process. So, I run the namenode, datanode, secondary namenode, dll. But after I've done the hadoop mapreduce job, why the memory for example decrease to 5 MB for example. Whereas, the hadoop mapreduce job has done.
How can it back to the 10 MB free memory? Thanks all....
Maybe you can try the linux clear memory command :
echo 3 > /proc/sys/vm/drop_caches

Job Tracker and TaskTracker in Hadoop2.0

I Installed Hadoop 2.4.X. As expected there is no JobTracker and TaskTracker. Its Yarn based. Is there any way to make it use old JobTracker and TaskTracker for MapReduce and not based on Yarn ? In short can I make JT and TT daemons running on this ?
By default there is no configuration file for map reduce in the 2.4.x installation even though there is a file called mapred-site.xml.template.Rename the file to mapred-site.xml and remember to set the property mapred.framework.name to classic to use the job tracker and tasktracker.Also the start scripts start-all.sh cannot be used as it executes the scripts start-dfs.sh and start-yarn.sh.You need to execute the script that starts jobtracker and tasktracker.
As described above, there is no Jobtracker and Tasktracer in Hadoop 2.0 (yarn). It's better to follow this instruction (http://codesfusion.blogspot.in/2013/10/setup-hadoop-2x-220-on-ubuntu.html) to get the idea, and you will find the processes are as:
25578 ResourceManager
25411 SecondaryNameNode
447 Jps
29464 NameNode
25222 DataNode
25905 NodeManager

What is best way to start and stop hadoop ecosystem, with command line?

I see there are several ways we can start hadoop ecosystem,
start-all.sh & stop-all.sh
Which say it's deprecated use start-dfs.sh & start-yarn.sh.
start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh
hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager
EDIT: I think there has to be some specific use cases for each command.
start-all.sh & stop-all.sh : Used to start and stop hadoop daemons all at once. Issuing it on the master machine will start/stop the daemons on all the nodes of a cluster. Deprecated as you have already noticed.
start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh : Same as above but start/stop HDFS and YARN daemons separately on all the nodes from the master machine. It is advisable to use these commands now over start-all.sh & stop-all.sh
hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager : To start individual daemons on an individual machine manually. You need to go to a particular node and issue these commands.
Use case : Suppose you have added a new DN to your cluster and you need to start the DN daemon only on this machine,
bin/hadoop-daemon.sh start datanode
Note : You should have ssh enabled if you want to start all the daemons on all the nodes from one machine.
Hope this answers your query.
From Hadoop page,
start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
start-dfs.sh
This will bring up HDFS with the Namenode running on the machine you ran the command on. On such a machine you would need start-mapred.sh to separately start the job tracker
start-all.sh/stop-all.sh has to be run on the master node
You would use start-all.sh on a single node cluster (i.e. where you would have all the services on the same node.The namenode is also the datanode and is the master node).
In multi-node setup,
You will use start-all.sh on the master node and would start what is necessary on the slaves as well.
Alternatively,
Use start-dfs.sh on the node you want the Namenode to run on. This will bring up HDFS with the Namenode running on the machine you ran the command on and Datanodes on the machines listed in the slaves file.
Use start-mapred.sh on the machine you plan to run the Jobtracker on. This will bring up the Map/Reduce cluster with Jobtracker running on the machine you ran the command on and Tasktrackers running on machines listed in the slaves file.
hadoop-daemon.sh as stated by Tariq is used on each individual node. The master node will not start the services on the slaves.In a single node setup this will act same as start-all.sh.In a multi-node setup you will have to access each node (master as well as slaves) and execute on each of them.
Have a look at this start-all.sh it call config followed by dfs and mapred
Starting
start-dfs.sh (starts the namenode and the datanode)
start-mapred.sh (starts the jobtracker and the tasktracker)
Stopping
stop-dfs.sh
stop-mapred.sh

Resources