legecy UI in hadoop 2.7.0 - hadoop

How can we enable Legacy UI in Hadoop 2.7.0?
http://localhost:50070/dfshealth.html#tab-overview

I assume you playing with a pseudo cluster (single node), namenode hosts the web service, you should be able to open the that UI as long as namenode is started even with a minimal number of properties set in hdfs-site.xml and core-site.xml.
So it mostly looks like your namenode is not running properly, 1st thing is to check the namenode log, hdfs-hdfs-namenode-...log. See what happened ...

Related

Aggregation retaining in yarn-site. What does it mean, how does it work?

Let's look at https://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn
Here we have something like:
yarn.log-aggregation.retain-seconds
What logs are connected to this option? Hadoop DataNode? NameNode? Yarn Resource Manager?
Should I set it on all hadoop nodes? Where?
If it starts with yarn it only applies to YARN services. This includes Resource Manager, Node Manager and maybe the new Timeline Server (if enabled). YARN-specific configuration settings belong in the yarn-site.xml.
Similarly,
HDFS specific configuration settings are found in the hdfs-site.xml. They usually start with dfs.
Common settings belong to the core-site.xml.
You setup the yarn-site.xml on all hosts running YARN services.

Yarn UI shows no information about applications

I know that the similar question was asked Applications not shown in yarn UI when running mapreduce hadoop job?
but the answers did not solve my problems.
I am running Hadoop streaming on Linux 17.01. I setup a cluster with 3 nodes and 1 master node.
When I start Hadoop, I can access localhost:50070 to see other nodes (all nodes are alive).
However, I see no information in "Application" of localhost:8088
as well as by command "yarn application -list -appStates ALL".
Here is my configuration.
My yarn-site.xml (for all nodes)
Here is all processes on master node
The problems may due to yarn services are running on ipv6. However, I followed I followed this thread
https://askubuntu.com/questions/440649/how-to-disable-ipv6-in-ubuntu-14-04
to change all Yarn services to ipv4. However, still there is no tasks displayed on Yarn UI, even I can see all nodes in my cluster marked as "active" on Yarn UI.
So, I do not know why this happened. Do you have any suggestion?
Thank you very much.
I haven't typically seen YARN being configured for IPv4, but this property is added into the hadoop-env.sh
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
I'm sure you also add a similar variable into the yarn-env.sh for YARN_OPTS, I think
However, it's not really clear from the your question when / if you've even submitted an application for anything to appear

Hadoop client and cluster separation

I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?
Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.
And I have some naiive ideas:
1.Create a client machine and install hadoop .
2.set fs.default.name to be hdfs://master:9000
3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode
Is it correct?
4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.
5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.
First of all.. this link has detailed information on how client communcates with namenode
http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2
To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.
Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.
Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go through configuration on Namenode,
core-site.xml will have this property-
<property>
<name>fs.default.name</name>
<value>192.168.0.1:9000</value>
</property>
mapred-site.xml will have this property-
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:8021</value>
</property>
These are two important properties must be copied to client machine’s Hadoop configuration.
And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.
Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.
If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.

do I need to restart ALL the hadoop daemons whenever I make changes to xml configuration files

Suppose my hadoop cluster is running and I make changes to hdfs-site.xml.
My question is which services/Daemons need to be restarted in this case?
Similarly which daemons need to be restarted if I make changes to yarn-site.xml, core-site.xml, mapred-site.xml, allocations.xml
Or should I restart all daemons in every case mentioned above?
I got answer to my questions.
Answer is it depends which service configuration properties we are changing.
let's say if we change namenode properties, we need to restart HDFS service.

How do I configure and reboot an HDInsight cluster running on Azure?

Specifically, I want to change the maximum number of mappers and the maximum number of reducers for each node in an HDInsight cluster running on Microsoft Azure.
Using remote desktop, I logged in to the head node. I edited the mapred-site.xml file on the head node and changed the mapred.tasktracker.map.tasks.maximum and the mapred.tasktracker.reduce.tasks.maximum values. I tried rebooting the head node, but I was not able to reboot. I used the start-onebox.cmd and stop-onebox.cmd scripts to try and start/stop HDInsight.
I then ran a streaming mapreduce passing the desired number of reducers to the hadoop-streaming.jar, but the number of reducers was still limited by the previous value of mapred.tasktracker.reduce.tasks.maximum. Most of my reducers were pending execution.
Do I need to change the mapred-site.xml file on every node? Is there an easy way to change this, or do I need to remote desktop into every node? How do I reboot or restart the cluster so that my new values are used?
Thanks
I know it has been a while since the question was posted, but I would like to post for other users who may find useful.
There are 2 ways you can change Hadoop configuration files (such as mapred-site.xml, hive-site.xml etc) on HDinsight
Option #1:
This is the easiest - you can supply the hadoop configuration values per job, as shown in this blog
Option #2:
You can customize HDinsight cluster with hadoop configuration values during provisioning or installing a cluster, as shown in this blog
Manually modifying a config file is not supported and the change will be lost when the Azure VM gets re-imaged.

Resources