Hadoop slave tasktracker error "retrying to connect" - hadoop

I have a simple setup on EC2
One Master
One Slave
When I do JPS, I see everything except the tasktracker process. When I execute start-all.sh/stop-all.sh, I see the tasktracker message start,logging,stopping etc. I am able to run my pig commands in Mapreduce mode as well on the slave. But I cant do:
Cannot access the http://ec2machineSlave.com:50060/tasktracker.jsp
JPS does not show tasktracker
On Slave get error Retrying to connect to server ec2machineMaster/ipaddress:54300. Already tried x time(s)
I can SSH between Master to Slave. Created a id_dsa key on master, copied the id_dsa.pub to slave and did all the permission thing.
Can anyone suggest what could be wrong or if I need to trying something else.

Related

Hadoop: Unable to connect to Web GUI

Introduction: I'm using Ubuntu 18.04.2 LTS on which I'm trying to set up a Hadoop 3.2 Single Node Cluster. The installation goes perfectly fine, and I have Java installed. JPS is working as well.
Issue: I'm trying to connect to the Web GUI at localhost:50070, but I'm unable to. I'm attaching a snippet of my console when I execute ./start-all.sh:
root#it-research:/usr/local/hadoop/sbin# ./start-all.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [it-research]
Starting resourcemanager
Starting nodemanagers
pdsh#it-research: localhost: ssh exited with exit code 1
root#it-research:/usr/local/hadoop/sbin# jps
6032 Jps
3154 SecondaryNameNode
2596 NameNode
I'm unable to resolve localhost: ssh exited with exit code 1
Solutions I've tried:
Set up password-less SSH
Set up NameNode User
Set up PDSH to work with SSH
I've also added master [myIPAddressv4Here] in /etc/hosts file and tried connecting to master:50070. but still facing the same issue
Expected Behaviour: I should be able to connect to the Web GUI when I go to localhost:50070, but I can't.
Please let me know if there's some more information I should provide.
The port number for Hadoop 3.x is 9870, so localhost:9870 should work.

Ambari show namenode is stop but actually namenode is still working

We are using HDP 2.7.1.2.3 with Ambari 2.1.2
After finish setup, every node status is correct.
But oneday ambari suddenly show namdenode is stopped.(we don't change any config of ambari or namenode)
However, we still can use HBASE and run MapReduce.
we think name node status should be normal.
We try to restart namenode and check ambari-server log
It shows:
ServiceComponentHostImpl:949 - Host role transitioned to a new state, serviceComponentName=NAMENODE, oldState=STARTING, currentState=STARTED
HeartBeatHandler:657 - State of service component NAMENODE of service HDFS of cluster wae has changed from STARTED to INSTALLED
we don't understand why its status change from "STARTED" to "INSTALLED".
In namenode side, we check ambari-agent.log
It shows one warning:
[Alert][namenode_directory_status] HA nameservice value is present but there are no aliases for {{hdfs-site/dfs.ha.namenodes.{{ha-nameservice}}}}
We think it is irrelevant.
What's the reason that ambari think namenode is stopped?
Is there any way that we can fix this issue?
Run the command ambari-server restart from linux terminal in Ambari server node
Run the command ambari-agent restart from linux terminal in all the nodes in the cluster.
You can run the command hdfs dfsadmin -report from the terminal as hdfs user to confirm all the nodes are up and running.

org.apache.hadoop.hbase.PleaseHoldException: Master is initializing

I am trying to setup the multinode cluster of Hbase. When i do the jps on slave i get
5780 Jps
5558 HQuorumPeer
5684 HRegionServer
1963 DataNode
2093 TaskTracker
similarly on master i get
4254 SecondaryNameNode
15226 Jps
14982 HMaster
3907 NameNode
14921 HQuorumPeer
4340 JobTracker
EVerything is runnnig properly. But when i try to create table on hbase shell. It gives an error
ERROR: org.apache.hadoop.hbase.PleaseHoldException: org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
regionserver log of my slave(where region server is running):
2013-06-11 13:09:53,119 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at localhost,60000,137093$
2013-06-11 13:10:53,190 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed servers list: localhost/127.0.0.1:60000
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
at $Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:2037)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2083)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:744)
at java.lang.Thread.run(Thread.java:722)
2013-06-11 13:10:53,391 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at localhost,60000,137093$
FYI, i have also took care of /etc/hosts file on both master and slave.
127.0.0.1 localhost
127.0.0.1 naresh-PC
I again did changes in /etc/hosts file 127.0.1.1 to naresh-PC. But still getting this error
2013-06-11 14:51:17,781 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at naresh-pc,60000,137094$
2013-06-11 14:52:17,817 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
java.net.UnknownHostException: unknown host: naresh-pc
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.<init>(HBaseClient.java:276)
at org.apache.hadoop.hbase.ipc.HBaseClient.createConnection(HBaseClient.java:255)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1111)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
at $Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:2037)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2083)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:744)
at java.lang.Thread.run(Thread.java:722)
Try clearing all the states in Zookeeper.
Stop Zookeeper
Wipe the Zookeeper data directory
Start Zookeeper
I was getting the same issue and followed this approach and it worked fine.
You need to change the configuration on the slave node to point at the master. It is currently pointing to localhost and not connecting to the actual master:
"org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This
server is in the failed servers list: localhost/127.0.0.1:60000 at "
I'm hosting my own cluster inside Docker. Here's what worked in my case. I grepped the HBase log file for errors and found "Master passed us a different hostname to use"
`[root#docker-iop bin]# grep ERROR /var/log/hbase/hbase-hbase-regionserver-bi-mgmt01.local.log
2016-10-06 00:05:29,816 ERROR [regionserver/bi-mgmt01.local/111.11.2.3:16020] regionserver.HRegionServer: Master passed us a different hostname to use; was=my-host-name, but now=111.22.33.444'
I mapped my-host-name to 111.22.333.444 in my hosts file, restarted HBase and it worked.
I also had the same issue with a fully distributed hbase cluster with the configuration below.
Master Node (Node-A)
Backup Masters ($HBASE_HOME/conf/backup-masters) (Node-B & Node-C)
3 Replication servers (Node-A, Node-B & Node-C)
RCA:
The backup-masters nodes were attempted to be started when the cluster started.
Solution
I removed the backup masters by making $HBASE_HOME/conf/backup-masters empty in all hbase nodes.
So I had a cluster running without backup masters.
I wonder if the master node and master nodes must not also function as regionservers? The HBase documentation says otherwise though.
I came across the same issue and could not find anything, it turns out I was copy pasting from the Hbase documentation (https://hbase.apache.org/book.html#shell_exercises). I believe some character in there may be creating the error, so try to manually enter:
create 'test', 'cf'
We resolved this issue. Solution is to
stop Hbase
log to zookeeper-client as root
execute command rmr /hbase-unsecure/meta-region-server
start Hbase
We stop/start Hbase through Ambari UI, delete /hbase... through server bash shell.
[root#s1 ~]# zookeeper-client
Connecting to localhost:2181
.......
[zk: localhost:2181(CONNECTED) 0] rmr /hbase-unsecure/meta-region-server
I use docker/docker-compose to set up my distributed hbase, after I make changes, I can not create table in hbase shell.
I docker rm all the related images, and rebuild them. It works. Also, simply rebuilding the images doesn't work...

Hadoop installation - Datanode running, but not showing in JPS

I have installed CDH3U5 on a 2 node cluster. Everything seems to run fine such as all the services, web UI, MR jobs, HDFS shell commands. However, interestingly, when I started the datanode service, it gave me an OK message that datanode is running as process say X. But when I run JPS, I do not see the label "Datanode" for the process. So the output looks like -
17153 TaskTracker
18908 Jps
16267
The process ID - 16267 is the Datanode process. All other checkpoints have passed. So this seems weird. The same thing happens on the other node in the cluster. Any insight into this behavior and if this is something that needs fixing would be helpful.
can you check the following and reply?
- web interface for namenode and what does it show there for livenode
- logfiles for datanode to see if any exception
- if datanode is pingable/ssh from namenode and viceversa
If all the above look ok I'm not sure what the problem is but to fix you can
- stop all hadoop deamons
- delete temp directory pointed in conf/core-site.xml for both NN and DN
- format namenode
- start deamon

Hadoop master cannot start slave with different $HADOOP_HOME

In master, the $HADOOP_HOME is /home/a/hadoop, the slave's $HADOOP_HOME is /home/b/hadoop
In master, when I try to using start-all.sh, then the master name node start successfuly, but fails to start slave's data node with following message:
b#192.068.0.2: bash: line 0: cd: /home/b/hadoop/libexec/..: No such file or directory
b#192.068.0.2: bash: /home/b/hadoop/bin/hadoop-daemon.sh: No such file or directory
any idea on how to specify the $HADOOP_HOME for slave in master configuration?
I don't know of a way to configure different home directories for the various slaves from the master, but the Hadoop FAQ says that the Hadoop framework does not require ssh and that the DataNode and TaskTracker daemons can be started manually on each node.
I would suggest writing you own scripts to start things that take into account the specific environments of your nodes. However, make sure to include all the slaves in the master's slave file. It seems that this is necessary and that the heart beats are not enough for a master to add slaves.

Resources