Hadoop 2.7.5 Yarn HA conflict and Bug - hadoop

I have set up a Hadoop HA cluster including standby node for both namenode and resourcemanager such that in node A and node B both namenode and resourcemanager process will start and one node role as a standby node.
When I shutdown active node (A or B), the other will be the active node (tested). In case of I just start a node (A or B) the namenode process is ok, but the resourcemanager process is not responsive! I checked the logs and it was the following:
2018-01-27 17:01:43,371 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/192.168.1.210:8020. Already tried 68 time(s); retry policy is RetryPolicy[MultipleLinearRandomRetry[500x2000ms], TryOnceThenFail]
The node A/192.168.1.210 is that not-started node. Its stock on a loop connecting to node A, while the namenode process is active on node B.
I set the following property to decrease the (500*2000) (refrenced here):
<property>
<name>yarn.resourcemanager.fs.state-store.retry-policy-spec</name>
<value>1000,10</value>
<!-- <value>2000,10</value> also checked -->
</property>
But is has no effect on resource manager behavior!
Is this a bug or I'm wrong!?

Since Hadoop 2.8 the property yarn.node-labels.fs-store.retry-policy-spec is added to control that situation. Adding the following property solved the problem:
<property>
<name>yarn.node-labels.fs-store.retry-policy-spec</name>
<value>1000, 10</value>
</property>
Now it try 10 time, sleeping 1000ms and after that it switch to another namenode

Related

YARN: No Active Cluster Nodes

I successfully installed Hadoop and can see all ResourceManager on master node and NodeManager on slave nodes. But When I check on http://hadoop-master:8088/cluster, it showed no active cluster nodes.
I also checked the Yarn's logs, it said java.net.ConnectException: Your endpoint configuration is wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort
But I don't know what I did wrong, here is the configuration:
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</configuration>
As the image here, not active nodes :
https://i.imgur.com/mbSCiKO.png
jps on slave nodes:
9011 DataNode
10093 NodeManager
10446 Jps
jps on Master node:
32546 ResourceManager
25176 NameNode
25643 SecondaryNameNode
17629 Jps
Check if hostname is correct wich machine
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</configuration>

Not able to run dump in pig

I am trying to dump a relation but getting following error.
I have tried start-all.sh and tried formatting namenode using hadoop namenode -format.
But I am not getting what is wrong.
Error:-
Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
Start the JobHistoryServer
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
Pig when ran in mapreduce mode expects the JobHistoryServer to be available.
To configure JobHistoryServer, add these properties to mapred-site.xml replacing hostname with actual name of the host where the process is started
<property>
<name>mapreduce.jobhistory.address</name>
<value>hostname:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostname:19888</value>
</property>
I would first ensure I'm able to connect to namenode from hdfs client on a edge node. If not some problem/inconsistency with your namenode configs in core-site.xml file either with ports or hostname.
Once you are able to run below with out any issues and ensure namenode is not in safe mode on url http://namenode_host:50070 (which prevents any writes)
hadoop fs -ls /
Then I would proceed with pig. Looks like based on your error hdfs client is unable to reach namenode for some reason which could be firewall or config issue.

HBase fails to start in single node cluster mode on Mac OSX

I am trying to get a personal HBase development environment set up. I have hdfs and yarn running, but cannot get HBase to start.
I have started up hadoop 2.7.1, by running start-dfs.sh and start-yarn.sh. I have verified these are running by testing hdfs dfs -mkdir /test and running a sample MR job bundled in the examples, I have browsed HDFS at port 50070.
I have started zookeeper 3.4.6 on port 2181 and set its dataDir. My zoo.cfg has:
dataDir=/Users/.../tools/hd/zookeeper_data
clientPort=2181
I observe its zookeeper_server.PID file in the dataDir I chose, and when I run jps I see the below:
51074 NodeManager
50743 DataNode
50983 ResourceManager
50856 SecondaryNameNode
57848 QuorumPeerMain
58731 Jps
50653 NameNode
QuorumPeerMain above matches the PID in zookeeper_server.PID, as I would expect. Is this expectation correct? From what I have done so far, should it be expected that any more processes should be showing here?
I installed hbase-1.1.2. I configure hbase-site.xml. I set the hbase.rootDir to be hdfs://localhost:8200/hbase, my hdfs is running at localhost:8200. I set hbase.zookeeper.property.dataDir to my zookeeper's dataDir, with the expectation that it will use this property to find the PID of a running zookeeper. Is this expectation correct or have I misunderstood? The config in hbase-site.xml is:
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8020/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>Users/.../tools/hd/zookeeper_data</value>
</property>
When I run start-hbase.sh my server fails to start. I see this log message:
2015-09-26 19:32:43,617 ERROR [main] master.HMasterCommandLine: Master exiting
To investigate I ran hbase master start and get more detail:
2015-09-26 19:41:26,403 INFO [Thread-1] server.NIOServerCnxn: Stat command output
2015-09-26 19:41:26,405 INFO [Thread-1] server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:63334 (no session established for client)
2015-09-26 19:41:26,406 INFO [main] zookeeper.MiniZooKeeperCluster: Started MiniZooKeeperCluster and ran successful 'stat' on client port=2182
Could not start ZK at requested port of 2181. ZK was started at port: 2182. Aborting as clients (e.g. shell) will not be able to find this ZK quorum.
2015-09-26 19:41:26,406 ERROR [main] master.HMasterCommandLine: Master exiting
java.io.IOException: Could not start ZK at requested port of 2181. ZK was started at port: 2182. Aborting as clients (e.g. shell) will not be able to find this ZK quorum.
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:214)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:139)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2304)
So I have a few questions:
Should I be trying to set up a zookeeper before running HBase?
Why when I have started a zookeeper and told HBase where its dataDir is, does HBase try to start its own zookeeper?
Anything obviously stupid/misguided in the above?
The script you are using to start hbase start-hbase.sh will try to start the following components, in order:
zookeeper
hbase master
hbase regionserver
hbase master-backup
So, you could either stop the zookeeper which is started by you (or) you could start the daemons individually yourself:
# start hbase master
bin/hbase-daemon.sh --config ${HBASE_CONF_DIR} start master
# start region server
bin/hbase-daemons.sh --config ${HBASE_CONF_DIR} --hosts ${HBASE_CONF_DIR}/regionservers start regionserver
HBase stand alone starts it's own zookeeper (if you run start-hbase.sh), but it if fails to start or keep running, the other need hbase daemons won't work.
Make sure you explicitly set the properties for your interface lo0 in the hbase-site.xml file:
<property>
<name>hbase.zookeeper.dns.interface</name>
<value>lo0</value>
</property>
<property>
<name>hbase.regionserver.dns.interface</name>
<value>lo0</value>
</property>
<property>
<name>hbase.master.dns.interface</name>
<value>lo0</value>
</property>
I found that when my wifi was on, if these entries were missing, zookeeper filed to start.

Configure edge node to launch Hadoop jobs on cluster running on a private network

I am trying to setup an edge node to a cluster in my work place. The cluster is CDH 5.* Hadoop Yarn. It has it's own internal private high speed network. The edge node is outside the private network.
I ran the steps for hadoop client setup and configured the core-site.xml
sudo apt-get install hadoop-client
Since the cluster is hosted on it's own private network the IP addresses in the internal network are different.
10.100.100.1 - Namemode
10.100.100.2 - Data Node 1
10.100.100.4 - Data Node 2
100.100.100.6 - Date Node 3
To handle this I requested the cluster admin to add the following properties to the hdfs-site.xml on the namenode so that the listening ports are not just open to internal IP range:
<property>
<name>dfs.namenode.servicerpc-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.http-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.https-bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>
After this setting is done and the services are restarted. I am able to run the following command:
hadoop fs -ls /user/hduser/testData/XML_Flows/test/test_input/*
This works fine. But when I try to cat the file, I get the following error:
*administrator#administrator-Virtual-Machine:/etc/hadoop/conf.empty$ hadoop fs -cat /user/hduser/testData/XML_Flows/test/test_input/*
*15/05/04 15:39:02 WARN hdfs.BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.100.100.6:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3035)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:744)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:659)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:327)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:574)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:78)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:104)
at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:99)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:306)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:278)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
15/05/04 15:39:02 WARN hdfs.DFSClient: Failed to connect to /10.100.100.6:50010 for block, add to deadNodes and continue. org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.100.100.6:50010]
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.100.100.6:50010]*
Same error message is repeated multiple times.
I copied the rest of the xml's i.e. hdfs-site.xml, yarn-site.xml, mapred-site.xml from the cluster data nodes to be on the safer side. But still I got the same error. Does anyone have any idea about this error or how to make edge nodes work on clusters running on private network.
The username of the edge node is "administrator" whereas the cluster is configured using "hduser" id. Could this be a problem ? I have configured password less login between the edge node and the name node.

Hadoop:datanode not connecting to namenode on localhost:50070 cluster summary shows 0

logs
2014-05-12 16:41:26,773 INFO org.apache.hadoop.ipc.RPC: Server at namenode/192.168.12.196:10001 not available yet, Zzzzz...
2014-05-12 16:41:28,777 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/192.168.12.196:10001. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
core site xml....
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://user#namenode:10001</value>
</property>
</configuration>
i put in etc/hosts
192.168.12.196 namenode
in masters
user#namenode
in slaves
localhost
and my namenode is on user#192.168.12.196
if i do jps on all node it shows datanode namenode job/tasktracker working fine
You need to change to localhost into namenode in slaves and masters file and restart hadoop once it will work fine.
for better view
Thanks for your Comment
if i put hostname in slaves of namenode it runs datanode and namenode on same node
configuration of my masters and slave are hereafter,
on namenode's masters
'user#namenode'
on namenode'master
hdname1#data1 (data1 belong to ip of node and hdname1 is user)
hdname2#data2
on datanode's masters
user#namenode
on datanode's slaves
hdname1#data1

Resources