Hadoop Cluster with HDFS High Availability

Hadoop Cluster with HDFS High Availability - hadoop

Copy the HDFS Meta data from active name node to standby namenode.
bin/hdfs namenode -bootstrapStandby
19/09/01 02:40:38 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
19/09/01 02:40:38 INFO namenode.NameNode: createNameNode [-bootstrapStandby]
19/09/01 02:40:39 WARN common.Util: Path /home/kenny/hadoop-2.7.3/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
19/09/01 02:40:39 WARN common.Util: Path /home/kenny/hadoop-2.7.3/namenode should be specified as a URI in configuration files. Please update hdfs configuration.
19/09/01 02:40:41 INFO ipc.Client: Retrying connect to server: node-master-2/192.168.1.170:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
19/09/01 02:40:42 INFO ipc.Client: Retrying connect to server: node-master-2/192.168.1.170:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCoun
Core-site setting
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://ha-cluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/kenny/hadoop-2.7.3/jn</value>
</property>
hdfs-site setting
hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/kenny/hadoop-2.7.3/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ha-cluster</value>
</property>
<property>
<name>dfs.ha.namenodes.ha-cluster</name>
<value>node-master,node-master-2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.node-master</name>
<value>node-master:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.node-master-2</name>
<value>node-master-2:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.node-master</name>
<value>node-master:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.node-master-2</name>
<value>node-master-2:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node-master:8485;node-master-2:8485;node1:8485/ha-cluster</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.ha-cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>node-master:2181,node-master-2:2181,node1:2181 </value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/kenny/.ssh/id_rsa</value>
</property>

Related

In Hadoop 3.1.0 namenode is working but datanode is not working

In Hadoop 3.1.0 namenode is working but datanode is not working showing below message:
STARTUP_MSG: build = https://github.com/apache/hadoop -r 16b70619a24cdcf5d3b0fcf4b58ca77238ccbe6d; compiled by 'centos' on 2018-03-30T00:00Z
STARTUP_MSG: java = 1.8.0_231
************************************************************/
2019-11-13 20:58:38,398 INFO checker.ThrottledAsyncChecker: Scheduling a check for [DISK]file:/C:/Appliacation/hadoop-3.1.0/data/datanode
2019-11-13 20:58:38,436 WARN checker.StorageLocationChecker: Exception checking StorageLocation [DISK]file:/C:/Appliacation/hadoop-3.1.0/data/datanode
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Ljava/lang/String;)Lorg/apache/hadoop/io/nativeio/NativeIO$POSIX$Stat;
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.stat(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$POSIX.getStat(NativeIO.java:455)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfoByNativeIO(RawLocalFileSystem.java:796)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:710)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.getPermission(RawLocalFileSystem.java:678)
at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:191)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:98)
at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:239)
at org.apache.hadoop.hdfs.server.datanode.StorageLocation.check(StorageLocation.java:52)
at org.apache.hadoop.hdfs.server.datanode.checker.ThrottledAsyncChecker$1.call(ThrottledAsyncChecker.java:142)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-11-13 20:58:38,436 ERROR datanode.DataNode: Exception in secureMain
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
at org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:220)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2762)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2677)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2719)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2863)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2887)
2019-11-13 20:58:38,436 INFO util.ExitUtil: Exiting with status 1: org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 0, volumes configured: 1, volumes failed: 1, volume failures tolerated: 0
2019-11-13 20:58:38,451 INFO datanode.DataNode: SHUTDOWN_MSG:

I had same issue I had to replace some binaries in bin folder reference Hadoop-3.1.2: Datanode and Nodemanager shuts down also I had done some changes in configuration some configuration files as follows :-
1. Edit file [core-site.xml]
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:19000</value>
</property>
</configuration>
2. Edit file hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.dir</name>
<value>file:///C:/hadoop-3.1.0/data/namenode</value>
</property>
<property>
<name>dfs.datanode.dir</name>
<value>file:///C:/hadoop-3.1.0/data/datanode</value>
</property>
</configuration>
3. Edit file workers
localhost
4. Edit file mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.user.name</name>
<value>%USERNAME%</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.apps.stagingDir</name>
<value>/user/%USERNAME%/staging</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>local</value>
</property>
</configuration>
5. Edit file yarn-site.xml
<configuration>
<property>
<name>yarn.server.resourcemanager.address</name>
<value>0.0.0.0:8020</value>
</property>
<property>
<name>yarn.server.resourcemanager.application.expiry.interval</name>
<value>60000</value>
</property>
<property>
<name>yarn.server.nodemanager.address</name>
<value>0.0.0.0:45454</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.server.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/dep/logs/userlogs</value>
</property>
<property>
<name>yarn.server.mapreduce-appmanager.attempt-listener.bindAddress</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.server.mapreduce-appmanager.client-service.bindAddress</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>-1</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>%HADOOP_CONF_DIR%,%HADOOP_COMMON_HOME%/share/hadoop/common/*,%HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*</value>
</property>
</configuration>

Flink session on Labeled YARN- requested resources are not available

I've set up a Hadoop 2.7.5.HA cluster and running Flink 1.4.0 applications using the default YARN queue. I decided to categoruze applications and run them on exclusive nodemanagers, so I labeled three nodes, each 4 core and 2GB RAM as stream in queue streamQ and three nodes each 1 core and 1GB RAM as onlinein queue onlineQ and all the settings are displayed in YARN webUI as desired and nodes are identified.
Here is the capacity-scheduler.xml:
<property>
<name>yarn.scheduler.capacity.maximum-applications</name>
<value>10000</value>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.1</value>
</property>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
</property>
<property>
<name>yarn.scheduler.capacity.node-locality-delay</name>
<value>40</value>
</property>
<property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value></value>
</property>
<property>
<name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>streamQ,onlineQ</value>
</property>
<!-- streamQ settings -->
<property>
<name>yarn.scheduler.capacity.root.streamQ.capacity</name>
<value>0</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels</name>
<value>stream</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels.stream.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels.stream.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.default-node-label-expression</name>
<value>stream</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.acl_submit_applications</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.acl_administer_queue</name>
<value>*</value>
</property>
<!-- onlineQ settings -->
<property>
<name>yarn.scheduler.capacity.root.onlineQ.capacity</name>
<value>0</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels</name>
<value>online</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels.online.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels.online.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.default-node-label-expression</name>
<value>online</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.acl_submit_applications</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.acl_administer_queue</name>
<value>*</value>
</property>
I run the command to start the Flink session on an edge node with all hadoop configuration the same as cluster:
yarn-session.sh -n 2 -jm 768 -tm 768 -nm flink -z flink_zoo -s 3 -qu streamQ
it successfully uploads Flink libs on HDFS and in YARN webUI I can see the application, but when it attempts to get resources, it says:
018-01-28 10:02:04,087 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
Here is the whole logs:
2018-01-28 10:00:09,648 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-01-28 10:00:09,649 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 768
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 768
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-01-28 10:00:09,650 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.port, 8081
2018-01-28 10:00:10,003 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-01-28 10:00:10,069 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to manager (auth:SIMPLE)
2018-01-28 10:00:10,377 INFO org.apache.flink.yarn.YarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=768, taskManagerMemoryMB=768, numberTaskManagers=2, slotsPerTaskManager=3}
2018-01-28 10:00:10,747 WARN org.apache.flink.yarn.YarnClusterDescriptor - The configuration directory ('/opt/flink/conf') contains both LOG4J and Logback configuration files. Please delete or rename one of them.
2018-01-28 10:00:10,751 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/conf/log4j.properties to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/log4j.properties
2018-01-28 10:00:11,123 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/log4j-1.2.17.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/log4j-1.2.17.jar
2018-01-28 10:00:11,384 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/flink-dist_2.11-1.4.0.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/flink-dist_2.11-1.4.0.jar
2018-01-28 10:00:30,986 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/flink-shaded-hadoop2-uber-1.4.0.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/flink-shaded-hadoop2-uber-1.4.0.jar
2018-01-28 10:00:40,852 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/flink-python_2.11-1.4.0.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/flink-python_2.11-1.4.0.jar
2018-01-28 10:00:41,017 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/slf4j-log4j12-1.7.7.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/lib/slf4j-log4j12-1.7.7.jar
2018-01-28 10:00:41,250 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/conf/logback.xml to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/logback.xml
2018-01-28 10:00:41,386 INFO org.apache.flink.yarn.Utils - Copying from file:/opt/flink/lib/flink-dist_2.11-1.4.0.jar to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/flink-dist_2.11-1.4.0.jar
2018-01-28 10:01:02,966 INFO org.apache.flink.yarn.Utils - Copying from /tmp/application_1517118829753_0002-flink-conf.yaml285707454205346702.tmp to hdfs://ha-cluster/user/manager/.flink/application_1517118829753_0002/application_1517118829753_0002-flink-conf.yaml285707454205346702.tmp
2018-01-28 10:01:03,601 INFO org.apache.flink.yarn.YarnClusterDescriptor - Submitting application master application_1517118829753_0002
2018-01-28 10:01:03,782 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1517118829753_0002
2018-01-28 10:01:03,783 INFO org.apache.flink.yarn.YarnClusterDescriptor - Waiting for the cluster to be allocated
2018-01-28 10:01:03,796 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deploying cluster, current state ACCEPTED
What is the problem?

Editing the capacity-scheduler.xml, solved the problem:
<!-- configuration of queue-root -->
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>streamQ,onlineQ</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels</name>
<value>*</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.stream.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.online.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default-node-label-expression</name>
<value>*</value>
</property>
<!-- configuration of queue-streamQ -->
<property>
<name>yarn.scheduler.capacity.root.streamQ.capacity</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels</name>
<value>stream</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels.stream.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.accessible-node-labels.online.capacity</name>
<value>0</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.streamQ.default-node-label-expression</name>
<value>stream</value>
</property>
<!-- configuration of queue-streamQ -->
<property>
<name>yarn.scheduler.capacity.root.onlineQ.capacity</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels</name>
<value>online</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels.online.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.accessible-node-labels.stream.capacity</name>
<value>0</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.onlineQ.default-node-label-expression</name>
<value>online</value>
</property>
</configuration>

Please check your flink app logs to see if there is some issue when connect to yarn resourcemanager. I also encounted the issue when I use flink on yarn with HA.I am not sure if I was the only one.

hdfs zkfc –formatZK error

I have a cluster consists of three nodes
hadoop-master (namenode) 192.168.4.128
hadoop-slave-1 (secondary name node ) 192.168.4.111
hadoop-slave-3 (data node ) 192.168.4.106
jps command on hadoop-master shows
15799 JournalNode
15929 Jps
14978 QuorumPeerMain
but when executing this command hdfs zkfc –formatZK on namenode
I am getting this error
17/03/30 07:33:09 INFO zookeeper.ZooKeeper: Session: 0x15b1ecb76480000 closed
17/03/30 07:33:09 FATAL tools.DFSZKFailoverController: Got a fatal error, exiting now
org.apache.hadoop.HadoopIllegalArgumentException: Bad argument: –formatZK
at org.apache.hadoop.ha.ZKFailoverController.badArg(ZKFailoverController.java:251)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:214)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
17/03/30 07:33:09 WARN ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x15b1ecb76480000
17/03/30 07:33:09 INFO zookeeper.ClientCnxn: EventThread shut down
my zoo.cfg is
initLimit=10
syncLimit=5
dataDir=/usr/local/zookeeper/data/
clientPort=2181
DataLogDir=/usr/local/log/
server.1=hadoop-master:2888:3888
server.2=hadoop-slave-1:2889:3889
server.3=hadoop-slave-2:2890:3890
my slaves file is
hadoop-slave-1
hadoop-slave-2
hadoop-master
my core-site.xml
<property>
<name>dfs.tmp.dir</name>
<value>/opt/hadoop/data15</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:8020</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/journal/node/local/data</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp</value>
</property>
my hdfs-site.xml is
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/data16</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/data17</value>
<final>true</final>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-slave-1:50090</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>hadoop-master,hadoop-slave-1</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.hadoop-master</name>
<value>hadoop-master:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.hadoop-slave-1</name>
<value>hadoop-slave-1:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.hadoop-master</name>
<value>hadoop-master:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.hadoop-slave-1</name>
<value>hadoop-slave-1:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop-master:8485;hadoop-slave-2:8485;hadoop-slave-1:8485/mycluster</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop-master:2181,hadoop-slave-1:2181,hadoop-slave-2:2181</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>3000</value>
</property>
i have applied stop-dfs.sh in all nodes before executing hdfs zkfc –formatZK in name node(hadoop-master)
are there any wrong configuration
and is it neccesary to issue hdfs namenode -format before executing
hdfs zkfc –formatZK

ConnectException: connect error: No such file or directory when trying to connect to '50010' using importtsv on hbase

I configured short-circuit settings on both hdfs-site.xml and hbase-site.xml. And I run importtsv on hbase to import data from HDFS to HBase on Hbase cluster. I look over the log on each datanode and all datanode have ConnectException i said to the title.
2017-03-31 21:59:01,273 WARN [main] org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: error creating DomainSocket
java.net.ConnectException: connect(2) error: No such file or directory when trying to connect to '50010'
at org.apache.hadoop.net.unix.DomainSocket.connect0(Native Method)
at org.apache.hadoop.net.unix.DomainSocket.connect(DomainSocket.java:250)
at org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory.createSocket(DomainSocketFactory.java:164)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextDomainPeer(BlockReaderFactory.java:753)
at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:469)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:783)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:717)
at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:421)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:332)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:617)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:841)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:696)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
at org.apache.hadoop.io.WritableUtils.readVIntInRange(WritableUtils.java:348)
at org.apache.hadoop.io.Text.readString(Text.java:471)
at org.apache.hadoop.io.Text.readString(Text.java:464)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:751)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2017-03-31 21:59:01,277 WARN [main] org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache: ShortCircuitCache(0x34f7234e): failed to load 1073750370_BP-642933002-"IP_ADDRESS"-1490774107737
EDIT
hadoop 2.6.4
hbase 1.2.3
hdfs-site.xml
<property>
<name>dfs.namenode.dir</name>
<value>/home/hadoop/hdfs/nn</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/home/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hdfs/dn</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop1:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop1:50090</value>
</property>
<property>
<name>dfs.namenode.rpc-address</name>
<value>hadoop1:8020</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>50</value>
</property>
<property>
<name>dfs.datanode.handler.count</name>
<value>50</value>
</property>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.block.local-path-access.user</name>
<value>hbase</value>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>775</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>_PORT</value>
</property>
<property>
<name>dfs.client.domain.socket.traffic</name>
<value>true</value>
</property>
hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop1/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop1,hadoop2,hadoop3,hadoop4,hadoop5,hadoop6,hadoop7,hadoop8</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>50</value>
</property>
<property>
<name>hfile.block.cache.size</name>
<value>0.5</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.size</name>
<value>0.3</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.size.lower.limit</name>
<value>0.65</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>_PORT</value>
</property>

Short-circuit reads make use of a UNIX domain socket. This is a special path in the filesystem that allows the Client and the DataNodes to communicate. You will need to set a path (not port) to this socket. The DataNode should be able to create this path.
The parent directory of the path value (for ex: /var/lib/hadoop-hdfs/) must exist and should be owned by the hadoop superuser. Also make sure any user except the HDFS user or root has no access to this path.
mkdir /var/lib/hadoop-hdfs/
chown hdfs_user:hdfs_user /var/lib/hadoop-hdfs/
chmod 750 /var/lib/hadoop-hdfs/
Add this property to hdfs-site.xml on all datanodes and clients.
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
Restart the services after making the changes.
Note: Paths under /var/run or /var/lib are commonly used.

Exception in doCheckpoint:Operation category JOURNAL is not supported in state standby

i created HA-Cluster with one DataNode ,active NameNode , Standby NameNode and three JournalNode
when is was put a file to HDFS get this error:
put: Operation category READ is not supported in state standby
put command :
./hadoop fs -put golnaz.txt /user/input
NameNode log:
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy15.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:148)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:273)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:315)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297)
2016-09-15 02:07:23,961 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9000, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 103.41.177.161:45797 Call#11403 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category JOURNAL is not supported in state standby
2016-09-15 02:07:30,547 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 9000, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 103.41.177.160:39200 Call#11404 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category JOURNAL is not supported in state standby
Error in SecondaryNameNode log :
ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby
and this is HDFS-Site.xml:
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/root/hadoopstorage/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/root/hadoopstorage/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>ha-cluster</value>
</property>
<property>
<name>dfs.ha.namenodes.ha-cluster</name>
<value>NameNode,Standby</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.NameNode</name>
<value>103.41.177.161:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.Standby</name>
<value>103.41.177.162:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.NameNode</name>
<value>103.41.177.161:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.Standby</name>
<value>103.41.177.162:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
value>qjournal://103.41.177.161:8485;103.41.177.162:8485;103.41.177.160:8485/ha-cluster</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.ha-cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>3000</value>
</property>
</configuration>

You didn't mention anything about the automatic failover (ZKFC & Zookeeper). Without it, hdfs won't failover automatically.
You may try the following : make sure that both of your namenodes are in the standby state, by checking the namenodes consoles (or using the getServiceState command from Administrative commands). If so, manually trigger the transition using -transitionToActive command and tail the namenodes logs at the same time. In case of transition failure update your post with namenode logs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop Cluster with HDFS High Availability - hadoop

Related

In Hadoop 3.1.0 namenode is working but datanode is not working

Flink session on Labeled YARN- requested resources are not available

hdfs zkfc –formatZK error

ConnectException: connect error: No such file or directory when trying to connect to '50010' using importtsv on hbase

Exception in doCheckpoint:Operation category JOURNAL is not supported in state standby

Categories

Resources