HBase distributed log-splitting keeps failing because unable to get a lease - hadoop

We used up all the free space on our test HDFS cluster so HBase crashed. After cleaning up some space, we were able to restart HBase, but after the startup a distributed log split job keeps failing.
The job looks like this:
Splitting log file hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 into a temporary staging area.
The Regionserver are trying to get a lease on the file for some time:
2013-10-24 11:50:47,662 DEBUG org.apache.hadoop.hbase.regionserver.SplitLogWorker: tasks arrived or departed
2013-10-24 11:50:47,671 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker host-4,60020,1382614844870 acquired task /hbase/splitlog/hdfs%3A%2F%2F192.168.249.1%3A9000%2Fhdfs%2Fhbase%2F.logs%2Fhost-3%2C60020%2C1382113928374-splitting%2Fhost-3%252C60020%252C1382113928374.1382523937002
2013-10-24 11:50:47,672 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog: hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002, length=41274332
2013-10-24 11:50:47,672 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering lease on dfs file hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002
2013-10-24 11:50:47,673 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=0 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 1ms
2013-10-24 11:50:50,674 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=1 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 3002ms
2013-10-24 11:50:51,674 DEBUG org.apache.hadoop.hbase.util.FSHDFSUtils: isFileClosed not available
2013-10-24 11:51:51,680 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=2 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 64008ms
Then the Master abort the job:
2013-10-24 11:55:48,685 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2013-10-24 11:55:48,687 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 interrupted, resigning
java.io.InterruptedIOException
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:136)
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:54)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:780)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:414)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:112)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:211)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:179)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:118)
... 9 more
It seems to me that the problem is the Regionserver which are unable to get a lease on this file, because it's already open, so I checked with sudo -u hdfs hadoop fsck /hdfs/hbase/.logs/ -openforwrite, and it confirms:
OPENFORWRITE: /hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 41274332 bytes, 1 block(s), OPENFORWRITE:
/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002: Under replicated blk_1073337163743094520_3534698. Target Replicas is 3 but found 2 replica(s).
I tried to shut down HBase, but the file stays OPENFORWRITE. How could I remove this flag?
ps> Hadoop 1.0.1, HBase 0.94.12

Related

Failed to start namenode.java.lang.IllegalStateException

iam using hadoop apache 2.7.1 high availability cluster that consists of
two name nodes mn1,mn2 and 3 journal nodes
but while i was working on cluster i faced the following error
when i issue start-dfs.sh mn1 is standby and mn2 is active
but after that if one of theses two namenodes are off there is no possibility
to turn it on again
and here are the last lines of log of one of these two name nodes
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 3 entries 72 lookups
2017-08-05 09:37:21,088 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 7052 msecs
2017-08-05 09:37:21,300 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to mn2:8020
2017-08-05 09:37:21,304 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2017-08-05 09:37:21,316 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020
2017-08-05 09:37:21,353 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean
2017-08-05 09:37:21,354 WARN org.apache.hadoop.hdfs.server.common.Util: Path /opt/hadoop/metadata_dir should be specified as a URI in configuration files. Please update hdfs configuration.
2017-08-05 09:37:21,361 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getNumUnderConstructionBlocks(LeaseManager.java:119)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCompleteBlocksTotal(FSNamesystem.java:5741)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:1063)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:678)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:664)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
2017-08-05 09:37:21,364 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-08-05 09:37:21,365 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at mn2/192.168.25.22
************************************************************/
This may be
1.Namenode PORT may be Change for each NODE.
This is a particularly vexing problem.
Swallow IllegalStateExceptions thrown by removeShutdownHook in FileSystem. The javadoc states:
public boolean removeShutdownHook(Thread hook)
Throws:
IllegalStateException - If the virtual machine is already in the process of shutting down
So if we are getting this exception, it MEANS we are already in the process of shutdown, so we CANNOT, try what we may, removeShutdownHook. If Runtime had a method Runtime.isShutdownInProgress(), we could have checked for it before the removeShutdownHook call. As it stands, there is no such method. In my opinion, this would be a good patch regardless of the needs for this JIRA.
Not send SIGTERMs from the NM to the MR-AM in the first place. Rather we should expose a mechanism for the NM to politely tell the AM its no longer needed and should shutdown asap. Even after this, if an admin were to kill the MRAppMaster with a SIGTERM, the JobHistory would be lost defeating the purpose of 3614
i discovered that my problem was in journal node and not in namenode
even though the log of namenode shows the error mentioned in question
jps shows journal node but it is fake because journal node service is shut down
even though it is found in jps output
so as a solution i issue hadoop-daemon.sh stop journalnode
then hadoop-daemon.sh start journalnode
and then namenode starts to work again

Yarn MapReduce approximate-pi example fails exit code 1 when run as non-hadoop user

I am running a small private cluster of linux machines with Hadoop 2.6.2 and yarn. I launch yarn jobs from a linux edge node. The canned Yarn example to approximate the value of pi works perfectly when run by the hadoop (superuser, owner of the cluster) user, but fails when run from my personal account on the edge node. In both cases (hadoop, me) I run the job exactly like this:
clott#edge: /home/hadoop/hadoop-2.6.2/bin/yarn jar /home/hadoop/hadoop-2.6.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.2.jar pi 2 5
It fails; the full output is below. I think the file-not-found exception is totally bogus. I think something causes the launch of the container to fail, and so there's no output to be found. What causes container launches to fail, and how can this be debugged?
Because this identical same command works perfectly when run by the hadoop user but not when run by a different account on the same edge node, I suspect a permission or other yarn configuration problem; I don't suspect a missing-jar file problem. My personal account uses the same environment variables as the hadoop account, for what that's worth.
These questions are similar but I didn't find a solution:
https://issues.cloudera.org/browse/DISTRO-577
Running a map reduce job as a different user
Yarn MapReduce Job Issue - AM Container launch error in Hadoop 2.3.0
I have tried these remedies without any success:
In core-site.xml, set the value of hadoop.tmp.dir to /tmp/temp-${user.name}
Add my personal user account to every node in the cluster
I guess that many installations run with just a single user, but I'm trying to allow two people to work together on the cluster without trashing each other's intermediate results. Am I totally nuts?
Full output:
Number of Maps = 2
Samples per Map = 5
Wrote input for Map #0
Wrote input for Map #1
Starting Job
15/12/22 15:29:18 INFO client.RMProxy: Connecting to ResourceManager at ac1.mycompany.com/1.2.3.4:8032
15/12/22 15:29:18 INFO input.FileInputFormat: Total input paths to process : 2
15/12/22 15:29:19 INFO mapreduce.JobSubmitter: number of splits:2
15/12/22 15:29:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1450815437271_0002
15/12/22 15:29:19 INFO impl.YarnClientImpl: Submitted application application_1450815437271_0002
15/12/22 15:29:19 INFO mapreduce.Job: The url to track the job: http://ac1.mycompany.com:8088/proxy/application_1450815437271_0002/
15/12/22 15:29:19 INFO mapreduce.Job: Running job: job_1450815437271_0002
15/12/22 15:29:31 INFO mapreduce.Job: Job job_1450815437271_0002 running in uber mode : false
15/12/22 15:29:31 INFO mapreduce.Job: map 0% reduce 0%
15/12/22 15:29:31 INFO mapreduce.Job: Job job_1450815437271_0002 failed with state FAILED due to: Application application_1450815437271_0002 failed 2 times due to AM Container for appattempt_1450815437271_0002_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://ac1.mycompany.com:8088/proxy/application_1450815437271_0002/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1450815437271_0002_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
15/12/22 15:29:31 INFO mapreduce.Job: Counters: 0
Job Finished in 13.489 seconds
java.io.FileNotFoundException: File does not exist: hdfs://ac1.mycompany.com/user/clott/QuasiMonteCarlo_1450816156703_163431099/out/reduce-out
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1817)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1841)
at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Yes Manjunath Ballur you were right it was a permissions problem! Finally learned how to preserve the yarn application logs, which clearly revealed the problem. Here are the steps:
Edit yarn-site.xml and add a property to delay yarn log deletion:
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>600</value>
</property>
Push yarn-site.xml to all nodes (ARGH I forgot this for a long time) and restart cluster.
Run yarn example to estimate pi as shown above, it fails. Look at http://namenode:8088/cluster/apps/FAILED to see the failed application, click on the link for the most recent failure, look at the bottom to see which nodes in the cluster were used.
Open a window on one of the nodes in the cluster where the app failed. Find the job directory, which in my case was
~hadoop/hadoop-2.6.2/logs/userlogs/application_1450815437271_0004/container_1450‌​815437271_0004_01_000001/
Et voila, I saw files stdout (only log4j bitching), stderr (nearly empty) and syslog (winner winner chicken dinner). In the syslog file I found this gem:
2015-12-23 08:31:42,376 INFO [main] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=clott, access=EXECUTE, inode="/tmp/hadoop-yarn/staging/history":hadoop:supergroup:drwxrwx---
So the problem was permissions on hdfs:///tmp/hadoop-yarn/staging/history. A simple chmod 777 put me right, I'm not fighting the group perms anymore. Now a non-hadoop non-superuser can run a yarn job.

getting java.net.SocketTimeoutException when trying to run the Hadoop mapReduce on fresh install of Hortonworks

I have a fresh install of Hortonworks version 2.3_1 for oracle virtualbox and I get a java.net.SocketTimeoutException whenever I try to run a mapreduce job. I changed nothing other than the memory and the cores available to the VM.
full text of run:
WARNING: Use "yarn jar" to launch YARN applications.
15/09/01 01:15:17 INFO impl.TimelineClientImpl: Timeline service address: http:/ /sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/01 01:15:20 INFO client.RMProxy: Connecting to ResourceManager at sandbox. hortonworks.com/10.0.2.15:8050
15/09/01 01:16:19 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your applicatio n with ToolRunner to remedy this.
15/09/01 01:18:09 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor excepti on for block BP-601678901-10.0.2.15-1439987491556:blk_1073742292_1499
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0 .2.15:52924 remote=/10.0.2.15:50010]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja va:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 61)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 31)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 18)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java :2280)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(P ipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor .run(DFSOutputStream.java:749)
15/09/01 01:18:11 INFO mapreduce.JobSubmitter: Cleaning up the staging area /use r/root/.staging/job_1441069639378_0001
Exception in thread "main" java.io.IOException: All datanodes DatanodeInfoWithStorage[10.0.2.15:50010,DS-56099a5f-3cb3-426e-8e1a-ff3b53df9bf2,DISK] are bad. Aborting...
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
Full name of file ova file I am using: Sandbox_HDP_2.3_1_virtualbox.ova
my host is a window 7 home premium machine with eight lines of execution(four hyperthreaded cores, I think)
The problem was exactly what it seemed a timeout error. Fixed by going to the hadoop config folder and raising all the timeouts as well as the number of retries (although from the log that didn't come into play) and stopping unnecessary services on both the host and guest operating system.
Thank, sunrise76 on of those issues pointed me to the config folder.

Job tracker is not starting up

I am installing CDH4.6.0 with the help of this site I am running start-all.sh to start services.
/etc/init.d/hadoop-hdfs-namenode start
/etc/init.d/hadoop-hdfs-datanode start
/etc/init.d/hadoop-hdfs-secondarynamenode start
/etc/init.d/hadoop-0.20-mapreduce-jobtracker start
/etc/init.d/hadoop-0.20-mapreduce-tasktracker start
bin/bash [to start bash prompt after starting services]
After executing these instructions as a part of docker file, like
CMD ["start-all.sh"]
It starts all the services
When i jps it, i can see only
jps
Namenode
Datanode
Secondary Namenode
Tasktracker
But job tracker is not yet started. log is as follows
2015-01-23 07:26:46,706 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=JobTracker, sessionId=
2015-01-23 07:26:46,735 INFO org.apache.hadoop.mapred.JobTracker:
JobTracker up at: 8021
2015-01-23 07:26:46,735 INFO org.apache.hadoop.mapred.JobTracker:
JobTracker webserver: 50030
2015-01-23 07:26:47,725 INFO org.apache.hadoop.mapred.JobTracker:
Creating the system directory
2015-01-23 07:26:47,750 WARN org.apache.hadoop.mapred.JobTracker: Failed
to operate on mapred.system.dir (hdfs://localhost:8020/var/lib/hadoop-
hdfs/cache/mapred/mapred/system) because of permissions.
2015-01-23 07:26:47,750 WARN org.apache.hadoop.mapred.JobTracker: This
directory should be owned by the user 'mapred (auth:SIMPLE)'
2015-01-23 07:26:47,751 WARN org.apache.hadoop.mapred.JobTracker: Bailing out ...
org.apache.hadoop.security.AccessControlException: Permission denied: user=mapred, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
But when i again start it from bash prompt, it works. Why so? Any suggestions?
I can see it from the log. Job tracker is starting at port 8020 and why is it trying to operate at port 8020? Is it a problem? If so, how to tackle it?
Seems like the mapred user doesn't have privilege to write files/directories inside the HDFS root directory.
Switch to hdfs user and assign necessary privilege to mapred user before starting mapreduce service.
sudo -su hdfs ;
hadoop fs -chmod 777 /
/etc/init.d/hadoop-0.20-mapreduce-jobtracker stop; /etc/init.d/hadoop-0.20-mapreduce-jobtracker start

Hortonworks HA Namenodes gives an error "Operation category READ is not supported in state standby"

My hadoop cluster HA active namenode (host1) suddenly switch to standby namenode(host2). I could not found any error in hadoop logs (in any server) to identify the root cause.
After switching the Namenodes following error appeared in hdfs logs frequently and non of the application could read the HDFS files.
2014-07-17 01:58:53,381 WARN namenode.FSNamesystem
(FSNamesystem.java:getCorruptFiles(6769)) - Get corrupt file blocks
returned error: Operation category READ is not supported in state
standby
Once I restart the new active node(host2), namenode is switching back to new standby node(host1). Then cluster is working as normal, users also can can retrieve the HDFS files.
I'm using Hortonworks 2.1.2.0 and HDFS version 2.4.0.2.1
Edit:21st Jult 2014
Following logs were found in active namenode logs when active-standby namenode switch happen
NT_SETTINGS-1675610.csv dst=null perm=null 2014-07-20
09:06:44,746 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditMessage(7755)) - allowed=true
ugi=storm (auth:SIMPLE) ip=/10.0.1.50 cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/PRODUCTS/PRODUCTS-138018
6.csv dst=null perm=null 2014-07-20 09:06:44,747 INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7755)) -
allowed=true ugi=storm (auth:SIMPLE) ip=/10.0.1.50
cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/MERCHANT_SETTINGS/MERCHA
NT_SETTINGS-1695794.csv dst=null perm=null 2014-07-20
09:06:44,747 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditMessage(7755)) - allowed=true
ugi=storm (auth:SIMPLE) ip=/10.0.1.50 cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/PRODUCTS/PRODUCTS-139954
1.csv dst=null perm=null 2014-07-20 09:06:44,748 INFO namenode.FSNamesystem (FSNamesystem.java:stopActiveServices(1095)) -
Stopping services started for active state 2014-07-20 09:06:44,750
INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1153)) -
Ending log segment 842249 2014-07-20 09:06:44,752 INFO
namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of
transactions: 2 Total time for transactions(ms): 0 Number of
transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4
35 2014-07-20 09:06:44,774 INFO namenode.FSEditLog
(FSEditLog.java:printStatistics(673)) - Number of transactions: 2
Total time for transactions(ms): 0 Number of transactions batched in
Syncs: 0 Number of syncs: 2 SyncTimes(ms): 24 37 2014-07-20
09:06:44,805 INFO namenode.FSNamesystem (FSNamesystem.java:run(4362))
- NameNodeEditLogRoller was interrupted, exiting 2014-07-20 09:06:44,824 INFO namenode.FileJournalManager
(FileJournalManager.java:finalizeLogSegment(130)) - Finalizing edits
file
/ebs/hadoop/hdfs/namenode/current/edits_inprogress_0000000000000842249
-> /ebs/hadoop/hdfs/name node/current/edits_0000000000000842249-0000000000000842250 2014-07-20
09:06:44,874 INFO blockmanagement.CacheReplicationMonitor
(CacheReplicationMonitor.java:run(168)) - Shutting down
CacheReplicationMonitor 2014-07-20 09:06:44,876 INFO
namenode.FSNamesystem (FSNamesystem.java:startStandbyServices(1136)) -
Starting services required for standby state 2014-07-20 09:06:44,927
INFO ha.EditLogTailer (EditLogTailer.java:(117)) - Will roll
logs on active node at hadoop-client-us-west-1b/10.0.254.10:8020 every
120 seconds. 2014-07-20 09:06:44,929 INFO ha.StandbyCheckpointer
(StandbyCheckpointer.java:start(129)) - Starting standby checkpoint
thread... Checkpointing active NN at
http:// hadoop-client-us-west-1b:50070 Serving checkpoints at
http:// hadoop-client-us-west-1a:50070 2014-07-20 09:06:44,930 INFO
ipc.Server (Server.java:run(2027)) - IPC Server handler 3 on 8020,
call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57297 Call#8431877 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby 2014-07-20 09:06:44,930 INFO ipc.Server
(Server.java:run(2027)) - IPC Server handler 16 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57294 Call#130105071 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby 2014-07-20 09:06:44,940 INFO ipc.Server
(Server.java:run(2027)) - IPC Server handler 14 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57294 Call#130105072 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby
Edit:13th August 2014
We were able to found out root cause of namenode switching, namenode getting lots of file info requests and then namenode switching was happened.
But still could not get resolve Operation category READ is not supported in state standby error.
Edit:7th December 2014
We were found that, as the solution application need to manually connect with current active namenode once previously active namenode failed. Traffic for namenodes in HA mode are not automatically directed to active node.
I had the same issue. You need to update the client libraries. Use amabari to set up spark and have it install the client on the server. Then set your SPARK_HOME environment variable.

Resources