unable to open cassandra on Mac - cassandra-2.0

I am using the tutorial http://www.datastax.com/2012/01/working-with-apache-cassandra-on-mac-os-x
I get the folllowing warnings and errors when I try to start cassandra:
Class JavaLaunchHelper is implemented in both
/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/bin/java
and
/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/libinstrument.dylib.
One of the two will be used. Which one is undefined. CompilerOracle:
inline org/apache/cassandra/db/AbstractNativeCell.compareTo
(Lorg/apache/cassandra/db/composites/Composite;)I
WARN 16:12:32 JNA link failure, one or more native method will be
unavailable. WARN 16:12:32 JMX is not enabled to receive remote
connections. Please see cassandra-env.sh for more info. INFO 16:12:32
Initializing SIGAR library WARN 16:12:32 Cassandra server running in
degraded mode. Is swap disabled? : false, Address space adequate? :
false, nofile limit adequate? : true, nproc limit adequate? : false
ERROR 16:12:34 Exiting due to error while processing commit log during
initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException:
Could not read commit log descriptor in file
./../data/commitlog/CommitLog-5-1446227619917.log at
org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:622)
[apache-cassandra-2.2.1.jar:2.2.1] at
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:302)
[apache-cassandra-2.2.1.jar:2.2.1] at
org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:147)
[apache-cassandra-2.2.1.jar:2.2.1] at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:189)
[apache-cassandra-2.2.1.jar:2.2.1] at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:169)
[apache-cassandra-2.2.1.jar:2.2.1] at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:266)
[apache-cassandra-2.2.1.jar:2.2.1] at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:488)
[apache-cassandra-2.2.1.jar:2.2.1] at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:595)
[apache-cassandra-2.2.1.jar:2.2.1]

Please delete Data/commitlogs to proceed in such cases.
but this method may remove your sensitive data.

Related

Dse is not starting stating unable to write to commit log directory

I am getting below error while starting the dse:
ERROR [main] 2020-02-26 13:08:33,269 DseModule.java:97 - {}. Exiting...
com.google.inject.CreationException: Unable to create injector, see the following errors:
1) An exception was caught and reported. Message: Unable to check disk space available to /u01/dse_ops/logs. Perhaps the Cassandra user does not have the necessary permissions
at com.datastax.bdp.DseModule.configure(Unknown Source)

Java Heap Space Error | OutOfMemory while deploying EAR in Websphere

I am getting OutOfMemory error while deploying EAR (Size : 230MB) file in Websphere server.
Sometime deployment is getting success after increasing the heap size.
I have analyzed the heap dump and found leak suspects but not sure how to proceed here after.
Leak suspect : 217,295,824 bytes (87.23 %) of Java heap is used by 105
instances of java/util/WeakHashMap$Entry
Contains 3 instances of the following leak suspects:
- array of java/lang/Object holding 16,235,440 bytes at 0x6a696c8
- array of java/lang/Object holding 101,373,968 bytes at 0x1125c240
- array of java/lang/Object holding 13,602,688 bytes at 0x5290818
<\n> Total size : 217,295,824 bytes
Size : 1,040 bytes
Name : array of java/util/WeakHashMap$Entry
Number of children : 105
Number of parents : 1
Owner address : 0x2e41fd0
Owner object : java/util/WeakHashMap
Address : 0xb4c2dc0
First single ancestor : org/eclipse/jst/j2ee/internal/archive/JavaEEArchiveUtilities at 0xb4c2dc0
and getting below error in WAS logs
[main] INFO deploylib - Installing application... ADMA5016I: Installation of Kijkglas-ear-1905.01.35 started. ADMA5058I: Application and module versions are validated with versions of deployment targets. ADMA5018I: The EJBDeploy program is running on file /tmp/app6232412827642995266.ear. Starting workbench. EJB Deploy configuration directory: /var/was/profiles/AdminAgent01/ejbdeploy/configuration/ framework search path: /opt/IBM/WebSphere/8.5/deploytool/itp/plugins build:RADWEJB95-I20150829_0214 Creating the project. JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2019/06/07 10:42:59 - please wait. JVMDUMP032I JVM requested System dump using '/var/was/profiles/AdminAgent01/core.20190610.104259.30244.0001.dmp' in response to an event JVMDUMP010I System dump written to /var/was/profiles/AdminAgent01/core.20190610.104259.30244.0001.dmp JVMDUMP032I JVM requested Heap dump using '/var/was/profiles/AdminAgent01/heapdump.20190610.104259.30244.0002.phd' in response to an event JVMDUMP010I Heap dump written to /var/was/profiles/AdminAgent01/heapdump.20190610.104259.30244.0002.phd JVMDUMP032I JVM requested Java dump using '/var/was/profiles/AdminAgent01/javacore.20190610.104259.30244.0003.txt' in response to an event JVMDUMP010I Java dump written to /var/was/profiles/AdminAgent01/javacore.20190610.104259.30244.0003.txt JVMDUMP032I JVM requested Snap dump using '/var/was/profiles/AdminAgent01/Snap.20190610.104259.30244.0004.trc' in response to an event JVMDUMP010I Snap dump written to /var/was/profiles/AdminAgent01/Snap.20190610.104259.30244.0004.trc JVMDUMP013I Processed dump event "systhrow", detail "java/lang/OutOfMemoryError". An unexpected exception was thrown. Halting execution. Shutting down workbench. Error executing deployment: java.lang.OutOfMemoryError. Error is Java heap space. java.lang.OutOfMemoryError: Java heap space at java.lang.Throwable.fillInStackTrace(Native Method) at java.lang.Throwable.<init>(Throwable.java:67) at java.lang.Throwable.<init>(Throwable.java:78) at java.lang.Error.<init>(Error.java:82) at java.lang.VirtualMachineError.<init>(VirtualMachineError.java:64) at java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:69) at java.lang.String.<init>(String.java:207) at java.util.jar.Attributes.read(Attributes.java:424) at java.util.jar.Manifest.read(Manifest.java:264) at java.util.jar.Manifest.<init>(Manifest.java:82) at java.util.jar.JarFile.getManifestFromReference(JarFile.java:200) at java.util.jar.JarFile.getManifest(JarFile.java:182) at sun.net.www.protocol.jar.URLJarFile.isSuperMan(URLJarFile.java:187) at sun.net.www.protocol.jar.URLJarFile.getManifest(URLJarFile.java:155) at java.util.jar.JarFile.maybeInstantiateVerifier(JarFile.java:387) at java.util.jar.JarFile.getInputStream(JarFile.java:488) at sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:178) at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source) at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source) at org.apache.xerces.impl.xs.opti.SchemaParsingConfig.parse(Unknown Source) at org.apache.xerces.impl.xs.opti.SchemaParsingConfig.parse(Unknown Source) at org.apache.xerces.impl.xs.opti.SchemaDOMParser.parse(Unknown Source) at org.apache.xerces.impl.xs.traversers.XSDHandler.getSchemaDocument(Unknown Source) at org.apache.xerces.impl.xs.traversers.XSDHandler.resolveSchema(Unknown Source) at org.apache.xerces.impl.xs.traversers.XSDHandler.constructTrees(Unknown Source) at org.apache.xerces.impl.xs.traversers.XSDHandler.constructTrees(Unknown Source) at org.apache.xerces.impl.xs.traversers.XSDHandler.parseSchema(Unknown Source) at org.apache.xerces.impl.xs.XMLSchemaLoader.loadSchema(Unknown Source) at org.apache.xerces.impl.xs.XMLSchemaValidator.findSchemaGrammar(Unknown Source) at org.apache.xerces.impl.xs.XMLSchemaValidator.handleStartElement(Unknown Source) at org.apache.xerces.impl.xs.XMLSchemaValidator.startElement(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook(Unknown Source) EJBDeploy level: #build# ADMA5008E: The EJBDeploy program failed on file /tmp/app6232412827642995266.ear. Exception: com.ibm.etools.ejbdeploy.EJBDeploymentException: Error executing EJBDeploy ADMA0063E: An error occurred during Enterprise JavaBeans (EJB) deployment. Exception: com.ibm.etools.ejbdeploy.EJBDeploymentException: Error executing EJBDeploy ADMA5011I: The cleanup of the temp directory for application Kijkglas-ear-1905.01.35 is complete. ADMA5014E: The installation of application Kijkglas-ear-1905.01.35 failed. 2019-06-10 10:43:05,625
[main] FATAL deploylib - Jython Exception in deploy.py : 2019-06-10 10:43:05,630
[main] FATAL deploylib - Traceback (most recent call last): 2019-06-10 10:43:05,630
[main] FATAL deploylib - File "/opt/Nolio/work/WAS/gMyAppWA/all/1905.01.35/deploylib/cfgfiles/gMyAppWA-assembled.cfg", line 606, in ? application.installApplication() 2019-06-10 10:43:05,630
[main] FATAL deploylib - File "<string>", line 779, in installApplication 2019-06-10 10:43:05,630
[main] FATAL deploylib - com.ibm.ws.scripting.ScriptingException: com.ibm.ws.scripting.ScriptingException: WASX7132E: Application install for /opt/Nolio/work/WAS/gMyAppWA/all/1905.01.35/Kijkglas-ear-1905.01.35.ear failed: see previous messages for details. [2019-06-10 10:43:05] [/opt/Nolio/work/WAS/gMyAppWA/all/1905.01.35/deploylib/deploy.ksh] [ERROR] Command /var/was/profiles/AppSrv01/bin/wsadmin.sh -javaoption -Duser.timezone=CET -f deploy.py /opt/Nolio/work/WAS/gMyAppWA/all/1905.01.35/deploylib/cfgfiles/gMyAppWA-assembled.cfg /opt/Nolio/work/WAS/gMyAppWA/all/1905.01.35/deploylib/cfgfiles/gMyAppWA.TST failed. [2019-06-10 10:43:05] [/opt/Nolio/work/WAS/gMyAppWA/all/1905.01.35/deploylib/deploy.ksh] [INFO ] See also deploy.log and wsadmin.log in deploylib-8.1.4 directory.
See /opt/Nolio/work/WAS/log/gMyAppWA/all/stdout.log.2019-06-10_10_37_57_285 and /opt/Nolio/work/WAS/gMyAppWA/all/1905.01.35/deploylib/deploy.log for more information
/gMyAppWA.TST failed. [2019-06-10 10:43:05] [/opt/Nolio/work/WAS/gMyAppWA/all/1905.01.35/deploylib/deploy.ksh] [INFO ] See also deploy.log and wsadmin.log in deploylib-8.1.4 directory.
Is there any rouge process or something blocking in background ?
You didn't write your version, topology (single, network deployment), nor way you deploy your app (console, wsadmin ohter).
As you can see in the log, there is OutOfMemoryError during EJB deploy call.
You need to increase memory for ejb deploy you can either set it file or in OS level. Check this post Getting OutofMemory condition while deploying a large application in WebSphere Application Server
1) Set it in install-root/deploytool/itp/EJBDeploy.sh file
EJBDEPLOY_JVM_HEAP="-Xms1024 -Xmx1024" at the beginning of the
ejbdeploy.sh file.
2) Set it in operating System Environment.
Set EJBDEPLOY_JVM_HEAP= '-Xms1024 -Xmx1024' in OS environment
variable.
I'd also increase memory for your admin server in AdminAgent01 profile, as it looks like you are using admin agent.
Recommend not to update the ejbdeploy.sh. When update the WebSphere to a new fixpack, the ejbdeploy.sh will be restored.
Increasing heap size through admin console
login to admin console
go to "System administration" > "Deployment manager" > "Configuration" tab > "Server Infrastructure" section on the right > "Java and Process Management" > "Process definition"
"Additional Properties" section on the right > "Environment Entries"
"New" entry by providing the Name EJBDEPLOY_JVM_HEAP and value "-Xms256m -Xmx1024m"
Save and synchronize
restart the DM server

Kafka fails on start due to topic not being loaded

I have setup Kafka server and a zookeeper in a windows machine with help from here. I was successfully able to setup a topic - MTETest as in below log, produce and consume messages to this topic.
On trying to stop and start Kafka and Zookeeper using the batch files that came with installation in a adminitrator command prompt, I am facing a problem that the kafka server is unable to start with below message -
[2017-11-30 21:26:24,601] ERROR There was an error in one of the
threads during logs loading: java.nio.file.FileSystemException:
C:SourceKafkakafka_2.11-0.11.0.1\MTETest-0\00000000000000000000.timeindex:
The process cannot access the file because it is being used by another
process. (kafka.log.LogManager) [2017-11-30 21:26:24,603] FATAL
[Kafka Server 0], Fatal error during KafkaServer startup. Prepare to
shutdown (kafka.server.KafkaServer) java.nio.file.FileSystemException:
C:SourceKafkakafka_2.11-0.11.0.1\MTETest-0\00000000000000000000.timeindex:
The process cannot access the file because it is being used by another
process.
at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
at java.nio.file.Files.deleteIfExists(Files.java:1165) at
kafka.log.Log$$anonfun$loadSegmentFiles$3.apply(Log.scala:318) at
kafka.log.Log$$anonfun$loadSegmentFiles$3.apply(Log.scala:279) at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at kafka.log.Log.loadSegmentFiles(Log.scala:279) at
kafka.log.Log.loadSegments(Log.scala:383) at
kafka.log.Log.(Log.scala:186) at
kafka.log.Log$.apply(Log.scala:1609) at
kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$5$$anonfun$apply$12$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:172)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) [2017-11-30 21:26:24,606]
WARN Found a corrupted index file due to requirement failed: Corrupt
index found, index file
(C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.index)
has non-zero size but the last offset is 0 which is no larger than the
base offset 0.}. deleting
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.timeindex,
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.index,
and
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.txnindex
and rebuilding index... (kafka.log.Log) [2017-11-30 21:26:24,609] INFO
[Kafka Server 0], shutting down (kafka.server.KafkaServer) [2017-11-30
21:26:24,613] INFO Terminate ZkClient event thread.
(org.I0Itec.zkclient.ZkEventThread) [2017-11-30 21:26:24,615] WARN
Found a corrupted index file due to requirement failed: Corrupt index
found, index file
(C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.index)
has non-zero size but the last offset is 0 which is no larger than the
base offset 0.}. deleting
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.timeindex,
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.index,
and
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.txnindex
and rebuilding index... (kafka.log.Log) [2017-11-30 21:26:24,616] INFO
Session: 0x1600d98747a0001 closed (org.apache.zookeeper.ZooKeeper)
[2017-11-30 21:26:24,623] INFO EventThread shut down for session:
0x1600d98747a0001 (org.apache.zookeeper.ClientCnxn) [2017-11-30
21:26:24,625] INFO [Kafka Server 0], shut down completed
(kafka.server.KafkaServer) [2017-11-30 21:26:24,626] FATAL Exiting
Kafka. (kafka.server.KafkaServerStartable) [2017-11-30 21:26:24,628]
INFO [Kafka Server 0], shutting down (kafka.server.KafkaServer)
I have tried changing the setting - delete.topic.enable to true as per suggestion in a similar question here in kafka server.properties, but it did not help. Also, I did not open the topic or its related files manually. Anyone faced this issue, please help. Is this problem specific to windows?
It loads successfully when I delete the topic and its related physical folders that are created by Kafka, but it is not the right thing. Please suggest the correct solution.
Thanks.
According to the error message:
The process cannot access the file because it is being used by another process.
you have another process already using this file and it is preventing Kafka from starting. See https://serverfault.com/questions/1966/how-do-you-find-what-process-is-holding-a-file-open-in-windows for finding the process

hdfs data node disconnected from namenode

I get from time to time the following errors in cloudera manager:
This DataNode is not connected to one or more of its NameNode(s).
and
The Cloudera Manager agent got an unexpected response from this role's web server.
(usually together, sometimes only one of them)
In most references to these errors in SO and Google, the issue is a configuration problem (and the data node never connects to the name node)
In my case the data nodes usually connect at start up, but loose the connection after some time - so it doesn't appear to be a bad configuration.
Any other options?
Is it possible to force the data node to reconnect to the name node?
Is it possible to "ping" the name node from the data node (simulate the connection attempt of the data node)
Could it be some kind of resource problem (to many open files \ connections)?
sample logs (the errors vary from time to time)
2014-02-25 06:39:49,179 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: exception:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,180 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.56.144.18:50010, dest: /10.56.144.28:48089, bytes: 132096, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1315770947_27, offset: 0, srvID: DS-990970275-10.56.144.18-50010-1384349167420, blockid: BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440, duration: 480291679056
2014-02-25 06:39:49,180 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.56.144.18, storageID=DS-990970275-10.56.144.18-50010-1384349167420, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster16;nsid=7043943;c=0):Got exception while serving BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440 to /10.56.144.28:48089
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,181 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: host.com:50010:DataXceiver error processing READ_BLOCK operation src: /10.56.144.28:48089 dest: /10.56.144.18:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
Hadoop uses specific ports to communicate between the DataNode and the NameNode. It could be that a firewall is blocking those specific ports. Check the default ports in the Cloudera WebSite and test the connectivity to the NameNode with specific ports.
If you're using Linux then please make sure that you have configured these properties correctly:
Disable SELINUX
type the command getenforce on CLI and if it shows enforcing, means it is enabled. Change it fro /etc/selinux/config file.
Disable Firewall
Make sure you have NTP service installed.
Make sure your server can SSH to all client nodes.
Make sure all the nodes have FQDN(Fully Qualified Domain Name) and have an entry in /etc/hosts with name and IP.
If these settings are right in the place then please attach the log of any of your datanode which got disconnected.
I ran into this error
"This DataNode is not connected to one or more of its NameNode(s). "
and I solved it by turning off safe mode and restart HDFS service
I realize you took some steps to test this, but intermittent disconnects still make it sound like a Connectivity issue.
If nodes really don't come back after a disconnect, that may be a configuration issue, which could well be completely independent from the reason why they disconnect in the first place.

Hadoop NameNode failed to start, Error: FSNamesystem initialization failed. java.io.FileNotFoundException

Exception I am getting is,
2011-07-13 12:04:13,006 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.FileNotFoundException: File does not exist: /opt/data/tmp/mapred/system/job_201107041958_0120/j^#^#^#^#^#^#
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetPermission(FSDirectory.java:544)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:724)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
2011-07-13 14:45:02,780 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed.
java.io.FileNotFoundException: File does not exist: /opt/data/tmp/mapred/system/job_201107041958_0120/j^#^#^#^#^#^#
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetPermission(FSDirectory.java:544)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:724)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
Problem(editlogs got corrupted) got resolved. I used -importCheckpoint option.
I just would like to tell you the possible scenario/reason of editlog corruption might have happened(correct me if I am wrong),
Below were the typical configurations in hdfs-site.xml
hadoop.tmp.dir : /opt/data/tmp
dfs.name.dir : /opt/data/name
dfs.data.dir : /opt/data/data
mapred.local.dir : ${hadoop.tmp.dir}/mapred/local
/opt/data is an mounted storage, size is 50GB. Namenode, SecondaryNamenode( ${hadoop.tmp.dir}/dfs/namesecondary) & Datanode directories were configured within /opt/data itself.
Once I moved 3.6GB compressed(bz2) file, I guess /opt/data memory usage of this dir. could have been 100%(I checked($df -h) after this incident). Then, I ran Hive with simple "Select" query, its job.jar files also needs to be created within the same directory which already has no space. So this is how the editlog corruption could have been occurred.
This is really a good learning for me! Now I have changed that configurations.

Resources