zookeeper unexpected exception causing shutdown while sock still open - hadoop

I am getting quite a few major issues in Cloudera Hadoop 2.0 cluster that coincides with the following errors on zookeeper that happens many times a day.
I am unable to find the root cause of this.
Any help is appreciated.
2016-04-11 14:48:30,872 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
2016-04-11 14:48:49,584 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
2016-04-11 14:49:07,239 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
2016-04-11 14:49:25,291 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
2016-04-11 14:49:42,779 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
2016-04-11 14:50:00,613 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
2016-04-11 14:50:17,976 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
2016-04-11 14:50:35,957 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
2016-04-11 14:50:54,676 ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open

This turned out was 2 layer issue:
The above error was happening while leader election was going on for too long.
The cause of the leader election taking too long was corrupt data on one of the 3 zookeeper servers.
Once data files from /var/lib/zookeeper were blown away and zookeeper restarted, leader election succeeded and that in turned stopped the problem above.

Related

PITEST mutationCoverage is returning SocketException

While running clean test verify org.pitest:pitest-maven:mutationCoverage, getting the below exception.
PIT >> INFO : MINION :
.pitest.testapi.execute.containers.UnContainer.execute(UnContainer.java:31)
- at org.pitest.testapi.execute.Pitest.executeTests(Pitest.java:57)
- at org.pitest.testapi.execute.Pitest.run(Pitest.java:48)
- at org.pitest.coverage.execute.CoverageWorker.run(C
-Caused by: java.net.SocketException: Software caused connection abort: socket write error
-- at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java
The micro service has many scenarios to execute, would like to know how this can be fixed. I see some details in http://pitest.org/faq/ refering under section
PIT is taking forever to run, but not sure, if there is a way to increase the thread count.

Kafka fails on start due to topic not being loaded

I have setup Kafka server and a zookeeper in a windows machine with help from here. I was successfully able to setup a topic - MTETest as in below log, produce and consume messages to this topic.
On trying to stop and start Kafka and Zookeeper using the batch files that came with installation in a adminitrator command prompt, I am facing a problem that the kafka server is unable to start with below message -
[2017-11-30 21:26:24,601] ERROR There was an error in one of the
threads during logs loading: java.nio.file.FileSystemException:
C:SourceKafkakafka_2.11-0.11.0.1\MTETest-0\00000000000000000000.timeindex:
The process cannot access the file because it is being used by another
process. (kafka.log.LogManager) [2017-11-30 21:26:24,603] FATAL
[Kafka Server 0], Fatal error during KafkaServer startup. Prepare to
shutdown (kafka.server.KafkaServer) java.nio.file.FileSystemException:
C:SourceKafkakafka_2.11-0.11.0.1\MTETest-0\00000000000000000000.timeindex:
The process cannot access the file because it is being used by another
process.
at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
at
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
at java.nio.file.Files.deleteIfExists(Files.java:1165) at
kafka.log.Log$$anonfun$loadSegmentFiles$3.apply(Log.scala:318) at
kafka.log.Log$$anonfun$loadSegmentFiles$3.apply(Log.scala:279) at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at kafka.log.Log.loadSegmentFiles(Log.scala:279) at
kafka.log.Log.loadSegments(Log.scala:383) at
kafka.log.Log.(Log.scala:186) at
kafka.log.Log$.apply(Log.scala:1609) at
kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$5$$anonfun$apply$12$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:172)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) [2017-11-30 21:26:24,606]
WARN Found a corrupted index file due to requirement failed: Corrupt
index found, index file
(C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.index)
has non-zero size but the last offset is 0 which is no larger than the
base offset 0.}. deleting
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.timeindex,
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.index,
and
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1\MTETestTopic-0\00000000000000000000.txnindex
and rebuilding index... (kafka.log.Log) [2017-11-30 21:26:24,609] INFO
[Kafka Server 0], shutting down (kafka.server.KafkaServer) [2017-11-30
21:26:24,613] INFO Terminate ZkClient event thread.
(org.I0Itec.zkclient.ZkEventThread) [2017-11-30 21:26:24,615] WARN
Found a corrupted index file due to requirement failed: Corrupt index
found, index file
(C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.index)
has non-zero size but the last offset is 0 which is no larger than the
base offset 0.}. deleting
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.timeindex,
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.index,
and
C:\Source\Kafka\kafka_2.11-0.11.0.1\SourceKafkakafka_2.11-0.11.0.1__consumer_offsets-0\00000000000000000000.txnindex
and rebuilding index... (kafka.log.Log) [2017-11-30 21:26:24,616] INFO
Session: 0x1600d98747a0001 closed (org.apache.zookeeper.ZooKeeper)
[2017-11-30 21:26:24,623] INFO EventThread shut down for session:
0x1600d98747a0001 (org.apache.zookeeper.ClientCnxn) [2017-11-30
21:26:24,625] INFO [Kafka Server 0], shut down completed
(kafka.server.KafkaServer) [2017-11-30 21:26:24,626] FATAL Exiting
Kafka. (kafka.server.KafkaServerStartable) [2017-11-30 21:26:24,628]
INFO [Kafka Server 0], shutting down (kafka.server.KafkaServer)
I have tried changing the setting - delete.topic.enable to true as per suggestion in a similar question here in kafka server.properties, but it did not help. Also, I did not open the topic or its related files manually. Anyone faced this issue, please help. Is this problem specific to windows?
It loads successfully when I delete the topic and its related physical folders that are created by Kafka, but it is not the right thing. Please suggest the correct solution.
Thanks.
According to the error message:
The process cannot access the file because it is being used by another process.
you have another process already using this file and it is preventing Kafka from starting. See https://serverfault.com/questions/1966/how-do-you-find-what-process-is-holding-a-file-open-in-windows for finding the process

Connection refused error in worker logs - apache storm

I see the below error in worker logs, it happens almost every milliseconds, but the cluster is running fine, I wanted to know what does these error mean and any idea on why this would occur.
This happens on all the worker nodes
2016-05-12T15:32:53.514-0500 b.s.m.n.Client [ERROR] connection attempt 3 to Netty-Client-xxxxx.hq.abc.com/xx.xx.xxx.xx:6700 failed: java.net.ConnectException: Connection refused: xxxxx.hq.abc.com/xx.xx.xxx.xxx:6700
And after some time i see this
2016-05-12T15:44:25.940-0500 b.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-xxxxx.hq.abc.com/xx.xx.xxx.xxx:6700 is being closed
After struggling forever with this problem, I found that setting storm.local.hostname property in the storm.yaml file solved the issue for me. On my laptop, I set storm.zookeeper.servers, nimbus.host and storm.local.hostname all to "localhost".
I am using version 0.10.2 of storm.

Ambari Server fatal error

WARN [qtp-ambari-agent-66] nio:720 - javax.net.ssl.SSLException: Received fatal alert: unknown_ca
I am using ambar-server version 2.1.0 with jdk1.8. At the time of registering nodes this error start.
At the same time agent is showing error ...
Server at https://ip-xxx-xx-xx-Xx.internal:8440 is not reachable, sleeping for 10 seconds...
Run this command on each node
sed -i 's/verify=platform_default/verify=disable/' /etc/python/cert-verification.cfg

DB connection issue with Playframework 2.1 and Bonecp 0.8.0 : This connection has been closed

I was facing an issue with Bonecp 0.7.1 on a Playframework app using postgresql 9.2.4 on Heroku. It seems this version had a DB connection leak causing after several DB accesses the folllwing error :
[error] c.j.b.h.AbstractConnectionHook - Failed to acquire connection Sleeping for 1000ms and trying again. Attempts left: 1. Exception: null.Message:FATAL: too many connections for role "eonqhnjenuislk" Database warning
[error] c.j.b.PoolWatchThread - Error in trying to obtain a connection. Retrying in 1000ms
org.postgresql.util.PSQLException: FATAL: too many connections for role "eonqhnjenuislk"
at org.postgresql.core.v3.ConnectionFactoryImpl.readStartupMessages(ConnectionFactoryImpl.java:469) ~[postgresql-9.1-901.jdbc4.jar:na]
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:112) ~[postgresql-9.1-901.jdbc4.jar:na]
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66) ~[postgresql-9.1-901.jdbc4.jar:na]
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125) ~[postgresql-9.1-901.jdbc4.jar:na]
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30) ~[postgresql-9.1-901.jdbc4.jar:na]
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:22) ~[postgresql-9.1-901.jdbc4.jar:na]
As every threads of the connection pool was acquired and retained, the application was not reachable anymore until I restarted it.
Then I heard that this issue was corrected in Bonecp 0.8.0, so I upgraded the lib. But the issue seems not to be completely fixed. In fact, now connection threads are not retained anymore what make the application reachable at anytime but sometimes a DB connection close suddenly... The app throws the following error causing an 500 error to the end users :
javax.persistence.PersistenceException: org.postgresql.util.PSQLException: This connection has been closed.
at com.avaje.ebeaninternal.server.transaction.TransactionManager.createTransaction(TransactionManager.java:331)
at com.avaje.ebeaninternal.server.core.DefaultServer.createServerTransaction(DefaultServer.java:2056)
at com.avaje.ebeaninternal.server.core.BeanRequest.createImplicitTransIfRequired(BeanRequest.java:58)
at com.avaje.ebeaninternal.server.core.PersistRequest.initTransIfRequired(PersistRequest.java:81)
at com.avaje.ebeaninternal.server.persist.DefaultPersister.executeSqlUpdate(DefaultPersister.java:146)
at com.avaje.ebeaninternal.server.core.DefaultServer.execute(DefaultServer.java:1928)
at com.avaje.ebeaninternal.server.core.DefaultServer.execute(DefaultServer.java:1935)
at com.avaje.ebeaninternal.server.core.DefaultSqlUpdate.execute(DefaultSqlUpdate.java:148)
at actor.PublicParkingPlacesActor$1.apply(PublicParkingPlacesActor.java:41)
at actor.PublicParkingPlacesActor$1.apply(PublicParkingPlacesActor.java:26)
at play.libs.F$Promise$PromiseActor.onReceive(F.java:425)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:159)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:425)
at akka.actor.ActorCell.invoke(ActorCell.scala:386)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:230)
at akka.dispatch.Mailbox.run(Mailbox.scala:212)
at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:502)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:262)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: org.postgresql.util.PSQLException: This connection has been closed.
at org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:714)
at org.postgresql.jdbc2.AbstractJdbc2Connection.setAutoCommit(AbstractJdbc2Connection.java:661)
at com.jolbox.bonecp.ConnectionHandle.setAutoCommit(ConnectionHandle.java:1292)
at play.api.db.BoneCPApi$$anon$1.onCheckOut(DB.scala:328)
at com.jolbox.bonecp.AbstractConnectionStrategy.postConnection(AbstractConnectionStrategy.java:75)
at com.jolbox.bonecp.AbstractConnectionStrategy.getConnection(AbstractConnectionStrategy.java:92)
at com.jolbox.bonecp.BoneCP.getConnection(BoneCP.java:553)
at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:131)
at play.db.ebean.EbeanPlugin$WrappingDatasource.getConnection(EbeanPlugin.java:146)
at com.avaje.ebeaninternal.server.transaction.TransactionManager.createTransaction(TransactionManager.java:297)
... 20 more
Thanks a lot for your help!
EDIT :
DB configuration :
db.default.isolation=READ_COMMITTED
db.default.partitionCount=2
db.default.maxConnectionsPerPartition=10
db.default.minConnectionsPerPartition=5
db.default.acquireIncrement=1
db.default.acquireRetryAttempts=2
db.default.acquireRetryDelay=5 seconds
db.default.connectionTimeout=10 second
db.default.idleMaxAge=10 minute
db.default.idleConnectionTestPeriod=5 minutes
db.default.initSQL="SELECT 1"
db.default.maxConnectionAge=1 hour
EDIT 2:
Here is the DB config I set according to this post Heroku/Play/BoneCp connection issues
These changes reduce the number of "This connection has been closed" issues, but I still get 1 or 2 of them perdays what makes some HTTP requests to fail. So the issue is still not fixed:
db.default.isolation=READ_COMMITTED
db.default.partitionCount=2
db.default.maxConnectionsPerPartition=10
db.default.minConnectionsPerPartition=5
db.default.acquireIncrement=1
db.default.acquireRetryAttempts=2
db.default.acquireRetryDelay=5 seconds
db.default.connectionTimeout=10 seconds
db.default.idleMaxAge=10 minutes
db.default.idleConnectionTestPeriod=30 seconds
db.default.initSQL="SELECT 1"
db.default.maxConnectionAge=30 minutes

Resources