Opendaylight Data Tree poisoned after leader switch - cluster-computing

We are using Nitrogen version (SR1) of ODL. We are trying 2 node cluster and when follower becomes leader (after running the server for 5 - 6 hrs) we observe below exception in karaf.log. When below exception happens we are unable to access MDSAL for any read/write operations.
We call switchAllLocalShardsState API to change follower to be new leader based on "Akka Member Removed" event.
1) What does "poisoned" and "No Progress Exception" refer here.
2) Does "TERM" that we pass as an argument to switchAllLocalShardsState
cause this issue. If yes, please explain the significance of TERM
and also let us know why we are facing this issue only after
running the server for a long time.
2018-06-04 11:58:25,452 | ERROR | tAdminThread #20 | 130 - com.fujitsu.fnc.sdn.fw-scheduler-odl - 5.1.0.SNAPSHOT | SchedulerServiceImpl | Get Schedule List Transaction failed : ReadFailedException{message=read execution failed, errorList=[RpcError [message=read execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=ReadFailedException{message=read execution failed, errorList=[RpcError [message=read execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.client.NoProgressException: No progress in 31198 seconds]]}]]}
2018-06-04 11:58:25,452 | WARN | tAdminThread #12 | 261 - com.fujitsu.fnc.sdnfw.security-odl - 5.1.0.SNAPSHOT | IDMLightServer | getContainer: {}
java.util.concurrent.ExecutionException: ReadFailedException{message=read execution failed, errorList=[RpcError [message=read execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.client.NoProgressException: No progress in 31198 seconds]]}
at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:65)[583:org.opendaylight.yangtools.util:1.2.1]
at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.get(MappingCheckedFuture.java:78)[583:org.opendaylight.yangtools.util:1.2.1]
at com.fujitsu.fnc.sdnfw.aaa.idmlight.impl.IDMLightServer.getUsermgmt(IDMLightServer.java:3056)[261:com.fujitsu.fnc.sdnfw.security-odl:5.1.0.SNAPSHOT]
at com.fujitsu.fnc.sdnfw.aaa.idmlight.impl.IDMLightServer.init(IDMLightServer.java:1977)[261:com.fujitsu.fnc.sdnfw.security-odl:5.1.0.SNAPSHOT]
at com.fujitsu.fnc.sdnfw.aaa.idmlight.impl.IDMLightServer.handleEvent(IDMLightServer.java:3251)[261:com.fujitsu.fnc.sdnfw.security-odl:5.1.0.SNAPSHOT]
at Proxyadd8c855_db4c_4e72_b600_2ca57cba8d4d.handleEvent(Unknown Source)[:]
at org.apache.felix.eventadmin.impl.handler.EventHandlerProxy.sendEvent(EventHandlerProxy.java:415)[393:org.apache.karaf.services.eventadmin:4.0.10]
at org.apache.felix.eventadmin.impl.tasks.HandlerTask.run(HandlerTask.java:90)[393:org.apache.karaf.services.eventadmin:4.0.10]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_66]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)[:1.8.0_66]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_66]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_66]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_66]
Suppressed: java.lang.IllegalStateException: Uncaught exception occured during closing transaction
at org.opendaylight.controller.cluster.databroker.AbstractDOMBrokerTransaction.closeSubtransactions(AbstractDOMBrokerTransaction.java:92)[497:org.opendaylight.controller.sal-distributed-datastore:1.6.1]
at org.opendaylight.controller.cluster.databroker.DOMBrokerReadOnlyTransaction.close(DOMBrokerReadOnlyTransaction.java:50)[497:org.opendaylight.controller.sal-distributed-datastore:1.6.1]
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMReadTransactionAdapter.close(BindingDOMReadTransactionAdapter.java:36)[484:org.opendaylight.controller.sal-binding-broker-impl:1.6.1]
at com.fujitsu.fnc.sdnfw.aaa.idmlight.impl.IDMLightServer.getUsermgmt(IDMLightServer.java:3062)[261:com.fujitsu.fnc.sdnfw.security-odl:5.1.0.SNAPSHOT]
... 10 more
Caused by: java.lang.IllegalStateException: Connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=1}, cookie=0, poisoned=org.opendaylight.controller.cluster.access.client.NoProgressException: No progress in 31198 seconds, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data#u446.nms.fnc.fujitsu.com:2550/user/shardmanager-config/member-2-shard-default-config#-1532234096], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}} has been poisoned
at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.commonEnqueue(AbstractClientConnection.java:198)[467:org.opendaylight.controller.cds-access-client:1.2.1]

The NoProgressException indicates you've enabled the new tell-based that was initially introduced in Nitrogen. It's designed to be more resilient to transient comm failures but is still experimental and there have been fixes since Nitrogen. I would suggest disabling it.
Also, there's a better way to implement 2-node primary/secondary by making the secondary non-voting and then promoting the secondary to leader by switching it to voting when the primary fails. This is documented in the online clustering guide in the Geo redundancy section which describes it for 6 nodes but you can use it for 2 nodes - same concept.

Related

ORACLE SQL Errors 127 and 396

For a long time Oracle, as soon as I start the program, gives me these two errors that I have not been able to solve or understand.
1st mistake
LEVEL: SEVERE
Sequence: 396
Elapsed: 60969
Source: oracle.dbtools.raptor.backgroundTask.RaptorTaskManager$1
Message: NORTHWIND expired 0.003s ago (see "Caused by:" stack trace below); reported from ExpiredTextBuffer on thread AWT-EventQueue-0 activate AWT event:
java.awt.event.FocusEvent[FOCUS_LOST,permanent,opposite=null,cause=CLEAR_GLOBAL_FOCUS_OWNER] on Editor2_MAIN at oracle.ide.model.ExpiredTextBuffer.newExpiredTextBufferException(ExpiredTextBuffer.java:55)
2nd Error
LEVEL: SEVERE
sequence: 127
Elapsed: 0
Source: oracle.ide.extension.HashStructureHook
Message: Unexpected runtime exception while delivering HashStructureHookEvent
I have tried to reinstall everything and it is not due to lack of resources either since the pc is quite powerful ryzen 9 and 32gb of ram

PITEST mutationCoverage is returning SocketException

While running clean test verify org.pitest:pitest-maven:mutationCoverage, getting the below exception.
PIT >> INFO : MINION :
.pitest.testapi.execute.containers.UnContainer.execute(UnContainer.java:31)
- at org.pitest.testapi.execute.Pitest.executeTests(Pitest.java:57)
- at org.pitest.testapi.execute.Pitest.run(Pitest.java:48)
- at org.pitest.coverage.execute.CoverageWorker.run(C
-Caused by: java.net.SocketException: Software caused connection abort: socket write error
-- at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java
The micro service has many scenarios to execute, would like to know how this can be fixed. I see some details in http://pitest.org/faq/ refering under section
PIT is taking forever to run, but not sure, if there is a way to increase the thread count.

Sqoop export to Teradata failing with error - Task attempt failed to report status for 600 seconds. Killing

I am facing tasks attempts failing with below error, related to Teradata export (batch insert) jobs.Other jobs exporting data to Oracle etc. are running fine.
Task attempt_1234_m_000000_0 failed to report status for 600 seconds. Killing!,
java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: ExitCodeException exitCode=255: at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:707)
at org.apache.hadoop.mapred.LinuxTaskController.createLogDir(LinuxTaskController.java:313) at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:295)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:215)
I can also see below error message from task stdout logs :
"main" prio=10 tid=0x00007f8824018800 nid=0x3395 runnable [0x00007f882bffb000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at com.teradata.jdbc.jdbc_4.io.TDNetworkIOIF.read(TDNetworkIOIF.java:693)
at com.teradata.jdbc.jdbc_4.io.TDPacketStream.readStream(TDPacketStream.java:774)
Hadoop version : Hadoop 2.5.0-cdh5.3.8
Specifically it will be really helpful if you could tell me why this issue is happening?
Is this an issue related to limit in number of connections to Teradata ?
Found the root cause of the issue:
-Sqoop task is inserting around 4 million records into teradata and hence the task is bit long running
-The insert query ,since long running is going into Teradata delay qeueue (workload management at Teradata end - set by DBAs), and hence sqoop mapreduce task is not getting a response fro 600 sec.s from teradata
-Since default task time-out is 600sec., the transaction was aborted by map task resulting in task failure
Ref: http://apps.teradata.com/TDMO/v08n04/Tech2Tech/TechSupport/RoadRules.aspx
Solution:
1-Increase Taks time-out at mapreduce end.
2-Change configuartions related to delay queue at teradata end for the sepecific user

Connection refused error in worker logs - apache storm

I see the below error in worker logs, it happens almost every milliseconds, but the cluster is running fine, I wanted to know what does these error mean and any idea on why this would occur.
This happens on all the worker nodes
2016-05-12T15:32:53.514-0500 b.s.m.n.Client [ERROR] connection attempt 3 to Netty-Client-xxxxx.hq.abc.com/xx.xx.xxx.xx:6700 failed: java.net.ConnectException: Connection refused: xxxxx.hq.abc.com/xx.xx.xxx.xxx:6700
And after some time i see this
2016-05-12T15:44:25.940-0500 b.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-xxxxx.hq.abc.com/xx.xx.xxx.xxx:6700 is being closed
After struggling forever with this problem, I found that setting storm.local.hostname property in the storm.yaml file solved the issue for me. On my laptop, I set storm.zookeeper.servers, nimbus.host and storm.local.hostname all to "localhost".
I am using version 0.10.2 of storm.

SEVERE error writing to S3 backup

I'm running OpsCenter 5.1.1 with Datastax Enterprise 4.5.1. It's a 3-node cluster on AWS and I'm backing up to S3 (still...) I've started seeing a new error. I think this is a different error than any I've posted b4.
$ cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.8.39 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
I am seeing this error in the agent.log file
node1_agent.log: SEVERE: error after writing 15736832/16777216 bytes to https://cassandra-dev-bkup.s3.amazonaws.com/snapshots/407bb4b1-5c91-43fe-9d4f-767115668037/sstables/1430904167-reporting_test-transaction_lookup-jb-288-Index.db?partNumber=2&uploadId=.MA3X4RYssg7xL_Hr7Msgze.J4exDq9zZ_0Y7qEj9gZhJ570j73kZNr5_nbxactmPMJeKf0XyZfEC0KAplWOz9lpyRCtNeeDCvCmtEXDchH8F1J2c57aq4MrxfBcyiZr
java.io.IOException: Error writing request body to server
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3192)
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3175)
at com.google.common.io.CountingOutputStream.write(CountingOutputStream.java:53)
at com.google.common.io.ByteStreams.copy(ByteStreams.java:179)
at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.writePayloadToConnection(JavaUrlHttpCommandExecutorService.java:308)
at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.convert(JavaUrlHttpCommandExecutorService.java:192)
at org.jclouds.http.internal.JavaUrlHttpCommandExecutorService.convert(JavaUrlHttpCommandExecutorService.java:72)
at org.jclouds.http.internal.BaseHttpCommandExecutorService.invoke(BaseHttpCommandExecutorService.java:95)
at org.jclouds.rest.internal.InvokeSyncToAsyncHttpMethod.invoke(InvokeSyncToAsyncHttpMethod.java:128)
at org.jclouds.rest.internal.InvokeSyncToAsyncHttpMethod.apply(InvokeSyncToAsyncHttpMethod.java:94)
at org.jclouds.rest.internal.InvokeSyncToAsyncHttpMethod.apply(InvokeSyncToAsyncHttpMethod.java:55)
at org.jclouds.rest.internal.DelegatesToInvocationFunction.handle(DelegatesToInvocationFunction.java:156)
at org.jclouds.rest.internal.DelegatesToInvocationFunction.invoke(DelegatesToInvocationFunction.java:123)
at com.sun.proxy.$Proxy48.uploadPart(Unknown Source)
at org.jclouds.aws.s3.blobstore.strategy.internal.SequentialMultipartUploadStrategy.prepareUploadPart(SequentialMultipartUploadStrategy.java:111)
at org.jclouds.aws.s3.blobstore.strategy.internal.SequentialMultipartUploadStrategy.execute(SequentialMultipartUploadStrategy.java:93)
at org.jclouds.aws.s3.blobstore.AWSS3BlobStore.putBlob(AWSS3BlobStore.java:89)
at org.jclouds.blobstore2$put_blob.doInvoke(blobstore2.clj:246)
at clojure.lang.RestFn.invoke(RestFn.java:494)
at opsagent.backups.destinations$create_blob$fn__12007.invoke(destinations.clj:69)
at opsagent.backups.destinations$create_blob.invoke(destinations.clj:64)
at opsagent.backups.destinations$fn__12170.invoke(destinations.clj:192)
at opsagent.backups.destinations$fn__11799$G__11792__11810.invoke(destinations.clj:24)
at opsagent.backups.staging$start_staging_BANG_$fn__12338$state_machine__7576__auto____12339$fn__12344$fn__12375.invoke(staging.clj:61)
at opsagent.backups.staging$start_staging_BANG_$fn__12338$state_machine__7576__auto____12339$fn__12344.invoke(staging.clj:59)
at opsagent.backups.staging$start_staging_BANG_$fn__12338$state_machine__7576__auto____12339.invoke(staging.clj:56)
at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:940)
at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:944)
at clojure.core.async.impl.ioc_macros$take_BANG_$fn__7592.invoke(ioc_macros.clj:953)
at clojure.core.async.impl.channels.ManyToManyChannel$fn__4097.invoke(channels.clj:102)
at clojure.lang.AFn.run(AFn.java:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
TL;DR -
Your SSTable which is 38866048 bytes, is both on your filesystem and on S3. This means the file has transferred over and you are in good shape. No need to worry about this error (though I opened an internal ticket to handle this kind of exception rather than throw a dump).
Details - A summary of what I suspect happened
1) There was a file transfer error when you reached 15736832 out of the 16777216 byte slice of the sstable.
2) At this point OpsCenter did not finish transferring the table or leave a partial version in s3
3) Another backup attempt later on moved the sstable with no error and a valid backup exists.

Resources