Websphere server that may be hung - websphere

I am getting the below error.... Kindly help
[8/5/14 21:06:54:277 GMT-08:00] 00000091 DiscoveryTx W DCSV1115W: DCS Stack DefaultCoreGroup at Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_QLCOMM_CL02: Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\nodeagent connection was closed. Member will be removed from view. DCS connection status is Discovery|Ptp, transmitter closed.
[8/5/14 21:07:23:562 GMT-08:00] 00000010 MbuRmmAdapter W DCSV1115W: DCS Stack DefaultCoreGroup at Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_QLCOMM_CL02: Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_PYMTCAPTURE_CL02 connection was closed. Member will be removed from view. DCS connection status is View|Gossip, this member is suspected by the other member.
[8/5/14 21:08:00:079 GMT-08:00] 00000091 DiscoveryTx W DCSV1115W: DCS Stack DefaultCoreGroup at Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_QLCOMM_CL02: Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_DOWNSTREAM_CL02 connection was closed. Member will be removed from view. DCS connection status is Discovery|Ptp, transmitter closed.
[8/5/14 21:08:16:296 GMT-08:00] 00000010 RmmPtpGroup W DCSV1112W: DCS Stack DefaultCoreGroup at Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_QLCOMM_CL02: Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_DOWNSTREAM_CL02 failed to respond to periodic heartbeats. Member will be removed from view. Configured Timeout is 180000 milliseconds. DCS logical channel is View|Ptp.
[8/5/14 21:08:29:236 GMT-08:00] 00000091 DiscoveryTx W DCSV1115W: DCS Stack DefaultCoreGroup at Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_QLCOMM_CL02: Member PT_STS_HK_CELL\PT_STS_HK_DMGR_Node\dmgr connection was closed. Member will be removed from view. DCS connection status is Discovery|Ptp, transmitter closed.
[8/5/14 21:10:20:892 GMT-08:00] 00000018 ApplicationMo W DCSV0004W: DCS Stack DefaultCoreGroup at Member PT_STS_HK_CELL\PT_STS_HK_APP_Node02\PT_STS_QLCOMM_CL02: Did not receive adequate CPU time slice. Last known CPU usage time at 21:03:08:272 GMT-08:00. Inactivity duration was 402 seconds.
[8/5/14 21:11:14:131 GMT-08:00] 00000043 ThreadMonitor W WSVR0605W: Thread "WMQJCAResourceAdapter : 5" (00000067) has been active for 657039 milliseconds and may be hung. There is/are 2 thread(s) in total in the server that may be hung.
at com.ibm.ejs.ras.TraceLogger.doLog(TraceLogger.java:332)
at com.ibm.ejs.ras.TraceLogger.processEvent(TraceLogger.java:319)
at com.ibm.ws.logging.WsHandlerWrapper.publish(WsHandlerWrapper.java:43)
at java.util.logging.Logger.log(Logger.java:1121)
at com.ibm.ejs.ras.Tr.logToJSR47Logger(Tr.java:1681)
at com.ibm.ejs.ras.Tr.fireEvent(Tr.java:1643)
at com.ibm.ejs.ras.Tr.fireTraceEvent(Tr.java:1565)
at com.ibm.ejs.ras.Tr.entry(Tr.java:816)
at com.ibm.ws.sib.utils.ras.SibTr.entry(SibTr.java:912)
at com.ibm.ws.wmqcsi.trace.TraceImpl.methodExit(TraceImpl.java:349)
at com.ibm.msg.client.commonservices.trace.Trace.methodExitInternal(Trace.java:715)
at com.ibm.msg.client.commonservices.trace.Trace.exit(Trace.java:628)
at com.ibm.msg.client.wmq.v6.jms.internal.JMSMessage._setJMSXObjectProperty(JMSMessage.java:3928)
at com.ibm.msg.client.wmq.v6.jms.internal.MQJMSMessage.write(MQJMSMessage.java:1223)
at com.ibm.msg.client.wmq.v6.jms.internal.MQMessageProducer.sendInternal(MQMessageProducer.java:1139)
at com.ibm.msg.client.wmq.v6.jms.internal.MQMessageProducer.send(MQMessageProducer.java:768)
at com.ibm.msg.client.wmq.v6.jms.internal.MQMessageProducer.send(MQMessageProducer.java:2713)
at com.ibm.msg.client.jms.internal.JmsMessageProducerImpl.sendMessage(JmsMessageProducerImpl.java:872)
at com.ibm.msg.client.jms.internal.JmsMessageProducerImpl.send_(JmsMessageProducerImpl.java:727)
at com.ibm.msg.client.jms.internal.JmsMessageProducerImpl.send(JmsMessageProducerImpl.java:398)
at com.ibm.mq.jms.MQMessageProducer.send(MQMessageProducer.java:281)
at com.ibm.ejs.jms.JMSQueueSenderHandle.send(JMSQueueSenderHandle.java:204)

you are receiving CPU Starvation errors. This could be because you are thrashing the garbage collector, your heap is not big enough or something else is taking up the CPU time. You need to find the process or processes that are taking up the CPU and examine why they are running high.
Regards,
Brian

The log entry starting with
ThreadMonitor W WSVR0605W: Thread "WMQJCAResourceAdapter : 5" (00000067) has been
active for 657039 milliseconds and may be hung.
indicates that this thread has been active for that period of time BUT the thread stack it generates is just the thread at the point in time that the log entry is generated. This means it could have been stuck for 90% of the time in one point in the code and the stack trace generated is just where it is now.
What that particular thread is doing at that point is appending an entry into the trace logs when the application is attempting to send an MQ JMS message. So there is no indication that that thread is hung at that point.
A couple of things to try:
Investigate the CPU usage as the CPU starvation messages indicate that is a problem.
Search the SystemOut.log for corresponding messages saying threads are no longer hung.
Take a javacore to see the threads at 2 minute intervals to see what threads are moving.
Turn off trace unless you need it.

This is the general error that might be encountered during server start phase.
Basic idea behind this is that, when you start the server, threads are getting initialized for your process/job that you want to run on server.
That thread is waiting for few resources which helps them to run the process/job. But at that point of time thread may get hung, because of un-availablity of resources.
One way to fix it - Kill the process from background because of which that thread is hung.
Again start the server.

Do the following steps:
- Ensure that Deployment manager is up and running
- verify that app server and node agent are stopped - no java processes related to node agent and app server running
- go to NODE_PROFILE\bin (not deployment manager profile)
- run syncNode.sh/bat
- run startNode.sh/bat
- if node agents starts successfully you should be able to start server from command line or web console

Related

(Weird) - ORA-12516 - TNS:listener could not find available handler with matching protocol stack [with only one active connection]

I am trying to run a Spring loader. That loader will take the data from the csv file and insert into oracle database. It starts well, after processing some records, i am getting the below error.
'ORA-12516 - TNS:listener could not find available handler with matching protocol stack'
Note : No other jobs were running at that time. Only this job was running.
processes -- 45 (CURRENT_UTILIZATION) -- 51 (MAX_UTILIZATION)
sessions -- 53 (CURRENT_UTILIZATION) -- 61 (MAX_UTILIZATION)
show parameter processes (processes - integer - 300)
show parameter session (sessions - integer - 480)
The thing is, the same batch program is running fine in another server, which has the same set of above configurations.
Since its a new server, anything i am missing in regards to oracle.? Can someone guide.

MQ slow dequeuing rates on a XMITQ

We have been facing an issue where message rate of a xmitq is very slow comparing with what should be a normal performance.
We have many other Qmgrs with bigger MQ flows that don't have the same issue.
Our HUB qmgr connects to business line in the same company HUB qmgr, and even the destination queues on their side being empty the flow is really slow.
At OS and Network level they say nothing can be done. At MQ we have changed the Buffersizes so it matches the OS level and uses the system tcp windows.
Now at MQ level we have the channel SDR setup with BATCHSZ to 100, but seems the receiver is configured with 30.
We noticed that because we see messages flow in batches fof 30 messages. Also not sure if that is related but we see the XMITQ havs always 30 uncommited messages.
Our questiong for advice.
Would increase the BATCHSZ parameter on SDR/RCVR help the perfomance?
Is there any other parameter at MQ level that could help it?
DIS CHS(NAME) ALL
AMQ8417: Display Channel Status details.
CHANNEL(QMGRA.QMGRB.T7) CHLTYPE(SDR)
BATCHES(234) BATCHSZ(30)
BUFSRCVD(235) BUFSSENT(6391)
BYTSRCVD(6996) BYTSSENT(14396692)
CHSTADA(2020-04-16) CHSTATI(14.38.17)
COMPHDR(NONE,NONE) COMPMSG(NONE,NONE)
COMPRATE(0,0) COMPTIME(0,0)
CONNAME(159.50.69.38(48702)) CURLUWID(398F3E5EEA43381C)
CURMSGS(30) CURRENT
CURSEQNO(43488865) EXITTIME(0,0)
HBINT(300) INDOUBT(YES)
JOBNAME(000051FC00000001) LOCLADDR(10.185.8.122(54908))
LONGRTS(999999999) LSTLUWID(398F3E5EE943381C)
LSTMSGDA(2020-04-16) LSTMSGTI(14.49.46)
LSTSEQNO(43488835) MCASTAT(RUNNING)
MONCHL(HIGH) MSGS(6386)
NETTIME(2789746,3087573) NPMSPEED(NORMAL)
RQMNAME(QMGRB) SHORTRTS(10)
SSLCERTI(*******************)
SSLKEYDA( ) SSLKEYTI( )
SSLPEER(*******************)
SSLRKEYS(0) STATUS(RUNNING)
STOPREQ(NO) SUBSTATE(RECEIVE)
XBATCHSZ(23,7) XMITQ(QMGRB.X7)
XQTIME(215757414,214033427) RVERSION(08000008)
RPRODUCT(MQMM)
qm.ini:
Log:
LogPrimaryFiles=10
LogSecondaryFiles=10
LogFilePages=16384
LogType=LINEAR
LogBufferPages=4096
LogPath=/apps/wmq/QMGR/log/QMGR/
LogWriteIntegrity=SingleWrite
Service:
Name=AuthorizationService
EntryPoints=13
TCP:
SvrSndBuffSize=0
SvrRcvBuffSize=0
ServiceComponent:
Service=AuthorizationService
Name=MQSeries.UNIX.auth.service
Module=/opt/mqm75/lib64/amqzfu
ComponentDataSize=0
Channels:
MaxChannels=500
UPDATED: 15:41 GMT
Just to update the information, both sides are now with BATCHSZ 100 and seems slightly.
AMQ8417: Display Channel Status details.
CHANNEL(QMGRA.QMGRB.T7) CHLTYPE(SDR)
BATCHES(403) BATCHSZ(100)
BUFSRCVD(405) BUFSSENT(23525)
BYTSRCVD(11756) BYTSSENT(53751066)
CHSTADA(2020-04-17) CHSTATI(15.13.51)
COMPHDR(NONE,NONE) COMPMSG(NONE,NONE)
COMPRATE(0,0) COMPTIME(0,0)
CONNAME(159.50.69.38(48702)) CURLUWID(6D66985E94343410)
CURMSGS(0) CURRENT
CURSEQNO(44115897) EXITTIME(0,0)
HBINT(300) INDOUBT(NO)
JOBNAME(0000172A00000001) LOCLADDR(10.185.8.122(2223))
LONGRTS(999999999) LSTLUWID(6D66985E93343410)
LSTMSGDA(2020-04-17) LSTMSGTI(15.30.06)
LSTSEQNO(44115897) MCASTAT(RUNNING)
MONCHL(HIGH) MSGS(23505)
NETTIME(101563,480206) NPMSPEED(NORMAL)
RQMNAME(QMGRB) SHORTRTS(10)
SSLCERTI(*************************************)
SSLKEYDA( ) SSLKEYTI( )
SSLPEER(****************************)
SSLRKEYS(0) STATUS(RUNNING)
STOPREQ(NO) SUBSTATE(MQGET)
XBATCHSZ(1,1) XMITQ(QMGRB.X7)
XQTIME(191225,794134) RVERSION(08000008)
RPRODUCT(MQMM)
AMQ8450: Display queue status details.
QUEUE(QMGRB.X7) TYPE(QUEUE)
CURDEPTH(0) IPPROCS(1)
LGETDATE(2020-04-17) LGETTIME(15.30.06)
LPUTDATE(2020-04-17) LPUTTIME(15.30.06)
MEDIALOG(S2488154.LOG) MONQ(LOW)
MSGAGE(0) OPPROCS(9)
QTIME(794134, 191225) UNCOM(NO)
I'll put a few observations in this answer, but based on any further feedback I may add more.
You are running a very old version of the software on the sender side, MQ 7.5 went out of support almost two years ago (April 30 2018). IBM for a cost will provide extended support for an additional three years, so maybe you fall in that group. The 7.5.0.2 maintenance release itself came out in July 11th 2013, so it is almost seven years old at this point. I would strongly suggest you move to a newer version.
Note that MQ v8.0 goes out of support April 30 2020, and IBM just announced a few days ago that MQ v9.0 goes out of support September 30 2021. When you do migrate you should go with either 9.1 which has no announced end of support (they give five years minimum so it could be 2023) or go with the next version of MQ that should be out some time this year.
You mention setting the following:
TCP:
SvrSndBuffSize=0
SvrRcvBuffSize=0
The above setting apply to the SVRCONN end of a client connection. You can see this in the MQ v7.5 Knowledge Center page WebSphere MQ>Configuring>Changing configuration information>Changing queue manager configuration information>TCP, LU62, NETBIOS, and SPX:
SvrSndBuffSize=32768|number
The size in bytes of the TCP/IP send buffer used by the server end of a client-connection
server-connection channel.
SvrRcvBuffSize=32768|number
The size in bytes of the TCP/IP receive buffer used by the server end of a client-connection
server-connection channel.
At IBM MQ v7.5.0.2 APAR IV58073 introduced the concept of setting various buffer settings to a value to 0 which means that it will allow the operating system defaults to be used. Unfortunately like many things in the Knowledge Center it does not look like IBM documented this correctly for 7.5.
You can however review the IBM MQ v8.0 Knowledge Center to get the full picture regarding these settings at the page Configuring>Changing configuration information>Changing queue manager configuration information>TCP, LU62, and NETBIOS, specifically you would want to set these two settings to have any impact on your Sender Channel:
SndBuffSize=number| 0
The size in bytes of the TCP/IP send buffer used by the sending end of
channels. This stanza value can be overridden by a stanza more
specific to the channel type, for example RcvSndBuffSize. If the
value is set as zero, the operating system defaults are used. If no
value is set, then the IBM MQ default, 32768, is used.
RcvSndBuffSize=number| 0
The size in bytes of the TCP/IP send buffer used by the sender end of
a receiver channel. If the value is set as zero, the operating system
defaults are used. If no value is set, then the IBM MQ default, 32768,
is used.
Starting at IBM MQ v8.0 any newly created queue manager will have all of the following in the qm.ini:
TCP:
SndBuffSize=0
RcvBuffSize=0
RcvSndBuffSize=0
RcvRcvBuffSize=0
ClntSndBuffSize=0
ClntRcvBuffSize=0
SvrSndBuffSize=0
SvrRcvBuffSize=0
However, any queue manager that is upgraded will not by default get those settings, meaning if those are not present they will not be added, if they are present they will remain the same. If the setting is not present then as it says above "the IBM MQ default, 32768, is used."
I had extensive discussions with IBM support on this topic and came to the conclusion that they did not see any reason to not set it to 0, they only saw benefit in doing so, but with an abundance of caution they do not change it to 0 for you.
I would recommend you add all of those to your qm.ini, but at minimum add the two I highlighted above.
These are good setting to implement but may not solve your problem if nothing changed recently on either end. If however something did change, for example a network difference, or MQ was upgraded to 8.0.0.8 on the remote side, then this setting just might solve your problem.
In the channel status output two values are interesting:
NETTIME(2789746,3087573)
XQTIME(215757414,214033427)
NETTIME means that based on recent activity it took 2.7 seconds to receive a response from the RCVR channel, over a longer period of time it took 3.1 seconds to receive a response from the RCVR channel. Can you compare this to a TCP ping from the sender channel server to the receive channel server, 2.7 seconds for a response over the network seems excessive. In the presentation Keeping MQ Channels Up and Running given at Capitalware's MQ Technical Conference v2.0.1.4, Paul Clarke who used to work for IBM states "NETTIME only measures network time, and does not include
the MQCMIT for example".
XQTIME means that based on recent activity and over a longer period of time it took ~215 seconds for a message on the XMITQ to be picked up by the SDR channel to be sent.
See below for how IBM documents these:
NETTIME
Amount of time, displayed in microseconds, to send a request to the remote end of the channel and receive a response. This time only measures the network time for such an operation. Two values are displayed:
A value based on recent activity over a short period.
A value based on activity over a longer period.
XQTIME
The time, in microseconds, that messages remained on the transmission queue before being retrieved. The time is measured from when the message is put onto the transmission queue until it is retrieved to be sent on the channel and, therefore, includes any interval caused by a delay in the putting application.
Two values are displayed:
A value based on recent activity over a short period.
A value based on activity over a longer period.
Information on the BATCHSZ channel parameter can be found in the IBM MQ v8.0 Knowledge Center page Reference>Configuration reference>Channel attributes>Channel attributes in alphabetical order>Batch size (BATCHSZ). I have quoted it and highlighted a few areas in bold.
This attribute is the maximum number of messages to be sent before a sync point is taken.
The batch size does not affect the way the channel transfers messages; messages are always transferred individually, but are committed or backed out as a batch.
To improve performance, you can set a batch size to define the maximum number of messages to be transferred between two sync points. The batch size to be used is negotiated when a channel starts, and the lower of the two channel definitions is taken. On some implementations, the batch size is calculated from the lowest of the two channel definitions and the two queue manager MAXUMSGS values. The actual size of a batch can be less; for example, a batch completes when there are no messages left on the transmission queue or the batch interval expires.
A large value for the batch size increases throughput, but recovery times are increased because there are more messages to back out and send again. The default BATCHSZ is 50, and you are advised to try that value first. You might choose a lower value for BATCHSZ if your communications are unreliable, making the need to recover more likely.
This attribute is valid for channel types of:
Sender
Server
Receiver
Requester
Cluster sender
Cluster receiver
Follow up questions:
Are the messages that are PUT to this XMITQ persistent?
Answer: Yes, in our PROD env all messages are pesistent.
Have you had a recent increase in volume going to this XMITQ?
Answer: No, we use a monitoring tools, we extracted a report that show very similar message rate during the period. The same rate over the last 2 weeks.
Do the putting applications set MQPMO_SYNCPOINT and then commit after 1 or more messages are PUT to the queue?
Answer: I will check with the application team.
A couple of things..
You have XBATCHSZ(1,1) so your recent batch size is 1 message per batch.
Total messages 23505 batches 403, so an average of 58 messages per batch. If your recent batch size is 1, then you must have had some larger (100?) batch sizes
XQTIME 191225 is number of microseconds messages were on the xmit queue before being sent. This is 0.1 second!
Nettime 101563 microseconds. This is a long time ( 0.1 seconds) 10,000 would be a good value. Compare this with a "TCP PING"
BUFSSENT 23525 is similar to number of messages - so message size is typically under 32K. Bytessent. messages gives 2286 so small messages.
Things to check
The queue at the remote end. Has it filled up? This would cause the sender queue to get more messages
The nettime seems very long. Compare this with TCP Ping. Nettime can include slow IO at the remote end - or a queue full at the remote end
XQTIME is high. This could be caused by sending applications not committing, or slow disk IO
I wrote "Why is my xmit queue filling up" in this blog
*Search for the title
have a read.
Capture these metrics over a day and see if they are typical
regards
Colin Paice

Jco Adapter pooling performance deadlock?

We're running an enterprise scale SAP application with front-end springboot clients connecting via Jco adapter 3.0 on Oracle VM using the connection pool (size 100). We're experiencing unsystematic long-running requests > 10s that are not visible in the SAP application server log, i.e. the bottleneck does not appear to be on SAP side.
Looking at the trace files (level 4) for an example request we can see that the time seems lost when the adapter thread tries to get the client from the pool (other threads continue execution, removed the irrelevant threads for clarity):
[20:05:50:259]: [JCoAPI] JCoContext.isStateful(P-foo-CPIC0) in session ID Client-53-1 returns false
[20:05:50:259]: [JCoAPI] JCoContext.begin(P-foo-CPIC0) in session ID Client-53-1
[20:05:50:259]: [JCoAPI] Started context for session Client-53-1
[20:05:50:259]: [JCoAPI] JCoContext.begin() for destination PFOO_200 (P-foo-CPIC0) on context with id Client-53-1; current state counter is 1
[20:05:50:259]: [JCoAPI] destination PFOO_200 destinationID=P-foo-CPIC0 executes Z_foo sessionID=Client-53-1, threadID=0x35
[20:05:50:259]: [JCoAPI] Context.getConnection on destination PFOO_200 (state: destination = STATEFUL, default = STATELESS)
[20:05:50:259]: [JCoAPI] PoolingFactory.getClient() on pool P-foo-CPIC0
--> time lost here
[20:06:20:840]: [JCoAPI] PoolingFactory.getClient() returns handle [3/84977415]
[20:06:20:840]: [JCoAPI] Context.getConnection on destination PFOO_200 nothing found in the context - got client from ConnectionManager [3/84977415]
[20:06:20:840]: [JCoAPI] JCoClient before execute(Z_foo) on handle [3/84977415]
[20:06:20:840]: [JCoRFC] Executing function Z_foo on handle [3/84977415]
[20:06:20:866]: [JCoAPI] JCoClient after execute(Z_foo) on handle [3/84977415] returns after 26 ms
[20:06:20:866]: [JCoAPI] Context.releaseConnection on destination PFOO_200 [3/84977415]
[20:06:20:867]: [JCoAPI] JCoContext.end(P-foo-CPIC0) in session ID Client-53-1
[20:06:20:867]: [JCoAPI] PoolingFactory.releaseClient() handle [3/84977415] into pool P-foo-CPIC0 [pool size: 3, peak limit: 100, waiting threads: 0, currently used: 1]
[20:06:20:879]: [JCoAPI] Finished context for session Client-53-1
[20:06:20:879]: [JCoAPI] JCoContext.end() for destination PFOO_200 (P-foo-CPIC0) on context with id Client-53-1; current state counter is 0
For a typical request the step is handled in milliseconds.
Are there any known limitations or configurations regarding pool handling for the Jco adapter, either on adapter or on SAP side?
Update we've on Jco adapter 3.0.16 and will double-check 3.0.17 now. DNS seems unlikely since we're monitoring dig/nslookup and they're running without delays.
Which JCo patch level do you use?
Did you try to update to the latest JCo patch level 3.0.17 first?
In your time gap the RFC connection will be opened and the RFC logon will be done, if the pool is empty at that time. Did you have a closer look with a higher trace level, or did you have a look into the RFC trace?
This can be anything from not having a free dialog work process at ABAP side, to SAP system database issues (required for the RFC logon authentication checks), slow response times from the SAP message server (if using load balanced logons), SNC handshake issues (if using SNC) or general network issues with the DNS (try using the IP address instead of a hostname).
Another point worth checking: you say your connection pool has size 100. Is it possible, that your program has more than 100 threads? Then it may happen from time to time, that all connections are currently busy in other threads and the current thread has to wait until a function call in another thread completes and a connection is returned to the pool.
(How long a thread waits on an empty pool can be customized via the "pool wait time" parameter.)

Socket Exception while running load test with Self Provisioned test rig

I am getting the Socket Exception while running load test on self-provisioned test rig.
I am trigger those load tests in agent machine(self-provisioned test rig) from my local machine.
Note : For first 2 to 3 minutes test iterations are passing , after that we are getting the Socket Exception.
Below is the error message :
A connection attempt failed because the connected party did not
properly respond after a period of time, or established connection
failed because connected host has failed to respond.
Below are the stack trace details :
at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult) at
System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure,
Socket s4, Socket s6, Socket& socket, IPAddress& address,
ConnectSocketState state, IAsyncResult asyncResult, Exception&
exception)
Run Time - 20min
Sample rate - 10sec
warm up duration 10sec
number of agents used - 2
Load pattern :
initial load - 10user
max user count - 300
step duration - 10sec
step user count - 10
Although, by Changing above values I am still getting the exception in the same way.
I am using Visual studio 2015 enterprise.
The question states: start with 10 users, every 10 seconds add 10 users to a maximum of 300. Thus after 29 increments there will be 300 users and that will take 29*10 seconds which is 4m50s. The test will thus (attempt to) run with the maximum load of 300 users for the remaining 15m10s.
Given that all tests pass for the first 2 or 3 minutes plus the the error message, that suggests that you are overloading some part of the network. It might be the agents, it might be the servers or it might be on the connections between them. Some network components have a maximum number of active connections and the 300 users might be too many.
Increasing the load so rapidly means you do not clearly know what the limiting value. The sampling rate (at 10 seconds) seems high. At each sampling interval a lot of data is transferred (i.e. the sample data) and that can swamp parts of the network. You should look at the network counters for the agents and controller, also the servers if available.
I recommend changing the load test steps to add 10 users every 30 seconds, so it takes about 15 minutes to reach 300 users. It may also be worth reducing the sample rate to every 20 seconds.

Weblogic Stuck Thread on JDBC call

We frequently get a series of Stuck threads on our Weblogic servers. I've analyzed this over a period of time.
What I'd like to understand is whether this stuck thread block indicates it is still reading data from the open socket to the database since the queries are simple SELECT stuff?
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Packet.java:239)
at oracle.net.ns.DataPacket.receive(DataPacket.java:92)
We've run netstat and other commands, the sockets from the Weblogic app server to the Database match the number of connections in the pool.
Any ideas what else we should be investigating here?
Stack trace of thread dump:
"[STUCK] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x61a5b000 nid=0x25f runnable [0x6147b000..0x6147eeb0]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Packet.java:239)
at oracle.net.ns.DataPacket.receive(DataPacket.java:92)
at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:172)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:117)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:92)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:77)
at oracle.jdbc.driver.T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1023)
at oracle.jdbc.driver.T4CMAREngine.unmarshalSB1(T4CMAREngine.java:999)
at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:584)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:183)
at oracle.jdbc.driver.T4CStatement.fetch(T4CStatement.java:1000)
at oracle.jdbc.driver.OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:314)
- locked <0x774546e0> (a oracle.jdbc.driver.T4CConnection)
at oracle.jdbc.driver.OracleResultSetImpl.next(OracleResultSetImpl.java:228)
- locked <0x774546e0> (a oracle.jdbc.driver.T4CConnection)
at weblogic.jdbc.wrapper.ResultSet_oracle_jdbc_driver_OracleResultSetImpl.next(Unknown Source)
The bit starting from weblogic.work.ExecuteThread.run to here has been omitted. We have 8 sets of thread dumps - and each show the thread waiting on the same line, and the same object locked
at oracle.jdbc.driver.OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:314)
- locked <0x774546e0> (a oracle.jdbc.driver.T4CConnection)
At the time the stack was printed, it seems blocked waiting for more data from the server
at oracle.jdbc.driver.OracleResultSetImpl.next(OracleResultSetImpl.java:228)
Maybe it is just the query which is taking more than StuckThreadMaxTimeand WL issues a Warning.
If possible I would try:
Find which query or queries are getting the threads stuck and check execution time
Use Wireshark to analyze communication with database
Have a look at the driver source code (JD comes to mind) to understand stack trace
if you use weblogic debug flag -Dweblogic.debug.DebugJDBCSQL you will be able to trace the SQL which is actually being executed

Resources