Tibco FT issue because of heartbeat - tibco

I have 2 instance of my Java application (A & B) running on same machine, where B is blocked for activate() callback.
This is how I register my callback:
new TibrvFtMember(this.tibrvQueue,
this.orv,
this.transport,
MessagingProps.TIBCO_GROUP_FT, // Group name
wt, // Weight
1, // Active members
2.0, // Heartbeat (secs.)
0.0, // Prep. (secs.)
2.5, // Activation (secs.)
null);
I sending heartbeat every 100 millisecond to ensure that my application is on time.
msg.add("Data", ++seq + ":" + MessagingProps.TIBCO_SUBJECT_FT);
msg.setSendSubject(MessagingProps.TIBCO_SUBJECT_FT);
transport.send(msg)
For some reason (ex., GC / garbage collection) if my active application "A" delays in sending heartbeat, immediately my passive application "B" gets activated while "A" is still active, where deactivate is not invoked on "A". And soon after application "A" sending a heartbeat, deactivate is invoked on application "B". This give an erroneous behavior for a few seconds since both the applications are active at the same time.
The mentioned is happens at random time and not predictable. Our application cannot publish duplicate messages at same time, so this creates a huge impact.
Kindly help me to how can I overcome this issue.

Related

MassTransit timeouts under load on .NETFramework under IIS

Under load in production we receive "RabbitMQ.Client.Exceptions.ConnectFailureException" connection failed and "MassTransit.RequestTimeoutException" timeout waiting for response. The consumer does receive the message and send it back. It's like the web app isn't listening, or unable to accept the connection.
We're running an ASP.NET web application ( not MVC ) on .NET Framework 4.6.2 on Windows Server 2019 on IIS. We're using MassTransit 7.0.4. In production, under load, we can get some exceptions dealing with sockets on RabbitMQ or timeouts from masstransit. It's difficult to reproduce them in Dev. RabbitMQ is in a mirror, it seems to happen once we turn on a high-load service that bumps from 140 message/sec to 250 message/sec.
I have a few questions about the code architecture, and then if anyone else is running into these kinds of timeout issues.
Questions:
Should I have static scope for the IBusControl? IE, should it be static inside Global asax? And does it matter at all if it's a singleton underneath?
Should I create a new IBusControl and start it per request ( maybe stick it in Application BeginRequest ). Would that make a difference?
Would adding another worker process affect the total number of open connections I'm able to make -- If this is a resource issue ( exhausting threads, connections or some resource ).
Exceptions:
MassTransit.RequestTimeoutException
Timeout Waiting for response
Stacktrace:
System.Runtime.ExceptionServices.ExceptionDispathInfo.Throw
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
MassTransit.Clients.ResponseHandlerConnectionHandle`1+<GetTask>d_11.MoveNext
System.Threading.ExecutionContext.RunInternal
RabbitMQ.Client.Exceptions.ConnectFailureException
Connection failed
Statcktrace:
RabbitMQ.Client.Impl.SocketFrameHandler.ConnectOrFail
RabbitMQ.Client.Impl.SocketFrameHandler.ConnectUsingAddressFamily
RabbitMQ.Client.Impl.SocketFrameHandler..ctor
RabbitMQ.Client.ConnectionFactory.CreateFrameHandler
RabbitMQ.Client.EndPointResolverExtensions.SelectOne
RabbitMQ.Client.ConnectionFactory.CreateConnection
How Our Code Works ( overview )
Static IBusControl that is instantiated the first time someone tries to produce a message. The whole connection and send code is a little large to put in here ( connection factory and other metric classes, but below are the interesting parts ).
Static IBusControl B;
B = Bus.Factory.CreateUsingRabbitMq(x =>
{
hostAddress = host.HostAddress;
x.Host(new Uri(host.HostAddress), h =>
{
h.Username(host.UserName);
h.Password(host.Password);
});
x.Durable = false;
x.SetQueueArgument("x-message-ttl", 600000);
});
B.Start(new TimeSpan(0, 0, 10));
// Then send the Actual Messages
// Generic with TRequest and TResponse : class BaseMessage
// Pulling the code out of a few different classes
string serviceAddressString = string.Format("{0}/{1}?durable={2}", HostAddress, ChkMassTransit.QueueName(typeof(TRequest), typeof(TResponse)), false ? "true" : "false");
Uri serviceAddress = new Uri(serviceAddressString);
RequestTimeout rt = RequestTimeout.After(0, 0, 0, 0, timeout.Value);
IRequestClient<TRequest> reqClient = B.CreateRequestClient<TRequest>(serviceAddress, rt);
var v = reqClient.GetResponse<TResponse>(request, sendInfo.CT, sendInfo.RT);
if ( v.Wait(timeoutMS) ) { /*do some stuff*/ }
First, I find your lack of async disturbing. Using Wait or anything like it on TPL-based code is a recipe for death and destruction, pain and suffering, dogs and cats living together, etc.
Yes, you should have a single bus instance that is started when the application starts. Since you're doing request/response, set AutoStart = true on the bus configurator to make sure it's all warmed up and ready.
Never, no, one bus only!
Each bus instance only has a single connection, so you shouldn't see any resource issues related to capacity on RabbitMQ.
MassTransit 7.0.4 is really old, you might consider the easy upgrade 7.3.1 and see if that improves things for you. It's the last version of the v7 codebase available.

How can i terminate myself if i run too long?

I have a application that runs periodically (it's a scheduled task). The task is launched once a minute, and normally only takes a few seconds to do its business, then exits.
But there's a ~1 in 80,000 chance (every two or three months) that the application will hang. The root cause is because we're using Microsoft ServerXmlHttpRequest component to perform some work, and sometimes it just decides to hang. The virtue of ServerXmlHttpRequest over XmlHttpRequest is that the latter is not recommended for important scenarios, such as where reliability and security are important (which is true of an unattended server component):
The ServerXMLHTTP object offers functionality similar to that of the XMLHTTP object. Unlike XMLHTTP, however, the ServerXMLHTTP object does not rely on the WinInet control for HTTP access to remote XML documents. ServerXMLHTTP uses a new HTTP client stack. Designed for server applications, this server-safe subset of WinInet offers the following advantages:
Reliability — The HTTP client stack offers longer uptimes. WinInet features that are not critical for server applications, such as URL caching, auto-discovery of proxy servers, HTTP/1.1 chunking, offline support, and support for Gopher and FTP protocols are not included in the new HTTP subset.
Security — The HTTP client stack does not allow a user-specific state to be shared with another user's session. ServerXMLHTTP provides support for client certificates.
The job is being run as a scheduled task. I need the task to continue to run periodically; killing the existing process if it's dead.
The Windows Task Scheduler does have an option for forcibly close a task that is running too long:
The only downside to that approach is that it simply doesn't work - it simply does not stop the task. The hung process keeps running.
Given that i cannot trust the Microsoft ServerXmlHttpRequest to not arbitrarily lock up, and the task scheduler is unable to terminate the scheduled task, i need some way to do it myself.
Jobs
I tried looking into using the Job Objects API:
A job object allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. A job can enforce limits such as working set size, process priority, and end-of-job time limit on each process that is associated with the job.
That one note sounded like exactly what i needed:
A job can enforce limits such as end-of-job time limit on each process that is associated with the job.
The only down-side to that approach is that it does not work. Job cannot impose a time-limit on a process. They can only impose a user time limit on a process:
PerProcessUserTimeLimit
If LimitFlags specifies JOB_OBJECT_LIMIT_PROCESS_TIME, this member is the per-process user-mode execution time limit, in 100-nanosecond ticks.
If the process is idle (for example, sitting at a MsgWaitForSingleObject as ServerXmlHttpRequest is), then it will accumulate no user time. I tested it. I created a job with a 1 second time limit, and placed my self process into it. As long as i don't move the mouse around my test application, it quite happily sits there for longer than one second.
Watchdog Thread
The only other technique i can imagine, given that my main thread is indefinitely blocked, is another thread. The only solution i can imagine is spawn another thread that will sleep for my three minutes, then ExitProcess:
Int32 watchdogTimeoutSeconds = FindCmdLineSwitch("watchdog", 0);
if (watchdogTimeoutSeconds > 0)
Thread thread = new Thread(KillMeCallback, new IntPtr(watchdogTimeoutSeconds));
void KillMeCallback(IntPtr data)
{
Int32 secondsUntilProcessIsExited = data.ToInt32();
if (secondsUntilProcessIsExited <= 0)
return;
Sleep(secondsUntilProcessIsExited*1000); //seconds --> milliseconds
LogToEventLog(ExtractFilename(Application.ExeName),
"Watchdog fired after "+secondsUntilProcessIsExited.ToString()+" seconds. Process will be forcibly exited.", EVENTLOG_WARNING_TYPE, 999);
ExitProcess(999);
}
And that works. The only downside is that it's a bad idea.
Can anyone think of anything better?
Edit
For now i will implement a
Contoso.exe /watchdog 180
So the process will be exited after 180 seconds. It means the duration is configurable, or can be removed completely easily in the field.
I used the route where i pass a special WatchDog argument to my process on the command line;
>Contoso.exe /watchdog 180
During initialization i check for the presence of the WatchDog option, with an integer number of seconds after it:
String s = Toolkit.FindCmdLineOption("watchdog", ["/", "-"]);
if (s <> "")
{
Int32 seconds = StrToIntDef(s, 0);
if (seconds > 0)
RunInThread(WatchdogThreadProc, Pointer(seconds));
}
and my thread procedure:
void WatchdogProc(Pointer Data);
{
Int32 secondsUntilProcessIsExited = Int32(Data);
if (secondsUntilProcessIsExited <= 0)
return;
Sleep(secondsUntilProcessIsExited*1000); //seconds -> milliseconds
LogToEventLog(ExtractFileName(ParamStr(0)),
Format("Watchdog fired after %d seconds. Process will be forcibly exited.", secondsUntilProcessIsExited),
EVENTLOG_WARNING_TYPE, 999);
ExitProcess(2);
}

Spring #Async cancel and start?

I have a spring MVC app where a user can kick off a Report generation via button click. This process could take few minutes ~ 10-20 mins.
I use springs #Async annotation around the service call so that report generation happens asynchronously. While I pop a message to user indicating job is currently running.
Now What I want to do is, if another user (Admin) can kick off Report generation via the button which should cancel/stop currently running #Async task and restart the new task.
To do this, I call the
.. ..
future = getCurrentTask(id); // returns the current task for given report id
if (!future.isDone())
future.cancel(true);
service.generateReport(id);
How can make it so that "service.generateReport" waits while the future cancel task kills all the running threads?
According to the documentation, after i call future.cancel(true), isDone will return true as well as isCancelled will return true. So there is no way of knowing the job is actually cancelled.
I can only start new report generation when old one is cancelled or completed so that it would not dirty data.
From documentation about cancel() method,
Subsequent calls to isCancelled() will always return true if this method returned true
Try this.
future = getCurrentTask(id); // returns the current task for given report id
if (!future.isDone()){
boolean terminatedImmediately=future.cancel(true);
if(terminatedImmediately)
service.generateReport(id);
else
//Inform user existing job couldn't be stopped.And to try again later
}
Assuming the code above runs in thread A, and your recently cancelled report is running in thread B, then you need thread A to stop before service.generateReport(id) and wait until thread B is completes / cancelled.
One approach to achieve this is to use Semaphore. Assuming there can be only 1 report running concurrently, first create a semaphore object acccessible by all threads (normally on the report runner service class)
Semaphore semaphore = new Semaphore(1);
At any point on your code where you need to run the report, call the acquire() method. This method will block until a permit is available. Similarly when the report execution is finished / cancelled, make sure release() is called. Release method will put the permit back and wakes up other waiting thread.
semaphore.acquire();
// run report..
semaphore.release();

What is the cost of creating actors in Akka?

Consider a scenario in which I am implementing a system that processes incoming tasks using Akka. I have a primary actor that receives tasks and dispatches them to some worker actors that process the tasks.
My first instinct is to implement this by having the dispatcher create an actor for each incoming task. After the worker actor processes the task it is stopped.
This seems to be the cleanest solution for me since it adheres to the principle of "one task, one actor". The other solution would be to reuse actors - but this involves the extra-complexity of cleanup and some pool management.
I know that actors in Akka are cheap. But I am wondering if there is an inherent cost associated with repeated creation and deletion of actors. Is there any hidden cost associated with the data structures Akka uses for the bookkeeping of actors ?
The load should be of the order of tens or hundreds of tasks per second - think of it as a production webserver that creates one actor per request.
Of course, the right answer lies in the profiling and fine tuning of the system based on the type of the incoming load.
But I wondered if anyone could tell me something from their own experience ?
LATER EDIT:
I should given more details about the task at hand:
Only N active tasks can run at some point. As #drexin pointed out - this would be easily solvable using routers. However, the execution of tasks isn't a simple run and be done type of thing.
Tasks may require information from other actors or services and thus may have to wait and become asleep. By doing so they release an execution slot. The slot can be taken by another waiting actor which now has the opportunity to run. You could make an analogy with the way processes are scheduled on one CPU.
Each worker actor needs to keep some state regarding the execution of the task.
Note: I appreciate alternative solutions to my problem, and I will certainly take them into consideration. However, I would also like an answer to the main question regarding the intensive creation and deletion of actors in Akka.
You should not create an actor for every request, you should rather use a router to dispatch the messages to a dynamic amount of actors. That's what routers are for. Read this part of the docs for more information: http://doc.akka.io/docs/akka/2.0.4/scala/routing.html
edit:
Creating top-level actors (system.actorOf) is expensive, because every top-level actor will initialize an error kernel as well and those are expensive. Creating child actors (inside an actor context.actorOf) is way cheaper.
But still I suggest you to rethink this, because depending on the frequency of the creation and deletion of actors you will also put afditional pressure on the GC.
edit2:
And most important, actors are not threads! So even if you create 1M actors, they will only run on as many threads as the pool has. So depending on the throughput setting in the config every actor will process n messages before the thread gets released to the pool again.
Note that blocking a thread (includes sleeping) will NOT return it to the pool!
An actor which will receive one message right after its creation and die right after sending the result can be replaced by a future. Futures are more lightweight than actors.
You can use pipeTo to receive the future result when its done. For instance in your actor launching the computations:
def receive = {
case t: Task => future { executeTask( t ) }.pipeTo(self)
case r: Result => processTheResult(r)
}
where executeTask is your function taking a Task to return a Result.
However, I would reuse actors from a pool through a router as explained in #drexin answer.
I've tested with 10000 remote actors created from some main context by a root actor, same scheme as in prod module a single actor was created. MBP 2.5GHz x2:
in main: main ? root // main asks root to create an actor
in main: actorOf(child) // create a child
in root: watch(child) // watch lifecycle messages
in root: root ? child // wait for response (connection check)
in child: child ! root // response (connection ok)
in root: root ! main // notify created
Code:
def start(userName: String) = {
logger.error("HELLOOOOOOOO ")
val n: Int = 10000
var t0, t1: Long = 0
t0 = System.nanoTime
for (i <- 0 to n) {
val msg = StartClient(userName + i)
Await.result(rootActor ? msg, timeout.duration).asInstanceOf[ClientStarted] match {
case succ # ClientStarted(userName) =>
// logger.info("[C][SUCC] Client started: " + succ)
case _ =>
logger.error("Terminated on waiting for response from " + i + "-th actor")
throw new RuntimeException("[C][FAIL] Could not start client: " + msg)
}
}
t1 = System.nanoTime
logger.error("Starting of a single actor of " + n + ": " + ((t1 - t0) / 1000000.0 / n.toDouble) + " ms")
}
The result:
Starting of a single actor of 10000: 0.3642917 ms
There was a message stating that "Slf4jEventHandler started" between "HELOOOOOOOO" and "Starting of a single", so the experiment seems even more realistic (?)
Dispatchers was a default (a PinnedDispatcher starting a new thread each and every time), and it seemed like all that stuff is the same as Thread.start() was, for a long long time since Java 1 - 500K-1M cycles or so ^)
That's why I've changed all code inside loop, to a new java.lang.Thread().start()
The result:
Starting of a single actor of 10000: 0.1355219 ms
Actors make great finite state machines so let that help drive your design here. If your request handling state is greatly simplified by having one actor per request then do that. I find that actors are particularly good at managing more than two states as a rule of thumb.
Commonly though, one request handling actor that references request state from within a collection that it maintains as part of its own state is a common approach. Note that this can also be achieved with an Akka reactive stream and the use of the scan stage.

measuring cross process latency on windows

I am building latency measurement into a communication middleware I am building. The way I have it working is that I periodically send a probe msg from my publishing apps. Subscribing apps receive this probe, cache it, and send an echo back at a time of their choosing, noting how much time the msg was kept “on hold”. The subscribing app receives these echos and calculates latency as (now() – time_sent – time_on_hold) / 2.
This kinda works, but the numbers are vastly different (3x) when “time on hold” is greater than 0. I.e if I echo the msg back immediately I get around 50us on my dev env and if I wait, then send the msg back the time jumps to 150us (though I discount whatever time I was on hold). I use QueryPerfomanceCounter for all measurements.
This is all inside a single Windows 7 box. What am I missing here?
TIA.
A bit more information. I am using the following to measure time:
static long long timeFreq;
static struct Init
{
Init()
{
QueryPerformanceFrequency((LARGE_INTEGER*) &timeFreq);
}
} init;
long long OS::now()
{
long long result;
QueryPerformanceCounter((LARGE_INTEGER*)&result);
return result;
}
double OS::secondsDiff(long long ts1, long long ts2)
{
return (double) (ts1-ts2)/timeFreq;
}
On the publish side I do something like:
Probe p;
p.sentTimeStamp = OS::now();
send(p);
Response r = recv();
latency=OS::secondsDiff(OS::now()- r.sentTimeStamp) - r.secondsOnHoldOnReceiver;
And on the receiver side:
Probe p = recv();
long long received = OS::now();
sleep();
Response r;
r.sentTimeStamp = p.timeStamp;
r.secondOnHoldOnReceiver = OS::secondsDiff(OS::now(), received);
send(r);
Ok, I have edited my answer to reflect your answer: Sorry for the delay, but I didn't notice that you had elaborated on the question by creating an answer.
It's seems that functionally you are doing nothing wrong.
I think that when you distribute your application outside of localhost conditions, the additional 100us (if it is indeed roughly constant) will pale into insignificance compared to the average latency of a functioning network.
For the purposes of answering your question am thinking that there is a thread/interrupt scheduling issue on the server side that needs to be investigated, as you do not seem to be doing anything on the client that is not accounted for.
Try the following test scenario:
Send two Probes to clients A and B. (all localhost)
Send the Probe to 'Client B' one second (or X/2 seconds) after you send the probe to Client A.
Ensuring that 'Client A' waits for two seconds (or X seconds) and 'Client B' waits 'one second (or X/2 seconds)
The idea being that hopefully, both clients will send back their probe answers at roughly the same time and both after a sleep/wait (performing the action that exposes the problem). The objective is to try to get one of the clients responses to 'wake up' the publisher to see if the next clients answer will be processed immediately.
If one of these returned probes is not showing the anomily (most likely the second response) it could point to the fact that the publisher thread is waking from a sleep cycle (on recv 1st responce) and is immediately available to process the second response.
Again, if it turns out that the 100us delay is roughly constant, it will be +-10% of 1ms which is the timeframe appropriate for realworld network conditions.

Resources