Server cascading failure

Server cascading failure - spring

this might be a totally noob question.
We just migrated to AWS a week back. We have two separate apps , call them App1 and App2. For every request that App1 receives , it makes a web service call to App2 with a read timeout of 2 sec.So ,if the response isn't delivered within 2 sec,it is aborted.However, App2 server is facing some problems due to which sometimes App2 server goes down. But the problem is that whenever App2 server goes down,App1 server goes down with it. And when it comes back up ,the App1 server immediately comes back up as well.
This is weird problem.What do you guys think is happening ?
Any help will be greatly appreciated.

My guess is that requests are piling up on app 1 (due to increased latency) as app 2 goes down, which eventually causes app 1 to become unresponsive as well. I would also look into what actually happens when you abort your request after the two second timeout. Are you actually making sure the connection is aborted? If not, you may be using up system resources for dead connections.
But the above is just guessing in the dark; I think we need more information to make more educated guesses :).

Related

SignalR combined with load balancer missing messages

I have 2 web servers (IIS 8.5) behind a hardware firewall and our application uses SignalR for some real-time updates. We are using SQL Server as the backplane to help us work in this load balanced environment. Additionally we are using sticky sessions on the load balancer to help us keep the users on the same web server during their session. When we are running in this hardware configuration we lose at least 1/3 of our messages. Sometimes we get all the expected messages but more often than not we are missing plenty.
When we are running on a single web server all messages are received. Does anyone have any suggestions for troubleshooting this problem? We've turned on logs (both client & server) and nothing looks like it's missing or broken. We're really stumped.
EDIT---
Some additional details that I hope will shed light on the situation.
Server to Client messages are getting lost. Pretty much all our communication is Server to Client.
We are using sticky session just based on IP and limited to 5 minutes but we're losing messages within that 5 minutes.
This is some old SignalR code that has been only minimally touched since SignalR 1 (or even older). We are keeping an in memory list of users along with their connections and we use that list to send notices back to the client. It seems most likely that this is the cause of the troubles but with Sticky sessions the user should be stuck to the same server for at least the 5 minutes right?
This list of users maps Username to connection id. This is useful when our backend services (on another machine) sends a message back with the username not the connection id.

Finally resolved this. There were 2 issues really. The first is that we were using an in memory list of users as mentioned in the edit above. Once we realized that wasn't going to work across machines we removed it. It also led us to the second issue which was how SignalR 2 uses the IUserIdProvider and our call should have been
Clients.User(userId).send(message)
instead of
context.Clients.Client(connection)
This code had existed since we first started using SignalR many years ago and never got properly updated as we upgraded SignalR versions

Have the same machineKey specified in your web.config on both servers.

What exactly does a HTTP or jquery $.ajax timeout mean?

When I issue an $.ajax query with a timeout: parameter, and my timeout is met such that error: is invoked, what does that mean?
More specifically:
does that mean the server received the request, but is still processing it? That may mean some effect may occur, so I may have to cancel it on the server, or somehow invalidate data that was already partially written to a database.
Or does that mean I was never able to reach the server at all? This is nice to know since then I don't have to deal with partial data on a server "save"
Or does that mean the request made it part of the way, and now we lost track of it? In this case, I'd have to actually ask the server, "Oh hey, about that request I sent awhile ago... did you get that one? yeah? okay ignore that last save"
OS Commands like tracert make it clear there may be many servers for a TCP command to go through, so if one becomes unresponsive, it's hard to tell if it got it or not. But some protocols require an echo-back to be considered receivable (so I'm not sure if HTTP or Apache is involved in this)

It is how long the client will wait to hear from the server before giving up.
The server may or may not have done its part. The only way for the client to know about that is for the client to be notified. Since you don't want to to leave a process or a human waiting forever, by using a timeout you specify the time to wait for success before giving up.

WCF - WebHttpBinding - RESTful - Performance Issue

first time poster so go easy on me.
I am currently trying to address a performance issue when hitting my web service after a one minute period of inactivity. Literally after one minute of THAT user not hitting the web service then the next call will take 15 seconds before actually hitting the service operation. If you keep making random (not the same service operation just so you guys don't think it is "caching" the call) service operation calls the service returns immediately (less than a second).
Here are some "timings" I decided to take so you can see how I came to the one minute of inactivity:
2:04PM
2:16PM --15 seconds
2:21PM --15 seconds
2:24PM --15 seconds
2:25PM --15 seconds
Again, if you hit the web service continuously without a one minute period of inactivity then ALL methods will return in less than a second.
Here are some details regarding my web service:
WCF, WebHttpBinding, RESTful, using HTTPs.
Basic Authentication + Custom Authentication using IDispatchMessageInspector. Authentication happens with EVERY call (except to the Initializer.aspx page).
Custom Initialization.aspx page has been created which is called every night after the Application Pool is recycled. This page caches a bunch of global data used by all users along with starting that compile.
Application Pool ONLY recycles every night at 2AM. Worker threads are never killed off because timeout is disabled.
I heard about ReliableSession but as the setting implies that sounds like it would only work for PerSession, not PerCall.
Is there any way to resolve this or am I stuck to resorting to "pinging" the server every 45 seconds using a dummy service operation?

Found out the issue. We have multiple domain controllers. When the user was getting authenticated it would start from the forest level and work its way down to the actual domain controller that server resided on. The firewalls that were put in place were blocking all domain controllers except what the server resided on.
So basically, it would fail to communicate to the N+ domain controllers until it finally reached the only one it could.
You can fix this a number of ways but we just created firewall rules to allow the web server to communicate to the domain controller the users needed to be authenticated against.

Best practice for updating Go web application

I am wondering what would be the best practice for deploying updates to a (MVC) Go web application. Imagine the following scenario :
1) Code and test some changes for my Go Web Application
2) Deploy update without anyone currently using the previous version getting interrupted.
I don't know how to make sure point 2) can be covered - when somebody is sending a request to the server and I rebuild/restart it just in this moment, he gets an error - even if the request just uses a part of the code I did not touch or that is backwards-compatible, or if I just added a new Request-handler.
Maybe I'm missing something trivial or a well-known pattern as I am just in the process of learning go and my previous web applications were ASP.NET- or php-applications where this was no issue as I did not need to restart the webserver on code changes.

It's not just an issue with Go, but in general we can divide the problem into two separate ones:
Making sure current requests do not get terminated and affect user experience.
Making sure there is no down-time in which new requests cannot be handled.
The first one is easier to tackle: You just don't violently kill your server, but tell it to exit, causing a "Drain phase", in which it does not accept new requests and only finishes the currently running requests, and exits. This can be done by listening on signals for example, and entering the app into a special state.
It's not trivial with Go as the default http server doesn't support shutting it down, but you can start a server with a net.Listener, and then keep a reference to it an close it when the time is due.
Now, doing only approach one and then starting the service again will cause new requests not to be accepted while this is going on, and we all know this can take a number of seconds in extreme cases.
So what we need is another instance of the server already running with the new code, the instant the old one is not responding to new requests, right? That can be done in several ways:
Having more than one server, and a load-balancer on top of them, allowing one (or more) server to take the load while we restart another. That's the simplest way, and the way most people do it. If you need N servers to take the load of your users, just keep N+1 and restart one at a time.
Using socket sharing tricks. In Newer Linux kernels, Many processes can listen and accept on the same port. What you do is simply start the new instance and then tell the old one to finish and exit. This way there is no pause. This is done by setting SO_REUSEPORT on the listening socket.
The above can be automated with ready to ship solutions, like Einhorn, that deals with all the details for you, see https://github.com/stripe/einhorn
Another approach is documented in this blog post: http://blog.nella.org/?p=879

ColdFusion sessions not being timed out

We have 2 core applications running on our servers on CF 8, and both have the exact same session timeout set in the application CFC (2 hours at the moment). However we're seeing that sessions are spiralling out of control for one of the applications (currently at 120,000+ on one server), lets call it AppA whereas AppB seems fine (and AppB is the one we'd expect a lot more traffic to).
So I did some further digging and found out that most of the sessions for AppA have been idle for many hours with the highest value I've seen so far being over 11 hours.
We're not actually doing that much with sessions so I'm a little confused as to why they're not being timed out as expected. Also I've dumped the this scope in the application CFC and it is showing the expected value for sessionTimeout.
The only thing I had noticed is that in one instance we're assigning a variable on the Request scope from a Session variable. If it were a different scope I would maybe think that is causing some sort of reference that GC (or whatever) can't clear.

In terms of the spiral, I'd say that's to do with some requests which aren't passing through the CFID/CFTOKEN to maintain the session. This could be web service calls, CFHTTP requests, search engine bots, etc. Sounds like one of your apps is experiencing this. If this is the case then for CFHTTP pass the CFID/CFTOKEN through to maintain sessions. Web services bit more tricky, you'll need to create a 'key' which is passed back and forth, whole separate topic! Bots can be handled by having some conditionals to set the session timeout value.
For the 11 hours, I'd say thats due to it been kept alive by something. Some continual polling? Reocurring AJAX request? It would have to be something that continues to pass the ID/TOKEN through.

I used to get server lockups in CF6.1 when I was persisting CFCs in the application or session scopes. Now I instantiate them in the request scope and the lockups stopped happening (with no noticeable performance drop). Maybe you have a similar issue.

Actually turns out the sessions were started from another App which wasn't over-riding the default value in the base Application.cfc (including the application name).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio