WinHTTP error 12175 after days and huge amount of queries - windows

I am using the Windows WinHTTP library to perform http & https queries, and am sometimes getting a WinHTTP 12175 error, usually after several days of operation (sometimes weeks), hundred of thousandths to millions of queries.
When that happens, the only way to "fix" the error is to restart the service, and occasionally the errors will not go away until Windows (the server) is restarted.
Interestingly enough, this error appears "gradually": getting some of them for some https queries, then after some time, the service gets them for 100% of https queries, then later on, they popup even for regular http queries (the service makes lots of http & https queries, on the order of dozens per seconds, to many servers).
The HTTP connections are made directly, without any proxies. The OS is Windows 2012 R2.
AFAICT there is no memory leak or corruption, outside the 12175 errors, the rest of the service is operating just fine, no access violations or suspicious exceptions, service responds to queries normally, etc.
I have a suspicion this could be related to Windows Update doing OS certificate updates, renewing credentials or something, but I could never positively confirm it.
Anyone else has observed this behavior? Any workarounds?

Related

Intermittent issues with MSMQ

We have intermittent issues with MSMQ communication cross domain.
Applications on residing on Domain (client)A and (server)B are communicating. The client can always send messages to the server. The queues (DIRECT:OS format) are created and in "Connected" state on the server. Sometimes, however, when the server responds it generates a
"BadDestinationQueue" error.
This can go on for anywhere between 10min - 1hr after which the error is gone and communication is normal. We see nothing abnormal, no obvious temporal pattern.
Clients in B talking to the server in B have no such issues.
Are there any settings that could cause this? We assume the intermittency suggests some sort of cache, somewhere? The DNS entries are up-to-date, and machines are discoverable by hostname even while MSMQ insists the destination queue does not exist.

IE11 is very slow on HTTPS sites

We have an internal web application and since short, it is very slow in IE11. Chrome and Firefox do not have this problem.
We have an HTTP version of the same website, and this is fast with up to 6 concurrent HTTP sessions to the server and session persistence.
However, when we change to HTTPS, all this changes.
We still have session persistence, but only 2 simultaneous sessions to the servers and each request seems to take ages. It is not the server response that is slow, but the "start" time before IE11 sends the request (and the next request). The connection diagram changes to a staircase, issuing one request at a time, one after the other. Each request taking 100-200ms even if the request itself returns only a couple of bytes.
The workstation has Symantec Endpoint Protection installed (12.1), and also Digital Guardian software (4.7).

TFS 503 Service Unavailable when retrieving files

We have a TFS 2013 server which I use only for source control. I just got a new desktop PC with Windows 10 and Visual Studio 2017. I am able to connect successfully to TFS, and I can start to pull down code, but after the directory structure has been built locally and files start to come down, several are successful, then the rest fail with a 503 Service Unavailable, and the connection is lost, forcing me to reconnect.
I can connect again a few seconds later, and keep trying, and it happens again.
If I pull down files one at a time, it seems to be ok, but when I try to pull an entire project, it blows up.
This happens on multiple projects on the server.
In general, I see 503 errors in ASP.NET when I overload an application to the point where it crashes the application pool - I don't know if that's what's happening here, but if it is, I'm wondering if maybe VS is pulling down the code too fast, maybe with too many concurrent threads or something, that the older version of TFS can't handle, which crashes it.
If that's the case, is there anything I can do to change that on my machine? Or do you have any other ideas on what could cause this?
As of now, I don't have access to the server to grab the event logs, but I can put in a request if it's not something I can figure out and fix myself.
Last time I encountered this, I was behind a rate-limiting proxy. 503 is a common error code for temporary service interruptions such as rate limiting.
Excessive requests are delayed until their number exceeds the maximum burst size in which case the request is terminated with an error 503 (Service Temporarily Unavailable).
The IIS service that's hosting TFS can also be configured to have request rate limiting.
maxBandwidth Optional uint attribute.
Specifies the maximum network bandwidth, in bytes per second, that is used for a site. Use this setting to help prevent overloading the network with IIS activity.
maxConnections Optional uint attribute.
Specifies the maximum number of connections for a site. Use this setting to limit the number of simultaneous client connections.
Visual studio 2017 has optimized the way it downloads over multiple parallel threads (at least 8 if I remember correctly). This can trigger the rate limiting in some proxies. And lots of small code files can easily trigger a rate limit.
If it's possible to reach the TFS server directly, you could try to add it to the proxy exclusion list to rule out whether the proxy on your end is causing the interruption. But a HTTP server such as nginx could also be implemented as a security measure and may reverse-proxy on the server end, in which case the server admin must change the limits.
I had a similar issue when accessing our tfs-git repo from intellij. All of a sudden, it errored unable to access <repo url>. The requested URL returned error: 503".
What helped:
In intellij->VCS-git-> remotes change the url to a false one (validation fails) and then back to the valid one.
In my case a login request then appeared and I finally regained access to the repo.

OpenFire, HTTP-BIND and performance

I'm looking into getting an openfire server started and setting up a strophe.js client to connect to it. My concern is that using http-bind might be costly in terms of performance versus making a straight on XMPP connection.
Can anyone tell me whether my concern is relevant or not? And if so, to what extend?
The alternative would be to use a flash proxy for all communication with OpenFire.
Thank you
BOSH is more verbose than normal XMPP, especially when idle. An idle BOSH connection might be about 2 HTTP requests per minute, while a normal connection can sit idle for hours or even days without sending a single packet (in theory, in practice you'll have pings and keepalives to combat NATs and broken firewalls).
But, the only real way to know is to benchmark. Depending on your use case, and what your clients are (will be) doing, the difference might be negligible, or not.
Basics:
Socket - zero overhead.
HTTP - requests even on IDLE session.
I doubt that you will have 1M users at once, but if you are aiming for it, then conection-less protocol like http will be much better, as I'm not sure that any OS can support that kind of connected socket volume.
Also, you can tie your OpenFires together, form a farm, and you'll have nice scalability there.
we used Openfire and BOSH with about 400 concurrent users in the same MUC Channel.
What we noticed is that Openfire leaks memory. We had about 1.5-2 GB of memory used and got constant out of memory exceptions.
Also the BOSH Implementation of Openfire is pretty bad. We switched then to punjab which was better but couldn't solve the openfire issue.
We're now using ejabberd with their built-in http-bind implementation and it scales pretty well. Load on the server having the ejabberd running is nearly 0.
At the moment we face the problem that our 5 webservers which we use to handle the chat load are sometimes overloaded at about 200 connected Users.
I'm trying to use websockets now but it seems that it doesn't work yet.
Maybe redirecting the http-bind not via Apache rewrite rule but directly on a loadbalancer/proxy would solve the issue but I couldn't find a way on how to do this atm.
Hope this helps.
I ended up using node.js and http://code.google.com/p/node-xmpp-bosh as I faced some difficulties to connect directly to Openfire via BOSH.
I have a production site running with node.js configured to proxy all BOSH requests and it works like a charm (around 50 concurrent users). The only downside so far: in the Openfire admin console you will not see the actual IP address of the connected clients, only the local server address will show up as Openfire get's the connection from the node.js server.

Detecting dead applications while server is alive in NLB

Windows NLB works great and removes computer from the cluster when the computer is dead.
But what happens if the application dies but the server still works fine? How have you solved this issue?
Thanks
By not using NLB.
Hardware load balancers often have configurable "probe" functions to determine if a server is responding to requests. This can be by accessing the real application port/URL, or some specific "healthcheck" URL that returns only if the application is healthy.
Other options on these look at the queue/time taken to respond to requests
Cisco put it like this:
The Cisco CSM continually monitors server and application availability
using a variety of probes, in-band
health monitoring, return code
checking, and the Dynamic Feedback
Protocol (DFP). When a real server or
gateway failure occurs, the Cisco CSM
redirects traffic to a different
location. Servers are added and
removed without disrupting
service—systems easily are scaled up
or down.
(from here: http://www.cisco.com/en/US/products/hw/modules/ps2706/products_data_sheet09186a00800887f3.html#wp1002630)
Presumably with Windows NLB there is some way to programmatically set the weight of nodes? The nodes should self-monitor and if there is some problem (e.g. a particular node is low on disc space), set its weight to zero so it receives no further traffic.
However, this needs to be carefully engineered and have further human monitoring to ensure that you don't end up with a situation where one fault causes the entire cluster to announce itself down.
You can't really hope to deal with a "byzantine general" situation in network load balancing; an appropriately broken node may think it's fine, appear fine, but while being completely unable to do any actual work. The trick is to try to minimise the possibility of these situations happening in production.
There are multiple levels of health check for a network application.
is the server machine up?
is the application (service) running?
is the service accepting network connections?
does the service respond appropriately to a "are you ok" request?
does the service perform real work? (this will also check back-end systems behind the service your are probing)
My experience with NLB may be incomplete, but I'll describe what I know. NLB can do 1 and 2. With custom coding you can add the other levels with varying difficulty. With some network architectures this can be very difficult.
Most hardware load balancers from vendors like Cisco or F5 can be easily configured to do 3 or 4. Level 5 testing still requires custom coding.
We start in the situation where all nodes are part of the cluster but inactive.
We run a custom service monitor which makes a request on the service locally via the external interface. If the response was successful we start the node (allow it to start handling NLB traffic). If the response failed we stop the node from receiving traffic.
All the intermediate steps described by Darron are irrelevant. Did it work or not is the only thing we care about. If the machine is inaccessible then the rest of the NLB cluster will treat it as failed.

Resources