Flask & NGINX performance tuning - performance

I have a RESTful web service written with Flask framework.
To make it simple, let's assume that all requests are GET, that is only request line + headers.
Request buffering time usually takes about 100ms, processing time - 1 second/request.
During stress tests I found an issue:
When lots of clients (hundreds+) open a connection to it at the same time, there's a delay between connections opened and start of processing.
It turned out that Flask reads a headers part of HTTP requests upon connection. More connections -> more headers to read -> bigger networking load -> bigger request buffering time.
For example: 100 simultaneously opened connections will start buffering together and will take 0.1*100=10 seconds to buffer. They will then pass to processing stage together.
My intention is to reach 2 goals:
Primary: Start processing as quick as possible
Secondary: Buffer all requests as quick as possible
Despite a seeming contradiction, both of them can be achieved by preserving a rule:
Buffering less connections, if there's a processing starvation.
Once again, to make it simple, I want my server to buffer only 10 connections at a time (/second). All the other connections should be accepted by server socket and wait patiently for their turn. Alternatively, accept only 10 connections/second (and still wait for others, not to discard bursts).
I tried to do it with NGINX reverse proxy with limit_req:
http {
limit_req_zone $server_name zone=processing_limit:10m rate=1r/s;
}
location ~ /process {
# Limit requests
limit_req zone=processing_limit burst=1000;
include proxy_params;
proxy_pass_header Server;
proxy_request_buffering on;
proxy_buffering off;
client_max_body_size 100m;
proxy_pass http://127.0.0.1:8081;
}
But NGINX also buffers all connections together and only forwards 1 request/sec to a backend server.

Related

Redirect to https increases site load time with Varnish

I want to redirect every http request to https. I have modified my VCL file as:
import std;
sub vcl_recv {
if (std.port(server.ip) != 443) {
set req.http.location = "https://" + req.http.host + req.url;
return(synth(301));
}
}
sub vcl_synth {
if (resp.status == 301 || resp.status == 302) {
set resp.http.location = req.http.location;
return (deliver);
}
}
It worked but it increases my site loading to approximately 1s, whether the request is http or https.
Is there any other alternative approach i can use or improve load time performance?
Varnish hardly adds latency to the request/response flow.
If the HTTPS resource that is redirected to is not cached, you will depend on the performance of your origin server. If the origin server is slow, the loading time will be slow.
Browser timing breakdown
Please analyse the breakdown of the loading time for that resource in the "network" tab of your browser's inspector.
For that page that is loaded, you can click the "Timings" tab in Firefox to see the breakdown. Here's an example:
If the Waiting timer is high, this means the server is slow.
If the Receiving timer is high, this means the network is slow.
Varnish timing breakdown
The varnishlog program allows you to inspect detailed transactions logs for Varnish.
The varnishlog -g request -i ReqUrl -i Timestamp will list the URL and loading times of every transaction.
Here's an example:
root#varnish:/etc/varnish# varnishlog -c -g request -i ReqUrl -i Timestamp
* << Request >> 163843
- Timestamp Start: 1645036124.028073 0.000000 0.000000
- Timestamp Req: 1645036124.028073 0.000000 0.000000
- ReqURL /
- Timestamp Fetch: 1645036124.035310 0.007237 0.007237
- Timestamp Process: 1645036124.035362 0.007288 0.000051
- Timestamp Resp: 1645036124.035483 0.007409 0.000120
Timestamps are expressed in UNIX timestamp format. The first timer in every Timestamp log line is the total loading time since the start of the transaction. The second one is the loading time of the individual part.
In this example the total server-side loading time was 0.007409 seconds. The backend fetch was responsible for 0.007237 seconds of loading time.
Varnish itself wasted 0.007237 seconds on processing before the fetch, 0.000051 seconds of processing after the fetch and 0.000120 seconds for sending the response.
You can use this command to inspect server-side performance at your end. It also allows you to inspect whether or not Varnish is to blame for any incurred latency.
When there's no Fetch line it means you're dealing with a cache hit and no backend fetch was required.
Conclusion
Combine server-side timing breakdown and client-side timing breakdown to form a conclusion on what causes the delay. Based on that information you can improve the component that is causing this delay.
It might help to add synthetic("") to your vcl_synth {} to make sure that the redirects are sent with an empty body, but I agree that the code should not increase the response time by any significant amount.

ChromeCast - Stream Calls Failing when Stalled for long time

I'm attempting to play a live stream on ChromeCast. The stream is thrown fine and starts playback appropriately.
However when I play the stream longer: somewhere between 2-15 minutes, the player stops playing and I get MediaStatus.IDLE_REASON_ERROR in my RemoteMediaClient.Callback.
When looking at the console logs from ChromeCast I see that 3-4 calls are failed. Here are the logs:
14:50:26.931 GET https://... 0 ()
14:50:27.624 GET https://... 0 ()
14:50:28.201 GET https://... 0 ()
14:50:29.351 GET https://... 0 ()
14:50:29.947 media_player.js:64 [1381.837s] [cast.player.api.Host] error: cast.player.api.ErrorCode.NETWORK/3126000
Looking at Cast MediaPlayer.ErrorCode Error 312.* is
Failed to retrieve the media (bitrated) playlist m3u8 file with three retries.
Developers need to validate that their playlists are indeed available. It could be the case that a user that cannot reach the playlist as well.
I checked, the playlist was available. So I thought perhaps the server wasn't responding. So I looked at the network calls response logs.
Successful Request
Stalled Request
Note that the stall time far exceeds the usual stall time.
ChromeCast isn't making these calls at all, the requests are simply stalled for a long time until they are cancelled. All the successful requests are stalled for less than 14ms (mostly under 2ms).
The Network Analysis Timing Breakdown provides three reasons for stalling
There are higher priority requests.
There are already six TCP connections open for this origin, which is the limit. Applies to HTTP/1.0 and HTTP/1.1 only.
The browser is briefly allocating space in the disk cache
While I do believe the first one should not be the case, the later two can be. However in both cases I believe the fault lies with cast.player.
Am I doing something wrong?
Has anyone else faced the same issue? Is there any way to either fix it or come up with a work-around.

When does nginx $upstream_response_time start/stop specifically

Does anyone know when, specifically, the clock for $upstream_response_time begins and ends?
The documentation seems a bit vague:
keeps time spent on receiving the response from the upstream server; the time is kept in seconds with millisecond resolution. Times of several responses are separated by commas and colons like addresses in the $upstream_addr variable.
There is also an $upstream_header_time value, which adds more confusion.
I assume $upstream_connect_time stops once the connection is established, but before it is accepted upstream?
After this what does $upstream_response_time include?
Time spent waiting for upstream to accept?
Time spent sending the request?
Time spent sending the response header?
A more specific definition is in their blog.
$request_time – Full request time, starting when NGINX reads the first
byte from the client and ending when NGINX sends the last byte of the
response body
$upstream_connect_time – Time spent establishing a
connection with an upstream server
$upstream_header_time – Time
between establishing a connection to an upstream server and receiving
the first byte of the response header
$upstream_response_time – Time
between establishing a connection to an upstream server and receiving
the last byte of the response body
So
$upstream_header_time is included in $upstream_response_time.
Time spent connecting to upstream is not included in both of them.
Time spent sending response to client is not included in both of them.
I've investigated and debug the behavior around this, and it turned out as follows:
start time
end time
$upstream_connect_time
before Nginx establishes TCP connection with upstream server
before Nginx sends HTTP request to upstream server
$upstream_header_time
before Nginx establishes TCP connection with upstream server
after Nginx receives and processes headers in HTTP response from upstream server
$upstream_response_time
before Nginx establishes TCP connection with upstream server
after Nginx receives and processes HTTP response from upstream server
source code
I'll explain how values are different between $upstream_connect_time and $upstream_response_time, as it's what I was primarily interested in.
The value of u->state->connect_time, which represents $upstream_connect_time in millisecond, is ingested in the following section: https://github.com/nginx/nginx/blob/3334585539168947650a37d74dd32973ab451d70/src/http/ngx_http_upstream.c#L2073
if (u->state->connect_time == (ngx_msec_t) -1) {
u->state->connect_time = ngx_current_msec - u->start_time;
}
Whereas the value of u->state->repponse_time, which represents $upstream_response_time in millisecond, is set in the following section: https://github.com/nginx/nginx/blob/3334585539168947650a37d74dd32973ab451d70/src/http/ngx_http_upstream.c#L4432
if (u->state && u->state->response_time == (ngx_msec_t) -1) {
u->state->response_time = ngx_current_msec - u->start_time;
You can notice that both of values are calculated based on u->start_time, which is the time just before the connection is established, defined in https://github.com/nginx/nginx/blob/3334585539168947650a37d74dd32973ab451d70/src/http/ngx_http_upstream.c#L1533
(note that ngx_event_connect_peer is a function to establish TCP connection between nginx workerprocesses and upstream servers).
Therefore, both values include the time taken to establish the TCP connection. You can check this by doing a live debug with, for example, gdbserver.

Strategies in reducing network delay from 500 milliseconds to 60-100 milliseconds

I am building an autocomplete functionality and realized the amount of time taken between the client and server is too high (in the range of 450-700ms)
My first stop was to check if this is result of server delay.
But as you can see these Nginx logs are almost always 0.001 milliseconds (request time is the last column). It’s hardly a cause of concern.
So it became very evident that I am losing time between the server and the client. My benchmarks are Google Instant's response times. Which almost often is in the range of 30-40 milliseconds. Magnitudes lower.
Although it’s easy to say that Google's has massive infrastructural capabilities to deliver at this speed, I wanted to push myself to learn if this is possible for someone who is not that level. If not 60 milliseconds, I want to shave off 100-150 milliseconds.
Here are some of the strategies I’ve managed to learn.
Enable httpd slowstart and initcwnd
Ensure SPDY if you are on https
Ensure results are http compressed
Etc.
What are the other things I can do here?
e.g
Does have a persistent connection help?
Should I reduce the response size dramatically?
Edit:
Here are the ping and traceroute numbers. The site is served via cloudflare from a Fremont Linode machine.
mymachine-Mac:c name$ ping site.com
PING site.com (160.158.244.92): 56 data bytes
64 bytes from 160.158.244.92: icmp_seq=0 ttl=58 time=95.557 ms
64 bytes from 160.158.244.92: icmp_seq=1 ttl=58 time=103.569 ms
64 bytes from 160.158.244.92: icmp_seq=2 ttl=58 time=95.679 ms
^C
--- site.com ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 95.557/98.268/103.569/3.748 ms
mymachine-Mac:c name$ traceroute site.com
traceroute: Warning: site.com has multiple addresses; using 160.158.244.92
traceroute to site.com (160.158.244.92), 64 hops max, 52 byte packets
1 192.168.1.1 (192.168.1.1) 2.393 ms 1.159 ms 1.042 ms
2 172.16.70.1 (172.16.70.1) 22.796 ms 64.531 ms 26.093 ms
3 abts-kk-static-ilp-241.11.181.122.airtel.in (122.181.11.241) 28.483 ms 21.450 ms 25.255 ms
4 aes-static-005.99.22.125.airtel.in (125.22.99.5) 30.558 ms 30.448 ms 40.344 ms
5 182.79.245.62 (182.79.245.62) 75.568 ms 101.446 ms 68.659 ms
6 13335.sgw.equinix.com (202.79.197.132) 84.201 ms 65.092 ms 56.111 ms
7 160.158.244.92 (160.158.244.92) 66.352 ms 69.912 ms 81.458 ms
mymachine-Mac:c name$ site.com (160.158.244.92): 56 data bytes
I may well be wrong, but personally I smell a rat. Your times aren't justified by your setup; I believe that your requests ought to run much faster.
If at all possible, generate a short query using curl and intercept it with tcpdump on both the client and the server.
It could be a bandwidth/concurrency problem on the hosting. Check out its diagnostic panel, or try estimating the traffic.
You can try and save a response query into a static file, then requesting that file (taking care as not to trigger the local browser cache...), to see whether the problem might be in processing the data (either server or client side).
Does this slowness affect every request, or only the autocomplete ones? If the latter, and no matter what nginx says, it might be some inefficiency/delay in recovering or formatting the autocompletion data for output.
Also, you can try and serve a static response bypassing nginx altogether, in case this is an issue with nginx (and for that matter: have you checked out nginx' error log?).
One approach I didn't see you mention is to use SSL sessions: you can add the following into your nginx conf to make sure that an SSL handshake (very expensive process) does not happen with every connection request:
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
See "HTTPS server optimizations" here:
http://nginx.org/en/docs/http/configuring_https_servers.html
I would recommend using New Relic if you aren't already. It is possible that the server-side code you have could be the issue. If you think that might be the issue, there are quite a few free code profiling tools.
You may want to consider an option to preload autocomplete options in the background while the page is rendered and then save a trie or whatever structure you use on the client in the local storage. When the user starts typing in the autocomplete field you would not need to send any requests to the server but instead query local storage.
Web SQL Database and IndexedDB introduce databases to the clientside.
Instead of the common pattern of posting data to the server via
XMLHttpRequest or form submission, you can leverage these clientside
databases. Decreasing HTTP requests is a primary target of all
performance engineers, so using these as a datastore can save many
trips via XHR or form posts back to the server. localStorage and
sessionStorage could be used in some cases, like capturing form
submission progress, and have seen to be noticeably faster than the
client-side database APIs.
For example, if you have a data grid component or an inbox with
hundreds of messages, storing the data locally in a database will save
you HTTP roundtrips when the user wishes to search, filter, or sort. A
list of friends or a text input autocomplete could be filtered on each
keystroke, making for a much more responsive user experience.
http://www.html5rocks.com/en/tutorials/speed/quick/#toc-databases

ZMQ data transfer latency from one process to another?

when using ZMQ transfer data, the transmitted port is fast and the data is huge, but the receive port processing is slow and the data is accumulated between the two processes. Does any one know how to solve this problem? Thanks.
Instead of sending all the data at once, send in chunks instead. Somethings like this...
Client requests file 'xyz' from server
Server responds with file size only, ex: 10Mb
Client sets chunk size accordingly, ex: 1024b
Client sends read requests to server for chunks of data:
client -> server: give me 0 to 1023 bytes for file 'xyz'
server -> client: 1st chunk
client -> server: give me 1024 to 2047 bytes for file 'xyz'
server -> client: 2nd chunk
...and so on.
For each response, client persists chunk to disk.
This approach allows the client to throttle the rate at which data is transmitted from the server. Also, in case of network failure, since each chunk is persisted, there's no need to read file from beginning; the client can start requesting more chunks from the point before the last response failed.
You mentioned nothing on language bindings, but this solution should be trivial to implement in just about any language.

Resources