WinHttpWriteData completion - winapi

I'm using WinHTTP to transfer large files to a PHP-based web server and I want to display the progress and an estimated speed. After reading the docs I have decided to use chunked transfer encoding. The files get transferred correctly but there is an issue with estimating the time that I cannot solve.
I'm using a loop to send chunks with WinHttpWriteData (header+trailer+footer) and I compute the time difference between start and finish with GetTickCount. I have a fixed bandwidth of 4mbit configured on my router in order to test the correctness of my estimation.
The typical time difference for chunks of 256KB is between 450 - 550ms, which is correct. The problem is that once in a while (few seconds/tens of seconds) WinHttpWriteData returns really really fast, like 4-10ms, which is obviously not possible. The next difference is much higher than the average 500ms.
Why does WinHttpWriteData confirms, either synchronously or asynchronously that it has written the data to the destination when, in reality, the data is still being transferred ? Any ideas ?
Oversimplified, my code looks like:
while (dataLeft)
t1 = GetTickCount();
WinHttpWriteData(hRequest, chunkHdr, chunkHdrLen , NULL);
WinHttpWriteData(hRequest, actualData, actualDataLen , NULL);
WinHttpWriteData(hRequest, chunkFtr, chunkFtrLen , NULL);
t2 = GetTickCount();
tdif= t2 - t1;

This is simply the nature of how sockets work in general.
Whether you call a lower level function like send() or a higher level function like WinHttpWriteData(), the functions return success/failure based on whether they are able to pass data to the underlying socket kernel to not. The kernel queues up data for eventual transmission in the background. The kernel does not report back when the data is actually transmitted, or if the receiver acks the data. The kernel happily accepts new data as long as there is room in the queue, even if it will take awhile to actually transmit. Otherwise, it will block the sender until room becomes available in the queue.
If you need to monitor actual transmission speed, you have to monitor the low level network activity directly, such as with a packet sniffer or driver hook. Otherwise, you can only monitor how fast you are able to pass data to the kernel (which is usually good enough for most purposes).



We are developing a project using Angular in the front and Spring at the backend. Nothing new. But we have set-up the backend to use HTTP2 and from time to time we find weird problems.
Today I started playing with "Network Log Export" from chrome and I found this interesting piece of information in the HTTP2_SESSION line of the log.
t=43659 [st=41415] HTTP2_SESSION_RECV_GOAWAY
--> active_streams = 4
--> debug_data = "Connection [263], Too much overhead so the connection will be closed"
--> error_code = "11 (ENHANCE_YOUR_CALM)"
--> last_accepted_stream_id = 77
--> unclaimed_streams = 0
t=43659 [st=41415] HTTP2_SESSION_CLOSE
--> description = "Connection closed"
--> net_error = -100 (ERR_CONNECTION_CLOSED)
t=43661 [st=41417] -HTTP2_SESSION
It looks like the root of the problem for the ERR_CONNECTION_CLOSED is the server decides there are too much overhead from the same client and closes the connection.
The question is ¿Can we tune the server to accept overhead up to a certain limit? ¿how? I believe this is something we should be able to tune up in Spring or tomcat or somewhere there.
The overhead protection was put in place in response to a collection of CVE's reported against HTTP/2 in the middle of 2019. While Tomcat wasn't directly affected (the malicious input didn't trigger excessive load) we did take steps to block input that matched the malicious profile.
From your GitHub comment, you see issues with POSTs. That strongly suggests that the client is sending the POST data in multiple small packets rather than a smaller number of larger packets. Some clients (e.g. Chrome) are know to do this occasionally due to they way they buffer data.
A number of the HTTP/2 DoS attacks could be summarized as sending more overhead than data. While Tomcat wasn't directly affected, we took the decision to monitor for clients operating in this way and drop connections if any were found on the grounds that the client was likely to be malicious.
Generally, data packets reduce the overhead count, non-data packets increase the overhead count and (potentially) malicious packets increase the overhead count significantly. The idea is that an established, generally well-behaved, connection should be able to survive the occasional 'suspect' packet but any more than that will quickly trigger the connection to be closed.
In terms of small POST packets the key configuration setting is:
The overhead count starts at -10. For every DATA frame received it is reduced by 1. For every SETTINGS, PRIORITY and PING frame it is increased by overheadCountFactor.If the overhead count goes above 0, the connection is closed.
In addition, if the average size of a received non-final DATA frame and the previously received DATA frame (on that same stream) is less than overheadDataThreshold then the overhead count is increased by overheadDataThreshold/(average size of current and previous DATA frames). In this way, the smaller the DATA frame, the greater the increase in the overhead. A small number of small non-final DATA frames should be enough to trigger connection closure.
The averaging is there so buffering such as exhibited by Chrome does not trigger the overhead protection.
To diagnose this problem you need to look at the logs to see what size non-final DATA frames are being sent by the client. I suspect that will show a series of non-final DATA frames with size less than 1024 (the default for overheadDataThreshold).
To fix the issue my recommendation is to look at the client first. Why is it sending small non-final DATA frames and what can be done to stop it?
If you need an immediate mitigation then you can reduce overheadDataThreshold. The information you get on DATA frame sizes sent by the client should guide you as to what to set this to. It needs to be smaller than DATA frames being sent by the client. In extremis you can set overheadDataThreshold to zero to disable the protection.

Auto save performance for rdbms

In my app user types in some content which I would like to auto save as the user types. The save call is not for every keystroke, rather I do autosave only when user pauses for more than 200ms. So in a typical paragraph there are 15-20 server calls. The content will not be read very often, so I need to optimize the writes.
I have to save data on MSSQL Server because of legacy code reasons. I'm getting 10 seconds avg response time in my load test. How do I improve the performance?
One approach I'm considering is instead of directly saving data in mssql I'll save it in Cassandra or redis, then eventually(maybe at regular time intervals) write it to mssql.
Another approach is instead of doing frequent updates, I'll insert new record for each auto save. Then a background process will clean up all records except for latest, every few minutes.
I replaced the existing logic with simple update calls to 2 tables and now I am seeing improvements. There was a long stored procedure which was taking upto 10 seconds under load. SO for now I have hold on the problem. Still I would like to know is there something I can do on application server layer to reduce frequent DB calls.
It is quite hard to answer yor question directly but here are some hints based on what we do in a multiple active user situation.
If you are writing/triggering on every keystroke, pass the keystroke to a background thread and do not perform the database write, or any network call, while blocking the users typing. A fast typist can hit 20 keystrokes/second, and you cannot afford to introduce latency.
If recording on a web page, you might be able to use localStorage. Do not issue an AJAX style call on every keystroke as there is a limit to outstanding requests. You need to implement some kind of buffered send. Remember that network calls in the real world can be 300mS sort of scale just to traverse the network.
Do you really need to save every keystroke, or is every N seconds acceptable? Every save operation will eventually turn into a disk operation, so you really want to coalesce as many saves as possible. The quickest way to do something is not to do it at all.
If you are recording to a database, then it is often quicker to update an existing row, if you can fetch it by direct key first. Unfortunatly it can sometimes be quicker to insert a new row and clean up excess later. This tends to be true if the table has few indexes. Which is quicker depends on database engine in use and how it is being used. We use both methods.
When using a database keep in mind that they often keep journals of some kind, so if you are updating frequently you might create a large load on the journal files.
If you are using techniques (Using C terminology) like fopen, fwrite these can perform very well, but if you are worried about system failure recovery, you may need to call fsync, which then limits your maximum performance rate. If you need fsync, a database might be better.
You might like to consider writing to a transactionlog table very frequently, and then posting to the real storage every N seconds. For example, if I am typing a customers name I might record every keystroke into a keylog table, and then have a background job read the keylog table and transfer the data to customers table. This helps reduce the operations to the customers table while also allowing the keylog table to be optimised to recording keystrokes. But, at the cost of more code server side.
Overall, you want logic like this
On keyup handler
Add keystroke to background queue
Wake background thread
Background thread
Read/remove ALL data from background queue
If no data, wait for wakeup and repeat
Write to database/network/file etc as one operation. (this can now be syncronous calls)
Optionally some velocity control, simple one is sleep(50mS) or sleep(2s)
Keep in mind with the above the user can type and immediately hit close, so your final buffer write might not have flushed yet. You need to handle this.
If you get this correct, the user will not notice any delay. In our usage, we are recording around 1000 keystrokes/sec average, all of which ar routed over private networks to central points. This load is barely a blip, even network monitoring does not see such a small amount of traffic.
Good luck.

MPI non-blocking communication and pthreads difference?

In MPI, there are non-blocking calls like MPI_Isend and MPI_Irecv.
If I am working on a p2p project, the Server would listen to many clients.
One way to do it:
for(int i = 1; i < highest_rank; i++){
MPI_Irecv(....,i,....statuses[i]); //listening to all slaves
for( int i = 1; i < highest_rank; i++){
if true do somthing
Another old way that I could do it is:
Server creating many POSIX threads, pass in a function,
that function will call MPI_Recv and loop forever.
Theoretically, which one would perform faster on the server end? If there is another better way to write the server, please let me know as well.
The latter solution does not seem very efficient to me because of all the overhead from managing the pthreads inside a MPI process.
Anyway I would rewrite you MPI code as:
for(int i = 1; i < highest_rank; i++){
MPI_Irev(....,i,....requests[i]); //listening to all slaves
MPI_waitany(highest_rank, request[i], index, status);
//do something useful
Even better you can use MPI_Recv with MPI_ANY_SOURCE as the rank of the source of a message. It seems like your server does not have anything to do except serving request therefore there is no need to use an asynchronous recv.
Code would be:
MPI_Recv(... ,MPI_ANY_SOURCE, REQUEST_TAG,MPI_comm,status)
//retrieve client id from status and do something
When calling MPI_Irecv, it is NOT safe to test the recv buffer until AFTER MPI_Test* or MPI_Wait* have been called and successfully completed. The behavior of directly testing the buffer without making those calls is implementation dependent (and ranges from not so bad to a segfault).
Setting up a 1:1 mapping with one MPI_Irecv for each remote rank can be made to work. Depending on the amount of data that is being sent, and the lifespan of that data once received, this approach may consume an unacceptable amount of system resources. Using MPI_Testany or MPI_Testall will likely provide the best balance between message processing and CPU load. If there is no non-MPI processing that needs to be done while waiting on incoming messages, MPI_Waitany or MPI_Waitall may be preferable.
If there are outstanding MPI_Irecv calls, but the application has reached the end of normal processing, it is "necessary" to MPI_Cancel those outstanding calls. Failing to do that may be caught in MPI_Finalize as an error.
A single MPI_Irecv (or just MPI_Recv, depending on how aggressive the message handling needs to be) on MPI_ANY_SOURCE also provides a reasonable solution. This approach can also be useful if the amount of data received is "large" and can be safely discarded after processing. Processing a single incoming buffer at a time can reduce the total system resources required, at the expense of serializing the processing.
Let me just comment on your idea to use POSIX threads (or whatever other threading mechanism there might be). Making MPI calls from multiple threads at the same time requires that the MPI implementation is initialised with the highest level of thread support of MPI_THREAD_MULTIPLE:
int provided;
MPI_Init_thread(&argv, &argc, MPI_THREAD_MULTIPLE, &provided);
if (provided != MPI_THREAD_MULTIPLE)
printf("Error: MPI does not provide full thread support!\n");
Although the option to support concurrent calls from different threads was introduced in the MPI standard quite some time ago, there are still MPI implementations that struggle to provide fully working multithreaded support. MPI is all about writing portable, at least in theory, applications, but in this case real life differs badly from theory. For example, one of the most widely used open-source MPI implementation - Open MPI - still does not support native InfiniBand communication (InfiniBand is the very fast low latency fabric, used in most HPC clusters nowadays) when initialised at MPI_THREAD_MULTIPLE level and therefore switches to different, often much slower and with higher latency transports like TCP/IP over regular Ethernet or IP-over-InfiniBand. Also there are some supercomputer vendors, whose MPI implementations do not support MPI_THREAD_MULTIPLE at all, often because of the way the hardware works.
Besides, MPI_Recv is a blocking call which poses problems with proper thread cancellation (if necessary). You have to make sure that all threads escape the infinite loop somehow, e.g. by having each worker send a termination message with the appropriate tag or by some other protocol.

boost::asio sending data faster than receiving over TCP. Or how to disable buffering

I have created a client/server program, the client starts
an instance of Writer class and the server starts an instance of
Reader class. Writer will then write a DATA_SIZE bytes of data
asynchronously to the Reader every USLEEP mili seconds.
Every successive async_write request by the Writer is done
only if the "on write" handler from the previous request had
been called.
The problem is, If the Writer (client) is writing more data into the
socket than the Reader (server) is capable of receiving this seems
to be the behaviour:
Writer will start writing into (I think) system buffer and even
though the data had not yet been received by the Reader it will be
calling the "on write" handler without an error.
When the buffer is full, boost::asio won't fire the "on write"
handler anymore, untill the buffer gets smaller.
In the meanwhile, the Reader is still receiving small chunks
of data.
The fact that the Reader keeps receiving bytes after I close
the Writer program seems to prove this theory correct.
What I need to achieve is to prevent this buffering because the
data need to be "real time" (as much as possible).
I'm guessing I need to use some combination of the socket options that
asio offers, like the no_delay or send_buffer_size, but I'm just guessing
here as I haven't had success experimenting with these.
I think that the first solution that one can think of is to use
UDP instead of TCP. This will be the case as I'll need to switch to
UDP for other reasons as well in the near future, but I would
first like to find out how to do it with TCP just for the sake
of having it straight in my head in case I'll have a similar
problem some other day in the future.
NOTE1: Before I started experimenting with asynchronous operations in asio library I had implemented this same scenario using threads, locks and asio::sockets and did not experience such buffering at that time. I had to switch to the asynchronous API because asio does not seem to allow timed interruptions of synchronous calls.
NOTE2: Here is a working example that demonstrates the problem:
EDIT: I've done one more test, in my NOTE1 I mentioned that when I was using asio::iosockets I did not experience this buffering. So I wanted to be sure and created this test: It turns out that the buffering is there event with asio::iosockets, so there must have been something else that caused it to go smoothly, possibly lower FPS.
TCP/IP is definitely geared for maximizing throughput as intention of most network applications is to transfer data between hosts. In such scenarios it is expected that a transfer of N bytes will take T seconds and clearly it doesn't matter if receiver is a little slow to process data. In fact, as you noticed TCP/IP protocol implements the sliding window which allows the sender to buffer some data so that it is always ready to be sent but leaves the ultimate throttling control up to the receiver. Receiver can go full speed, pace itself or even pause transmission.
If you don't need throughput and instead want to guarantee that the data your sender is transmitting is as close to real time as possible, then what you need is to make sure the sender doesn't write the next packet until he receives an acknowledgement from the receiver that it has processed the previous data packet. So instead of blindly sending packet after packet until you are blocked, define a message structure for control messages to be sent back from the receiver back to the sender.
Obviously with this approach, your trade off is that each sent packet is closer to real-time of the sender but you are limiting how much data you can transfer while slightly increasing total bandwidth used by your protocol (i.e. additional control messages). Also keep in mind that "close to real-time" is relative because you will still face delays in the network as well as ability of the receiver to process data. So you might also take a look at the design constraints of your specific application to determine how "close" do you really need to be.
If you need to be very close, but at the same time you don't care if packets are lost because old packet data is superseded by new data, then UDP/IP might be a better alternative. However, a) if you have reliable deliver requirements, you might ends up reinventing a portion of tcp/ip's wheel and b) keep in mind that certain networks (corporate firewalls) tend to block UDP/IP while allowing TCP/IP traffic and c) even UDP/IP won't be exact real-time.

Distributed time synchronization and web applications

I'm currently trying to build an application that inherently needs good time synchronization across the server and every client. There are alternative designs for my application that can do away with this need for synchronization, but my application quickly begins to suck when it's not present.
In case I am missing something, my basic problem is this: firing an event in multiple locations at exactly the same moment. As best I can tell, the only way of doing this requires some kind of time synchronization, but I may be wrong. I've tried modeling the problem differently, but it all comes back to either a) a sucky app, or b) requiring time synchronization.
Let's assume I Really Really Do Need synchronized time.
My application is built on Google AppEngine. While AppEngine makes no guarantees about the state of time synchronization across its servers, usually it is quite good, on the order of a few seconds (i.e. better than NTP), however sometimes it sucks badly, say, on the order of 10 seconds out of sync. My application can handle 2-3 seconds out of sync, but 10 seconds is out of the question with regards to user experience. So basically, my chosen server platform does not provide a very reliable concept of time.
The client part of my application is written in JavaScript. Again we have a situation where the client has no reliable concept of time either. I have done no measurements, but I fully expect some of my eventual users to have computer clocks that are set to 1901, 1970, 2024, and so on. So basically, my client platform does not provide a reliable concept of time.
This issue is starting to drive me a little mad. So far the best thing I can think to do is implement something like NTP on top of HTTP (this is not as crazy as it may sound). This would work by commissioning 2 or 3 servers in different parts of the Internet, and using traditional means (PTP, NTP) to try to ensure their sync is at least on the order of hundreds of milliseconds.
I'd then create a JavaScript class that implemented the NTP intersection algorithm using these HTTP time sources (and the associated roundtrip information that is available from XMLHTTPRequest).
As you can tell, this solution also sucks big time. Not only is it horribly complex, but only solves one half the problem, namely giving the clients a good notion of the current time. I then have to compromise on the server, either by allowing the clients to tell the server the current time according to them when they make a request (big security no-no, but I can mitigate some of the more obvious abuses of this), or having the server make a single request to one of my magic HTTP-over-NTP servers, and hoping that request completes speedily enough.
These solutions all suck, and I'm lost.
Reminder: I want a bunch of web browsers, hopefully as many as 100 or more, to be able to fire an event at exactly the same time.
Let me summarize, to make sure I understand the question.
You have an app that has a client and server component. There are multiple servers that can each be servicing many (hundreds) of clients. The servers are more or less synced with each other; the clients are not. You want a large number of clients to execute the same event at approximately the same time, regardless of which server happens to be the one they connected to initially.
Assuming that I described the situation more or less accurately:
Could you have the servers keep certain state for each client (such as initial time of connection -- server time), and when the time of the event that will need to happen is known, notify the client with a message containing the number of milliseconds after the beginning value that need to elapse before firing the event?
To illustrate:
client A connects to server S at time t0 = 0
client B connects to server S at time t1 = 120
server S decides an event needs to happen at time t3 = 500
server S sends a message to A:
S->A : {eventName, 500}
server S sends a message to B:
S->B : {eventName, 380}
This does not rely on the client time at all; just on the client's ability to keep track of time for some reasonably short period (a single session).
It seems to me like you're needing to listen to a broadcast event from a server in many different places. Since you can accept 2-3 seconds variation you could just put all your clients into long-lived comet-style requests and just get the response from the server? Sounds to me like the clients wouldn't need to deal with time at all this way ?
You could use ajax to do this, so yoǘ'd be avoiding any client-side lockups while waiting for new data.
I may be missing something totally here.
If you can assume that the clocks are reasonable stable - that is they are set wrong, but ticking at more-or-less the right rate.
Have the servers get their offset from a single defined source (e.g. one of your servers, or a database server or something).
Then have each client calculate it's offset from it's server (possible round-trip complications if you want lots of accuracy).
Store that, then you the combined offset on each client to trigger the event at the right time.
(client-time-to-trigger-event) = (scheduled-time) + (client-to-server-difference) + (server-to-reference-difference)
Time synchronization is very hard to get right and in my opinion the wrong way to go about it. You need an event system which can notify registered observers every time an event is dispatched (observer pattern). All observers will be notified simultaneously (or as close as possible to that), removing the need for time synchronization.
To accommodate latency, the browser should be sent the timestamp of the event dispatch, and it should wait a little longer than what you expect the maximum latency to be. This way all events will be fired up at the same time on all browsers.
Google found the way to define time as being absolute. It sounds heretic for a physicist and with respect to General Relativity: time is flowing at different pace depending on your position in space and time, on Earth, in the Universe ...
You may want to have a look at Google Spanner database:
I guess it is used now by Google and will be available through Google Cloud Platform.
