What can cause IDXGISwapChain2's frame latency waitable to timeout?
I'm trying to implement this recommendation from Microsoft: Reduce latency with DXGI 1.3 swap chains, but this code here from the article:
DWORD result = WaitForSingleObjectEx(
m_frameLatencyWaitableObject,
1000, // 1 second timeout (shouldn't ever occur)
true
);
sometimes (not always) returns WAIT_TIMEOUT although the article says it shouldn't ever occur. I probably do something I shouldn't, but what can cause this waitable to timeout?
Related
I’m seeing some strange results when trying to time some WebGL using the disjoint timer query extension.
I have written some simple WebGL code to reproduce the issue we are seeing here : https://jsfiddle.net/d79q3mag/2/.
const texImage2DQuery = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, texImage2DQuery);
gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA32F, 512, 512, 0, gl.RGBA, gl.FLOAT, buffer);
gl.endQuery(ext.TIME_ELAPSED_EXT);
tex2dQuerys.push(texImage2DQuery);
const drawQuery = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, drawQuery);
gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);
gl.endQuery(ext.TIME_ELAPSED_EXT);
drawQuerys.push(drawQuery);
tryGetNextQueryResult(tex2dQuerys, tex2dQueryResults);
tryGetNextQueryResult(drawQuerys, drawQueryResults);
The render function uses the timer extension to individually time the texImage2D call and the drawArrays call. If I graph the results of the draw call I see some pretty large spikes (Some as high as 12ms! but the majority of spikes in the 2ms to 4ms range) :
However, if I increase the framerate from 30fps to 60fps the results improve (largest spike 1.8ms most spikes between 1ms and 0.4ms) :
I have also noticed that if I don’t time the texImage2D function (https://jsfiddle.net/q01aejnv/2/) then the spikes in the times for the drawArrays call also disappear at 30FPS (spikes between 1.6ms and 0.2 ms).
I’m using a Quadro P4000 with chrome 81.
In the Nvidia control panel Low latency mode is set to Ultra and the power management mode is set to Prefer maximum performance.
These results were gathered using D3D11 as the ANGLE graphics backend in chrome.
There seems to be 2 confusing things here. First is that a higher framerate seems to improve the draw times. Second is that timing the texImage2D seems to be affecting the draw times.
When doing timing things you probably need to gl.flush after gl.endQuery
Explanation: WebGL functions add commands to a command buffer. Some other process reads that buffer but there is overhead in telling the other process "hey! I added some commands for you!". So, in general, WebGL commands don't always do they "Hey!" part they just insert the commands. WebGL will do an auto flush (a "Hey! Execute this!) at various points automatically but when doing timing you may need to add a gl.flush.
Note: Like I said above there is overhead to flushing. In a normal program it's rarely important to do a flush.
I have read a bunch of posts on SO regarding the computation of end-to-end delay in Veins, but have not found an answer to be fulfilling in explaining why the delay is seemingly too low.
I am using:
Veins 4.7
Sumo 0.32.0
Omnetpp 5.3
Channel switching is turned off.
I have the following code, sending a message from the transmitting node:
if(sendMessage) {
WaveShortMessage* wsm = new WaveShortMessage();
sendDown(wsm);
}
The receiving node computes the delay using the wsm creation time, but I have also tried setting the timestamp on the transmitting side. The result is the same.
simtime_t delay = simTime() - wsm -> getCreationTime();
delayVector.record(delay);
The sample output for the delay vector is as follows:
Item# Event# Time Value
0 165 14.400239402394 2.39402394E-4
1 186 14.500240403299 2.40403299E-4
2 207 14.600241404069 2.41404069E-4
3 228 14.700242404729 2.42404729E-4
Which means that the end-to-end delay (from creation to reception) is equivalent to roughly a quarter of a millisecond, which seems to be quite low - and a fair bit below what is typically reported in the literature. This seems to be consistent with what other people have reported as being an issue (e.g. end to end delay in Veins)
Am I missing something in this computation? I have tried adding load on the network by adding a high number of vehicular nodes (21 nodes within a 1000x50 sandbox on a straight highway, with an average speed of 50 km/h), but the result seems to be the same. The difference is negligible. I have read several research papers that suggest that end-to-end delay should increase dramatically in high vehicular densities.
This end-to-end delay is to be expected. If your application's simulation model does not explicitly model processing delay (e.g., by an application running on a slow general purpose computer), all you would expect to delay a frame is propagation delay (lightspeed, so negligible here) and queueing delay on the MAC (time from inserting frame into TX queue until transmission finishes).
To give an example, for a 2400 bit frame sent at 6 Mbit/s this delay is roughly 0.45 ms. You are likely using slightly shorter frames, so your values appear to be reasonable.
For background information, see F. Klingler, F. Dressler, C. Sommer: "The Impact of Head of Line Blocking in Highly Dynamic WLANs" (DOI 10.1109/TVT.2018.2837157), which also includes a comparison of theory vs. Veins vs. real measurements.
I'm experimenting with writing a simplistic single-AU play-through based, (almost)-no-latency tracking phase vocoder prototype in C. It's a standalone program. I want to find how much processing load can a single render callback safely bear, so I prefer keeping off async DSP.
My concept is to have only one pre-determined value which is window step, or hop size or decimation factor (different names for same term used in different literature sources). This number would equal inNumberFrames, which somehow depends on the device sampling rate (and what else?). All other parameters, such as window size and FFT size would be set in relation to the window step. This seems the simplest method for keeipng everything inside one callback.
Is there a safe method to machine-independently and safely guess or query which could be the inNumberFrames before the actual rendering starts, i.e. before calling AudioOutputUnitStart()?
The phase vocoder algorithm is mostly standard and very simple, using vDSP functions for FFT, plus custom phase integration and I have no problems with it.
Additional debugging info
This code is monitoring timings within the input callback:
static Float64 prev_stime; //prev. sample time
static UInt64 prev_htime; //prev. host time
printf("inBus: %d\tframes: %d\tHtime: %lld\tStime: %7.2lf\n",
(unsigned int)inBusNumber,
(unsigned int)inNumberFrames,
inTimeStamp->mHostTime - prev_htime,
inTimeStamp->mSampleTime - prev_stime);
prev_htime = inTimeStamp->mHostTime;
prev_stime = inTimeStamp->mSampleTime;
Curious enough, the argument inTimeStamp->mSampleTime actually shows the number of rendered frames (name of the argument seems somewhat misguiding). This number is always 512, no matter if another sampling rate has been re-set through AudioMIDISetup.app at runtime, as if the value had been programmatically hard-coded. On one hand, the
inTimeStamp->mHostTime - prev_htime
interval gets dynamically changed depending on the sampling rate set in a mathematically clear way. As long as sampling rate values match multiples of 44100Hz, actual rendering is going on. On the other hand 48kHz multiples produce the rendering error -10863 ( =
kAudioUnitErr_CannotDoInCurrentContext ). I must have missed a very important point.
The number of frames is usually the sample rate times the buffer duration. There is an Audio Unit API to request a sample rate and a preferred buffer duration (such as 44100 and 5.8 mS resulting in 256 frames), but not all hardware on all OS versions honors all requested buffer durations or sample rates.
Assuming audioUnit is an input audio unit:
UInt32 inNumberFrames = 0;
UInt32 propSize = sizeof(UInt32);
AudioUnitGetProperty(audioUnit,
kAudioDevicePropertyBufferFrameSize,
kAudioUnitScope_Global,
0,
&inNumberFrames,
&propSize);
This number would equal inNumberFrames, which somehow depends on the device sampling rate (and what else?)
It depends on what you attempt to set it to. You can set it.
// attempt to set duration
NSTimeInterval _preferredDuration = ...
NSError* err;
[[AVAudioSession sharedInstance]setPreferredIOBufferDuration:_preferredDuration error:&err];
// now get the actual duration it uses
NSTimeInterval _actualBufferDuration;
_actualBufferDuration = [[AVAudioSession sharedInstance] IOBufferDuration];
It would use a value roughly around the preferred value you set. The actual value used is a time interval based on a power of 2 and the current sample rate.
If you are looking for consistency across devices, choose a value around 10ms. The worst performing reasonable modern device is iOS iPod touch 16gb without the rear facing camera. However, this device can do around 10ms callbacks with no problem. On some devices, you "can" set the duration quite low and get very fast callbacks, but often times it will crackle up because the processing is not finished in the callback before the next callback happens.
How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.
Question about the MAC-protocol of 802.11 Wifi.
We have learned that when a station has received the data it waits for SIFS time. Then it sends the packet. When searching online the reason that is always mentioned is to give ACK packets a higher priority. This is understandable since a station first has to wait DIFS time when it wants to send normal data (and DIFS is larger than SIFS).
But why wait at all? Why not immediately send the ACK? The station knows the data has arrived and the CRC is correct, so why wait?
It is theoretically possible to know that the CRC is correct at the exact end of the received data from the wire, but in practice, you need to accumulate all the samples in the last block in order to run the IFFT, deconvolution, FEC, and then, finally, at the very end, after finally getting the input data out of the waveform, do you know that the CRC is correct. Also, you sometimes need to turn on transmit circuitry to send the ACK, which can hamper receive performance. If each step in the processing chain were instantaneous, and if the transmit circuitry definitely didn't interfere with the receive circuitry, and if there were no lead-time necessary for building the waveform for the ACK, it'd be possible to send the ACK immediately after getting the last bit of the wave-form. But, while each element in this chain takes some deterministic time, it is not instantaneous. SIFS gives the receiver time to get the data from the PHY, verify it, and send the response.
Disclaimer: I'm more familiar with Homeplug than 802.11.
It is like that because Distributed Coordination Function (DCF) and Point Coordination Function (PCF) mode can coexist within one cell. That is a base station may use polling while at the same time the cell can use disitributed coordination using CSMA/CA.
So during SIFS, control frames or next fragment may be sent. During PIFS, PCF frames may be sent and during DIFS DCF frames may be sent. During SIFS and PIFS, PCF can work its magic.
Even though not all base stations support PCF all stations must wait since some may support it.
Update:
The way I understand this now is that during SIFS the station may send RTS,CTS or ACK and have enough time to switch back to receiving mode before the sender starts to transmit. If that's correct, it will send ACK during SIFS. Then it will change to receive mode and wait until SIFS elapses. When SIFS has elapsed the transmitter will start sending.
Also, PCF is controlled by PIFS which comes after SIFS and before DIFS and is therefor not relevant for this discussion (my mistake). That is, SIFS < PIFS < DIFS < EIFS.
Sources: This PDF (page 8) and Computer Networks by Andrew S. Tanenbaum
SIFS = RTT (based on PHY Transmission rate) + FRAME PROCESSING DELAY AT RECEIVER (PHY PROCESSING DELAY + MAC PROCESSING DELAY) + FRAME PROCESSING DELAY (FOR COMPOSING RESPONSE CTS/ACK)+ RF TUNER DELAY (CHANGE FROM RX to TX)
A the Transmitter side, after last PHY Symbol (of RTS, e.g), the time required to change to RX mode (at RF). So, I would see SIFS as a RX side calculation than a TX side.
I can't say for sure but it sounds like an optimization strategy similar to IP. If you don't require an ACK for every data packet, it makes sense to hold off for a bit so that, if more data packets arrive, you can acknowledge them all with a single ACK.
Example: client sends 400 packets really fast to the server. Rather than the server sending back 400 ACKs, it can simply wait until the client takes a breather before sending a single ACK back. Combined with the likelihood that the client will take a breather even under heavy load (it has to as its unacknowledged-packets buffer fills up), this would be workable.
This is possible in systems where the ACK(n) means "I've received everything up to and including packet # n.
You'll get better performance and less traffic by using such a strategy. As long as the wait-before-sending-ack time on the receiver is less than the retransmit-if-no-ack-before time on the sender (taking transmission delays into account), there should be no problem.
Quick crash-course on 802.11:
802.11 is a essentially a giant system of timers. The most common implementations of 802.11 utilize the distributed coordination function, DCF. The DCF allows for nodes to come in and out of the range of a radio channel being used for 802.11 and coordinate in a distributed fashion who should be sending and receiving data (ignoring hidden and exposed node problems for this discussion). Before any node can begin sending data on the channel they all must wait a period of DIFS, in which the channel is determined to be idle, if it is idle during a DIFS period the first node to grab the channel begins transmitting. In standard 802.11, i.e. non-802.11e implementations and non 802.11n, every single data packet that gets transmitted needs to be acknowledged by a physical layer, PHY, acknowledgment packet, irregardless of the upper layer protocol being used. After a data packet gets sent a SIFS time period needs to expire, after SIFS expires control frames destined for the node that has "taken" control of the channel may be sent, in this instance and acknowledgment frame is transmitted. SIFS allows the node that sent the data packet to switch from transmitting to receiving mode. If a packet does get lost and no ACK is received after SIFS/ACK timeout occurs, then exponential back-off is invoked. Exponential back-off, a.k.a contention window (CW), begins at a value CWmin, in some linux implementation this is 15 slot times, where a slot time varies depending on the 802.11 protocol that is being used. The CW value is then chosen from 1 to whatever the upper limit that has been calculated for CW. If the current packet was lost, then the CW is incremented from 15 to 30, and then a random value is chosen between 1 and 30. Every-time there is a consecutive lose the CW doubles up to 1023, at which point it hits a limit. Once a packet is received successfully the CW is reset back to CWmin.
In regards to 802.11n / 802.11e:
Every data packet still needs to be acknowledged, but when using 802.11e (implemented into 802.11n) multiple data packets can be aggregated together in two different ways A-MSDU or A-MPDU. A-MSDU is a jumbo-frame that has one checksum for the entire aggregated packet being sent, within it are many sub-frames that contain each of the data frames that needed to be sent. If there is any error in the A-MSDU frame and it needs to be retransmitted, then every sub-frame is required to be resent. However, when using A-MPDU, each sub-frame has a small header and checksum that allow for any sub-frame that has an error in it to be retransmitted by itself/within another aggregated frame the next time the sending nodes gains the channel. With these aggregated packet sending schemes there is the notion of the block-ack. The block-ack contains a bitmap of the frames from a starting sequence number that were just sent in the aggregated packet and received correctly or incorrectly. Usage of aggregated frame sending greatly improves throughput performance as more data can be sent per channel acquisition by a sending node, also allowing out-of-order packet sending. However, out-order packet sending greatly complicates the 802.11 MAC layer.
SIFS=D+M+Rx/Tx
Where,
D=(Receiver delay (RF delay) and decoding of physical layer convergence procedure (PLCP) preamble/header)
M=(MAC processing delay)
Rx/Tx=(transceiver turnaround time)
Above all the delays can not be avoided so It has to wait SIFS time before sending acknowledgement