What is the difference between latency, bandwidth and throughput? - performance

I am struggling to draw a clear line between latency, bandwidth and throughput.
Can someone explain me in simple terms and with easy examples?

Water Analogy:
Latency is the amount of time it takes to travel through the tube.
Bandwidth is how wide the tube is.
The rate of water flow is the Throughput
Vehicle Analogy:
Vehicle travel time from source to destination is latency.
Types of Roadways are bandwidth.
Number of Vehicles traveling is throughput.

When a SYN packet is sent using TCP it waits for a SYN+ACK response, the time between sending and receiving is the latency. It's a function of one variable ie time.
If we're doing this on a 100Mbit connection this is the theoretical bandwidth that we have i.e. how many bits per second we can send.
If I compress a 1000Mbit file to 100Mbit and send it over the 100Mbit line then my effective throughput could be considered 1Gbit per second. Theoretical throughput and theoretical bandwidth are the same on this network but why am I saying the throughput is 1Gbit per second.
When talking about throughput I hear it most in relation to an application ie the 1Gbit throughput example I gave assumed compression at some layer in the stack and we measured throughput there. The throughput of the actual network did not change but the application throughput did. Sometimes throughput is talking about actual throughput ie a 100Mbit connection is the theoretical bandwidth and also the theoretical throughput in bps but highly unlikely to be what you'll actually get.
Throughput is also used in terms of whole systems ie Number of Dogs washed per day or Number of Bottles filled per hour. You don't often use bandwidth in this way.
Note, bandwidth in particular has other common meanings, I've assumed networking because this is stackoverflow but if it was a maths or amateur radio forum I might be talking about something else entirely.
https://en.wikipedia.org/wiki/Bandwidth
https://en.wikipedia.org/wiki/Latency
This is worth reading on throughput.
https://en.wikipedia.org/wiki/Throughput

Here is my bit in a language which I can understand
When you go to buy a water pipe, there are two completely independent parameters that you look at: the diameter of the pipe and its length. The diameter determines the throughput of the pipe and the length determines the latency, i.e., the time it will take for a water droplet to travel across the pipe. Key point to note is that the length and diameter are independent, thus, so are are latency and throughput of a communication channel.
More formally, Throughput is defined as the amount of water entering or leaving the pipe every second and latency is the average time required to for a droplet to travel from one end of the pipe to the other.
Let’s do some math:
For simplicity, assume that our pipe is a 4inch x 4inch square and its length is 12inches. Now assume that each water droplet is a 0.1in x 0.1in x 0.1in cube. Thus, in one cross section of the pipe, I will be able to fit 1600 water droplets. Now assume that water droplets travel at a rate of 1 inch/second.
Throughput: Each set of droplets will move into the pipe in 0.1 seconds. Thus, 10 sets will move in 1 second, i.e., 16000 droplets will enter the pipe per second. Note that this is independent of the length of the pipe.
Latency: At one inch/second, it will take 12 seconds for droplet A to get from one end of the pipe to the other regardless of pipe’s diameter. Hence the latency will be 12 seconds.

I would like to supplement to the answers already written, another distinction of Latency and Throughput, relevant to the concept of pipelining. For that purpose I 'll use an example from the daily life, regarding the preparation of clothes: To get them ready, we have to (i) wash them, (ii) dry them (iii) iron them. Each of these tasks needs an amount of time, lets say A,B and C respectively. Every batch of clothes will need a total of A+B+C time until it is ready. This is the latency of the total process. However, since i, ii and iii are separate sub-processes you may start washing the 3rd batch of clothes, while the 2nd one is drying, the 1st batch is being ironed, etc (Pipeline). Then, every batch of clothes after the 1st, will be ready after max(A,B,C) time. Throughput would be measured in batches of clothes per time, equal to 1/[max(A,B,C)].
That being said, this answer tries to highlight that when we only know the latency of a system, we do not necessarily know its throughput. These are truly different metrics and not just another way to express the same information.

Latency: Elapsed time of an event.
eg. Walking from point A to B takes one minute, the latency is one minute.
Throughput: The number of events that can be executed per unit of time.
eg. Bandwidth is a measure of throughput.
We can increase bandwidth to improve throughput but it wont improve latency.
Take the RPC case — There are two components to latency of message communication in a distributed system, the first component is the hardware overhead and the second component is the software overhead.
The hardware overhead is dependent on how the network is interfaced with the computer, this is managed mostly by the network controller.
I wrote a blog about it :)
https://medium.com/#nbosco/latency-vs-throughput-d7a4459b5cdb

Bandwidth is a measure of data per second, which is equal to the temporal speed of such data multiplied by the number of spatial multiplexing channels, so essentially in the water pipe analogy it is flow velocity * diameter. In digital signal processing, the temporal speed of the data is constrained by the frequency bandwidth of the channel and the SNR.
Latency is the physical length of the channel (in terms of the number of bits it can hold in flight) divided by the bandwidth. Latency increases when transmitter and receiver get further apart spatially, but bandwidth does not change because the transmitter layer 1 can still send at the same speed. It also increases when there's an intermittent node or receiving node that buffers, processes or delays the data, but still has the same bandwidth – it might take a while for the first packets of a download to come in, but when they do, it will hopefully be at full bandwidth. Of course, that assumes that the transmitter protocol stack doesn't need to wait around for control packets from the receiver like TCP ACK or layer 2 ACK.

Related

Will memory that is physically closer to the CPU perform faster than memory physically farther away?

I know this may sound like a silly question considering the speeds at which computers work, but say a certain address in RAM is physically closer to the CPU on the motherboard, compared to a memory address that is located the farthest possible to the CPU, will this have an affect on the speed that the closer memory address is accessed compared to the farthest memory address?
If you're talking about NUMA accessing RAM connected to this socket vs. going over the interconnect to access RAM connected to another socket, then yes this is a well known effect. example. Otherwise, no.
Also note that signal travel time over the external memory bus is only tiny fraction of the total latency cache-miss latency cost for a CPU core. Queuing inside the CPU, time to check L3 cahce, and the internal bus between cores and memory controllers, all adds up. Tightening DDR4 CAS latency by 1 whole memory cycle makes only a small (but measurable) difference to overall memory performance (see hardware review sites benchmarking memory overclocking), other timings even less so.
No, DDR4 (and earlier) memory busses are synced to a clock and expect a response at a specific number of memory-clock cycles1 after a command (so the controller can pipeline requests without causing overlap). See What Every Programmer Should Know About Memory? for more about DDR memory commands and memory timings (and CAS latency vs. other timings).
(Wikipedia's introduction to SDRAM mentions that earlier DRAM standards were asynchronous, so yes they maybe could just reply as soon as they had the data ready. If that happened to be a whole clock cycle early, a speedup was perhaps possible.)
So memory latency is discrete, not continuous, and being 1 mm closer can't make it fractions of a nanosecond faster. The only plausible effect is if you socket all the memory into DIMM slots in a way that enables you to run tighter timings and/or a faster memory clock than with some other arrangement. Go read about memory overclocking if you want real-world experience with people who try to push systems to the limits of stability. What's best may depend on the motherboard; physical length of traces isn't the only consideration.
AFAIK, all real-world motherboard firmwares insist on using the same timings for all DIMMs on all memory channels2.
So even if one DIMM could theoretically support tighter timings than another, you couldn't actually configure a system to make that happen. e.g. because of shorter or less noisy traces, less signal reflection because it's at the end instead of middle of some traces, or whatever. Physical proximity isn't the only thing that could help.
(This is probably a good thing; interleaving physical address space across multiple DRAM channels allows sequential reads/writes to benefit from the aggregate bandwidth of all channels. But if they ran at different speeds, you might have more contention for shared busses between controllers and cores, and more time left unused.)
Memory frequency and timings are usually chosen by the firmware after reading the SPD ROM on each DIMM (memory module) to find out what memory is installed and what timings each DIMM is rated for at what frequencies.
Footnote 1: I'm not sure how transmission-line propagation delays over memory traces are accounted for when the memory controller and DIMM agree on how many cycles there should be after a read command before the DIMM starts putting data on the bus.
The CAS latency is a timing number that the memory controller programs into the "mode register" of each DIMM.
Presumably the number the DIMM sees is the actual number it uses, and the memory controller has to account for the round-trip propagation delay to know when to really expect a read burst to start arriving. Other command latencies are just times between sending different commands so propagation delay doesn't matter: the gap at the sending side equals the gap at the receiving side.
But the CAS latency seen by the memory controller includes the round-trip propagation delay for signals to go over the wires to the DIMM and back. Modern systems with DDR4-4000 have a clock that runs at 2GHz, cycle time of half a nanosecond (and transferring data on the rising and falling edge).
At light speed, 0.5ns is "only" about 15 cm, half of one of Grace Hopper's nanoseconds, and with transmission-line effects could be somewhat shorter (like maybe 2/3rd of that). On a big server motherboard it's certainly plausible that some DIMMs are far enough away from the CPU for traces to be that long.
The rated speeds on memory DIMMs are somewhat conservative so they're still supposed to work at that speed even when as far as allowed by DDR4 standards. I don't know the details, but I assume JEDEC considers this when developing DDR SDRAM standards.
If there's a "data valid" pin the DIMM asserts at the start of the read burst, that would solve the problem, but I haven't seen a mention of that on Wikipedia.
Timings are those numbers like 9-9-9-24, with the first one being CAS latency, CL. https://www.hardwaresecrets.com/understanding-ram-timings/ was an early google hit if you want to read more from a perf-tuning PoV. Also described in Ulrich Drepper's "What Every Programmer Should Know about Memory" linked earlier, from a how-it-works PoV. Note that the higher the memory clock speed, the less real time (in nanoseconds) a given number of cycles is. So CAS latency and other timings have stayed nearly constant in nanoseconds as clock frequencies have increase, or even dropped. https://www.crucial.com/articles/about-memory/difference-between-speed-and-latency shows a table.
Footnote 2: Unless we're talking about special faster memory for use as a scratchpad or cache for the larger main memory, but still off-chip. e.g. the 16GB of MCDRAM on Xeon Phi cards, separate from the 384 GB of regular DDR4. But faster memories are usually soldered down so timings are fixed, not socketed DIMMs. So I think it's fair to say that all DIMMs in a system will run with the same timings.
Other random notes:
https://www.overclock.net/threads/ram-4x-sr-or-2x-dr-for-ryzen-3000.1729606/ contained some discussion of motherboards with a "T-topology" vs. "daisy chain" for the layout of their DIMM sockets. This seems pretty self-explanatory terminology: a T would be when each of the 2 DIMMs on a channel are on opposite sides of the CPU, about equidistant from the pins. vs. "daisy chain" when both DIMMs for the same channel are on the same side of the CPU, with one farther away than the other.
I'm not sure what the recommended practice is for using the closer or farther socket. Signal reflection could be more of a concern with the near socket because it's not the end of the trace.
If you have multiple DIMMs on the same memory channel by the "chip-enable" pin , the DDR4 protocol may require they all run at the same timings. (Such DIMMs see each others commands, except there's a "chip-select" pin that the memory controller can control independently for each DIMM to control which one the command is for.
But in theory a CPU could be designed to run its different memory channels at different frequencies, or at least different timings at the same frequency if the memory controllers all share a clock. And of course in a multi-socket system, you'd expect no physical / electrical obstacle to programming different timings for the different sockets.
(I haven't played around in the BIOS on a multi-socket system for years, not since I was a cluster sysadmin in AMD K8 / K10 days). So IDK, it's possible that some BIOS might have options to control different timings for different sockets, or simply allow different auto-detect if you use slower RAM in one socket than in others. But given the price of servers and how few people run them as hobby machines, it's unlikely that vendors would bother to support or validate such a config.

Why does more Bandwidth guarantee high bit rate?

The definition of bandwidth is frequency range and it seems to be correct to say that higher bandwidth guarantees higher data rate.
However, i do not understand why it does
Data rate depends on modulation scheme and nowdays QAM,which is combination of ASK and PSK, is most widely used scheme
I have understood that FSK needs more frequency so it needs more bandwidth but i do not understand why ASK and PSK need more bandwidth
(If QAM did not need more bandwidth, QAM could be used in small bandwidth and it would mean that bandwidth has nothing to do with data rate)
As i understand, ASK does not need more bandwidth. If transmission power in transmitter is bigger, the amplitude of wave will be bigger. In that sense, ASK can be achieved by transmission power control.
Furthermore, PSK will be constructed if signal is delayed. As i know, the angle of phase is decided by delay of wave (timewise)
If what i explained is correct, why does high bandwidth guarantee high data rate?
In communications engineering, bandwidth is the measure of the width of a range of frequencies, measured in Hertz.
Rate is the number of transmitted bits per time unit, usually seconds, so it's measured in bit/second. Equivalently, it can be given in symbols/time unit.
The rate is proportional to the system bandwidth. The Shannon Capacity is one theoretical way to see this relation, as it provides the maximum number of bits transmitted for a given system bandwidth in the presence of noise.
So If We can consider the bandwidth as the diameter of the water pipe. A larger pipe can carry a larger volume of water, and hence more water can be delivered between two points with larger pipe. How large is the pipe (bandwidth) determines maximum quantity of water (data) flows at a particular time. So more the bandwidth more data can be transferred between two nodes. So increasing bandwidth can increase data transfer rate. Data transfer rate can vary due to distance between two nodes, efficiency of medium used etc. So higher bandwidth does not always guarantee higher data transfer rate. So fundamentally they are not related to each other. Data transfer can be considered as consumption of bandwidth
You might want to check out the Nyquist-Shannon Sampling Theorem. In a nutshell it says that the bandwidth limits how much "data" can be transmitted. Further the Shannon–Hartley theorem states how much "data" can be transmitted using a given bandwidth (because of noise).
For example in (A)DSL using QAM64:4000Baud/Channel, 6Bit per Baud, 62 Upstream Channels yields:
6*4000*62 = 1,488 Mbit/s
Hope this helps ^^

Adjusting parameter - Machine learning

I am transfering some data remotely packet by packet.
Before sending each packet I need to have a sleep for some time (milliseconds). After transferring each file I have a feedback: fail or success.
Of course as smaller delay I have as smaller success rate will be however time for transferring will be less.
My goal is to adjust automatically current delay to make average SUCCESS RATE equal some constant (say 98%).
Intuitevly I assume:
After each success transfer I'll increase current delay
After each unsuccess transfer I'll decrease current delay
In time I'll modify current delay less (fade)
What algorithms would you suggest for efficient (from viewpoint of time to learn, memory) finding optimal parameter value?
You are essentially describing a network congestion solution. Look at http://en.wikipedia.org/wiki/Network_congestion_avoidance#Avoidance for much more information on the subject.
One algorithm that might suit you well is to decrease the time you wait after each successful transfer. After an unsuccessful transfer increase the time (either by a set amount or dynamically) and repeat indefinitely. I wish I could remember the specific name for this algorithm but at the moment it is escaping me.
Note if you are truly sending packages over a network and not just a play network "optimal" is not a constant as the network is always in a state of change.

ZMQ throughput optimization

I developed an application that has very various zmq-message sizes. In average those are ~177 byte, but in reality most messages are very small < 20b and just few messages have very big size > 3000b.
Now the network is the limiting factor (1gbit ethernet). I can reach ~50MByte/s. Another benchmark told me that the network throughput can reach ~85 MByte/s with a paket size of >256byte.
I think my results are that low due to the fact that most pakets have very small size. Am I right? Is there a possiblity to optimize zmq for using the whole bandwidth for my application, too? Extended batching for example?
Regards
The ZeroMQ guide illustrates the Black Box Pattern for high speed subscribers. In essence, it uses a two stream approach (per node), where each stream has it own I/O thread and subscriber, both of whom are bound to a specific network interface (NIC) and core, so you'll need two network adapters and multi-cores per node for this to work. You can read the full details in the guide.

What is the latency (or delay) time for callbacks from the waveOutWrite API method?

I'm having a debate with some developers on another forum about accurately generating MIDI events (Note On messages and so forth). The human ear is pretty sensitive to slight timing inaccuracies, and I think their main problem comes from their use of relatively low-resolution timers which quantize their events around 15 millisecond intervals (which is large enough to cause perceptible inaccuracies).
About 10 years ago, I wrote a sample application (Visual Basic 5 on Windows 95) that was a combined software synthesizer and MIDI player. The basic premise was a leapfrog-buffer playback system with each buffer being the duration of a sixteenth note (example: with 120 quarter-notes per minute, each quarter-note was 500 ms and thus each sixteenth-note was 125 ms, so each buffer is 5513 samples). Each buffer was played via the waveOutWrite method, and the callback function from this method was used to queue up the next buffer and also to send MIDI messages. This kept the WAV-based audio and the MIDI audio synchronized.
To my ear, this method worked perfectly - the MIDI notes did not sound even slightly out of step (whereas if you use an ordinary timer accurate to 15 ms to play MIDI notes, they will sound noticeably out of step).
In theory, this method would produce MIDI timing accurate to the sample, or 0.0227 milliseconds (since there are 44.1 samples per millisecond). I doubt that this is the true latency of this approach, since there is presumably some slight delay between when a buffer finishes and when the waveOutWrite callback is notified. Does anyone know how big this delay would actually be?
The Windows scheduler runs at either 10ms or 16ms intervals by default depending on the processor. If you use the timeBeginPeriod() API you can change this interval (at a fairly significant power consumption cost).
In Windows XP and Windows 7, the wave APIs run with a latency of about 30ms, for Windows Vista the wave APIs have a latency of about 50ms. You then need to add in the audio engine latency.
Unfortunately I don't have numbers for the engine latency in one direction, but we do have some numbers regarding engine latency - we ran a test that played a tone looped back through a USB audio device and measured the round-trip latency (render to capture). On Vista the round trip latency was about 80ms with a variation of about 10ms. On Win7 the round trip latency was about 40ms with a variation of about 5ms. YMMV however since the amount of latency introduced by the audio hardware is different for each piece of hardware.
I have absolutely no idea what the latency was for the XP audio engine or the Win9x audio stack.
At the very basic level, Windows is a multi threaded OS. And it schedules threads with 100ms time slices.
Which means that, if there is no CPU contention, the delay between the end of the buffer and the waveOutWrite callback could be arbitrailly short. Or, if there are other busy threads, you have to wait up to 100ms per thread.
In the best case however... CPU speeds clock in at the GHz now. Which puts an absolute lower bound on how fast the callback can be called in the 0.000,000,000,1 second order of magnitude.
Unless you can figure out the maximum number of waveOutWrite callbacks you can process in a single second, which could imply the latency of each call, I think that really, the latency is going to be orders of magnitude below preception most of the time, unless there are too many busy threads, in which case its going to go horribly, horribly wrong.
To add to great answers above.
Your question is about the latency Windows neither promised not cared of. And as such, it might be quite different depending on OS version, hardware and other factors. WaveOut API, and DirectSound too (not sure about WASAPI, but I guess it is also true for this latest Vista+ audio API) are all set for buffered audio output. Specific callback accuracy is not required as long as your are on time queuing next buffer while current is still being played.
When you start audio playback, you have a few assumptions such as no underflows during playback and all output is continuous, and audio clock rate is exactly as you expect is, such as 44,100 Hz precisely. Then you do simple math to schedule your wave output in time, converting time to samples and then to bytes.
Sadly, effective playback rate is not precise, e.g. imagine real hardware sampling rate may be 44,100 Hz -3%, and in long run the time-to-byte math might be letting you down. There has been attempt to compensate for this effect, such as making audio hardware the playback clock and synchronizing video to it (this is how players work), and rate matching technique to match incoming data rate to actual playback rate on hardware. Both these make absolute time measurements and latency in question quite a speculative knowledge.
More to this, the API latencies 20 ms, 30 ms, 50 ms and so on. Since long ago waveOut API is a layer on top of other APIs. This means that some processing takes place before data actually reach the hardware and this processing requires that you put your hands off the queued data well in advance, or the data won't reach the hardware. Let's say if you attempt to queue your data in 10 ms buffers right before playback time, the API will accept this data but it will be late itself passing this data downstream, and there will be silence or comfort noise on the speakers.
Now this is also related to callbacks that you receive. You could say that you don't care about latency of buffers and what is important to you is precise callback time. However since the API is layered, you receive callback at the accuracy of inner layer synchronization, such second inner layer notifies on free buffer, and first inner layer updates its records and checks if it can release your buffer too (hey, those buffers don't have to match too). This makes callback accuracy expectations really weak and unreliable.
Provided that I have not been touching waveOut API for quite some time, if such question of synchronization accuracy would come up, I would probably first of all thought of two things:
Windows provides access to audio hardware clock (I am aware of IReferenceClock interface available through DirectShow, and it probably comes from another lower level thing which is also accessible) and having that available I would try to synchronize with it
Latest audio API from Microsoft, WASAPI, provides special support for low latency audio with new cool stuff there like better media thread scheduling, exclusive mode streams and <10 ms latency for PCM - this is where better sync is to be looked at

Resources