The definition of bandwidth is frequency range and it seems to be correct to say that higher bandwidth guarantees higher data rate.
However, i do not understand why it does
Data rate depends on modulation scheme and nowdays QAM,which is combination of ASK and PSK, is most widely used scheme
I have understood that FSK needs more frequency so it needs more bandwidth but i do not understand why ASK and PSK need more bandwidth
(If QAM did not need more bandwidth, QAM could be used in small bandwidth and it would mean that bandwidth has nothing to do with data rate)
As i understand, ASK does not need more bandwidth. If transmission power in transmitter is bigger, the amplitude of wave will be bigger. In that sense, ASK can be achieved by transmission power control.
Furthermore, PSK will be constructed if signal is delayed. As i know, the angle of phase is decided by delay of wave (timewise)
If what i explained is correct, why does high bandwidth guarantee high data rate?
In communications engineering, bandwidth is the measure of the width of a range of frequencies, measured in Hertz.
Rate is the number of transmitted bits per time unit, usually seconds, so it's measured in bit/second. Equivalently, it can be given in symbols/time unit.
The rate is proportional to the system bandwidth. The Shannon Capacity is one theoretical way to see this relation, as it provides the maximum number of bits transmitted for a given system bandwidth in the presence of noise.
So If We can consider the bandwidth as the diameter of the water pipe. A larger pipe can carry a larger volume of water, and hence more water can be delivered between two points with larger pipe. How large is the pipe (bandwidth) determines maximum quantity of water (data) flows at a particular time. So more the bandwidth more data can be transferred between two nodes. So increasing bandwidth can increase data transfer rate. Data transfer rate can vary due to distance between two nodes, efficiency of medium used etc. So higher bandwidth does not always guarantee higher data transfer rate. So fundamentally they are not related to each other. Data transfer can be considered as consumption of bandwidth
You might want to check out the Nyquist-Shannon Sampling Theorem. In a nutshell it says that the bandwidth limits how much "data" can be transmitted. Further the Shannon–Hartley theorem states how much "data" can be transmitted using a given bandwidth (because of noise).
For example in (A)DSL using QAM64:4000Baud/Channel, 6Bit per Baud, 62 Upstream Channels yields:
6*4000*62 = 1,488 Mbit/s
Hope this helps ^^
Related
The CPU frequency and CPU usage are the main factors that impact energy consumption (as far as I know). however what is better from an energy-saving perspective to run task minimum energy consumption:
Option 1: Maximum CPU frequency with minimum usage
Option 2: Maximum CPU usage with min frequency.
Work per time scales approximately linearly with CPU frequency. (A bit less than linear because higher CPU frequency means DRAM latency is more clock cycles).
CPU power has two components: switching (dynamic) power which scales with f3 (because voltage has to increase for higher frequency, and transistors switch are pumping that V^2 capacitor energy more often); and leakage power which doesn't vary as dramatically. At high frequency dynamic power dominates, but as you lower the frequency, eventually it becomes significant. The smaller your transistors, the more significant leakage is.
System-wide, there's also other power for things like DRAM that doesn't change much or at all with CPU frequency.
Min frequency is more efficient, unless the minimum is far below the best frequency for work per energy. (Some parts of power decrease with frequency, others like leakage current and DRAM refresh don't).
Frequencies lower than max have lower work per energy (better task efficiency) up to a certain point. Like 800 MHz on a Skylake CPU on Intel's 14 nm process. If there's work to be done, there's no gain from dropping below that; just race-to-sleep at that most efficient frequency. (Power would decrease, but work rate would decrease more below that point.)
https://en.wikichip.org/wiki/File:Intel_Architecture,_Code_Name_Skylake_Deep_Dive-_A_New_Architecture_to_Manage_Power_Performance_and_Energy_Efficiency.pdf is slides from IDF2015 about Skylake power management covered a lot of that general-case stuff well. Unfortunately I don't know where to find a copy of the audio from Efraim Rotem's talk; it was up for a year or so after, but the original link is dead now. :/
Also in general about dynamic power (from switching, not leakage) scaling with frequency cubed if you adjust voltage as well as frequency, see Modern Microprocessors
A 90-Minute Guide! and
https://electronics.stackexchange.com/questions/614018/why-does-switching-cause-power-dissipation
https://electronics.stackexchange.com/questions/258724/why-do-cpus-need-so-much-current
https://electronics.stackexchange.com/questions/548601/why-does-decreasing-the-cmos-supply-voltage-also-decrease-the-maximum-circuit-fr
I know this may sound like a silly question considering the speeds at which computers work, but say a certain address in RAM is physically closer to the CPU on the motherboard, compared to a memory address that is located the farthest possible to the CPU, will this have an affect on the speed that the closer memory address is accessed compared to the farthest memory address?
If you're talking about NUMA accessing RAM connected to this socket vs. going over the interconnect to access RAM connected to another socket, then yes this is a well known effect. example. Otherwise, no.
Also note that signal travel time over the external memory bus is only tiny fraction of the total latency cache-miss latency cost for a CPU core. Queuing inside the CPU, time to check L3 cahce, and the internal bus between cores and memory controllers, all adds up. Tightening DDR4 CAS latency by 1 whole memory cycle makes only a small (but measurable) difference to overall memory performance (see hardware review sites benchmarking memory overclocking), other timings even less so.
No, DDR4 (and earlier) memory busses are synced to a clock and expect a response at a specific number of memory-clock cycles1 after a command (so the controller can pipeline requests without causing overlap). See What Every Programmer Should Know About Memory? for more about DDR memory commands and memory timings (and CAS latency vs. other timings).
(Wikipedia's introduction to SDRAM mentions that earlier DRAM standards were asynchronous, so yes they maybe could just reply as soon as they had the data ready. If that happened to be a whole clock cycle early, a speedup was perhaps possible.)
So memory latency is discrete, not continuous, and being 1 mm closer can't make it fractions of a nanosecond faster. The only plausible effect is if you socket all the memory into DIMM slots in a way that enables you to run tighter timings and/or a faster memory clock than with some other arrangement. Go read about memory overclocking if you want real-world experience with people who try to push systems to the limits of stability. What's best may depend on the motherboard; physical length of traces isn't the only consideration.
AFAIK, all real-world motherboard firmwares insist on using the same timings for all DIMMs on all memory channels2.
So even if one DIMM could theoretically support tighter timings than another, you couldn't actually configure a system to make that happen. e.g. because of shorter or less noisy traces, less signal reflection because it's at the end instead of middle of some traces, or whatever. Physical proximity isn't the only thing that could help.
(This is probably a good thing; interleaving physical address space across multiple DRAM channels allows sequential reads/writes to benefit from the aggregate bandwidth of all channels. But if they ran at different speeds, you might have more contention for shared busses between controllers and cores, and more time left unused.)
Memory frequency and timings are usually chosen by the firmware after reading the SPD ROM on each DIMM (memory module) to find out what memory is installed and what timings each DIMM is rated for at what frequencies.
Footnote 1: I'm not sure how transmission-line propagation delays over memory traces are accounted for when the memory controller and DIMM agree on how many cycles there should be after a read command before the DIMM starts putting data on the bus.
The CAS latency is a timing number that the memory controller programs into the "mode register" of each DIMM.
Presumably the number the DIMM sees is the actual number it uses, and the memory controller has to account for the round-trip propagation delay to know when to really expect a read burst to start arriving. Other command latencies are just times between sending different commands so propagation delay doesn't matter: the gap at the sending side equals the gap at the receiving side.
But the CAS latency seen by the memory controller includes the round-trip propagation delay for signals to go over the wires to the DIMM and back. Modern systems with DDR4-4000 have a clock that runs at 2GHz, cycle time of half a nanosecond (and transferring data on the rising and falling edge).
At light speed, 0.5ns is "only" about 15 cm, half of one of Grace Hopper's nanoseconds, and with transmission-line effects could be somewhat shorter (like maybe 2/3rd of that). On a big server motherboard it's certainly plausible that some DIMMs are far enough away from the CPU for traces to be that long.
The rated speeds on memory DIMMs are somewhat conservative so they're still supposed to work at that speed even when as far as allowed by DDR4 standards. I don't know the details, but I assume JEDEC considers this when developing DDR SDRAM standards.
If there's a "data valid" pin the DIMM asserts at the start of the read burst, that would solve the problem, but I haven't seen a mention of that on Wikipedia.
Timings are those numbers like 9-9-9-24, with the first one being CAS latency, CL. https://www.hardwaresecrets.com/understanding-ram-timings/ was an early google hit if you want to read more from a perf-tuning PoV. Also described in Ulrich Drepper's "What Every Programmer Should Know about Memory" linked earlier, from a how-it-works PoV. Note that the higher the memory clock speed, the less real time (in nanoseconds) a given number of cycles is. So CAS latency and other timings have stayed nearly constant in nanoseconds as clock frequencies have increase, or even dropped. https://www.crucial.com/articles/about-memory/difference-between-speed-and-latency shows a table.
Footnote 2: Unless we're talking about special faster memory for use as a scratchpad or cache for the larger main memory, but still off-chip. e.g. the 16GB of MCDRAM on Xeon Phi cards, separate from the 384 GB of regular DDR4. But faster memories are usually soldered down so timings are fixed, not socketed DIMMs. So I think it's fair to say that all DIMMs in a system will run with the same timings.
Other random notes:
https://www.overclock.net/threads/ram-4x-sr-or-2x-dr-for-ryzen-3000.1729606/ contained some discussion of motherboards with a "T-topology" vs. "daisy chain" for the layout of their DIMM sockets. This seems pretty self-explanatory terminology: a T would be when each of the 2 DIMMs on a channel are on opposite sides of the CPU, about equidistant from the pins. vs. "daisy chain" when both DIMMs for the same channel are on the same side of the CPU, with one farther away than the other.
I'm not sure what the recommended practice is for using the closer or farther socket. Signal reflection could be more of a concern with the near socket because it's not the end of the trace.
If you have multiple DIMMs on the same memory channel by the "chip-enable" pin , the DDR4 protocol may require they all run at the same timings. (Such DIMMs see each others commands, except there's a "chip-select" pin that the memory controller can control independently for each DIMM to control which one the command is for.
But in theory a CPU could be designed to run its different memory channels at different frequencies, or at least different timings at the same frequency if the memory controllers all share a clock. And of course in a multi-socket system, you'd expect no physical / electrical obstacle to programming different timings for the different sockets.
(I haven't played around in the BIOS on a multi-socket system for years, not since I was a cluster sysadmin in AMD K8 / K10 days). So IDK, it's possible that some BIOS might have options to control different timings for different sockets, or simply allow different auto-detect if you use slower RAM in one socket than in others. But given the price of servers and how few people run them as hobby machines, it's unlikely that vendors would bother to support or validate such a config.
I am struggling to draw a clear line between latency, bandwidth and throughput.
Can someone explain me in simple terms and with easy examples?
Water Analogy:
Latency is the amount of time it takes to travel through the tube.
Bandwidth is how wide the tube is.
The rate of water flow is the Throughput
Vehicle Analogy:
Vehicle travel time from source to destination is latency.
Types of Roadways are bandwidth.
Number of Vehicles traveling is throughput.
When a SYN packet is sent using TCP it waits for a SYN+ACK response, the time between sending and receiving is the latency. It's a function of one variable ie time.
If we're doing this on a 100Mbit connection this is the theoretical bandwidth that we have i.e. how many bits per second we can send.
If I compress a 1000Mbit file to 100Mbit and send it over the 100Mbit line then my effective throughput could be considered 1Gbit per second. Theoretical throughput and theoretical bandwidth are the same on this network but why am I saying the throughput is 1Gbit per second.
When talking about throughput I hear it most in relation to an application ie the 1Gbit throughput example I gave assumed compression at some layer in the stack and we measured throughput there. The throughput of the actual network did not change but the application throughput did. Sometimes throughput is talking about actual throughput ie a 100Mbit connection is the theoretical bandwidth and also the theoretical throughput in bps but highly unlikely to be what you'll actually get.
Throughput is also used in terms of whole systems ie Number of Dogs washed per day or Number of Bottles filled per hour. You don't often use bandwidth in this way.
Note, bandwidth in particular has other common meanings, I've assumed networking because this is stackoverflow but if it was a maths or amateur radio forum I might be talking about something else entirely.
https://en.wikipedia.org/wiki/Bandwidth
https://en.wikipedia.org/wiki/Latency
This is worth reading on throughput.
https://en.wikipedia.org/wiki/Throughput
Here is my bit in a language which I can understand
When you go to buy a water pipe, there are two completely independent parameters that you look at: the diameter of the pipe and its length. The diameter determines the throughput of the pipe and the length determines the latency, i.e., the time it will take for a water droplet to travel across the pipe. Key point to note is that the length and diameter are independent, thus, so are are latency and throughput of a communication channel.
More formally, Throughput is defined as the amount of water entering or leaving the pipe every second and latency is the average time required to for a droplet to travel from one end of the pipe to the other.
Let’s do some math:
For simplicity, assume that our pipe is a 4inch x 4inch square and its length is 12inches. Now assume that each water droplet is a 0.1in x 0.1in x 0.1in cube. Thus, in one cross section of the pipe, I will be able to fit 1600 water droplets. Now assume that water droplets travel at a rate of 1 inch/second.
Throughput: Each set of droplets will move into the pipe in 0.1 seconds. Thus, 10 sets will move in 1 second, i.e., 16000 droplets will enter the pipe per second. Note that this is independent of the length of the pipe.
Latency: At one inch/second, it will take 12 seconds for droplet A to get from one end of the pipe to the other regardless of pipe’s diameter. Hence the latency will be 12 seconds.
I would like to supplement to the answers already written, another distinction of Latency and Throughput, relevant to the concept of pipelining. For that purpose I 'll use an example from the daily life, regarding the preparation of clothes: To get them ready, we have to (i) wash them, (ii) dry them (iii) iron them. Each of these tasks needs an amount of time, lets say A,B and C respectively. Every batch of clothes will need a total of A+B+C time until it is ready. This is the latency of the total process. However, since i, ii and iii are separate sub-processes you may start washing the 3rd batch of clothes, while the 2nd one is drying, the 1st batch is being ironed, etc (Pipeline). Then, every batch of clothes after the 1st, will be ready after max(A,B,C) time. Throughput would be measured in batches of clothes per time, equal to 1/[max(A,B,C)].
That being said, this answer tries to highlight that when we only know the latency of a system, we do not necessarily know its throughput. These are truly different metrics and not just another way to express the same information.
Latency: Elapsed time of an event.
eg. Walking from point A to B takes one minute, the latency is one minute.
Throughput: The number of events that can be executed per unit of time.
eg. Bandwidth is a measure of throughput.
We can increase bandwidth to improve throughput but it wont improve latency.
Take the RPC case — There are two components to latency of message communication in a distributed system, the first component is the hardware overhead and the second component is the software overhead.
The hardware overhead is dependent on how the network is interfaced with the computer, this is managed mostly by the network controller.
I wrote a blog about it :)
https://medium.com/#nbosco/latency-vs-throughput-d7a4459b5cdb
Bandwidth is a measure of data per second, which is equal to the temporal speed of such data multiplied by the number of spatial multiplexing channels, so essentially in the water pipe analogy it is flow velocity * diameter. In digital signal processing, the temporal speed of the data is constrained by the frequency bandwidth of the channel and the SNR.
Latency is the physical length of the channel (in terms of the number of bits it can hold in flight) divided by the bandwidth. Latency increases when transmitter and receiver get further apart spatially, but bandwidth does not change because the transmitter layer 1 can still send at the same speed. It also increases when there's an intermittent node or receiving node that buffers, processes or delays the data, but still has the same bandwidth – it might take a while for the first packets of a download to come in, but when they do, it will hopefully be at full bandwidth. Of course, that assumes that the transmitter protocol stack doesn't need to wait around for control packets from the receiver like TCP ACK or layer 2 ACK.
I basically need some help to explain/confirm some experimental results.
Basic Theory
A common idea expressed in papers on DVFS is that execution times have on-chip and off-chip components. On-chip components of execution time scale linearly with CPU frequency whereas the off-chip components remain unaffected.
Therefore, for CPU-bound applications, there is a linear relationship between CPU frequency and instruction-retirement rate. On the other hand, for a memory bound application where the caches are often missed and DRAM has to be accessed frequently, the relationship should be affine (one is not just a multiple of the other, you also have to add a constant).
Experiment
I was doing experiments looking at how CPU frequency affects instruction-retirement rate and execution time under different levels of memory-boundedness.
I wrote a test application in C that traverses a linked list. I effectively create a linked list whose individual nodes have sizes equal to the size of a cache-line (64 bytes). I allocated a large amount of memory that is a multiple of the cache-line size.
The linked list is circular such that the last element links to the first element. Also, this linked list randomly traverses through the cache-line sized blocks in the allocated memory. Every cache-line sized block in the allocated memory is accessed, and no block is accessed more than once.
Because of the random traversal, I assumed it should not be possible for the hardware to use any pre-fetching. Basically, by traversing the list, you have a sequence of memory accesses with no stride pattern, no temporal locality, and no spacial locality. Also, because this is a linked list, one memory access can not begin until the previous one completes. Therefore, the memory accesses should not be parallelizable.
When the amount of allocated memory is small enough, you should have no cache misses beyond initial warm up. In this case, the workload is effectively CPU bound and the instruction-retirement rate scales very cleanly with CPU frequency.
When the amount of allocated memory is large enough (bigger than the LLC), you should be missing the caches. The workload is memory bound and the instruction-retirement rate should not scale as well with CPU frequency.
The basic experimental setup is similiar to the one described here:
"Actual CPU Frequency vs CPU Frequency Reported by the Linux "cpufreq" Subsystem".
The above application is run repeatedly for some duration. At the start and end of the duration, the hardware performance counter is sampled to determine the number of instructions retired over the duration. The length of the duration is measured as well. The average instruction-retirement rate is measured as the ratio between these two values.
This experiment is repeated across all the possible CPU frequency settings using the "userspace" CPU-frequency governor in Linux. Also, the experiment is repeated for the CPU-bound case and the memory-bound case as described above.
Results
The two following plots show results for the CPU-bound case and memory-bound case respectively. On the x-axis, the CPU clock frequency is specified in GHz. On the y-axis, the instruction-retirement rate is specified in (1/ns).
A marker is placed for repetition of the experiment described above. The line shows what the result would be if instruction-retirement rate increased at the same rate as CPU frequency and passed through the lowest-frequency marker.
Results for the CPU-bound case.
Results for the memory-bound case.
The results make sense for the CPU-bound case, but not as much for the memory-bound case. All the markers for the memory-bound fall below the line which is expected because the instruction-retirement rate should not increase at the same rate as CPU frequency for a memory-bound application. The markers appear to fall on straight lines, which is also expected.
However, there appears to be step-changes in the instruction-retirement rate with change in CPU frequency.
Question
What is causing the step changes in the instruction-retirement rate? The only explanation I could think of is that the memory controller is somehow changing the speed and power-consumption of memory with changes in the rate of memory requests. (As instruction-retirement rate increases, the rate of memory requests should increase as well.) Is this a correct explanation?
You seem to have exactly the results you expected - a roughly linear trend for the cpu bound program, and a shallow(er) affine one for the memory bound case (which is less cpu effected). You will need a lot more data to determine if they are consistent steps or if they are - as I suspect - mostly random jitter depending on how 'good' the list is.
The cpu clock will affect bus clocks, which will affect timings and so on - synchronisation between differently clocked buses is always challenging for hardware designers. The spacing of your steps is interestingly 400 Mhz but I wouldn't draw too much from this - generally, this kind of stuff is way too complex and specific-hardware dependent to be properly analysed without 'inside' knowledge the memory controller used, etc.
(please draw nicer lines of best fit)
This is very interesting topic, they use following formula to compute access interval time:
BreakEvenIntervalinSeconds = (PagesPerMBofRAM / AccessesPerSecondPerDisk) × (PricePerDiskDrive / PricePerMBofRAM).
It is derived using formulas for the cost of RAM to hold a page in the buffer pool and the cost of a (fractional) disk to perform I/O every time a page is needed, equating these two costs, and solving the equation for the interval between accesses.
so the cost of disc I/O per access is PricePerDiskDrive / AccessesPerSecondPerDisk, My question is why disc I/O cost per access is computed like this?
The underlying assumption is that the limit to the life of a disk is how many disk seeks there are, while RAM has a fixed cost for its size, and a fixed lifetime regardless of how often it is accessed. This is reasonable because seeking to disk causes physical wear and tear, and when the disk goes, you lose the whole disk. By contrast RAM has no physical moving parts, and so does not wear out with use.
With that assumption, the cost of keeping data on disk depends on the frequency of access and the cost of the disk. The cost of keeping data in RAM depends on how much RAM you're using. What they are trying to find is the break even point between where it is cheaper to keep data on disk or in RAM.
However the equation as given is incomplete. While that equation identifies relevant factors, there is an important constant of proportionality missing. How many accesses can the average hard drive sustain? How long does RAM last on average? Those enter into the costs for keeping data on hard drives and RAM, and without them you are comparing apples and oranges.
This is indicative of my impression of the whole paper. It says a lot at great length, about an important topic, but the analysis is sloppy. They are slopping and leave critical things out, and don't do enough to help people understand what they are thinking and when their analysis is appropriate what you are doing. For instance if you are trying to maintain a low latency system, you have to keep all of your data in RAM. Period. If you're processing large data sets and don't want to pay to keep it all in RAM, then you will be streaming data to/from disk. If you're keeping data in a redundant format, for instance RAID, you are doing more seeks per read than they admit.