Independent memory channels: What does it mean for a programmer

Independent memory channels: What does it mean for a programmer - memory-management

I read the Datasheet for an Intel Xeon Processor and saw the following:
The Integrated Memory Controller (IMC) supports DDR3 protocols with four
independent 64-bit memory channels with 8 bits of ECC for each channel (total of
72-bits) and supports 1 to 3 DIMMs per channel depending on the type of memory
installed.
I need to know what this exactly means from a programmers view.
The documentation on this seems to be rather sparse and I don't have someone from Intel at hand to ask ;)
Can this memory controller execute 4 loads of data simultaneously from non-adjacent memory regions (and request each data from up to 3 memory DIMMs)? I.e. 4x64 Bits, striped from up to 3 DIMMs, e.g:
| X | _ | X | _ | X | _ | X |
(X is loaded data, _ an arbitrarily large region of unloaded data)
Can this IMC execute 1 load which will load up to 1x256 Bits from a contiguous memory region.
| X | X | X | X | _ | _ | _ | _ |

This seems to be implementation specific, depending on compiler, OS and memory controller. The standard is available at: http://www.jedec.org/standards-documents/docs/jesd-79-3d . It seems that if your controller is fully compliant there are specific bits that can be set to indicate interleaved or non-interleaved mode. See page 24,25 and 143 of the DDR3 Spec, but even in the spec details are light.
For the i7/i5/i3 series specifically, and likely all newer Intel chips the memory is interleaved as in your first example. For these newer chips and presumably a compiler that supports it, yes one Asm/C/C++ level call to load something large enough to be interleaved/striped would initiate the required amount of independent hardware channel level loads to each channel of memory.
In the Triple channel section in of the Multichannel memory page on wikipedia there is a small list of CPUs that do this, likely it is incomplete: http://en.wikipedia.org/wiki/Multi-channel_memory_architecture

Related

What is the "spool time" of Intel Turbo Boost?

Just like a turbo engine has "turbo lag" due to the time it takes for the turbo to spool up, I'm curious what is the "turbo lag" in Intel processors.
For instance, the i9-8950HK in my MacBook Pro 15" 2018 (running macOS Catalina 10.15.7) usually sits around 1.3 GHz when idle, but when I run a CPU-intensive program, the CPU frequency shoots up to, say 4.3 GHz or so (initially). The question is: how long does it take to go from 1.3 to 4.3 GHz? 1 microsecond? 1 milisecond? 100 miliseconds?
I'm not even sure this is up to the hardware or the operating system.
This is in the context of benchhmarking some CPU-intensive code which takes a few 10s of miliseconds to run. The thing is, right before this piece of CPU-intensive code is run, the CPU is essentially idle (and thus the clock speed will drop down to say 1.3 GHz). I'm wondering what slice of my benchmark is running at 1.3 GHz and what is running at 4.3 GHz: 1%/99%? 10%/90%? 50%/50%? Or even worse?
Depending on the answer, I'm thinking it would make sense to run some CPU-intensive code prior to starting the benchmark as a way to "spool up" TurboBoost. And this leads to another question: for how long should I run this "spooling-up" code? Probably one second is enough, but what if I'm trying to minimize this -- what's a safe amount of time for "spooling-up" code to run, to make sure the CPU will run the main code at the maximum frequency from the very first instruction executed?

Evaluation of CPU frequency transition latency paper presents transition latencies of various Intel processors. In brief, the latency depends on the state in which the core currently is, and what is the target state. For an evaluated Ivy Bridge processor (i7-3770 # 3.4 GHz) the latencies varied from 23 (1.6 GH -> 1.7 GHz) to 52 (2.0 GHz -> 3.4 GHz) micro-seconds.
At Hot Chips 2020 conference a major transition latency improvement of the future Ice Lake processor has been presented, which should have major impact mostly at partially vectorised code which uses AVX-512 instructions. While these instructions do not support as high frequencies as SSE or AVX-2 instructions, using an island of these instructions cause down- and following up-scaling of the processor frequency.
Pre-heating a processor obviously makes sense, as well as "pre-heating" memory. One second of a prior workload is enough to reach the highest available turbo frequency, however you should take into account also temperature of the processor, which may down-scale the frequency (actually CPU core and uncore frequencies if speaking about one of the latest Intel processors). You are not able to reach the temperature limit in a second. But it depends, what you want to measure by your benchmark, and if you want to take into account the temperature limit. When speaking about temperature limit, be aware that your processor also has a power limit, which is another possible reason for down-scaling the frequency during the application run.
Another think that you should take into account when benchmarking your code is that its runtime is very short. Be aware of the runtime/resources consumption measurement reliability. I would suggest an artificially extending the runtime (run the code 10 times and measure the overall consumption) for better results.

I wrote some code to check this, with the aid of the Intel Power Gadget API. It sleeps for one second (so the CPU goes back to its slowest speed), measures the clock speed, runs some code for a given amount of time, then measures the clock speed again.
I only tried this on my 2018 15" MacBook Pro (i9-8950HK CPU) running macOS Catalina 10.15.7. The specific CPU-intensive code being run between clock speed measurements may also influence the result (is it integer only? FP? SSE? AVX? AVX-512?), so don't take these as exact numbers, but only order-of-magnitude/ballpark figures. I have no idea how the results translate into different hardware/OS/code combinations.
The minimum clock speed when idle in my configuration is 1.3 GHz. Here's the results I obtained in tabular form.
+--------+-------------+
| T (ms) | Final clock |
| | speed (GHz) |
+--------+-------------+
| <1 | 1.3 |
| 1..3 | 2.0 |
| 4..7 | 2.5 |
| 8..10 | 2.9 |
| 10..20 | 3.0 |
| 25 | 3.0-3.1 |
| 35 | 3.3-3.5 |
| 45 | 3.5-3.7 |
| 55 | 4.0-4.2 |
| 66 | 4.6-4.7 |
+--------+-------------+
So 1 ms appears to be the minimum amount of time to get any kind of change. 10 ms gets the CPU to its nominal frequency, and from then on it's a bit slower, apparently over 50 ms to reach maximum turbo frequencies.

What are the advantages of 16-bit adressing on IEEE 802.15.4 networks?

The only advantage I can think of using 16-bit instead of 64-bit addressing on a IEEE 802.15.4 network is that 6 bytes are saved in each frame. There might be a small win for memory constrained devices as well (microcontrollers), especially if they need to keep a list of many addresses.
But there are a couple of drawbacks:
A coordinator must be present to deal out short addresses
Big risk of conflicting addresses
A device might be assigned a new address without other nodes knowing
Are there any other advantages of short addressing that I'm missing?

You are correct in your reasoning, it saves 6 bytes which is a non-trivial amount given the packet size limit. This is also done with PanId vs ExtendedPanId addressing.
You are inaccurate about some of your other points though:
The coordinator does not assign short addresses. A device randomly picks one when it joins the network.
Yes, there is a 1/65000 or so chance for a collision. When this happens, BOTH devices pick a new short address and notify the network that there was an address conflict. (In practice I've seen this happen all of twice in 6 years)
This is why the binding mechanism exists. You create a binding using the 64-bit address. When transmission fails to a short address, the 64-bit address can be used to relocate the target node and correct the routing.

The short (16-bit) and simple (8-bit) addressing modes and the PAN ID Compression option allow a considerable saving of bytes in any 802.15.4 frame. You are correct that these savings are a small win for the memory-constrained devices that 802.15.4 is design to work on, however the main goal of these savings are for the effect on the radio usage.
The original design goals for 802.15.4 were along the lines of 10 metre links, 250kbit/s, low-cost, battery operated devices.
The maximum frame length in 802.15.4 is 128 bytes. The "full" addressing modes in 802.15.4 consist of a 16-but PAN ID and a 64-bit Extended Address for both the transmitter and receiver. This amounts to 20 bytes or about 15% of the available bytes in the frame. If these long addresses had to be used all of the time there would be a significant impact on the amount of application data that could be sent in any frame AND on the energy used to operate the radio transceivers in both Tx and Rx.
The 802.15.4 MAC layer defines an association process that can be used to negotiate and use shorter addressing mechanisms. The addressing that is typically used is a single 16-bit PAN ID and two 16-bit Short Ids, which amounts to 6 bytes or about 5% of the available bytes.
On your list of drawbacks:
Yes, a coordinator must hand out short addresses. How the addresses are created and allocated is not specified but the MAC layer does have mechanisms for notifying the layers above it that there are conflicts.
The risk of conflicts is not large as there are 65533 possible address to be handed out and 802.15.4 is only worried about "Layer 2" links (NB: 0xFFFF and 0xFFFE are special values). These addresses are not routable/routing/internetworking addresses (well, not from 802.15.4's perspective).
Yes, I guess a device might get a new address without the other nodes knowing but I have a hunch this question has more to do with ZigBee's addressing than with the 802.15.4 MAC addressing. Unfortunately I do not know much about ZigBee's addressing so I can't comment too much here.
I think it is important for me to point out that 802.15.4 is a layer 1 and layer 2 specification and the ZigBee is layer 3 up, i.e. ZigBee sits on top of 802.15.4.
This table is not 100% accurate, but I find it useful to think of 802.15.4 in this context:
+---------------+------------------+------------+
| Application | HTTP / FTP /Etc | COAP / Etc |
+---------------+------------------+------------+
| Transport | TCP / UDP | |
+---------------+------------------+ ZigBee |
| Network | IP | |
+---------------+------------------+------------+
| Link / MAC | WiFi / Ethernet | 802.15.4 |
| | Radio | Radio |
+---------------+------------------+------------+

micro-programmed control circuit and one questions

I ran into a question:
in digital system with micro-programmed control circuit, total of distinct operation pattern of 32 signal is 450. if the micro-programmed memory contains 1K micro instruction, by using Nano memory, how many bits is reduced from micro-programmed memory?
1) 22 Kbits
2) 23 Kbits
3) 450 Kbits
4) 450*32 Kbits
I read in my notes, that (1) is true, but i couldn't understand how we get this?
Edit: Micro instructions are stored in the micro memory (control memory). There is a chance that a group of micro instructions may occur several times in a micro program. As a result the more memory space isneeded.By making use of the nano memory we can have significant saving in the memory when a group of micro operations occur several times in a micro program. Please see for nano technique ref:

Control Units
Back in the day, before .NET, when you actually had to know what a computer was, before you could make it do stuff. This question would have gotten a ton of answers.
Except, back then, the internet wasn't really a thing, and Stack overflow was not really a problem, as the concept of a stack and a heap, wasn't really a standard..
So just to make sure that we are in fact talking about the same thing, I will just tr to explain this..
The control unit in a digital computer initiates sequences of microoperations. In a bus-oriented system, the control signals that specify microoperations are
groups of bits that select the paths in multiplexers, decoders, and ALUs.
So we are looking at the control unit, and the instruction set for making it capable of actually doing stuff.
We are dealing with what steps should happen, when the compiled assembly requests a bit shift, clear a register, or similar "low level" stuff.
Some of theese instructions may be hardwired, but usually not all of them.
Micro-programs
Quote: "Microprogramming is an orderly method of designing the control unit
of a conventional computer"
(http://www2.informatik.hu-berlin.de/rok/ca/data/slides/english/ca9.pdf)
The control variables, for the control unit can be represented by a string of 1’s and 0’s called a "control word". A microprogrammed control unit is a control unit whose binary control variables are not hardwired, but are stored in a memory. Before we optimized stuff we called this memory the micro memory ;)
Typically we would actually be looking at two "memories" a control memory, and a main memory.
the control memory is for the microprogram,
and the main memory is for instructions and data
The process of code generation for the control memory is called
microprogramming.
... ok?
Transfer of information among registers in the processor is through MUXs rather
than a bus, we typically have a few register, some of which are familiar to programmers, some are not. The ones that should ring a bell for most in here, is the processor registers. The most common 4 Processor registers are:
Program counter – PC
Address register – AR
Data register – DR
Accumulator register - AC
Examples where microcode uses processor registers to do stuff
Assembly instruction "ADD"
pseudo micro code: " AC ← AC + M[EA] " where M[EA] is data from main memory register
control word: 0000
Assembly instruction "BRANCH"
pseudo micro code "If (AC < 0) then (PC ← EA) "
control word: 0001
Micro-memory
The micro memory only concerns how we organize whats in the control memory.
However when we have big instruction sets, we can do better than simply storing all the instructions. We can subdivide the control memory into "control memory" and "nano memory" (since nano is smaller than micro right ;) )
This is good as we don't waste a lot of valuable space (chip area) on microcode.
The concept of nano memory is derived from a combination of vertical and horizontal instructions, but also provides trade-offs between them.
The motorola M68k microcomputer is one the earlier and popular µComputers with this nano memory control design. Here it was shown that a significant saving of memory could be achieved when a group of micro instructions occur often in a microprogram.
Here it was shown that by structuring the memory properly, that a few bits could be used to address the instructions, without a significant cost to speed.
The reduction was so that only the upper log_2(n) bits are required to specify the nano-address, when compared to the micro-address.
what does this mean?
Well let's stay with the M68K example a bit longer:
It had 640 instructions, out of which only 280 where unique.
had the instructions been coded as simple micro memory, it would have taken up:
640x70 bits. or 44800 bits
however, as only the 280 unique instructions where required to fill all 70 bits, we could apply the nano memory technique to the remaining instructions, and get:
8 < log_2(640-280) < 9 = 9
640*9 bit micro control store, and 280x70 bit nano memory store
total of 25360 bits
or a memory savings of 19440 bits.. which could be laid out as main memory for programmers :)
this shows that the equation:
S = Hm x Wm + Hn x Wn
where:
Hm = Number of words High Level
Wm = Length of words in High Level
Hn = Number of Low Level words
Wn = Length of low level words
S = Control Memory Size (with Nano memory technique)
holds in real life.
note that, micro memory is usually designed vertically (Hm is large, Wm is small) and nano programs are usually opposite Hn small, Wn Large.
Back to the question
I had a few problems understanding the wording of the problem, - that may because my first language is Danish, but still I tried to make some sense of it and got to:
proposition 1:
1000 instructions
32 bits
450 uniques
µCode:
1000 * 32 = 32.000 bits
bit width required for nano memory:
log2(1000-450) > 9 => 10
450 * 32 = 14400
(1000-450) * 10 = 5500
32000 - (14400 + 5500) = 12.100 bits saved
Which is not any of your answers.
please provide clarification?
UPDATE:
"the control word is 32 bit. we can code the 450 pattern with 9 bit and we use these 9 bits instead of 32 bit control word. reduce memory from 1000*(32+x) to 1000*(9+x) is equal to 23kbits. – Ali Movagher"
There is your problem, we cannot code the 450 pattern with 9 bits, as far as I can see we need 10..

Is there a performance impact by turning on LARGEADDRESSAWARE?

I am keen to know how the /LARGEADDRESSAWARE switch works and cannot find much about the implementation details.
Can anybody describe what is happening when the switch is used and its consequences (aside from allowing a process to access more memory)?

I have run a simple benchmark using the SLAM++ library on Venice, Sphere and 100k datasets:
Dataset | Time x86 | Time x86 /LARGEADDRESSAWARE | Time x64
Venice | bad_alloc | 4276.524971 | 300.315826 sec
Sphere | 2.946498 | 3.605073 | 1.727466 sec
100k | 46.402400 | 50.196711 | 32.774384 sec
All times are in seconds. There you have it - the performance toll can be substantial. This is mostly doing BLAS operations, sometimes accelerated using SSE, and the whole thing is quite memory bound. Note that the peak memory usage on Venice in x86 was slightly over 3.5 GB (I believe it can be up to 4 GB in an x64 system), in x64 it was a bit under 4.3 GB. The other datasets use much less memory, well below 2 GB.
In case of x86 /LARGEADDRESSAWARE on Venice, t seemed that the OS wants to keep most of the >2 GB in the paging file, although the memory usage jumped to >3 GB when the data was accessed - so the extra cost may stem from aggressive paging. Also, there is some advantage to arithmetic operations in x64 over x86 (the program can use extra registers, etc.), which is probably why ordinary x86 is slower than x64 on the small datasets.
This was measured on a machine with 2x AMD Opteron 2356 SE and 16 GB of 667 MHz DDR2, running Windows Server 2003 x64.
On Windows 7, Intel Core i7-2620M, 8 GB 1333 MHz DDR3 machine:
Dataset | Time x86 | Time x86 /LARGEADDRESSAWARE | Time x64
Venice | bad_alloc | 203.139716 | 115.641463 sec
Sphere | 1.714962 | 1.814261 | 0.870865 sec
100k | 18.040907 | 18.091992 | 13.660002 sec
This has quite a similar behavior, x64 is faster than x86, and /LARGEADDRESSAWARE is slower (although not so much slower as in the previous case - it likely depends on a CPU or on an OS).

Historically, 32-bit Windows systems would have a virtual memory layout where only the low 2 GB of process' address space would be used by the application; the upper 2 GB would be reserved for the kernel. This was the documented behavior. Changing the documented behavior is not cool, unless it's explicitly opt-in. That's what /LARGEADDRESSAWARE is for. It triggers a flag in the executable header that tells the system that the program won't mind using addresses above the 2GB boundary. With that flag, the system can allocate addresses from the low 3 GB and the upper 1 GB is for kernel.
How would you have to code the app so that this was a breaking change is a whole another question. Maybe some people would cast addresses to signed ints and compare them; that'd break if the addresses had bit 31 set.
EDIT: there's no performance impact from the switch per se. However, if the application routinely encounters memory loads over 2 GB, you can get some performance from caching more aggressively. Without the 3GB switch, an app can't consume over 2GB of virtual memory; with the switch, up to three.

/LARGEADDRESSAWARE does not have a performance impact, because it does not affect code generation.
Programs that do not set this flag only get virtual memory addresses < 2^31. Programs with this flag set may get virtual addresses > 2^31.
This is significant, because there may be subtle bugs in programs where they rely on signed integer math.
E.g. pointer casting to int:
void* p0 = ...; // from somewhere
void* p1 = ...; // from somewhere else
assert( p1 > p0 );
int diff = (int)p1 - (int)p0;
This will break in the presence of addresses > 2 GB.
So to be conservative, the OS does treat programs that do not have this flag set, as it 'may do something bad when encountering addresses > 2 GB'.
On the other hand on x86 systems, setting the /3GB flag reduces the amount of virtual memory the kernel has available, which might impact its performance.

Little Endian Cabling?

I have a doubt about endian-ness concept.please don't refer me to wikipedia, i've already read it.
Endian-ness, Isn't it just the 2 ways that the hardware cabling(between memory, and registers, through data bus) has been implemented in a system?
In my understanding, below picture is a little endian implementation(follow horizontal line from a memory address (e.g 4000) and then vertical line to reach to the low/high part of the register please)
As you see little memory addresses have been physically connected to low-part of 4-byte register.
I think that it does not related at all to READ and WRITE instructions in any language(e.g. LDR in ARM).
1-byte memory address:
- 4000 value:XX ------------------|
- 4001 value:XX ---------------| |
- 4002 value:XX ------------| | |
- 4003 value:XX ---------| | | |
| | | |
general-purpose register:XX XX XX XX

Yes and no. (I can't see your diagram, but I think I understand what you're asking). The way data lines are physically connected in the hardware can determine/control whether the representation in memory is treated as big or little endian. However, there is more to it than this; little endian is a means of representation, so for instance data stored on magnetic storage (in a file) might be coded using little endian representation or big endian representation and obviously at this level the hardware is not important.
Furthermore, some 8 bit microcontrollers can perform 16 bit operations, which are performed at the hardware level using two separate memory accesses. They can therefore use either little or big endian representation independent of bus design and ALU connection.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio