How do "per-cpu" variables work? [duplicate]

How do "per-cpu" variables work? [duplicate] - memory-management

On multiprocessor, each core can have its own variables. I thought they are different variables in different addresses, although they are in same process and have the same name.
But I am wondering, how does the kernel implement this? Does it dispense a piece of memory to deposit all the percpu pointers, and every time it redirects the pointer to certain address with shift or something?

Normal global variables are not per CPU. Automatic variables are on the stack, and different CPUs use different stack, so naturally they get separate variables.
I guess you're referring to Linux's per-CPU variable infrastructure.
Most of the magic is here (asm-generic/percpu.h):
extern unsigned long __per_cpu_offset[NR_CPUS];
#define per_cpu_offset(x) (__per_cpu_offset[x])
/* Separate out the type, so (int[3], foo) works. */
#define DEFINE_PER_CPU(type, name) \
__attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
/* var is in discarded region: offset to particular copy we want */
#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
#define __get_cpu_var(var) per_cpu(var, smp_processor_id())
The macro RELOC_HIDE(ptr, offset) simply advances ptr by the given offset in bytes (regardless of the pointer type).
What does it do?
When defining DEFINE_PER_CPU(int, x), an integer __per_cpu_x is created in the special .data.percpu section.
When the kernel is loaded, this section is loaded multiple times - once per CPU (this part of the magic isn't in the code above).
The __per_cpu_offset array is filled with the distances between the copies. Supposing 1000 bytes of per cpu data are used, __per_cpu_offset[n] would contain 1000*n.
The symbol per_cpu__x will be relocated, during load, to CPU 0's per_cpu__x.
__get_cpu_var(x), when running on CPU 3, will translate to *RELOC_HIDE(&per_cpu__x, __per_cpu_offset[3]). This starts with CPU 0's x, adds the offset between CPU 0's data and CPU 3's, and eventually dereferences the resulting pointer.

Related

How to Detect L0,L1,L2 Cache Possible overflow just by looking at the kernel Code?

I have a RX 570, These are the information i received from clGetDeviceInfo
MaxComputeUnitPerGPU: 32
MaxWorkGroupSize: 256
MaxWorkItemSize: 256
MaxGlobalMemoryOfDevice: 4294967296
MaxPrivateMemoryBytesPerWorkGroup: 16384
MaxLocalMemoryBytesPerWorkGroup: 32768
If I have 256 Work Groups and 256 Work Items per Work Group It would mean that
64 Bytes Of Private(l1?) Memory per work Item(16384/256)
32768 Bytes Of Local(l2) Memory per work Group
And if I use 17 floats would it overflow to L2?
or
If I use 15 float, and 2 private float would it overflow to L2?
also is float the same as private float? Answer: Same by default, By #doqtor
or
If I use 16 float and use functions like pow, sqrt and clamp would registry(l1?) overflow occur?

Variables without address specifier are by default private. By OpenCL docs:
Variables inside a __kernel function not declared with an address
space qualifier, all variables inside non-kernel functions, and all
function arguments are in the __private or private address space.
Variables declared as pointers are considered to point to the
__private address space if an address space qualifier is not specified.
Private variables are stored in registers on GPU. If the kernel uses more registers than available, some variables are stored instead in global memory (register spilling).

To add to doqtor's answer, you can detect register spilling by doing roofline analysis if you are in the bandwidth limit. You can count the number of FLOPs and memory transfers from the program binaries (string binaries = program.getInfo<CL_PROGRAM_BINARIES>()[0]);). If you are very close to the bandwidth limit, then there is no spilling. If you increase the number of private variables from this point, for example with a matrix multiplication in private memory, and performance significantly drops, then you have a register spill: private variables are suddenly read from global memory and since you already were in the bandwidth limit, the additional global memory access leads to slowdown.

x86 store when data is in 2 different blocks

Supose linux-32: the aligment rules say, for example, that doubles (8 Bytes) must be aligned to 4 Bytes. This means that, if we assume 64 Bytes cache blocks (a typical value for modern processors) we can have a double aligned in the 60th position, which mean that this double will be in 2 different cache blocks.
It could even happen that both parts of the double were in 2 different cache blocks located in 2 different 4KB pages.
After this brief introduction to put the question in context, I have a couple of doubts:
1- For an assembler programming where we seek maximum performance, it is recommended to prevent these things from happenning by putting alignment directives, right? Or, for any reason that I unknow, making the alignment to make the double in only 1 block doesn't imply any performance change?
2- How will be the store instruction decoded in the in the mentioned case? (supose modern intel microarchitecture). I mean, I know that a normal store x86 instruction is decoded in a micro-fused pair of str-addr and str-data, but in this case where 2 different cache blocks (and maybe even 2 different 4KB pages) are involved, this will be decoded in 2 micro-fused pair of str-addr and str-data (one for the first 4 bytes of the double and another for the last 4 bytes)? Or it will be decoded to a single micro-fused pair but having to do both the str-addr and the str-data twice the work until finally being able to exit the execution port?

Yes, of course you should align a double whenever possible, like compilers do except when forced by ABI struct-layout rules to misalign them. (The ABI was designed when i386 was current so a double always required 2 loads anyway.)
The current version of the i386 System V ABI requires 16-byte stack alignment, so local doubles (that have to get spilled at all instead of kept in regs) can be aligned, and malloc has to return memory suitable for any type, and alignof(max_align_t) = 16 on 32-bit Linux (8 on 32-bit Windows) so 32-bit malloc will always give you at least 16 (or 8)-byte aligned memory. And of course in static storage you control the alignment with align (NASM) or .p2align (GAS) directives.
For the perf downsides of cacheline splits and page splits, see How can I accurately benchmark unaligned access speed on x86_64
re: decoding: The address isn't know at decode time so obviously any effects of a line-split page-split are resolved later. For stores, probably no effect until the store-buffer entry has to commit to L1d cache. Are two store buffer entries needed for split line/page stores on recent Intel? - probably no, allocating a 2nd entry after executing the store-address uop is implausible.
For loads, re-running the load through the execution unit to get the other half (or whatever uneven split), using internal line-split buffers to combine data. (Not re-dispatching from the RS, just internally handled in the load port. But the RS does aggressively replay uops waiting for the result of a load.)
Re-running the store-data uop for a misaligned store seems unlikely, too. I don't think we see extra counts for uops_dispatched_port.port_4 perf events.

A heap manager for C/Pascal that automatically fills freed memory with zero bytes

What do you think about an option to fill freed (not actually used) pages with zero bytes? This may improve performance under Windows, and also under VMWare and other virtual machine environments? For example, VMWare and HyperV calculate hash of memory pages, and, if the contents is the same, mark this page as "shared" inside a virtual machine and between virtual machines on the same host, until the page is modified. It effectively decreases memory consumption. Windows does the same - it handles zero pages differently, treating them as free.
We could have the heap manager that would automatically fill memory with zeros when we call FreeMem/ReallocMem. As an alternative option, we could have a function that zeroizes empty memory by demand, i.e. only when this function is explicitly called. Of course, this function has to be thread-safe.
The drawback of filling memory with zeros is touching the memory, which might have already been turned into virtual, thus issuing page faults. Besides that, any memory store operations are slow, so our program will be slower, albeit to an unknown extent (maybe negligible).
If we manage to fill 4-K pages completely with zeros, the hypervisor or Windows will explicitly mark it as a zero page. But even partial zeroizing may be beneficial, since the hypervisor may compress pages using LZ or similar algorithms to save physical memory.
I just want to know your opinion whether the benefits of filling emptied heap memory with zero bytes by the heap manager itself will outweigh the disadvantages of such a technique.
Is zeroizing worth its price when we buy reduced physical memory consumption?

When you have a page whose contents you no longer care about but you still want to keep it allocated, you can call VirtualAlloc (and variants) and pass the MEM_RESET flag.
From VirtualAlloc on MSDN:
MEM_RESET
Indicates that data in the memory range specified by lpAddress and
dwSize is no longer of interest. The pages should not be read from or
written to the paging file. However, the memory block will be used
again later, so it should not be decommitted. This value cannot be
used with any other value.
Using this value does not guarantee that
the range operated on with MEM_RESET will contain zeros. If you want
the range to contain zeros, decommit the memory and then recommit it.
This gives the best of both worlds - you don't have the cost of zeroing the memory, and the system does not have the cost of paging it back in. You get to take advantage of the well-tuned memory manager which already has a zero-pool.
Similar functionality also exists on Linux under the MADV_FREE (or MADV_DONTNEED for Posix) flag to madvise. Glibc uses this function in the implementation of its heap.:
/*
* Stack:
* int shrink_heap (heap_info *h, long diff)
* int heap_trim (heap_info *heap, size_t pad) at arena.c:660
* void _int_free (mstate av, mchunkptr p, int have_lock) at malloc.c:4097
* void __libc_free (void *mem) at malloc.c:2948
* void free(void *mem)
*/
static int
shrink_heap (heap_info *h, long diff)
{
long new_size;
new_size = (long) h->size - diff;
/* ... snip ... */
__madvise ((char *) h + new_size, diff, MADV_DONTNEED);
/* ... snip ... */
h->size = new_size;
return 0;
}

If your heap is in user space this will never work. The kernel can only trust itself, not user space. If the kernel zeros a page, it can treat it as zero. If user space says it zeroed a page, the kernel would still have to check that. It might just as well zero it. One thing user space can do is to discard pages. Which marks them as "don't care". Then a kernel can treat them as zero. But manually zeroing pages in user space is futile.

How does a CPU know if an address in RAM contains an integer, a pre-defined CPU instruction, or any other kind of data?

The reason this gets me confused is that all addresses hold a sequence of 1's and 0's. So how does the CPU differentiate, let's say, 00000100(integer) from 00000100(CPU instruction)?

First of all, different commands have different values (opcodes). That's how the CPU knows what to do.
Finally, the questions remains: What's a command, what's data?
Modern PCs are working with the von Neumann-Architecture ( https://en.wikipedia.org/wiki/John_von_Neumann) where data and opcodes are stored in the same memory space. (There are architectures seperating between these two data types, such as the Harvard architecture)
Explaining everything in Detail would totally be beyond the scope of stackoverflow, most likely the amount of characters per post would not be sufficent.
To answer the question with as few words as possible (Everyone actually working on this level would kill me for the shortcuts in the explanation):
Data in the memory is stored at certain addresses.
Each CPU Advice is basically consisting of 3 different addresses (NOT values - just addresses!):
Adress about what to do
Adress about value
Adress about an additional value
So, assuming an addition should be performed, and you have 3 Adresses available in the memory, the application would Store (in case of 5+7) (I used "verbs" for the instructions)
Adress | Stored Value
1 | ADD
2 | 5
3 | 7
Finally the CPU receives the instruction 1 2 3, which then means ADD 5 7 (These things are order-sensitive! [Command] [v1] [v2])... And now things are getting complicated.
The CPU will move these values (actually not the values, just the adresses of the values) into its registers and then processing it. The exact registers to choose depend on datatype, datasize and opcode.
In the case of the command #1 #2 #3, the CPU will first read these memory addresses, then knowing that ADD 5 7 is desired.
Based on the opcode for ADD the CPU will know:
Put Address #2 into r1
Put Address #3 into r2
Read Memory-Value Stored at the address stored in r1
Read Memory-Value stored at the address stored in r2
Add both values
Write result somewhere in memory
Store Address of where I put the result into r3
Store Address stored in r3 into the Memory-Address stored in r1.
Note that this is simplified. Actually the CPU needs exact instructions on whether its handling a value or address. In Assembly this is done by using
eax (means value stored in register eax)
[eax] (means value stored in memory at the adress stored in the register eax)
The CPU cannot perform calculations on values stored in the memory, so it is quite busy moving values From memory to registers and from registers to memory.
i.e. If you have
eax = 0x2
and in memory
0x2 = 110011
and the instruction
MOV ebx, [eax]
this means: move the value, currently stored at the address, that is currently stored in eax into the register ebx. So finally
ebx = 110011
(This is happening EVERYTIME the CPU does a single calculation!. Memory -> Register -> Memory)
Finally, the demanding application can read its predefined memory address #2,
resulting in address #2568 and then knows, that the outcome of the calculation is stored at adress #2568. Reading that Adress will result in the value 12 (5+7)
This is just a tiny tiny example of whats going on. For a more detailed introduction about this, refer to http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
One cannot really grasp the amount of data movement and calculations done for a simple addition of 2 values. Doing what a CPU does (on paper) would take you several minutes just to calculate "5+7", since there is no "5" and no "7" - Everything is hidden behind an address in memory, pointing to some bits, resulting in different values depending on what the bits at adress 0x1 are instructing...

Short form: The CPU does not know what's stored there, but the instructions tell the CPU how to interpret it.
Let's have a simplified example.
If the CPU is told to add a word (let's say, an 32 bit integer) stored at the location X, it fetches the content of that address and adds it.
If the program counter reaches the same location, the CPU will again fetch this word and execute it as a command.

The CPU (other than security stuff like the NX bit) is blind to whether it's data or code.
The only way data doesn't accidentally get executed as code is by carefully organizing the code to never refer to a location holding data with an instruction meant to operate on code.
When a program is started, the processor starts executing it at a predefined spot. The author of a program written in machine language will have intentionally put the beginning of their program there. From there, that instruction will always end up setting the next location the processor will execute to somewhere this is an instruction. This continues to be the case for all of the instructions that make up the program, unless there is a serious bug in the code.
There are two main ways instructions can set where the processor goes next: jumps/branches, and not explicitly specifying. If the instruction doesn't explicitly specify where to go next, the CPU defaults to the location directly after the current instruction. Contrast that to jumps and branches, which have space to specifically encode the address of the next instruction's address. Jumps always jump to the place specified. Branches check if a condition is true. If it is, the CPU will jump to the encoded location. If the condition is false, it will simply go to the instruction directly after the branch.
Additionally, the a machine language program should never write data to a location that is for instructions, or some other instruction at some future point in the program could try to run what was overwritten with data. Having that happen could cause all sorts of bad things to happen. The data there could have an "opcode" that doesn't match anything the processor knows what to do. Or, the data there could tell the computer to do something completely unintended. Either way, you're in for a bad day. Be glad that your compiler never messes up and accidentally inserts something that does this.
Unfortunately, sometimes the programmer using the compiler messes up, and does something that tells the CPU to write data outside of the area they allocated for data. (A common way this happens in C/C++ is to allocate an array L items long, and use an index >=L when writing data.) Having data written to an area set aside for code is what buffer overflow vulnerabilities are made of. Some program may have a bug that lets a remote machine trick the program into writing data (which the remote machine sent) beyond the end of an area set aside for data, and into an area set aside for code. Then, at some later point, the processor executes that "data" (which, remember, was sent from a remote computer). If the remote computer/attacker was smart, they carefully crafted the "data" that went past the boundary to be valid instructions that do something malicious. (To give them more access, destroy data, send back sensitive data from memory, etc).

this is because an ISA must take into account what a valid set of instructions are and how to encode data: memory address/registers/literals.
see this for more general info on how ISA is designed
https://en.wikipedia.org/wiki/Instruction_set

In short, the operating system tells it where the next instruction is. In the case of x64 there is a special register called rip (instruction pointer) which holds the address of the next instruction to be executed. It will automatically read the data at this address, decode and execute it, and automatically increment rip by the number of bytes of the instruction.
Generally, the OS can mark regions of memory (pages) as holding executable code or not. If an error or exploit tries to modify executable memory an error should occur, similarly if the CPU finds itself trying to execute non-executable memory it will/should also signal an error and terminate the program. Now you're into the wonderful world of software viruses!

Memory management in Forth

So I'm just learning Forth and was curious if anyone could help me understand how memory management generally works. At the moment I only have (some) experience with the C stack-vs-heap paradigm.
From what I understand, one can allocate in the Dictionary, or on the heap. Is the Dictionary faster/preferred like the stack in C? But unlike in C, there aren't scopes and automatic stack reclamation, so I'm wondering if one only uses the dictionary for global data structures (if at all).
As far as the heap goes, is it pretty much like C? Is heap management a standard (ANS) concept, or is it implementation-defined?

It is not Dictionary, or on the heap - the equivalent of the heap is the dictionary. However, with the severe limitation that it acts more like a stack than a heap - new words are added to the end of the dictionary (allocation by ALLOT and freeing by FORGET or FREE (but freeing all newer words - acting more like multiple POPs)).
An implementation can control the memory layout and thus implement a traditional heap (or garbage collection). An example is A FORTH implementation of the Heap Data Structure for Memory Management (1984). Another implementation is Dynamic Memory Heaps for Quartus Forth (2000).
A lot is implementation dependent or extensions. For instance, the memory layout is often with the two block buffers (location by BLOCK and TIB), the text input buffer and values and low-level/primitive functions of the language, in the lowest portion, dictionary in the middle (growing upwards) and the return stack and the parameter stack at the top 1.
The address of the first available byte above the dictionary is returned by HERE (it changes as the dictionary expands).
There is also a scratchpad area above the dictionary (address returned by PAD) for temporarily storing data. The scratchpad area can be regarded as free memory.
The preferred mode of operation is to use the stack as much as possible instead of local variables or a heap.
1 p. 286 (about a particular edition of Forth, MMSFORTH) in chapter "FORTH's Memory, Dictionary, and Vocabularies", Forth: A text and a reference. Mahlon G. Kelly and Nicholas Spies. ISBN 0-13-326349-5 / 0-13-326331-2 (pbk.). 1986 by Prentice-Hall.

The fundamental question may not have been answered in a way that a new Forth user would require so I will take a run at it.
Memory in Forth can be very target dependent so I will limit the description to the simplest model, that being a flat memory space, where code and data live together happily. (as opposed to segmented memory models, or FLASH memory for code and RAM for data or other more complicated models)
The Dictionary typically starts at the bottom of memory and is allocated upwards by the Forth system. The two stacks, in a simple system would exist in high memory and typically have two CPU registers pointing to them. (Very system dependent)
At the most fundamental level, memory is allocated simply by changing the value of the dictionary pointer variable. (sometimes called DP)
The programmer does not typically access this variable directly but rather uses some higher level words to control it.
As mentioned the Forth word HERE returns the next available address in the dictionary space. What was not mentioned was that HERE is defined by fetching the value of the variable DP. (system dependency here but useful for a description)
In Forth HERE might look like this:
: HERE ( -- addr) DP # ;
That's it.
To allocate some memory we need to move HERE upwards and we do that with the word ALLOT.
The Forth definition for ALLOT simply takes a number from the parameter stack and adds it to the value in DP. So it is nothing more than:
: ALLOT ( n --) DP +! ; \ '+!' adds n to the contents variable DP
ALLOT is used by the FORTH system when we create a new definition so that what we created is safely inside 'ALLOTed' memory.
Something that is not immediately obvious is the that ALLOT can take a negative number so it is possible to move the dictionary pointer up or down. So you could allocate some memory and return it like this:
HEX 100 ALLOT
And free it up like this:
HEX -100 ALLOT
All this to say that this is the simplest form of memory management in a Forth system. An example of how this is used can be seen in the definition of the word BUFFER:
: BUFFER: ( n --) CREATE ALLOT ;
BUFFER: "creates" a new name in the dictionary (create uses allot to make space for the name by the way) then ALLOTs n bytes of memory right after the name and any associated housekeeping bytes your Forth system might use
So now to allocate a block of named memory we just type:
MARKER FOO \ mark where the memory ends right now
HEX 2000 BUFFER: IN_BUFFER
Now we have an 8K byte buffer called IN_BUFFER. If wanted to reclaim that space in Standard Forth we could type FOO and everything allocated in the Dictionary after FOO would be removed from the Forth system.
But if you want temporary memory space, EVERYTHING above HERE is free to use!
So you can simply point to an address and use it if you want to like this
: MYMEMORY here 200 + ; \ MYMEMORY points to un-allocated memory above HERE
\ MYMEMORY moves with HERE. be aware.
MYMEMORY HEX 1000 ERASE \ fill it with 2K bytes of zero
Forth has typically been used for high performance embedded applications where dynamic memory allocation can cause un-reliable code so static allocation using ALLOT was preferred. However bigger systems have a heap and use ALLOCATE, FREE and RESIZE much like we use malloc etc. in C.
BF

Peter Mortensen laid it out very well. I'll add a few notes that might help a C programmer some.
The stack is closest to what C terms "auto" variables, and what are commonly called local variables. You can give your stack values names in some forths, but most programmers try to write their code so that naming the values is unnecessary.
The dictionary can best be viewed as "static data" from a C programming perspective. You can reserve ranges of addresses in the dictionary, but in general you will use ALLOT and related words to create static data structures and pools which do not change size after allocation. If you want to implement a linked list that can grow in real time, you might ALLOT enough space for the link cells you will need, and write words to maintain a free list of cells you can draw from. There are naturally implementations of this sort of thing available, and writing your own is a good way to hone pointer management skills.
Heap allocation is available in many modern Forths, and the standard defines ALLOCATE, FREE and RESIZE words that work in a way analogous to malloc(), free(), and realloc() in C. Where the bytes are allocated from will vary from system to system. Check your documentation. It's generally a good idea to store the address in a variable or some other more permanent structure than the stack so that you don't inadvertently lose the pointer before you can free it.
As a side note, these words (along with the file i/o words) return a status on the stack that is non-zero if an error occurred. This convention fits nicely with the exception handling mechanism, and allows you to write code like:
variable PTR
1024 allocate throw PTR !
\ do some stuff with PTR
PTR # free throw
0 PTR !
Or for a more complex if somewhat artificial example of allocate/free:
\ A simple 2-cell linked list implementation using allocate and free
: >link ( a -- a ) ;
: >data ( a -- a ) cell + ;
: newcons ( a -- a ) \ make a cons cell that links to the input
2 cells allocate throw tuck >link ! ;
: linkcons ( a -- a ) \ make a cons cell that gets linked by the input
0 newcons dup rot >link ! ;
: makelist ( n -- a ) \ returns the head of a list of the numbers from 0..n
0 newcons dup >r
over 0 ?do
i over >data ! linkcons ( a -- a )
loop >data ! r> ;
: walklist ( a -- )
begin dup >data ? >link # dup 0= until drop ;
: freelist ( a -- )
begin dup >link # swap free throw dup 0= until drop ;
: unittest 10 makelist dup walklist freelist ;

Some Forth implementations support local variables on the return stack frame and allocating memory blocks. For example in SP-Forth:
lib/ext/locals.f
lib/ext/uppercase.f
100 CONSTANT /buf
: test ( c-addr u -- ) { \ len [ /buf 1 CHARS + ] buf }
buf SWAP /buf UMIN DUP TO len CMOVE
buf len UPPERCASE
0 buf len + C! \ just for illustration
buf len TYPE
;
S" abc" test \ --> "ABC"

With Forth you enter a different world.
In a typical Forth like ciforth on linux (and assuming 64 bits) you can configure your Forth to have a linear memory space that is as large as your swap space (e.g. 128 Gbyte). That is yours to fill in with arrays, linked lists, pictures whatever. You do this interactively, typically by declaring variable and including files. There are no restrictions. Forth only provides you with a HERE pointer to help you keep track of memory you have used up. Even that you can ignore, and there is even a word in the 1994 standard that provides scratch space that floats in the free memory (PAD).
Is there something like malloc() free() ? Not necessarily. In a small kernel of a couple of dozen kilobytes,no. But you can just include a file with an ALLOCATE / FREE and set aside a couple of Gbyte to use for dynamic memory.
As an example I'm currently working with tiff files. A typical 140 Mbyte picture takes a small chunk out of the dictionary advancing HERE.
Rows of pixels are transformed, decompressed etc. For that I use dynamic memory, so I ALLOCATE space for the decompression result of a row. I've to manually FREE them again when the results have been used up for another transformation. It feels totally different from c. There is more control and more danger.
Your question about scopes etc. In Forth if you know the address, you can access the data structure. Even if you jotted F7FFA1003 on a piece of paper. Trying to make programs safer by separate name spaces is not prominent in Forth style. So called wordlist (see also VOCABULARY) provide facilities in that direction.

There's a little elephant hiding in a big FORTH memory management room, and I haven't seen too many people mention it.
The canonical FORTH has, at the very least, a non-addressable parameter stack. This is the case in all FORTH hardware implementations I'm aware of (usually originating with Chuck Moore) that have a hardware parameter stack: it's not mapped into the addressable memory space.
What does "non-addressable" mean? It means: you can't have pointers to the parameter stack, i.e. there are no means to get addresses of things on that stack. The stack is a "black box" that you can only access via the stack API (opcodes if it's a hardware stack), without bypassing it - and only that API will modify its contents.
This implies no aliasing between parameter stack and memory accesses using pointers - via # and ! and the like. This enables efficient code generation with small effort, and indeed it makes decent generated code in FORTH systems orders of magnitude easier to obtain than with C and C++.
This of course breaks down when pointers can be obtained to the parameter stack. A well designed system would probably have guarded API for such access, since within the guards the code generator has to spill everything from registers to stack - in absence of full data flow analysis, that is.
DFA and other "expensive" optimization techniques are not of course impossible in FORTH, it's just that they are a bit larger in scope than many a practical FORTH system. They can be done very cleanly in spite of that (I'm using CFA, DFA and SSA optimizations in an in-house FORTH implementation, and the whole thing has less source code, comments included, than the utility classes in LLVM... - classes that are used all over the place, but that don't actually do anything related to compiling or code analysis).
A practical FORTH system can also place aliasing limitations on the return stack contents, namely that the return addresses themselves don't alias. That way control flow can be analyzed optimistically, only taking into account explicit stack accesses via R#, >R and R>, while letting you place addressable local variables on that stack - that's typically done when a variable is larger than a cell or two, or would be awkward to keep around on the parameter stack.
In C and C++, aliasing between automatic "local" variables and pointers is a big problem, because only large compilers with big optimizers can afford to prove lack of aliasing and forgo register reloads/spills when intervening pointer dereferences take place. Small compilers, to remain compliant and not generate broken code, have to pessimize and assume that accesses via char* alias everything, and accesses via Type* alias that type and others "like it" (e.g. derived types in C++). That char* aliases all things in C is a prime example of where you pay a big price for a feature you didn't usually intend to use.
Usually, forcing an unsigned char type for characters, and re-writing the string API using this type, lets you not use char* all over the place and lets the compiler generate much better code. Compilers of course add lots of analysis passes to minimize the fallout from this design fiasco... And all it'd take to fix in C is having a byte type that aliases every other type, and is compatible with arbitrary pointers, and has the size of the smallest addressable unit of memory. The reuse of void in void* to mean "pointer to anything" was, in hindsight, a mistake, since returning void means returning nothing, whereas pointing to void absolutely does not mean "pointing to nothing".

My idea is published at https://sites.google.com/a/wisc.edu/memorymanagement
I'm hoping to put forth code on github soon.
If you have an array (or several) with each array having a certain number of items of a certain size, you can pair a single-purpose stack to each array. The stack is initialized with the address of each array item. To allocate an array item, pop an address off the stack. To deallocate an array item, push its address onto the stack.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio