SystemStackError when pushing more than 130798 objects into an array - ruby

I am trying to understand why pushing many (in my case 130798) objects in an array returns a SystemStackError.
big = Array.new(130797, 1)
[].push(*big) && false
=> false
bigger = Array.new(130798, 1)
[].push(*bigger) && false
=> SystemStackError: stack level too deep
from (irb):104
from /Users/julien/.rbenv/versions/2.2.0/bin/irb:11:in `<main>'
I was able to reproduce it on MRI 1.9.3 and 2.2.0 while no errors were raised on Rubinius (2.5.2).
I understand this is due to the way Array are implemented in MRI but don't quite understand why a SystemStackError is raised.

Ruby's error message ("stack level too deep") isn't accurate here - what Ruby is really saying is "I ran out of stack memory", which is usually caused by infinite recursion, but in this case, is caused by you passing more arguments than Ruby has memory allocated to handle.
Ruby 2.0+ has a maximum stack size controlled by RUBY_THREAD_VM_STACK_SIZE (prior to 2.0 this was controlled by the C limits, set via ulimit). Each argument passed to a method gets pushed onto the thread's stack; if you push more arguments onto the stack than RUBY_THREAD_VM_STACK_SIZE has room to accomodate, you'll get a SystemStackError. You can see this limit from IRB:
RubyVM::DEFAULT_PARAMS[:thread_vm_stack_size]
=> 1048576
By default, each thread has 1MB of stack it can use. Ruby Fixnums are 8 bytes large, and on my system, I overflow at 130808 arguments, or 1046464 bytes allocated, leaving 2112 bytes allocated for the rest of the call stack. By using the splat operator (*) you are saying "take this list of 130798 Fixnums and expand it into 130798 arguments to be passed on the stack"; you simply don't have enough stack memory allocated to hold them all.
If you need to, you can increase RUBY_THREAD_VM_STACK_SIZE when you invoke Ruby:
$ RUBY_THREAD_VM_STACK_SIZE=2097152 irb
> [].push(*Array.new(150808, 1)); nil
=> nil
And this will increase the number of arguments you can pass. However, it also means that each thread will allocate twice as much stack, which is probably not desirable. You should also note that Fibers have a separate stack allocation setting, which is typically substantially smaller, since Fibers are designed to by lightweight and disposable.
Very rarely should you ever need to pass that much data on the stack; typically, if you need to pass a large amount of data to a method, you would pass an object as an argument (that is, on the stack, such as a Hash or Array) whose storage is allocated on the heap, so your stack usage is measured in bytes even if your heap usage is measured in megabytes. That is, you would pass your very large array to your method (which could hold gigabytes of data on the heap without issue), then you would iterate that array in your method.

Related

Does PyTorch allocate GPU memory eagerly?

Consider the following script:
import torch
def unnecessary_compute():
x = torch.randn(1000,1000, device='cuda')
l = []
for i in range(5):
print(i,torch.cuda.memory_allocated())
l.append(x**i)
unnecessary_compute()
Running this script with PyTorch (1.11) generates the following output:
0 4000256
1 8000512
2 12000768
3 16001024
4 20971520
Given that PyTorch uses asynchronous computation and we never evaluated the contents of l or of a tensor that depends on l, why did PyTorch eagerly allocate GPU memory to the new tensors? Is there a way of invoking these tensors in an utterly lazy way (i.e., without triggering GPU memory allocation before it is required)?
torch.cuda.memory_allocated() returns the memory that has been allocated, not the memory that has been "used".
In a typical GPU compute pipeline, you would record operations in a queue along with whatever synchronization primitives your API offers. The GPU will then dequeue and execute those operations, respecting the enqueued synchronization primitives. However, GPU memory allocation is not usually an operation which even goes on the queue. Rather, there's usually some sort of fundamental instruction that the CPU can issue to the GPU in order to allocate memory, just as recording operations is another fundamental instruction. This means that the memory necessary for a GPU operation has to be allocated before the operation has even been enqueued; there is no "allocate memory" operation in the queue to synchronize with.
Consider Vulkan as a simple example. Rendering operations are enqueued on a graphics queue. However, memory is typically allocated via calls to vkAllocateMemory(), which does not accept any sort of queue at all; it only accepts the device handle and information about the allocation (size, memory type, etc). From my understanding, the allocation is done "immediately" / synchronously (the memory is safe to use by the time the function call returns on the CPU).
I don't know enough about GPUs to explain why this is the case, but I'm sure there's a good reason. And perhaps the limitations vary from device to device. But if I were to guess, memory allocation probably has to be a fairly centralized operation; it can't be done by just any core executing recorded operations on a queue. This would make sense, at least; the space of GPU memory is usually shared across cores.
Let's apply this knowledge to answer your question: When you call l.append(x**i), you're trying to record a compute operation. That operation will require memory to store the result, and so PyTorch is likely allocating the memory prior to enqueuing the operation. This explains the behavior you're seeing.
However, this doesn't invalidate PyTorch's claims about asynchronous compute. The memory might be allocated synchronously, but it won't be populated with the result of the operation until the operation has been dequeued and completed by the GPU, which indeed happens asynchronously.
I was able to reproduce your problem. I cannot really tell you why it behaves like that. I just think the (randomly) initialized tensor needs a certain amount of memory. For instance if you call x = torch.randn(0,0, device='cuda') the tensor does not allocate any GPU memory and x = torch.zeros(1000,10000, device='cuda') allocates 4000256 as in your example.
To load the tensors lazy, I suggest you create them on CPU and send them on the GPU briefly before using them. Kind of a speeed/memory tradeoff. I changed your code accordingly:
import torch
def unnecessary_compute():
x = torch.randn(1000,1000, device='cpu')
l = []
for i in range(5):
print(i,torch.cuda.memory_allocated())
l.append(x**i)
print("Move to cuda")
for i, tensor_x in enumerate(l):
l[i]=tensor_x.to('cuda')
print(i,torch.cuda.memory_allocated())
unnecessary_compute()
that produced the following output:
0 0
1 0
2 0
3 0
4 0
Move to cuda
0 4000256
1 8000512
2 12000768
3 16001024
4 20971520

How to fix Run Time Error '7' Out of memory in visual basic 6?

I am trying to ZIP a folder with sub folders and files in vb6. For that I read each file and store them one by one in byte array using Redim Preserve. But large folders having size larger than 130MB throw an Out of Memory error.I have 8 GB of RAM in my PC so it shouldn't be a problem.So, is this some limitation by visual basic 6 that we can't use more than 150MB memory?
'Length of a particular File is determined
lngFileLen = FileLen(a_strFilePath)
DoEvents
If lngFileLen <> 0 Then
m_lngPtr = m_lngPtr + lngFileLen
'Next line Throws error once m_lngPtr reaches around 150 MB
ReDim Preserve arrFileBuffer(1 To m_lngPtr)
First of all, VB6 arrays can only be resized to a maximum of 2,147,483,647 elements. However, since that's also the upper limit of a Long in VB6, it seems like that's unlikely to be the problem. However, even though it may be allowed to make an array that big, it's running in a 32-bit process, so it's still subject to the limit of 2GB of addressable memory for the whole process. Since the VB6 run-time has some overhead, it's using some of that memory for other things, and since your program is likely doing other things too, that will be using up some memory too.
In addition to that, when you create an array, the system has to find that number of bytes of contiguous memory. So, even when there is enough memory available, within the 2GB limit, if it's sufficiently fragmented, you can still get out of memory errors. For that reason, creating gigantic arrays is always a concern.
Next, you are using ReDim Preserve, which requires twice the memory. When you resize the array like that, what it actually has to do, under the hood, is create a second array of the new size and then copy all of the other data out of the old array into the new one. Once it's done copying all the data out of the source array, it can then delete it, but while it's performing the copy, it needs to hold both the old array and the new array in memory simultaneously. That means that in a best case scenario, even if there was no other allocated memory or fragmentation, the maximum memory size of an array that you could resize would be 1GB.
Finally, in your example, you never showed what the data type of the array was. If it's an array of bytes, you should be good, I think (where the memory size of the array would only be slightly more than it's length in elements). However, if, for instance, it's an array of strings or variants, then I believe that's going to require a minimum of 4 bytes per element, thereby more-than-quadrupling the memory size of the array.

Faster memory allocation and freeing algorithm than multiple Free List method

We allocate and free many memory blocks. We use Memory Heap. However, heap access is costly.
For faster memory access allocation and freeing, we adopt a global Free List. As we make a multithreaded program, the Free List is protected by a Critical Section. However, Critical Section causes a bottleneck in parallelism.
For removing the Critical Section, we assign a Free List for each thread, i.e. Thread Local Storage. However, thread T1 always memory blocks and thread T2 always frees them, so Free List in thread T2 is always increasing, meanwhile there is no benefit of Free List.
Despite of the bottleneck of Critical Section, we adopt the Critical Section again, with some different method. We prepare several Free Lists as well as Critical Sections which is assigned to each Free List, thus 0~N-1 Free Lists and 0~N-1 Critical Sections. We prepare an atomic-operated integer value which mutates to 0, 1, 2, ... N-1 then 0, 1, 2, ... again. For each allocation and freeing, we get the integer value X, then mutate it, access X-th Critical Section, then access X-th Free List. However, this is quite slower than the previous method (using Thread Local Storage). Atomic operation is quite slow as there are more threads.
As mutating the integer value non-atomically cause no corruption, we did the mutation in non-atomic way. However, as the integer value is sometimes stale, there is many chance of accessing the same Critical Section and Free List by different threads. This causes the bottleneck again, though it is quite few than the previous method.
Instead of the integer value, we used thread ID with hashing to the range (0~N-1), then the performance got better.
I guess there must be much better way of doing this, but I cannot find an exact one. Are there any ideas for improving what we have made?
Dealing with heap memory is a task for the OS. Nothing guarantees you can do a better/faster job than the OS does.
But there are some conditions where you can get a bit of improvement, specially when you know something about your memory usage that is unknown to the OS.
I'm writting here my untested idea, hope you'll get some profit of it.
Let's say you have T threads, all of them reserving and freeing memory. The main goal is speed, so I'll try not to use TLS, nor critical blocking, not atomic ops.
If (repeat: if, if, if) the app can fit to several discrete sizes of memory blocks (not random sizes, so as to avoid fragmentation and unuseful holes) then start asking the OS for a number of these discrete blocks.
For example, you have an array of n1 blocks each of size size1, an array of n2 blocks each of size size2, an array of n3... and so on. Each array is bidimensional, the second field just stores a flag for used/free block. If your arrays are very large then it's better to use a dedicated array for the flags (due to contiguous memory usage is always faster).
Now, some one asks for a block of memory of size sB. A specialized function (or object or whatever) searches the array of blocks of size greater or equal to sB, and then selects a block by looking at the used/free flag. Just before ending this task the proper block-flag is set to "used".
When two or more threads ask for blocks of the same size there may be a corruption of the flag. Using TLS will solve this issue, and critical blocking too. I think you can set a bool flag at the beggining of the search into flags-array, that makes the other threads to wait until the flag changes, which only happens after the block-flag changes. With pseudo code:
MemoryGetter(sB)
{
//select which array depending of 'sB'
for (i=0, i < numOfarrays, i++)
if (sizeOfArr(i) >= sB)
arrMatch = i
break //exit for
//wait if other thread wants a block from the same arrMatch array
while ( searching(arrMatch) == true )
; //wait
//blocks other threads wanting a block from the same arrMatch array
searching(arrMatch) = true
//Get the first free block
for (i=0, i < numOfBlocks, i++)
if ( arrOfUsed(arrMatch, i) != true )
selectedBlock = addressOf(....)
//mark the block as used
arrOfUsed(arrMatch, i) = true
break; //exit for
//Allow other threads
searching(arrMatch) = false
return selectedBlock //NOTE: selectedBlock==NULL means no free block
}
Freeing a block is easier, just mark it as free, no thread concurrency issue.
Dealing with no free blocks is up to you (wait, use a bigger block, ask OS for more, etc).
Note that the whole memory is reserved from the OS at app start, which can be a problem.
If this idea makes your app faster, let me know. What I can say for sure is that memory used is greater than if you use normal OS request; but not much if you choose "good" sizes, those most used.
Some improvements can be done:
Cache the last freeded block (per size) so as to avoid the search.
Start with not that much blocks, and ask the OS for more memory only
when needed. Play with 'number of blocks' for each size depending on
your app. Find the optimal case.

Is it okay to use dictionary memory without 'allot'?

I am doing a programming exercise where I'm trying to do the same thing in different ways. (I happen to be adding two 3 element vectors together in Forth). In one of my revisions I used the return stack to store temporary values (so I am using that feature), but in addition to that I am considering using un-allocated memory as temporary storage.
I created two words to access this memory:
: front! here + ! ;
: front# here + # ;
I tried it in my experiment, and it seemed to work for what I was doing. I don't have any intention to use this memory after my routines are done. And I am living in dictionary, of which memory has already been given to the program.
But, my gut still tells me that this is a bad thing to do. Is this such a bad thing?
If it matters, I'm using Gforth.
Language-lawyer strictly speaking, no. ANS Forth 3.3.3.2 states:
A program may perform address arithmetic within contiguously allocated regions.
You are performing address arithmetic outside any allocated region.
However, it might be perfectly fine in some particular implementation. Such as gforth.
Note that there is a word called PAD, which returns an address to a temporary memory region.
It's okay if you know what you are doing, bud PAD is a better place than HERE to do it. There is also the alternative ALLOCATE and FREE:
ALLOCATE ( u -- a-addr ior )
Allocate u address units of contiguous data space. The data-space
pointer is unaffected by this operation. The initial content of the
allocated space is undefined.
If the allocation succeeds, a-addr is the aligned starting address of
the allocated space and ior is zero.
If the operation fails, a-addr does not represent a valid address and
ior is the implementation-defined I/O result code.
FREE ( a-addr -- ior )
Return the contiguous region of data space indicated by a-addr to the
system for later allocation. a-addr shall indicate a region of data
space that was previously obtained by ALLOCATE or RESIZE. The
data-space pointer is unaffected by this operation.
If the operation succeeds, ior is zero. If the operation fails, ior is
the implementation-defined I/O result code. American National Standard for Information Systems
: front! here + ! ;
What's the stack diagram? I guess ( n offset_in_cells -- )?

Memory management in Forth

So I'm just learning Forth and was curious if anyone could help me understand how memory management generally works. At the moment I only have (some) experience with the C stack-vs-heap paradigm.
From what I understand, one can allocate in the Dictionary, or on the heap. Is the Dictionary faster/preferred like the stack in C? But unlike in C, there aren't scopes and automatic stack reclamation, so I'm wondering if one only uses the dictionary for global data structures (if at all).
As far as the heap goes, is it pretty much like C? Is heap management a standard (ANS) concept, or is it implementation-defined?
It is not Dictionary, or on the heap - the equivalent of the heap is the dictionary. However, with the severe limitation that it acts more like a stack than a heap - new words are added to the end of the dictionary (allocation by ALLOT and freeing by FORGET or FREE (but freeing all newer words - acting more like multiple POPs)).
An implementation can control the memory layout and thus implement a traditional heap (or garbage collection). An example is A FORTH implementation of the Heap Data Structure for Memory Management (1984). Another implementation is Dynamic Memory Heaps for Quartus Forth (2000).
A lot is implementation dependent or extensions. For instance, the memory layout is often with the two block buffers (location by BLOCK and TIB), the text input buffer and values and low-level/primitive functions of the language, in the lowest portion, dictionary in the middle (growing upwards) and the return stack and the parameter stack at the top 1.
The address of the first available byte above the dictionary is returned by HERE (it changes as the dictionary expands).
There is also a scratchpad area above the dictionary (address returned by PAD) for temporarily storing data. The scratchpad area can be regarded as free memory.
The preferred mode of operation is to use the stack as much as possible instead of local variables or a heap.
1 p. 286 (about a particular edition of Forth, MMSFORTH) in chapter "FORTH's Memory, Dictionary, and Vocabularies", Forth: A text and a reference. Mahlon G. Kelly and Nicholas Spies. ISBN 0-13-326349-5 / 0-13-326331-2 (pbk.). 1986 by Prentice-Hall.
The fundamental question may not have been answered in a way that a new Forth user would require so I will take a run at it.
Memory in Forth can be very target dependent so I will limit the description to the simplest model, that being a flat memory space, where code and data live together happily. (as opposed to segmented memory models, or FLASH memory for code and RAM for data or other more complicated models)
The Dictionary typically starts at the bottom of memory and is allocated upwards by the Forth system. The two stacks, in a simple system would exist in high memory and typically have two CPU registers pointing to them. (Very system dependent)
At the most fundamental level, memory is allocated simply by changing the value of the dictionary pointer variable. (sometimes called DP)
The programmer does not typically access this variable directly but rather uses some higher level words to control it.
As mentioned the Forth word HERE returns the next available address in the dictionary space. What was not mentioned was that HERE is defined by fetching the value of the variable DP. (system dependency here but useful for a description)
In Forth HERE might look like this:
: HERE ( -- addr) DP # ;
That's it.
To allocate some memory we need to move HERE upwards and we do that with the word ALLOT.
The Forth definition for ALLOT simply takes a number from the parameter stack and adds it to the value in DP. So it is nothing more than:
: ALLOT ( n --) DP +! ; \ '+!' adds n to the contents variable DP
ALLOT is used by the FORTH system when we create a new definition so that what we created is safely inside 'ALLOTed' memory.
Something that is not immediately obvious is the that ALLOT can take a negative number so it is possible to move the dictionary pointer up or down. So you could allocate some memory and return it like this:
HEX 100 ALLOT
And free it up like this:
HEX -100 ALLOT
All this to say that this is the simplest form of memory management in a Forth system. An example of how this is used can be seen in the definition of the word BUFFER:
: BUFFER: ( n --) CREATE ALLOT ;
BUFFER: "creates" a new name in the dictionary (create uses allot to make space for the name by the way) then ALLOTs n bytes of memory right after the name and any associated housekeeping bytes your Forth system might use
So now to allocate a block of named memory we just type:
MARKER FOO \ mark where the memory ends right now
HEX 2000 BUFFER: IN_BUFFER
Now we have an 8K byte buffer called IN_BUFFER. If wanted to reclaim that space in Standard Forth we could type FOO and everything allocated in the Dictionary after FOO would be removed from the Forth system.
But if you want temporary memory space, EVERYTHING above HERE is free to use!
So you can simply point to an address and use it if you want to like this
: MYMEMORY here 200 + ; \ MYMEMORY points to un-allocated memory above HERE
\ MYMEMORY moves with HERE. be aware.
MYMEMORY HEX 1000 ERASE \ fill it with 2K bytes of zero
Forth has typically been used for high performance embedded applications where dynamic memory allocation can cause un-reliable code so static allocation using ALLOT was preferred. However bigger systems have a heap and use ALLOCATE, FREE and RESIZE much like we use malloc etc. in C.
BF
Peter Mortensen laid it out very well. I'll add a few notes that might help a C programmer some.
The stack is closest to what C terms "auto" variables, and what are commonly called local variables. You can give your stack values names in some forths, but most programmers try to write their code so that naming the values is unnecessary.
The dictionary can best be viewed as "static data" from a C programming perspective. You can reserve ranges of addresses in the dictionary, but in general you will use ALLOT and related words to create static data structures and pools which do not change size after allocation. If you want to implement a linked list that can grow in real time, you might ALLOT enough space for the link cells you will need, and write words to maintain a free list of cells you can draw from. There are naturally implementations of this sort of thing available, and writing your own is a good way to hone pointer management skills.
Heap allocation is available in many modern Forths, and the standard defines ALLOCATE, FREE and RESIZE words that work in a way analogous to malloc(), free(), and realloc() in C. Where the bytes are allocated from will vary from system to system. Check your documentation. It's generally a good idea to store the address in a variable or some other more permanent structure than the stack so that you don't inadvertently lose the pointer before you can free it.
As a side note, these words (along with the file i/o words) return a status on the stack that is non-zero if an error occurred. This convention fits nicely with the exception handling mechanism, and allows you to write code like:
variable PTR
1024 allocate throw PTR !
\ do some stuff with PTR
PTR # free throw
0 PTR !
Or for a more complex if somewhat artificial example of allocate/free:
\ A simple 2-cell linked list implementation using allocate and free
: >link ( a -- a ) ;
: >data ( a -- a ) cell + ;
: newcons ( a -- a ) \ make a cons cell that links to the input
2 cells allocate throw tuck >link ! ;
: linkcons ( a -- a ) \ make a cons cell that gets linked by the input
0 newcons dup rot >link ! ;
: makelist ( n -- a ) \ returns the head of a list of the numbers from 0..n
0 newcons dup >r
over 0 ?do
i over >data ! linkcons ( a -- a )
loop >data ! r> ;
: walklist ( a -- )
begin dup >data ? >link # dup 0= until drop ;
: freelist ( a -- )
begin dup >link # swap free throw dup 0= until drop ;
: unittest 10 makelist dup walklist freelist ;
Some Forth implementations support local variables on the return stack frame and allocating memory blocks. For example in SP-Forth:
lib/ext/locals.f
lib/ext/uppercase.f
100 CONSTANT /buf
: test ( c-addr u -- ) { \ len [ /buf 1 CHARS + ] buf }
buf SWAP /buf UMIN DUP TO len CMOVE
buf len UPPERCASE
0 buf len + C! \ just for illustration
buf len TYPE
;
S" abc" test \ --> "ABC"
With Forth you enter a different world.
In a typical Forth like ciforth on linux (and assuming 64 bits) you can configure your Forth to have a linear memory space that is as large as your swap space (e.g. 128 Gbyte). That is yours to fill in with arrays, linked lists, pictures whatever. You do this interactively, typically by declaring variable and including files. There are no restrictions. Forth only provides you with a HERE pointer to help you keep track of memory you have used up. Even that you can ignore, and there is even a word in the 1994 standard that provides scratch space that floats in the free memory (PAD).
Is there something like malloc() free() ? Not necessarily. In a small kernel of a couple of dozen kilobytes,no. But you can just include a file with an ALLOCATE / FREE and set aside a couple of Gbyte to use for dynamic memory.
As an example I'm currently working with tiff files. A typical 140 Mbyte picture takes a small chunk out of the dictionary advancing HERE.
Rows of pixels are transformed, decompressed etc. For that I use dynamic memory, so I ALLOCATE space for the decompression result of a row. I've to manually FREE them again when the results have been used up for another transformation. It feels totally different from c. There is more control and more danger.
Your question about scopes etc. In Forth if you know the address, you can access the data structure. Even if you jotted F7FFA1003 on a piece of paper. Trying to make programs safer by separate name spaces is not prominent in Forth style. So called wordlist (see also VOCABULARY) provide facilities in that direction.
There's a little elephant hiding in a big FORTH memory management room, and I haven't seen too many people mention it.
The canonical FORTH has, at the very least, a non-addressable parameter stack. This is the case in all FORTH hardware implementations I'm aware of (usually originating with Chuck Moore) that have a hardware parameter stack: it's not mapped into the addressable memory space.
What does "non-addressable" mean? It means: you can't have pointers to the parameter stack, i.e. there are no means to get addresses of things on that stack. The stack is a "black box" that you can only access via the stack API (opcodes if it's a hardware stack), without bypassing it - and only that API will modify its contents.
This implies no aliasing between parameter stack and memory accesses using pointers - via # and ! and the like. This enables efficient code generation with small effort, and indeed it makes decent generated code in FORTH systems orders of magnitude easier to obtain than with C and C++.
This of course breaks down when pointers can be obtained to the parameter stack. A well designed system would probably have guarded API for such access, since within the guards the code generator has to spill everything from registers to stack - in absence of full data flow analysis, that is.
DFA and other "expensive" optimization techniques are not of course impossible in FORTH, it's just that they are a bit larger in scope than many a practical FORTH system. They can be done very cleanly in spite of that (I'm using CFA, DFA and SSA optimizations in an in-house FORTH implementation, and the whole thing has less source code, comments included, than the utility classes in LLVM... - classes that are used all over the place, but that don't actually do anything related to compiling or code analysis).
A practical FORTH system can also place aliasing limitations on the return stack contents, namely that the return addresses themselves don't alias. That way control flow can be analyzed optimistically, only taking into account explicit stack accesses via R#, >R and R>, while letting you place addressable local variables on that stack - that's typically done when a variable is larger than a cell or two, or would be awkward to keep around on the parameter stack.
In C and C++, aliasing between automatic "local" variables and pointers is a big problem, because only large compilers with big optimizers can afford to prove lack of aliasing and forgo register reloads/spills when intervening pointer dereferences take place. Small compilers, to remain compliant and not generate broken code, have to pessimize and assume that accesses via char* alias everything, and accesses via Type* alias that type and others "like it" (e.g. derived types in C++). That char* aliases all things in C is a prime example of where you pay a big price for a feature you didn't usually intend to use.
Usually, forcing an unsigned char type for characters, and re-writing the string API using this type, lets you not use char* all over the place and lets the compiler generate much better code. Compilers of course add lots of analysis passes to minimize the fallout from this design fiasco... And all it'd take to fix in C is having a byte type that aliases every other type, and is compatible with arbitrary pointers, and has the size of the smallest addressable unit of memory. The reuse of void in void* to mean "pointer to anything" was, in hindsight, a mistake, since returning void means returning nothing, whereas pointing to void absolutely does not mean "pointing to nothing".
My idea is published at https://sites.google.com/a/wisc.edu/memorymanagement
I'm hoping to put forth code on github soon.
If you have an array (or several) with each array having a certain number of items of a certain size, you can pair a single-purpose stack to each array. The stack is initialized with the address of each array item. To allocate an array item, pop an address off the stack. To deallocate an array item, push its address onto the stack.

Resources