How can I read the received packets with a NDIS filter driver? - windows

I am currently experimenting with the NDIS driver samples.
I am trying to print the packets contents (including the MAC-addresses, EtherType and the data).
My first guess was to implement this in the function FilterReceiveNetBufferLists. Unfortunately I am not sure how to extract the packets contents out of the NetBufferLists.

That's the right place to start. Consider this code:
void FilterReceiveNetBufferLists(..., NET_BUFFER_LIST *nblChain, ...)
{
UCHAR buffer[14];
UCHAR *header;
for (NET_BUFFER_LIST *nbl = nblChain; nbl; nbl = nbl->Next) {
header = NdisGetDataBuffer(nbl->FirstNetBuffer, sizeof(buffer), buffer, 1, 1);
if (!header)
continue;
DbgPrint("MAC address: %02x-%02x-%02x-%02x-%02x-%02x\n",
header[0], header[1], header[2],
header[3], header[4], header[5]);
}
NdisFIndicateReceiveNetBufferLists(..., nblChain, ...);
}
There are a few points to consider about this code.
The NDIS datapath uses the NET_BUFFER_LIST (nbl) as its primary data structure. An nbl represents a set of packets that all have the same metadata. For the receive path, nobody really knows much about the metadata, so that set always has exactly 1 packet in it. In other words, the nbl is a list... of length 1. For the receive path, you can count on it.
The nbl is a list of one or more NET_BUFFER (nb) structures. An nb represents a single network frame (subject to LSO or RSC). So the nb corresponds most closely to what you think of as a packet. Its metadata is stored on the nbl that contains it.
Within an nb, the actual packet payload is stored as one or more buffers, each represented as an MDL. Mentally, you should pretend the MDLs are just concatenated together. For example, the network headers might be in one MDL, while the rest of the payload might be in another MDL.
Finally, for performance, NDIS gives as many NBLs to your LWF as possible. This means there's a list of one or more NBLs.
Put it all together, and you have:
Your function receives a list of NBLs.
Each NBL contains exactly 1 NB (on the receive path).
Each NB contains a list of MDLs.
Each MDL points to a buffer of payload.
So in our example code above, the for-loop iterates along that first bullet point: the chain of NBLs. Within the loop, we only need to look at nbl->FirstNetBuffer, since we can safely assume there is no other nb besides the first.
It's inconvenient to have to fiddle with all those MDLs directly, so we use the helper routine NdisGetDataBuffer. You tell this guy how many bytes of payload you want to see, and he'll give you a pointer to a contiguous range of payload.
In the good case, your buffer is contained in a single MDL, so NdisGetDataBuffer just gives you a pointer back into that MDL's buffer.
In the slow case, your buffer straddles more than one MDL, so NdisGetDataBuffer carefully copies the relevant bit of payload into a scratch buffer that you provided.
The latter case can be fiddly, if you're trying to inspect more than a few bytes. If you're reading all 1500 bytes of the packet, you can't just allocate 1500 bytes on the stack (kernel stack space is scarce, unlike usermode), so you have to allocate it from the pool. Once you figure that out, note it will slow things down to copy all 1500 bytes of data into a scratch buffer for every packet. Is the slowdown too much? It depends on your needs. If you're only inspecting occasional packets, or if you're deploying the LWF on a low-throughput NIC, it won't matter. If you're trying to get beyond 1Gbps, you shouldn't be memcpying so much data around.
Also note that if you ultimately want to modify the packet, you'll need to be wary of NdisGetDataBuffer. It can give you a copy of the data (stored in your local scratch buffer), so if you modify the payload, those changes won't actually stick to the packet.
What if you do need to scale to high throughputs, or modify the payload? Then you need to work out how to manipulate the MDL chain. That's a bit confusing at first, but spend a little time with the documentation and draw yourself some whiteboard diagrams.
I suggest first starting out by understanding an MDL. From networking's point of view, an MDL is just a fancy way of holding a { char * buffer, size_t length }, along with a link to the next MDL.
Next, consider the NB's DataOffset and DataLength. These conceptually move the buffer boundaries in from the beginning and the end of the buffer. They don't really care about MDL boundaries -- for example, you can reduce the length of the packet payload by decrementing DataLength, and if that means that one or more MDLs are no longer contributing any buffer space to the packet payload, it's no big deal, they're just ignored.
Finally, add on top CurrentMdl and CurrentMdlOffset. These are redundant with everything above, but they exist for (microbenchmark) performance. You aren't required to even think about them if you're reading the NB, but if you are editing the size of the NB, you do need to update them.

Related

Indexing text file for microcontroller

I need to search for a specific record in a large file. The search will be performed on a microprocessor (ESP8266), so I'm working with limited storage and RAM.
The list looks like this:
BSSID,data1,data2
001122334455,float,float
001122334466,float,float
...
I was thinking using an index to speed up the search. The data are static, and the index will be built on a computer and then loaded onto the microcontroller.
What I've done so far is very simplistic.
I created an index of the first byte of the BSSID and points at the first and last values with that BSSID prefix.
The performance is terrible, but the index file is very small and uses very little RAM. I though to go further with this method, taking a look at the first two bytes, but the index table will be 256 times larger, resulting in a table 1/3 the size of the data file.
This is the index with the first method:
00,0000000000,0000139984
02,0000139984,0000150388
04,0000150388,0000158812
06,0000158812,0000160900
08,0000160900,0000171160
What indexing algorithm do you suggest that I use?
EDIT:Sorry I didn't include enough background before.I'm storing the data and index file on the flash memory of the chip. I have at the moment 30000 records, but this number could potentially grow until the chips momery limit is hit. The set is indeed static when is stored on the microcontroller but could be updated in a second moment with the help of a computer.The data isn't spread simmetrically between indexes.My goal is to find a good compromise between search speed, index size and RAM used.
I'm not sure where you're stuck, but I can comment on what you've done so far.
Most of all, the way to determine the "best" method is to
define "best" for your purposes;
research indexing algorithms (basic ones have been published for over 50 years);
choose a handful to implement;
Evaluate those implementations according to your definition of "best".
Keep in mind your basic resource restriction: you have limited RAM. If method requires more RAM than you have, it doesn't work, and is therefore infinitely slower than any method that does work.
You've come close to a critical idea, however: you want your index table to expand to consume any free RAM, using that space as effectively as possible. If you can index 16 bits instead of 8 and still fit the table comfortably into your available space, then you've cut down your linear search time by roughly a factor of 256.
Indexing considerations
Don't put the ending value in each row: it's identical to the starting value in the next row. Omit that, and you save one word in each row of the table, giving you twice the table room.
Will you get better performance if you slice the file into equal parts (same quantity of BSSIDS for each row of your table), and then store the entire starting BSSID with its record number? If your BSSIDs are heavily clumped, this might improve your overall processing, even though your table had fewer rows. You can't use a direct index in this case; you have to search the first column to get the proper starting point.
Does that move you toward a good solution?
Not sure how much memory you got (I am not familiar with that MCU) but do not forget that these tables are static/constant so they can be stored in EEPROM instead of RAM some chips have quite a lot of EEPROM usually way more than RAM...
Assume your file is sorted by the index. So You you got (assuming 32bit address) per each entry:
BYTE ix, DWORD beg,DWORD end
Why not this:
struct entry { DWORD beg,end };
entry ix0[256];
Where the first BYTE is also address in index array. This will spare 1 Byte per entry
Now as Prune suggested you can ignore the end address as you will scan the following entries in file anyway until you hit the correct index or index with different first BYTE. so yo can use:
DWORD ix[256];
where yo have only start address beg.
Now we do not know how many entries you actually have nor how many entries will share the same second BYTE of index. So we can not do any further assumption to improve...
You wanted to do something like:
DWORD ix[65536];
But have not enough memory for it ... how about doing something like this instead:
const N=1024; // number of entries you can store
const dix=(max_index_value+1)/N;
const ix[N]={.....};
so each entry ix[i] will cover all the indexes from i*dix to ((i+1)*dix)-1. So to find index you do this:
i = ix[index/dix];
for (;i<file_size;)
{
read entry from file at i-th position;
update position i;
if (file_index==index) { do your stuff; break; }
if (file_index> index) { index not found; break; }
}
To improve performance you can rewrite this linear scan into binary search between address of ix[index/dix] and ix[(index/dix)+1] or file size for the last index ... assuming each entry in file has the same size ...

How does cacheline to register data transfer work?

Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.

MME Audio Output Buffer Size

I am currently playing around with outputting FP32 samples via the old MME API (waveOutXxx functions). The problem I've bumped into is that if I provide a buffer length that does not evenly divide the sample rate, certain audible clicks appear in the audio stream; when recorded, it looks like some of the samples are lost (I'm generating a sine wave for the test). Currently I am using the "magic" value of 2205 samples per buffer for 44100 sample rate.
The question is, does anybody know the reason for these dropouts and if there is some magic formula that provides a way to compute the "proper" buffer size?
Safe alignment of data buffers is the value of nBlockAlign of WAVEFORMATEX structure.
Software must process a multiple of nBlockAlign bytes of data at a
time. Data written to and read from a device must always start at the
beginning of a block. For example, it is illegal to start playback of
PCM data in the middle of a sample (that is, on a non-block-aligned
boundary).
For PCM formats this is the amount of bytes for single sample across all channels. Non-PCM formats have their own alignments, often equal to length of format-specific block, e.g. 20 ms.
Back in time when waveOutXxx was the primary API for audio, carrying over unaligned bytes was an unreasonable burden for the API and unneeded performance overhead. Right now this API is a compatibility layer on top of other audio APIs, and I suppose that unaligned bytes are just stripped to still play the rest of the content, which would otherwise be rejected in full due to this small glitch, which might be just a smaller and non-fatal caller's inaccuracy.
if you fill the audio buffer with sine sample and play it looped , very easily it will click , unless the buffer length is not a multiple of the frequence, as you said ... the audible click in fact is a discontinuity in the wave ...an advanced techinques is to fill the buffer dinamically , that is, you should set a callback notification while the buffer pointer advance and fill the buffer with appropriate data at appropriate offset. i would use a more large buffer as 2205 is too short to get an async notification , calculate data , and write the buffer ,all that while playing , but it would depend of cpu power

Provide several kernel buffers through mmap

I have a kernel driver which allocates several buffers in kernel space (physically contiguous, aligned to page boundaries, and consisting of integral number of pages).
Next, I need to make my driver able to mmap some of these buffers to userspace (one buffer per mmap() call, of course). The driver registers single character device for that purpose.
Userspace program must be able to tell kernel which buffer it wants to mmap (for example, by specifying its index or unique ID, or physical address previously resolved through ioctl()).
I want to do so by using mmap()'s offset parameter, for example (from userspace):
mapped_ptr = mmap(NULL, buf_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, (MAGIC + buffer_id) * PAGE_SIZE);
Where "MAGIC" is some magic number, and buffer_id is the buffer ID which I want to mmap.
Next, in the kernel part there will be something like this:
static int my_dev_mmap(struct file *filp, struct vm_area_struct *vma)
{
int bufferID = vma->vm_pgoff - MAGIC;
/*
* Convert bufferID to PFN by looking through driver's buffer descriptors
* Check length = vma->vm_end - vma->vm_start
* Call remap_pfn_range()
*/
}
But I think it is some sort of dirty way, because "offset" in the mmap() is not supposed to specify index or identifier, its role is to provide number of skipped bytes (or pages) from the beginning of mmap-ed device(or file) memory (which is supposed to be contiguous, right?).
However, i've already seen some drivers in mainline which use "offset" to distinguish between mmap-ed buffers.
Are there any alternative solutions to this?
P.S.
I need all this just because I'm dealing with some unusual SoC' graphics controller, which can operate only on physically contiguous, aligned to 8-byte boundary memory buffers. So, I can only allocate such buffers in kernel space and pass them to user space via mmap().
The most part of controller' programming (composing instruction batches and pushing them to kernel driver) is performed in user space.
Also, I can't just allocate single big chunk of physically contiguous memory, because in that case it needs to be really big (for ex., 16+ MiB) and alloc_pages_exact() will fail.
I don't see anything wrong with using the offset to pass the index in from userspace to your driver. If it bugs you, then just look at your driver as assembling a large buffer out of individual pages that it wants to present to userspace as virtually contiguous, so that the offset really is an offset into this buffer. But really in my opinion there's nothing wrong with doing things this way.
Another alternative, if you can use kernel 3.5 or newer, might be to use the "Contiguous Memory Allocator" (CMA) -- look at <linux/dma-contiguous.h> and drivers/base/dma-contiguous.c for more information. There's also https://lwn.net/Articles/486301/ as a reference but I don't know how much (if anything) changed between that article and getting the code merged into mainline.
Finally, I've chosen to mmap exactly one buffer per one opened device file descriptor (struct file in kernel) and implement control through ioctl(): one IOCTL for allocating new buffer, one for attaching to already allocated buffer with known ID, and another one to get information about buffer.
Usually, userspace will mmap() about 10..20 buffers at the same time, so it is nice and clean solution for this case.

Memory management in Forth

So I'm just learning Forth and was curious if anyone could help me understand how memory management generally works. At the moment I only have (some) experience with the C stack-vs-heap paradigm.
From what I understand, one can allocate in the Dictionary, or on the heap. Is the Dictionary faster/preferred like the stack in C? But unlike in C, there aren't scopes and automatic stack reclamation, so I'm wondering if one only uses the dictionary for global data structures (if at all).
As far as the heap goes, is it pretty much like C? Is heap management a standard (ANS) concept, or is it implementation-defined?
It is not Dictionary, or on the heap - the equivalent of the heap is the dictionary. However, with the severe limitation that it acts more like a stack than a heap - new words are added to the end of the dictionary (allocation by ALLOT and freeing by FORGET or FREE (but freeing all newer words - acting more like multiple POPs)).
An implementation can control the memory layout and thus implement a traditional heap (or garbage collection). An example is A FORTH implementation of the Heap Data Structure for Memory Management (1984). Another implementation is Dynamic Memory Heaps for Quartus Forth (2000).
A lot is implementation dependent or extensions. For instance, the memory layout is often with the two block buffers (location by BLOCK and TIB), the text input buffer and values and low-level/primitive functions of the language, in the lowest portion, dictionary in the middle (growing upwards) and the return stack and the parameter stack at the top 1.
The address of the first available byte above the dictionary is returned by HERE (it changes as the dictionary expands).
There is also a scratchpad area above the dictionary (address returned by PAD) for temporarily storing data. The scratchpad area can be regarded as free memory.
The preferred mode of operation is to use the stack as much as possible instead of local variables or a heap.
1 p. 286 (about a particular edition of Forth, MMSFORTH) in chapter "FORTH's Memory, Dictionary, and Vocabularies", Forth: A text and a reference. Mahlon G. Kelly and Nicholas Spies. ISBN 0-13-326349-5 / 0-13-326331-2 (pbk.). 1986 by Prentice-Hall.
The fundamental question may not have been answered in a way that a new Forth user would require so I will take a run at it.
Memory in Forth can be very target dependent so I will limit the description to the simplest model, that being a flat memory space, where code and data live together happily. (as opposed to segmented memory models, or FLASH memory for code and RAM for data or other more complicated models)
The Dictionary typically starts at the bottom of memory and is allocated upwards by the Forth system. The two stacks, in a simple system would exist in high memory and typically have two CPU registers pointing to them. (Very system dependent)
At the most fundamental level, memory is allocated simply by changing the value of the dictionary pointer variable. (sometimes called DP)
The programmer does not typically access this variable directly but rather uses some higher level words to control it.
As mentioned the Forth word HERE returns the next available address in the dictionary space. What was not mentioned was that HERE is defined by fetching the value of the variable DP. (system dependency here but useful for a description)
In Forth HERE might look like this:
: HERE ( -- addr) DP # ;
That's it.
To allocate some memory we need to move HERE upwards and we do that with the word ALLOT.
The Forth definition for ALLOT simply takes a number from the parameter stack and adds it to the value in DP. So it is nothing more than:
: ALLOT ( n --) DP +! ; \ '+!' adds n to the contents variable DP
ALLOT is used by the FORTH system when we create a new definition so that what we created is safely inside 'ALLOTed' memory.
Something that is not immediately obvious is the that ALLOT can take a negative number so it is possible to move the dictionary pointer up or down. So you could allocate some memory and return it like this:
HEX 100 ALLOT
And free it up like this:
HEX -100 ALLOT
All this to say that this is the simplest form of memory management in a Forth system. An example of how this is used can be seen in the definition of the word BUFFER:
: BUFFER: ( n --) CREATE ALLOT ;
BUFFER: "creates" a new name in the dictionary (create uses allot to make space for the name by the way) then ALLOTs n bytes of memory right after the name and any associated housekeeping bytes your Forth system might use
So now to allocate a block of named memory we just type:
MARKER FOO \ mark where the memory ends right now
HEX 2000 BUFFER: IN_BUFFER
Now we have an 8K byte buffer called IN_BUFFER. If wanted to reclaim that space in Standard Forth we could type FOO and everything allocated in the Dictionary after FOO would be removed from the Forth system.
But if you want temporary memory space, EVERYTHING above HERE is free to use!
So you can simply point to an address and use it if you want to like this
: MYMEMORY here 200 + ; \ MYMEMORY points to un-allocated memory above HERE
\ MYMEMORY moves with HERE. be aware.
MYMEMORY HEX 1000 ERASE \ fill it with 2K bytes of zero
Forth has typically been used for high performance embedded applications where dynamic memory allocation can cause un-reliable code so static allocation using ALLOT was preferred. However bigger systems have a heap and use ALLOCATE, FREE and RESIZE much like we use malloc etc. in C.
BF
Peter Mortensen laid it out very well. I'll add a few notes that might help a C programmer some.
The stack is closest to what C terms "auto" variables, and what are commonly called local variables. You can give your stack values names in some forths, but most programmers try to write their code so that naming the values is unnecessary.
The dictionary can best be viewed as "static data" from a C programming perspective. You can reserve ranges of addresses in the dictionary, but in general you will use ALLOT and related words to create static data structures and pools which do not change size after allocation. If you want to implement a linked list that can grow in real time, you might ALLOT enough space for the link cells you will need, and write words to maintain a free list of cells you can draw from. There are naturally implementations of this sort of thing available, and writing your own is a good way to hone pointer management skills.
Heap allocation is available in many modern Forths, and the standard defines ALLOCATE, FREE and RESIZE words that work in a way analogous to malloc(), free(), and realloc() in C. Where the bytes are allocated from will vary from system to system. Check your documentation. It's generally a good idea to store the address in a variable or some other more permanent structure than the stack so that you don't inadvertently lose the pointer before you can free it.
As a side note, these words (along with the file i/o words) return a status on the stack that is non-zero if an error occurred. This convention fits nicely with the exception handling mechanism, and allows you to write code like:
variable PTR
1024 allocate throw PTR !
\ do some stuff with PTR
PTR # free throw
0 PTR !
Or for a more complex if somewhat artificial example of allocate/free:
\ A simple 2-cell linked list implementation using allocate and free
: >link ( a -- a ) ;
: >data ( a -- a ) cell + ;
: newcons ( a -- a ) \ make a cons cell that links to the input
2 cells allocate throw tuck >link ! ;
: linkcons ( a -- a ) \ make a cons cell that gets linked by the input
0 newcons dup rot >link ! ;
: makelist ( n -- a ) \ returns the head of a list of the numbers from 0..n
0 newcons dup >r
over 0 ?do
i over >data ! linkcons ( a -- a )
loop >data ! r> ;
: walklist ( a -- )
begin dup >data ? >link # dup 0= until drop ;
: freelist ( a -- )
begin dup >link # swap free throw dup 0= until drop ;
: unittest 10 makelist dup walklist freelist ;
Some Forth implementations support local variables on the return stack frame and allocating memory blocks. For example in SP-Forth:
lib/ext/locals.f
lib/ext/uppercase.f
100 CONSTANT /buf
: test ( c-addr u -- ) { \ len [ /buf 1 CHARS + ] buf }
buf SWAP /buf UMIN DUP TO len CMOVE
buf len UPPERCASE
0 buf len + C! \ just for illustration
buf len TYPE
;
S" abc" test \ --> "ABC"
With Forth you enter a different world.
In a typical Forth like ciforth on linux (and assuming 64 bits) you can configure your Forth to have a linear memory space that is as large as your swap space (e.g. 128 Gbyte). That is yours to fill in with arrays, linked lists, pictures whatever. You do this interactively, typically by declaring variable and including files. There are no restrictions. Forth only provides you with a HERE pointer to help you keep track of memory you have used up. Even that you can ignore, and there is even a word in the 1994 standard that provides scratch space that floats in the free memory (PAD).
Is there something like malloc() free() ? Not necessarily. In a small kernel of a couple of dozen kilobytes,no. But you can just include a file with an ALLOCATE / FREE and set aside a couple of Gbyte to use for dynamic memory.
As an example I'm currently working with tiff files. A typical 140 Mbyte picture takes a small chunk out of the dictionary advancing HERE.
Rows of pixels are transformed, decompressed etc. For that I use dynamic memory, so I ALLOCATE space for the decompression result of a row. I've to manually FREE them again when the results have been used up for another transformation. It feels totally different from c. There is more control and more danger.
Your question about scopes etc. In Forth if you know the address, you can access the data structure. Even if you jotted F7FFA1003 on a piece of paper. Trying to make programs safer by separate name spaces is not prominent in Forth style. So called wordlist (see also VOCABULARY) provide facilities in that direction.
There's a little elephant hiding in a big FORTH memory management room, and I haven't seen too many people mention it.
The canonical FORTH has, at the very least, a non-addressable parameter stack. This is the case in all FORTH hardware implementations I'm aware of (usually originating with Chuck Moore) that have a hardware parameter stack: it's not mapped into the addressable memory space.
What does "non-addressable" mean? It means: you can't have pointers to the parameter stack, i.e. there are no means to get addresses of things on that stack. The stack is a "black box" that you can only access via the stack API (opcodes if it's a hardware stack), without bypassing it - and only that API will modify its contents.
This implies no aliasing between parameter stack and memory accesses using pointers - via # and ! and the like. This enables efficient code generation with small effort, and indeed it makes decent generated code in FORTH systems orders of magnitude easier to obtain than with C and C++.
This of course breaks down when pointers can be obtained to the parameter stack. A well designed system would probably have guarded API for such access, since within the guards the code generator has to spill everything from registers to stack - in absence of full data flow analysis, that is.
DFA and other "expensive" optimization techniques are not of course impossible in FORTH, it's just that they are a bit larger in scope than many a practical FORTH system. They can be done very cleanly in spite of that (I'm using CFA, DFA and SSA optimizations in an in-house FORTH implementation, and the whole thing has less source code, comments included, than the utility classes in LLVM... - classes that are used all over the place, but that don't actually do anything related to compiling or code analysis).
A practical FORTH system can also place aliasing limitations on the return stack contents, namely that the return addresses themselves don't alias. That way control flow can be analyzed optimistically, only taking into account explicit stack accesses via R#, >R and R>, while letting you place addressable local variables on that stack - that's typically done when a variable is larger than a cell or two, or would be awkward to keep around on the parameter stack.
In C and C++, aliasing between automatic "local" variables and pointers is a big problem, because only large compilers with big optimizers can afford to prove lack of aliasing and forgo register reloads/spills when intervening pointer dereferences take place. Small compilers, to remain compliant and not generate broken code, have to pessimize and assume that accesses via char* alias everything, and accesses via Type* alias that type and others "like it" (e.g. derived types in C++). That char* aliases all things in C is a prime example of where you pay a big price for a feature you didn't usually intend to use.
Usually, forcing an unsigned char type for characters, and re-writing the string API using this type, lets you not use char* all over the place and lets the compiler generate much better code. Compilers of course add lots of analysis passes to minimize the fallout from this design fiasco... And all it'd take to fix in C is having a byte type that aliases every other type, and is compatible with arbitrary pointers, and has the size of the smallest addressable unit of memory. The reuse of void in void* to mean "pointer to anything" was, in hindsight, a mistake, since returning void means returning nothing, whereas pointing to void absolutely does not mean "pointing to nothing".
My idea is published at https://sites.google.com/a/wisc.edu/memorymanagement
I'm hoping to put forth code on github soon.
If you have an array (or several) with each array having a certain number of items of a certain size, you can pair a single-purpose stack to each array. The stack is initialized with the address of each array item. To allocate an array item, pop an address off the stack. To deallocate an array item, push its address onto the stack.

Resources