This is a homework question from compiler design course. I just need an explanation of certain parts of the question.
It is claimed that returning blocks to the standard memory manager
would require much administration. Why
is it not enough to have a single
counter per block, which holds the
number of busy records for that block,
and to return the block when it
reaches 0?
The context in which it refers to speaks about linked lists.
The answer from the answer sheet states:
How do you find this counter starting
from the pointer to the record and how
do you get the pointer by which to
return the block?
Coming from a C based background. Could someone explain to me what:
block is?
the counter does?
a busy record is?
A reference to documents that provide a walk-through of what happens during this counting phase. Diagrams would be helpful.
Thanks.
I think it may help if I change some terms, to better explain what I am guessing is going on.
If you have a page of memory, we can say a page is 8k in size. This is the minimum size that is allocated by the memory manager.
You have 10 requests of 100 bytes each, so 1000 bytes are in various places on the page.
The counter would be 10, but, how do you know what is actually freed, or has already been allocated, as the 10 requests may not be contiguous, as there may have been other requests that have already been freed.
So, we have 10 busy records.
Now, you would need to come up with your own answers to the question in the answer sheet, but, hopefully by looking at an example it may be simpler.
A "block" most likely is a basic block.
I'm not familiar with the term "busy record"; most likely, it refers to some data flow analysis result for variables (i.e. variables might be considered "busy"). Several definitions seem plausible:
a variable may be considered "busy" if it holds a value (i.e. has been "written") which also will be "read" (meaning that you can't eliminate the variable easily)
a variable may be considered "busy" if it is used "often" (more often than other variables), meaning that you should try to allocate it to a register.
However, you should really find out how the term was defined in your course.
Then, the counter would count, per basic block, the number of variables that are busy. Why that counter may become 0 after some processing is unclear - most likely, "busy" has yet another meaning in your course.
block is? the manager had divided the memory space into blocks. one or more blocks compose of a memory area unit which is usable for accessing continuously by the user. if require more memory, the manager will add extra block(s) to that memory area. while the manager is always trying to give continuous blocks to the user.
the counter does? for a specific block, it may be used by different users, that is, the memory area is shared by multiple users.
a busy record is? the counter value which is stored in above "counter".
for example:
struct block {
struct block *next;
long counter; //#< the busy record
};
EDIT: changing "area" to "user"
struct user {
struct block *head;
...
};
EDIT: answer the question "why is a counter not enough for a block?"
Needs more information when move a block from a "free block list" to a "allocated block list" or vice versa, e.g. order used to locate a position in a list quickly. while i just guess per this point.
Related
The question is more about how DRAM works.
(speaking in C terms) If I have a local(located on stack) variable and a global(static or dynamically allocated) variable, which one would be accessed faster?
Considering that neither one is cached or placed in a register!
So the actual question is whether it is faster to retrieve data that is close to previously touched area than to retrieve data that is in completely different place, say, row address and column address are different from previous.
If there's indeed a difference in access times, why?
There’s no difference in general. DRAM works the same whether a given address is on the stack or the heap. In practice, there are several cases where a local variable is often faster:
The first few bytes of the stack are practically always in the cache, and the first time you access a static variable, it probably will not be.
Compilers can often statically analyze the lifetime of a local variable and optimize it into a register, eliminating the memory access entirely, whereas a global variable usually must be loaded and stored, because another part of the program might have changed it before and could refer to it later.
On many architectures, the machine instruction to access a memory location relative to the stack pointer is more efficient than the machine instructions to access an arbitrary static address.
Complicating things is that “local/global” is probably not the the distinction you really mean. For example, many languages have “static local” variables that are implemented like globals, but lexically local, and “thread-local” variables that are lexically non-local but stored on the stack. And if you pass a locally-allocated variable by reference far down the call chain, it will eventually fall out of the cache and behave exactly like a global.
So the question was
whether it is faster to retrieve data that is close to previously touched
area than to retrieve data that is in completely different place
The answer is yes, it is faster.
TL;DR: DRAM has a buffer(a cache, if you please, though it's not really a cache)
And the reason to that is DRAM workings.
A SIMM is 1 or 2 ranks that consist of multiple DRAM chips(ICs).
Each IC consists of multiple banks(rows of bytes + row/column decoder + row buffer)
If ICs are numbered 0 through K, banks 0 through M, and rows 0 through N;then rows (0, m, n), (1, m, n) ... (K, m, n) constitute a memory page(data of successive addresses).
(a common case) If a given SIMM has 8 ICs per rank and a bank has 1024 columns(each is a byte), a memory page(or the overall buffered memory) is 8KB in size.
With that said, if you access an address that is on the same memory page as the last address that was requested for this same bank, only the column decoder would be engaged, which is ~2 times faster then when the address is on a different page. Note: the 2 times difference is only relative to DRAM and is not relative to the overall time to get to the CPU, which still would be >100ms.
There's a lot of details to be added, but I'm not proficient at all to do that.
P.S. this topic is not widely discussed and all of the above is just a very short overview of what made sense to me from examining not very well written information.
I have a conceptual doubt in understanding the way Linux Kernel manages Free blocks. Here is what I interpreted through reading so far.
The Buddy Allocator implementation is allocation scheme that combines a normal power-of-2 allocation.
At times when we need a block of size which is not available, it divides the large block into two. Those two blocks are Buddies, probably hence it is called the Buddy Allocator.
Through a source I learnt that an array of free_area_t structs are maintained for each order that points to a linked list of blocks of pages that are free.
Which I found in <linux/mm.h>
typedef struct free_area_struct {
struct list_head free_list;
unsigned long *map;
} free_area_t;
The free_list appear to be a linked-list of page blocks? My question is, whether it is a list of Free pages or Used pages?
And map appears to be a bitmap that represents the state of a pair of buddies.
My question is How can it be a single-bit that holds the state bit for a pair of buddies? Because if, I use one of the block in a Buddy-pair to allocats, and the other left free, what would be the state then, and how is that managed to be stored in a single bit? Does it represent the entire block of the size of power-of-two, which can be divided in two parts when we need a block size which is not available, so the allocated half is Buddy of the other half which is free? If this is the case that half is being allocated and half remains free, then what will be status of map ? What if both are free? and what if both are allocated? How can be a binary value representing 3 states of a block?
Edit: After further reading, the first doubt is cleared. Which says: If a free block cannot be found of the requested order, a higher order block
is split into two buddies. One is allocated and the other is placed on the free list for
the lower order. So it is linked list of free pages.
map represents the state of a single memory block at the lowest level.
The OP here mentions in the final post (4th or so para from bottom):
"Now one thing that always bothered me about this is all the child
pointer checking. There are usually a lot of null pointers, and
waiting on memory access to fill the cache with zeros just seems
stupid. Over time I added a byte that contains a 1 or 0 to tell if
each of the pointers is NULL. At first this just reduced the cache
waste. However, I've managed cram 9 distance comparisons, 8 pointer
bits, and 3 direction bits all through some tables and logic to
generate a single switch variable that allows the cases to skip the
pointer checks and only call the relevant children directly. It is in
fact faster than the above, but a lot harder to explain if you haven't
seen this version."
He is referring to octrees as the data structure for real-time volume rendering. These would be allocated on the heap, due to their size. What I am trying to figure out is:
(a) Are his assumptions in terms of waiting on memory access, valid? My understanding is that he's referring to waiting on a full run out to main memory to fetch data, since he's assuming it won't be found in the cache due to generally not-too-good locality of reference when using dynamically-allocated octrees (common for this data structure in this sort of application).
(b) Should (a) prove to be true, I am trying to figure out how this workaround
Over time I added a byte that contains a 1 or 0 to tell if each of the
pointers is NULL.
would be implemented without still using the heap, and thus still incurring the same overhead, since I assume it would need to be stored in the octree node.
(a) Yes, his concerns about memory wait time are valid. In this case, he seems to be worried about the size of the node itself in memory; just the children take up 8 pointers, which is 64 bytes on a 64-bit architecture, or one cache line just for the children.
(b) That bitfield is stored in the node itself, but now takes up only 1 byte (1 bit for 8 pointers). It's not clear to me that this is an advantage though, as the line(s) containing the children will get loaded anyway when they are searched. However, he's apparently doing some bit tricks that allow him to determine which children to search with very few branches, which may increase performance. I wish he had some benchmarks that would show the benefit.
Let's say I have an allocation in memory containing a string, "ABCDEFG", but I only have a pointer to the 'E'. Is it possible, on win32, to free that block, given a pointer that is within the block, but not at the start? Any allocation method would work, but a Heap* function would be the path of least resistance.
If not a native solution, have there been any custom memory managers written which offer this feature?
EDIT: This isn't an excuse to be sloppy. I'm developing an automatic memory management system using 100% compile-time metadata. This odd requirement seems to be the only thing standing in the way of getting it working, and even then it's needed only for data types based on arrays (which are slicable).
It would be possible for the memory allocation routines in the runtime library to check a given memory address against the beginning and end of every allocated block. That search accomplished, it would be easy to release the block from the beginning.
Even with clever algorithms behind it, this would incur some kind of search with each memory deallocation. And why? Just to support erroneous programs too stupid to keep track of the beginning of the blocks of memory they allocated?
The standard C idiom thrives on treating blocks of allocated memory like arrays. The pointer returned from *alloc is a pointer to the beginning of an array, and the pointer can be used with subscripts to access any element of that array, subscripts starting at 0. This has worked well enough for 40 years that I can't think of a sensible reason to introduce a change here.
I suppose if you know what the malloc() guard blocks look like, you could write a function that backs up from the pointer you pass it until it finds a 'best guess' of the original memory address and then calls free(). Why not just keep a copy of the base pointer around?
If you use VirtualAlloc to allocate memory, you can use VirtualQuery to figure out which block a pointer belongs to. Once you have the base address, you can pass this to VirtualFree to free the entire block.
Assume you have a reference counted object in shared memory. The reference count represents the number of processes using the object, and processes are responsible for incrementing and decrementing the count via atomic instructions, so the reference count itself is in shared memory as well (it could be a field of the object, or the object could contain a pointer to the count, I'm open to suggestions if they assist with solving this problem). Occasionally, a process will have a bug that prevents it from decrementing the count. How do you make it as easy as possible to figure out which process is not decrementing the count?
One solution I've thought of is giving each process a UID (maybe their PID). Then when processes decrement, they push their UID onto a linked list stored alongside the reference count (I chose a linked list because you can atomically append to head with CAS). When you want to debug, you have a special process that looks at the linked lists of the objects still alive in shared memory, and whichever apps' UIDs are not in the list are the ones that have yet to decrement the count.
The disadvantage to this solution is that it has O(N) memory usage where N is the number of processes. If the number of processes using the shared memory area is large, and you have a large number of objects, this quickly becomes very expensive. I suspect there might be a halfway solution where with partial fixed size information you could assist debugging by somehow being able to narrow down the list of possible processes even if you couldn't pinpoint a single one. Or if you could just detect which process hasn't decremented when only a single process hasn't (i.e. unable to handle detection of 2 or more processes failing to decrement the count) that would probably still be a big help.
(There are more 'human' solutions to this problem, like making sure all applications use the same library to access the shared memory region, but if the shared area is treated as a binary interface and not all processes are going to be applications written by you that's out of your control. Also, even if all apps use the same library, one app might have a bug outside the library corrupting memory in such a way that it's prevented from decrementing the count. Yes I'm using an unsafe language like C/C++ ;)
Edit: In single process situations, you will have control, so you can use RAII (in C++).
You could do this using only a single extra integer per object.
Initialise the integer to zero. When a process increments the reference count for the object, it XORs its PID into the integer:
object.tracker ^= self.pid;
When a process decrements the reference count, it does the same.
If the reference count is ever left at 1, then the tracker integer will be equal to the PID of the process that incremented it but didn't decrement it.
This works because XOR is commutative ( (A ^ B) ^ C == A ^ (B ^ C) ), so if a process XORs the tracker with its own PID an even number of times, it's the same as XORing it with PID ^ PID - that's zero, which leaves the tracker value unaffected.
You could alternatively use an unsigned value (which is defined to wrap rather than overflow) - adding the PID when incrementing the usage count and subtracting it when decrementing it.
Fundementally, shared memory shared state is not a robust solution and I don't know of a way of making it robust.
Ultimately, if a process exits all its non-shared resources are cleaned up by the operating system. This is incidentally the big win from using processes (fork()) instead of threads.
However, shared resources are not. File handles that others have open are obviously not closed, and ... shared memory. Shared resources are only closed after the last process sharing them exits.
Imagine you have a list of PIDs in the shared memory. A process could scan this list looking for zombies, but then PIDs can get reused, or the app might have hung rather than crashed, or...
My recommendation is that you use pipes or other message passing primitives between each process (sometimes there is a natural master-slave relationship, other times all need to talk to all). Then you take advantage of the operating system closing these connections when a process dies, and so your peers get signalled in that event. Additionally you can use ping/pong timeout messages to determine if a peer has hung.
If, after profiling, it is too inefficient to send the actual data in these messages, you could use shared memory for the payload as long as you keep the control channel over some kind of stream that the operating system clears up.
The most efficient tracing systems for resource ownership don't even use reference counts, let alone lists of reference-holders. They just have static information about the layouts of every data type that might exist in memory, also the shape of the stack frame for every function, and every object has a type indicator. So a debugging tool can scan the stack of every thread, and follow references to objects recursively until it has a map of all the objects in memory and how they refer to each other. But of course systems that have this capability also have automatic garbage collection anyway. They need help from the compiler to gain all that information about the layout of objects and stack frames, and such information cannot actually be reliably obtained from C/C++ in all cases (because object references can be stored in unions, etc.) On the plus side, they perform way better than reference counting at runtime.
Per your question, in the "degenerate" case, all (or almost all) of your process's state would be held in shared memory - apart from local variables on the stack. And at that point you would have the exact equivalent of a multi-threaded program in a single process. Or to put it another way, processes that share enough memory start to become indistinguishable from threads.
This implies that you needn't specify the "multiple processes, shared memory" part of your question. You face the same problem anyone faces when they try to use reference counting. Those who use threads (or make unrestrained use of shared memory; same thing) face another set of problems. Put the two together and you have a world of pain.
In general terms, it's good advice not to share mutable objects between threads, where possible. An object with a reference count is mutable, because the count can be modified. In other words, you are sharing mutable objects between (effective) threads.
I'd say that if your use of shared memory is complex enough to need something akin to GC, then you've almost got the worst of both worlds: the expensive of process creation without the advantages of process isolation. You've written (in effect) a multi-threaded application in which you are sharing mutable objects between threads.
Local sockets are a very cross-platform and very fast API for interprocess communication; the only one that works basically identically on all Unices and Windows. So consider using that as a minimal communication channel.
By the way, are you using consistently using smart pointers in the processes that hold references? That's your only hope of getting reference counting even half right.
Use following
int pids[MAX_PROCS]
int counter;
Increment
do
find i such pid[i]=0 // optimistic
while(cas[pids[i],0,mypid)==false)
my_pos = i;
atomic_inc(counter)
Decrement
pids[my_pos]=0
atomic_dec(counter);
So you know all processes using this object
You MAX_PROCS big enough and search free
place randomly so if number of processes significanly lower then MAX_PROCS the search
would be very fast.
Next to doing things yourself: you can also use some tool like AQTime which has a reference counted memchecker.