Memory Management - WorstFit vs. BestFit algorithms - memory-management

I understand the idea behind both BestFit and WorstFit memory schedulers.
I am wondering which approach yields the lowest time in the job queue.
Since WorstFit slows the rate at which small holes in memory are made, does that mean that it will result in a lower average job queue wait time?

I have discovered the answer. For future viewers, Worst Fit maintains, on average, a lower job queue time. It is a direct result of the characteristic of Worst Fit.
With a minimal memory compaction approach (only compacting empty frames who are both adjacent in memory and in the linked list), Worst Fit postpones creating slivers of empty memory.
However, with a more complete memory compaction algorithm (compaction adjacent frames in memory irrespective of their location in the linked list), Worst Fit and Best Fit will operate almost identically. While they choose their frames differently, the OS will always work harder to compact memory and create empty spaces to allocate to incoming processes.

Related

Practical Efficiency of binary search

When searching an element or an insertion point in a sorted array, there are basically two approaches: straight search (element by element) or binary search. From the time complexities O(n) vs O(log(n)) we know that binary search is ultimately more efficient, however this does not automatically imply that binary search will always be faster than "normal search".
My question therefore is: Can binary search be practically less efficent than "normal" search for low n? If yes, can we estimate the point at which binary search will be more efficient?
Thanks!
Yes, a binary search can be practically less efficient than "normal" search for a small n. However, this is very hard to estimate the point at which a binary search will be more efficient (if even possible) because this is very dependent of the problem (eg. data type, search predicate), the hardware (eg. processor, RAM) and even the dynamic state of the hardware used when the search is performed as well as the actual data in the sorted array on modern systems.
The first reason a binary search can be less efficient is vectorization. Indeed, modern processors can support SIMD instructions working on pretty big vectors. Thus, a linear search can work simultaneously on many item per processing cycle. Modern processors can even often execute few SIMD instructions in parallel per cycle. While linear searches can be often trivially vectorized, it is not the case of binary searches which are almost inherently sequential. One should keep in mind that vectorization is not always possible nor always automatically done by compilers, especially on non-trivial data types (eg. composite data structures, pointer-based types) or non-trivial search predicates (eg. the ones with conditionals or memory indirections).
The second reason a binary search can be less efficient is branch predictability. Indeed, modern processors try to predict branches ahead of time to avoid pipeline stall. If this prediction works, then branches can be taken very quickly, otherwise the processor can stall for several cycles (up to dozens). A branch can be easily predicted if it is always true or always false. A randomly taken branch cannot be predicted causing stalls. Because the array is sorted, branches in linear searches are easy to predict (branches are either always taken or never taken until the element is found), while this is clearly not the case for binary searches. As a result, the speed of a search is dependent of the searched item, and data inside the sorted array.
The same thing apply for cache misses and memory fetches: because the latency of the RAM is very big compared to executing arithmetic instructions, modern processors contains dedicated hardware prefetching units trying to predict the next memory fetches and prefetch data ahead of time in order to avoid cache misses. Prefetchers are good to predict linear/contiguous memory accesses but very bad for random memory accesses. Memory accesses of linear searches are trivial while the one of binary searches appear to be mostly random for many processors. A cache miss happening during a binary search will certainly cause the processor to stall for a lot of cycles. If the sorted array is already loaded in cache, a binary search on it can be much faster.
But this is not enough: using wide SIMD instructions or doing cache-misses can impact the frequency of the computing core and so the speed of the algorithm. Not to mention that the size of the data type also matters a lot as the memory throughput is limited and strided memory accesses are slower than contiguous one. One should also take into account the additional complexity of binary searches compared to linear ones (ie. often more instructions to execute). I guess I missed some important points in the above list.
As a programmer, you may need to define a threshold to choose which algorithm to use. If you really need that, the best solution is to find is automatically using a benchmark or autotuning methods. Practical experimentations shows that the threshold changed over the last decades for a given fixed context (data type, cache state, etc.), in favour to linear searches (so the thresholds are generally increasing over time).
My personal advice is not to use a binary search for value of n smaller than 256 / data_type_size_in_bytes with trivial/native data types on mainstream processors. I think it is a good idea to use a binary search when n is bigger than 1000, or also when the data-type is non-trivial as well as when the predicate is expensive.

What type of input would slow down execution time of dynamic memory allocators malloc() and free()?

I am curious about calculating Worst Case Execution Time of a real-time system and I am trying to find extreme scenarios to predict a worst case time.
What type of input scenarios would slow down the dynamic memory allocation? Thank you.
The free-list being empty would be one case, needing new memory from the OS. The free-list being huge but filled with small blocks too small to satisfy the current request would be another: could trigger walking a big list before finding one, or falling back to another way of getting new memory.
So obviously you'd want to design an allocator's data structures to avoid that problem, perhaps by grouping free lists by size, especially when real-time worst case is a concern.
That's just off the top of my head, and not something I've been involved with designing, so it's certainly not an exhaustive list.

Algorithms for memory allocation which produces low fragmentation

I've read Modern Operating Systems 4th Edition, Andrew Tanenbaum in which are presented some ways to handle the memory management (with bitmaps, with linked lists) and some of the algorithms that can be used to allocate memory (first fit, next fit, best fit, worst fit, quick fit), which are different but there is not one that is the best.
I'm trying to make my own memory allocator which will prioritize to have as low as possible external fragmentation (blocks of memory that are too small to be used) and the speed of allocation/deallocation (the first thing being the low fragmentation than the speed). I implemented the worst fit (thinking this will produce as little as possible external fragmentation because it always chooses the biggest contiguous space of memory when allocating and the remaining of that memory is enough to be used later for another allocation. I implemented it using a sorted list descending for free spaces and a set for allocated spaced sorted by address. The complexity for allocation is O(1) + the cost of maintaining the list sorted and for deallocation O(log n1) - for finding the address + O(n2) - for parsing the list of the free spaces and inserting the address found. n1= elements of the set, n2 = elements of the list.
I have multiple questions. First is how can I improve the algorithm? Second, what other algorithms used for memory allocation exists that will prioritize the fragmentation of memory? Third, are there any improved versions of the algorithms that I listed that will prioritize the fragmentation of memory? I want to know as many algorithms/methods of improving the algorithms that I know, that will reduce the external fragmentation.

Suitability of parallel computation for comparisons over a large dataset

Suppose the following hypothetical task:
I am given a single integer A (say, 32 bit double) an a large array of integers B's (same type). The size of the integer array is fixed at runtime (doesn't grow mid-run) but of arbitrary size except it can always fit inside either RAM or VRAM (whichever is smallest). For the sake of this scenario, the integer array can sit in either RAM and VRAM; ignore any time cost in transferring this initial data set at start-up.
The task is to compare A against each B and to return true only if the test is true for against ALL B's, returning false otherwise. For the sake of this scenario, let is the greater than comparison (although I'd be interested if your answer is different for slightly more complex comparisons).
A naïve parallel implementation could involve slicing up the set B and distributing the comparison workload across multiple core. The core's workload would then be entirely independent save for when a failed comparison would interrupt all others as the result would immediately be false. Interrupts play a role in this implementation; although I'd imagine an ever decreasing one probabilistically as the array of integers gets larger.
My question is three-fold:
Would such a scenario be suitable for parallel-processing on GPU. If so, under what circumstances? Or is this a misleading case where the direct CPU implementation is actually the fastest?
Can you suggest an improved parallel algorithm over the naïve one?
Can you suggest any reading to gain intuition on deciding such problems?
If I understand your questions correctly, what you are trying to perform is a reductive operation. The operation in question is equivalent to a MATLAB/Numpy all(A[:] == B). To answer the three sections:
Yes. Reductions on GPUs/multicore CPUs can be faster than their sequential counterpart. See the presentation on GPU reductions here.
The presentation should provide a hierarchical approach for reduction. A more modern approach would be to use atomic operations on shared memory and global memory, as well as warp-aggregation. However, if you do not wish to deal with the intricate details of GPU implementations, you can use a highly-optimized library such as CUB.
See 1 and 2.
Good luck! Hope this helps.
I think this is a situation where you'll derive minimal benefit from the use of a GPU. I also think this is a situation where it'll be difficult to get good returns on any form of parallelism.
Comments on the speed of memory versus CPUs
Why do I believe this? Behold: the performance gap (in terrifyingly unclear units).
The point here is that CPUs have gotten very fast. And, with SIMD becoming a thing, they are poised to become even faster.
In the meantime, memory is getting faster slower. Not shown on the chart are memory buses, which ferry data to/from the CPU. Those are also getting faster, but at a slow rate.
Since RAM and hard drives are slow, CPUs try to store data in "little RAMs" known as the L1, L2, and L3 caches. These caches are super-fast, but super-small. However, if you can design an algorithm to repeatedly use the same memory, these caches can speed things up by an order of magnitude. For instance, this site discusses optimizing matrix multiplication for cache reuse. The speed-ups are dramatic:
The speed of the naive implementation (3Loop) drops precipitously for everything about a 350x350 matrix. Why is this? Because double-precision numbers (8 bytes each) are being used, this is the point at which the 1MB L2 cache on the test machine gets filled. All the speed gains you see in the other implementations come from strategically reusing memory so this cache doesn't empty as quickly.
Caching in your algorithm
Your algorithm, by definition, does not reuse memory. In fact, it has the lowest possible rate of memory reuse. That means you get no benefit from the L1, L2, and L3 caches. It's as though you've plugged your CPU directly into the RAM.
How do you get data from RAM?
Here's a simplified diagram of a CPU:
Note that each core has it's own, dedicated L1 cache. Core-pairs share L2 caches. RAM is shared between everyone and accessed via a bus.
This means that if two cores want to get something from RAM at the same time, only one of them is going to be successful. The other is going to be sitting there doing nothing. The more cores you have trying to get stuff from RAM, the worse this is.
For most code, the problem's not too bad since RAM is being accessed infrequently. However, for your code, the performance gap I talked about earlier, coupled your algorithm's un-cacheable design, means that most of your code's time is spent getting stuff from RAM. That means that cores are almost always in conflict with each other for limited memory bandwidth.
What about using a GPU?
A GPU doesn't really fix things: most of your time will still be spent pulling stuff from RAM. Except rather than having one slow bus (from the CPU to RAM), you have two (the other being the bus from the CPU to the GPU).
Whether you get a speed up is dependent on the relative speed of the CPU, the GPU-CPU bus, and the GPU. I suspect you won't get much of a speed up, though. GPUs are good for SIMD-type operations, or maps. The operation you describe is a reduction or fold: an inherently non-parallel operation. Since your mapped function (equality) is extremely simple, the GPU will spend most of its time on the reduction operation.
tl;dr
This is a memory-bound operation: more cores and GPUs are not going to fix that.
ignore any time cost in transferring this initial data set at
start-up
if there are only a few flase conditions in millions or billions of elements, you can try an opencl example:
// A=5 and B=arr
int id=get_global_id(0);
if(arr[id]!=5)
{
atomic_add(arr,1);
}
is as fast as it gets. arr[0] must be zero if all conditions are "true"
If you are not sure wheter there are only a few falses or millions(which makes atomic functions slow), you can have a single-pass preprocessing to decrease number of falses:
int id=get_global_id(0);
// get arr[id*128] to arr[id*128+128] into local/private mem
// check if a single false exists.
// if yes, set all cells true condition except one
// write results back to a temporary arr2 to be used
this copies whole array to another but if you can ignore time delta of transferring from host device, this should be also ignored. On top of this, only two kernels shouldn't take more than 1ms for the overhead(not including memory read writes)
If data fits in cache, the second kernel(one with the atomic function) will access it instead of global memory.
If time of transfers starts concerning, you can hide their latency using pipelined upload compute download operations if threads are separable from whole array.

Complexity analysis of queues

I have two queues, one is implemented using an array for storage and the other is implemented using a linked list.
I believe that the complexity of the enqueue and dequeue operations looks like this -
LinkedListQueue Enqueue - O(1)
LinkedListQueue Dequeue- O(1)
ArrayQueueEnqueue - O(1)
ArrayQueue Dequeue - O(n)
So I tested both queues by adding and removing strings from them. Here are the results I got, time taken is in milliseconds -
Does my Big Oh complexity analysis stack up with these results? The complexity of LLQueue Enqueue I have is O(1) but as you can see it ranges from 2ms for 1000 strings to 79ms for 100000 strings. Is that expected?
And I have ArrayQueue Dequeue marked down as O(n), do those results look like O(n)? There is a huge jump between 20000 and 50000 strings, but it only doubles in time when dequeuing 100000 strings instead of 50000. That seems a bit odd...
First, I am surprised that your Dequeue operation on ArrayQueue is so expensive. Are you shifting the entire queue contents on each dequeue? There is a much better approach - look up "circular queue" online.
Now, onto the measurements. The big-Oh complexity only tells a part of the story. When you start measuring the running times, you will see a variety of complex effects, such as the following:
Caches
It seems like your test enqueues a bunch of elements into a queue and then dequeues them. If the entire queue fits into an L1 cache on your processor, all memory operations will be fast.
If the size of the queue exceeds the L1 cache, then data will need to spill into L2 cache, which results in maybe a 2-3x slowdown. If the size of the queue exceeds the L2 cache, the data will have to go into the main memory and you'll get another big drop in performance - say 5x.
Garbage collection
From the naming of your types, I would guess that you are using a language with garbage collection, like Java or C#. If that's the case, then that implies that your program can pause at any time (not quite "any time", but that's not important now) and spend some time cleaning up memory.
Other effects
I would guess that the two effects I mentioned above are the ones you'd be most likely to run into in a scenario like yours, but there are many other complex behaviors in the compiler, the VM, the OS, and the hardware that make it hard to interpret performance measurements.
It can be a great learning experience - and also fun - trying to figure out why exactly your performance measurements came out a certain way, but it certainly not trivial.

Resources