Is it possible for fixed sized partitioning to suffer from external partitioning?
My sir said it's not possible and that fixed sized partitioning can only suffer from internal fragmentation. But consider this case, a fixed sized memory of 30Kb, divided into 3 partitions of 10Kb each and a process of 10Kb resides in the middle partition. Now a new process of 20Kb requires memory, but it can't be assigned memory because even if the required memory is available, it is not contiguous. Isn't this external fragmentation?
But consider this case, a fixed sized memory of 30Kb, divided into 3 partitions of 10Kb each and a process of 10Kb resides in the middle partition. Now a new process of 20Kb requires memory, but it can't be assigned memory because even if the required memory is available, it is not contiguous. Isn't this external fragmentation?
No.
For fixed size partitioning you can't allocate anything larger than a partition; so even if all partitions were empty the allocation would fail because it's larger than the size of a partition (20 Kib > 10 KiB).
For allocations that are possible (not larger than a partition) external fragmentation is impossible (mostly because it becomes internal fragmentation instead).
Related
To use cache memory, main memory is divided into cache lines, typically 32 or 64 bytes long. An entire cache line is cached at once. What is the advantage of caching an entire line instead of a single byte or word at a time?
This is done to exploit the principle of locality; spatial locality to be precise. This principle states that the data bytes which lie close together in memory are likely to be referenced together in a program. This is immediately apparent when accessing large arrays in loops. However, this is not always true (e.g. pointer based memory access) and hence it is not advisable to fetch data from memory at more than the granularity of cache lines (in case the program does not have locality of reference) since cache is a very limited and important resource.
Having cache block size equal to the smallest addressable size would mean, if a larger size access is supported, multiple tags would have to be checked for such larger accesses. While parallel tag checking is often used for set associative caches, a four-fold increase (8-bit compared to 32-bit) in the number of tags to check would increase access latency and greatly increase energy cost. In addition, such introduces the possibility of partial hits for larger accesses, increasing the complexity of sending the data to a dependent operation or internal storage. While data can be speculatively sent by assuming a full hit (so latency need not be hurt by the possibility of partial hits), the complexity budget is better not spent on supporting partial hits.
32-bit cache blocks, when the largest access size is 32 bits, would avoid the above-mentioned issues, but would use a significant fraction of storage for tags. E.g., a 16KiB direct-mapped cache in a 32-bit address space would use 18 bits for the address portion of the tag; even without additional metadata such as coherence state, tags would use 36% of the storage. (Additional metadata might be avoided by having a 16KiB region of the address space be non-cacheable; a tag matching this address region would be interpreted as "invalid".)
Besides the storage overhead, having more tag data tends to increase latency (smaller tag storage facilitates earlier way selection) and access energy. In addition, having a smaller number of blocks for a cache of a given size makes way prediction and memoization easier, these are used to reduce latency and/or access energy.
(The storage overhead can be a significant factor when it allows tags to be on chip while data is too large to fit on chip. If data uses a denser storage type than tags — e.g., data in DRAM and tags in SRAM with a four-fold difference in storage density —, lower tag overhead becomes more significant.)
If caches only exploited temporal locality (the reuse of a memory location within a "short" period of time), this would typically be the most attractive block size. However, spatial locality of access (accesses to locations near an earlier access often being close in time) is common. Taken control flow instructions are typically less than a sixth of all instructions and many branches and jumps are short (so the branch/jump target is somewhat likely to be within the same cache block as the branch/jump instruction if each cache block holds four or more instructions). Stack frames are local to a function (concentrating the timing of accesses, especially for leaf functions, which are common). Array accesses often use unit stride or very small strides. Members of a structure/object tend to be accessed nearby in time (conceptually related data tends to be related in action/purpose and so accessed nearer in time). Even some memory allocation patterns bias access toward spatial locality; related structures/objects are often allocated nearby in time — if the preferred free memory is not fragmented (which would happen if spatially local allocations are freed nearby in time, if little memory has been freed, or if the allocator is clever in reducing fragmentation, then such allocations are more likely to be spatially local.
With multiple caches, coherence overhead also tends to be reduced with larger cache blocks (under the assumption spatial locality). False sharing increases coherence overhead (similar to lack of spatial locality increasing capacity and conflict misses).
In this sense, larger cache blocks can be viewed as a simple form of prefetching (even with respect to coherence). Prefetching trades bandwidth and cache capacity for a reduction in latency via cache hits (as well as from increasing the useful queue size and scheduling flexibility). One could gain the same benefit by always prefetching a chunk of memory into multiple small cache blocks, but the capacity benefit of finer-grained eviction would be modest because spatial locality of use is common. In addition, to avoid prefetching data that is already in the cache, the tags for the other blocks would have to be probed to check for hits.
With simple modulo-power-of-two indexing and modest associativity, two spatially nearby blocks are more likely to conflict and evict earlier another blocks with spatial locality (index A and index B will have the same spatial locality relationship for all addresses mapping to indexes within a larger address range). With LRU-oriented replacement, accesses within a larger cache block reduce the chance of a too-early eviction when spatial locality is common at the cost of some capacity and conflict misses.
(For a direct-mapped cache, there is no difference between always prefetching a multi-block aligned chunk and using a larger cache block, so paying the extra tag overhead would be pointless.)
Prefetching into a smaller buffer would avoid cache pollution from used data, increasing the benefit of smaller block size, but such also reduces the temporal scope of the spatial locality. A four-entry prefetch buffer would only support spatial locality within four cache misses; this would catch most stream-like accesses (rarely will more than four new "streams" be active at the same time) and many other cases of spatial locality but some spatial locality is over a larger span of time.
Mandatory prefetching (whether from larger cache blocks or a more flexible mechanism) provides significant bandwidth advantages. First, the address and request type overhead is spread over a larger amount of data. 32 bits of address and request type overhead per 32 bit access uses 50% of the bandwidth for non-data but less than 12% when 256 bits of data are transferred.
Second, the memory controller processing and scheduling overhead can be more easily averaged over more transferred data.
Finally, DRAM chips can provide greater bandwidth by exploiting internal prefetch. Even in the days of Fast Page Mode DRAM, accesses within the same DRAM page were faster and higher bandwidth (less page precharge and activation overhead); while non-mandatory prefetch could exploit such and be more general, the control and communication overheads would be larger. Modern DRAMs have minimum burst lengths (and burst chop merely drops part of the DRAM-chip-internal prefetch — the internal access energy and array occupation are nor reduced).
The ideal cache block size depends on workload ('natural' algorithm choices and legacy optimization assumptions, data set sizes and complexity, etc.), cache sizes and associativity (larger and more associative caches encourage larger blocks), available bandwidth, use of in-cache data compression (which tends to encourage larger blocks), cache block sectoring (where validity/coherence state is tracked at finer granularity than the address), and other factors.
The main advantage of caching an entire line is the probability of the next cache-hit is increased.
From Tanenbaum's "Modern Operating Systems" book:
Cache-hit: When the program needs to read a memory word, the cache hardware checks to see if the line needed is in the cache.
If we don't have a cache-hit then cache-miss will occur. A memory request is sent to the main memory.
As a result, more time will be spent to complete the process, since searching inside the memory is costly.
We can tell that, caching an entire line will increase the probability of completing the process in two-cycles.
Suppose I have two processes of 50 bytes and have only one partition of 100 bytes.
Suppose the first process takes up the partition and 50 bytes is remaining .
Can the second process reside in the same partition even if free space is available or will internal fragmentation occur?
Also is it true that if internal fragmentation is present then external fragmentation is also present?
Since there is only one partition, internal fragmentation will occur because internal fragmentation is the left space inside a partition and in this example 50 bytes will be left free inside the partition.
Also is it true that if internal fragmentation is present then
external fragmentation is also present?
No, its not true. External fragmentation is the free space left when you dont have larger blocks and you have free blocks(or partitions) available but none of them alone can satisfy the requirement.
So basically the total amount of free space will be greater or equal to space required but it will not be contiguous and this is called external fragmentation.
In contiguous memory allocation we have a problem of external fragmentation , but cant we just combine all the available small holes of free memory to create a big one according to our requirement?
Yes we can combine all the memory space which is scattered all over the physical memory and is not contiguous. So we need an algorithm to move all the used memory allocations to one side of the memory so that we a contiguous free memory available. This method is called Compaction. Compaction is quite inefficient to execute and require some time.
Also the memory bound dynamically can be reallocated.
Adding to varun's answer:
Compaction may not always be possible i.e. if the address binding is static the program's address space cannot be relocated. It can be done only if relocation is dynamic and is done at execution time.
One other solution is to increase the block size but it is trade off.
Refer this for block-size affect on fragmentation
I have a continuous memory of 1024 buffers, each buffer sizes 2K bytes. I use a linked list to keep record of available buffers (Buffer here can be thought of being used by Producer and Consumer). After some operations, the order of buffers in the link list becomes random.
The modern computer architecture favours compact data, locality a lot. It caches neighbouring data when a location needs to be accessed. The cache-line of my computer is 64(corrected from 64K) bytes.
Question 1. For my case, is there a lot of cache misses due to my access pattern is random?
Question 2. What is the size of neighbouring data a modern computer caches? I think if you access a location in an array of integers, it will cache neighbouring integers. But my unit data (2K) is much larger than int (4). So, I am not sure how many neighbours will be cached.
First of all I doubt that "The cache-line of my computer is 64K bytes". It's most likely to be 64 Bytes only. Let me try to answer your questions:
Question 1. For my case, is there a lot of cache misses due to my access pattern is random?
Not necessarily. It depends on how many operations you do on a buffer once it is cached.
So if you cache a 2K buffer and do lots of sequential work on it your
cache hit rate would be good. As Paul suggested, this works even better with hardware prefetching enabled.
However if you constantly jump between buffers and do relatively
low amount of work on each buffer, the cache hit rate will drop.
However 1024 x 2KB = 2MB so that could be a size for an L2 cache (if you have also L3, then L2 is generally smaller). So even
if you miss L1, there's a high chance that in both cases you will
hit L2.
Question 2. What is the size of neighbouring data a modern computer caches?
Usually the number of neighbors fetched is given by the cache line size. If the line size is 64B, you could fetch 16 integer values. So on each read, you fill a cache line. However you need to take into consideration prefetching. If your CPU detects that the memory reads are sequential, it will prefetch more neighbors and bring more cache lines in advance.
Hope this helps!
I want to make sure I am getting this concept right:
In Hadoop the Definite Guide it is stated that: "the goal while designing a file system is always to reduce the number of seeks in comparison to the amount of data to be transferred." In this statement the author is referring to the "seeks()" of Hadoop logical blocks, right?
I am thinking that no matter how big the Hadoop block size is (64MB or 128MB or bigger) the number of seeks of the physical blocks (which are usually 4KB or 8KB) that the underlying filesystem (e.g. ext3/fat) will have to perform will be the same no matter the size of Hadoop block size.
Example: To keep numbers simple, assume underlying file system block size is 1MB. We want to read a file of size 128MB. If the Hadoop block size is 64MB, the file occupies 2 blocks. When reading there are 128 seeks. if the Hadoop block size is increased to 128MB, the number of seeks performed by the files system is still 128. In the second case, Hadoop will perform 1 seek instead of 2.
Is my understanding correct?
If I am correct, a substantial performance improvement by increasing block size will only be observed only for very large files, right? I am thinking that in the case of files that are in the 1~GB size range, reducing the number of seeks from ~20 seeks (64MB block size) to ~10 seek (128MB block size) shouldn't make much of a difference, right?
You are correct that increasing the file system block size will improve performance. Linux requires that the block size be less than or equal to the page size. The x86 page size is limited to 4K; therefore, the largest block size that you can use is 4K even if the file system can support larger block sizes. The performance benefits of a large block size and page size are significant: reduction in read/write system calls, reduction in rotational delays and seeks (don't begin to consider SSDs), fewer context switches, improved cache locality, fewer TLB misses, etc. This is all goodness.
I analytically modeled the benefits of various block sizes based on our disk usage pattern and in some cases predicted order of magnitude improvements from the disk subsystem. This would shift the performance bottleneck elsewhere.
You are correct that substantial performance gains are possible. Unfortunately, a certain engineer who controls such improvements sees no value in page sizes larger than 4K. He mocks enterprise users who need high perform from largely homogeneous workloads on big iron and focuses on heterogeneous workloads that are interactively run on desktop or laptop systems where high performance is unimportant.