Dynamic array - Does it deallocate memory when elements are removed? - data-structures

Going by the article in wikipedia Dynamic Array
It automatically allocates memory in geometrical progression amounts as the last empty memory cell is filled up and then copies entire data to the new array. What happens when one removes elements in quantity larger than the amount by which it was increased? Does it automatically deallocate memory too? Or does it leave it as it is?
For example in the image on the top right of the above wikipedia link,
after the last step 2|7|1|3|8|4| one removes all the elements except 2. What happens then? Does it allocate memory of smaller size and copy the entire contents to the new one?
Side question: How or what decides what would be initial amount of memory allocated to a dynamic array?

The article you cite replies your question:
"Many dynamic arrays also deallocate some of the underlying storage if its size drops below a certain threshold, [...]"
It's really worth reading ;-)
For the cases where you know in advance that you will need a specific size, some implementations provide a specific method ("reserve()", in the C++ Standard Library).

Related

Disadvantage of Using Linked Lists in Memory Management

I'm kinda confused as to what the primary disadvantage of using a linked list would be in maintaining a list of free disk blocks. My professor said that using a bit map would help solve said problem. Why does using a bit map solve this problem?
To narrow down my questions:
What is the primary disadvantage of using a linked list in maintaining a list of free disk blocks?
Why does using a bit map solve this problem/disadvantage?
Hi,
What is the primary disadvantage of using a linked list in maintaining a list of free disk blocks?
This scheme is not very efficient since to traverse the list, we must read each block requiring substantial time.
The second disadvantage is additional memory requirement for maintaining linked list of all free disk blocks.
Why does using a bit map solve this problem/disadvantage?
Most often the list of free disk spaces is implemented as a bit map or bit vector. Each block is represented by a single bit. 0(zero) is marked as a free block whereas 1 is for allocated block. So, no need of extra extra memory to store free disk space.
Fast random access allocation check: Checking if a sector is free is as simple as checking the corresponding bit. so traversal is faster than LinkedList.
Other Advantage of using Bit Map:
Fast deletion: Data need not be overwritten on delete, flipping the corresponding bit is sufficient
May this helps you. Fill free for further clarification.
Regards,
Bhavik
The correct solution was given by #FullDecent in the comments to the other answer (he deserves your bounty). To elaborate:
Assuming that the disk drive in question is of the older, conventional type, with a spinning storage surface and a read/write head that physically moves radially across the surface...
In general it is good for files to be stored as contiguously on disk as possible, so that multiple blocks can be read sequentially. If a file is "fragmented" (its blocks are scattered in different places on the disk), the drive head will need to be repositioned several times to read the entire file. Repositioning of the head is one of the most time-consuming operation involved in a disk read (second only to starting the disk spinning after it has been stopped). Hence the procedure known as "defragmentation" or "defragging", which rearranges the used blocks on a disk to make all files contiguous.
With a linked list of free blocks, allocation involves taking blocks from the front of the list, and deallocation involves adding freed blocks to the front of the list. Hence the list can get messy, with blocks that are not adjacent on the disk frequently being adjacent in the list. To find a contiguous stretch of free blocks large enough for a large file, it may be necessary to scan a significant fraction of the list.
With a bitmap, it will still be necessary to scan for a large contiguous free block section, but this is easier since 8, 16, 32, or 64 bits (depending on the hardware's word size) can be checked in a single operation.

Understanding the Buddy Allocator

I have a conceptual doubt in understanding the way Linux Kernel manages Free blocks. Here is what I interpreted through reading so far.
The Buddy Allocator implementation is allocation scheme that combines a normal power-of-2 allocation.
At times when we need a block of size which is not available, it divides the large block into two. Those two blocks are Buddies, probably hence it is called the Buddy Allocator.
Through a source I learnt that an array of free_area_t structs are maintained for each order that points to a linked list of blocks of pages that are free.
Which I found in <linux/mm.h>
typedef struct free_area_struct {
struct list_head free_list;
unsigned long *map;
} free_area_t;
The free_list appear to be a linked-list of page blocks? My question is, whether it is a list of Free pages or Used pages?
And map appears to be a bitmap that represents the state of a pair of buddies.
My question is How can it be a single-bit that holds the state bit for a pair of buddies? Because if, I use one of the block in a Buddy-pair to allocats, and the other left free, what would be the state then, and how is that managed to be stored in a single bit? Does it represent the entire block of the size of power-of-two, which can be divided in two parts when we need a block size which is not available, so the allocated half is Buddy of the other half which is free? If this is the case that half is being allocated and half remains free, then what will be status of map ? What if both are free? and what if both are allocated? How can be a binary value representing 3 states of a block?
Edit: After further reading, the first doubt is cleared. Which says: If a free block cannot be found of the requested order, a higher order block
is split into two buddies. One is allocated and the other is placed on the free list for
the lower order. So it is linked list of free pages.
map represents the state of a single memory block at the lowest level.

Heap Type Implementation

I was implementing a heap sort and I start wondering about the different implementations of heaps. When you don need to access the elements by index(like in a heap sort) what are the pros and cons of implementing a heap with an array or doing it like any other linked data structure.
I think it's important to take into account the memory wasted by the nodes and pointers vs the memory wasted by empty spaces in an array, as well as the time it takes to add or remove elements when you have to resize the array.
When I should use each one and why?
As far as space is concerned, there's very little issue with using arrays if you know how much is going into the heap ahead of time -- your values in the heap can always be pointers to the larger structures. This may afford for better cache localization on the heap itself, but you're still going to have to go out someplace to memory for extra data. Ideally, if your comparison is based on a small morsel of data (often just a 4 byte float or integer) you can store that as the key with a pointer to the full data and achieve good cache coherency.
Heap sorts are already not particularly good on cache hits throughout traversing the heap structure itself, however. For small heaps that fit entirely in L1/L2 cache, it's not really so bad. However, as you start hitting main memory performance will dive bomb. Usually this isn't an issue, but if it is, merge sort is your savior.
The larger problem comes in when you want a heap of undetermined size. However, this still isn't so bad, even with arrays. Anymore, in non-embedded environments with nice, pretty memory systems growing an array with some calls (e.g. realloc, please forgive my C background) really isn't all that slow because the data may not need to physically move in memory -- just some address pointer magic for most of it. Added to the fact that if you use a array-size-doubling strategy (array is too small, double the size in a realloc call) you're still ending up with an O(n) amortized cost with relatively few reallocs and at most double wasted space -- but hey, you'd get that with linked lists anyways if you're using a 32-bit key and 32-bit pointer.
So, in short, I'd stick with arrays for the smaller base data structures. When the heap goes away, so do the pointers I don't need anymore with a single deallocation. However, it's easier to read pointer-based code for heaps in my opinion since dealing with the indexing magic isn't quite as straightforward. If performance and memory aren't a concern, I'd recommend that to anyone in a heartbeat.

Data structure and algorithm for representing/allocating free space in a file

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.
I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?
For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.
My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.
I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:
1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.
We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.
When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)
One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.
Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.
Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.
For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.
The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:
no need to store the size explicitly, it's known depending on the bag you're in
no "real" fragmentation within a bag
The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.
You need an index to map size to its corresponding bag.
You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).
This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.
This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.
(*) BiBOP: Big Bag Of Pages
I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.
What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.
Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.
On one side - best compaction, highest fragmentation - splay and many other kinds of trees.
On another side - lowest fragmentation, worst compaction - linked list.
In between there are B-Trees and others.
As you I understand, you stated as priority: space-saving - while taking care about performance.
I would recommend you mixed data-structure in order to achieve all requirements.
a kind of list of contiguous blocks of data
a kind of tree for current "add/remove" operation
when data are required on demand, allocate from tree. When deleted, keep track what's "deleted" using tree as well.
mixing -> during each operation (or on idle moments) do "step by step" de-fragmentation, and apply changes kept in tree to contiguous blocks, while moving them slowly.
This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.
The most simple solution is a free list: keep a linked list of free blocks, reusing the free space to store the address of the next block in the list.

in-place realloc with gcc/linux

Is there such a thing? I mean some function that would reallocate memory without moving it if possible or do nothing if not possible. In Visual C there is _expand which does what I want. Does anybody know about equivalents for other platforms, gcc/linux in particular? I'm mostly interested in shrinking memory in-place when possible (and standard realloc may move memory even when its size decreases, in case somebody asks).
I know there is no standard way to do this, and I'm explicitly asking for implementation-dependent dirty hackish tricks. List anything you know that works somewhere.
Aside from using mmap and munmap to eliminate the excess you don't need (or mremap, which could do the same but is non-standard), there is no way to reduce the size of an allocated block of memory. And mmap has page granularity (normally 4k) so unless you're dealing with very large objects, using it would be worse than just leaving the over-sized objects and not shrinking them at all.
With that said, shrinking memory in-place is probably not a good idea, since the freed memory will be badly fragmented. A good realloc implementation will want to move blocks when significantly shrinking them as an opportunity to defragment memory.
I would guess your situation is that you have an allocated block of memory with lots of other structures holding pointers into it, and you don't want to invalidate those pointers. If this is the case, here is a possible general solution:
Break your resizable object up into two allocations, a "head" object of fixed size which points to the second variable-sized object.
For other objects which need to point into the variable-size object, store a pointer to the head object and an integer offset (size_t or ptrdiff_t) into the variable-size object.
Now, even if the variable-size object moves to a new address, none of the references to it are invalidated.
If you're using these objects from multiple threads, you should put a read-write lock in the head object, read-locking it whenever you need to access the variable-sized object, and write-locking it whenever resizing the variable-sized object.
A similar question was asked on another forum. One of the more reasonable answers I saw involved using mmap for the initial allocation (using the MAP_ANONYMOUS flag) and calling mremap without the MREMAP_MAYMOVE flag. A limitation of this approach, though, is that the allocation sizes must be exact multiples to the system's page size.

Resources