The data structure of libev watchers - data-structures

Libev uses three data structures to storage different watchers.
Heap: for watchers that sorted by time, such as ev_timer and ev_periodic.
Linked list: such as ev_io, ev_signal, ev_child and etc.
Array: such as ev_prepare, ev_check, ev_async and etc.
There is no doubt about that uses heap to store timer watcher. But what is the criteria of selecting linked list and array?
The data structure that stores ev_io watchers seems a little complex. It first is an array that with fd as its index and the element in the array is a linked list of ev_io watcher. It is more convenient to allocate space for array if use linked list as element. Is it the reason?
Or just because of the insert or remove operation of ev_io is more frequently and the ev_prepare seems more stable?
Or any other reasons?

The expectation is that there are only a few (usually one, and almost always at most two) io watchers for the same fd (similarly for signals). Putting the list links into the watcher means no extra allocations are required, as would be required if an array was used per watcher. If a lot of I/O watchers were active on the same fd, then this linked list approach would be slower.
Arrays are used because insertion and removal is very fast (the watcher stores the array index). Using a 4-byte index also reduces memory requirements on 64 bit machines (12 bytes per watcher as opposed to e.g. 16 for a doubly-linked list), and using an array of pointers means that all pointers are near each other in memory, which improves cache efficiency when scanning, which is a frequent operation for some watchers.
Cache efficiency is also the reason why a 4-heap is used over a 2-heap, and the reason why the time values are copied to the heap data structure, to avoid having to access the watcher memory on heap operations.
Child watchers actually use a fixed-size hash table, and again the expectation is that the number of child watchers per hash bucket is small, so the list pointers become part of the watcher data structure.

Probably the reason is that in typical scenario ev_io needs to be looked up by fd. Underlying library (epoll, select, or whatever) would provide fd that has some event. Libev would then simply use it as index, and the just iterate over the linked list of event watchers that need to be invoked. So it can be fast in firing events.

Related

List Cache Behavior

OCaml From the Ground Up states that ...
At the machine level, a linked list is a pair of a head value and a pointer to the tail.
I have heard that linked lists (in imperative languages) tend to be slow because of cache misses, memory overhead and pointer chasing. I am curious if OCaml's garbage collector or memory management system avoids any of these issues, and if they do what sort of techniques or optimizations they employ internally that might be different from linked lists in other languages.
OCaml manages its own memory, it calls system-level memory allocation and deallocation primitives in its own terms (e.g. it can allocate a big chunk of heap memory during the start of the program, and manage OCaml values on it), so if the compiler and/or the runtime knows that you are allocating a list of a fixed sized, it can arrange for the cells to be close to each other in memory. And since there is no pointer type in the language itself, it can move values around during garbage collection, to avoid memory fragmentation, something a language like C or C++ cannot do (or with great effort to maintain the abstraction while allowing moves).
Those are general pointers about how garbage collected languages (imperative or not) may optimize memory management, but Understanding the Garbage Collector has more details about how the garbage collector actually works in OCaml.
A linked list is indeed a horrible structure to iterate over in general.
But this is mitigated a lot by the way OCaml allocates memory and how lists are created most of the time.
In OCaml the GC allocates a large block of memory as it's (minor) heap and maintains a pointer to the end of the used portion. An allocation simply increases the pointer by the needed amount of memory.
Combine that with the fact that most of the time lists are constructed in a very short time. Often the list creation is the only thing allocating memory. Think of List.map for example, or List.rev. That will produce a list where the nodes of the list are contiguous in memory. So the linked list isn't jumping all over the address space but is rather contained on a small chunk. This allows caching to work far better than you would expect for a linked list. Iterating the list will actually access memory sequentially.
The above means that a lot of lists are much more ordered than in other languages. And a lot of the time lists are temporary and will be purely in cache. It performs a lot better than it ought to.
There is another layer to this. In OCaml the garbage collector is a generational GC. New values are created on the minor heap which is scanned frequently. Temporary values are thus quickly reclaimed. Values that remain alive on the minor heap are copied to the major heap, which is scanned less frequent. The copy operation compacts the values, eliminating any holes caused by values that are no longer alive. This will bring list nodes closer together again if it had gaps in it in the first place. The same thing happens when the major heap is scanned, the memory is compacted bringing values that where allocated close in time nearer together.
While none of that guarantees that lists will be contiguous in memory it seems to avoid a lot of the bad effects associated with linked lists in other languages. None the less you shouldn't use lists when you need to iterate over data, or worse access the n-th node, frequently. Use an array instead. Appending is bad too unless your list is small (and will overflow the stack for large lists). Due to the later you often build a list in reverse, adding items to the front instead of appending at the end, and then reverse the list as final step. And that final List.rev will give you a perfectly contiguous list.

Does msync performance depend on the size of the provided range?

I'm making many small random writes to a mmaped file. I want to ensure consistency, so from time to time I use msync, but I don't want to keep track of every single small write that I made. In the current Linux kernel implementation, is there a performance penalty for using msync on the whole file? For example if the file is 100GB, but I only made total of 10MB changes? Is the kernel looping over every page in the range provided for msync to find the dirty pages to flush or are those kept on some sort of linked list/other structure?
TL;DR: no it doesn't, kernel structures that hold the needed information are designed to make the operation efficient regardless of range size.
Pages of mappable objects are kept in a radix tree, however the Linux kernel implementation of radix trees has an additional special feature: entries can be marked with up to 3 different marks, and marked entries can be found and iterated on a lot faster. The actual data structure used is called "XArray", you can find more information about it in this LWN article or in Documentation/core-api/xarray.rst.
Dirty pages have a special mark which can be set (PAGECACHE_TAG_DIRTY) allowing for them to be quickly found when writeback is needed (e.g. msync, fsync, etc). Furthermore, XArrays provide an O(1) mechanism to check whether any entry exists with a given mark, so in the case of pages it can be quickly determined whether a writeback is needed at all even before looking for dirty pages.
In conclusion, you should not incur in a noticeable performance penalty when calling msync on the entire mapping as opposed to only a smaller range of actually modified pages.

How to implement a Linked List in secondary memory and not in RAM?

I'm curious to know whether there is a way to implement any data structure in secondary storage or not?
Sure. Traditionally, many data structures we use today have been used as on-disk structures. However, it's much harder to dynamically add and delete elements, especially if they're different sizes. The hard disk will also seek a lot if the elements are scattered throughout the file, and that will really slow down your program.
To save a linked list in a file, your "next" pointer is typically the offset of the next element within the file. To read the next element, you seek to that offset, and read the structure into RAM.
That's if you want one big file with your entire linked list. Another way you can put a linked list onto secondary storage is to have each entry be a separate file, and your "next" pointer is the filename of the next element. That makes it easier to add and remove elements (that's just file creation and deletion, and updating the pointers if needed), but does even more seeking.

Data structure and algorithm for representing/allocating free space in a file

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.
I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?
For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.
My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.
I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:
1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.
We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.
When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)
One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.
Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.
Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.
For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.
The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:
no need to store the size explicitly, it's known depending on the bag you're in
no "real" fragmentation within a bag
The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.
You need an index to map size to its corresponding bag.
You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).
This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.
This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.
(*) BiBOP: Big Bag Of Pages
I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.
What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.
Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.
On one side - best compaction, highest fragmentation - splay and many other kinds of trees.
On another side - lowest fragmentation, worst compaction - linked list.
In between there are B-Trees and others.
As you I understand, you stated as priority: space-saving - while taking care about performance.
I would recommend you mixed data-structure in order to achieve all requirements.
a kind of list of contiguous blocks of data
a kind of tree for current "add/remove" operation
when data are required on demand, allocate from tree. When deleted, keep track what's "deleted" using tree as well.
mixing -> during each operation (or on idle moments) do "step by step" de-fragmentation, and apply changes kept in tree to contiguous blocks, while moving them slowly.
This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.
The most simple solution is a free list: keep a linked list of free blocks, reusing the free space to store the address of the next block in the list.

Searching for membership in array of ranges

As part of our system simulation, I'm modeling a memory space with 64-bit addressing using a sparse memory array and keeping a list of objects to keep track of buffers that are allocated within the memory space. Buffers are allocated and de-allocated dynamically.
I have a function that searches for a given address or address range within the allocated buffers to see if accesses to the memory model are in allocated space or not, and my first cut "search through all the buffers until you find a match" is slowing down our simulations by 10%. Our UUT does a lot of memory accesses that have to be vetted by the simulation.
So, I'm trying to optimize. The memory buffer objects contain a starting address and a length. I'm thinking about sorting the object array by starting address at object creation, and then, when the checking function is called, doing a binary search through the array looking to see if a given address falls within a start/end range.
Are there any better/faster ways to do this? There must be some faster/cooler algorithm out there using heaps or hash signatures or some-such, right?
Binary search through a sorted array works but makes allocation/deallocation slow.
A simple case is to make an ordered binary tree (red-black tree, AVR tree, etc.) indexed by the starting address, so that insertion (allocation), removal (deallocation) and searching are all O(log n). Most modern languages provide such data structure (e.g. C++'s std::map) already.
My first thought was also binary search and I think that it is a good idea. You should be able to insert and remove quickly too. Using a hash would just make you put the addresses in buckets (in my opinion) and then you'd get to the right bucket quickly (and then have to search through the bucket).
Basically your problem is that you have a defined intervals of "valid" memory, memory outside those intervals is "invalid", and you want to check for a given address whether it is inside a valid memory block or not.
You can definitely do this by storing the start addresses of all allocated blocks in a binary tree; then search for the largest address at or below the queried address, and just verify that the address falls within the length of the valid address. This gives you O(log n) query time where n = number of allocated blocks. The same query of course can be used also to actually the find the block itself, so you can also read the contents of the block at the given address, which I guess you'd need also.
However, this is not the most efficient scheme. Instead, you could use additionally one-dimensional spatial subdivision trees to mark invalid memory areas. For example, use a tree with branching factor of 256 (corresponding to 8 bits) that maps all those 16kB blocks that have only invalid addresses inside them to "1" and others to "0"; the tree will have only two levels and will be very efficient to query. When you see an address, first ask form this tree if it's certainly invalid; only when it's not, query the other one. This will speed things up ONLY IF YOU ACTUALLY GET LOTS OF INVALID MEMORY REFERENCES; if all the memory references are actually valid and you're just asserting, you won't save anything. But you can flip this idea also around and use the tree mark to all those 16kB or 256B blocks that have only valid addresses inside them; how big the tree grows depends on how your simulated memory allocator works.

Resources