Durably reordering values on disk - algorithm

I have a large number (100s of millions) of fixed-size values stored in a random order on a disk. I have the same set of values stored in memory, in a different order. I need to store the values in the order they are in memory, on disk. The challenge is this: I need to keep at least one copy of each value on disk at any one time – i.e. it needs to be durable.
I have quite a bit of RAM to work with (the values take up only about 60%), a lot of ephemeral storage, but only a very small amount of space on the durable disk, enough for less than a million of the values.
Given a value on disk, I can find it in memory very very quickly. But the converse is not true, given a value in memory, it is very slow to find it on disk.
Given these limitations, what's the best algorithm to transfer the order of the values from memory, to disk, as fast as possible?

Sounds like you have a sorting problem, where your comparator is the order of elements in RAM (element x is 'bigger' than element y, if x appears after y in RAM).
It can be solved using an external sort.
Note that if you allow duplicates, some more processing needs to be done in order to make sure your comparator is valid (can be solved by enumerating the identical values, and assigning a 'dupe_id' to each duplicate - both in RAM and on disk)


Does use of hash tables cause memory fragmentation?

My understanding of hash tables is that they use hash functions to relate keys to locations in memory, with a total number of "buckets" pre-allocated in memory. The goal is for there to be enough buckets that I don't have to use chaining, slowing my ideal O(1) access time complexity to n/m x O(1) where n is the number of unique keys to store, and m is the number of buckets.
So if I have 1000 unique items to store, I'll want no less than 1000 buckets, and perhaps a lot more to minimize probability of having to use my chained linked list. If this weren't the case, we'd expect the average hash table to have many, many collisions. Now if we've got 1000 pre-allocated buckets, that means I have 1000 bytes of allocated memory, distributed around my memory. Thus every single unique key in my hash table results in a fragment of memory, fragmenting my RAM.
Does this mean that the use of hash tables is basically guaranteed to result in an amount of fragmentation proportional to the number of unique keys? Further, this seems to indicate that you can greatly minimize fragmentation using some statistics to pick the number of buckets, if you know how many unique keys there are going to be. Is this the case?
1000 bytes of allocated memory, distributed around my memory
No, you have one array of 1000 entries (of some size which is almost certainly larger than 1 byte per entry).
If each entry is big enough to handle the non-collision case in-place, no extra dynamic allocation is required until you have a collision. (e.g. maybe you use a union and a 1-bit flag to indicate whether this entry is a stand-alone bucket or whether it's a pointer to a linked list.)
If not, then when you write an entry, space needs to be allocated for it and a pointer stored in the table array itself. (e.g. a key-value hash table with small keys but large values). An empty hash table can still be full of NULL pointers.
You might still want it to hold structs of pointer and hash value (for single-member buckets). Then you can reject definitely-not-present queries without another level of indirection if the full hash value doesn't match the query; e.g. for a 32 or 64-bit hash that's many more bits than the 10 bits for indexing a 1024-entry table.
To reduce overall fragmentation, you can use a slab allocator or other technique for carving nodes out of a contiguous block you get from a global allocator. Having the hash table maintain its own private free-list could help with spatial locality of the linked-list nodes, so they're at least not scattered across many different virtual pages (TLB misses) and hopefully not DRAM pages (even slower cache misses).

Why are tries slower than hash tables when stored on-disk?

I heard that tries are less efficient than hash tables for performing lookups when the data strictures are stored on disk rather than main memory. Why would this be the case?
On disk, random access is slow because in order to read bytes at a particular location, the hard drive has to physically spin around to put those bytes under the read head. The cost of a random access on disk can be millions of times slower than a comparable access to RAM.
On top of this, whenever you read data from disk, a block of memory called a page is read from disk, not just the bytes you asked for. This means that if you read some data from disk, accessing the bytes near that byte will likely be very fast because that data will have been read from the same page and loaded into RAM. This means that sequential access in an array on disk will be fast, since after the first (slow) read to get the bytes for the first array element to read, the bytes for the next array elements will probably already be loaded and available.
Think about what this means for tries versus linear probing hash tables. A trie is a tree structure where lookups require following lots of pointers to nodes laid out in no particular order in memory. This means that the cost of a trie lookup will likely be one disk read per character of the string, which is terribly inefficient. On the other hand, if you have a hash table using linear probing, the cost of a lookup will (roughly) be the cost of one disk read, since after finding the initial spot in the table where the value should be the array reads should not require future disk reads.
Note that not all tries and all hash tables have this property. Cache-oblivious tries are tries that are specifically constructed to minimize disk reads and can be very quick in external memory. Many hash tables, such as chained hash tables or double hashing tables, have more scattered lookup patterns and thus incur more disk reads.
Hope this helps!

Accessing elements within the same set in an n-way set associative cache

Is there any way to guarantee you access only blocks that map to the same set in an n-way set associative cache if you don't know the level of associativity nor the size of the cache itself? I know that given either level of associativity or cache size it's possible to do this, but in this particular situation all I've got is a low-balled estimate of the cache size. I've thought about it for a while and I'm starting to believe it's not possible, but I'm not definitively sure.
For the sake of this question please assume that it's impossible to obtain the level of associativity or the cache size by any means.
The reason for this is that I'm trying to quantitatively determine the level of associativity, but the algorithm I used to quantitatively determine cache size only gives exact results for cache sizes that are a power of two and it gives the nearest power of two estimate otherwise. Unfortunately the machine I'm currently running on has a 3MB L2 cache.
After doing more research and asking a professor of computer architecture, it would seem there is no foolproof way to guarantee you will only access blocks that map to the same set if you don't know cache size or the associativity of the cache.
Given the cache size, N, you can access elements of an array that are separated by N bytes or any multiple of N bytes and each of the blocks that are subsequently pulled in will be mapped to the same set. This is the easiest way to guarantee you access only blocks that map to the same set.
If you do not know the cache size, the best you can do is estimate. For example, if you access elements of an array that are separated by 32MB then you are guaranteed to access only blocks that map to the same set for any cache size that is a power of two up to 32MB. Caches with odd sizes will not have this same guarantee.

Is there a speed difference in ordering by int vs. float?

When retrieving entries in a database, is there a difference between storing values as a float or decimal vs. an int when using ORDERBY in a SELECT statement?
It depends. You didn't specify the RDBMS so I can only speak to SQL Server specifically but data types have different storage costs associated with them. Ints range from 1 to 8 bytes, Decimals are 5-17 and floats are 4 to 8 bytes.
The RDBMS will need to read data pages off disk to find your data (worst case) and they can only fit so many rows on an 8k page of data. So, if you have 17 byte decimals, you're going to get 1/17th the amount of rows read off disk per read than you could have if you sized your data correctly and used a tinyint with a 1 byte cost to store X.
That storage cost will have a cascading effect when you go to sort (order by) your data. It will attempt to sort in memory but if you have a bazillion rows and are starved for memory it may dump to temp storage for the sort and you're paying that cost over and over.
Indexes may help as the data can be stored in a sorted manner but again, if getting that data into memory may not be as efficient for obese data types.
#Bohemian makes a fine point about the CPU efficiency of integer vs floating point comparisons but it is amazingly rare for the CPU to be spiked on a database server. You are far more likely to be constrained by the disk IO subsystem and memory which is why my answer focuses on the speed difference between getting that data into the engine for it to perform the sort operation vs the CPU cost of comparison.
(Edited) Since both int and float occupy exactly the same space on disk, and of course in memory - ie 32 bits - the only differences are in the way they are processed.
int should be faster to sort than float, because the comparison is simpler: Processors can compare ints in one machine cycle, but a float's bits have to be "interpreted" to get a value before comparing (not sure how many cycles, but probably more than one, although some CPUs may have special support for float comparison).
In general, the choice of datatypes should be driven by whether the datatype is appropriate for storing the values that are required to be stored. If a given datatype is inadequate, it doesn't matter how efficient it is.
In terms of disk i/o the speed difference is second order. Don't worry about second order effects until your design is good with regard to first order effects.
Correct index design will result in a huge decrease in delays when a query can be retrieved in sorted order to begin with. However, speeding up that query is done at the cost of slowing down other processes, like processes that modify the indexed data. The trade off has to be considered to see whether it's worth it.
In short, worry about the stuff that's going to double your disk i/o or worse before you worry about the stuff that's going to add 10% to your disk i/o

Data structure and algorithm for representing/allocating free space in a file

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.
I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?
For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.
My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.
I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:
1) File header. Points at master type list.
2) Data. Each block has a header with type, index, logical size, and padded size.
3) Arrays of (offset, size) tuples for each given type.
4) Array of (type, offset, count) that keeps track of the types.
We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.
When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)
One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.
Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.
Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file!
Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.
For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.
The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:
no need to store the size explicitly, it's known depending on the bag you're in
no "real" fragmentation within a bag
The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.
You need an index to map size to its corresponding bag.
You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).
This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.
This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.
(*) BiBOP: Big Bag Of Pages
I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.
What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.
Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.
On one side - best compaction, highest fragmentation - splay and many other kinds of trees.
On another side - lowest fragmentation, worst compaction - linked list.
In between there are B-Trees and others.
As you I understand, you stated as priority: space-saving - while taking care about performance.
I would recommend you mixed data-structure in order to achieve all requirements.
a kind of list of contiguous blocks of data
a kind of tree for current "add/remove" operation
when data are required on demand, allocate from tree. When deleted, keep track what's "deleted" using tree as well.
mixing -> during each operation (or on idle moments) do "step by step" de-fragmentation, and apply changes kept in tree to contiguous blocks, while moving them slowly.
This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.
The most simple solution is a free list: keep a linked list of free blocks, reusing the free space to store the address of the next block in the list.
