custom cache alignment in rust - caching

How can I optimize the performance of my RowMatrix struct in Rust for large number of rows?
I have a matrix defined in a RowMajor form using a struct in Rust as follows:
pub struct RowMatrix
{
data: Vec<[usize; 8]>,
width: usize,
}
Each row is broken down into an array of 8 elements and stacked one after the other in the data vector. For example, if the width is 64, then, the first 8 elements in the vector represent the first row, the next 8 elements represent the second row, and so on.
I need to perform operations on individual arrays belonging to two separate rows of this matrix at the same index. For example, if I want to perform an operation on the 2nd array segment of the 1st and 10th row, I would pick the 2nd and 74th elements from the data vector respectively. The array elements will always be from the same array segment.
This operation is performed a number of times with different row pairs and when the number of rows in the matrix is small, I don't see any issues with the performance. However, when the number of rows is significant, I'm seeing a significant degradation in performance, which I attribute to frequent cache misses.
Is there a way to custom align my struct along the cache line to reduce cache misses without changing the struct definition? I want to control the layout of elements in memory at a fine-grained level like keeping elements that are 8 elements apart in cache(if 64 is the width of the matrix).
I used the repr(align(x)) attribute to specify the alignment of a struct but I think it's not helping as I think it's keeping array elements in a sequential fashion and in the case of a big matrix the respective elements might not be there in the cache.

#[repr(align)] can only affect the items stored in the struct (The Vec pointer, length and capacity plus your width), but since Vec is little more than a pointer to the data the layout behind it is entirely dictated by it's implementation and there is no way for you to directly affect it. So "without changing the struct definition" it's not possible to change the layout. You can however create a custom Vec-like or manage the memory yourself directly in the RowMatrix

Related

Does a "Pyramid List" data-structure already exist?

I am thinking about data structures which can be used in environments such as embedded/memory-constrained/filesystem and came upon an idea for a list-like data structure which has O(1) {access, insert, pop} while also always having O(1) push (non-amortized), even if it can only be grown by a constant amount (i.e. 4KiB). I cannot find an example of it anywhere and am wondering if it exists, and if so if anyone knows of a reference implementation.
The basic structure would look something like this:
PyramidList contains
a size_t numSlots
a size_t sizeSlots
a void** slots pointer to an array of pointers of size sizeSlots with pointers to values in indexes up to numSlots
The void **slots array has the following structure for each index. These are structured in such a way that 2^i = maxValues where i is the index and maxValues is the maximum number of values that can exist at that index or less (i.e. the sum of the count of all values up to that index)
index 0: contains a pointer directly to a single value (2^0 = 1)
index 1: contains a pointer directly to a single value (2^1 = 2)
index 2: contains a pointer to an array of two values (2^2 = 4)
index 3: contains a pointer to an array of four values (2^3 = 8)
index 4: contains a pointer to an array of eight values (2^4 = 16)
.. etc
index M: contains a pointer to an array of MAX_NUM_VALUES (2^M = MAX_NUM_VALUES*2)
index M+1: contains a pointer to an array of MAX_NUM_VALUES
index M+2: contains a pointer to an array of MAX_NUM_VALUES
etc
Now, suppose I want to access index i. I can use the BSR instruction to get the "power of 2" of the index. If it is less than the power of 2 of MAX_NUM_VALUES then I have my index. If it is larger than the power of 2 of MAX_NUM_VALUES I can act accordingly (subtract and divide). Therefore I can look up the array/single-value in O(1) time and then access the index I want in O(1) as well. Pushing to the PyramidList requires (at most):
allocating a new MAX_NUM_VALUES and adding it's pointer to slots
In some cases slots might not be able to hold it and would have to be grown as well, so this is only really always O(1) up to some limit, but that limit is likely to be extreme for the use cases here.
inserting the value into the proper index
A few other benefits
Works great for (embedded/file-system/kernel/etc) memory managers that have a maximum alloc size (i.e. can only allocate 4KiB chunks)
Works great when you truly don't know how large your vector is likely to be. Starts out extremely small and grows by known amounts
Always having (near) constant insertion may be useful for timing-critical interrupts/etc
Does not leave fragmented space behind when growing. Might be great for appending records into a file.
Disadvantages
Is likely less performant (amortized) than a contiguous vector in nearly every way (even insertion). Moving memory is typically less expensive than adding a dereference for every operation, so the amortized cost of a vector is still probably smaller.
Also, it is not truly always O(1) since the slots vector has to be grown when all the slots are full, but this only happens when currentNumSlots*2*MAX_NUM_VALUES have been added since the last growth.
When you exceed the capacity of an array of size X, and so allocate a new array of size 2X, you can then incrementally move the X items from the old array into the start of the new array over the next X append operations. After that the old array can be discarded when the new array is full, just before you have to allocate a new array of size 4X.
Therefore, it is not necessary to maintain this list of increasing-size arrays in order to achieve O(1) appends (assuming that allocation is O(1)). Incremental doubling is a well-known technique in the de-amortization business, so I think most people desiring this sort of behaviour would turn to that first.
Nothing like this is commonly used, because memory allocation can almost never be considered O(1). Applications that can't afford to copy a block all at once generally can't afford to use any kind of dynamic memory allocation at all.

Hashmap hashcode to internal table index conversion

Hashmaps usually implemented using internal array (table) of buckets. On accessing hashmap by key, we get key's hashcode using key-type specific(logic type specific) hash function. Then we need to map hashcode to actual internal buckets table index.
key -> (hash function) -> hashcode -> (???) -> index in internal table
Sometimes internal table could shrink and expand, depending on hashmap filling ratio. Then probably hashcode->index conversion method could be changed a bit.
For example our hash function returns 32 bit unsigned integer value and
moment A: internal table has capacity 10000
moment B: internal table has capacity 100000
What algorithms or approach usually used to perform hashcode->internal table index conversion? How is table resizing isue solved for them?
Usually, a simple modulo will do the job.
To take a quick example from Wikipedia, it's simple as that :
hash = hashfunc(key)
index = hash % array_size
As you said, the resizing happen dependending on the hashmap filling ratio. The array is reallocated (see realloc()), then the indices are recalculated given the new array size, and the values copied to their new index.
I wrote about this here and here.
When you increase the size of your vector of indeces you can be sure that the algorithm that worked well on the shorter vector will work less well on the longer. It is possible to test beforehand and have new algorithms to put in place when you make the vector longer. Or, as the the number of occupied indeces in the current vector increases, have a background, lower-priority thread that tests different algorithms on the data.
As the example in one of my answers shows, a "new algorithm" need be nothing more than a different pair of matched prime numbers.

Representing 2D data optimized by row vs by column vs flat

In D3 I need to visualize loading lab samples into plastic 2D plates of 8 rows x 12 columns or similar. Sometimes I load a row at a time, sometimes a column at a time, occasionally I load flat 1D 0..95, or other orderings. Should the base D3 data() structure nest rows in columns (or vice verse) or should I keep it one dimensional?
Representing the data optimized for columns [columns[rows[]] makes code complex when loading by rows, and vice versa. Representing it flat [0..95] is universal but it requires calculating all row and column references for 2D modes. I'd rather reference all orderings out of a common base but so far it's a win-lose proposition. I lean toward 1D flat and doing the math. Is there a win-win? Is there a way to parameterize or invert the ordering and have it optimized for both ways?
I believe in your case the best implementation would be an Associative array (specifically, a hash table implementation of it). Keys would be coordinates and values would be your stored data. Depending on your programming language you would need to handle keys in one way or another.
Example:
[0,0,0] -> someData(1,2,3,4,5)
[0,0,1] -> someData(4,2,4,6,2)
[0,0,2] -> someData(2,3,2,1,5)
Using a simple associative array would give you great insertion speeds and reading speeds, however code would become a mess if some complex selection of data blocks is required. In that case, using some database could be reasonable (though slower than a hashmap implementation of associative array). It would allow you to query some specific data in batches. For example, you could get whole row (or several rows) of data using one simple query:
SELECT * FROM data WHERE x=1 AND y=2 ORDER BY z ASC
Or, let's say selecting a 2x2x2 cube from the middle of 3d data:
SELECT * FROM data WHERE x>=5 AND x <=6 AND y>=10 AND Y<=11 AND z >=3 AND z <=4 ORDER BY x ASC, y ASC, z ASC
EDIT:
On a second thought, if the size of the dimensions wont change during runtime - you should go with a 1-dimentional array using all the math yourself, as it is the fastest solution. If you try to initialize a 3-dimentional arrays as array of arrays of arrays, every read/write to an element would require 2 additional hops in memory to find the required address. However, writing some function like:
int pos(w,h, x,y,z) {return z*w*h+y*w+x;} //w,h - dimensions, x,y,z, - position
Would make it inlined by most compilers and pretty fast.

Data structure for tiled map

I want to make an infinite tiled map, from (-max_int,-max_int) until (max_int,max_int), so I'm gonna make a basic structure: chunk, each chunk contain char tiles[w][h] and also it int x, y coordinates, so for example h=w=10 so tile(15,5) is in chunk(1,0) on (5,5) coordinate, and tile(-25,-17) is in chunk(-3,-2)on(5,3) and so on. Now there can be any amount of chunks and I need to store them and easy access them in O(logn) or better ( O(1) if possible.. but it's not.. ). It should be easy to: add, ??remove??(not must) and find. So what data structure should I use?
Read into KD-tree or Quad-tree (the 2d variant of Octree). Both of these might be a big help here.
So all your space is splited into chunks (rectangular clusters). Generally problem is storing data in sparse (since clusters already implemented) matrix. Why not to use two-level dictionary-like containers?.. I.e. rb-tree by row index where value is rb-tree by column index. Or if you are lucky you can use hashes to get your O(1). In both cases if you can't find row you allocate it in container and create new container as value but initially with only single chunk. Of course allocating new chunk on existing row will be a bit faster than on new one and I guess that's the only issue with this approach.

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?
Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.
As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.
Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.
How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.
Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.
Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Resources