Calculating size class - algorithm

High-performance malloc implementations often implement segregated free lists, that is, each of the more common (smaller) sizes gets its own separate free list.
A first attempt at this could say, below a certain threshold, the size class is just the size divided by 8, rounded up. But actual implementations have more nuance, arranging the recognized size classes on something like an exponential curve (but gentler than simply doubling at each step), e.g. http://jemalloc.net/jemalloc.3.html
I'm trying to figure out how to convert a size to a size class on some such curve. Now, in principle this is not difficult; there are several ways to do it. But to achieve the desired goal of speeding up the common case, it really needs to be fast, preferably only a few instructions.
What's the fastest way to do this conversion?

In the dark ages, when I used to worry about those sorts of things, I just iterated through all the possible sizes starting at the smallest one.
This actually makes a lot of sense, since allocating memory strongly implies work outside of the actual allocation -- like initializing and using that memory -- that is proportional to the allocation size. In all but the smallest allocations, that overhead will swamp whatever you spend to pick a size class.
Only the small ones really need to be fast.

Lets assume you want to use all the power of two sizes and a plus half the size, ie 8, 12, 16, 24, 32, 48, 64 etc. ... 4096.
Check the value is less than or equal 4096, I have arbitrarily chosen that to be the highest allocable for this example.
Take the size of your struct, then use the highest set bit times two, and add 1 if the next bit is also set and you get an index into the size list add one more if the value is higher than the value the two bits would give. Should be 5-6 ASM instructions
So 26 is 16+8+2 bits are 4,3,1 4*2 + 1 + 1 so the 10th index is chosen for a 32 byte list.
Your system might require some minimum allocation size.
Also if your doing a lot of allocations, consider using some pool allocator that is private to your program with backup from the system allocator.

Related

avx512 strided gather with arbitrary stride

I know in AVX512 you can perform strided gathers with strides of 1,2,4,8. However what if I have an arbitrary stride that can be anywhere between 10-1000? The stride is known at compile time. I understand then the instruction won't be the bottleneck, the memory probably will be. Is _mm512_set_ps the most effective way to do this?
strided gathers with strides of 1,2,4,8
No, there's no special support for that; maybe you're thinking of ARM/ARM64 NEON vld4 4-way deinterleave?
In x86 you can use 1,2,4, or 8 as a scale factors for an index vector for vpgatherdd / vpgatherdps, but if you just want every 2nd element it's better to manually shuffle (e.g. _mm512_permutex2var_ps to grab alternate floats from 2 input vectors), getting many useful elements with one wide load instead of accessing cache once per element.
But in your case, with a minimum stride of 10, at most 2 elements will come from the same 16 x 32-bit 512-bit vector, and with wider strides not even one per vector.
So you can use vpgatherdps with _mm512_add_epi32(idx, _mm512_set1_epi32(16 * stride)) in a loop. Or better, just use a fixed vector of indices and increment the base pointer. You might generate that vector of indices with _mm512_mullo_epi32(_mm512_setr_epi32(0,1,2,3,...,15), _mm512_set1_epi32(stride)). Since a float is 4 bytes wide, use a scale factor of 4 with your gathers.
Even if you need to handle huge arrays, incrementing the pointer instead of the vector elements avoids any need for 64-bit indices, as well as minimizing the number of vector uops. (Valuable when using 512-bit vectors on current CPUs.)
IIRC, Intel's optimization manual has a section about strided loads and the tradeoff in manual gather vs. using gather instructions. Gather instructions become relatively better the wider your vectors are (2/clock load throughput but only 1/clock shuffle throughput for most shuffles), so especially for 512-bit vectors its likely a win to use vector shuffles.

Using a slice instead of list when working with large data volumes in Go

I have a question on the utility of slices in Go. I have just seen Why are lists used infrequently in Go? and Why use arrays instead of slices? but had some question which I did not see answered there.
In my application:
I read a CSV file containing approx 10 million records, with 23 columns per record.
For each record, I create a struct and put it into a linked list.
Once all records have been read, the rest of the application logic works with this linked list (the processing logic itself is not relevant for this question).
The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need. Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Every Go tutorial or article I read seems to suggest that I should use slices and not lists (as a slice can do everything a list can, but do it better somehow). However, I don't see how or why a slice would be more helpful for what I need? Any ideas from anyone?
... approx 10 million records, with 23 columns per record ... The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need.
This contiguous memory is its own benefit as well as its own drawback. Let's consider both parts.
(Note that it is also possible to use a hybrid approach: a list of chunks. This seems unlikely to be very worthwhile here though.)
Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Clearly, if there are n records, and you allocate and fill in each one once (using a list), this is O(n).
If you use a slice, and allocate a single extra slice entry every time, you start with none, grow it to size 1, then copy the 1 to a new array of size 2 and fill in item #2, grow it to size 3 and fill in item #3, and so on. The first of the n entities is copied n times, the second is copied n-1 times, and so on, for n(n+1)/2 = O(n2) copies. But if you use a multiplicative expansion technique—which Go's append implementation does—this drops to O(log n) copies. Each one does copy more bytes though. It ends up being O(n), amortized (see Why do dynamic arrays have to geometrically increase their capacity to gain O(1) amortized push_back time complexity?).
The space used with the slice is obviously O(n). The space used for the linked list approach is O(n) as well (though the records now require at least one forward pointer so you need some extra space per record).
So in terms of the time needed to construct the data, and the space needed to hold the data, it's O(n) either way. You end up with the same total memory requirement. The main difference, at first glace anyway, is that the linked-list approach doesn't require contiguous memory.
So: What do we lose when using contiguous memory, and what do we gain?
What we lose
The thing we lose is obvious. If we already have fragmented memory regions, we might not be able to get a contiguous block of the right size. That is, given:
used: 1 MB (starting at base, ending at base+1M)
free: 1 MB (starting at +1M, ending at +2M)
used: 1 MB (etc)
free: 1 MB
used: 1 MB
free: 1 MB
we have a total of 6 MB, 3 used and 3 free. We can allocate 3 1 MB blocks, but we can't allocate one 3 MB block unless we can somehow compact the three "used" regions.
Since Go programs tend to run in virtual memory on large-memory-space machines (virtual sizes of 64 GB or more), this tends not to be a big problem. Of course everyone's situation differs, so if you really are VM-constrained, that's a real concern. (Other languages have compacting GC to deal with this, and a future Go implementation could at least in theory use a compacting GC.)
What we gain
The first gain is also obvious: we don't need pointers in each record. This saves some space—the exact amount depends on the size of the pointers, whether we're using singly linked lists, and so on. Let's just assume 2 8 byte pointers, or 16 bytes per record. Multiply by 10 million records and we're looking pretty good here: we've saved 160 MBytes. (Go's container/list implementation uses a doubly linked list, and on a 64 bit machine, this is the size of the per-element threading needed.)
We gain something less obvious at first, though, and it's huge. Because Go is a garbage-collected language, every pointer is something the GC must examine at various times. The slice approach has zero extra pointers per record; the linked-list approach has two. That means that the GC system can avoid examining the nonexistent 20 million pointers (in the 10 million records).
Conclusion
There are times to use container/list. If your algorithm really calls for a list and is significantly clearer that way, do it that way, unless and until it proves to be a problem in practice. Or, if you have items that can be on some collection of lists—items that are actually shared, but some of them are on the X list and some are on the Y list and some are on both—this calls for a list-style container. But if there's an easy way to express something as either a list or a slice, go for the slice version first. Because slices are built into Go, you also get the type safety / clarity mentioned in the first link (Why are lists used infrequently in Go?).

How do I correctly adjust the Argon2 parameters in Go to consume less memory?

Argon2 by design is memory hungry. In the semi-official Go implementation, the following parameters are recommended when using IDKey:
key := argon2.IDKey([]byte("some password"), salt, 1, 64*1024, 4, 32)
where 1 is the time parameter and 64*1024 is the memory parameter. This means the library will create a 64MB buffer when hashing a value. In scenarios where many hashing procedures might run at the same time this creates high pressure on the host memory.
In cases where this is too much memory consumption it is advised to decrease the memory parameter and increase the time factor:
The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number. If using that amount of memory (64 MB) is not possible in some contexts then the time parameter can be increased to compensate.
So, assuming I would like to limit memory consumption to 16MB (1/4 of the recommended 64MB), it is still unclear to me how I should be adjusting the time parameter: is this supposed to be times 4 so that the product of memory and time stays the same? Or is there some other logic behind the correlation of time and memory at play?
The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number. If using that amount of memory (64 MB) is not possible in some contexts then the time parameter can be increased to compensate.
I think the key here is the word "to compensate", so in this context it is trying to say: to achieve similar hashing complexity as IDKey([]byte("some password"), salt, 1, 64*1024, 4, 32), you can try IDKey([]byte("some password"), salt, 4, 16*1024, 4, 32).
But if you want to decrease hashing result complexity (and decreasing performance overhead), you can decrease the size of memory uint32 disregarding the time parameter.
is this supposed to be times 4 so that the product of memory and time stays the same?
I dont think so, i believe the memory here means the length of result hash, but time parameter could mean "how many times the hashing result needs to be re-hash-ed until i get the end result".
So these 2 parameters are independent of each other. These are just controlling how much "brute- force cost savings due to time-memory tradeoffs" you want to achieve
Difficulty is roughly equal to time_cost * memory_cost (and possibly / parallelism). So if you 0.25x the memory cost, you should 4x the time cost. See also this answer.
// The time parameter specifies the number of passes over the memory and the
// memory parameter specifies the size of the memory in KiB.
Check out the Argon2 API itself. I'm going to cross-reference a little bit and use the argon2-cffi documentation. It looks like the go interface uses the C-FFI (foreign function interface) under the hood, so the protoype should be the same.
Parameters
time_cost (int) – Defines the amount of computation realized and therefore the execution time, given in number of iterations.
memory_cost (int) – Defines the memory usage, given in kibibytes.
parallelism (int) – Defines the number of parallel threads (changes the resulting hash value).
hash_len (int) – Length of the hash in bytes.
salt_len (int) – Length of random salt to be generated for each password.
encoding (str) – The Argon2 C library expects bytes. So if hash() or verify() are passed an unicode string, it will be encoded using this encoding.
type (Type) – Argon2 type to use. Only change for interoperability with legacy systems.
Indeed, if we look at the Go docs:
// The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number.
// If using that amount of memory (64 MB) is not possible in some contexts then
// the time parameter can be increased to compensate.
//
// The time parameter specifies the number of passes over the memory and the
// memory parameter specifies the size of the memory in KiB. For example
// memory=64*1024 sets the memory cost to ~64 MB. The number of threads can be
// adjusted to the numbers of available CPUs. The cost parameters should be
// increased as memory latency and CPU parallelism increases. Remember to get a
// good random salt.
I'm not 100% clear on the impact of thread count, but I believe it does parallelize the hashing, and like any multithreaded job, this reduces the total amount of time taken by approximately 1/N, for N cores. Apparently, you should essentially set the parallelism to cpu count.

How to automatically calculate the block and grid size of a 2D image in CUDA?

I have known the ideas of block and grid in cuda, and I'm wondering if there is any helper function well written that can help me determine the best block and grid size for any given 2D image.
For example, for a 512x512 image mentioned in this thread. Grid is 64x64 and block is 8x8.
However sometimes my input image may not be power of 2, it may be 317x217 or something like that.In this case, maybe grid should be 317x1 and block should be 1x217.
So if I have an application that accepts an image from user, and use cuda to process it, how can it automatically determine the size and dimension of block and grid, where user can input any size of image.
Is there any existed helper function or class that handles this problem?
Usually you want to choose the size of your blocks based on your GPU architecture, with the goal of maintaining 100% occupancy on the Streaming Multiprocessor (SM). For example, the GPUs at my school can run 1536 threads per SM, and up to 8 blocks per SM, but each block can only have up to 1024 threads in each dimension. So if I were to launch a 1d kernel on the GPU, I could max out a block with 1024 threads, but then only 1 block would be on the SM (66% occupancy). If I instead chose a smaller number, like 192 threads or 256 threads per block, then I could have 100% occupancy with 6 and 8 blocks respectively on the SM.
Another thing to consider is the amount of memory that must be accessed vs the amount of computation to be done. In many imaging applications, you don't just need the value at a single pixel, rather you need the surrounding pixels as well. Cuda groups its threads into warps, which step through every instruction simultaneously (currently, there are 32 threads to a warp, though that may change). Making your blocks square generally minimizes the amount of memory that needs to be loaded vs the amount of computation that can be done, making the GPU more efficient. Likewise, blocks that are a power of 2 load memory more efficiently (if properly aligned with memory addresses) since Cuda loads memory lines at a time instead of by single values.
So for your example, even though it might seem more effective to have a grid that is 317x1 and blocks that are 1x217, your code will likely be more efficient if you launch blocks that are 16x16 on a grid that is 20x14 as it will lead to better computation/memory ratio and SM occupancy. This does mean, though, that you will have to check within the kernel to make sure the thread is not out of the picture before trying to access memory, something like
const int thread_id_x = blockIdx.x*blockDim.x+threadIdx.x;
const int thread_id_y = blockIdx.y*blockDim.y+threadIdx.y;
if(thread_id_x < pic_width && thread_id_y < pic_height)
{
//Do stuff
}
Lastly, you can determine the lowest number of blocks you need in each grid dimension that completely covers your image with (N+M-1)/M where N is the number of total threads in that dimension and you have M threads per block in that dimension.
It depends on how you deal with the image. If your thread only process each pixel separately, for example, adding 3 to each pixel value, you can just assign one dimension to your block size and the other to your grid size (just do not out of range). But if you want to do something like filter or erode, this kind of operation often need to access the pixels near the center pixel like 3*3 of 9*9. Then the block should be 8*8 as you mentioned, or some other value. And you'd better to use the texture memory. Because when the thread access to the global memory, there always be 32 thread to be a wrap in a block one time.
So there isn't function as you described. The number of threads and blocks depends on how you process the data, it is not universal.

overlapping variables and performance

Sorry in advance if I have some of this wrong. I may edit to correct later if it's not too disruptive.
When multiple variables are declared in adjacent memory, as I understand it, on a very low level, registers are created that encapsulate a number of bytes, commonly 1, 2, 4 or 8. This allows those bit ranges to be binary rotated, as well as thought of by the processor as numbers and so mutated with simple mathematics such as add, subtract, multiply and devide.
There may be abstraction reasons for not overlapping thease ranges, but as many langueges consider instructions to occur in a well defined sequential order that the coder will be aware of, are there any performance reasons to not overlap one or more in adjacent bytes of allocated memory?
For example in a block of allocated memory where every bit starts as 0. Bytes 0 to 3 could be being used as an int, as well as bytes 1 to 4. The first could be set to a value before the second range was multiplied by 3.
If there are performance reasons not to then are they overcome by otherwise having to to copy values in and out of completely new variables and perform more complicated processes to achieve certain algorithms that could otherwise be done on a very low level?
There is nothing wrong with this trick when it is done in assembly: optimizers have been routinely making use of knowing where parts of an integer are to save CPU cycles and reduce the size of the code. For example, when a 32-bit integer variable is initialized to a value that fits in only 16 bits, optimizing compilers would replace the instruction that stores a 32-bit value in memory with a faster instruction that stores a 16-bit value to the lower bits of the variable, and clear the upper 16 bits. Moreover, many optimizers would go even further: if a constant is divisible by 2^16, they would store the value divided by 2^16 to the upper 16 bits, and clear lower 16 bits.
Some architectures restrict such manipulations to addresses of certain properties, for example, by requiring all 4-byte memory load/store instructions to be done at addresses divisible by four. These restrictions may reduce applicability of partial-value writing tricks.

Resources