does append() always extend minimal capacity needed? - go

When learning slice, I have this doubt: does append() always extend minimal capacity needed?
a := make([]byte, 0)
a = append(a, 1, 2, 3)
cap(a) == 3 // will this be always true?
// or the assumption may not hold since the underlying implementation of append()
// is not specified.

No, it's not guaranteed in this case. The specifications say:
append(s S, x ...T) S // T is the element type of S
If the capacity of s is not large enough to fit the additional values, append allocates a new, sufficiently large slice that fits both the existing slice elements and the additional values. Thus, the returned slice may refer to a different underlying array.
(Emphasizes mine)
In your case, clearly any capacity >= 3 is sufficiently large, so you can rely on cap >= 3, but you cannot rely on cap == 3.
Of course you can assume cap in this case will not be, say 1e6 or 1e9 or 1e12. However, the exact enlarging (allocating new backing array) strategy is intentionally not specified in every detail to allow the compiler guys to experiment with some knobs attached to this mechanism.

I would add that, not only does it not guarantee that the capacity of the slice would be equal to the length, in fact, for large lengths, it would almost never be the case where the resulting slice would have capacity would equal the length.
append() is promoted as the replacement to the vector package. In order to do this, the complexity of appending must match the complexity in the vector package, which means that appending an element must have amortized O(1) complexity. Although this complexity is not guaranteed in the language specification, it must be true for the patterns for which append() is used now in the Go community to work efficiently.
In order for append() to be amortized O(1), it must expand the capacity by a fixed percentage of the current capacity each time it runs out of space. For example, doubling in capacity. Think about it, if it doubles in capacity every time it runs out, the length and capacity can only be the same if the length is exactly a power of 2 (assuming it started out as a power of 2), which is not frequent.

Related

Slice declarations like [0:2]

I don't understand slice declarations in Go.
For, me a declaration for the first and second element of an array must be 0:1.
But it is 0:2. Why? How should I read this, from zero to 2 minus 1 (all the time)?
var slice = array[0:2]
Slice bounds are half open, this is very standard for many programming languages. One advantage is that it makes the length of the range apparent (2-0=2). Specifically, it's common to do this:
s[start:start+len]
And it's obvious that this selects len elements from the slice, starting with start. If the range would be fully closed (both bounds included), there would have to be a lot of -1s in code to deal with slicing and subslicing.
It works similarly in C++ ranges and Python, etc. Here's some reasoning from a C++ answer, attributed to Dijkstra:
You want the size of the range to be a simple difference end − begin;
including the lower bound is more "natural" when sequences degenerate to empty ones, and also because the alternative (excluding the lower bound) would require the existence of a "one-before-the-beginning" sentinel value.
A slice is formed by specifying two indices, a low and high bound,
separated by a colon:
a[low : high]
This selects a half-open range which includes the first
element, but excludes the last one.
This is from Golang's page on slices https://tour.golang.org/moretypes/7

Counting Sort different approach

In a counting sort algorithm, we initialize an count array with a size of Maximum Value in a given array. Runtime of this method is O(n + Max value). However with an extra loop, we can look for minimum and maximum value of given array;
for 0 -> Length(given_array)
if given_array[i] > max
max = given_array[i]
if given_array[i] < min
min = given_array[i]
Then use that data to create the count array, lets say between 95-100. We could decrease the runtime in some cases tremendously. However, I haven't seen an approach like this. Would it be still a counting sort algorithm, or does it have another name that I don't know.
Counting sort is typically used when we know upfront that values will be restricted to a certain range.
This range doesn't need to start at zero; it's absolutely fine to use an array of length six whose elements represent the counts of values 95 through 100 (or, for that matter, the counts of values from −2 to 3). So, yes, your approach is still "counting sort".
But if you don't know this restriction upfront, you're not likely to get faster results by doing a complete pass over the data to check.
For example: suppose you have 1,000,000 elements, and you know they're all somewhere in the range 0–200, but you think they're probably all in a much narrower range. Well, the cost of prescanning the entire input array is going to be greater than the cost of working with a 201-element working array, which means it costs more than it can possibly save compared to just doing a counting sort with the range 0–200.
Runtime of this method is O(n + Max value).
The runtime is O(max(num_elements, range_size)), which — due to the magic of Landau (big-O) notation — is the same as O(num_elements + range_size). Your approach only affects the asymptotic complexity if max_value is asymptotically greater than both num_elements and range_size.

How to get N greatest elements out of M elements using CUDA, where N << M?

I am just wondering whether there is any efficient ways of getting N greatest elements out of M elements, where N is much smaller than M (e.g. N = 10, and M = 1000) using the GPU.
The problem is that - due to the large size of input data, I really do not want to transfer the data from the GPU to the CPU and then get it back. However, exact sorting does not seem to work well because of thread divergence and the time wasted on sorting elements that we do not really care about (in the case above, DC elements are 11 ~ 1000).
If N is small enough that the N largest values can be kept in shared memory, that would allow a fast implementation that only reads through your array of M elements in global memory once and then immediately writes out these N largest values. Implementation becomes simpler if N also doesn't exceed the maximum number of threads per block.
Contrary to serial programming, I would not use a heap (or other more complicated data structure), but just a sorted array. There is plenty of parallel hardware on an SM that would go unused when traversing a heap. The entire thread block can be used to shift the elements of the shared memory array that are smaller than the newly incoming value.
If N<=32, a neat solution is possible that keeps a sorted list of the N largest numbers in registers, using warp shuffle functions.

Running maximum of changing array of fixed size

At first, I am given an array of fixed size, call it v. The typical size of v would be a few thousand entries. I start by computing the maximum of that array.
Following that, I am periodically given a new value for v[i] and need to recompute the value of the maximum.
What is a practically fast way (average time) of computing that maximum?
Edit: we can assume that the process is:
1) uniformly choosing a random entry;
2) changing its value to a uniform value between [0,1].
I believe this specifies the problem a bit better and allows an unequivocal "best answer" (which will depend on the array size).
You can maintain a max-heap of that array. The element can be index to the array. for every element of the array, you should also have some indexes to the element of max-heap. so every time v[i] is changed, you only need O(log(n)) to maintain the heap. (if v[i] is increased, it will go up in the heap, if v[i] is decreased, it will go down in the heap).
If the changes to the array are random, e.g. v[rand()%size] = rand(), then most of the time the max won't decrease.
There are two main ways I can think of to handle this: keep the full collection sorted on the fly, or track just the few (or one) highest elements. The choice depends on the relative importance of worst-case, average case, and fast-path. (Including code and data cache footprint of the common case where the change doesn't affect anything you're tracking.)
Really low complexity / overhead / code size: O(1) average case, O(N) worst case.
Just track the current max, (and optionally its position, if you can't get the old value to see if it == max before applying the change). On the rare occasion that the element holding the max decreased, rescan the whole array. Otherwise just see if the new element is greater than max.
The average complexity should be O(1) amortized: O(N) for N changes, since on average one of N changes affects the element holding the max. (And only half those changes decrease it).
A bit more overhead and code size, but less frequent scans of the full array: O(1) typical case, O(N) worst case.
Keep a priority queue of the 4 or 8 highest elements in the array (position and value). When an element in the PQueue is modified, remove it from the PQueue. Try to re-add the new value to the PQueue, but only if it won't be the smallest element. (It might be smaller than some other element we're not tracking). If the PQueue is empty, rescan the array to rebuild it to full size. The current max is the front of the PQueue. Rescanning the array should be quite rare, and in most cases we only have to touch about one cache line of data holding our PQueue.
Since the small PQueue needs to support fast access to the smallest and the largest element, and even finding elements that aren't the min or max, a sorted-array implementation probably makes the most sense, rather than a Heap. If it's only 8 elements, a linear search is probably best, too. (From the smallest element upwards, so the search ends right away if the old value of the element modified is less than the smallest value in the PQueue, the search stops right away.)
If you want to optimize the fast-path (position modified wasn't in the PQueue), you could store the PQueue as struct pqueue { unsigned pos[8]; int val[8]; }, and use vector instructions (e.g. x86 SSE/AVX2) to test i against all 8 positions in one or two tests. Hrm, actually just checking the old val to see if it's less than PQ.val[0] should be a good fast-path.
To track the current size of the PQueue, it's probably best to use a separate counter, rather than a sentinel value in pos[]. Checking for the sentinel every loop iteration is probably slower. (esp. since you'd prob. need to use pos to hold the sentinel values; maybe make it signed after all and use -1?) If there was a sentinel you could use in val[], that might be ok.
slower O(log N) average case, but no full-rescan worst case:
Xiaotian Pei's solution of making the whole array a heap. (This doesn't work if the ordering of v[] matters. You could keep all the elements in a Heap as well as in the ordered array, but that sounds cumbersome.) Re-heapifying after changing a random element will probably write several other cache lines every time, so the common case is much slower than for the methods that only track the top one or few elements.
something else clever I haven't thought of?

Realistic usage of unrolled skip lists

Why there is no any information in Google / Wikipedia about unrolled skip list? e.g. combination between unrolled linked list and skip list.
Probably because it wouldn't typically give you much of a performance improvement, if any, and it would be somewhat involved to code correctly.
First, the unrolled linked list typically uses a pretty small node size. As the Wikipedia article says: " just large enough so that the node fills a single cache line or a small multiple thereof." On modern Intel processors, a cache line is 64 bytes. Skip list nodes have, on average, two pointers per node, which means an average of 16 bytes per node for the forward pointers. Plus whatever the data for the node is: 4 or 8 bytes for a scalar value, or 8 bytes for a reference (I'm assuming a 64 bit machine here).
So figure 24 bytes, total, for an "element." Except that the elements aren't fixed size. They have a varying number of forward pointers. So you either need to make each element a fixed size by allocating an array for the maximum number of forward pointers for each element (which for a skip list with 32 levels would require 256 bytes), or use a dynamically allocated array that's the correct size. So your element becomes, in essence:
struct UnrolledSkipListElement
{
void* data; // 64-bit pointer to data item
UnrolledSkipListElement* forward_pointers; // dynamically allocated
}
That would reduce your element size to just 16 bytes. But then you lose much of the cache-friendly behavior that you got from unrolling. To find out where you go next, you have to dereference the forward_pointers array, which is going to incur a cache miss, and therefore eliminate the savings you got by doing the unrolling. In addition, that dynamically allocated array of pointers isn't free: there's some (small) overhead involved in allocating that memory.
If you can find some way around that problem, you're still not going to gain much. A big reason for unrolling a linked list is that you must visit every node (up to the node you find) when you're searching it. So any time you can save with each link traversal adds up to very big savings. But with a skip list you make large jumps. In a perfectly organized skip list, for example, you could skip half the nodes on the first jump (if the node you're looking for is in the second half of the list). If your nodes in the unrolled skip list only contain four elements, then the only savings you gain will be at levels 0, 1, and 2. At higher levels you're skipping more than three nodes ahead and as a result you will incur a cache miss.
So the skip list isn't unrolled because it would be somewhat involved to implement and it wouldn't give you much of a performance boost, if any. And it might very well cause the list to be slower.
Linked list complexity is O(N)
Skip list complexity is O(Log N)
Unrolled Linked List complexity can be calculate as following:
O (N / (M / 2) + Log M) = O (2N/M + Log M)
Where M is number of elements in single node.
Because Log M is not significant,
Unrolled Linked List complexity is O(N/M)
If we suppose to combine Skip list with Unrolled linked list, the new complexity will be
O(Log N + "something from unrolled linked list such N1/M")
This means the "new" complexity will not be as better as first someone will think. New complexity might be even worse than original O(Log N). The implementation will more complex as well. So gain is questionable and rather dubious.
Also, since single node will have lots of data, but only single "forward" array, the "tree" will not be so-balanced either and this will ruin O(Log N) part of the equation.

Resources