Performance comparison between in-place vs out-of-pace matrix transposition - performance

Wikipedia has an extensive summary of methods for in-place matrix transposition.
These methods look hard to implement, before committing to implementing them I'm wondering if there are any benchmarks or other evidence that would say these techniques outperform out-of-place matrix transposition in terms of walltime (on CPU/GPU/ any architecture)?
Do they all run slower compared to using an out-of-place transpose where the data is copied to a different destination?

In short, I don't think I have ever seen an in-place exchange being faster than out of place. Just think about "when did you last chose an in-place sort instead of the out of place version".
But the reason to use in-place is usually memory considerations or allocation considerations. In which case you are bound to use it.
For an out of place exchange, you also have to think about cache, cache and cache, subdivide your problem into smaller parts until both target and source can comfortably be in the cache at the same time.

Related

CUDA parallel sorting algorithm vs single thread sorting algorithms

I have a large amount of data which i need to sort, several million array each with tens of thousand of values. What im wondering is the following:
Is it better to implement a parallel sorting algorithm, on the GPU, and run it across all the arrays
OR
implement a single thread algorithm, like quicksort, and assign each thread, of the GPU, a different array.
Obviously speed is the most important factor. For single thread sorting algorithm memory is a limiting factor. Ive already tried to implement a recursive quicksort but it doesnt seem to work for large amounts of data so im assuming there is a memory issue.
Data type to be sorted is long, so i dont believe a radix sort would be possible due to the fact that it a binary representation of the numbers would be too long.
Any pointers would be appreciated.
Sorting is an operation that has received a lot of attention. Writing your own sort isn't advisable if you are interested in high performance. I would consider something like thrust, back40computing, moderngpu, or CUB for sorting on the GPU.
Most of the above will be handling an array at a time, using the full GPU to sort an array. There are techniques within thrust to do a vectorized sort which can handle multiple arrays "at once", and CUB may also be an option for doing a "per-thread" sort (let's say, "per thread block").
Generally I would say the same thing about CPU sorting code. Don't write your own.
EDIT: I guess one more comment. I would lean heavily towards the first approach you mention (i.e. not doing a sort per thread.) There are two related reasons for this:
Most of the fast sorting work has been done along the lines of your first method, not the second.
The GPU is generally better at being fast when the work is well adapted for SIMD or SIMT. This means we generally want each thread to be doing the same thing and minimizing branching and warp divergence. This is harder to achieve (I think) in the second case, where each thread appears to be following the same sequence but in fact data dependencies are causing "algorithm divergence". On the surface of it, you might wonder if the same criticism might be levelled at the first approach, but since these libraries I mention arer written by experts, they are aware of how best to utilize the SIMT architecture. The thrust "vectorized sort" and CUB approaches will allow multiple sorts to be done per operation, while still taking advantage of SIMT architecture.

If consistent hash is efficient,why don't people use it everywhere?

I was asked some shortcommings of consistent hash. But I think it just costs a little more than a traditional hash%N hash. As the title mentioned, if consistent hash is very good, why not we just use it?
Do you know more? Who can tell me some?
Implementing consistent hashing is not trivial and in many cases you have a hash table that rarely or never needs remapping or which can remap rather fast.
The only substantial shortcoming of consistent hashing I'm aware of is that implementing it is more complicated than simple hashing. More code means more places to introduce a bug, but there are freely available options out there now.
Technically, consistent hashing consumes a bit more CPU; consulting a sorted list to determine which server to map an object to is an O(log n) operation, where n is the number of servers X the number of slots per server, while simple hashing is O(1).
In practice, though, O(log n) is so fast it doesn't matter. (E.g., 8 servers X 1024 slots per server = 8192 items, log2(8192) = 13 comparisons at most in the worst case.) The original authors tested it and found that computing the cache server using consistent hashing took only 20 microseconds in their setup. Likewise, consistent hashing consumes space to store the sorted list of server slots, while simple hashing takes no space, but the amount required is minuscule, on the order of Kb.
Why is it not better known? If I had to guess, I would say it's only because it can take time for academic ideas to propagate out into industry. (The original paper was written in 1997.)
I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.
The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.
So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.
Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.
Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.
Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.
The reason is because Consistent Hashing tends to cause more work on the Read side for range scan queries.
For example, if you want to search for entries that are sorted by a particular column then you'd need to send the query to EVERY node because consistent hashing will place even "adjacent" items in separate nodes.
It's often preferred to instead use a partitioning that is going to match the usage patterns. Better yet replicate the same data in a host of different partitions/formats

What can be parameters other than time and space while analyzing certain algorithms?

I was interested to know about parameters other than space and time during analysing the effectiveness of an algorithms. For example, we can focus on the effective trap function while developing encryption algorithms. What other things can you think of ?
First and foremost there's correctness. Make sure your algorithm always works, no matter what the input. Even for input that the algorithm is not designed to handle, you should print an error mesage, not crash the entire application. If you use greedy algorithms, make sure they truly work in every case, not just a few cases you tried by hand.
Then there's practical efficiency. An O(N2) algorithm can be a lot faster than an O(N) algorithm in practice. Do actual tests and don't rely on theoretical results too much.
Then there's ease of implementation. You usually don't need the best intro sort implementation to sort an array of 100 integers once, so don't bother.
Look for worst cases in your algorithms and if possible, try to avoid them. If you have a generally fast algorithm but with a very bad worst case, consider detecting that worst case and solving it using another algorithm that is generally slower but better for that single case.
Consider space and time tradeoffs. If you can afford the memory in order to get better speeds, there's probably no reason not to do it, especially if you really need the speed. If you can't afford the memory but can afford to be slower, do that.
If you can, use existing libraries. Don't roll your own multiprecision library if you can use GMP for example. For C++, stuff like boost and even the STL containers and algorithms have been worked on for years by an army of people and are most likely better than you can do alone.
Stability (sorting) - Does the algorithm maintain the relative order of equal elements?
Numeric Stability - Is the algorithm prone to error when very large or small real numbers are used?
Correctness - Does the algorithm always give the correct answer? If not, what is the margin of error?
Generality - Does the algorithm work in many situation (e.g. with many different data types)?
Compactness - Is the program for the algorithm concise?
Parallelizability - How well does performance scale when the number of concurrent threads of execution are increased?
Cache Awareness - Is the algorithm designed to maximize use of the computer's cache?
Cache Obliviousness - Is the algorithm tuned for particulary cache-sizes / cache-line-sizes or does it perform well regardless of the parameters of the cache?
Complexity. 2 algorithms being the same in all other respects, the one that's much simpler is going to be a much better candidate for future customization and use.
Ease of parallelization. Depending on your use case, it might not make any difference or, on the other hand, make the algorithm useless because it can't use 10000 cores.
Stability - some algorithms may "blow up" with certain test conditions, e.g. take an inordinately long time to execute, or use an inordinately large amount of memory, or perhaps not even terminate.
For algorithms that perform floating point operations, the accumulation of round-off error is often a consideration.
Power consumption, for embedded algorithms (think smartcards).
One important parameter that is frequently measure in the analysis of algorithms is that of Cache hits and cache misses. While this is a very implementation and architecture dependent issue, it is possible to generalise somewhat. One particularly interesting property of the algorithm is being Cache-oblivious, which means that the algorithm will use the cache optimally on multiple machines with different cache sizes and structures without modification.
Time and space are the big ones, and they seem so plain and definitive, whereby they should often be qualified (1). The fact that the OP uses the word "parameter" rather than say "criteria" or "properties" is somewhat indicative of this (as if a big O value on time and on space was sufficient to frame the underlying algorithm).
Other criteria include:
domain of applicability
complexity
mathematical tractability
definitiveness of outcome
ease of tuning (may be tied to "complexity" and "tactability" afore mentioned)
ability of running the algorithm in a parallel fashion
(1) "qualified": As hinted in other answers, a -technically- O(n^2) algorithm may be found to be faster than say an O(n) algorithm, in 90% of the cases (which, btw, may turn out to be 100% of the practical cases)
worst case and best case are also interesting, especially when linked to some conditions in the input. if your input data shows some properties, an algorithm, by taking advantage of this property, may perform better that another algorithm which performs the same task but does not use that property.
for example, many sorting algorithm perform very efficiently when input are partially ordered in a specific way which minimizes the number of operations the algorithm has to execute.
(if your input is mostly sorted, an insertion sort will fit nicely, while you would never use that algorithm otherwise)
If we're talking about algorithms in general, then (in the real world) you might have to think about CPU/filesystem(read/write operations)/bandwidth usage.
True they are way down there in the list of things you need worry about these days, but given a massive enough volume of data and cheap enough infrastructure you might have to tweak your code to ease up on one or the other.
What you are interested aren’t parameters, rather they are intrinsic properties of an algorithm.
Anyway, another property you might be interested in, and analyse an algorithm for, concerns heuristics (or rather, approximation algorithms), i.e. algorithms which don’t find an exact solution but rather one that is (hopefully) good enough.
You can analyze how far a solution is from the theoretical optimal solution in the worst case. For example, an existing algorithm (forgot which one) approximates the optimal travelling salesman tour by a factor of two, i.e. in the worst case it’s twice as long as the optimal tour.
Another metric concerns randomized algorithms where randomization is used to prevent unwanted worst-case behaviours. One example is randomized quicksort; quicksort has a worst-case running time of O(n2) which we want to avoid. By shuffling the array beforehand we can avoid the worst-case (i.e. an already sorted array) with a very high probability. Just how high this probability is can be important to know; this is another intrinsic property of the algorithm that can be analyzed using stochastic.
For numeric algorithms, there's also the property of continuity: that is, whether if you change input slightly, output also changes only slightly. See also Continuity analysis of programs on Lambda The Ultimate for a discussion and a link to an academical paper.
For lazy languages, there's also strictness: f is called strict if f _|_ = _|_ (where _|_ denotes the bottom (in the sense of domain theory), a computation that can't produce a result due to non-termination, errors etc.), otherwise it is non-strict. For example, the function \x -> 5 is non-strict, because (\x -> 5) _|_ = 5, whereas \x -> x + 1 is strict.
Another property is determinicity: whether the result of the algorithm (or its other properties, such as running time or space consumption) depends solely on its input.
All these things in the other answers about the quality of various algorithms are important and should be considered.
But time and space are two things that vary at some rate compared to the size of the input (n). So what else can vary according to n?
There are several that are related to I/O. For example, the number of writes to a disk is an important one, which may not be directly shown by space and time estimates alone. This becomes particularly important with flash memory, where the number of writes to the same memory location is the significant metric in some algorithms.
Another I/O metric would be "chattiness". A networking protocol might send shorter messages more often adding up to the same space and time as another networking protocol, but some aspect of the system (perhaps billing?) might make minimizing either the size or number of the messages desireable.
And that brings us to Cost, which is a very important algorithmic consideration sometimes. The cost of an algorithm may be affected by both space and time in different amounts (consider the separate costing of server storage space and gigabits of data transfer), but the cost is the thing that you wish to minimize overall, so it may have its own big-O estimations.

Efficient reordering of large dataset to maximize memory cache effectiveness

I've been working on a problem which I thought people might find interesting (and perhaps someone is aware of a pre-existing solution).
I have a large dataset consisting of a long list of pairs of pointers to objects, something like this:
[
(a8576, b3295),
(a7856, b2365),
(a3566, b5464),
...
]
There are way too many objects to keep in memory at any one time (potentially hundreds of gigabytes), so they need to be stored on disk, but can be cached in memory (probably using an LRU cache).
I need to run through this list processing every pair, which requires that both objects in the pair be loaded into memory (if they aren't already cached there).
So, the question: is there a way to reorder the pairs in the list to maximize the effectiveness of an in-memory cache (in other words: minimize the number of cache misses)?
Notes
Obviously, the re-ordering algorithm should be as fast as possible, and shouldn't depend on being able to have the entire list in memory at once (since we don't have enough RAM for that) - but it could iterate over the list several times if necessary.
If we were dealing with individual objects, not pairs, then the simple answer would be to sort them. This obviously won't work in this situation because you need to consider both elements in the pair.
The problem may be related to that of finding a minimum graph cut, but even if the problems are equivalent, I don't think solutions to min-cut meet
My assumption is that the heuristic would stream the data off the disk, and write it back in chunks in a better order. It may need to iterate over this several times.
Actually it may not just be pairs, it could be triplets, quadruplets, or more. I'm hoping that an algorithm that does this for pairs can be easily generalized.
Your problem is related to a similar one for computer graphics hardware:
When rendering indexed vertices in a triangle mesh, typically the hardware has a cache of most recently transformed vertices (~128 the last time I had to worry about it, but suspect the number is larger these days). Vertices not cached need a relatively expensive transform operation to calculate. "Mesh optimisation" to restructure triangle meshes to optimise cache usage used to be a pretty hot research topic. Googling
vertex cache optimisation
(or optimization :^) might find you some interesting material relevant to your problem. As other posters suggest, I suspect doing this effectively will depend on exploiting any inherent coherence in your data.
Another thing to bear in mind: as an LRU cache becomes overloaded it can be well worth changing to an MRU replacement strategy to at least hold some of the items in memory (rather than turning over the entire cache each pass). I seem to remember John Carmack has written some good material on this subject in connection with Direct3D texture caching strategies.
For start, you could mmap the list. That works if there's enough address space, not memory, e.g. on 64-bit CPUs. This makes it easier to access the elements in order.
You could sort that list according to a minimum distance in cache which considers both elements, which works well if the objects are in a contiguous space. The sorting function could be something like: compare (a, b) to (c, d) = (a - c) + (b - d) (which looks like a Hamming distance). Then you pull in slices of the object store and process according to the list.
EDIT: fixed a mistake in the distance.
Even though you're not just sorting this list, the general pattern of a multiway merge sort might be applicable - that is, consider some kind of (possibly recursive) breakdown of the set into smaller sets that can be dealt with in memory separately, and then a second phase where small chunks of the previously dealt-with sets can all be combined together. Even not knowing the specific nature of what you're doing with the pairs, it's safe to say that many algorithmic problems are made much more straightforward when you're dealing with sorted data (including graph problems, which might be what you have on your hands here).
I think the answer to this question is going to depend very heavily on exactly the access pattern of the pair of objects. As you said, just sorting the pointers would be best in a simple, non-paired case. In a more complex case it may still make sense to sort by one of the halves of the pair if the pattern is such that locality for those values is more important (if, for example, these are key/value pairs and you are doing a lot of searches, locality for the keys is infinitely more important than for the values).
So, really, my answer is that this question can't be answered in a general case.
For storing your structure, what you actually want is probably a B-tree. These are designed for what you're talking about--keeping track of large collections where you don't want to (or can't) keep the whole thing in memory.

Is it worthwhile to use a bit vector/array rather than a simple array of bools?

When I want an array of flags it has typically pained me to use an entire byte (or word) to store each one, as would be the result if I made an array of bools or some other numeric type that could be set to 0 or 1. But now I wonder whether using a structure that is more space-efficient is worth it given the (albeit hopefully very slight) additional overhead of shifting and bit testing.
In my company we use Rogue Wave tools (though hopefully not for much longer) and it's their RWBitVec that I've used for this purpose up until now.
It's mostly about saving memory. If your array of bools is large enough that a 8x improvement on storage space is meaningful, then by all means, use a bitarray.
Note that the memory access is pretty expensive compared to the shift/and, so the bitarray approach is slightly faster than the array-of-chars. Basically it comes down to memory versus programmer time. Remember that premature optimization is a waste of time. I'd use whichever approach is the easiest to develop, and then refactor only after it shows that it's a primary performance bottleneck.
Don't use vector<bool>, it's not really a Container:
http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=98
Use std::bitset (for fixed size bitsets) and boost::dynamic_bitset (for resizeable ones) where appropriate. They aren't Containers either, but they don't look as if they ought to be, so are less likely to cause confusion.
Whether the trade-off is worth it depends, obviously, on how big the arrays are in your program. I think you're right that the overhead of bit access is usually negligible, but if the memory overhead is negligible too then you've nothing to go on there either.
bitsets have the advantage that they do exactly what they say on the tin - none of this "declare an array of chars/ints, but the only legal values are 0 and 1" nonsense. Your code will read about the same as if you'd used an array.
I wrote some code once to unpack a bitmap image line into separate bytes per pixel, then pack it back again after processing. For the code I was benchmarking, it was actually faster to do it that way than to work at the bit level.
I've used a bit array for indexing a HUGE tree. The algorithm was:
Check bitarray if entry exists
if entry doesn't exists
return null
else do binary search in tree
return value
The advantage is that the Tree has huge enough that searching for a non existent entry would cause several cache misses before completing. Thus the algorithm was taking longer or not depending on the existence of the value.
However adding that initial bit array search meant I'd reduce cache misses, and would avoid searching the tree at all if the answer wasn't there. By adding this extra step the algorithm became much more robust (actual performance time on a Computer, became nearly linear although the Big-O would say differently), and overall performance increased by an order of magnitude.
Like they say sometimes taking hardware into consideration is more important than the "ideal" mathematical algorithm.
Modern computers have barrel shifters so that a shift of any number of bits up to 31 takes a few cycles (less than many other instructions). Compilers take advantage of this and bit operations are not only space efficient but in most cases time efficient.
But it really depends on how you're using and testing the bits - there are some inefficient methods that would make using a whole integer faster.
-Adam
Is it worth it? Only if you know that you have a problem with memory usage.
But unless you're either:
Working on an embedded processor with very limited resources, or
Storing an astronomical number of bools
then the answer is no. You'll have to work somewhat harder to achieve the same level of readability in your source by using a bitmap than you will using bools, and unless you're operating under either of the previous two conditions you'll likely find that it doesn't make any noticeable difference to your memory footprint.

Resources