Is sqrt still slow in Delphi 2009? - performance

Is sqrt still slow in Delphi 2009?
Are the old tricks (lookup table, approx functions) still useful?

If you are dealing with a small set of really large numbers then a lookup table will most always be faster. While if you are dealing with a large set of small numbers then even a slow routine may be faster then maintaining a large table.
I looked in System.pas (where SQRT is located) and while there are a number of blocks marked as licensed from the Fastcode project, SQRT is not. In fact, it just makes an assembly call to FSQRT, so it most likely hasn't changed. So if it was comparatively slow at one point then it most likely still is as slow (although your CPU may be a lot faster and optimized for that now . . . .)
My recommendation is to look at your usage and test.

Many moons ago I had an app that computed distance to sort vectors. I realized sorting by the un-sqrt-ed values was the same so I skipped it altogether. Sorted by distance^2 and saved some time.

My answer to questions of efficiency is to just apply the simplest method first, then optimize later. Making an operation faster when there's no need for it to be all that fast in the first place seems silly.
I digress, however, since my answer is about efficiency, not the historical aspects of your question.

Related

Efficient Data Structures in Maple

I'm working with a large amount of data in Maple and I need to know the most efficient way to store it. I started with lists, but I quickly learned how inefficient those are so I have since replaced them. Now I'm using a mixture of Arrays (for structures with a fixed length) and tables (for structures with variable length), but my code actually runs significantly slower than it did when I was only using lists.
So here are my questions:
What is the most efficient data structure to use in Maple for a static-length set of data? for a variable-length set?
Are there any "gotchas" I need to be aware of when using these structures as parameters in a recursive proc? If using Arrays or tables, does each one need to be copied for each iteration to avoid clobbering data?
I think I can wrap this one up now. I made a few performance improvements, mostly just small tweaks that only helped a bit, but I did manage a big improvement by removing as many instances of the copy command as I could (I used it on arrays and tables). It turns out this is what was causing my array/table implementation to be slower than my list-only implementation. But the code still didn't run as fast as I needed, so I re-wrote it in C#. That's probably not the best solution for "how to improve Maple efficiency", but it sure does run a lot faster now.

Memory and time issues when dividing two matrices

I have two sparse matrices in matlab
M1 of size 9thousandx1.8million and M2 of size 1.8millionx1.8million.
Now I need to calculate the expression
M1/M2
and it took me like an hour. Is it normal? Is there any efficient way in matlab so that I can overcome this time issue. I mean it's a lot and if I make number of iterations then it will keep on taking 1 hour. Any suggestion?
A quick back-of-the-envelope calculation based on assuming some iterative method like conjugate gradient or Kaczmarz method is used, and plugging in the sizes makes me believe that an hour isn't bad.
Because of the tridiagonality the matrix that's being "inverted" (if not explicitly), both of those methods are going to take a number of instructions near "some near-unity scalar factor" times ~9000 times 1.8e6 times "the number of iterations required for convergence". The product of the two things in quotes is probably around 50 (minimum) to around 1000 (maximum). I didn't cherry pick these to make your math work, these are about what I'd expect from having done these. If you assume about 1e9 instructions per second (which doesn't account much for memory access etc.) you get around 13 minutes to around 4.5 hours.
Thus, it seems in the right range for an algorithm that's exploiting sparsity.
Might be able to exploit it better yourself if you know the structure, but probably not by much.
Note, this isn't to say that 13 minutes is achievable.
Edit: One side note, I'm not sure what's being used, but I assumed iterative methods. It's also possible that direct methods are used (like explained here). These methods can be very efficient for sparse systems if you exploit the sparsity right. It's very possible that Matlab is using these by default, but it's worth investigating what Matlab is doing in your case.
In my limited experience, iterative methods were usually preferred over direct methods as the size of the systems get large (yours is large.) Our linear systems worked out to be block tridiagonal as well, as they often do in image processing.

How to speed up 'sort' function in Matlab?

I was using the built-in sort function of Matlab:
[temp, Idx] = sort(M,2);
I would like to have the sorted index of each row of M, which is a matrix of size > 50k.
I searched hard but did not find anything.. It would be greatly appreciated if you have any comments!
To get a sense of how much room for improvement you have, I would suggest writing a test program in C and use qsort or in C++ and user sort and carefully time it on 7000 inputs of size 7000 (or whatever setup you have in Matlab).
I'm going to give you my estimate: probably Matlab's sort runs (on properly vectorized code, like yours) as fast as C++, and you're just seeing the effect of running an algorithm that takes O(n^2 log n). It is reported in Matlab's marketing material that its sort function was faster than C's qsort, but take it with a grain of salt.
The best way to speed up that sort is to get a faster computer. It will speed everything else up too. :)
The fact is, you can rarely speed up a single call to something like a sort. MATLAB is already doing that in an efficient manner, using an optimized code internally. (Reread the carlosdc answer.) The things you can sometimes get a boost on are tools that are written in MATLAB itself.
So, what can you do? Short of buying that new computer, you can look at your overall code. One single sort of that size is never that big of a problem. But the reason for doing that sort over and over again is. Think carefully about the code, about whether you can change the flow or avoid a many times repeated sort. Algorithm change is often a FAR bigger source of improvement than the wee bit you would ever get even if you could improve that sort.
Sorting is fundamentally O(n log n).
As long as you have a reasonably efficient implementation, this is unlikely to change much.
That said, as Andrew Janke's comment suggests, multi-threading can improve things dramatically.
GPU programming can be a way to get massive speedups. If you have R2010b or later, you may be able to use accelerated versions of built-in functions like sort from Mathworks.
Otherwise, write a mex wrapper around the CUDA Thrust library which includes a sort.
You could write your own sort function in C/C++ as MEX. MATLAB documentation has examples for it.
There exist many sort algorithms which are better then other in edge cases, for example almost sorted data or stability (which does not matter in MATLAB because all its types are value types).
Is your data numeric or strings? For strings there are probably special algorithms for ASCII sort, sometimes natural sort is preferable.

If consistent hash is efficient,why don't people use it everywhere?

I was asked some shortcommings of consistent hash. But I think it just costs a little more than a traditional hash%N hash. As the title mentioned, if consistent hash is very good, why not we just use it?
Do you know more? Who can tell me some?
Implementing consistent hashing is not trivial and in many cases you have a hash table that rarely or never needs remapping or which can remap rather fast.
The only substantial shortcoming of consistent hashing I'm aware of is that implementing it is more complicated than simple hashing. More code means more places to introduce a bug, but there are freely available options out there now.
Technically, consistent hashing consumes a bit more CPU; consulting a sorted list to determine which server to map an object to is an O(log n) operation, where n is the number of servers X the number of slots per server, while simple hashing is O(1).
In practice, though, O(log n) is so fast it doesn't matter. (E.g., 8 servers X 1024 slots per server = 8192 items, log2(8192) = 13 comparisons at most in the worst case.) The original authors tested it and found that computing the cache server using consistent hashing took only 20 microseconds in their setup. Likewise, consistent hashing consumes space to store the sorted list of server slots, while simple hashing takes no space, but the amount required is minuscule, on the order of Kb.
Why is it not better known? If I had to guess, I would say it's only because it can take time for academic ideas to propagate out into industry. (The original paper was written in 1997.)
I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.
The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.
So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.
Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.
Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.
Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.
The reason is because Consistent Hashing tends to cause more work on the Read side for range scan queries.
For example, if you want to search for entries that are sorted by a particular column then you'd need to send the query to EVERY node because consistent hashing will place even "adjacent" items in separate nodes.
It's often preferred to instead use a partitioning that is going to match the usage patterns. Better yet replicate the same data in a host of different partitions/formats

Is it worthwhile to use a bit vector/array rather than a simple array of bools?

When I want an array of flags it has typically pained me to use an entire byte (or word) to store each one, as would be the result if I made an array of bools or some other numeric type that could be set to 0 or 1. But now I wonder whether using a structure that is more space-efficient is worth it given the (albeit hopefully very slight) additional overhead of shifting and bit testing.
In my company we use Rogue Wave tools (though hopefully not for much longer) and it's their RWBitVec that I've used for this purpose up until now.
It's mostly about saving memory. If your array of bools is large enough that a 8x improvement on storage space is meaningful, then by all means, use a bitarray.
Note that the memory access is pretty expensive compared to the shift/and, so the bitarray approach is slightly faster than the array-of-chars. Basically it comes down to memory versus programmer time. Remember that premature optimization is a waste of time. I'd use whichever approach is the easiest to develop, and then refactor only after it shows that it's a primary performance bottleneck.
Don't use vector<bool>, it's not really a Container:
http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=98
Use std::bitset (for fixed size bitsets) and boost::dynamic_bitset (for resizeable ones) where appropriate. They aren't Containers either, but they don't look as if they ought to be, so are less likely to cause confusion.
Whether the trade-off is worth it depends, obviously, on how big the arrays are in your program. I think you're right that the overhead of bit access is usually negligible, but if the memory overhead is negligible too then you've nothing to go on there either.
bitsets have the advantage that they do exactly what they say on the tin - none of this "declare an array of chars/ints, but the only legal values are 0 and 1" nonsense. Your code will read about the same as if you'd used an array.
I wrote some code once to unpack a bitmap image line into separate bytes per pixel, then pack it back again after processing. For the code I was benchmarking, it was actually faster to do it that way than to work at the bit level.
I've used a bit array for indexing a HUGE tree. The algorithm was:
Check bitarray if entry exists
if entry doesn't exists
return null
else do binary search in tree
return value
The advantage is that the Tree has huge enough that searching for a non existent entry would cause several cache misses before completing. Thus the algorithm was taking longer or not depending on the existence of the value.
However adding that initial bit array search meant I'd reduce cache misses, and would avoid searching the tree at all if the answer wasn't there. By adding this extra step the algorithm became much more robust (actual performance time on a Computer, became nearly linear although the Big-O would say differently), and overall performance increased by an order of magnitude.
Like they say sometimes taking hardware into consideration is more important than the "ideal" mathematical algorithm.
Modern computers have barrel shifters so that a shift of any number of bits up to 31 takes a few cycles (less than many other instructions). Compilers take advantage of this and bit operations are not only space efficient but in most cases time efficient.
But it really depends on how you're using and testing the bits - there are some inefficient methods that would make using a whole integer faster.
-Adam
Is it worth it? Only if you know that you have a problem with memory usage.
But unless you're either:
Working on an embedded processor with very limited resources, or
Storing an astronomical number of bools
then the answer is no. You'll have to work somewhat harder to achieve the same level of readability in your source by using a bitmap than you will using bools, and unless you're operating under either of the previous two conditions you'll likely find that it doesn't make any noticeable difference to your memory footprint.

Resources