Is this linear search implementation actually useful? - algorithm

In Matters Computational I found this interesting linear search implementation (it's actually my Java implementation ;-)):
public static int linearSearch(int[] a, int key) {
int high = a.length - 1;
int tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
int i = 0;
while (a[i] != key) {
// restore original value
a[high] = tmp;
if (i == high && key != tmp) {
return i;
It basically uses a sentinel, which is the searched for value, so that you always find the value and don't have to check for array boundaries. The last element is stored in a temp variable, and then the sentinel is placed at the last position. When the value is found (remember, it is always found due to the sentinel), the original element is restored and the index is checked if it represents the last index and is unequal to the searched for value. If that's the case, -1 (NOT_CONTAINED) is returned, otherwise the index.
While I found this implementation really clever, I wonder if it is actually useful. For small arrays, it seems to be always slower, and for large arrays it only seems to be faster when the value is not found. Any ideas?
The original implementation was written in C++, so that could make a difference.

It's not thread-safe, for example, you can lose your a[high] value through having a second thread start after the first has changed a[high] to key, so will record key to tmp, and finish after the first thread has restored a[high] to its original value. The second thread will restore a[high] to what it first saw, which was the first thread's key.
It's also not useful in java, since the JVM will include bounds checks on your array, so your while loop is checking that you're not going past the end of your array anyway.

Will you ever notice any speed increase from this? No
Will you notice a lack of readability? Yes
Will you notice an unnecessary array mutation that might cause concurrency issues? Yes
Premature optimization is the root of all evil.

Doesn't seem particularly useful. The "innovation" here is just to get rid of the iteration test by combining it with the match test. Modern processors spend 0 time on iteration checks these days (all the computation and branching gets done in parallel with the match test code).
In any case, binary search kicks the ass of this code on large arrays, and is comparable on small arrays. Linear search is so 1960s.

See also the 'finding a tiger in africa' joke.
Punchline = An experienced programmer places a tiger in cairo so that the search terminates.

A sentinel search goes back to Knuth. It value is that it reduces the number of tests in a loop from two ("does the key match? Am I at the end?") to just one.
Yes, its useful, in the sense that it should significantly reduce search times for modest size unordered arrays, by virtue of eliminating conditional branch mispredictions. This also reduces insertion times (code not shown by the OP) for such arrays, because you don't have to order the items.
If you have larger arrays of ordered items, a binary search will be faster, at the cost of larger insertion time to ensure the array is ordered.
For even larger sets, a hash table will be the fastest.
The real question is what is the distribution of sizes of your arrays?

Yes - it does because while loop doesn't have 2 comparisons as opposed to standard search.
It is twice as fast.It is given as optimization in Knuth Vol 3.


Is there any probabilistic data structure that reduces the space complexity of a large number of counters?

Basically I need to keep track of a large number of counters. I can increment or decrement each counter by name. The simplest way to do so is to use a hash table, using counter_name as key and its corresponding count as the value for that key.
The counters don't need to be 100% accurate, approximate values for count are fine. So I'm wondering if there is any probabilistic data structure that can reduce the space complexity of N counters to lower than O(N), kinda similar to how HyperLogLog reduces the memory requirement of counting N items by giving only an approximate result. Any ideas?
In my opinion, the thing you are looking for is Count-min sketch.
Reading a stream of elements a1, a2, a3, ..., an where there can be a
lot of repeated elements, in any time it will give you the answer to
the following question: how many ai elements have you seen so far.
basically your unique elements can be bijected into your counters. Countmin sketch allows you to adjust parameters to trade your memory for the accuracy.
P.S. I described some other popular probabilistic data structures here.
Stefan Haustein's correct that the names are likely to take more space than the counters, and you may be able to prioritise certain names as he suggests, but failing that you can consider how best to store the names. If they're fairly short (e.g. 8 characters or less), you might consider using a closed hashing table that stores them directly in the buckets. If they're long, you could store them contiguously (NUL terminated) in a block of memory, and in the hash table store the offset into that block of their first character.
For the counter itself, you can save space by using a probabilistic approach as follows:
template <typename T, typename Q = unsigned>
class Approx_Counter
Approx_Counter() : n_(0) { }
Approx_Counter& operator++()
if (n_ < 2 || rand() % (operator Q()) == 0)
return *this;
operator Q() const { return n_ < 2 ? n_ : 1 << n_; }
T n_;
Then you can use e.g. Approx_Counter<unsigned char, unsigned long>. Swap out rand() for a C++11 generator if you care.
The idea's simple:
when n_ is 0, ++ has definitely not be invoked
when n_ is 1, ++ has definitely been invoked exactly once
when n_ >= 2, it indicates ++ has probably been invoked about 2n_ times
To keep that last implication in line with the number of ++ invocations actually made, each invocation has a 1 in 2n_ chance of actually incrementing n_ again.
Just make sure your rand() or substitute returns values much larger than the largest counter value you want to track, otherwise you'll get rand() % (operator Q()) == 0 too often and increment inappropriately.
That said, having a smaller counter doesn't help much if you have pointers or offsets to it, so you'll want to squeeze the counter into the bucket too, another reason to prefer your own closed hashing implementation if you genuinely need to tighten up memory usage but want to stick with a hash table (a trie is another possibility).
The above is still O(N) in counter space, just with a smaller constant. For genuinely < O(N) options, you need to consider whether/how keys are related, such that incrementing a counter might reasonable impact multiple keys. You've given us no insights in your question to date.
The names probably take up more space than the counters.
How about having a fixed number of counters and only keep the ones with the highest counts, plus some kind of LRU mechanism to allow new counters to rise to the top? I guess it really depends on your use case...

How does JIT optimize branching while processing elements of collections? (in Scala)

This is a question about performance of code written in Scala.
Consider the following two code snippets, assume that x is some collection containing ~50 million elements:
def process(x: Traversable[T]) = {
processFirst x.head
x reduce processPair
processLast x.last
Versus something like this (assume for now we have some way to determine if we're operating on the first element versus the last element):
def isFirstElement[T](x: T) = ???
def isLastElement[T](x: T) = ???
def process(x: Traversable[T]) = {
x reduce {
(left, right) =>
if (isFirstElement(left)
else if (isLastElement(right))
processPair(left, right)
Which approach is faster? and for ~50 million elements, how much faster?
It seems to me that the first example would be faster because there are fewer conditional checks occurring for all but the first and last elements. However for the latter example there is some argument to suggest that the JIT might be clever enough to optimize away those additional head/last conditional checks that would otherwise occur for all but the first/last elements.
Is the JIT clever enough to perform such operations? The obvious advantage of the latter approach is that all business can be placed in the same function body while in the latter case business must be partitioned into three separate function bodies invoked separately.
** EDIT **
Thanks for all the great responses. While I am leaving the second code snippet above to illustrate its incorrectness, I want to revise the first approach slightly to reflect better the problem I am attempting to solve:
// x is some iterator
def process(x: Iterator[T]) = {
if (x.hasNext)
var previous =
var current = null
processFirst previous
current =
processPair(previous, current)
previous = current
processLast previous
While there are no additional checks occurring in the body, there is an additional reference assignment that appears to be unavoidable (previous = current). This is also a much more imperative approach that relies on nullable mutable variables. Implementing this in a functional yet high performance manner would be another exercise for another question.
How does this code snippet stack-up against the last of the two examples above? (the single-iteration block approach containing all the branches). The other thing I realize is that the latter of the two examples is also broken on collections containing fewer than two elements.
If your underlying collection has an inexpensive head and last method (not true for a generic Traversable), and the reduction operations are relatively inexpensive, then the second way takes about 10% longer (maybe a little less) than the first on my machine. (You can use a var to get first, and you can keep updating a second far with the right argument to obtain last, and then do the final operation outside of the loop.)
If you have an expensive last (i.e. you have to traverse the whole collection), then the first operation takes about 10% longer (maybe a little more).
Mostly you shouldn't worry too much about it and instead worry more about correctness. For instance, in a 2-element list your second code has a bug (because there is an else instead of a separate test). In a 1-element list, the second code never calls reduce's lambda at all, so again fails to work.
This argues that you should do it the first way unless you're sure last is really expensive in your case.
Edit: if you switch to a manual reduce-like-operation using an iterator, you might be able to shave off up to about 40% of your time compared to the expensive-last case (e.g. list). For inexpensive last, probably not so much (up to ~20%). (I get these values when operating on lengths of strings, for example.)
First of all, note that, depending on the concrete implementation of Traversable, doing something like x.last may be really expensive. Like, more expensive than all the rest of what's going on here.
Second, I doubt the cost of conditionals themselves is going to be noticeable, even on a 50 million collection, but actually figuring out whether a given element is the first or the last, might again, depending on implementation, get pricey.
Third, JIT will not be able to optimize the conditionals away: if there was a way to do that, you would have been able to write your implementation without conditionals to begin with.
Finally, if you are at a point where it starts looking like an extra if statement might affect performance, you might consider switching to java or even "C". Don't get me wrong, I love scala, it is a great language, with lots of power and useful features, but being super-fast just isn't one of them.

Maps in Go - how to avoid double key lookup?

Suppose I want to update some existing value in a map, or do something else if the key is not found. How do I do this, without performing 2 lookups? What's the golang equivalent of the following C++ code:
auto it = m.find(key);
if (it != m.end()) {
// update the value, without performing a second lookup
it->second = calc_new_value(it->second);
} else {
// do something else
m.insert(make_pair(key, 42));
Go does not expose the map's internal (key,value) pair data structure like C++ does, so you can't replicate this exactly.
One possible work around would be to make the values of your map pointers, so you can keep the same values in the map but update what they point to. For example, if m is a map[int]*int, you could change a value with:
v := m[10]
*v = 42
With that said, I wouldn't be surprised if the savings from reducing the number of hash lookups will be eaten by the additional memory management overhead. So it would be worth benchmarking whatever solution you settle on.
You cannot. The situation is actually the same with Python dicts. However it shouldn't matter. Both lookup and assignment to a Go map are amortized O(1). Combining the two operations has the same time complexity.

Are boolean operations slower than mathematical operations in loops?

I really tried to find something about this kind of operations but I don't find specific information about my question... It's simple: Are boolean operations slower than typical math operations in loops?
For example, this can be seen when working with some kind of sorting. The method will make an iteration and compare X with Y... But is this slower than a summatory or substraction loop?
Boolean comparisons
for(int i=1; i<Vector.Length; i++) if(Vector[i-1] < Vector[i])
Versus summation:
Double sum = 0;
for(int i=0; i<Vector.Length; i++) sum += Vector[i];
(Talking about big length loops)
Which is faster for the processor to complete?
Do booleans require more operations in order to return "true" or "false" ?
Short version
There is no correct answer because your question is not specific enough (the two examples of code you give don't achieve the same purpose).
If your question is:
Is bool isGreater = (a > b); slower or faster than int sum = a + b;?
Then the answer would be: It's about the same unless you're very very very very very concerned about how many cycles you spend, in which case it depends on your processor and you need to read its documentation.
If your question is:
Is the first example I gave going to iterate slower or faster than the second example?
Then the answer is: It's going to depend primarily on the values the array contains, but also on the compiler, the processor, and plenty of other factors.
Longer version
On most processors a boolean operation has no reason to significantly be slower or faster than an addition: both are basic instructions, even though comparison may take two of them (subtracting, then comparing to zero). The number of cycles it takes to decode the instruction depends on the processor and might be different, but a few cycles won't make a lot of difference unless you're in a critical loop.
In the example you give though, the if condition could potentially be harmful, because of instruction pipelining. Modern processors try very hard to guess what the next bunch of instructions are going to be so they can pre-fetch them and treat them in parallel. If there is branching, the processor doesn't know if it will have to execute the then or the else part, so it guesses based on the previous times.
If the result of your condition is the same most of the time, the processor will likely guess it right and this will go well. But if the result of the condition keeps changing, then the processor won't guess correctly. When such a branch misprediction happens, it means it can just throw away the content of the pipeline and do it all over again because it just realized it was moot. That. does. hurt.
You can try it yourself: measure the time it takes to run your loop over a million elements when they are of same, increasing, decreasing, alternating, or random value.
Which leads me to the conclusion: processors have become some seriously complex beasts and there is no golden answers, just rules of thumb, so you need to measure and profile. You can read what other people did measure though to get an idea of what you should or should not do.
Have fun experimenting. :)

Sum reduction of binary sequence

Consider a binary sequence:
I have to find sum of this series (actually in parallel)
Sum =1+1+0+0+0+1+1+1= 5
This is a waste of resource as why invest time in adding 0s?
Is there any clever way to sum this sequence so I can avoid unnecessary additions?
Operate at the byte level rather than the bit level. Use a small LUT to convert a byte to a population count. That way you're only doing one lookup and one add per 8 bits. Unless your data is likely to be very sparse this should be quite efficient.
Well it depends on how you store your bitset.
If it's an array, then you can't do more than a plain for. If you want to do this in parallel, just split the array in chunks and process them concurrently.
If we are talking about a bitset (storing the bits in a native (32/64-bit) integer type), then the simplest way to count bits would be this one:
int bitset;
int s = 0;
for (; bitset; s++)
bitset &= bitset-1;
This removes the last bit of 1 at every step, so you have O(s).
Of course, you can combine these two methods if you need more than 32/64 bits
I dunno why people are answering, not even looking into link from the 1st comment to the question. You can easily make it under O(size_of_bitset). At lewast when it comes to constant factor.
You could use this method (found in link by J.F. Sebastian):
inline int count_bits(int num){
int sum = 0;
for (; bitset; sum++) bitset &= bitset-1;
return sum;
int main (void){
int array[N];
int total_sum = 0;
#pragma omp parallel for reduction(+:total_sum)
for (size_t i = 0; i < N, i++){
total_sum += count_bits(array[i]);
This will count number of bits in memory range of array in parallel. The inline is important to avoid unnecessary copying, also the compiler should optimize it much better.
You can swap the count_bits with anything better that counts bits in an integer to get faster if you find anything. This version has complexity of O(bits_set) (not size of the bit set!).
Invoking the parallel construct will introduce quite a lot of overhead compared to a single summation that it does need to be quite large to compensate.
The parallelism is done via OpenMP. The partial sum of each thread is summed at the end of the parallel loop and stored in total_sum. Note the total_sum will be private inside the loop for each thread reduction due to reduction clause.
You could alter the code to make it count bits set in arbitrary memory region but it is quite important for it to be memory aligned when you perform operations on such low level.
As far as I can see, it would be wasteful to try to handle the zeros specially. As #bdares said, addition is really cheap. At a minimum, you'll need to execute N instructions to sum up the an N-bit sequence, that would be if you unconditionally sum ever bit. If you add a test to see whether the bit is a 0 or 1, that's another instruction that needs to be executed for each bit. Even if there's no branch penalty, you're executing minimum 1 instruction for every bit (the conditional test), and then you're also executing the original instruction (the add) for any bits that are equal to 1. So even without branch penalty, this takes more time to execute.
#bdares mentions that the compiler will optimize out the branches, but that's only if the value of each bit is known at compile time, and if you know the values of the bits at compile time, you should just add them up yourself in advance.
There might be some cute things you can do with bit twiddling. For instance, if you take the bits two at a time you're adding up values of 0, 1, 2, or 3, and only have half as many additions to do. There may by something you can then do with the result to convert it into the value you want, but I haven't actually thought about how to do that.
