Binary search within BTree to improve performance - algorithm

I'm reading through Cormen et. al. again, specifically the chapter on BTrees, and I am implementing one in C. However, I had a question about whether I could improve performance in searching by using binary search rather than linear search. The pseudocode given in the book looks something like:
ordered_pair* btree_search(btreenode* root, double val){
i = 0;
while(i < root->num_vals && val > x->vals[i]) i++;
if(i < x->num_keys && val == x->vals[i]) return ordered_pair(x, i);
else if (x->leaf) return NULL;
else {
disk_read(x->children[i]);
return btree_search(x->children[i], val);
}
}
(I have modified it to look like C and used index 0 rather than 1)
My question(s):
This looks like a linear search. However, since each collection of keys in a BTree node is implemented as an array, couldn't I use binary search instead? Would that lessen the time complexity of searching each node from O(n) to O(lg(n))? Or does reading from the disk make a binary search here rather pointless. The reason I'm asking this is because it seems relatively trivial to implement, and I am confused why Cormen et. al., doesn't mention it at all. Or perhaps I am just missing something.
If you take the time to answer or attempt to answer this question, thank you for your time!

Yes, you can definitely use a binary search.
Whether or not there's an advantage in it depends on how big your blocks are and how much it cost to read them, but as you say, it's not difficult, and it's not going to be slower.
I would always do this with a binary search.
Perhaps the authors just didn't want to complicate the lesson on B-Trees, or maybe they they are assuming that you've already spent a lot more time reading those blocks in the first place.

Related

Hashing algorithms for data summary

I am on the search for a non-cryptographic hashing algorithm with a given set of properties, but I do not know how to describe it in Google-able terms.
Problem space: I have a vector of 64-bit integers which are mostly linearlly distributed throughout that space. There are two exceptions to this rule: (1) The number 0 occurs considerably frequently and (2) if a number x occurs, it is more likely to occur again than 2^-64. The goal is, given two vectors A and B, to have a convenient mechanism for quickly detecting if A and B are not the same. Not all vectors are of fixed size, but any vector I wish to compare to another will have the same size (aka: a size check is trivial).
The only special requirement I have is I would like the ability to "back out" a piece of data. In other words, given A[i] = x and a hash(A), it should be cheap to compute hash(A) for A[i] = y. In other words, I want a non-cryptographic hash.
The most reasonable thing I have come up with is this (in Python-ish):
# Imagine this uses a Mersenne Twister or some other seeded RNG...
NUMS = generate_numbers(seed)
def hash(a):
out = 0
for idx in range(len(a)):
out ^= a[idx] ^ NUMS[idx]
return out
def hash_replace(orig_hash, idx, orig_val, new_val):
return orig_hash ^ (orig_val ^ NUMS[idx]) ^ (new_val ^ NUMS[idx])
It is an exceedingly simple algorithm and it probably works okay. However, all my experience with writing hashing algorithms tells me somebody else has already solved this problem in a better way.
I think what you are looking for is called homomorphic hashing algorithm and it has already been discussed Paillier cryptosystem.
As far as I can see from that discussion, there are no practical implementation nowadays.
The most interesting feature, the one for which I guess it fits your needs, is that:
H(x*y) = H(x)*H(y)
Because of that, you can freely define the lower limit of your unit and rely on that property.
I've used the Paillier cryptosystem a few years ago (there was a Java implementation somewhere, but I don't have anymore the link) during my studies, but it's far more complex in respect of what you are looking for.
It has interesting feature under certain constraints, like the following one:
n*C(x) = C(n*x)
Again, it looks to me similar to what you are looking for, so maybe you should search for this family of hashing algorithms. I'll have a try with Google searching for a more specific link.
References:
This one is quite interesting, but maybe it is not a viable solution because of your space that is [0-2^64[ (unless you accept to deal with big numbers).

How would i sort a queue using only one additional queue

So basically, Im asked to sort a queue, and i can only use one helper queue , and am only allowed constant amount of additional memory.
How would i sort a queue using only one additional queue ?
Personally, I don't like interview questions with arbitrary restrictions like this, unless it actually reflects the conditions you have to work with at the company. I don't think it actually finds qualified candidates or rather, I don't think it accurately eliminates unqualified ones.
When I did technical interviews for my company, all of the questions were realistic and relevant. Setting that aside.
While there are surely several ways to solve this, and using recursion is certainly one, I would hope that if you tried to solve it with recursion they would have asked you to do it over without recursion, considering they are placing limits on the memory you can use and the data structures. Otherwise, they might as well as given you two queues, a stack and as much memory as you want.
Here is an example of one way to solve it. At the end of this, queueB should be sorted. It uses just one extra local variable. This was done in C#.
queueB.Enqueue(queueA.Dequeue());
while (queueA.Count > 0)
{
int next = queueA.Dequeue();
for (int i = 0; i < queueB.Count; ++i)
{
if (queueB.Peek() < next)
{
queueB.Enqueue(queueB.Dequeue());
}
else
{
queueB.Enqueue(next);
next = queueB.Dequeue();
}
}
queueB.Enqueue(next);
}

question abouut string sort

i have question from programming pearls
problem is following
show how to use Lomuto's partitioning scheme to sort varying length bit strings in time proportional to the sum of their length
and algorithm is following
each record in x[0..n-1] has an integer length and pointer to the array bit[0..length-1]
code
void bsort(l,u,depth)
{
if (l>=u) return;
for (int i=l;i<u;i++)
if (x[i].length<depth)
swap(i,l++);
m=l;
if (x[i].bit[depth] == 0) swap(i,m++);
bsort(l,m-1,depth+1);
bsort(m,u,depth+1);
}
I need the following things:
how this algorithm works
how implement in java?
It's essentially almost the same in Java. If you know Java, which I assume, this shouldn't take more than a few minutes to port. I'm sure we'd be more than happy to give you some pointers as to how the algorithm works, but I'd like to see some work from you first. Take a pencil and some paper and trace the code. That's going to be your best bet with recursion.

Is this linear search implementation actually useful?

In Matters Computational I found this interesting linear search implementation (it's actually my Java implementation ;-)):
public static int linearSearch(int[] a, int key) {
int high = a.length - 1;
int tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
int i = 0;
while (a[i] != key) {
i++;
}
// restore original value
a[high] = tmp;
if (i == high && key != tmp) {
return NOT_CONTAINED;
}
return i;
}
It basically uses a sentinel, which is the searched for value, so that you always find the value and don't have to check for array boundaries. The last element is stored in a temp variable, and then the sentinel is placed at the last position. When the value is found (remember, it is always found due to the sentinel), the original element is restored and the index is checked if it represents the last index and is unequal to the searched for value. If that's the case, -1 (NOT_CONTAINED) is returned, otherwise the index.
While I found this implementation really clever, I wonder if it is actually useful. For small arrays, it seems to be always slower, and for large arrays it only seems to be faster when the value is not found. Any ideas?
EDIT
The original implementation was written in C++, so that could make a difference.
It's not thread-safe, for example, you can lose your a[high] value through having a second thread start after the first has changed a[high] to key, so will record key to tmp, and finish after the first thread has restored a[high] to its original value. The second thread will restore a[high] to what it first saw, which was the first thread's key.
It's also not useful in java, since the JVM will include bounds checks on your array, so your while loop is checking that you're not going past the end of your array anyway.
Will you ever notice any speed increase from this? No
Will you notice a lack of readability? Yes
Will you notice an unnecessary array mutation that might cause concurrency issues? Yes
Premature optimization is the root of all evil.
Doesn't seem particularly useful. The "innovation" here is just to get rid of the iteration test by combining it with the match test. Modern processors spend 0 time on iteration checks these days (all the computation and branching gets done in parallel with the match test code).
In any case, binary search kicks the ass of this code on large arrays, and is comparable on small arrays. Linear search is so 1960s.
See also the 'finding a tiger in africa' joke.
Punchline = An experienced programmer places a tiger in cairo so that the search terminates.
A sentinel search goes back to Knuth. It value is that it reduces the number of tests in a loop from two ("does the key match? Am I at the end?") to just one.
Yes, its useful, in the sense that it should significantly reduce search times for modest size unordered arrays, by virtue of eliminating conditional branch mispredictions. This also reduces insertion times (code not shown by the OP) for such arrays, because you don't have to order the items.
If you have larger arrays of ordered items, a binary search will be faster, at the cost of larger insertion time to ensure the array is ordered.
For even larger sets, a hash table will be the fastest.
The real question is what is the distribution of sizes of your arrays?
Yes - it does because while loop doesn't have 2 comparisons as opposed to standard search.
It is twice as fast.It is given as optimization in Knuth Vol 3.

Why should recursion be preferred over iteration?

Iteration is more performant than recursion, right? Then why do some people opine that recursion is better (more elegant, in their words) than iteration? I really don't see why some languages like Haskell do not allow iteration and encourage recursion? Isn't that absurd to encourage something that has bad performance (and that too when more performant option i.e. recursion is available) ? Please shed some light on this. Thanks.
Iteration is more performant than
recursion, right?
Not necessarily.
This conception comes from many C-like languages, where calling a function, recursive or not, had a large overhead and created a new stackframe for every call.
For many languages this is not the case, and recursion is equally or more performant than an iterative version. These days, even some C compilers rewrite some recursive constructs to an iterative version, or reuse the stack frame for a tail recursive call.
Try implementing depth-first search recursively and iteratively and tell me which one gave you an easier time of it. Or merge sort. For a lot of problems it comes down to explicitly maintaining your own stack vs. leaving your data on the function stack.
I can't speak to Haskell as I've never used it, but this is to address the more general part of the question posed in your title.
Haskell do not allow iteration because iteration involves mutable state (the index).
As others have stated, there's nothing intrinsically less performant about recursion. There are some languages where it will be slower, but it's not a universal rule.
That being said, to me recursion is a tool, to be used when it makes sense. There are some algorithms that are better represented as recursion (just as some are better via iteration).
Case in point:
fib 0 = 0
fib 1 = 1
fib n = fib(n-1) + fib(n-2)
I can't imagine an iterative solution that could possibly make the intent clearer than that.
Here is some information on the pros & cons of recursion & iteration in c:
http://www.stanford.edu/~blp/writings/clc/recursion-vs-iteration.html
Mostly for me Recursion is sometimes easier to understand than iteration.
Iteration is just a special form of recursion.
Recursion is one of those things that seem elegant or efficient in theory but in practice is generally less efficient (unless the compiler or dynamic recompiler) is changing what the code does. In general anything that causes unnecessary subroutine calls is going to be slower, especially when more than 1 argument is being pushed/popped. Anything you can do to remove processor cycles i.e. instructions the processor has to chew on is fair game. Compilers can do a pretty good job of this these days in general but it's always good to know how to write efficient code by hand.
Several things:
Iteration is not necessarily faster
Root of all evil: encouraging something just because it might be moderately faster is premature; there are other considerations.
Recursion often much more succinctly and clearly communicates your intent
By eschewing mutable state generally, functional programming languages are easier to reason about and debug, and recursion is an example of this.
Recursion takes more memory than iteration.
I don't think there's anything intrinsically less performant about recursion - at least in the abstract. Recursion is a special form of iteration. If a language is designed to support recursion well, it's possible it could perform just as well as iteration.
In general, recursion makes one be explicit about the state you're bringing forward in the next iteration (it's the parameters). This can make it easier for language processors to parallelize execution. At least that's a direction that language designers are trying to exploit.
As a low level ITERATION deals with the CX registry to count loops, and of course data registries.
RECURSION not only deals with that it also adds references to the stack pointer to keep references of the previous calls and then how to go back.-
My University teacher told me that whatever you do with recursion can be done with Iterations and viceversa, however sometimes is simpler to do it by recursion than Iteration (more elegant) but at a performance level is better to use Iterations.-
In Java, recursive solutions generally outperform non-recursive ones. In C it tends to be the other way around. I think this holds in general for adaptively compiled languages vs. ahead-of-time compiled languages.
Edit:
By "generally" I mean something like a 60/40 split. It is very dependent on how efficiently the language handles method calls. I think JIT compilation favors recursion because it can choose how to handle inlining and use runtime data in optimization. It's very dependent on the algorithm and compiler in question though. Java in particular continues to get smarter about handling recursion.
Quantitative study results with Java (PDF link). Note that these are mostly sorting algorithms, and are using an older Java Virtual Machine (1.5.x if I read right). They sometimes get a 2:1 or 4:1 performance improvement by using the recursive implementation, and rarely is recursion significantly slower. In my personal experience, the difference isn't often that pronounced, but a 50% improvement is common when I use recursion sensibly.
I find it hard to reason that one is better than the other all the time.
Im working on a mobile app that needs to do background work on user file system. One of the background threads needs to sweep the whole file system from time to time, to maintain updated data to the user. So in fear of Stack Overflow , i had written an iterative algorithm. Today i wrote a recursive one, for the same job. To my surprise, the iterative algorithm is faster: recursive -> 37s, iterative -> 34s (working over the exact same file structure).
Recursive:
private long recursive(File rootFile, long counter) {
long duration = 0;
sendScanUpdateSignal(rootFile.getAbsolutePath());
if(rootFile.isDirectory()) {
File[] files = getChildren(rootFile, MUSIC_FILE_FILTER);
for(int i = 0; i < files.length; i++) {
duration += recursive(files[i], counter);
}
if(duration != 0) {
dhm.put(rootFile.getAbsolutePath(), duration);
updateDurationInUI(rootFile.getAbsolutePath(), duration);
}
}
else if(!rootFile.isDirectory() && checkExtension(rootFile.getAbsolutePath())) {
duration = getDuration(rootFile);
dhm.put(rootFile.getAbsolutePath(), getDuration(rootFile));
updateDurationInUI(rootFile.getAbsolutePath(), duration);
}
return counter + duration;
}
Iterative: - iterative depth-first search , with recursive backtracking
private void traversal(File file) {
int pointer = 0;
File[] files;
boolean hadMusic = false;
long parentTimeCounter = 0;
while(file != null) {
sendScanUpdateSignal(file.getAbsolutePath());
try {
Thread.sleep(Constants.THREADS_SLEEP_CONSTANTS.TRAVERSAL);
} catch (InterruptedException e) {
e.printStackTrace();
}
files = getChildren(file, MUSIC_FILE_FILTER);
if(!file.isDirectory() && checkExtension(file.getAbsolutePath())) {
hadMusic = true;
long duration = getDuration(file);
parentTimeCounter = parentTimeCounter + duration;
dhm.put(file.getAbsolutePath(), duration);
updateDurationInUI(file.getAbsolutePath(), duration);
}
if(files != null && pointer < files.length) {
file = getChildren(file,MUSIC_FILE_FILTER)[pointer];
}
else if(files != null && pointer+1 < files.length) {
file = files[pointer+1];
pointer++;
}
else {
pointer=0;
file = getNextSybling(file, hadMusic, parentTimeCounter);
hadMusic = false;
parentTimeCounter = 0;
}
}
}
private File getNextSybling(File file, boolean hadMusic, long timeCounter) {
File result= null;
//se o file é /mnt, para
if(file.getAbsolutePath().compareTo(userSDBasePointer.getParentFile().getAbsolutePath()) == 0) {
return result;
}
File parent = file.getParentFile();
long parentDuration = 0;
if(hadMusic) {
if(dhm.containsKey(parent.getAbsolutePath())) {
long savedValue = dhm.get(parent.getAbsolutePath());
parentDuration = savedValue + timeCounter;
}
else {
parentDuration = timeCounter;
}
dhm.put(parent.getAbsolutePath(), parentDuration);
updateDurationInUI(parent.getAbsolutePath(), parentDuration);
}
//procura irmao seguinte
File[] syblings = getChildren(parent,MUSIC_FILE_FILTER);
for(int i = 0; i < syblings.length; i++) {
if(syblings[i].getAbsolutePath().compareTo(file.getAbsolutePath())==0) {
if(i+1 < syblings.length) {
result = syblings[i+1];
}
break;
}
}
//backtracking - adiciona pai, se tiver filhos musica
if(result == null) {
result = getNextSybling(parent, hadMusic, parentDuration);
}
return result;
}
Sure the iterative isn't elegant, but alhtough its currently implemented on an ineficient way, it is still faster than the recursive one. And i have better control over it, as i dont want it running at full speed, and will let the garbage collector do its work more frequently.
Anyway, i wont take for granted that one method is better than the other, and will review other algorithms that are currently recursive. But at least from the 2 algorithms above, the iterative one will be the one in the final product.
I think it would help to get some understanding of what performance is really about. This link shows how a perfectly reasonably-coded app actually has a lot of room for optimization - namely a factor of 43! None of this had anything to do with iteration vs. recursion.
When an app has been tuned that far, it gets to the point where the cycles saved by iteration as against recursion might actually make a difference.
Recursion is the typical implementation of iteration. It's just a lower level of abstraction (at least in Python):
class iterator(object):
def __init__(self, max):
self.count = 0
self.max = max
def __iter__(self):
return self
# I believe this changes to __next__ in Python 3000
def next(self):
if self.count == self.max:
raise StopIteration
else:
self.count += 1
return self.count - 1
# At this level, iteration is the name of the game, but
# in the implementation, recursion is clearly what's happening.
for i in iterator(50):
print(i)
I would compare recursion with an explosive: you can reach big result in no time. But if you use it without cautions the result could be disastrous.
I was impressed very much by proving of complexity for the recursion that calculates Fibonacci numbers here. Recursion in that case has complexity O((3/2)^n) while iteration just O(n). Calculation of n=46 with recursion written on c# takes half minute! Wow...
IMHO recursion should be used only if nature of entities suited for recursion well (trees, syntax parsing, ...) and never because of aesthetic. Performance and resources consumption of any "divine" recursive code need to be scrutinized.
Iteration is more performant than recursion, right?
Yes.
However, when you have a problem which maps perfectly to a Recursive Data Structure, the better solution is always recursive.
If you pretend to solve the problem with iterations you'll end up reinventing the stack and creating a messier and ugly code, compared to the elegant recursive version of the code.
That said, Iteration will always be faster than Recursion. (in a Von Neumann Architecture), so if you use recursion always, even where a loop will suffice, you'll pay a performance penalty.
Is recursion ever faster than looping?
"Iteration is more performant than recursion" is really language- and/or compiler-specific. The case that comes to mind is when the compiler does loop-unrolling. If you've implemented a recursive solution in this case, it's going to be quite a bit slower.
This is where it pays to be a scientist (testing hypotheses) and to know your tools...
on ntfs UNC max path as is 32K
C:\A\B\X\C.... more than 16K folders can be created...
But you can not even count the number of folders with any recursive method, sooner or later all will give stack overflow.
Only a Good lightweight iterative code should be used to scan folders professionally.
Believe or not, most top antivirus cannot scan maximum depth of UNC folders.

Resources