B-tree algorithm of child splitting

B-tree algorithm of child splitting - algorithm

I am trying to implement a B-tree, however I have trouble grasping the proper insertion algorithm.
I am trying to make one with pre-emptive splitting of nodes.
Here is what I have figured so far in steps:
Splitting:
Begin splitting node N with parent P
Allocate a new node K
Copy all the keys from N from middle index (MaxKeys / 2 + 1) to the end, to K
Copy all the children from N from middle index (MaxKeys / 2 + 1) to the end, to K
Remove the right half of the keys and children in N
Inserting into P after split:
Get the middle key (at index MaxKeys / 2) of N
Find the index I of the first key in P that is larger than the found key in step 1
Shift all keys of P from index I, to the right
Shift all keys of P from index I + 1, to the right
Set the key of P at the middle index to the key of N from step 1
Set child of P at the index I + 1 to null
Finishing:
Set the child of P at index one step right of the ndex of N from step 1, to the new node K
My questions are:
Should I somehow check which side I should be moving keys and children, since it's not going to be always on the right side?
If I have to split the root, I must have a special case since then there is no parent node P, or is there a way to handle it in a generalized manner?

Related

How to keep an array/linked list sorted at O(1) insertion?

I need to find an array/linked list algorithm that solve the following problem:
There are n different clubs, and there are 10 universities.
At each university there is a branch of each club.
I got k scholarship at the beginning to grant for each one of the clubs at each one of the universities. ( a total of 10*n*k).
In addition, I can add
let's assume we can initialize any data structure at O(1) time.
we need to support the following operations:
grant(i,j) - grant a scholarship to the i (between 1 and n) club at the j (between 1 and 10) university. time: O(1)
add(i,j) - add another scholarship to the budget for the i club at the j university. time: O(1)
insight(m) - print the m (a number between 1 and n) clubs who had got the most scholarships until now. time: O(m)
for example, after the lines:
grant(1,3)
grant(1,2)
grant(2,4)
grant(2,5)
grant(3,1)
insight(2)
I should print:
club 1 got 2 scholarships
club 2 got 2 scholarships
I got a space complexity of O(n*k).
I need only an algorithm.
I need to use the space instead of the time to keep the data structure sorted for the insight operation but I can't find a way, because the add operation can add as many scholarships as it wants.

Assuming you have a hash set data structure with O(1) insertion and deletion which can be traversed (e.g. using a doubly liked list between inserted values):
I have a solution with O(n + k) space complexity, O(1) grant and O(m) insight. My solution ignores the university that each club belongs to since you want to sort by total scholarships awarded over all universities.
A minor simplification
I will first give an easier to explain version with an O(m + k) insight function, before modifying it to make it O(m).
Data structures
There is a doubly linked list of hash sets, where each hash set contains the club numbers for all clubs that have given a certain number of scholarships: the first node contains all clubs that have given 0 scholarships, the seconds node for clubs that have given 1, etc. Initially, there will only be one node in the list, with a hash set containing all the club numbers, since all club have initially given no scholarships.
There is also an array of length n where index i in the array contains the node in the aforementioned doubly linked list which has the hash set that contains club i.
There is also a variable to store the tail of the doubly linked list, which is where the hash set of the clubs that gave the most scholarships is stored.
Granting a scholarship
First find the node that contains the club i using the array of nodes. The club is removed from the hash set of this node. If no next node exists, create it with an empty hash set, and connect it with the next and previous pointers and update the variable that points to the tail of the list. Then, add the club to the hash set of the next node in the doubly linked list.
Update the array of nodes at index i to point to the next node in the list.
Insight
Use the variable that points to the tail of the double linked list to traverse the list in reverse, iterating over the hash sets and printing items until m have been printed.
Complexities
The doubly linked list has a node added for each number of scholarships that have been awarded. At most k scholarships can be awarded so there are at most k Node. Each node contains a hash set, and the total number of items across all the hash sets is exactly n. The array is length n. This gives the space complexity of O(n + k).
grant contains no loops and all operations are O(1), so it is O(1).
The insight function may be O(m) in some cases, such as when there are less than or equal to m clubs in the hash set of the tail node in the linked list, however in some cases there may be many nodes with empty hash sets that need to be iterated over to find the next item to print, for example:
grant(1)
grant(7)
grant(7)
grant(7)
grant(7)
grant(7)
grant(7)
insight(2)
This creates a linked list with the tail node just containing 7, then (going towards the head) 4 empty nodes, then a node with 1, then a node with all other clubs. insight thus may have to traverse up to k (the maximum length of the list) nodes, making it O(m + k).
Improving to O(m) insight
In order to fix this problem, each node should also contain an attribute for the number of scholarships given by the clubs in its hash set. This allows elimination of these empty nodes, since in some cases a node with only a single club in it, when it grants a scholarship may increment it's number rather than creating any new nodes. Unfortunately, this makes the grant code less neat, since now when granting a scholarship, we need to check that the number of the next node is one more than the number of the current node, and delete old nodes. This does work though, since now there are no empty nodes, so the number of nodes to iterate through is at most m, making insight into O(m).
An implementation
In case any of that wasn't clear, here is a reference implementation of the final data structure in Python:
from dataclasses import dataclass
from typing import *
#dataclass
class Node:
clubs: Set[int]
number: int = 0
prev: Optional['Node'] = None
next: Optional['Node'] = None
class ScholarshipCounter:
def __init__(self, n):
# the tail of the list
self.greatest_node = Node(clubs=set(range(n)))
# the array pointing to the node that contains each club
self.club_nodes: List[Node] = [self.greatest_node] * n
def grant(self, club):
# the node such that `node.clubs` contains `club`
node: Node = self.club_nodes[club]
if len(node.clubs) == 1 and not (node.next and node.next.number == node.number + 1):
# if there is only one club, increment the number
# unless the next node has the correct number, then add the club to that node
node.number += 1
else:
# remove the club from this node so it can be added to the next
node.clubs.remove(club)
if not node.next:
# create a next node if there isn't one
self.greatest_node = Node(clubs=set(), number=node.number + 1)
node.next = self.greatest_node
self.greatest_node.prev = node
elif node.next.number != node.number + 1:
# if the number on the next node is not one more than this node's number
# then insert a new node with the desired number in between the two
new_node = Node(clubs=set(), number=node.number + 1)
new_node.next = node.next
node.next.prev = new_node
node.next = new_node
new_node.prev = node
# finish moving the club to the next node
node.next.clubs.add(club)
# update the entry in the array of nodes
self.club_nodes[club] = node.next
# if the node is empty, delete it
if not node.clubs:
if node.prev:
node.prev.next = node.next
# there is always a next node, so no need to check
node.next.prev = node.prev
def insight(self, m):
return_list = []
# iterate starting at the tail where the clubs with the most scholarships are
node = self.greatest_node
while len(return_list) < m and node is not None:
# iterate over this hash set if clubs
for club in node.clubs:
return_list.append((club, node.number))
# if enough have been found, stop
if len(return_list) == m:
break
node = node.prev
return return_list
if __name__ == '__main__':
sc = ScholarshipCounter(n=10)
sc.grant(1)
sc.grant(1)
sc.grant(2)
sc.grant(2)
for i in range(10):
sc.grant(3)
print(sc.greatest_node)
for club, number in sc.insight(3):
print(f'club {club} got {number} scholarships')

It's easy maintain a list of clubs sorted by # of scholarships. The key is that each club's count only changes by 1 for each grant.
A data structure that works is a doubly-linked list of all the distinct scholarship counts. The node for each count has a doubly-linked list of all the clubs that have that number of scholarships. Each club's node has a pointer to the count node that contains its list. You also need an auxiliary index of club node by club ID or whatever.
When you grant a scholarship:
Get the club's node and the containing count node.
Look to the count node's right. If there's no node for count+1, then make one.
Move the club's node from the count node's list to the count+1 node's list.
If count node is now empty, delete it.
To get the m clubs with the most scholarships, you just start at the high end of the count list, and output all the clubs for each count until you have enough.

Is it possible to find the smallest value in max heap recursively without reversing the array

I am trying to find the smallest value in a max heap(stored in an array) recursively, without reversing the array. I have some problems trying to define the recursive case. How do I give the correct index to the recursive call? (starting index from 1 instead of 0)If the first node is stored in slot i, then I know that its left and right nodes are stored in slot 2*i and 2*i+1 respectively and so are their own left and right nodes. How do I pass this information recursively?
pseudo-code:
smallest(size ,index_of_parent_node){
i = size/2
if (i == 0)
return A[i]
else
return min[smallest(size/2 , index_of_left_of_parent) , smallest(size/2, index_of_right_of_parent)]

I have some problems trying to define the recursive case. How do I give the correct index to the recursive call? (starting index from 1 instead of 0)If the first node is stored in slot i, then I know that its left and right nodes are stored in slot 2*i and 2*i+1 respectively and so are their own left and right nodes. How do I pass this information recursively?
The current implementation does not work because it will not look at all the leaf nodes. The minimum element will be one of the leaf nodes.
If you want to do it recursively then you can start from the root node of the max-heap and get the minimum from its two subtrees recursively like below -
def getSmallestNumber (maxHeapArray , size):
#assuming maxHeapArray has at least one element
#and 1-based indexing
return helper(maxHeapArray, size, 1)
def helper (maxHeapArray, size, currentIndex):
if currentIndex >= size:
return maxHeapArray[currentIndex]
currentNumber = maxHeapArray[currentIndex]
leftIndex = 2 * currentIndex
rightIndex = 2 * currentIndex + 1
leftMin = helper(maxHeapArray, size, leftIndex)
rightMin = helper(maxHeapArray, size, rightIndex)
return min(currentNumber, leftMin, rightMin)
You can also do a linear traversal of the complete array or half of the elements. Time complexity to get min elements from max-heap

Judgecode -- Sort with swap (2)

The problem I've seen is as bellow, anyone has some idea on it?
http://judgecode.com/problems/1011
Given a permutation of integers from 0 to n - 1, sorting them is easy. But what if you can only swap a pair of integers every time?
Please calculate the minimal number of swaps

One classic algorithm seems to be permutation cycles (https://en.wikipedia.org/wiki/Cycle_notation#Cycle_notation). The number of swaps needed equals the total number of elements subtracted by the number of cycles.
For example:
1 2 3 4 5
2 5 4 3 1
Start with 1 and follow the cycle:
1 down to 2, 2 down to 5, 5 down to 1.
1 -> 2 -> 5 -> 1
3 -> 4 -> 3
We would need to swap index 1 with 5, then index 5 with 2; as well as index 3 with index 4. Altogether 3 swaps or n - 2. We subtract n by the number of cycles since cycle elements together total n and each cycle represents a swap less than the number of elements in it.

Here is a simple implementation in C for the above problem. The algorithm is similar to User גלעד ברקן:
Store the position of every element of a[] in b[]. So, b[a[i]] = i
Iterate over the initial array a[] from left to right.
At position i, check if a[i] is equal to i. If yes, then keep iterating.
If no, then it's time to swap. Look at the logic in the code minutely to see how the swapping takes place. This is the most important step as both array a[] and b[] needs to be modified. Increase the count of swaps.
Here is the implementation:
long long sortWithSwap(int n, int *a) {
int *b = (int*)malloc(sizeof(int)*n); //create a temporary array keeping track of the position of every element
int i,tmp,t,valai,posi;
for(i=0;i<n;i++){
b[a[i]] = i;
}
long long ans = 0;
for(i=0;i<n;i++){
if(a[i]!=i){
valai = a[i];
posi = b[i];
a[b[i]] = a[i];
a[i] = i;
b[i] = i;
b[valai] = posi;
ans++;
}
}
return ans;
}

The essence of solving this problem lies in the following observation
1. The elements in the array do not repeat
2. The range of elements is from 0 to n-1, where n is the size of the array.
The way to approach
After you have understood the way to approach the problem ou can solve it in linear time.
Imagine How would the array look like after sorting all the entries ?
It will look like arr[i] == i, for all entries . Is that convincing ?
First create a bool array named FIX, where FIX[i] == true if ith location is fixed, initialize this array with false initially
Start checking the original array for the match arr[i] == i, till the time this condition holds true, eveything is okay. While going ahead with traversal of array also update the FIX[i] = true. The moment you find that arr[i] != i you need to do something, arr[i] must have some value x such that x > i, how do we guarantee that ? The guarantee comes from the fact that the elements in the array do not repeat, therefore if the array is sorted till index i then it means that the element at position i in the array cannot come from left but from right.
Now the value x is essentially saying about some index , why so because the array only has elements till n-1 starting from 0, and in the sorted arry every element i of the array must be at location i.
what does arr[i] == x means is that , not only element i is not at it's correct position but also the element x is missing from it's place.
Now to fix ith location you need to look at xth location, because maybe xth location holds i and then you will swap the elements at indices i and x, and get the job done. But wait, it's not necessary that the index x will hold i (and you finish fixing these locations in just 1 swap). Rather it may be possible that index x holds value y, which again will be greater than i, because array is only sorted till location i.
Now before you can fix position i , you need to fix x, why ? we will see later.
So now again you try to fix position x, and then similarly you will try fixing till the time you don't see element i at some location in the fashion told .
The fashion is to follow the link from arr[i], untill you hit element i at some index.
It is guaranteed that you will definitely hit i at some location while following in this way . Why ? try proving it, make some examples, and you will feel it
Now you will start fixing all the index you saw in the path following from index i till this index (say it j). Now what you see is that the path which you have followed is a circular one and for every index i, the arr[i] is tored at it's previous index (index from where you reached here), and Once you see that you can fix the indices, and mark all of them in FIX array to be true. Now go ahead with next index of array and do the same thing untill whole array is fixed..
This was the complete idea, but to only conunt no. of swaps, you se that once you have found a cycle of n elements you need n swaps, and after doing that you fix the array , and again continue. So that's how you will count the no. of swaps.
Please let me know if you have some doubts in the approach .
You may also ask for C/C++ code help. Happy to help :-)

Disperse Duplicates in an Array

Source : Google Interview Question
Write a routine to ensure that identical elements in the input are maximally spread in the output?
Basically, we need to place the same elements,in such a way , that the TOTAL spreading is as maximal as possible.
Example:
Input: {1,1,2,3,2,3}
Possible Output: {1,2,3,1,2,3}
Total dispersion = Difference between position of 1's + 2's + 3's = 4-1 + 5-2 + 6-3 = 9 .
I am NOT AT ALL sure, if there's an optimal polynomial time algorithm available for this.Also,no other detail is provided for the question other than this .
What i thought is,calculate the frequency of each element in the input,then arrange them in the output,each distinct element at a time,until all the frequencies are exhausted.
I am not sure of my approach .
Any approaches/ideas people .

I believe this simple algorithm would work:
count the number of occurrences of each distinct element.
make a new list
add one instance of all elements that occur more than once to the list (order within each group does not matter)
add one instance of all unique elements to the list
add one instance of all elements that occur more than once to the list
add one instance of all elements that occur more than twice to the list
add one instance of all elements that occur more than trice to the list
...
Now, this will intuitively not give a good spread:
for {1, 1, 1, 1, 2, 3, 4} ==> {1, 2, 3, 4, 1, 1, 1}
for {1, 1, 1, 2, 2, 2, 3, 4} ==> {1, 2, 3, 4, 1, 2, 1, 2}
However, i think this is the best spread you can get given the scoring function provided.
Since the dispersion score counts the sum of the distances instead of the squared sum of the distances, you can have several duplicates close together, as long as you have a large gap somewhere else to compensate.
for a sum-of-squared-distances score, the problem becomes harder.
Perhaps the interview question hinged on the candidate recognizing this weakness in the scoring function?

In perl
#a=(9,9,9,2,2,2,1,1,1);
then make a hash table of the counts of different numbers in the list, like a frequency table
map { $x{$_}++ } #a;
then repeatedly walk through all the keys found, with the keys in a known order and add the appropriate number of individual numbers to an output list until all the keys are exhausted
#r=();
$g=1;
while( $g == 1 ) {
$g=0;
for my $n (sort keys %x)
{
if ($x{$n}>1) {
push #r, $n;
$x{$n}--;
$g=1
}
}
}
I'm sure that this could be adapted to any programming language that supports hash tables

python code for algorithm suggested by Vorsprung and HugoRune:
from collections import Counter, defaultdict
def max_spread(data):
cnt = Counter()
for i in data: cnt[i] += 1
res, num = [], list(cnt)
while len(cnt) > 0:
for i in num:
if num[i] > 0:
res.append(i)
cnt[i] -= 1
if cnt[i] == 0: del cnt[i]
return res
def calc_spread(data):
d = defaultdict()
for i, v in enumerate(data):
d.setdefault(v, []).append(i)
return sum([max(x) - min(x) for _, x in d.items()])

HugoRune's answer takes some advantage of the unusual scoring function but we can actually do even better: suppose there are d distinct non-unique values, then the only thing that is required for a solution to be optimal is that the first d values in the output must consist of these in any order, and likewise the last d values in the output must consist of these values in any (i.e. possibly a different) order. (This implies that all unique numbers appear between the first and last instance of every non-unique number.)
The relative order of the first copies of non-unique numbers doesn't matter, and likewise nor does the relative order of their last copies. Suppose the values 1 and 2 both appear multiple times in the input, and that we have built a candidate solution obeying the condition I gave in the first paragraph that has the first copy of 1 at position i and the first copy of 2 at position j > i. Now suppose we swap these two elements. Element 1 has been pushed j - i positions to the right, so its score contribution will drop by j - i. But element 2 has been pushed j - i positions to the left, so its score contribution will increase by j - i. These cancel out, leaving the total score unchanged.
Now, any permutation of elements can be achieved by swapping elements in the following way: swap the element in position 1 with the element that should be at position 1, then do the same for position 2, and so on. After the ith step, the first i elements of the permutation are correct. We know that every swap leaves the scoring function unchanged, and a permutation is just a sequence of swaps, so every permutation also leaves the scoring function unchanged! This is true at for the d elements at both ends of the output array.
When 3 or more copies of a number exist, only the position of the first and last copy contribute to the distance for that number. It doesn't matter where the middle ones go. I'll call the elements between the 2 blocks of d elements at either end the "central" elements. They consist of the unique elements, as well as some number of copies of all those non-unique elements that appear at least 3 times. As before, it's easy to see that any permutation of these "central" elements corresponds to a sequence of swaps, and that any such swap will leave the overall score unchanged (in fact it's even simpler than before, since swapping two central elements does not even change the score contribution of either of these elements).
This leads to a simple O(nlog n) algorithm (or O(n) if you use bucket sort for the first step) to generate a solution array Y from a length-n input array X:
Sort the input array X.
Use a single pass through X to count the number of distinct non-unique elements. Call this d.
Set i, j and k to 0.
While i < n:
If X[i+1] == X[i], we have a non-unique element:
Set Y[j] = Y[n-j-1] = X[i].
Increment i twice, and increment j once.
While X[i] == X[i-1]:
Set Y[d+k] = X[i].
Increment i and k.
Otherwise we have a unique element:
Set Y[d+k] = X[i].
Increment i and k.

Interview puzzle: Jump Game

Jump Game:
Given an array, start from the first element and reach the last by jumping. The jump length can be at most the value at the current position in the array. The optimum result is when you reach the goal in minimum number of jumps.
What is an algorithm for finding the optimum result?
An example: given array A = {2,3,1,1,4} the possible ways to reach the end (index list) are
0,2,3,4 (jump 2 to index 2, then jump 1 to index 3 then 1 to index 4)
0,1,4 (jump 1 to index 1, then jump 3 to index 4)
Since second solution has only 2 jumps it is the optimum result.

Overview
Given your array a and the index of your current position i, repeat the following until you reach the last element.
Consider all candidate "jump-to elements" in a[i+1] to a[a[i] + i]. For each such element at index e, calculate v = a[e] + e. If one of the elements is the last element, jump to the last element. Otherwise, jump to the element with the maximal v.
More simply put, of the elements within reach, look for the one that will get you furthest on the next jump. We know this selection, x, is the right one because compared to every other element y you can jump to, the elements reachable from y are a subset of the elements reachable from x (except for elements from a backward jump, which are obviously bad choices).
This algorithm runs in O(n) because each element need be considered only once (elements that would be considered a second time can be skipped).
Example
Consider the array of values a, indicies, i, and sums of index and value v.
i -> 0 1 2 3 4 5 6 7 8 9 10 11 12
a -> [4, 11, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
v -> 4 12 3 4 5 6 7 8 9 10 11 12 13
Start at index 0 and consider the next 4 elements. Find the one with maximal v. That element is at index 1, so jump to 1. Now consider the next 11 elements. The goal is within reach, so jump to the goal.
Demo
See here or here with code.

Dynamic programming.
Imagine you have an array B where B[i] shows the minimum number of step needed to reach index i in your array A. Your answer of course is in B[n], given A has n elements and indices start from 1. Assume C[i]=j means the you jumped from index j to index i (this is to recover the path taken later)
So, the algorithm is the following:
set B[i] to infinity for all i
B[1] = 0; <-- zero steps to reach B[1]
for i = 1 to n-1 <-- Each step updates possible jumps from A[i]
for j = 1 to A[i] <-- Possible jump sizes are 1, 2, ..., A[i]
if i+j > n <-- Array boundary check
break
if B[i+j] > B[i]+1 <-- If this path to B[i+j] was shorter than previous
B[i+j] = B[i]+1 <-- Keep the shortest path value
C[i+j] = i <-- Keep the path itself
The number of jumps needed is B[n]. The path that needs to be taken is:
1 -> C[1] -> C[C[1]] -> C[C[C[1]]] -> ... -> n
Which can be restored by a simple loop.
The algorithm is of O(min(k,n)*n) time complexity and O(n) space complexity. n is the number of elements in A and k is the maximum value inside the array.
Note
I am keeping this answer, but cheeken's greedy algorithm is correct and more efficient.

Construct a directed graph from the array. eg: i->j if |i-j|<=x[i] (Basically, if you can move from i to j in one hop have i->j as an edge in the graph). Now, find the shortest path from first node to last.
FWIW, you can use Dijkstra's algorithm so find shortest route. Complexity is O( | E | + | V | log | V | ). Since | E | < n^2, this becomes O(n^2).

We can calculate far index to jump maximum and in between if the any index value is larger than the far, we will update the far index value.
Simple O(n) time complexity solution
public boolean canJump(int[] nums) {
int far = 0;
for(int i = 0; i<nums.length; i++){
if(i <= far){
far = Math.max(far, i+nums[i]);
}
else{
return false;
}
}
return true;
}

start from left(end)..and traverse till number is same as index, use the maximum of such numbers. example if list is
list: 2738|4|6927
index: 0123|4|5678
once youve got this repeat above step from this number till u reach extreme right.
273846927
000001234
in case you dont find nething matching the index, use the digit with the farthest index and value greater than index. in this case 7.( because pretty soon index will be greater than the number, you can probably just count for 9 indices)

basic idea:
start building the path from the end to the start by finding all array elements from which it is possible to make the last jump to the target element (all i such that A[i] >= target - i).
treat each such i as the new target and find a path to it (recursively).
choose the minimal length path found, append the target, return.
simple example in python:
ls1 = [2,3,1,1,4]
ls2 = [4,11,1,1,1,1,1,1,1,1,1,1,1]
# finds the shortest path in ls to the target index tgti
def find_path(ls,tgti):
# if the target is the first element in the array, return it's index.
if tgti<= 0:
return [0]
# for each 0 <= i < tgti, if it it possible to reach
# tgti from i (ls[i] <= >= tgti-i) then find the path to i
sub_paths = [find_path(ls,i) for i in range(tgti-1,-1,-1) if ls[i] >= tgti-i]
# find the minimum length path in sub_paths
min_res = sub_paths[0]
for p in sub_paths:
if len(p) < len(min_res):
min_res = p
# add current target to the chosen path
min_res.append(tgti)
return min_res
print find_path(ls1,len(ls1)-1)
print find_path(ls2,len(ls2)-1)
>>>[0, 1, 4]
>>>[0, 1, 12]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio