Maximum and minimum number of keys in this B-tree - b-tree

This is from a homework assignment:
Assume that each page (disk block) has 16K bytes and each KVP has 8 bytes. Thus
we decide to use a B-tree of minsize (16000/8)/2 = 1000. Let T be such a B-tree and
suppose that height of T is 3. What is the minimum and maximum number of keys
that can be stored in T? Briefy justify your answer.
Note the following due to the properties of B-trees:
Each node has at most 2000 keys
Each node has at least 1000 keys (except for the root node)
I am having trouble understanding how the memory is limiting the number of keys.
It seems to me that since each page has 16000 bytes of space and each key takes up 8 bytes, then each page can store 2000 keys which is the max number of keys that can be stored at each level anyways.
The following are my calculations:
Minimum number of keys = 1000(1001)(2) + 1 = 2002001 keys at minimum
(Since the root is not constrained to having at least 1000 keys)
Maximum number of keys = 2000(2001)(2001) = 8008002000 keys at maximum
I feel I am missing something vital as the question cannot be this simple.

Somewhat blatant hint: Each non-leaf node has a right and a left child. Plus, there are pointers to key/value pairs, however they might be stored. (1000 seems like a lot...) Think about how you're going to store those 1000+ data points.
+--------------+
| Root |
| Left Right |
+---+------+---+
| |
| +---+----------+
| | Level 2 +---Data: List, hash table, whatever
| | Left Right |
| +---+------+---+
| | |
| Etc Etc
|
+---+----------+
| Level 2 +---Data: List, hash table, whatever
| Left Right |
+---+------+---+
| |
Etc Etc

Related

Proficient method to use adjacency list and matrix for graphsin golang?

I want to practice about graphs in golang using adjacency list and ajdacency matrix. For that user inputs number of nodes (n) and edges (e).
In golang for matrix, i cannot dynamically initialise the size of it as (var arr[e][e] int). I can only do it as (var arr[6][6]) It says an error that matrix size has to be constant. How can i implement this?
Another question is that of adjaceny list. i have to store at each position in array, an object that this node is connected to. Example:
0 | 1 | 2 | 3 | 4 | 5 |
{} | {2,3} | {1,4} | {1,4,5} | {2,3,5} | {3,4} |
Which data structure should I use for this?
I hope I was able to convey my problem

Data structure to achieve random delete and insert where elements are weighted in [a,b]

I would like to design a data structure and algorithm such that, given an array of elements, where each element has a weight according to [a,b], I can achieve constant time insertion and deletion. The deletion is performed randomly where the probability of an element being deleted is proportional to its weight.
I do not believe there is a deterministic algorithm that can achieve both operations in constant time, but I think there are there randomized algorithms that should be can accomplish this?
I don't know if O(1) worst-case time is impossible; I don't see any particular reason it should be. But it's definitely possible to have a simple data structure which achieves O(1) expected time.
The idea is to store a dynamic array of pairs (or two parallel arrays), where each item is paired with its weight; insertion is done by appending in O(1) amortised time, and an element can be removed by index by swapping it with the last element so that it can be removed from the end of the array in O(1) time. To sample a random element from the weighted distribution, choose a random index and generate a random number in the half-open interval [0, 2); if it is less than the element's weight, select the element at that index, otherwise repeat this process until an element is selected. The idea is that each index is equally likely to be chosen, and the probability it gets kept rather than rejected is proportional to its weight.
This is a Las Vegas algorithm, meaning it is expected to complete in a finite time, but with very low probability it can take arbitrarily long to complete. The number of iterations required to sample an element will be highest when every weight is exactly 1, in which case it follows a geometric distribution with parameter p = 1/2, so its expected value is 2, a constant which is independent of the number of elements in the data structure.
In general, if all weights are in an interval [a, b] for real numbers 0 < a <= b, then the expected number of iterations is at most b/a. This is always a constant, but it is potentially a large constant (i.e. it takes many iterations to select a single sample) if the lower bound a is small relative to b.
This is not an answer per se, but just a tiny example to illustrate the algorithm devised by #kaya3
| value | weight |
| v1 | 1.0 |
| v2 | 1.5 |
| v3 | 1.5 |
| v4 | 2.0 |
| v5 | 1.0 |
| total | 7.0 |
The total weight is 7.0. It's easy to maintain in O(1) by storing it in some memory and increasing/decreasing at each insertion/removal.
The probability of each element is simply it's weight divided by total weight.
| value | proba |
| v1 | 1.0/7 | 0.1428...
| v2 | 1.5/7 | 0.2142...
| v3 | 1.5/7 | 0.2142...
| v4 | 2.0/7 | 0.2857...
| v5 | 1.0/7 | 0.1428...
Using the algorithm of #kaya3, if we draw a random index, then the probability of each value is 1/size (1/5 here).
The chance of being rejected is 50% for v1, 25% for v2 and 0% for v4. So at first round, the probability to be selected are:
| value | proba |
| v1 | 2/20 | 0.10
| v2 | 3/20 | 0.15
| v3 | 3/20 | 0.15
| v4 | 4/20 | 0.20
| v5 | 2/20 | 0.10
| total | 14/20 | (70%)
Then the proba of having a 2nd round is 30%, and the proba of each index is 6/20/5 = 3/50
| value | proba 2 rounds |
| v1 | 2/20 + 6/200 | 0.130
| v2 | 3/20 + 9/200 | 0.195
| v3 | 3/20 + 9/200 | 0.195
| v4 | 4/20 + 12/200 | 0.260
| v5 | 2/20 + 6/200 | 0.130
| total | 14/20 + 42/200 | (91%)
The proba to have a 3rd round is 9%, that is 9/500 for each index
| value | proba 3 rounds |
| v1 | 2/20 + 6/200 + 18/2000 | 0.1390
| v2 | 3/20 + 9/200 + 27/2000 | 0.2085
| v3 | 3/20 + 9/200 + 27/2000 | 0.2085
| v4 | 4/20 + 12/200 + 36/2000 | 0.2780
| v5 | 2/20 + 6/200 + 18/2000 | 0.1390
| total | 14/20 + 42/200 + 126/2000 | (97,3%)
So we see that the serie is converging to the correct probabilities. The numerators are multiple of the weight, so it's clear that the relative weight of each element is respected.
This is a sketch of an answer.
With weights only 1, we can maintain a random permutation of the inputs.
Each time an element is inserted, put it at the end of the array, then pick a random position i in the array, and swap the last element with the element at position i.
(It may well be a no-op if the random position turns out to be the last one.)
When deleting, just delete the last element.
Assuming we can use a dynamic array with O(1) (worst case or amortized) insertion and deletion, this does both insertion and deletion in O(1).
With weights 1 and 2, the similar structure may be used.
Perhaps each element of weight 2 should be put twice instead of once.
Perhaps when an element of weight 2 is deleted, its other copy should also be deleted.
So we should in fact store indices instead of the elements, and another array, locations, which stores and tracks the two indices for each element. The swaps should keep this locations array up-to-date.
Deleting an arbitrary element can be done in O(1) similarly to inserting: swap with the last one, delete the last one.

Violation in deterministic skip-list topdown insertion

Suppose I'm given a skip-list, with an order of 3.
HEAD
level 3 |--------------------------------------------> X
| |---|
level 2 | -------------------> | | ----------------> X
| |---| |---| |---| |---|
level 1 | -> | | -> | | -> | | -> | | -------> X
| |---| |---| |---| |---|
| | 20| |100| |150| |200|
| |---| |---| |---| |---|
minlimit = ceil(order/2) - 1 = 1
maxlimit = order - 1 = 2
So essentially it's a 1-2 skip-list.
If I want to insert 50 by the top-down insertion algorithm, it'll raise the level of node 100 before dropping into gap between Head and 150 and insert 50 right before 100. Now a violation will occur as there are no nodes between 100 and 150 while there should be at least one node of height h-1 in that gap as the minlimit=1.
What am I doing wrong?
If I want to insert 50 by the top-down insertion algorithm, it'll raise the level of node 100 before dropping into gap between Head and 150 and insert 50 right before 100
Why are you doing this?
The first reference I found for deterministic 1-2 skip lists (this paper), available (PDF) as per your link says:
As noted in [...], insertions in ... can
be performed top-down, ... Adopting this
approach, we insert an element in a 1-2-3 skip list by
splitting any gap of size 3 into two gaps of size 1, when
searching for the element to be inserted. We ensure in
this way that the structure retains the gap invariant with or without the inserted element.
To be more
precise, we start our search at the header, and at level
1 higher than the height of the skip list. When we find
the gap that we are going to drop, we look at the level
below and if we see 3 nodes of the same height in a row,
we raise the middle one; after that we drop down a level.
When we reach the bottom level, we simply insert a new
node of height 1.
According to this, you should start at level 3, and look at level 2 below. There are not 3 nodes of the same height in a row here - only the single node 150 - and so you don't need to raise anything. Now, drop down to level 2 in the gap [HEAD,150].
Does that start to address your confusion?
If I want to insert 50 by the top-down insertion algorithm, it'll raise the level of node 100 before dropping into gap between Head and 150 and insert 50 right before 100.
It would not raise the level of node 100. Rather it would raise the level of node 20. According to the algorithm, whenever you have the reached the maxlimit of nodes in a gap, you raise the level of the ceil((maxlimit/2))th node in that gap.
In this instance, when level of node 20 is raised to level 2, there is no level 1 node between head and node 20 but it does not cause any structural violation. The original structure of deterministic skip lists as described in the paper by Munro et al. reads thus.
Assuming that in a skip list of n elements there exists a 0th and a (n+1)st node of height 1 higher than the height of the skip list, we require that between any two nodes of height h (h > 1) or higher, there exist either 1 or 2 nodes of height h – 1.

Faster computation to get multiples of number at different levels

Here is the scenario:
We have several items that are shipped to many stores. We want to be able to allocate a certain quantity of each item to a store based on need. Each of these stores is also associated to a specific warehouse.
The catch is that at the warehouse level, the total quantity of each item must be a multiple of a number (6 for example).
I have already calculated out the quantity needed by each store at store level, but they do not sum up to a multiple of 6 at the warehouse level.
My solution was this using Excel:
Using a SUMIFS formula to keep track of the sum of each item allocated at the warehouse level. Then another MOD(6) formula that calculates the remaining until a multiple of 6. Then my actually VBA code loops through and subtracts 1 (if MOD <= 3) or adds (if MOD > 3) from the store level units needed until MOD = 0 for all rows.
Now this works for me, but is extremely slow even when I have just ~5000 rows.
I am looking for a faster solution, because everytime I subtract/add to units needed, the SUMIFS and MOD need to be calculated again.
EDIT: (trying to be clearer)
I have a template file that I paste my data into with the following setup:
+------+-------+-----------+----------+--------------+--------+
| Item | Store | Warehouse | StoreQty | WarehouseQty | Mod(6) |
+------+-------+-----------+----------+--------------+--------+
| 1 | 1 | 1 | 2 | 8 | 2 |
| 1 | 2 | 1 | 3 | 8 | 2 |
| 1 | 3 | 1 | 1 | 8 | 2 |
| 1 | 4 | 1 | 2 | 8 | 2 |
| 2 | 1 | 2 | 1 | 4 | 2 |
| 2 | 2 | 2 | 3 | 4 | 2 |
+------+-------+-----------+----------+--------------+--------+
Currently the WarehouseQty column is the SUMIFS formula summing up the StoreQty for each Item-Store combo that is associated to the Warehouse. So I guess the Warehouse/WarehouseQty columns is actually duplicated several times every time an Item-Store combo shows up. The WarehouseQty is the one that needs to be a multiple of 6.
Screen updating can be turned OFF to speed up length computations like this:
Application.ScreenUpdating = FALSE
The opposite assignment turns screen updating back on again.
put the data into an array first, rather than cells, then put the data back after you have manipulated it - this will be much faster.
an example which uses your criteria:
Option Explicit
Sub test()
Dim q() 'this is what will be used for the range
Dim i As Long
q = Range("C2:C41") 'put the data into the array - *ALWAYS* 2 dimensions, even if a single column
For i = LBound(q) To UBound(q) ' use this, in case it's a dynamic array - 1 to 40 would have worked here
Select Case q(i, 1) Mod 6 ' calculate remander
Case 0 To 3
q(i, 1) = q(i, 1) - (q(i, 1) Mod 6) 'make a multiple of 6
Case 4 To 5
q(i, 1) = q(i, 1) - (q(i, 1) Mod 6) + 6 ' and go higher in the later numbers
End Select
Next i
Range("D2:D41") = q ' drop the data back
End Sub
Guessing you may find that stopping the screen refresh may help quite a chunk and therefore not need any more suggestions.
Another option would be to reduce your adjustment to a quantity which is divisible by 6 to a number of if statements, depending on the value of mod(6).
You could also address how you sum up the number of a particular item across all stores, using a pivot table and reading the sum totals from there is a lot quicker than using sumifs in a macro
Based on your modifications to the question:
You're correct that you could have huge amounts of replication doing the calculation row by row, as well as adjusting the quantity by a single unit at a time even though you know exactly how many units you need to add / remove from the mod(6) formula.
Could you not create a new sheet with all your possible combinations of product Id and store. You could then use sumifs() for each of these unique combinations and in a final step round up/down at a warehouse level?

golang array referencing eg. b[1:4] references elements 1,2,3

The golang blog states :
"A slice can also be formed by "slicing" an existing slice or array. Slicing is done by specifying a half-open range with two indices separated by a colon. For example, the expression b[1:4] creates a slice including elements 1 through 3 of b (the indices of the resulting slice will be 0 through 2)."
Can someone please explain to me the logic in the above. IE. Why doesn't b[1:4] reference elements 1 through 4? Is this consistent with other array referencing?
Indexes point to the "start" of the element. This is shared by all languages using zero-based indexing:
| 0 | first | 1 | second | 2 | third | 3 | fourth | 4 | fifth | 5 |
[0] = ^
[0:1] = ^ --------> ^
[1:4] = ^-------------------------------------> ^
[0:5] = ^ ----------------------------------------------------------> ^
It's also common to support negative indexing, although Go doesn't allow this:
|-6 | |-5 | |-4 | |-3 | |-2 | |-1 |
| 0 | first | 1 | second | 2 | third | 3 | fourth | 4 | fifth | 5 |
The reason is given in the Go Language Specification section on Slices.
For a string, array, or slice a, the
primary expression
a[low : high]
constructs a substring or slice. The
index expressions low and high select
which elements appear in the result.
The result has indexes starting at 0
and length equal to high - low.
For convenience, any of the index
expressions may be omitted. A missing
low index defaults to zero; a missing
high index defaults to the length of
the sliced operand.
It's easy and efficient to calculate the length of the slice as high - low.
Half-open intervals make sense for many reasons, when you get down to it. For instance, with a half-open interval like this, the number of elements is:
n = end - start
which is a pretty nice and easy formula. For a closed interval, it would be:
n = (end - start) + 1
which is (not a lot, but still) more complicated.
It also means that for e.g. a string, the entire string is [1, len(s)] which also seems intuitive. If the interval was closed, to get the entire string you would need [1, len(s) + 1].
Go uses half-open intervals for slices like many other languages. In a more mathematical notation, the slice b[1:4] is the interval [1,4) which excludes the upper endpoint.

Resources