Help with Btree homework - b-tree

I need to do a preorder traversal of a Btree, and among other things, print the following information for each page (which is the same thing as a node):
The B-Tree page number
The value of each B-Tree page pointer (e.g., address, byte offset, RRN).
My questions are:
1. How do you figure out the byte offset? What is it offset from?
2. Isn't the RRN the same as the page number?
Note: A Btree is NOT A BINARY TREE. Btrees can have multiple keys in each node, and a node with n keys has n+1 child pointers.

The byte offset is probably the offset of the record from the beginning of the page.
I think the RRN is the relative record number. So if a record is the 5th record in the page, its RRN would be 5.
You need to know the page layout to know how to interpret the information in a page/node. Many solutions are possible.
What code do you have to write, and what code is given to you? I need to know more about exactly what the assignment is asking you to do before I can be of any more help.

Related

How to implement dynamic indexes?

I know, Maybe the title is a little confusing. however, my actual question is basic I think.
I'm working on a brand new LRU implementation for that I use an Index Table which maps the name of the incoming packet to index of where the content of packet stored in CS.
As illustrated below each incoming packet store in the CS and can be addressed by Index Table.
Now suppose new packet arrived, as we know, regarding LRU, its index must set to top of CS (zero) and it needs to upgrade other indexes, they need to be incremented as a result.
One obvious solution is to loop over all entries in the Index Table and increment them.
Is there any solution or structure that is using for such a problem?
I don't see how you are establishing the order of your cache in the description. But to answer your question, it's possible to reduce the LRU store method to O(1) time complexity.
The classical way to do it is to have these two data structures:
Doubly Linked List : for order in the cache. Each node stores a data element (it plays the role of your content store).
HashMap that associates each key to the pointer to the node in the linked list. (it plays the role of your index table)
So when you access already stored data in your cache, it must be at the top of the list, so you delete the corresponding node from the linked list (in O(1) time because you have access to its previous and next nodes) and store it at the head.
For new data it is simpler, only store it at the head of the list and store your (key, value) in the hashmap.

How does the leaf node split in the physical space in innodb?

If the keys are inserted in ascending order, according to normal B+-tree characteristics, when the leaf page is full, it will split and there will be a new page introduced to the B+-tree.
For an instance, if there is a leaf page with up to 3 keys.
(page0)|1|2|3|
Then the key 4 is inserted:
|1|3|*|(page0)
(page1)|1|2|*| |3|4|*|(page2)
After this, later keys will be inserted into page2 until the next split since they are in ascending order. All previous pages will remain half full.
In my example, I guess this will cause space to be wasted. However, in the database, it seems to be unreasonable. This really confuses me. I've read Jeremy Cole-B+Tree index structures in InnoDB, but I have probably misunderstood something.
Without additional optimizations, you're absolutely correct that as an index page filled it would be split in half and then remain half-filled forever. However, InnoDB optimizes index fill based on its perception of the insertion order. That is, if it detects that insertion is being done in-order (ascending or descending) it will, instead of splitting a page in half, just create a new empty page for an insertion at the "edge" of the page.
There is some information about this in the MySQL manual section The Physical Structure of an InnoDB Index. Additionally I illustrate an example of this behavior in my post Visualizing the impact of ordered vs. random index insertion in InnoDB.
In The physical structure of InnoDB index pages I describe the Last Insert Position, Page Direction, and Number of Inserts in Page Direction fields of each index page. This is how the tracking for ascending vs. descending order is done (as left vs. right, though). With each insert, the last inserted record is compared to the currently inserted one, and if the insert is in the same "direction", the counter is incremented. This counter is then checked to determine the page split behavior; whether to split in half or create a new, empty page.
In practice, this optimization is not perfect, and there's a big difference between insertions being mostly in-order, and exactly in-order. If inserts are only mostly in-order it can mean that the page direction may never get appropriately set, and pages will end up half-filled (as you described).

Suitable data structure for finding a person's phone number, given their name?

Suppose you want to write a program that implements a simple phone book. Given a particular name, you want to be able to retrieve that person's phone number as quickly as possible. What data structure would you use to store the phone book, and why?
the text below answers your question.
In computer science, a hash table or hash map is a data structure that
uses a hash function to map identifying values, known as keys (e.g., a
person's name), to their associated values (e.g., their telephone
number). Thus, a hash table implements an associative array. The hash
function is used to transform the key into the index (the hash) of an
array element (the slot or bucket) where the corresponding value is to
be sought.
the text is from wiki:hashtable.
there are some further discussions, like collision, hash functions... check the wiki page for details.
I respect & love hashtables :) but even a balanced binary tree would be fine for your phone book application giving you in worst case a logarithmic complexity and avoiding you for having good hash functions, collisions etc. which is more suitable for huge amounts of data.
When I talk about huge data what I mean is something related to storage. Every time you fill all of the buckets in a hash-table you will need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. Balanced trees wont let you go into these problems. Domain needs to be considered too while designing data structures, for an example for small devices storage matters a lot.
I was wondering why 'Tries' didn't come up in one of the answers,
Tries is suitable for Phone book kind of data.
Also, saving space compared to HashTable at the same cost(almost) of Retrieval efficiency, (assuming constant size alphabet & constant length Names)
Tries also facilitate the 'Prefix Matches' sometimes required while searching.
A dictionary is both dynamic and fast.
You want a dictionary, where you use the name as the key, and the number as the data stored. Check this out: http://en.wikipedia.org/wiki/Dictionary_%28data_structure%29
Why not use a singly linked list? Each node will have the name, number and link information.
One drawback is that your search might take some time since you'll have to traverse the entire list from link to link. You might order the list at the time of node insertion itself!
PS: To make the search a tad bit faster, maintain a link to the middle of the list. Search can continue to the left or right of the list based on the value of the "name" field at this node. Note that this requires a doubly linked list.

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?
Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.
As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.
Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.
How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.
Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.
Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Compression and Lookup of huge list of words

I have a huge list of multi-byte sequences (lets call them words) that I need to store in a file and that I need to be able to lookup quickly. Huge means: About 2 million of those, each 10-20 bytes in length.
Furthermore, each word shall have a tag value associated with it, so that I can use that to reference more (external) data for each item (hence, a spellchecker's dictionary is not working here as that only provides a hit-test).
If this were just in memory, and if memory was plenty, I could simply store all words in a hashed map (aka dictionary, aka key-value pairs), or in a sorted list for a binary search.
However, I'd like to compress the data highly, and would also prefer not to have to read the data into memory but rather search inside the file.
As the words are mostly based on the english language, there's a certain likelyness that certain "sillables" in the words occur more often than others - which is probably helpful for an efficient algorithm.
Can someone point me to an efficient technique or algorithm for this?
Or even code examples?
Update
I figure that DAWG or anything similar routes the path into common suffixes this way won't work for me, because then I won't be able to tag each complete word path with an individual value. If I were to detect common suffixes, I'd have to put them into their own dictionary (lookup table) so that a trie node could reference them, yet the node would keep its own ending node for storing that path's tag value.
In fact, that's probably the way to go:
Instead of building the tree nodes for single chars only, I could try to find often-used character sequences, and make a node for those as well. That way, single nodes can cover multiple chars, maybe leading to better compression.
Now, if that's viable, how would I actually find often-used sub-sequences in all my phrases?
With about 2 million phrases consisting of usually 1-3 words, it'll be tough to run all permutations of all possible substrings...
There exists a data structure called a trie. I believe that this data structure is perfectly suited for your requirements. Basically a trie is a tree where each node is a letter and each node has child nodes. In an letter based trie, there would be 26 children per node.
Depending on what language you are using this may be easier or better to store as a variable length list while creation.
This structure gives:
a) Fast searching. Following a word of length n, you can find the string in n links in the tree.
b) Compression. Common prefixes are stored.
Example: The word BANANA and BANAL both will have B,A,N,A nodes equal and then the last (A) node will have 2 children, L and N. Your Nodes can also stored other information about the word.
(http://en.wikipedia.org/wiki/Trie)
Andrew JS
I would recommend using a Trie or a DAWG (directed acyclic word graph). There is a great lecture from Stanford on doing exactly what you want here: http://academicearth.org/lectures/lexicon-case-study
Have a look at the paper "How to sqeeze a lexicon". It explains how to build a minimized finite state automaton (which is just another name for a DAWG) with a one-to-one mapping of words to numbers and vice versa. Exactly what you need.
You should get familiar with Indexed file.
Have you tried just using a hash map? Thing is, on a modern OS architecture, the OS will use virtual memory to swap out unused memory segments to disk anyway. So it may turn out that just loading it all into a hash map is actually efficient.
And as jkff points out, your list would only be about 40 MB, which is not all that much.

Resources