What indicates if a string exists or not in a trie - data-structures

Every video or explaination in books are not clear or agree if the ending point of a word/string, has a special key that is returned or a boolean true is returned.
e.g. We know that the trie has the word "paper" and now I want to find the word "pap", what indicates that this is not a word, and how can this become a found word in my trie.
Edit: If I am at the end of a branch, will the array the belongs to the last node, have nullptr's on each index?

Either is fine. If the trie is just used to test for membership, then all a node needs is a boolean that indicates whether or not it's a terminal node. If the trie is used to associate values with keys, then the terminal nodes will contain those values (or pointers to them). If there's a distinguished nil/null value, then there's no need for a separate boolean — a node is a terminal iff it contains a non-nil value. But storing the boolean is also perfectly acceptable. None of this makes any real difference to the nature of the data structure.

Related

Will key in the index be removed after deletion in B Plus tree?

I'm a little confused with the deletion in B+ tree. I searched a lot in Google and find that there are two implementation when the key you want to delete appears in the index:
Delete the key in the index
Keep the key in the index
Algorithm from https://www.javatpoint.com/b-plus-tree-deletion uses the first way.
Algorithm from https://www.cs.princeton.edu/courses/archive/fall08/cos597A/Notes/BplusInsertDelete.pdf uses the second way.
So I really want to know which one is right.
But I'm more inclined to take that as an undefined behavior. At this point, could someone help me figure out the advantage and disadvantage between them? And how to choose between them?
Thanks in advance.
Both methods are correct.
The difference that you highlight is not so much in deleting/not-deleting internal keys, but in updating/not-updating them.
Obviously, when you delete a value (i.e. a key in a leaf node), the b-plus-tree property is not violated: all child values are still within the range dictated by the parent information. You can never break this range-rule by merely removing a value from a leaf. This rule is also still valid when you update the internal key(s) in the path to that leaf (according to method 1), which is only necessary when the deleted value was the left-most one in its leaf.
Note that the two methods may produce quite different trees after a long sequence of the same operations (insert, delete).
But on average the second method will have slightly less work to do. This difference is not significant though.

Storing arrangement of same word efficiently

I had a question related to Dictionary storage.
I was reading about Trie Data-structures and so far I have read that it works pretty well as prefix tree. But, I came to Trie - DS in efforts to see if it can reduce the storage of arrangement of letters formed through same word efficiently.
For ex : words "ANT", "TAN" and NAT have same letters but according to Trie it goes on to create two separate paths for these words. I can understand that Trie is meant for prefix storage and reduce redundancy. But can anyone help me in reducing the redundancy here.
One way I was thinking was to change the behavior of Trie as to each node has a status of 'word complete'; In addition if I put 'word start' status too I can make this work as below :
A
N - A - T
T - A - N
Now, every time I can check if the word is starting form here and go till the end.
Does this makes sense ? and if this is feasible ?
Or is their any better method to do this ?
Thanks
You can use 2 tries and also store the reverse trie. Then you can use a wildcard expansion everywhere in the search for example you can split the search word into 2 half and search for one half by the prefix and the other half by its suffix:http://phpir.com/tries-and-wildcards/. When you concatenate the 2 you can efficient search with a wildcard.
If you add a status field to each node you will increase the memory cost of your tree (assuming 8-bit chars) by a possibly not insignificant portion.
I understand that you want to reduce the number of letters in the DS, but you have to consider what happens if some contents are subsets of other contents, e.g. how ANTAN would be represented. Think about the minimal number of chars (128) as nodes of a fully connected graph. Obviously all words are stored in this graph, however it is not suitable to store any specific words. There is no way of telling where words end. The information stored in a trie is not just letters, but complete and properly terminated words.
If you add a marker as you suggest, how will you be able to encode this: SUPERCHARGED, SUPER, PERCH. You would set word_starts at S and P and word_ends at R and H. How would you know that SUPERCH and PER are not contained? You could instead use a non-zero label and assign number-pairs to the beginning and end of words: S:1 P:2 R:1 H:2. To make sure that start and end can occur at the same letter, you would have to use specific bits as labels.
You could then use NATANT as minimal flat representation and N:001 A:000 T:011 A:100 N:010 T: 100. This requires #words bit for the marker in the worst case: A, AA, AAA.... If you would store that in a tree however, you would have to look for the other marker, which is not an operation supported by trees. So I see no good way of using a marker.
From an information theoretical point I think the critical issue here is to properly encode the length, ordering and contents of a word in a unique way for each possible combination of these.
I originally meant to just comment, but it got a bit lengthy. I am not sure if this answers your question, but I hope it helps.
Are you hoping that any search for "ant" also brings up "tan" and "nat"?
If so, then use a TrieMap, always sort keys before reads/writes and map to a container of all words in that "anagram class."
If you're just looking for ideas to reduce the space overhead of using a Trie, then look no further. I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation.

Hash Table and Substring Matching

I have hundreds of keys for example like:
redapple
maninred
foraman
blueapple
i have data related to these keys, data is a string and has related key at the end.
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
foraman: they-bought-the-present-foraman
blueapple: it-was-surprising-but-it-was-a-blueapple
i am expected to use hash table and hash function to record the data according to keys and i am expected to be able to retieve data from table.
i know to use hash function and hash table, there is no problem here.
But;
i am expected to give the program a string which takes place as a substring and retrieve the data for the matching keys.
For example:
i must give "red" and must be able to get
redapple: the-tree-has-redapple
maninred: she-saw-the-maninred
as output.
or
i must give "apple" and must be able to get
redapple: the-tree-has-redapple
blueapple: it-was-surprising-but-it-was-a-blueapple
as output.
i only can think to search all keys if they has a matching substring, is there some other solution? If i search all the key strings for every query, use of hashing is unneeded, meaningless, is it?
But, searching all keys for substring is O(N), i am expected to solve the problem with O(1).
With hashing i can hash a key e.g. "redapple" to e.g. 943, and "maninred" to e.g. 332.
And query man give the string "red" how can i found out from 943 and 332 that the keys has "red" substring? It is out of my cs thinking skills.
Thanks for any advise, idea.
Possible you should use the invert index for n-gramm, the same approach is used for spell correction. For word redapple you will have following set of 3-gramms red, eda, dap, app, ppl, ple. For each n-gramm you will have a list of string in which contains it. For example for red it will be
red -> maninred, redapple
words in this list must be ordered. When you want to find the all string that contains a a give substring, you dived the substring on n-gramm and intercept the list of words for n-gramm.
This alogriphm is not O(n), but it practice it has enough speed.
It cannot be nicely done in a hash table. Given a a substring - you cannot predict the hashed result of the entire string1
A reasonable alternative is using a suffix tree. Each terminal in the suffix tree will hold list of references of the complete strings, this suffix is related to.
Given a substring t, if it is indeed a substring of some s in your collection, then there is a suffix x of s - such that t is a prefix of x. By traversing the suffix tree while reading t, and find all the terminals reachable from the the node you reached from there. These terminals contain all the needed strings.
(1) assuming reasonable hash function, if hashCode() == 0 for each element, you can obviously predict the hash value.
I have researched this problem recently and i'm sure that this can not be done. I hope hash table will help me improve speed of searching like you but it makes me disapointed.

Compression and Lookup of huge list of words

I have a huge list of multi-byte sequences (lets call them words) that I need to store in a file and that I need to be able to lookup quickly. Huge means: About 2 million of those, each 10-20 bytes in length.
Furthermore, each word shall have a tag value associated with it, so that I can use that to reference more (external) data for each item (hence, a spellchecker's dictionary is not working here as that only provides a hit-test).
If this were just in memory, and if memory was plenty, I could simply store all words in a hashed map (aka dictionary, aka key-value pairs), or in a sorted list for a binary search.
However, I'd like to compress the data highly, and would also prefer not to have to read the data into memory but rather search inside the file.
As the words are mostly based on the english language, there's a certain likelyness that certain "sillables" in the words occur more often than others - which is probably helpful for an efficient algorithm.
Can someone point me to an efficient technique or algorithm for this?
Or even code examples?
Update
I figure that DAWG or anything similar routes the path into common suffixes this way won't work for me, because then I won't be able to tag each complete word path with an individual value. If I were to detect common suffixes, I'd have to put them into their own dictionary (lookup table) so that a trie node could reference them, yet the node would keep its own ending node for storing that path's tag value.
In fact, that's probably the way to go:
Instead of building the tree nodes for single chars only, I could try to find often-used character sequences, and make a node for those as well. That way, single nodes can cover multiple chars, maybe leading to better compression.
Now, if that's viable, how would I actually find often-used sub-sequences in all my phrases?
With about 2 million phrases consisting of usually 1-3 words, it'll be tough to run all permutations of all possible substrings...
There exists a data structure called a trie. I believe that this data structure is perfectly suited for your requirements. Basically a trie is a tree where each node is a letter and each node has child nodes. In an letter based trie, there would be 26 children per node.
Depending on what language you are using this may be easier or better to store as a variable length list while creation.
This structure gives:
a) Fast searching. Following a word of length n, you can find the string in n links in the tree.
b) Compression. Common prefixes are stored.
Example: The word BANANA and BANAL both will have B,A,N,A nodes equal and then the last (A) node will have 2 children, L and N. Your Nodes can also stored other information about the word.
(http://en.wikipedia.org/wiki/Trie)
Andrew JS
I would recommend using a Trie or a DAWG (directed acyclic word graph). There is a great lecture from Stanford on doing exactly what you want here: http://academicearth.org/lectures/lexicon-case-study
Have a look at the paper "How to sqeeze a lexicon". It explains how to build a minimized finite state automaton (which is just another name for a DAWG) with a one-to-one mapping of words to numbers and vice versa. Exactly what you need.
You should get familiar with Indexed file.
Have you tried just using a hash map? Thing is, on a modern OS architecture, the OS will use virtual memory to swap out unused memory segments to disk anyway. So it may turn out that just loading it all into a hash map is actually efficient.
And as jkff points out, your list would only be about 40 MB, which is not all that much.

What is a quad linked list?

I'm currently working on implementing a list-type structure at work, and I need it to be crazy effective. In my search for effective data structures I stumbled across a patent for a quad liked list, and this sparked my interest enough to make me forget about my current task and start investigating the quad list instead. Unfortunately, internet was very secretive about the whole thing, and google didn't produce much in terms of usable results. The only explanation I got was the patent description that stated:
A quad linked data structure that provides bidirectional search capability for multiple related fields within a single record. The data base is searched by providing sets of pointers at intervals of N data entries to accommodate a binary search of the pointers followed by a linear search of the resultant range to locate an item of interest and its related field.
This, unfortunately, just makes me more puzzled, as I cannot wrap my head around the non-layman explanation. So therefore I turn to you all in hope that you can explain to me what this quad linked history really is, as I know not knowing will drive me up and over the walls pretty quickly.
Do you know what a quad linked list is?
I can't be sure, but it sounds a bit like a skip list.
Even if that's not what it is, you might find skip lists handy. (To the best of my knowledge they are unidirectional, however.)
I've not come across the term formally before, but from the patent description, I can make an educated guess.
A linked list is one where each node has a link to the next...
a -->-- b -->-- c -->-- d -->-- null
A doubly linked list means each node holds a link to its predecessor as well.
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
Let's assume the list is sorted. If I want to perform binary search, I'd normally go half way down the list to find the middle node, then go into the appropriate interval and repeat. However, linked list traversal is always O(n) - I have to follow all the links. From the description, I think they're just adding additional links from a node to "skip" a fixed number of nodes ahead in the list. Something like...
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
| |
|----------->-----------|
-----------<-----------
Now I can traverse the list more rapidly, especially if I chose the extra link targets carefully (i.e. ensure they always go back/forward half of the offset of the item they point from in the list length). I then find the rough interval I want with these links, and use the normal links to find the item.
This is a good example of why I hate software patents. It's eminently obvious stuff, wrapped in florid prose to confuse people.
I don't know if this is exactly a "quad-linked list", but it sounds like something like this:
struct Person {
// Normal doubly-linked list.
Customer *nextCustomer;
Customer *prevCustomer;
std::string firstName;
Customer *nextByFirstName;
Customer *prevByFirstName;
std::string lastName;
Customer *nextByLastName;
Customer *prevByLastName;
};
That is: you maintain several orderings through your collection. You can easily navigate in firstName order, or in lastName order. It's expensive to keep the links up to date, but it makes navigation quite quick.
Of course, this could be something completely different.
My reading of it is that a quad linked list is one which can be traversed (backwards or forwards) in O(n) in two different ways, ie sorted according to FieldX or FieldY:
(a) generating first and second sets
of link pointers, wherein the first
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the fixed ID
field, and the second set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the fixed ID field;
(b) generating third and fourth sets
of link pointers, wherein the third
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the variable
ID field, and the fourth set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the variable ID field;
So if you had a quad linked list of employees you could store it sorted by name AND sorted by age, and enumerate either in O(n).
One source of the patent is this. There are, it appears, two claims, the second of which is more nearly relevant:
A computer implemented method for organizing and searching a set of related records, wherein each record includes:
i) a fixed ID field; and
ii) a variable ID field; the method comprising the steps of:
(a) generating first and second sets of link pointers, wherein the first set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the fixed ID field, and the second set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the fixed ID field;
(b) generating third and fourth sets of link pointers, wherein the third set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the variable ID field, and the fourth set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the variable ID field;
(c) generating first and second sets of field pointers, wherein the first set of field pointers includes an ordered set of pointers that point to every Nth fixed ID field when the records are ordered with respect to the fixed ID field, and the second set of pointers includes an ordered set of pointers that point to every Nth variable ID field when the records are ordered with respect to the variable ID field;
(d) when searching for a particular record by reference to its fixed ID field, conducting a binary search of the first set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(e) examining by linear scarch, the fixed ID fields within the range determined in step (d) to locate the particular record;
(f) when searching for a particular record by reference to its variable ID field, conducting a binary search of the second set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(g) examining, by linear search, the variable ID fields within the range determined in step (f) to locate the particular record.
When you work through the patent gobbledegook, I think it means approximately the same as having two skip lists (one for forward search, one for backwards search) on each of two keys (hence 4 lists in total, and the name 'quad-list'). I don't think it is a very good patent - it looks to be an obvious application of skip lists to a data set where you have two keys to search on.
The description isn't particularly good, but as best I can gather, it sounds like a less-efficient skip list.

Resources