I'm reading this post: https://linux.thai.net/~thep/datrie/, in the beginning of section Double-Array Trie, it says
The tripple-array structure for implementing trie appears to be well defined,
but is still not practical to keep in a single file.
The next/check pool may be able to keep in a single array of integer couples,
but the base array does not grow in parallel to the pool,
and is therefore usually split.
What does the base array is usually split mean and why?
I'd like to understand what is the benefits to use double array trie instead of triple array trie.
I can answer partially to you question.
In a triple array trie we've three array: base, next and check. the base array contains the distinct state of the trie. In the next array we have all the state stored many times: one time when they are the start state and the others each time another state transitions to them. The check have the ownership of a transition.
One way to model a trie with a triple array is model a structure with three array: base, next and check. This is a basic implemantion.
trie {
base: array<S>;
next: array<S>;
check: array<S>;
}
Because next and check have meaningful data, state and ownership, for a transion at the same index we can model these data in a pair. So the data structure has two array: the base array and the pair array, containg the next and check data in one place.
trie {
base: array<S>;
transition: array<Pair>;
}
Pair {
next: array<S>;
check: array<S>;
}
We can have this implementation to:
trie {
transition: array<Triple>;
}
Triple {
base: array<S>;
next: array<S>;
check: array<S>;
}
This is a bad implementation because it seems like the first, but base array data are duplicated for each transition.
In second implementation, where base split from next and check, we can retrieve next and check data at the same time and we doesn't duplicate base info like in the third.
In two-array next is called base and base is dropped away because it is not really necessary. It stores and manages data and this is a something valuable.
Related
I have a large list of some elements sorted by their probabilities:
data class Element(val value: String, val probability: Float)
val sortedElements = listOf(
Element("dddcccdd", 0.7f),
Element("aaaabb", 0.2f),
Element("bbddee", 0.1f)
)
Now I need to perform a prefix searches on this list to find items that start with one prefix and then with the next prefix and so on (elements still need to be sorted by probabilities)
val filteredElements1 = sortedElements
.filter { it.value.startsWith("aa") }
val filteredElements2 = sortedElements
.filter { it.value.startsWith("bb") }
Each "request" of elements filtered by some prefix takes O(n) time, which is too slow in case of a large list.
If I didn't care about the order of the elements (their probabilities), I could sort the elements lexicographically and perform a binary search: sorting takes O(n*log n) time and each request -- O(log n) time.
Is there any way to speed up the execution of these operations without losing the sorting (probability) of elements at the same time? Maybe there is some kind of special data structure that is suitable for this task?
You can read more about Trie data structure https://en.wikipedia.org/wiki/Trie
This could be really useful for your usecase.
Leetcode has another very detailed explanation on it, which you can find here https://leetcode.com/articles/implement-trie-prefix-tree/
Hope this helps
If your List does not change often, you could create a HashMap where each existing Prefix is a key referring to a collection (sorted by probability) of all entries it is a prefix of.
getting all entries for a given prefix needs ~O(1) then.
Be careful the Map get really big. And creation of the map takes quite some time.
We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java
My End Goal:
Create the implementation of a hash-table from scratch. The twist, if the number of entries in a hash bucket is greater than 10 it is stored in Binary Search Tree, or else it is stored in a Linked List.
In my knowledge the only way to be able to achieve this is through a
enum class type_name { a, b };
My Question: Can 'a', and 'b' be classes?
Thought Process:
So to implement a hash table, I am thinking to make an array of the enumerated class this way, as soon the Linked List at any index of the array it will be replaced with a Binary Search Tree.
If this is not possible, what would be the best way to achieve this? My implementation for Linked List and Binary Search Tree are complete and work perfectly.
Note: I am not looking for a complete implemenation/ full code. I would like to be able to code it myself but I think my theory is flawed.
Visualization of My Idea
----------------------------------H A S H T A B L E---------------------------------------
enum class Hash { LinkedList, Tree };
INDEXES: 0 1 2 3 4
Hash eg = new Hash [ LinkedList, LinkedList, LinkedList, LinkedList, LinkedList ]
//11th element is inserted into eg[2]
//Method to Replace Linked List with Binary Search Tree
if (eg[1].getSize() > 10) {
Tree toReplace();
Node *follow = eg[1].headptr; //Each linked list is made of connected
//headptr is a pointer to the first element of the linked list
while ( follow != nullptr ){
toReplace.insert(follow->value);
follow = follow.next() //Next is the pointer to the next element in the linked list
}
}
//Now, the Linked List at eg[2] is replaced with a Binary Search Tree
Hash eg = new Hash [ LinkedList, LinkedList, Tree, LinkedList, LinkedList ]
Short answer: No.
An enumeration is a distinct type whose value is restricted to a range
of values (see below for details), which may include several
explicitly named constants ("enumerators"). The values of the
constants are values of an integral type known as the underlying type
of the enumeration.
http://en.cppreference.com/w/cpp/language/enum
Classes will not be 'values of an integral type'.
You may be able to achieve what you want with a tuple.
http://en.cppreference.com/w/cpp/utility/tuple
Imagine you have an MD5 sum that was calculated from an array of N 64-byte elements. I want to replace an element at an arbitrary index in the source array with a new element. Then, instead of recalculating the MD5 sum by re-running it through an MD5 function, I would like to "subtract" the old element from the result and "add" the new piece of data to it.
To be a bit more clear, here's some pseudo-Scala:
class Block {
var summary: MD5Result
// The key reason behind this question is that the elements might not be
// loaded. With a large array, it can get expensive to load everything just to
// update a single thing.
var data: Array[Option[Element]]
def replaceElement(block: Block, index: Integer, newElement: Element) = {
// we can know the element that we're replacing
val oldElement = block.data(index) match {
case Some(x) => x
case None => loadData(index) // <- this is expensive
}
// update the MD5 using this magic function
summary = replaceMD5(summary, index, oldElement, newElement)
}
}
Is replaceMD5 implementable? While all signs point to "this is breaking a (weak) cryptographic hash," the actual MD5 algorithm seems to support doing this (but I might be missing something obvious).
I think I better understand what you want to do now. My solution below assumes nothing about MD5 computation, but involves a tradeoff between IO and storing a large number of MD5 hashes. Instead of computing the simple MD5 hash of the entire dataset, it computes a different MD5 hash that nevertheless should have the same important property: that any change to any element (drastically) changes it.
At the outset, decide on a block size b such that
you can afford to read b values from disk (or whatever IO you're talking about) per change of element, and
you can afford to keep 2n/b MD5 hashes in memory.
Create a binary tree of MD5 hashes. Each leaf in this tree will be the MD5 hash of a size-b block. Each internal node is the MD5 hash of its two children. We will use the hash of the root of this tree as "the" MD5 hash.
When element i changes, read in the b elements in block RoundDown(i/b), compute the new MD5 hash for this, and then propagate the changes up the tree (this will take at most log2(n) steps).
I have quite a big amount of fixed size records. Each record has lots of fields, ID and Value are among them. I am wondering what kind of data structure would be best so that I can
locate a record by ID(unique) very fast,
list the 100 records with the biggest values.
Max-heap seems work, but far from perfect; do you have a smarter solution?
Thank you.
A hybrid data structure will most likely be best. For efficient lookup by ID a good structure is obviously a hash-table. To support top-100 iteration a max-heap or a binary tree is a good fit. When inserting and deleting you just do the operation on both structures. If the 100 for the iteration case is fixed, iteration happens often and insertions/deletions aren't heavily skewed to the top-100, just keep the top 100 as a sorted array with an overflow to a max-heap. That won't modify the big-O complexity of the structure, but it will give a really good constant factor speed-up for the iteration case.
I know you want pseudo-code algorithm, but in Java for example i would use TreeSet, add all the records by ID,value pairs.
The Tree will add them sorted by value, so querying the first 100 will give you the top 100. Retrieving by ID will be straight-forward.
I think the algorithm is called Binary-Tree or Balanced Tree not sure.
Max heap would match the second requirement, but hash maps or balanced search trees would be better for the first one. Make the choice based on frequency of these operations. How often would you need to locate a single item by ID and how often would you need to retrieve top 100 items?
Pseudo code:
add(Item t)
{
//Add the same object instance to both data structures
heap.add(t);
hash.add(t);
}
remove(int id)
{
heap.removeItemWithId(id);//this is gonna be slow
hash.remove(id);
}
getTopN(int n)
{
return heap.topNitems(n);
}
getItemById(int id)
{
return hash.getItemById(id);
}
updateValue(int id, String value)
{
Item t = hash.getItemById(id);
//now t is the same object referred to by the heap and hash
t.value = value;
//updated both.
}