MIN_TREEIFY_CAPACITY in HashMap - java-8

The documentation says below
/**
* The smallest table capacity for which bins may be treeified.
* (Otherwise the table is resized if too many nodes in a bin.)
* Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
* between resizing and treeification thresholds.
*/
Could you explain the rationale or logic behind having this parameter as atleast 4 * TREEIFY_THRESHOLD

MIN_TREEIFY_CAPACITY means that the total buckets in the hashmap should be 64 so that a bucket can be transformed from Linked list to Red Black tree (self balancing BST).
Also TREEIFY_THRESHOLD condition too should be met. i.e each bucket should have 8 or more entries.

This constant basically says not to start making buckets into trees if our hash map is very small — it should resize to be larger first instead.

As of java 8, when the Entries in a linked list reaches 8 (MIN_TREEIFY_CAPACITY;), it converts the linked list to a Balanced Tree . This improved the performance a million times.

Related

algorithm to find duplicated byte sequences

Hello everyone who read the post. Help me please to resolve a task:
On input I have an array of bytes. I need to detect duplicated sequences of bytes for the compressing duplicates. Can anyone help me to find acceptable algorithm?
Usually this type of problems come with CPU/memory tradeoffs.
The extreme solutions would be (1) and (2) below, you can improve from there:
High CPU / low memory - iterate over all possible combination sizes and letters and find duplicates (2 for statements)
Low CPU / high memory - create lookup table (hash map) of all required combinations and lengths, traverse the array and add to the table, later traverse the table and find your candidates
Improve from here - want ideas from (2) with lower memory, decrease lookup table size to make more hits and handle the lower problem later. want faster lookup for sequence length, create separate lookup table per length.
Build a tree with branches are all type of bytes (256 branches).
Then traverse the array, building new sub-branches. At each node, store a list of positions where this sequence is found.
For example: Let's say you are at node AC,40,2F. This sequence in the tree means: "Byte AC was found at position xx (one of its positions stored in that node). The next byte, 40, was at position yy=xx+1 (among others). The byte 2F was at position zz=yy+1
Now you want to "compress" only sequences of some size (e.g. 5). So traverse the tree an pay attention to depths 5 or more. In the 5th-deep subnode of a node you have already stored all positions where such sequence (or greater) is found in the array. Those positions are those you are interested to store in your compressed file.

Time complexity of unordered_set<int> find method

What is the time complexity of find method in unordered_set<int>?
And also is it possible to change the hash functions ?
what is the time complexity of find method in unordered_set?
...it's right there in the page you linked:
Complexity:
Average case: constant.
Worst case: linear in container size.
and also it it possible to change the hash functions?
Yes. Again, look at the documentation!
std::unordered_map takes an Hash template parameter. It's a customization point where you can inject your own hashing logic. A custom Hash must satisfy the Hash concept.
I guess you are getting confused by the default max_load_factor being 1. When you insert an int x in the unordered_set, it goes to the bucket i (i=x%number of buckets). So as you can imagine even if the hash function wont have collisions, as it maps each int with itself, the mod operation can have "collisions" in some cases. For example, if you insert 1, 4 and 6 in that order, both 1 and 6 will be in the same bucket, and the find function will need to go through the bucket to find them. The number of buckets is only increased when the load factor reaches the max load factor. The load factor is the arithmetic mean of the number of elements per bucket. So you can actually have more than one element in each bucket, and you can even have all elements in the same in the same bucket. In that case, finding an element that's inside the set would need a traditional sequential search (O(n)) inside the bucket. Here you have an example:
unordered_set<int> n;
n.insert(1);
n.insert(12);
n.insert(23);
n.insert(34);
n.insert(45);
In that case, every int is in the bucket 1, so when you look for 56 (56%11 = 1) you need to go through the whole bucket (size n, O(n)). The load factor is 0.4545 (5 elements / 11 buckets), so no buckets are added. You can reduce the max_load_factor (some languages use a load factor of 0.75), but that would increase the number of rehashes, as you would need to reserve buckets more frequently (the process of reserving is amortized constant, as it uses the same method std::vector uses, that's why in the example we have 11 buckets)

hash table about the load factor

I'm studying about hash table for algorithm class and I became confused with the load factor.
Why is the load factor, n/m, significant with 'n' being the number of elements and 'm' being the number of table slots?
Also, why does this load factor equal the expected length of n(j), the linked list at slot j in the hash table when all of the elements are stored in a single slot?
The crucial property of a hash table is the expected constant time it takes to look up an element.*
In order to achieve this, the implementer of the hash table has to make sure that every query to the hash table returns below some fixed amount of steps.
If you have a hash table with m buckets and you add elements indefinitely (i.e. n>>m), then also the size of the lists will grow and you can't guarantee that expected constant time for look ups, but you will rather get linear time (since the running time you need to traverse the ever increasing linked lists will outweigh the lookup for the bucket).
So, how can we achieve that the lists don't grow? Well, you have to make sure that the length of the list is bounded by some fixed constant - how we do that? Well, we have to add additional buckets.
If the hash table is well implemented, then the hash function being used to map the elements to buckets, should distribute the elements evenly across the buckets. If the hash function does this, then the length of the lists will be roughly the same.
How long is one of the lists if the elements are distributed evenly? Clearly we'll have total number of elements divided by the number of buckets, i.e. the load factor n/m (number of elements per bucket = expected/average length of each list).
Hence, to ensure constant time look up, what we have to do is keep track of the load factor (again: expected length of the lists) such that, when it goes above the fixed constant we can add additional buckets.
Of course, there are more problems which come in, such as how to redistribute the elements you already stored or how many buckets should you add.
The important message to take away, is that the load factor is needed to decide when to add additional buckets to the hash table - that's why it is not only 'important' but crucial.
Of course, if you map all the elements to the same bucket, then the average length of each list won't be worth much. All this stuff only makes sense, if you distribute evenly across the buckets.
*Note the expected - I can't emphasize this enough. Its typical to hear "hash table have constant look up time". They do not! Worst case is always O(n) and you can't make that go away.
Adding to the existing answers, let me just put in a quick derivation.
Consider a arbitrarily chosen bucket in the table. Let X_i be the indicator random variable that equals 1 if the ith element is inserted into this element and 0 otherwise.
We want to find E[X_1 + X_2 + ... + X_n].
By linearity of expectation, this equals E[X_1] + E[X_2] + ... E[X_n]
Now we need to find the value of E[X_i]. This is simply (1/m) 1 + (1 - (1/m) 0) = 1/m by the definition of expected values. So summing up the values for all i's, we get 1/m + 1/m + 1/m n times. This equals n/m. We have just found out the expected number of elements inserted into a random bucket and this is the load factor.

Why not use alternate elements as vertical search nodes in Skip Lists?

In most implementations of skip lists I've seen, they use a randomized algorithm to decide if an element must be copied into the upper level.
But I think using odd indexed elements at each level to have copies in the upper level will give us logarithmic search complexity. Why isn't this used?
E.g. :
Data : 1 2 3 4 5 6 7 8 9
Skip List:
1--------------------
1--------------------9
1--------5----------9
1----3---5----7----9
1-2-3-4-5-6-7-8-9
It is not used because it is easy to maintain while building the list from scratch - but what happens when you add/remove element to an existing list? Elements that used to be odd indexed are even indexed now, and vise versa.
In your example, assume you now add 3.5, all to maintain the DS as you described, it will require O(k + k/2 + k/4 + ... ) changes to the DS, where k is the number of elements AFTER the element you have just inserted.
This basically gives you O(n/2 + n/4 + ...) = O(n) add/remove complexity on average.
Because if you start deleting nodes or inserting nodes in the middle the structure quickly requires rebalancing or it loses its logarithmic guarantees on access and update.
Actually there is a structure very similar to what you suggest, an interval tree, which gets around the update problem by not using actual elements as intermediate node labels. It can also require some care to get good performance.

min/max number of records on a B+Tree?

I was looking at the best & worst case scenarios for a B+Tree (http://en.wikipedia.org/wiki/B-tree#Best_case_and_worst_case_heights) but I don't know how to use this formula with the information I have.
Let's say I have a tree B with 1,000 records, what is the maximum (and maximum) number of levels B can have?
I can have as many/little keys on each page. I can also have as many/little number of pages.
Any ideas?
(In case you are wondering, this is not a homework question, but it will surely help me understand some stuff for hw.)
I don't have the math handy, but...
Basically, the primary factor to tree depth is the "fan out" of each node in the tree.
Normally, in a simply B-Tree, the fan out is 2, 2 nodes as children for each node in the tree.
But with a B+Tree, typically they have a fan out much larger.
One factor that comes in to play is the size of the node on disk.
For example, if you have a 4K page size, and, say, 4000 byte of free space (not including any other pointers or other meta data related to the node), and lets say that a pointer to any other node in the tree is a 4 byte integer. If your B+Tree is in fact storing 4 byte integers, then the combined size (4 bytes of pointer information + 4 bytes of key information) = 8 bytes. 4000 free bytes / 8 bytes == 500 possible children.
That give you a fan out of 500 for this contrived case.
So, with one page of index, i.e. the root node, or a height of 1 for the tree, you can reference 500 records. Add another level, and you're at 500*500, so for 501 4K pages, you can reference 250,000 rows.
Obviously, the large the key size, or the smaller the page size of your node, the lower the fan out that the tree is capable of. If you allow variable length keys in each node, then the fan out can easily vary.
But hopefully you can see the gist of how this all works.
It depends on the arity of the tree. You have to define this value. If you say that each node can have 4 children then and you have 1000 records, then the height is
Best case log_4(1000) = 5
Worst case log_{4/2}(1000) = 10
The arity is m and the number of records is n.
The best and worst case depends on the no. of children each node can have. For the best case, we consider the case, when each node has the maximum number of children (i.e. m for an m-ary tree) with each node having m-1 keys. So,
1st level(or root) has m-1 entries
2nd level has m*(m-1) entries (since the root has m children with m-1 keys each)
3rd level has m^2*(m-1) entries
....
Hth level has m^(h-1)*(m-1)
Thus, if H is the height of the tree, the total number of entries is equal to n=m^H-1
which is equivalent to H=log_m(n+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the best case height will be equal to log_m(1000+1)
Similarly, for the worst case scenario:
Level 1(root) has at least 1 entry (and minimum 2 children)
2nd level has as least 2*(d-1) entries (where d=ceil(m/2) is the minimum number of children each internal node (except root) can have)
3rd level has 2d*(d-1) entries
...
Hth level has 2*d^(h-2)*(d-1) entries
Thus, if H is the height of the tree, the total number of entries is equal to n=2*d^H-1 which is equivalent to H=log_d((n+1)/2+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the worst case height will be equal to log_d((1000+1)/2+1)

Resources