Hashing Access time with multi variable key - algorithm

Suppose a dictionary has 2 variable keys instead of 1 like
dictionary[3,5] = Something
dictionry[1,2] = Something
dictionary[3,1] = Something
Would the search time still be O(1).In case I need to find if dictionary[1,5] exists would it yield constant time?
Thanks in advance.

When you do a lookup in a hash table, the cost involved is the cost of
hashing the item to look up, and
comparing that item against (an expected O(1) number of) other other entries in the table.
We can write the expected cost of a hash table lookup as O(hash-cost + compare-cost).
In your case, the cost of hashing a pair instead of a single element is still O(1) - just hash each element of the pair and apply some hash combination step to the two values. Similarly, the cost of comparing two pairs is also O(1) (assuming each individual element of the pair can be compared in constant time). As a result, a lookup will still be (expected) constant time.
The above argument generalizes to any fixed size triple as a key. You typically have to worry about the cost of hashing and comparing keys when they have variable length, as would be the case if you were hashing strings with no length restriction.

Yes. This is not new. In usual, you can have a dictionary with string keys. If you see string as an array of characters, you have a list of chars as key. So, in the same situation, you can say your dictionary works in O(1) too (if length of string is constant).

Related

What is the run-time of inserting the words in a string into a hash table?

More info:
n is the number of characters in the string
the hash table should keep track of each word's frequency; i.e., the hash table should store key-value pairs, where the key is a word in the input string, and the value is the number of times that word occurs in the input string
We've had some heated debates about this question at work, and I'd like to see what you guys think the answer is.
Important thing to consider during implementation of insert function is how do we handle collisions and resolution techniques. This will have a greater influence in both put() and get() operations.
The collision resolution techniques are implemented diffently in each libraries. The core idea is to maintain all colliding keys in the same bucket. And during retrieval traverse all the colliding keys and apply some equality check to retrieve the given key. Important thing to note is we need to maintain both 'keys' and 'values' in the bucket, to facilicate the above mentioned equality check.
So the key(words) is also being stored in hash table along with the count.
Another thing to consider is, during insertion operation a hashcode will be generated for the given key. We can consider this to be constant O(1) for every key.
Now, answering the question.
Given a string of length 'n'
Inserting all the words and frequencies will have following steps.
1. split given string in to words, with given delimiter - O(n)
2. For word in words - O(n)
# Considering copy of word of length k as constant and very small compared to 'n'.
# And collision resolution implementation amortized across all inserts
if MAP.exists(word) - O(1)
MAP.set(word, MAP.get(word)+1) - amortized to O(1)
else
MAP.set(word, 1) - O(1)
Over all, O(n) run-time for inserting the words in a string into a hash table. Because the for loop runs 'n/k' times and we know 'k' is constant and small compared to n.
If H is your hashtable mapping words to counts, then H[s] and H[s] = <new value> are both O(len(s)). That's because computing the hashcode for s requires you to read every character of s, and also once you've found the relevant line in the hashtable, you need to compare s to whatever's stored there. Of course, the usual hashtable complexities apply to -- there's O(1) of these comparisons performed.
With respect to your original problem, you can break your string of length n into words in O(n) time. Then for each word, you need an O(len(word)) operation to update the hashtable. For all the strings, O(len(word1) + len(word2) + ... + len(word_n)) = O(n) overall, since the sum of the length of the words is always less than n, the length of the original string.

Hashing analysis in hashtable

The search time for a hash value is O(1+alpha) , where
alpha = number of elements/size of table
I don't understand why the 1 is added?
The expected number elements examined is
(1/n summation of i=1 to n (1+(i-1/m)))
I don't understand this too.How it is derived?
(I know how to solve the above expression , but I want to understand how it has been lead to this expression..)
EDIT : n is number of elements present and m is the number of slots or the size of the table
I don't understand why the 1 is added?
The O(1) is there to tell that even if there is no element in a bucket or the hash table at all, you'll have to compute the key hash value and thus it won't be instantaneous.
Your second part needs precisions. See my comments.
EDIT:
Your second portion is there for "amortized analysis", the idea is to consider each insertion in fact in a set of n insertions in an initially empty hash table, each lookup would take O(1) hashing plus O(i-1/m) searching the bucket content considering each bucket is evenly filled with respect to previous elements. The resolution of the sum actually gives the O(1+alpha) amortized time.

Hashtable and the bucket array

I read that into a hash table we have a bucket array but I don't understand what that bucket array contains.
Does it contain the hashing index? the entry (key/value pair)? both?
This image, for me, is not very clear:
(reference)
So, which is a bucket array?
The array index is mostly equivalent to the hash value (well, the hash value mod the size of the array), so there's no need to store that in the array at all.
As to what the actual array contains, there are a few options:
If we use separate chaining:
A reference to a linked-list of all the elements that have that hash value. So:
LinkedList<E>[]
A linked-list node (i.e. the head of the linked-list) - similar to the first option, but we instead just start off with the linked-list straight away without wasting space by having a separate reference to it. So:
LinkedListNode<E>[]
If we use open addressing, we're simply storing the actual element. If there's another element with the same hash value, we use some reproducible technique to find a place for it (e.g. we just try the next position). So:
E[]
There may be a few other options, but the above are the best-known, with separate-chaining being the most popular (to my knowledge)
* I'm assuming some familiarity with generics and Java/C#/C++ syntax - E here is simply the type of the element we're storing, LinkedList<E> means a LinkedList storing elements of type E. X[] is an array containing elements of type X.
What goes into the bucket array depends a lot on what is stored in the hash table, and also on the collision resolution strategy.
When you use linear probing or another open addressing technique, your bucket table stores keys or key-value pairs, depending on the use of your hash table *.
When you use a separate chaining technique, then your bucket array stores pairs of keys and the headers of your chaining structure (e.g. linked lists).
The important thing to remember about the bucket array is that it establishes a mapping between a hash code and a group of zero or more keys. In other words, given a hash code and a bucket array, you can find out, in constant time, what are the possible keys associated with this hash code (enumerating the candidate keys may be linear, but finding the first one needs to be constant time in order to meet hash tables' performance guarantee of amortized constant time insertions and constant-time searches on average).
* If your hash table us used for checking membership (i.e. it represents a set of keys) then the bucket array stores keys; otherwise, it stores key-value pairs.
In practice a linked list of the entries that have been computed (by hashing the key) to go into that bucket.
In a HashTable there are most of the times collisions. That is when different elements have the same hash value. Elements with the same Hash value are stored in one bucket. So for each hash value you have a bucket containing all elements that have this hash-value.
A bucket is a linked list of key-value pairs. hash index is the one
to tell "which bucket", and the "key" in the key-value pair is the one to tell "which entry in that bucket".
also check out
hashing in Java -- structure & access time, i've bee telling more details there.

Programming : find the first unique string in a file in just 1 pass

Given a very long list of Product Names, find the first product name which is unique (occurred exactly once ). You can only iterate once in the file.
I am thinking of taking a hashmap and storing the (keys,count) in a doubly linked list.
basically a linked hashmap
can anyone optimize this or give a better approach
Since you can only iterate the list once, you have to store
each string that occurs exactly once, because it could be the output
their relative position within the list
each string that occurs more than once (or their hash, if you're not afraid)
Notably, you don't have to store the relative positions of strings that occur more than once.
You need
efficient storage of the set of strings. A hash set is a good candidate, but a trie could offer better compression depending on the set of strings.
efficient lookup by value. This rules out a bare list. A hash-set is the clear winner, but a trie also performs well. You can store the leaves of the trie in a hash set.
efficient lookup of the minimum. This asks for a linked list.
Conclusion:
Use a linked hash-set for the set of strings, and a flag indicating if they're unique. If you're fighting for memory, use a linked trie. If a linked trie is too slow, store the trie leaves in a hash map for look-up. Include only the unique strings in the linked list.
In total, your nodes could look like: Node:{Node[] trieEdges, Node trieParent, String inEdge, Node nextUnique, Node prevUnique}; Node firstUnique, Node[] hashMap
If you strive for ease of implementation, you can have two hash-sets instead (one linked).
The following algorithm solves it in O(N+M) time.
where
N=number of strings
M=total number of characters put together in all strings.
The steps are as follows:
`1. Create a hash value for each string`
`2. Xor it and find the one which didn't have a pair`
Xor has this useful property that if you do a xor a=0 and b xor 0=b.
Tips to generate the hash value for a string:
Use a 27 base number system, and give a a value of 1, b a value of 2 and so on till z which gets 26, and so if string is "abc" , we compute hash value as:
H=3*(27 power 0)+2*(27 power 1)+ 1(27 power 2)
=786
You could use modulus operator to make hash values small enough to fit in 32-bit integers.If you do that keep an eye out for collisions, which are basically two strings which are different but get the same hash value due to the modulus operation.
Mostly I guess you won't be needing it.
So compute the hash for each string, and then start from the first hash and keep xor-ing, the result will hold the hash value of the string which din't have a pair.
Caution:This is useful only when strings occur in pairs.Still this is a good idea to start with, that's why I answered it.
Using a linked hashmap is obvious enough. Otherwise, you could use a TreeMap style data structure where the strings are ordered by count. So as soon as you are done reading the input, the root of your tree is unique if a unique string exists. Unlike a linked hash map, insertion takes at most O(log n) as opposed to O(n). You can read up on TreeMaps for insight on how to augment a basic TreeMap into what you need. Also in your linked hashmap you may have to travel O(n) to find your first unique key. With a TreeMap style data structure, your look up is O(1) -- the root. Even if more unique keys exist, the first one you encountered will be the root. The subsequent ones will be children of the root.

Determining if a sequence T is a sorting of a sequence S in O(n) time

I know that one can easily determine if a sequence is sorted in O(n) time. However, how can we insure that some sequence T is indeed the sorting of elements from sequence S in O(n) time?
That is, someone might have an algorithm that outputs some sequence T that is indeed in sorted order, but may not contain any elements from sequence S, so how can we check that T is indeed a sorted sequence of S in O(n) time?
Get the length L of S.
Check the length of T as well. If they differ, you are done!
Let Hs be a hash map with something like 2L buckets of all elements in S.
Let Ht be a hash map (again, with 2L buckets) of all elements in T.
For each element in T, check that it exists in Hs.
For each element in S, check that it exists in Ht.
This will work if the elements are unique in each sequence. See wcdolphin's answer for the small changes needed to make it work with non-unique sequences.
I have NOT taken memory consumption into account. Creating two hashmap of double the size of each sequence may be expensive. This is the usual tradeoff between speed and memory.
While Emil's answer is very good, you can do slightly better.
Fundamentally, in order for T to be a reordering of S it must contain all of the same elements. That is to say, for every element in T or S, they must occur the same number of times. Thus, we will:
Create a Hash table of all elements in S, mapping from the 'Element' to the number of occurrences.
Iterate through every element in T, decrementing the number of times the current element occurred.
If the number of occurrences is zero, remove it from the hash.
If the current element is not in the hash, T is not a reordering of S.
Create a hash map of both sequences. Use the character as key, and the count of the character as value. If a character has not been added yet add it with a count of 1. If a character has already been added increase its count by 1.
Verify that for each character in the input sequence that the hash map of the sorted sequence contains the character as key and has the same count as value.
I believe it this is a O(n^2) problem because:
Assuming data structure you use to store elements is a linked list for minimal operations of removing an element
You will be doing a S.contains(element of T) for every element of T, and one to check they are the same size.
You cannot assume that s is ordered and therefore need to do a element by element comparison for every element.
worst case would be if S is reverse of T
This would mean that for element (0+x) of T you would do (n-x) comparisons if you remove each successful element.
this results in (n*(n+1))/2 operations which is O(n^2)
Might be some other cleverer algorithm out there though

Resources