Programming : find the first unique string in a file in just 1 pass - algorithm

Given a very long list of Product Names, find the first product name which is unique (occurred exactly once ). You can only iterate once in the file.
I am thinking of taking a hashmap and storing the (keys,count) in a doubly linked list.
basically a linked hashmap
can anyone optimize this or give a better approach

Since you can only iterate the list once, you have to store
each string that occurs exactly once, because it could be the output
their relative position within the list
each string that occurs more than once (or their hash, if you're not afraid)
Notably, you don't have to store the relative positions of strings that occur more than once.
You need
efficient storage of the set of strings. A hash set is a good candidate, but a trie could offer better compression depending on the set of strings.
efficient lookup by value. This rules out a bare list. A hash-set is the clear winner, but a trie also performs well. You can store the leaves of the trie in a hash set.
efficient lookup of the minimum. This asks for a linked list.
Conclusion:
Use a linked hash-set for the set of strings, and a flag indicating if they're unique. If you're fighting for memory, use a linked trie. If a linked trie is too slow, store the trie leaves in a hash map for look-up. Include only the unique strings in the linked list.
In total, your nodes could look like: Node:{Node[] trieEdges, Node trieParent, String inEdge, Node nextUnique, Node prevUnique}; Node firstUnique, Node[] hashMap
If you strive for ease of implementation, you can have two hash-sets instead (one linked).

The following algorithm solves it in O(N+M) time.
where
N=number of strings
M=total number of characters put together in all strings.
The steps are as follows:
`1. Create a hash value for each string`
`2. Xor it and find the one which didn't have a pair`
Xor has this useful property that if you do a xor a=0 and b xor 0=b.
Tips to generate the hash value for a string:
Use a 27 base number system, and give a a value of 1, b a value of 2 and so on till z which gets 26, and so if string is "abc" , we compute hash value as:
H=3*(27 power 0)+2*(27 power 1)+ 1(27 power 2)
=786
You could use modulus operator to make hash values small enough to fit in 32-bit integers.If you do that keep an eye out for collisions, which are basically two strings which are different but get the same hash value due to the modulus operation.
Mostly I guess you won't be needing it.
So compute the hash for each string, and then start from the first hash and keep xor-ing, the result will hold the hash value of the string which din't have a pair.
Caution:This is useful only when strings occur in pairs.Still this is a good idea to start with, that's why I answered it.

Using a linked hashmap is obvious enough. Otherwise, you could use a TreeMap style data structure where the strings are ordered by count. So as soon as you are done reading the input, the root of your tree is unique if a unique string exists. Unlike a linked hash map, insertion takes at most O(log n) as opposed to O(n). You can read up on TreeMaps for insight on how to augment a basic TreeMap into what you need. Also in your linked hashmap you may have to travel O(n) to find your first unique key. With a TreeMap style data structure, your look up is O(1) -- the root. Even if more unique keys exist, the first one you encountered will be the root. The subsequent ones will be children of the root.

Related

Hashing Access time with multi variable key

Suppose a dictionary has 2 variable keys instead of 1 like
dictionary[3,5] = Something
dictionry[1,2] = Something
dictionary[3,1] = Something
Would the search time still be O(1).In case I need to find if dictionary[1,5] exists would it yield constant time?
Thanks in advance.
When you do a lookup in a hash table, the cost involved is the cost of
hashing the item to look up, and
comparing that item against (an expected O(1) number of) other other entries in the table.
We can write the expected cost of a hash table lookup as O(hash-cost + compare-cost).
In your case, the cost of hashing a pair instead of a single element is still O(1) - just hash each element of the pair and apply some hash combination step to the two values. Similarly, the cost of comparing two pairs is also O(1) (assuming each individual element of the pair can be compared in constant time). As a result, a lookup will still be (expected) constant time.
The above argument generalizes to any fixed size triple as a key. You typically have to worry about the cost of hashing and comparing keys when they have variable length, as would be the case if you were hashing strings with no length restriction.
Yes. This is not new. In usual, you can have a dictionary with string keys. If you see string as an array of characters, you have a list of chars as key. So, in the same situation, you can say your dictionary works in O(1) too (if length of string is constant).

A red black tree with the same key multiple times: store collections in the nodes or store them as multiple nodes?

Apparently you could do either, but the former is more common.
Why would you choose the latter and how does it work?
I read this: http://www.drdobbs.com/cpp/stls-red-black-trees/184410531; which made me think that they did it. It says:
insert_always is a status variable that tells rb_tree whether multiple instances of the same key value are allowed. This variable is set by the constructor and is used by the STL to distinguish between set and multiset and between map and multimap. set and map can only have one occurrence of a particular key, whereas multiset and multimap can have multiple occurrences.
Although now i think it doesnt necessarily mean that. They might still be using containers.
I'm thinking all the nodes with the same key would have to be in a row, because you either have to store all nodes with the same key on the right side or the left side. So if you store equal nodes to the right and insert 1000 1s and one 2, you'd basically have a linked list, which would ruin the properties of the red black tree.
Is the reason why i can't find much on it that it's just a bad idea?
down side of store as multiple nodes:
expands tree size, which make search slower.
if you want to retrieve all values for key K, you need M*log(N) time, where N is number of total nodes, M is number of values for key K, unless you introduce extra code (which complicates the data structure) to implement linked list for these values. (if storing collection, time complexity only take log(N), and it's simple to implement)
more costly to delete. with multi-node method, you'll need to remove node on every delete, but with collection-storage, you only need to remove node K when the last value of key K is deleted.
Can't think of any good side of multi-node method.
Binary Search trees by definition cannot contain duplicates. If you use them to produce a sorted list throwing out the duplicates would produce an incorrect result.
I am working on an implementation of Red Black trees in PHP when I ran into the duplicate issue. We are going to use the tree for sorting and searching.
I am considering adding an occurrence value to the node data type. When a duplicate is encountered just increment occurrence. When walking the tree to produce output just repeat the value by the number of occurrences. I think I would still have a valid BST and avoid having a whole chain of duplicate values which preserve the optimal search time.

Given a flat file of IP Ranges and mappings, find a city given an IP

This is the question:
Given a flat text file that contains a range of IP addresses that map
to a location (e.g.
192.168.0.0-192.168.0.255 = Boston, MA), come up with an algorithm that will find a city for a specific ip address if a mapping exists.
My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits) and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.
Any faults in my plans, or better solutions?
Your approach seems perfectly reasonable.
If you are interested in doing a bit of research / extra coding, there are algorithms that will asymptotically outperform the standard binary search technique that rely on the fact that your IP addresses can be interpreted as integers in the range from 0 to 231 - 1. For example, the van Emde Boas tree and y-Fast Trie data structures can implement the predecessor search operation that you're looking at in time O(log log U), where U is the maximum possible IP address, as opposed to the O(log N) approach that binary search uses. The constant factors are higher, though, which means that there is no guarantee that this approach will be any faster. However, it might be worth exploring as another approach that could potentially be even faster.
Hope this helps!
The problem smells of ranges, and one of the good data-structures for this problem would be a Segment Tree. Some resources to help you get started.
The root of the segment tree can represent the addresses (0.0.0.0 - 255.255.255.255). The left sub-tree would represent the addresses (0.0.0.0 - 127.255.255.255) and the right sub-tree would represent the range (128.0.0.0 - 255.255.255.255), and so on. This will go on till we reach ranges which cannot be further sub-divided. Say, if we have the range 32.0.0.0 - 63.255.255.255, mapped to some arbitrary city, it will be a leaf node, we will not further subdivide that range when we arrive there, and tag it to the specific city.
To search for a specific mapping, we follow the tree, just as we do in a Binary Search Tree. If your IP lies in the range of the left sub-tree, move to the left sub-tree, else move to the right sub-tree.
The good parts:
You need not have all sub-trees, only add the sub-trees which are required. For example, if in your data, there is no city mapped for the range (0.0.0.0 - 127.255.255.255), we will not construct that sub-tree.
We are space efficient. If the entire range is mapped to one city, we will create only the root node!
This is a dynamic data-structure. You can add more cities, split-up ranges later on, etc.
You will be making constant number of operations, since the maximum depth of the tree would be 4 x log2(256) = 32. For this particular problem it turns out that Segment Trees would be as fast as van-Emde Boas trees, and require lesser space (O(N)).
This is a simple, but non-trivial data-structure, which is better than sorting, because it is dynamic, and easier to explain to your interviewer than van-Emde Boas trees.
This is one of the easiest non-trivial data-structures to code :)
Please note that in some Segment Tree tutorials, they use arrays to represent the tree. This is probably not what you want, since we would not be populating the entire tree, so dynamically allocating nodes, just like we do in a standard Binary Tree is the best.
My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits)...
If following this approach, you would probably want to multiply by 256^3, 256^2, 256 and 1 respectively for A, B, C and D in an address A.B.C.D. That effectively recreates the IP address as a 32-bit unsigned number.
... and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.
I would suggest creating a contiguous array (a std::vector) containing structs with the lower and upper ranges (and location name - discussed below). Then as you say you can binary search for a range including a specific value, without any odd/even hassles.
Using the lower end of the range as a key in a hash is one way to avoid having space for the location names in the array, but given the average number of characters in a city name, the likely size of pointers, a choice between a sparsely populated hash table and lengthly displacement lists to search in successive alternative buckets or further indirection to arbitrary length containers - you'd need to be pretty desperate to bother trying. In the first instance, storing the location in struct alongside the IP value range seems good.
Alternatively, you could create a tree based on e.g. the individual 0-255 IP values: each level in the tree could be either an array of 256 values for direct indexing, or a sorted array of populated values. That can reduce the number of IP value comparisons you're likely to need to make (O(log2N) to O(1)).
In your example, 192.168.0.0-192.168.0.255 = Boston, MA.
Will the first three octets (192.168.0) be the same for both IP addresses in the entry?
Also, will the first three octets be unique for a city?
If so, then this problem can solved more easily

Designing small comparable objects

Intro
Consider you have a list of key/value pairs:
(0,a) (1,b) (2,c)
You have a function, that inserts a new value between two current pairs, and you need to give it a key that keeps the order:
(0,a) (0.5,z) (1,b) (2,c)
Here the new key was chosen as the average between the average of keys of the bounding pairs.
The problem is, that you list may have milions of inserts. If these inserts are all put close to each other, you may end up with keys such to 2^(-1000000), which are not easily storagable in any standard nor special number class.
The problem
How can you design a system for generating keys that:
Gives the correct result (larger/smaller than) when compared to all the rest of the keys.
Takes up only O(logn) memory (where n is the number of items in the list).
My tries
First I tried different number classes. Like fractions and even polynomium, but I could always find examples where the key size would grow linear with the number of inserts.
Then I thought about saving pointers to a number of other keys, and saving the lower/greater than relationship, but that would always require at least O(sqrt) memory and time for comparison.
Extra info: Ideally the algorithm shouldn't break when pairs are deleted from the list.
I agree with snowlord. A tree would be ideal in this case. A red-black tree would prevent things from getting unbalanced. If you really need keys, though, I'm pretty sure you can't do better than using the average of the keys on either side of the value you need to insert. That will increase your key length by 1 bit each time. What I recommend is renormalizing the keys periodically. Every x inserts, or whenever you detect keys being generated too close together, renumber everything from 1 to n.
Edit:
You don't need to compare keys if you're inserting by position instead of key. The compare function for the red-black tree would just use the order in the conceptual list, which lines up with in-order in the tree. If you're inserting in position 4 in the list, insert a node at position 4 in the tree (using in-ordering). If you're inserting after a certain node (such as "a"), it's the same. You might have to use your own implementation if whatever language/library you're using requires a key.
I don't think you can avoid getting size O(n) keys without reassigning the key during operation.
As a practical solution I would build an inverted search tree, with pointers from the children to the parents, where each pointer is marked whether it is coming from a left or right child. To compare two elements you need to find the closest common ancestor, where the path to the elements diverges.
Reassigning keys is then rebalancing of the tree, you can do that by some rotation that doesn't change the order.

Find a common element within N arrays

If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).

Resources