Design a datastructure for telephone directory - algorithm

Design a datastructure for telephone directory for storing the name and phone number so that we can search for key given name and vice versa.
We can use 2 hash maps as follows
Map<String,int>
Map<int,String>
But it requires twice the memory.Can anyone suggest any other solution?

One person can have more than one number, and one number can belong to more than one person (members of a family). And as Nick said, a telephone number in general case can have non-numeric characters. All considered, instead of Map<String,int> you might be using Map<String,List<String>>, or to have only pointers to strings (in C++ terms), to avoid redundancy: Map<String*,List<String*>>.

A bimap (or "bidirectional map") is a
map that preserves the uniqueness of
its values as well as that of its
keys. This constraint enables bimaps
to support an "inverse view", which is
another bimap containing the same
entries as this bimap but with
reversed keys and values.
http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/BiMap.html
BiMap<String, Integer> biMap = HashBiMap.create();
biMap.put("Mike", 123);
biMap.put("John", 456);
biMap.put("Steven", 789);
BiMap<Integer, String> invertedBiMap = biMap.inverse();
Edit: Multimaps
Multimap<String, String> multimap = HashMultimap.create();
multimap.put("John", "111111111");
multimap.put("John", "222222222");
multimap.put("Mike", "333333333");
System.out.println(multimap.get("John")); //[222222222, 111111111]
for(Map.Entry<String, String> entry : multimap.entries()){
if(entry.getValue().equals("222222222")){
System.out.println(entry.getKey()); //John
}
}
//or
Multimap<String, String> inverted = HashMultimap.create();
Multimaps.invertFrom(multimap, inverted);
System.out.println(inverted.get("222222222")); //[John]

another possibility is storing the Strings in a trie, and have a '$' sign indicate the end of each string. use doubly linked pointers for each step, and hold a double linked pointer from each '$' (end of name) to its number (in an array or a list).
now, when you want to get a phone from name:
find the '$' indicating the end of the word for this string.
it is connected to a number in a list - that is your number.
when you want to get a name from a phone:
find the number, it is connected to a '$' sign.
follow the up-links all the way to the root, this will get you the name (reversed).
reverse it and you are done.
also, as I said in the comment (regarding the double map approach): assuming your strings are pretty large, and the map holds a pointer/reference to the string (and not the actual string), you can assume the storage space needed will not be double, but something much better.

Using a binary search tree for implementing the telephone directory is the best way to do it. Think about the practical implementation of the mobile phone telephone contact list. It is sorted alphabetically. If one uses map template then we shall not get a sorted list. You can not sort the elements of the map, and it's not effective.
The only way to do it is the binary tree way. Since, while adding a new entry it's inserted in a ordered way. So, no more sorting is needed. It's already ordered. Remember, left_tree < root, and root < right_tree in case of binary tree.

Related

Can I use mapreduce with a pair of Keys and a pair of values?

My question is theoretical,
I'm trying to make a design for a mapreduce example in Big data processing.
The case which I have requires a pair of keys to be mapped to a pair of values.
for example if we have below text:
"Bachelors in Engineering has experience of 5 years"
I am trying to count the words Engineering & Experience in a way where I would have a value for each word separately.
So my key would be (Engineering,Experience) and my value would be (1,1) as per the above given text example.
Note that there is a relationship between both key values in my homework, therefore I want them both in one set of a key-value to determine if both keys are mentioned in one text file, or only one key is mentioned, or none is mentioned.
Please let me know if above case is possible to do in map-reduce of big data or not..
Having a string key of "(Engineering,Experience)" is no different than just having a String of one of those words.
If you want to have some more custom type, then you will want to subclass the Writable and maybe the WritableComparable interfaces.
Simlarly, for the value, you could put the entire tuple as Text and parse it later, or you can create your own Writable subclass that can store two integers.
Thanks for the Answer, but I figured I could use "Engineering Experience" as a string for the key.

What kind of data structure will be best for storing a key-value pair where the value will be a String for some key and a List<String> for some keys?

For example, key 1 will have values "A","B","C" but key 2 will have value "D". If I use
Map<String, List<String>>
I need to populate the List<String> even when I have only single String value.
What data structure should be used in this case?
Map<String,List<String>> would be the standard way to do it (using a size-1 list when there is only a single item).
You could also have something like Map<String, Object> (which should work in either Java or presumably C#, to name two), where the value is either List<String> or String, but this would be fairly bad practice, as there are readability issue (you don't know what Object represents right off the bat from seeing the type), casting happens during runtime, which isn't ideal, among other things.
It does however depend what type of queries you plan to run. Map<String,Set<String>> might be a good idea if you plan of doing existence checks in the List and it can be large. Set<StringPair> (where StringPair is a class with 2 String members) is another consideration if there are plenty of keys with only 1 mapped value. There are plenty of solutions which would be more appropriate under various circumstances - it basically comes down to looking at the type of queries you want to perform and picking an appropriate structure according to that.

Hashing table design in C

I have a design issue regarding HASH function.
In my program I am using a hash table of size 2^13, where the slot is calculated based on the value of the node(the hash key) which I want to insert.
Now, say my each node has two value |A|B| however I am inserting value into hash table using A.
Later on, I want to search a particular node which B not A.
Is it possible to that way? Is yes, could you highlight some design approaches?
The constraint is that I have to use A as the hash key.
Sorry, I can't share the code. Small example:
Value[] = {Part1, Part2, Part3};
insert(value)
check_for_index(value.part1)
value.part1 to be used to calculate the index of the slot.
Once slot is found then insert the "value"
Later on,
search_in_hash(part2)
check_for_index("But here I need the value.part1 to check for slot index")
So, how can I relate the part1, part2 & part3 such that I later on I can find the slot by either part2 or part3
If the problem statement is vague kindly let me know.
Unless you intend to do a search element-by-element (in which case you don't need a hash, just a plain list), then what you basically ask is - can I have a hash such that hash(X) == hash(Y), but X!=Y, so that you could map to a location using part1 and then map to the same one using part2 or 3. That completely goes against what hashing stands for.
What you should do is (as viraptor also suggested), create 3 structures, each hashed using a different part of the value, and push the full value to all 3. Then when you need to search use the proper hash by the part you want to search by.
for e.g.:
value[] = {part1, part2, part3};
hash1.insert(part1, value)
hash2.insert(part2, value)
hash3.insert(part3, value)
then
hash2.search_in_hash(part2)
or
hash3.search_in_hash(part3)
The above 2 should produce the exact same values.
Also make sure that all data manipulations (removing values, changing them), is done on all 3 structures simultaneously. For e.g. -
value = hash2.search_in_hash(part2)
hash1.remove(value.part1)
hash2.remove(part2) // you can assert that part2 == value.part2
hash3.remove(value.part3)

Method to determine if keys of a dictionary are in sequence or a range

I'm trying to determine when to remove entries in the sorteddictionary, when a sequence is found, i.e. where the key is a sequence of 1,2,3,4,5,6,7,8,9,10... etc.
I have:
SortedDictionary<int, string>
Its hard to explain. I'm adding pairs where the key can be any integer value, generally on a random'ish basis. So, the program may add
<2,"jim"> <15,"Jack"> <62,"jill"> and so on.
So when it executes, the dictionary is going to filled with a sorted list which is not necessary in sequence, but I want to check, if say key values 1..10 are present, in a proper sequence, i.e 1,2,3,4,5,6,7,8,9,10.
The background is i've got stuff coming in from a messaging pipe, which is not in order. So it goes into this dictionary, and then on another thread I check the dictionary, and if return's success for the range I provide, then removes it from the dictionary and enques it, in order onto a concurrentqueue. Fundamentally an inorder to ordered exchange.
Any help is appreciated.
Bob.
If you get the highest and lowest keys, then the count would tell you if you've got a sequence.

Creating an id from name and address data. Hash/Digest

My problem:
I'm looking for a way to represent a person's name and address as an encoded id. The id should contain only alpha-numeric characters, be collision-proof, and be represented in a smallest number of characters possible. My first thought was to simply use a cryptographic hash function like MD5 or SHA1, but this seems like overkill (security isn't important - doesn't need to be one-way) and I'd prefer to find something that would produce a shorter id. Does anyone know of an existing algorithm that fits this problem?
In other words, what is the best way to implement the following function so that the return value is the same consistently for the same input, collisions are unlikely, and ids are less than 20 characters?
>>> make_fake_id(fname = 'Oscar', lname = 'Grouch', stnum = '1', stname = 'Sesame', zip = '12345')
N1743123734
Application Context (for those that are interested):
This will be used for a record linkage app. Given an input name and address we search a very large database for the best match and return the database id and other data (how we do this is not important here). If there isn't a match I need to generate this psuedo/generated/derived id from the search input (entity's name and address data). Every search record should result in an output record with either a real (the actual database id resulting from a match/link) or this generated psuedo/generated/derived id. The psuedo id will be prefixed with a character (e.g. N) to differentiate it from a real id.
I know you said no to MD5 and SHA1, but I think you should consider them anyway. As well as being well studied hashing algorithms, the length gives you more protection against possible collisions. No hash is collision-proof, but the cryptographic ones generally are less collision-prone than something you couuld come up with yourself.
Use a cryptographic hash for its collision resistance, not its other qualities
Use as many bytes from the hash as you want (truncate)
convert to alpha-numeric characters
You can also truncate the alpha-numeric string instead of the hash
An easy way to do this: hash the data, encode in base64, remove all non-alpha-numeric characters, truncate.
N_HASH_CHARS = 11
import hashlib, re
def digest(name, address):
hash = hashlib.md5(name + "|" + address).digest().encode("base64")
alnum_hash = re.sub(r'[^a-zA-Z0-9]', "", hash)
return alnum_hash[:N_HASH_CHARS]
How many alpha-numeric characters should you keep? Each character gives you around 5.95 bits of entropy (log(62,2)). 11 characters give you 65.5 bits of entropy, which should be enough to avoid a collision for the first 2**32.7 users (about 7 billion).
A good solution is somewhat dependent on your application. Do you know how many users and what the set of all users is? If you provide more details you would get better help.
I agree with the other poster suggesting serial numbers. OTOH, if you really, really really want to do something else:
Create a SHA1 hash from the data, and store it in a table with a serial number field.
Then, when you get the data, calculate the hash, look it up on the table, get the serial, and that's your id. If it's not on the table, insert it.
I wonder whether you intend to "assign" these ids to the users? If so, I would expect your users to hate anything that you propose; who would want a user id of "AAAAA01"?
So, if these ids are visible to the user, then you should just let them pick what they like and check them for uniqueness (easy). If they are not visible to the user (e.g., internal primary key), then just generate them sequentially using an appropriate technique such as an Oracle Sequence or SQL Server AutoNumber (also easy).
If these ids are an attempt to detect a user that is registering more than once, then I would agree that you should consider a cryptographic hash followed by a full comparison of the registration data (name, address, etc.). However, to be usable, you will need to translate the data into a canonical form (standardized letter case, whitespace, canonical street address, etc.) before computing the hash or making the comparison. Otherwise, you will mismatch based on trivial differences.
EDIT: Now that I understand the problem space better based on your edits, I think that it is highly unlikely that your algorithm (so far) will catch most matches. Beyond my suggestion to canonicalize the inputs, I recommend that you consider an approach that results in a ranked list of a handful of possible matches (to be resolved by a human if possible) rather than an all-or-nothing attempt at a single match. In other words, I recommend a search approach rather than a lookup approach.
Is that feasible in your situation?
Well, if there's more than one person at the same address with the same name, you're toast here, (w/o adding code to detect this and add a discriminator of some kind).
but assuming that issue is not, then the street address and zip code portion of the full addresss is sufficient to guaranteee uniqueness there, so adding enough data from the name should take care of the issue...
Do you have access to a database, or other persistence mechanism, where you could generate and maintain key values for each address? Then keep the address and individual entities in two keyed dictionary structures, where the key is autogenerated for each new distinct address, person encountered... and then use the autogenerated alpha-numeric key...
You could use AAAAA01 for first person at first address,
AAAAA02 for second person at first address,
AAAAB07 for the seventh resident at the second adresss, etc.
If you donlt have any way to generate and maintain these entity-Key mappings then you need to use the full street address/Zip and fullNAme, or a hash value of the same, although the Hash value approach has a smnall chance of generating duplicates...

Resources