Data Structure for phone book [duplicate] - algorithm

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
storing 1 million phone numbers
How to design a data structure for a phone address book with 3 fields
name, phone number , address
one must be able to search this phone book on any of the 3 fields
Hash table wouldn't work because all the three fields should hash to the same value which is i think impossible. I thought about trie and other data structures too but couldn't think of a proper answer.

You Should use TRIE data Structure for Implementing Phonebook. TRIE is an ordered tree data structure that uses strings as keys. Unlike Binary Trees, TRIE does not store keys associated with the node.
Good example

You could accomplish this with a single hash table or other type of associative array (if you wanted to). For each person, just have three keys in the table (name, address, phone) all pointing to the same record.

I think a combination of a trie (each phone book entry is one leaf) and two skip lists (one for each name and address) could turn out to be effective.
Just assign each node one set of pointers to move along the name axis, and one set of pointers to move along the address axis (that is, to traverse the skip lists).

You can't exactly sort something in three ways at the same time. Nor can you feasibly build a single hash table which allows lookup with only a third of the key.
What you probably want to do is basically what databases do:
Store one (possibly unsorted) master list of all your records.
For each column you want to be able to search on, build some kind of lookup structure which returns a pointer/index into the master list.
So, for example, you build a flat array of {name, phone, address} structs in whatever order you want, and then for each row, put a (phone -> row#) mapping into a hash table. Non-unique columns could hash to a list of row numbers, or you could put them in a binary tree where duplicate keys aren't an issue.
As far as space requirements, you basically end up storing every element twice, so your space requirement will at least double. On top of this you've got the overhead from the data structures themselves; keeping three hash tables loaded at ~70% capacity, your storage requirements increase by at least 2.4 times.
You can do away with one of these auxiliary lookup structures by keeping your main table sorted on one of the columns, so you can search on it directly in O(logN). However, this makes inserting/deleting rows very expensive (O(N)), but if your data is fairly static, this isn't much of an issue. And if this is the case, sorted arrays would be the most space-efficient choice for your auxiliary lookups as well.

in a phone book, the telephone number should be unique, address is unique, but the name could be duplicated.
so perhaps you can use hash table combine with linked list to approach this.
you can use any one or combination of the 'name, address, phone number' as hash key, if you simply use name as hash key, then linked list is needed to store the duplicated entries.
in this approach, search based on the hash key is O(1) efficiency, but search based on the other two will be O(n).

C or C++ or C#?
Use a list of classes
public class PhoneBook
{
public string name;
public string phoneNumber;
public string address;
}
place this in a list and you have a phone book

In C, I think a struct is the best option.
typedef struct _Contact Contact;
struct _Contact
{
char* name;
char* number;
char* address;
};
Contact* add_new_contact( char* name, char* number, char* address )
{
Contact* c = (Contact*) malloc( sizeof( Contact ) );
c->name = name;
c->number = number;
c->address = address;
return c;
}
Contact* phone_book [ 20 ]; /* An array of Contacts */
Use the standard string functions ( <string.h> or if using a C++ compiler, <cstring> ) or something like the glib for searching the names, numbers etc.
Here's a simple example:
Contact* search_for_number( Contact* phone_book[], const char* number )
{
register int i;
for( i = 0; i < sizeof( phone_book ); i++)
{
if ( strcmp( phone_book[i]->number, number ) == 0 ) return phone_book[i];
}
return NULL;
}
There is also a good, helpful code example over here.
Alternatively
You may be able to use linked lists, but since C or the C standard library doesn't provide linked-lists, you either need to implement it yourself, or to use a third-party library.
I suggest using the g_linked_list in the glib.

Related

Hash Tables and Separate Chaining: How do you know which value to return from the bucket's list?

We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java

GO: Best way to search element on structs [duplicate]

This question already has answers here:
How to search for an element in a golang slice
(8 answers)
Closed last month.
I'm working with go structs, Now I have a struct that has more structs in it, in this case I need to find out the value of the Id in an slice. I only have the name of an element in the last structure,
The way that I'm do it now, is reading each element of a slice called Genes till find my string Name.
Are there a better practice to find my string Name?
type GenresResponse struct {
Count int `xml:"count,attr"`
PageIndex int `xml:"page_index,attr"`
PageSize int `xml:"page_size,attr"`
NumOfResults int `xml:"num_of_results,attr"`
TotalPages int `xml:"total_pages,attr"`
Genes []Gene `xml:"gene"`
}
type Gene struct {
Category string `xml:"category,attr"`
Id string `xml:"id,attr"`
Translations Translations `xml:"translations"`
}
type Translations struct{
Translation Translation `xml:"translation"`
}
type Translation struct{
Lang string `xml:"lang,attr"`
Name string `xml:"name"`
}
And this is the way that I'm reading it
idToFind := "0"
for _, genreItem := range responseStruct.Genes {
if strings.ToLower(genreItem.Translations.Translation.Name) == strings.ToLower(myNameValue){
idToFind = genreItem.Id
break
}
}
Your code seems to be working fine and to my knowledge there isn't any "better" way to do a linear search.
But if you're dealing with a lot of data (especially when dealing with a high amount of searching), you might want to use a scheme were the Gene array is sorted (by name in this case). In this case various faster searching algorithms (like binary search) can be applied, which lowers the complexity of searching from O(x) to O(log(x)). This can make a big difference when searching big amounts of data.
More on the binary search algorithm can be found on Wikipedia: http://en.wikipedia.org/wiki/Binary_search_algorithm
Go also includes a default package which can handle sorting and binary search, especially the examples could be quite useful: http://golang.org/pkg/sort/
Go works well with json. As an option, of course, it is not optimal from the point of view of memory and CPU. But, you can apply marshaling and search through the text of the entire structure...

Data structure for occurrence counting in long tail distribution

I have a big list of elements (tens of millions).
I am trying to count the number of occurrence of several subset of these elements.
The occurrence distribution is long-tailed.
The data structure currently looks like this (in an OCaml-ish flavor):
type element_key
type element_aggr_key
type raw_data = element_key list
type element_stat =
{
occurrence : (element_key, int) Hashtbl.t;
}
type stat =
{
element_stat_hashtable : (element_aggr_key, element_stat) Hashtbl.t;
}
Element_stat currently use hashtable where the key is each elements and the value is an integer. However, this is inefficient because when many elements have a single occurrence, the occurrence hashtable is resized many times.
I cannot avoid resizing the occurrence hashtable by setting a big initial size because there actually are many element_stat instances (the size of hashtable in stat is big).
I would like to know if there is a more efficient (memory-wise and/or insertion-wise) data structure for this use-case. I found a lot of existing data structure like trie, radix tree, Judy array. But I have trouble understanding their differences and whether they fit my problem.
What you have here is a table mapping element_aggr_key to tables that in turn map element_key to int. For all practical purposes, this is equivalent to a single table that maps element_aggr_key * element_key to int, so you could do:
type stat = (element_aggr_key * element_key, int) Hashtbl.t
Then you have a single hash table, and you can give it a huge initial size.

Efficient data structure/algorithm for transliteration based word lookup

I'm looking for a efficient data structure/algorithm for storing and searching transliteration based word lookup (like google do: http://www.google.com/transliterate/ but I'm not trying to use google transliteration API). Unfortunately, the natural language I'm trying to work on doesn't have any soundex implemented, so I'm on my own.
For an open source project currently I'm using plain arrays for storing word list and dynamically generating regular expression (based on user input) to match them. It works fine, but regular expression is too powerful or resource intensive than I need. For example, I'm afraid this solution will drain too much battery if I try to port it to handheld devices, as searching over thousands of words with regular expression is too much costly.
There must be a better way to accomplish this for complex languages, how does Pinyin input method work for example? Any suggestion on where to start?
Thanks in advance.
Edit: If I understand correctly, this is suggested by #Dialecticus-
I want to transliterate from Language1, which has 3 characters a,b,c to Language2, which has 6 characters p,q,r,x,y,z. As a result of difference in numbers of characters each language possess and their phones, it is not often possible to define one-to-one mapping.
Lets assume phonetically here is our associative arrays/transliteration table:
a -> p, q
b -> r
c -> x, y, z
We also have a valid word lists in plain arrays for Language2:
...
px
qy
...
If the user types ac, the possible combinations become px, py, pz, qx, qy, qz after transliteration step 1. In step 2 we have to do another search in valid word list and will have to eliminate everyone of them except px and qy.
What I'm doing currently is not that different from the above approach. Instead of making possible combinations using the transliteration table, I'm building a regular expression [pq][xyz] and matching that with my valid word list, which provides the output px and qy.
I'm eager to know if there is any better method than that.
From what I understand, you have an input string S in an alphabet (lets call it A1) and you want to convert it to the string S' which is its equivalent in another alphabet A2. Actually, if I understand correctly, you want to generate a list [S'1,S'2,...,S'n] of output strings which might potentially be equivalent to S.
One approach that comes to mind is for each word in the list of valid words in A2 generate a list of strings in A1 that matches the. Using the example in your edit, we have
px->ac
qy->ac
pr->ab
(I have added an extra valid word pr for clarity)
Now that we know what possible series of input symbols will always map to a valid word, we can use our table to build a Trie.
Each node will hold a pointer to a list of valid words in A2 that map to the sequence of symbols in A1 that form the path from the root of the Trie to the current node.
Thus for our example, the Trie would look something like this
Root (empty)
| a
|
V
+---Node (empty)---+
| b | c
| |
V V
Node (px,qy) Node (pr)
Starting at the root node, as symbols are consumed transitions are made from the current node to its child marked with the symbol consumed until we have read the entire string. If at any point no transition is defined for that symbol, the entered string does not exist in our trie and thus does not map to a valid word in our target language. Otherwise, at the end of the process, the list of words associated with the current node is the list of valid words the input string maps to.
Apart from the initial cost of building the trie (the trie can be shipped pre-built if we never want the list of valid words to change), this takes O(n) on the length of the input to find a list of mapping valid words.
Using a Trie also provide the advantage that you can also use it to find the list of all valid words that can be generated by adding more symbols to the end of the input - i.e. a prefix match. For example, if fed with the input symbol 'a', we can use the trie to find all valid words that can begin with 'a' ('px','qr','py'). But doing that is not as fast as finding the exact match.
Here's a quick hack at a solution (in Java):
import java.util.*;
class TrieNode{
// child nodes - size of array depends on your alphabet size,
// her we are only using the lowercase English characters 'a'-'z'
TrieNode[] next=new TrieNode[26];
List<String> words;
public TrieNode(){
words=new ArrayList<String>();
}
}
class Trie{
private TrieNode root=null;
public void addWord(String sourceLanguage, String targetLanguage){
root=add(root,sourceLanguage.toCharArray(),0,targetLanguage);
}
private static int convertToIndex(char c){ // you need to change this for your alphabet
return (c-'a');
}
private TrieNode add(TrieNode cur, char[] s, int pos, String targ){
if (cur==null){
cur=new TrieNode();
}
if (s.length==pos){
cur.words.add(targ);
}
else{
cur.next[convertToIndex(s[pos])]=add(cur.next[convertToIndex(s[pos])],s,pos+1,targ);
}
return cur;
}
public List<String> findMatches(String text){
return find(root,text.toCharArray(),0);
}
private List<String> find(TrieNode cur, char[] s, int pos){
if (cur==null) return new ArrayList<String>();
else if (pos==s.length){
return cur.words;
}
else{
return find(cur.next[convertToIndex(s[pos])],s,pos+1);
}
}
}
class MyMiniTransliiterator{
public static void main(String args[]){
Trie t=new Trie();
t.addWord("ac","px");
t.addWord("ac","qy");
t.addWord("ab","pr");
System.out.println(t.findMatches("ac")); // prints [px,qy]
System.out.println(t.findMatches("ab")); // prints [pr]
System.out.println(t.findMatches("ba")); // prints empty list since this does not match anything
}
}
This is a very simple trie, no compression or speedups and only works on lower case English characters for the input language. But it can be easily modified for other character sets.
I would build transliterated sentence one symbol at the time, instead of one word at the time. For most languages it is possible to transliterate every symbol independently of other symbols in the word. You can still have exceptions as whole words that have to be transliterated as complete words, but transliteration table of symbols and exceptions will surely be smaller than transliteration table of all existing words.
Best structure for transliteration table is some sort of associative array, probably utilizing hash tables. In C++ there's std::unordered_map, and in C# you would use Dictionary.

Store and update huge (and sparse?) multi-dimensional array efficiently to count conditional probabilities

Just for fun I would like to count the conditional probabilities that a word (from a natural language) appears in a text, depending on the last and next to last word. I.e. I would take a huge bunch of e.g. English texts and count how often each combination n(i|jk) and n(jk) appears (where j,k,i are sucsessive words).
The naive approach would be to use a 3-D array (for n(i|jk)), using a mapping of words to position in 3 dimensions. The position look-up could be done efficiently using tries (at least that's my best guess), but already for O(1000) words I would run into memory constraints. But I guess that this array would be only sparsely filled, most entries being zero, and I would thus waste lots of memory. So no 3-D array.
What data structure would be suited better for such a use case and still be efficient to do a lot of small updates like I do them when counting the appearances of the words? (Maybe there is a completely different way of doing this?)
(Of course I also need to count n(jk), but that's easy, because it's only 2-D :)
The language of choice is C++ I guess.
C++ code:
struct bigram_key{
int i, j;// words - indexes of the words in a dictionary
// a constructor to be easily constructible
bigram_key(int a_i, int a_j):i(a_i), j(a_j){}
// you need to sort keys to be used in a map container
bool operator<(bigram_key const &other) const{
return i<other.i || (i==other.i && j<other.j);
}
};
struct bigram_data{
int count;// n(ij)
map<int, int> trigram_counts;// n(k|ij) = trigram_counts[k]
}
map<bigram_key, bigram_data> trigrams;
The dictionary could be a vector of all found words like:
vector<string> dictionary;
but for better lookup word->index it could be a map:
map<string, int> dictionary;
When you read a new word. You add it to the dictionary and get its index k, you already have i and j indexes of the previous two words so then you just do:
trigrams[bigram_key(i,j)].count++;
trigrams[bigram_key(i,j)].trigram_counts[k]++;
For better performance you may search for bigram only once:
bigram_data &bigram = trigrams[bigram_key(i,j)];
bigram.count++;
bigram.trigram_counts[k]++;
Is it understandable? Do you need more details?

Resources