Quickly find string in long array - ruby

I have a very long array (~2 million) of strings (ISBN for books).
I am parsing an XML file that adds to my existing library. Instead of hitting the database on every book in the XML file, I've loaded the existing library of ISBNs into an array.
If a new ISBN is found, I create a new book model. If the already ISBN exists, I update a column for that title.
Currently using index to find each ISBN from the XML file
array.index(ISBN)
I also experimented with converting the array to a hash of ISBN keys and that's a little faster.
hash[ISBN]
Any ideas to do the lookup faster? Both the array and hash method are giving me roughly 15 and 20 checks/sec respectively.

Hash lookup is the fastest you can get, I think. Its computational complexity is a constant with respect to the size. If that is not fast enough, there should be something wrong with other parts of your code.

Related

Inserting an element to a full hash table with a constant number of buckets

I am studying hash table at the moment, and got a question about its implementation with a fixed size of buckets.
Suppose we have a hash table with 23 elements(for example). Let's use the simplest hash function (hash_value = key%table_size) and the keys being integers only. If we say that one bucket can have at most only 1 element(no separate chaining), does it mean that when all buckets are full we will no longer be able to insert any element in the table at all? Or will we have to actually replace element that has the same hash value with a new element?
I do understand that I am putting a lot of constrains , and the real implementation might never look like that,but I want to be sure I understand that particular case.
A real implementation usually allows for a hash table to be able to resize, but this usually takes a long time and is undesired. Considering a fixed-size hash table, it would probably return an error code or throw an exception for the user to treat that error or not.
Or will we have to actually replace element that has the same hash value with a new element?
In Java's HashMap if you add a key that equals to another already present in the hash table only the value associated with that key will be replaced by the new one, but never if two keys hash to the same hash.
Yes. An "open" hash table - which you are describing - has a fixed size, so it can fill up.
However implementations will usually respond by copying all contents into a new, bigger table. In fact, normally they won't wait to fill entirely, but use some criterion - for example a fraction of all space used (sometimes called the "load factor") - to decide when it's time to expand.
Some implementations will also "shrink" themselves to a smaller table if the load factor becomes too small due to deletions.
You'd probably find reading Google's hash table implementation, which includes some documentation of its internals, to be a good learning experience.

C++ Integer Trie implementation using a hash_map to reduce memory consumption

I have to implement a Trie of codes of a given fixed-length. Each code is a sequence of integers and considering that some patterns are usual, I decided to implement a Trie in order to store all the codes.
I also need to iterate throught the codes given they lexicograph order and I'm expecting to work with millions (maybe billions) of codes.
This is why I considered implementing this particular Trie as a dictionary where each key is the index of a given prefix.
Let's say key 0 has a list of his prefix children and for each one i save the corresponding entry on the dictionary...
Example: If my first insertion is the code 231, then the dictionary would look like:
[0]->{(2,1)}
[1]->{(3,2)}
[2]->{(1,3)}
This way, if my second insertion would be 243, the dictionary would be updated this way:
[0]->{(2,1)}
[1]->{(3,2),(4,3)} *Here each list is sorted using a flat_map
[2]->{(1,endMark)}
[3]->{(3,endMark)}
My problem is that I have been using a vector for this purpuse and because having all the dictionary in contiguos memory allows me to have a better performance while iterating over it.
Now, when I need to work with BIG instances of my problem, due to resizing the vector I cannot work with millions of codes (memory consuption could be as much as 200GB).
Now I have tried google's sparse hash insted of the vector and my question is, do you have any suggestion? any other alternative in mind? Is there any other way to work with integers as keys to improve performance?
I know I wont have any collision because each key would be different from the rest.
Best regards,
Quentin

Which data structure to use for fast search? (C#)

I need a efficient data structure for searching. Currently I am using a simple List from System.Collections.Generic until I find a good solution.
The user can add/remove strings at runtime by clicking on them in a list. But the main operation is searching because every time the user want's to see the list I need to check for every entry if the user already clicked on it before. The list may contain about 100-1000 entries of which the user can choose about 100. The list with the chosen strings will also be saved as a string array to disk and needs to be loaded again. So the data structure should be fast to rebuild from a string array if the array was saved in the correct order.
I thought about using an AVL tree. Is it a good solution? Or would hashing be possible (I don't know the strings that can be chosen at compile time)?
Since you are using C# I would recommend using: Dictionary. It will allow you to store strings in a Hash Map (which is a related to Java's Map). The benefit of storing strings in a Dictionary is similar to that of a hash table which will allow constant time of searching, inserting, and deleting. Should you want to check if there are duplicate values you can check here for more information: Finding duplicate values in dictionary and print Key of the duplicate element.

Suitable data structure for finding a person's phone number, given their name?

Suppose you want to write a program that implements a simple phone book. Given a particular name, you want to be able to retrieve that person's phone number as quickly as possible. What data structure would you use to store the phone book, and why?
the text below answers your question.
In computer science, a hash table or hash map is a data structure that
uses a hash function to map identifying values, known as keys (e.g., a
person's name), to their associated values (e.g., their telephone
number). Thus, a hash table implements an associative array. The hash
function is used to transform the key into the index (the hash) of an
array element (the slot or bucket) where the corresponding value is to
be sought.
the text is from wiki:hashtable.
there are some further discussions, like collision, hash functions... check the wiki page for details.
I respect & love hashtables :) but even a balanced binary tree would be fine for your phone book application giving you in worst case a logarithmic complexity and avoiding you for having good hash functions, collisions etc. which is more suitable for huge amounts of data.
When I talk about huge data what I mean is something related to storage. Every time you fill all of the buckets in a hash-table you will need to allocate new storage and re-hash everything. This can be avoided if you know the size of the data ahead of time. Balanced trees wont let you go into these problems. Domain needs to be considered too while designing data structures, for an example for small devices storage matters a lot.
I was wondering why 'Tries' didn't come up in one of the answers,
Tries is suitable for Phone book kind of data.
Also, saving space compared to HashTable at the same cost(almost) of Retrieval efficiency, (assuming constant size alphabet & constant length Names)
Tries also facilitate the 'Prefix Matches' sometimes required while searching.
A dictionary is both dynamic and fast.
You want a dictionary, where you use the name as the key, and the number as the data stored. Check this out: http://en.wikipedia.org/wiki/Dictionary_%28data_structure%29
Why not use a singly linked list? Each node will have the name, number and link information.
One drawback is that your search might take some time since you'll have to traverse the entire list from link to link. You might order the list at the time of node insertion itself!
PS: To make the search a tad bit faster, maintain a link to the middle of the list. Search can continue to the left or right of the list based on the value of the "name" field at this node. Note that this requires a doubly linked list.

How to match against a large collection efficiently

I have a large collection of objects of type foo. Each object of type foo has say 100 properties (all strings) plus an id. An object of type bar also has these 100 properties.
I want to find the matching object of type foo from the collection where all these properties match with that of bar.
Aside from the brute force method, is there an elegant algorithm where we can calculate a signature for foo objects once and do the same for the bar object and match more efficiently?
The foos are in the thousands and the bars are in the millions.
Darth Vader has a point there... and I never thought that I'd be siding with the dark side!
I'll go over what I think are the best tools for the trade:
Embedded database: Google's LevelDB- it's faster than most database solutions out there.
Hashing function: Google's CityHash- it's fast and it offers excellent hashing!
JSON Serialization
The Embedded Database
The goal of using an embedded database is that you will get performance that will beat most database solutions that you're likely to encounter. We can talk about just how fast LevelDB is, but plenty of other people have already talked about it quite a bit so I won't waste time. The embedded database allows you to store key/value pairs and quickly find them in your database.
The Hashing Function
A good hashing function will be fast and it will provide a good distribution of non-repeatable hashes. CityHash is very fast and it has very good distribution, but again: I won't waste time since a lot of other people have already talked about the performance of CityHash. You would use the hashing function to hash your objects and then use the unique key to look them up in the database.
JSON Serialization
JSON Serialization is the antithesis of what I've shown above: it's very slow and it will diminish any performance gain you achieved with CityHash, but it gives you a very simple way to hash an entire object. You serialize the object to a JSON string, then you hash the string using CityHash. Despite the fact that you've lost the performance gains of CityHash because you spent so much time serializing the object to JSON, you will still reap the benefits of having a really good hashing function.
The Conclusion
You can store billions of records in LevelDB and you will be able to quickly retrieve the exact value you're looking for just by providing the hash for it.
In order to generate a key, you can use JSON serialization and CityHash to hash the JSON string.
Use the key to find the matching object!
Enjoy!
If you have ALL matching properties. That means they are same objects actually. is that correct?
In any case, you want to use a Map/Dictionary/Table with a good hashing algorithm to find matching objects.
Whichever language you are using, you should override the gethashcode and equals methods to implement it.
If you have a good hashing algorithm your access time will be O(1). otherwise it can be upto O(n).
Based on your memory limitation, you want to store foos in the map, storing bars might requite lots of space which you might not have.
Hash is very nice and simple to implement.. But i want suggest you that algorithm:
Map your 100 string properties to one big string(for example concatenate with fixed length for each property) that should unique id of this object. So we have 1000 string in first set, and 1mln strings in second.
The problem reduces to find for each strings in second set if first set contains it.
Make trie data structure on first set
Complicity of checking if string S in the trie is O(|S|). |S| - length of S.
So... Complicity of algorithm is - O(Sum(|Ai|) + Sum(|Bi|)) = O(max(Sum(|Ai|), Sum(|Bi|)) = O(Sum(|Bi|)) for your problem. Ai - string unique id for first set, Bi - string unique id for second set.
UPDATE:
Trie takes O(Sum(|Ai|) * |Alphabet|) space at worst.

Resources