Efficient storage of external index of strings - algorithm

Say you have a large collection with n objects on disk and each one has a variable-sized string. What are common practices of efficient ways to make an index of those objects with plain string comparison. Storing the whole strings on the index would be prohibitive in the long rundue to size and I/O, but since disks have a high latency storing only references isn't a good idea, either.
I've been thinking on using a B-Tree-like design with tries but can't find any database implementation using this approach. In fact, it's hard to find how major databases implement indexes for strings (it probably gets lost in the vast results for SQL-level information.)
TIA!
EDIT: changed title from "Efficient external sorting and searching of stored objects with large strings" to "Efficient storage of external index of strings."

A "prefix B-tree" or "simple prefix B-tree" would probably be helpful here.
A "simple prefix B-tree" is a bit simpler, just storing the shortest prefix that separates two items, without trying to eliminate redundancy within those prefixes (e.g. for 'astronomy' and 'azimuth', it would store just 'as' and 'az', but not try to keep from duplicating the 'a').
A "prefix B-tree" is close to what you've described -- something like a trie, but in a B-tree structure to give good characteristics when stored primarily on disk. Nonetheless, it's intended to remove (most of) the redundancy within the prefixes that form the index.
There is one other question: do you really need to traverse the records in order, or do you just need to look up a specified record quickly? If the latter is adequate, you might be able to use extendible hashing instead. Extendible hashing has been around (in a number of different forms) for a few decades, and still works pretty well. The general idea is fairly simple: hash the strings to create keys of fixed length, then create some sort of tree of those fixed-length pseudo-keys. As with (almost) any hash, you have to be prepared to deal with collisions. As with other hash tables, the details of the hashing and collision resolution vary (though probably not quite as much with extendible hashing as in-memory hashing).
As for real use, major DBMS and DBMS-like systems use all of the above. B-tree variants are probably the most common in the general purpose DBMS market (e.g. Oracle or MS SQL Server). Extendible hashing is used in a fair number of more-specialized products (e.g., Lotus Domino Server).

What are you doing with the objects?
If you're running a large system that needs low latency to handle lots of concurrent requests, then I'd store the objects in a database and have it take care of the sorting and indexing. This would be much simpler than implementing B-tree from scratch and possibly having it be buggy.
DBMSs also have caching and various other features that might make your life easier.

Start by being clear what you want. Do you want to sort them or index them? Sorting is likely to require moving at least some of the items on disk, but indexing would likely leave them where they are.
If you really want to sort them, Knuth's "The Art of Computer Programming" volume three covers sorting and searching in about as much details as you're likely to want.

Related

Working with a Set that does not fit in memory

Let's say I have a huge list of fixed-length strings, and I want to be able to quickly determine if a new given string is part of this huge list.
If the list remains small enough to fit in memory, I would typically use a set: I would feed it first with the list of strings, and by design, the data structure would allow me to quickly check whether or not a given string is part of the set.
But as far as I can see, the various standard implementation of this data structure store data in memory, and I already know that the huge list of strings won't fit in memory, and that I'll somehow need to store this list on disk.
I could rely on something like SQLite to store the strings in a indexed table, then query the table to know whether a string is part of the initial set or not. However, using SQLite for this seems unnecessarily heavy to me, as I definitely don't need all the querying features it supports.
Have you guys faced this kind of problems before? Do you know any library that might be helpful? (I'm quite language-agnostic, feel free to throw whatever you have)
There are multiple solutions to efficiently find if a string is a part of a huge set of strings.
A first solution is to use a trie to make the set much more compact. Indeed, many strings will likely start by the same header and re-writing it over and over in memory is not space efficient. It may be enough to keep the full set in memory or not. If not, the root part of the trie can be stored in memory referencing leaf-like nodes stored on the disk. This enable the application to quickly find with part of the leaf-like nodes need to be loaded with a relatively small cost. If the number of string is not so huge, most leaf parts of the trie related to a given leaf of the root part can be loaded in one big sequential chunk from the storage device.
Another solution is to use a hash table to quickly find if a given string exist in the set with a low latency (eg. with only 2 fetches). The idea is just to hash a searched string and perform a lookup at a specific item of a big array stored on the storage device. Open-adressing can be used to make the structure more compact at the expense of a possibly higher latency while only 2 fetches are needed with closed-adressing (the first get the location of the item list associated to the given hash and the second get all the actual items).
One simple way to easily implement such data structures so they can work on a storage devices is to make use of mapped memory. Mapped memory enable you to access data on a storage device transparently as if it was in memory (whatever the language used). However, the cost to access data is the one of the storage device and not the one of the memory. Thus, the data structure implementation should be adapted to the use of mapped memory for better performance.
Finally, you can cache data so that some fetches can be much faster. One way to do that is to use Bloom filters. A Bloom filter is a very compact probabilistic hash-based data structure. It can be used to cache data in memory without actually storing any string item. False positive matches are possible, but false negatives are not. Thus, they are good to discard searched strings that are often not in the set without the need to do any (slow) fetch on the storage device. A big Bloom filter can provide a very good accuracy. This data structure need to be mixed with the above ones if deterministic results are required. LRU/LFU caches might also help regarding the distribution of the searched items.

What is the utility of treap data structure?

I am currently studying advanced data structures and I came across a weird data structure called Treap. I understand what Treap is but I can't seem to find it's utility in a valid use case scenario.
Why should you use such a data structure and in what type of problems/conditions treaps are best used?
I find myself much more into using either hash maps, min/max heaps, binary search tree or balanced binary search trees, but I can't tell on why should you use a treap.
They are easier to implement and more importantly, that makes them easier to modify/maintain into the future if you want to make slight variations on them or change them some way. They also allow for efficient parallel versions of set operations Union/Intersect/Difference which is extremely valuable. Using them simultaneously as a heap and binary tree isn't really very handy unless the stuff you use for priorities are coincidentally really nicely randomly distributed/permuted. I suppose there might be a case where that would be handy, but it seems really unlikely. Stuff so randomly distributed is usually more like a hash key which typically aren't useful as ordered data. How often do you want to pull people out in order of their SSNs? I guess it's possible but unlikely.

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

Overhead and (in)efficiency of NoSQL databases?

I have a question about NoSQL type databases, in particular MongoDB, but it applies in general to most key-value or document based storages. Some of the selling points of NoSQL are speed and scalability, but it seems to me that there is significant overhead compared to relational databases.
You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases. I'm more concerned about the next ones:
There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.
The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)
I know you can write your own views or map/reduce algorithms to get something like an index, but it seems at first glance that for the general case NoSQL must be terribly inefficient space and CPU wise.
Is it really that bad? What kinds of optimizations are in place in NoSQL databases (say MongoDB)? What's the overhead in storing lots of identical complex JSON documents compared to using a relational database?
First, any overhead or inefficiency is more often than not simply represent a choice of priorities; an overhead somewhere gives you an advantage somewhere else.
As for your specifics points, again, I think answers will depends a lot depending on the exact NoSQL products, even among the key-value or document-based subgroup, but here some thoughts :
1- You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases.
Actually, most (if not all) key-value databases can be used with any schema you want. So you can have a "normalized schema" laid upon a key-value store, resulting in no duplication. Don't forget that there are SQL solutions available for some (or most?) key-value databases.
2- There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.
I guess this depends on how the database engine is implemented, but compression - either complicated or simple "tokenization" - can be used and result in no significant overhead there neither.
3- The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)
Again, nothing prevent a key-value or document-based database from using any kind of trees under the hood or to store integers in a compact way (for example, it can have a simple binary flag to indicate if the data is stored as string or "compact integer"). As for creating indices, that is also possible (for the same reasons stated in 1, or done manually by the application).

NoSQL or YesSQL

I have a huge dictionary of words:
"word1" => [value1]
"word2" => [value2]
"word3" => [value3, value2]
...
"word400000000" => [value455, value3435, ..., value3423]
number of words is really big.
Now I want to be able to retrieve, really fast, all the values which are being pointed by word. word is string value.
What are the best tools to use? I thought of simple DB solution, but DBA guys said that it will not work really fast.
So, before I open Cormen's book, is there some ready solutions for that problem?
Look at key/value storage engines such as Berkeley DB. They are very fast at that sort of thing.
In RDMSs (YesSQL) you will most probably search values with LIKE or = operators on all records, i.e. search will take O(n). What you actually need is a data structure called inverted index, which allows you to find list of needed values in O(1). For description of structure and algorithms see Wikipedia article, for ready-to-use tools keep reading.
There's plenty of implementations of inverted index in search engines like Lucene/Solr, Sphinx (which, by the way, supports several databases as data source), and also in some key-value stores like Berkeley DB or Apache Cassandra. Distinction between search engines and key-value stores is in that:
Search engines implement inverted index more directly (AFAIK, key-value DBs use BigTable-like structures, that are much more complex then inverted index itself).
Search engines have a plenty of tools for text analysis (parsing, stemming). I don't know, if you actually need it, but if you do, use search engines.
Key-value DBs are real databases. I.e., unlike search engines they have real data types, not only strings. Moreover, some of such DBs (e.g. Berkeley DB) can store programming language native data types without converting them to any inner format. So, if you need a real database with all features, use key-value stores.
Also note, that inverted index is really simple structure, so you can easily implement it by yourself, if none of previous options is suitable for you.
It really depends on what behavior you want. If you just want to be able to do an exact text search, then a hash table is probably a really great idea. It has expected O(1) lookup, which is about as fast as you're going to get.
If you need the elements in sorted order (for example, so you can iterate across them in a reasonable order), then one of the myriad balanced search trees might be a good candidate; for example, a red-black tree or an AVL tree.
If you're working with a huge data set that can't all fit into main memory, then a very good choice might be a B-tree, which is a type of balanced binary search tree that minimizes the number of disk reads required to find a given element. Most database systems use some flavor of B-trees for their lookups.
You can use cassandra (http://cassandra.apache.org/). Is Easy to start, has pretty much documentation and is a really fast solution for your problem.
Hope this helps,
If you know that you will only want to search for values based on words and not the other way around, use a simple Key-Value store. Maybe Redis would be best.
If you think you will ever need to search based on the values, then you'll likely need Secondary Indices or off-line MapReduce jobs. Maybe Cassandra would be best.

Resources