NoSQL or YesSQL - algorithm

I have a huge dictionary of words:
"word1" => [value1]
"word2" => [value2]
"word3" => [value3, value2]
...
"word400000000" => [value455, value3435, ..., value3423]
number of words is really big.
Now I want to be able to retrieve, really fast, all the values which are being pointed by word. word is string value.
What are the best tools to use? I thought of simple DB solution, but DBA guys said that it will not work really fast.
So, before I open Cormen's book, is there some ready solutions for that problem?

Look at key/value storage engines such as Berkeley DB. They are very fast at that sort of thing.

In RDMSs (YesSQL) you will most probably search values with LIKE or = operators on all records, i.e. search will take O(n). What you actually need is a data structure called inverted index, which allows you to find list of needed values in O(1). For description of structure and algorithms see Wikipedia article, for ready-to-use tools keep reading.
There's plenty of implementations of inverted index in search engines like Lucene/Solr, Sphinx (which, by the way, supports several databases as data source), and also in some key-value stores like Berkeley DB or Apache Cassandra. Distinction between search engines and key-value stores is in that:
Search engines implement inverted index more directly (AFAIK, key-value DBs use BigTable-like structures, that are much more complex then inverted index itself).
Search engines have a plenty of tools for text analysis (parsing, stemming). I don't know, if you actually need it, but if you do, use search engines.
Key-value DBs are real databases. I.e., unlike search engines they have real data types, not only strings. Moreover, some of such DBs (e.g. Berkeley DB) can store programming language native data types without converting them to any inner format. So, if you need a real database with all features, use key-value stores.
Also note, that inverted index is really simple structure, so you can easily implement it by yourself, if none of previous options is suitable for you.

It really depends on what behavior you want. If you just want to be able to do an exact text search, then a hash table is probably a really great idea. It has expected O(1) lookup, which is about as fast as you're going to get.
If you need the elements in sorted order (for example, so you can iterate across them in a reasonable order), then one of the myriad balanced search trees might be a good candidate; for example, a red-black tree or an AVL tree.
If you're working with a huge data set that can't all fit into main memory, then a very good choice might be a B-tree, which is a type of balanced binary search tree that minimizes the number of disk reads required to find a given element. Most database systems use some flavor of B-trees for their lookups.

You can use cassandra (http://cassandra.apache.org/). Is Easy to start, has pretty much documentation and is a really fast solution for your problem.
Hope this helps,

If you know that you will only want to search for values based on words and not the other way around, use a simple Key-Value store. Maybe Redis would be best.
If you think you will ever need to search based on the values, then you'll likely need Secondary Indices or off-line MapReduce jobs. Maybe Cassandra would be best.

Related

What is the utility of treap data structure?

I am currently studying advanced data structures and I came across a weird data structure called Treap. I understand what Treap is but I can't seem to find it's utility in a valid use case scenario.
Why should you use such a data structure and in what type of problems/conditions treaps are best used?
I find myself much more into using either hash maps, min/max heaps, binary search tree or balanced binary search trees, but I can't tell on why should you use a treap.
They are easier to implement and more importantly, that makes them easier to modify/maintain into the future if you want to make slight variations on them or change them some way. They also allow for efficient parallel versions of set operations Union/Intersect/Difference which is extremely valuable. Using them simultaneously as a heap and binary tree isn't really very handy unless the stuff you use for priorities are coincidentally really nicely randomly distributed/permuted. I suppose there might be a case where that would be handy, but it seems really unlikely. Stuff so randomly distributed is usually more like a hash key which typically aren't useful as ordered data. How often do you want to pull people out in order of their SSNs? I guess it's possible but unlikely.

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

Overhead and (in)efficiency of NoSQL databases?

I have a question about NoSQL type databases, in particular MongoDB, but it applies in general to most key-value or document based storages. Some of the selling points of NoSQL are speed and scalability, but it seems to me that there is significant overhead compared to relational databases.
You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases. I'm more concerned about the next ones:
There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.
The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)
I know you can write your own views or map/reduce algorithms to get something like an index, but it seems at first glance that for the general case NoSQL must be terribly inefficient space and CPU wise.
Is it really that bad? What kinds of optimizations are in place in NoSQL databases (say MongoDB)? What's the overhead in storing lots of identical complex JSON documents compared to using a relational database?
First, any overhead or inefficiency is more often than not simply represent a choice of priorities; an overhead somewhere gives you an advantage somewhere else.
As for your specifics points, again, I think answers will depends a lot depending on the exact NoSQL products, even among the key-value or document-based subgroup, but here some thoughts :
1- You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases.
Actually, most (if not all) key-value databases can be used with any schema you want. So you can have a "normalized schema" laid upon a key-value store, resulting in no duplication. Don't forget that there are SQL solutions available for some (or most?) key-value databases.
2- There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.
I guess this depends on how the database engine is implemented, but compression - either complicated or simple "tokenization" - can be used and result in no significant overhead there neither.
3- The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)
Again, nothing prevent a key-value or document-based database from using any kind of trees under the hood or to store integers in a compact way (for example, it can have a simple binary flag to indicate if the data is stored as string or "compact integer"). As for creating indices, that is also possible (for the same reasons stated in 1, or done manually by the application).

External store for complex collections that can be accessed by Key-Value

Problem
I need a key-value store that can store values of the following form:
DS<DS<E>>
where the data structure DS can be
either a List, SortedSet or an Array
and E can be either a String or byte-array.
It is very expensive to generate this data and so once I put it into the store, I will only perform read queries on it. Essentially it is a complex object cache with no eviction.
Example Application
A (possibly bad, but sufficient to clarify) example of an application is storing tokenized sentences from a document where you need to be able to quickly access the qth word of the pth sentence given documentID. In this case, I would be storing it as a K-V pair as follows:
K - docID
V - List<List<String>>
String word = map.get(docID).get(p).get(q);
I prefer to avoid app-integrated Map solutions (such as EhCache within Java).
I have worked with Redis but it doesn't appear to support the second layer of data-structure complexity. Any other K-V solutions that can help my use case?
Update:
I know that I could serialize/deserialize my object but I was wondering if there is any other solution.
In terms of platform choice you have two options - A full document database will support arbitrarily complex objects, but won't have built in commands for working with specific data structures. Something like Redis which does have optimised code for specific data structures can't support all possible data structures.
You can actually get pretty close with Redis by using ids instead of the nested data structure. DS1<DS2<E>> becomes DS1<int> and DS2<E>, with the int from DS1 and a prefix giving you the key holding DS2.
With this structure you can access any E with only two operations. In some cases you will be able to get that down to a single operation by knowing what the id of DS2 will be for a given query.
I hesitate to "recommend" it, but one of the only storage engines I know of which handles multi-dimensional data of this sort efficiently is Intersystems Cache. I had to use it at my last job, mostly coding against it using it's built in MUMPS-based language. I would not recommend the native approach, unless you hate yourself or your developers. However, they do have decent Java adapters, which appears to be what you're using. I've seen it handle billions of records, efficiently stored in nested binary tree tables. There is no practical limit to the depth (number of dimensions) you can use. However, this is very much a proprietary solution. There is an open-source alternative called GT.M, but I don't know how compatible it is with languages that aren't M or C.
Any Key-Value store supports complex values, you just need to serialize/deserialize the data.
If you want fast retrieval only for specific parts of the data, you could use a more complex Key. In your example this would be:
K - tuple(docID, p, q)

Efficient storage of external index of strings

Say you have a large collection with n objects on disk and each one has a variable-sized string. What are common practices of efficient ways to make an index of those objects with plain string comparison. Storing the whole strings on the index would be prohibitive in the long rundue to size and I/O, but since disks have a high latency storing only references isn't a good idea, either.
I've been thinking on using a B-Tree-like design with tries but can't find any database implementation using this approach. In fact, it's hard to find how major databases implement indexes for strings (it probably gets lost in the vast results for SQL-level information.)
TIA!
EDIT: changed title from "Efficient external sorting and searching of stored objects with large strings" to "Efficient storage of external index of strings."
A "prefix B-tree" or "simple prefix B-tree" would probably be helpful here.
A "simple prefix B-tree" is a bit simpler, just storing the shortest prefix that separates two items, without trying to eliminate redundancy within those prefixes (e.g. for 'astronomy' and 'azimuth', it would store just 'as' and 'az', but not try to keep from duplicating the 'a').
A "prefix B-tree" is close to what you've described -- something like a trie, but in a B-tree structure to give good characteristics when stored primarily on disk. Nonetheless, it's intended to remove (most of) the redundancy within the prefixes that form the index.
There is one other question: do you really need to traverse the records in order, or do you just need to look up a specified record quickly? If the latter is adequate, you might be able to use extendible hashing instead. Extendible hashing has been around (in a number of different forms) for a few decades, and still works pretty well. The general idea is fairly simple: hash the strings to create keys of fixed length, then create some sort of tree of those fixed-length pseudo-keys. As with (almost) any hash, you have to be prepared to deal with collisions. As with other hash tables, the details of the hashing and collision resolution vary (though probably not quite as much with extendible hashing as in-memory hashing).
As for real use, major DBMS and DBMS-like systems use all of the above. B-tree variants are probably the most common in the general purpose DBMS market (e.g. Oracle or MS SQL Server). Extendible hashing is used in a fair number of more-specialized products (e.g., Lotus Domino Server).
What are you doing with the objects?
If you're running a large system that needs low latency to handle lots of concurrent requests, then I'd store the objects in a database and have it take care of the sorting and indexing. This would be much simpler than implementing B-tree from scratch and possibly having it be buggy.
DBMSs also have caching and various other features that might make your life easier.
Start by being clear what you want. Do you want to sort them or index them? Sorting is likely to require moving at least some of the items on disk, but indexing would likely leave them where they are.
If you really want to sort them, Knuth's "The Art of Computer Programming" volume three covers sorting and searching in about as much details as you're likely to want.

Resources