What is the best performing general purpose data structure? [closed] - data-structures

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm looking to create a library where one of the functions users need to do is store and retrieve data, along with an index.I don't know which they'll be doing more of: inserting, reading/writing, deleting, or random searching.
What kind of data structure would you use so they get the best performance in general? How would your proposed data structure compare performance wise in each scenario?
Thinking hash table or avl tree? Or something like a combo of data structures? Linked list of arrays?
What would be cool is if it self optimized, so it sees the user is doing more inserts or reads or random searches, so future inserts are optimized for that.

There's no single best data structure out there that does this or I'd promise that everyone would be using it. However, there are a couple of very reasonable options available.
The first question to think about is what do you need to do with the data? If you're just storing items and looking them up later on, and all you need to do is add, remove, and look up items, then you might want to look more toward various flavors of hash tables. On the other hand, if you're looking for the ability to process items in sorted order, then hash tables are probably out and you should probably look more toward balanced trees.
The next question is what type of data you're storing. If each item has some associated key, what kind of key is it? Both hash tables and BSTs are great in general, but more specialized data structures exist as well that work specifically for string keys (tries) and other types like integers.
From there you should think about how much data you're storing. If you're storing a couple hundred megabytes and things fit comfortably in RAM, you might not need to do anything special here. But if you have a truly huge amount of data and things don't fit into RAM, you'll need to look into external data structures like B-trees.
Another question to consider is what kind of performance guarantees you want. Most hash tables require some sort of dynamic resizing as the number of items increases, which can lead to infrequent but expensive rebuild operations that can slow things down. If you absolutely need real-time performance, this won't work for you. If you're okay with that, then go for it!
And let's suppose you've then narrowed things down to, say, "a hash table" or "a balanced BST." Now you have to select which type to use! For hash tables, simple structures like linear probing hash tables or chained hashing often need some performance tuning to be maximally efficient. Newer approaches like cuckoo hashing can give better memory performance in some cases, while engineered approaches like Google's flat_hash_map are extremely optimized for the x86 architecture. For BSTs, you might want something like an AVL tree if you have way more lookups than insertions or deletions, since AVL trees have a low height, but you might also want to look at red/black trees if insertions and deletions are more common, and perhaps into more modern trees like RAVL or WAVL trees if you really have a lot of deletions.
All of this is to say that the answer is "it depends." The more you know about your particular application, the better a data structure you'll be able to pick. And, sadly, there is no One Data Structure To Rule Them All. :-)

Related

What is the utility of treap data structure?

I am currently studying advanced data structures and I came across a weird data structure called Treap. I understand what Treap is but I can't seem to find it's utility in a valid use case scenario.
Why should you use such a data structure and in what type of problems/conditions treaps are best used?
I find myself much more into using either hash maps, min/max heaps, binary search tree or balanced binary search trees, but I can't tell on why should you use a treap.
They are easier to implement and more importantly, that makes them easier to modify/maintain into the future if you want to make slight variations on them or change them some way. They also allow for efficient parallel versions of set operations Union/Intersect/Difference which is extremely valuable. Using them simultaneously as a heap and binary tree isn't really very handy unless the stuff you use for priorities are coincidentally really nicely randomly distributed/permuted. I suppose there might be a case where that would be handy, but it seems really unlikely. Stuff so randomly distributed is usually more like a hash key which typically aren't useful as ordered data. How often do you want to pull people out in order of their SSNs? I guess it's possible but unlikely.

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

Optimal physical orderings of nodes

First, read this:
TPT paper
I was wondering what other options might exist for arranging nodes to boost performance. Anything from post-parent order in a byte array, like TPT's, to something more like a k-order b-tree; I'm wondering what good options are known at the moment?
A bit more on the problem:
I have an extremely fast way of finding elements within a sparse set, given some concept of adjacency to a given pointer. I was wondering how I could best take advantage of this in storing a patricia trie.
You can make assumptions about whether the trie will be random-access, read only, write-seldom, or add-only. Please note them if you do, but I've actually used a TPT and the gains were pretty significant so I'm willing to consider certain constraints.
Update
I guess in some senses this was a little unclear. What I'm looking for here is ways of arranging things in memory that optimize one performance metric or another. The TPTs, through some tricks, use node order to optimize disk reads and space-per-node. I'm curious about:
Total deletion, where the structure is removed from memory entirely.
Inserts, particularly in densely populated structures.
Deletes, again, particularly in densely populated structures.
A DAWG or a minimal DFA (see this question or the paper "How to squeeze a lexicon") may be even better than a TPT because the totel size is smaller.

Efficient storage of external index of strings

Say you have a large collection with n objects on disk and each one has a variable-sized string. What are common practices of efficient ways to make an index of those objects with plain string comparison. Storing the whole strings on the index would be prohibitive in the long rundue to size and I/O, but since disks have a high latency storing only references isn't a good idea, either.
I've been thinking on using a B-Tree-like design with tries but can't find any database implementation using this approach. In fact, it's hard to find how major databases implement indexes for strings (it probably gets lost in the vast results for SQL-level information.)
TIA!
EDIT: changed title from "Efficient external sorting and searching of stored objects with large strings" to "Efficient storage of external index of strings."
A "prefix B-tree" or "simple prefix B-tree" would probably be helpful here.
A "simple prefix B-tree" is a bit simpler, just storing the shortest prefix that separates two items, without trying to eliminate redundancy within those prefixes (e.g. for 'astronomy' and 'azimuth', it would store just 'as' and 'az', but not try to keep from duplicating the 'a').
A "prefix B-tree" is close to what you've described -- something like a trie, but in a B-tree structure to give good characteristics when stored primarily on disk. Nonetheless, it's intended to remove (most of) the redundancy within the prefixes that form the index.
There is one other question: do you really need to traverse the records in order, or do you just need to look up a specified record quickly? If the latter is adequate, you might be able to use extendible hashing instead. Extendible hashing has been around (in a number of different forms) for a few decades, and still works pretty well. The general idea is fairly simple: hash the strings to create keys of fixed length, then create some sort of tree of those fixed-length pseudo-keys. As with (almost) any hash, you have to be prepared to deal with collisions. As with other hash tables, the details of the hashing and collision resolution vary (though probably not quite as much with extendible hashing as in-memory hashing).
As for real use, major DBMS and DBMS-like systems use all of the above. B-tree variants are probably the most common in the general purpose DBMS market (e.g. Oracle or MS SQL Server). Extendible hashing is used in a fair number of more-specialized products (e.g., Lotus Domino Server).
What are you doing with the objects?
If you're running a large system that needs low latency to handle lots of concurrent requests, then I'd store the objects in a database and have it take care of the sorting and indexing. This would be much simpler than implementing B-tree from scratch and possibly having it be buggy.
DBMSs also have caching and various other features that might make your life easier.
Start by being clear what you want. Do you want to sort them or index them? Sorting is likely to require moving at least some of the items on disk, but indexing would likely leave them where they are.
If you really want to sort them, Knuth's "The Art of Computer Programming" volume three covers sorting and searching in about as much details as you're likely to want.

Skip Lists -- ever used them? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm wondering whether anyone here has ever used a skip list. It looks to have roughly the same advantages as a balanced binary tree but is simpler to implement. If you have, did you write your own, or use a pre-written library (and if so, what was its name)?
My understanding is that they're not so much a useful alternative to binary trees (e.g. red-black trees) as they are to B-trees for database use, so that you can keep the # of levels down to a feasible minimum and deal w/ base-K logs rather than base-2 logs for performance characteristics. The algorithms for probabilistic skip-lists are (IMHO) easier to get right than the corresponding B-tree algorithms. Plus there's some literature on lock-free skip lists. I looked at using them a few months ago but then abandoned the effort on discovering the HDF5 library.
literature on the subject:
Papers by Bill Pugh:
A skip list cookbook
Skip lists: A probabilistic alternative to balanced trees
Concurrent Maintenance of Skip Lists
non-academic papers/tutorials:
Eternally Confuzzled (has some discussion on several data structures)
"Skip Lists" by Thomas A. Anastasio
Actually, for one of my projects, I am implementing my own full STL. And I used a skiplist to implement my std::map. The reason I went with it is that it is a simple algorithm which is very close to the performance of a balanced tree but has much simpler iteration capabilities.
Also, Qt4's QMap was a skiplist as well which was the original inspiration for my using it in my std::map.
Years ago I implemented my own for a probabilistic algorithms class. I'm not aware of any library implementations, but it's been a long time. It is pretty simple to implement. As I recall they had some really nice properties for large data sets and avoided some of the problems of rebalancing. I think the implementation is also simpler than binary tries in general. There is a nice discussion and some sample c++ code here:
http://www.ddj.us/cpp/184403579?pgno=1
There's also an applet with a running demonstration. Cute 90's Java shininess here:
http://www.geocities.com/siliconvalley/network/1854/skiplist.html
Java 1.6 (Java SE 6) introduced ConcurrentSkipListSet and ConcurrentSkipListMap to the collections framework. So, I'd speculate that someone out there is really using them.
Skiplists tend to offer far less contention for locks in a multithreaded situation, and (probabilistically) have performance characteristics similar to trees.
See the original paper [pdf] by William Pugh.
I implemented a variant that I termed a Reverse Skip List for a rules engine a few years ago. Much the same, but the reference links run backward from the last element.
This is because it was faster for inserting sorted items that were most likely towards the back-end of the collection.
It was written in C# and took a few iterations to get working successfully.
The skip list has the same logarithmic time bounds for searching as is achieved by the binary search algorithm, yet it extends that performance to update methods when inserting or deleting entries. Nevertheless, the bounds are expected for the skip list, while binary search of a sorted table has a worst-case bound.
Skip Lists are easy to implement. But, adjusting the pointers on a skip list in case of insertion and deletion you have to be careful. Have not used this in a real program but, have doen some runtime profiling. Skip lists are different from search trees. The similarity is that, it gives average log(n) for a period of dictionary operations just like the splay tree. It is better than an unbalanced search tree but is not better than a balanced tree.
Every skip list node has forward pointers which represent the current->next() connections to the different levels of the skip list. Typically this level is bounded at a maximum of ln(N). So if N = 1million the level is 13. There will be that much pointers and in Java this means twie the number of pointers for implementing reference data types. where as a balanced search tree has less and it gives same runtime!!.
SkipList Vs Splay Tree Vs Hash As profiled for dictionary look up ops a lock stripped hashtable will give result in under 0.010 ms where as a splay tree gives ~ 1 ms and skip list ~720ms.

Resources