Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm wondering whether anyone here has ever used a skip list. It looks to have roughly the same advantages as a balanced binary tree but is simpler to implement. If you have, did you write your own, or use a pre-written library (and if so, what was its name)?
My understanding is that they're not so much a useful alternative to binary trees (e.g. red-black trees) as they are to B-trees for database use, so that you can keep the # of levels down to a feasible minimum and deal w/ base-K logs rather than base-2 logs for performance characteristics. The algorithms for probabilistic skip-lists are (IMHO) easier to get right than the corresponding B-tree algorithms. Plus there's some literature on lock-free skip lists. I looked at using them a few months ago but then abandoned the effort on discovering the HDF5 library.
literature on the subject:
Papers by Bill Pugh:
A skip list cookbook
Skip lists: A probabilistic alternative to balanced trees
Concurrent Maintenance of Skip Lists
non-academic papers/tutorials:
Eternally Confuzzled (has some discussion on several data structures)
"Skip Lists" by Thomas A. Anastasio
Actually, for one of my projects, I am implementing my own full STL. And I used a skiplist to implement my std::map. The reason I went with it is that it is a simple algorithm which is very close to the performance of a balanced tree but has much simpler iteration capabilities.
Also, Qt4's QMap was a skiplist as well which was the original inspiration for my using it in my std::map.
Years ago I implemented my own for a probabilistic algorithms class. I'm not aware of any library implementations, but it's been a long time. It is pretty simple to implement. As I recall they had some really nice properties for large data sets and avoided some of the problems of rebalancing. I think the implementation is also simpler than binary tries in general. There is a nice discussion and some sample c++ code here:
http://www.ddj.us/cpp/184403579?pgno=1
There's also an applet with a running demonstration. Cute 90's Java shininess here:
http://www.geocities.com/siliconvalley/network/1854/skiplist.html
Java 1.6 (Java SE 6) introduced ConcurrentSkipListSet and ConcurrentSkipListMap to the collections framework. So, I'd speculate that someone out there is really using them.
Skiplists tend to offer far less contention for locks in a multithreaded situation, and (probabilistically) have performance characteristics similar to trees.
See the original paper [pdf] by William Pugh.
I implemented a variant that I termed a Reverse Skip List for a rules engine a few years ago. Much the same, but the reference links run backward from the last element.
This is because it was faster for inserting sorted items that were most likely towards the back-end of the collection.
It was written in C# and took a few iterations to get working successfully.
The skip list has the same logarithmic time bounds for searching as is achieved by the binary search algorithm, yet it extends that performance to update methods when inserting or deleting entries. Nevertheless, the bounds are expected for the skip list, while binary search of a sorted table has a worst-case bound.
Skip Lists are easy to implement. But, adjusting the pointers on a skip list in case of insertion and deletion you have to be careful. Have not used this in a real program but, have doen some runtime profiling. Skip lists are different from search trees. The similarity is that, it gives average log(n) for a period of dictionary operations just like the splay tree. It is better than an unbalanced search tree but is not better than a balanced tree.
Every skip list node has forward pointers which represent the current->next() connections to the different levels of the skip list. Typically this level is bounded at a maximum of ln(N). So if N = 1million the level is 13. There will be that much pointers and in Java this means twie the number of pointers for implementing reference data types. where as a balanced search tree has less and it gives same runtime!!.
SkipList Vs Splay Tree Vs Hash As profiled for dictionary look up ops a lock stripped hashtable will give result in under 0.010 ms where as a splay tree gives ~ 1 ms and skip list ~720ms.
Related
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Cross posted: Need a good overview for Succinct Data Structure algorithms
Since I knew about Succinct Data Structures I'm in a desperate need of a good overview of most recent developments in that area.
I have googled and read a lot of articles I could see in top of google results on requests from top of my head. I still suspect I have missed something important here.
Here are topics of particular interest for me:
Succinct encoding of binary trees with efficient operations of getting parent, left/right child, number of elements in a subtree.
The main question here is as following: all approaches I know of assume the tree nodes enumerated in breath-first order (like in the pioneer work in this area Jacobson, G. J (1988). Succinct static data structures), which does not seem appropriate for my task. I deal with huge binary trees given in depth-first layout and the depth-first node indices are keys to other node properties, so changing the tree layout has some cost for me which I'd like to minimize. Hence the interest in getting references to works considering other then BF tree layouts.
Large variable-length items arrays in external memory. The arrays are immutable: I don't need to add/delete/edit the items. The only requirement is O(1) element access time and as low overhead as possible, better then straightforward offset and size approach. Here is some statistics I gathered about typical data for my task:
typical number of items - hundreds of millions, up to tens of milliards;
about 30% of items have length not more then 1 bit;
40%-60% items have length less then 8 bits;
only few percents of items have length between 32 and 255 bits (255 bits is the limit)
average item length ~4 bit +/- 1 bit.
any other distribution of item lengths is theoretically possible but all practically interesting cases have statistics close to the described above.
Links to articles of any complexity, tutorials of any obscurity, more or less documented C/C++ libraries, - anything what was useful for you in similar tasks or what looks like that by your educated guess - all such things are gratefully appreciated.
Update: I forgot to add to the question 1: binary trees I'm dealing with are immutable. I have no requirements for altering them, all i need is only traversing them in various ways always moving from node to children or to parent, so that the average cost of such operations was O(1).
Also, typical tree has milliards of nodes and should not be fully stored in RAM.
Update 2 Just if someone interested. I got a couple of good links in https://cstheory.stackexchange.com/a/11265/9276.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Practical uses of different data structures
Could please anyone point me out to a brief summary which describes real-life applications of various data structures? I am looking for ready-to-use summary not a reference to the Cormen's book :)
For example, almost every article says what a Binary tree is; but they doesn't provide with examples when they really should be used in real-life; the same for other data structures.
Thank you,
Data structures are so widely used that this summary will be actually enormous. The simplest cases are used almost every day -- hashmaps for easy searching of particular item. Linked lists -- for easy adding/removing elements(you can for example describe object's properties with linked lists and you can easily add or remove such properties). Priority queues -- used for many algorithms (Dijsktra's algorithm, Prim's algorithm for minimum spanning tree, Huffman's encoding). Trie for describing dictionary of words. Bloom filters for fast and cheap of memory search (your email's spam filter may use this). Data structures are all around us -- you really should study and understand them and then you can find application for them everywhere.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am interested in good literature about spatial indexes. Which one are in use, comparisons between them in speed, space requirements, spatial queries performance when using them etc.
I used to use a kind of home-grown QuadTree for spatial indexing (well before I learned the word "quadtree"). For ordinary kinds of spatial data (I deal with street map data), they are fast to create and fast to query, but they scan too many leaf nodes during queries. Specifically, with reasonable node sizes (50-100), my quadtree tended to produce around 300 results for a point query, i.e. 3-6 leaf nodes apply (very rough ballpark; results are highly variable.)
Nowadays, my preferred data structure is the the R*tree. I wrote and tested an implementation myself that obtained very good results. My code for building an R*tree is very slow compared to my QuadTree code, but the bounding boxes on the leaf nodes end up very well organized; at least half of the query space is answered by only one leaf node (i.e. if you do a random point query, there is a good chance that only a single leaf node is returned), and something like 90% of the space is covered by two nodes or less. So with a node size of 80 elements, I'd typically get 80 or 160 results from a point query, with the average closer to 160 (since a few queries do return 3-5 nodes). This holds true even in dense urban areas of the map.
I know this because I wrote a visualizer for my R* tree and the graphical objects inside it, and I tested it on a large dataset (600,000 road segments). It performs even better on point data (and other data in which bounding boxes rarely overlap). If you implement an R* tree I urge you to visualize the results, because when I wrote mine it had multiple bugs that lowered the efficiency of the tree (without affecting correctness), and I was able to tweak some of the decision-making to get better results. Be sure to test on a large dataset, as it will reveal problems that a small dataset does not. It may help to decrease the fan-out (node size) of the tree for testing, to see how well the tree works when it is several levels deep.
I'd be happy to give you the source code except that I would need my employer's permission. You know how it is. In my implementation I support forced reinsertion, but my PickSplit and
insertion penalty have been tweaked.
The original paper, The R* tree: An Efficient and Robust Access Method for Points and Rectangles, is missing dots for some reason (no periods and no dots on the "i"s). Also, their terminology is a bit weird, e.g. when they say "margin", what they mean is "perimeter".
The R* tree is a good choice if you need a data structure that can be modified. If you don't need to modify the tree after you first create it, consider bulk loading algorithms. If you only need to modify the tree a small amount after bulk loading, ordinary R-tree algorithms will be good enough. Note that R*-tree and R-tree data is structurally identical; only the algorithms for insertion (and maybe deletion? I forget) are different. R-tree is the original data structure from 1984; here's a link to the R-tree paper.
The kd-tree looks efficient and not too difficult to implement, but it can only be used for point data.
By the way, the reason I focus on leaf nodes so much is that
I need to deal with disk-based spatial indexes. You can generally cache all the inner nodes in memory because they are a tiny fraction of the index size; therefore the time it takes to scan them is tiny compared to the time required for a leaf node that is not cached.
I save a lot of space by not storing bounding boxes for the elements in the spatial index, which means I have to actually test the original geometry of each element to answer a query. Thus it's even more important to minimize the number of leaf nodes touched.
I developed a algorithm for quadrant based fast search and publushed it on ddj.com a couple of years ago. Maybe it's interesting for you:
Accelerated Search For the Nearest Line
http://drdobbs.com/windows/198900559
I recently came across the data structure known as a skip list. It seems to have very similar behavior to a binary search tree.
Why would you ever want to use a skip list over a binary search tree?
Skip lists are more amenable to concurrent access/modification. Herb Sutter wrote an article about data structure in concurrent environments. It has more indepth information.
The most frequently used implementation of a binary search tree is a red-black tree. The concurrent problems come in when the tree is modified it often needs to rebalance. The rebalance operation can affect large portions of the tree, which would require a mutex lock on many of the tree nodes. Inserting a node into a skip list is far more localized, only nodes directly linked to the affected node need to be locked.
Update from Jon Harrops comments
I read Fraser and Harris's latest paper Concurrent programming without locks. Really good stuff if you're interested in lock-free data structures. The paper focuses on Transactional Memory and a theoretical operation multiword-compare-and-swap MCAS. Both of these are simulated in software as no hardware supports them yet. I'm fairly impressed that they were able to build MCAS in software at all.
I didn't find the transactional memory stuff particularly compelling as it requires a garbage collector. Also software transactional memory is plagued with performance issues. However, I'd be very excited if hardware transactional memory ever becomes common. In the end it's still research and won't be of use for production code for another decade or so.
In section 8.2 they compare the performance of several concurrent tree implementations. I'll summarize their findings. It's worth it to download the pdf as it has some very informative graphs on pages 50, 53, and 54.
Locking skip lists is insanely fast. They scale incredibly well with the number of concurrent accesses. This is what makes skip lists special, other lock based data structures tend to croak under pressure.
Lock-free skip lists are consistently faster than locking skip lists but only barely.
transactional skip lists are consistently 2-3 times slower than the locking and non-locking versions.
locking red-black trees croak under concurrent access. Their performance degrades linearly with each new concurrent user. Of the two known locking red-black tree implementations, one essentially has a global lock during tree rebalancing. The other uses fancy (and complicated) lock escalation but still doesn't significantly outperform the global lock version.
lock-free red-black trees don't exist (no longer true, see Update).
transactional red-black trees are comparable with transactional skip-lists. That was very surprising and very promising. Transactional memory, though slower if far easier to write. It can be as easy as quick search and replace on the non-concurrent version.
Update
Here is paper about lock-free trees: Lock-Free Red-Black Trees Using CAS.
I haven't looked into it deeply, but on the surface it seems solid.
First, you cannot fairly compare a randomized data structure with one that gives you worst-case guarantees.
A skip list is equivalent to a randomly balanced binary search tree (RBST) in the way that is explained in more detail in Dean and Jones' "Exploring the Duality Between Skip Lists and Binary Search Trees".
The other way around, you can also have deterministic skip lists which guarantee worst case performance, cf. Munro et al.
Contra to what some claim above, you can have implementations of binary search trees (BST) that work well in concurrent programming. A potential problem with the concurrency-focused BSTs is that you can't easily get the same had guarantees about balancing as you would from a red-black (RB) tree. (But "standard", i.e. randomzided, skip lists don't give you these guarantees either.) There's a trade-off between maintaining balancing at all times and good (and easy to program) concurrent access, so relaxed RB trees are usually used when good concurrency is desired. The relaxation consists in not re-balancing the tree right away. For a somewhat dated (1998) survey see Hanke's ''The Performance of Concurrent Red-Black Tree Algorithms'' [ps.gz].
One of the more recent improvements on these is the so-called chromatic tree (basically you have some weight such that black would be 1 and red would be zero, but you also allow values in between). And how does a chromatic tree fare against skip list? Let's see what Brown et al. "A General Technique for Non-blocking Trees" (2014) have to say:
with 128 threads, our algorithm outperforms Java’s non-blocking skiplist
by 13% to 156%, the lock-based AVL tree of Bronson et al. by 63% to 224%, and a RBT that uses software transactional memory (STM) by 13 to 134 times
EDIT to add: Pugh's lock-based skip list, which was benchmarked in Fraser and Harris (2007) "Concurrent Programming Without Lock" as coming close to their own lock-free version (a point amply insisted upon in the top answer here), is also tweaked for good concurrent operation, cf. Pugh's "Concurrent Maintenance of Skip Lists", although in a rather mild way. Nevertheless one newer/2009 paper "A Simple Optimistic skip-list Algorithm" by Herlihy et al., which proposes a supposedly simpler (than Pugh's) lock-based implementation of concurrent skip lists, criticized Pugh for not providing a proof of correctness convincing enough for them. Leaving aside this (maybe too pedantic) qualm, Herlihy et al. show that their simpler lock-based implementation of a skip list actually fails to scale as well as the JDK's lock-free implementation thereof, but only for high contention (50% inserts, 50% deletes and 0% lookups)... which Fraser and Harris didn't test at all; Fraser and Harris only tested 75% lookups, 12.5% inserts and 12.5% deletes (on skip list with ~500K elements). The simpler implementation of Herlihy et al. also comes close to the lock-free solution from the JDK in the case of low contention that they tested (70% lookups, 20% inserts, 10% deletes); they actually beat the lock-free solution for this scenario when they made their skip list big enough, i.e. going from 200K to 2M elements, so that the probability of contention on any lock became negligible. It would have been nice if Herlihy et al. had gotten over their hangup over Pugh's proof and tested his implementation too, but alas they didn't do that.
EDIT2: I found a (2015 published) motherlode of all benchmarks: Gramoli's "More Than You Ever Wanted to Know about Synchronization. Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms": Here's a an excerpted image relevant to this question.
"Algo.4" is a precursor (older, 2011 version) of Brown et al.'s mentioned above. (I don't know how much better or worse the 2014 version is). "Algo.26" is Herlihy's mentioned above; as you can see it gets trashed on updates, and much worse on the Intel CPUs used here than on the Sun CPUs from the original paper. "Algo.28" is ConcurrentSkipListMap from the JDK; it doesn't do as well as one might have hoped compared to other CAS-based skip list implementations. The winners under high-contention are "Algo.2" a lock-based algorithm (!!) described by Crain et al. in "A Contention-Friendly Binary Search Tree" and "Algo.30" is the "rotating skiplist" from "Logarithmic data structures for
multicores". "Algo.29" is the "No hot spot non-blocking skip
list". Be advised that Gramoli is a co-author to all three of these winner-algorithm papers. "Algo.27" is the C++ implementation of Fraser's skip list.
Gramoli's conclusion is that's much easier to screw-up a CAS-based concurrent tree implementation than it is to screw up a similar skip list. And based on the figures, it's hard to disagree. His explanation for this fact is:
The difficulty in designing a tree that is lock-free stems from
the difficulty of modifying multiple references atomically. Skip lists
consist of towers linked to each other through successor pointers and
in which each node points to the node immediately below it. They are
often considered similar to trees because each node has a successor
in the successor tower and below it, however, a major distinction is
that the downward pointer is generally immutable hence simplifying
the atomic modification of a node. This distinction is probably
the reason why skip lists outperform trees under heavy contention
as observed in Figure [above].
Overriding this difficulty was a key concern in Brown et al.'s recent work.
They have a whole separate (2013) paper "Pragmatic Primitives for Non-blocking Data Structures"
on building multi-record LL/SC compound "primitives", which they call LLX/SCX, themselves implemented using (machine-level) CAS. Brown et al. used this LLX/SCX building block in their 2014 (but not in their 2011) concurrent tree implementation.
I think it's perhaps also worth summarizing here the fundamental ideas
of the "no hot spot"/contention-friendly (CF) skip list. It addapts an essential idea from the relaxed RB trees (and similar concrrency friedly data structures): the towers are no longer built up immediately upon insertion, but delayed until there's less contention. Conversely, the deletion of a tall tower can create many contentions;
this was observed as far back as Pugh's 1990 concurrent skip-list paper, which is why Pugh introduced pointer reversal on deletion (a tidbit that Wikipedia's page on skip lists still doesn't mention to this day, alas). The CF skip list takes this a step further and delays deleting the upper levels of a tall tower. Both kinds of delayed operations in CF skip lists are carried out by a (CAS based) separate garbage-collector-like thread, which its authors call the "adapting thread".
The Synchrobench code (including all algorithms tested) is available at: https://github.com/gramoli/synchrobench.
The latest Brown et al. implementation (not included in the above) is available at http://www.cs.toronto.edu/~tabrown/chromatic/ConcurrentChromaticTreeMap.java Does anyone have a 32+ core machine available? J/K My point is that you can run these yourselves.
Also, in addition to the answers given (ease of implementation combined with comparable performance to a balanced tree). I find that implementing in-order traversal (forwards and backwards) is far simpler because a skip-list effectively has a linked list inside its implementation.
In practice I've found that B-tree performance on my projects has worked out to be better than skip-lists. Skip lists do seem easier to understand but implementing a B-tree is not that hard.
The one advantage that I know of is that some clever people have worked out how to implement a lock-free concurrent skip list that only uses atomic operations. For example, Java 6 contains the ConcurrentSkipListMap class, and you can read the source code to it if you are crazy.
But it's not too hard to write a concurrent B-tree variant either - I've seen it done by someone else - if you preemptively split and merge nodes "just in case" as you walk down the tree then you won't have to worry about deadlocks and only ever need to hold a lock on two levels of the tree at a time. The synchronization overhead will be a bit higher but the B-tree is probably faster.
From the Wikipedia article you quoted:
Θ(n) operations, which force us to visit every node in ascending order (such as printing the entire list) provide the opportunity to perform a behind-the-scenes derandomization of the level structure of the skip-list in an optimal way, bringing the skip list to O(log n) search time. [...]
A skip list, upon which we have not
recently performed [any such] Θ(n) operations, does not
provide the same absolute worst-case
performance guarantees as more
traditional balanced tree data
structures, because it is always
possible (though with very low
probability) that the coin-flips used
to build the skip list will produce a
badly balanced structure
EDIT: so it's a trade-off: Skip Lists use less memory at the risk that they might degenerate into an unbalanced tree.
Skip lists are implemented using lists.
Lock free solutions exist for singly and doubly linked lists - but there are no lock free solutions which directly using only CAS for any O(logn) data structure.
You can however use CAS based lists to create skip lists.
(Note that MCAS, which is created using CAS, permits arbitrary data structures and a proof of concept red-black tree had been created using MCAS).
So, odd as they are, they turn out to be very useful :-)
Skip Lists do have the advantage of lock stripping. But, the runt time depends on how the level of a new node is decided. Usually this is done using Random(). On a dictionary of 56000 words, skip list took more time than a splay tree and the tree took more time than a hash table. The first two could not match hash table's runtime. Also, the array of the hash table can be lock stripped in a concurrent way too.
Skip List and similar ordered lists are used when locality of reference is needed. For ex: finding flights next and before a date in an application.
An inmemory binary search splay tree is great and more frequently used.
Skip List Vs Splay Tree Vs Hash Table Runtime on dictionary find op
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
What are the most common problems that can be solved with both these data structures?
It would be good for me to have also recommendations on books that:
Implement the structures
Implement and explain the reasoning of the algorithms that use them
The first thing I think about when I read this question is: what types of things use graphs/trees? and then I think backwards to how I could use them.
For example, take two common uses of a tree:
The DOM
File systems
The DOM, and XML for that matter, resemble tree structures.
It makes sense, too. It makes sense because of how this data needs to be arranged. A file system, too. On a UNIX system there's a root node, and branching down below. When you mount a new device, you're attaching it onto the tree.
You should also be asking yourself: does the data fall into this type of structure? Create data structures that make sense to the problem and the rest will follow.
As far as being easier, I think thats relative. Are you good with recursive functions to traverse a tree/graph? What if you need to balance the tree?
Think about a program that solves a word search puzzle. You could map out all the letters of the word search into a graph and check surrounding nodes to see if that string is matching any of the words. But couldn't you just do the same with with a single array? All you really need to do is move an index to check letters to the left and right, and by the width to check above and below letters. Solving this problem with a graph isn't difficult, but it can create a lot of extra work and difficulty if you're not comfortable with using them - of course that shouldn't discourage you from doing it, especially if you are learning about them.
I hope that helps you think about these structures. As for a book recommendation, I'd have to go with Introduction to Algorithms.
Circuit diagrams.
Compilation (Directed Acyclic graphs)
Maps. Very compact as graphs.
Network flow problems.
Decision trees for expert systems (sic)
Fishbone diagrams for fault finding, process improvment, safety analysis. For bonus points, implement your error recovery code as objects that are the fishbone diagram.
Just about every problem can be re-written in terms of graph theory. I'm not kidding, look at any book on NP complete problems, there are some pretty wacky problems that get turned into graph theory because we have good tools for working with graphs...
The Algorithm Design Manual contains some interesting case studies with creative use of graphs. Despite its name, the book is very readable and even entertaining at times.
There's a course for such things at my university: CSE 326. I didn't think the book was too useful, but the projects are fun and teach you a fair bit about implementing some of the simpler structures.
As for examples, one of the most common problems (by number of people using it) that's solved with trees is that of cell phone text entry. You can use trees, not necessarily binary, to represent the space of possible words that can come out of any given list of numbers that a user punches in very quickly.
Algorithms for Java: Part 5 by Robert Sedgewick is all about graph algorithms and datastructures. This would be a good first book to work through if you want to implement some graph algorithms.
Scene graphs for drawing graphics in games and multimedia applications heavily use trees and graphs. Nodes represents objects to be rendered, transformations, controls, groups, ...
Scene graphs usually have multiple layers and attributes which mean that you can draw only some node of a graph (attributes) in a specified order (layers). Depending on the kind of scene graph you have it can have two parralel structures: declarations and instantiation. Th
#DavidJoiner / all:
FWIW: A new version of the Algorithm Design Manual is due out any day now.
The entire course that he Prof Skiena developed this book for is also available on the web:
http://www.cs.sunysb.edu/~algorith/video-lectures/2007-1.html
Trees are used a lot more in functional programming languages because of their recursive nature.
Also, graphs and trees are a good way to model a lot of AI problems.
Games often use graphs to facilitate finding paths across the game world. The graph representation of the world can have algorithms such as breadth-first search or A* in order to find a route across it.
They also often use trees to represent entities within the world. If you have thousands of entities and need to find one at a certain position then iterating linearly through a list can be inefficient, especially if you need to do it often. Therefore the area can be subdivided into a tree to allow it to be searched more quickly. Just as a linear space can be efficiently searched with a binary search (and thus divided into a binary tree), 2D space can be divided into a quadtree and 3D space into an octree.