Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Cross posted: Need a good overview for Succinct Data Structure algorithms
Since I knew about Succinct Data Structures I'm in a desperate need of a good overview of most recent developments in that area.
I have googled and read a lot of articles I could see in top of google results on requests from top of my head. I still suspect I have missed something important here.
Here are topics of particular interest for me:
Succinct encoding of binary trees with efficient operations of getting parent, left/right child, number of elements in a subtree.
The main question here is as following: all approaches I know of assume the tree nodes enumerated in breath-first order (like in the pioneer work in this area Jacobson, G. J (1988). Succinct static data structures), which does not seem appropriate for my task. I deal with huge binary trees given in depth-first layout and the depth-first node indices are keys to other node properties, so changing the tree layout has some cost for me which I'd like to minimize. Hence the interest in getting references to works considering other then BF tree layouts.
Large variable-length items arrays in external memory. The arrays are immutable: I don't need to add/delete/edit the items. The only requirement is O(1) element access time and as low overhead as possible, better then straightforward offset and size approach. Here is some statistics I gathered about typical data for my task:
typical number of items - hundreds of millions, up to tens of milliards;
about 30% of items have length not more then 1 bit;
40%-60% items have length less then 8 bits;
only few percents of items have length between 32 and 255 bits (255 bits is the limit)
average item length ~4 bit +/- 1 bit.
any other distribution of item lengths is theoretically possible but all practically interesting cases have statistics close to the described above.
Links to articles of any complexity, tutorials of any obscurity, more or less documented C/C++ libraries, - anything what was useful for you in similar tasks or what looks like that by your educated guess - all such things are gratefully appreciated.
Update: I forgot to add to the question 1: binary trees I'm dealing with are immutable. I have no requirements for altering them, all i need is only traversing them in various ways always moving from node to children or to parent, so that the average cost of such operations was O(1).
Also, typical tree has milliards of nodes and should not be fully stored in RAM.
Update 2 Just if someone interested. I got a couple of good links in https://cstheory.stackexchange.com/a/11265/9276.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I found the below in a question bank and I'm looking for some help with it.
For each of the following situations, select the best data structure and justify your selection.
The data structures should be selected from the following possibilities: unordered list, ordered array, heap, hash table, binary search tree.
(a) (4 points) Your structure needs to store a potentially very large number of records, with the data being added as it arrives. You need to be able to retrieve a record by its primary key, and these keys are random with respect to the order in which the data arrives. Records also may be deleted at random times, and all modifications to the data need to be completed just after they are submitted by the users. You have no idea how large the dataset could be, but the data structure implementation needs to be ready in a few weeks. While you are designing the program, the actual programming is going to be done by a co-op student.
For the answer, I thought BST would be the best choice.
Since the size is not clear, hashtable is not a good choice.
Since there is a matter of deletion, heap is not acceptable either.
Is my reasoning correct?
(b) (4 points) You are managing data for the inventory of a large warehouse store. New items (with new product keys) are added and deleted from the inventory system every week, but this is done while stores are closed for 12 consecutive hours.
Quantities of items are changed frequently: incremented as they are stocked, and decremented as they are sold. Stocking and selling items requires the item to be retrieved from the system using its product key.
It is also important that the system be robust, well-tested, and have predictable behaviour. Delays in retrieving an item are not acceptable, since it could cause problems for the sales staff. The system will potentially be used for a long time, though largely it is only the front end that is likely to be modified.
For this part I thought heapsort, but I have no idea how to justify my answer.
Could you please help me?
(a) needs fast insertion and deletion and you need retrieval based on key. Thus I'd go with a hashtable or a binary search tree. However, since the size is not known in advance and there's that deadline constraint, I'd say the binary search tree is the best alternative.
(b) You have enough time to process data after insertion/deletion but need an O(1) random access. An ordered array should do the trick.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I need to maintain a large directed graph G, with possibly millions of nodes and edges. Potentially it may not fit in memory.
Some frequent operations I need to perform on this graph include:
Each node/edges will have user defined properties associated with it, such as access count, weight, etc.
For each node (vertex), I will need to perform efficient query based on the property values. For example, find the node which has X value greater than v1 but less than v2. This probably requires build an index on certain fields.
I will need to find all incoming edges and outgoing edges of a given node, and update the weight of the edges.
I will need to do local (DFS based) traversal from a given node, and return all paths which satisfy a certain user defined predicate (this predicate may use the property values of the node/edges in a path).
I will need to add/delete nodes/edges efficiently. This is not performed as often as operation 1, 2, 3 though.
Potentially there are some hot-spots in the graph which gets accessed much more often than the other parts, and I would like to cache these hot-spots in memory.
What is the efficient way to achieve this with the least implementation effort?
I am looking at some disk-based graph databases, such as Neo4j/InfiniteGraph/DEX. Even though they support all the above operations, it seem to be an overkill since I don't need a lot of features they are offering, such as consistency/concurrent control or cluster-based replication. Also, a lot of them are based on Java, and I prefer something with a C/C++ interface.
Basically I just need an on-disk graph library which handles persistence, query on nodes and local traversal efficiently. Do you have any recommendation on existing (open source) project which I can use? If not, what's the best way to implement such a thing?
I have seen some large graphs with millions upon millions of nodes. What i reccomend is that you find a point, you should do a weighted compression. So you will take N nodes, compress into N/M nodes, using averages and weights.... and then rebuild the graph.
You would recompress after every so many nodes, of your choice. THe reason is that as EVERYTHING gets HUGE, you will be able to, in a sense, normalize it over time.
You have a directed graph. As you pass larger upon large nodes, you can say that, if A>B>(E&D)>H, you can then say stuff like: A>H.
It does allow you to determine common routes between Nodes, based on shortest jumps between Nodes. If it isnt in a compressed list, it will at least head towards a certain area, and can be, depending... decompressed in some sense.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am thinking of something like a 3x3 matrix for each of the x,y,z coordinates. But that would be a waste of memory since a lot of block spaces are empty. Another solution would be to have a hashmap ((x,y,z) -> BlockObject), but that doesn't seem too efficient either.
When I say efficient, I do not mean optimal. It simply means that it would be enough to run smoothly on your modern day computer. Keep in mind, that the worlds generated by minecraft are quite huge, efficiency is important regardless. There's is also tons of meta-data that needs to be stored.
As noted in my comment, I have no idea how MineCraft does this, but a common efficient way of representing this sort of data is an Octree; http://en.wikipedia.org/wiki/Octree. The general idea is that it's like a binary tree but in three-space. You recursively divide each block of space in each dimension to get eight smaller blocks, and each block contains the pointers to the smaller blocks and a pointer to its parent block.
This allows you to be efficient about storing large blocks of the same material (e.g., "empty space"), because you can terminate the recursion whenever you get to a block that is made up of all the same thing, even if you haven't recursed down to the level of individual "cube" units.
Also, this means that you can efficiently find all the cubes in a given region by taking your current block and going up the tree just far enough to get to a block that contains all you can see -- and that way, you can very easily ignore all the cubes that are somewhere else.
If you're interested in exploring alternative means to represent Minecraft world (chunk)data, you can also look into the idea of bitstrings. Each 'chunk' is comprised of a volume 16*16*128, whereas 16*16 can adequately be represented by a single byte character and can be consolidated into a binary string.
As this approach is highly specific to a certain goal of trading client-computation vs highly optimized storage and transfer time, it seems imprudent to attempt to explain all the details, but I have created a specification for just this purpose, if you're interested.
Using this method, calculating storage cost is drastically different than the current 1byte-per-block, but instead is 'variable-bit-rate': ((1bit-per-block, rounded up to a multiple of 8) * (number of unique layers a blocktype appears in a chunk) + 2bytes)
This is then summed for the (unique number of blocktypes in that chunk).
Pretty much only in deliberate edgecases can this be more expensive than a normally structured chunk, in excess of 99% of Minecraft chunks are naturally generated and would benefit from this variable-bit-representation by a ratio of 8:1 or more in many of my tests.
Your best bet is to decompile Minecraft and look at the source. Modifying Minecraft: The Source Code is a nice walkthrough on how to do that.
Minecraft is very far from efficent. It just stores "chunks" of data.
Check out the "Map formats" in the Development Resources at Minecraft Wiki. AFAIK, the internal representation is exactly the same.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am interested in good literature about spatial indexes. Which one are in use, comparisons between them in speed, space requirements, spatial queries performance when using them etc.
I used to use a kind of home-grown QuadTree for spatial indexing (well before I learned the word "quadtree"). For ordinary kinds of spatial data (I deal with street map data), they are fast to create and fast to query, but they scan too many leaf nodes during queries. Specifically, with reasonable node sizes (50-100), my quadtree tended to produce around 300 results for a point query, i.e. 3-6 leaf nodes apply (very rough ballpark; results are highly variable.)
Nowadays, my preferred data structure is the the R*tree. I wrote and tested an implementation myself that obtained very good results. My code for building an R*tree is very slow compared to my QuadTree code, but the bounding boxes on the leaf nodes end up very well organized; at least half of the query space is answered by only one leaf node (i.e. if you do a random point query, there is a good chance that only a single leaf node is returned), and something like 90% of the space is covered by two nodes or less. So with a node size of 80 elements, I'd typically get 80 or 160 results from a point query, with the average closer to 160 (since a few queries do return 3-5 nodes). This holds true even in dense urban areas of the map.
I know this because I wrote a visualizer for my R* tree and the graphical objects inside it, and I tested it on a large dataset (600,000 road segments). It performs even better on point data (and other data in which bounding boxes rarely overlap). If you implement an R* tree I urge you to visualize the results, because when I wrote mine it had multiple bugs that lowered the efficiency of the tree (without affecting correctness), and I was able to tweak some of the decision-making to get better results. Be sure to test on a large dataset, as it will reveal problems that a small dataset does not. It may help to decrease the fan-out (node size) of the tree for testing, to see how well the tree works when it is several levels deep.
I'd be happy to give you the source code except that I would need my employer's permission. You know how it is. In my implementation I support forced reinsertion, but my PickSplit and
insertion penalty have been tweaked.
The original paper, The R* tree: An Efficient and Robust Access Method for Points and Rectangles, is missing dots for some reason (no periods and no dots on the "i"s). Also, their terminology is a bit weird, e.g. when they say "margin", what they mean is "perimeter".
The R* tree is a good choice if you need a data structure that can be modified. If you don't need to modify the tree after you first create it, consider bulk loading algorithms. If you only need to modify the tree a small amount after bulk loading, ordinary R-tree algorithms will be good enough. Note that R*-tree and R-tree data is structurally identical; only the algorithms for insertion (and maybe deletion? I forget) are different. R-tree is the original data structure from 1984; here's a link to the R-tree paper.
The kd-tree looks efficient and not too difficult to implement, but it can only be used for point data.
By the way, the reason I focus on leaf nodes so much is that
I need to deal with disk-based spatial indexes. You can generally cache all the inner nodes in memory because they are a tiny fraction of the index size; therefore the time it takes to scan them is tiny compared to the time required for a leaf node that is not cached.
I save a lot of space by not storing bounding boxes for the elements in the spatial index, which means I have to actually test the original geometry of each element to answer a query. Thus it's even more important to minimize the number of leaf nodes touched.
I developed a algorithm for quadrant based fast search and publushed it on ddj.com a couple of years ago. Maybe it's interesting for you:
Accelerated Search For the Nearest Line
http://drdobbs.com/windows/198900559
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm wondering whether anyone here has ever used a skip list. It looks to have roughly the same advantages as a balanced binary tree but is simpler to implement. If you have, did you write your own, or use a pre-written library (and if so, what was its name)?
My understanding is that they're not so much a useful alternative to binary trees (e.g. red-black trees) as they are to B-trees for database use, so that you can keep the # of levels down to a feasible minimum and deal w/ base-K logs rather than base-2 logs for performance characteristics. The algorithms for probabilistic skip-lists are (IMHO) easier to get right than the corresponding B-tree algorithms. Plus there's some literature on lock-free skip lists. I looked at using them a few months ago but then abandoned the effort on discovering the HDF5 library.
literature on the subject:
Papers by Bill Pugh:
A skip list cookbook
Skip lists: A probabilistic alternative to balanced trees
Concurrent Maintenance of Skip Lists
non-academic papers/tutorials:
Eternally Confuzzled (has some discussion on several data structures)
"Skip Lists" by Thomas A. Anastasio
Actually, for one of my projects, I am implementing my own full STL. And I used a skiplist to implement my std::map. The reason I went with it is that it is a simple algorithm which is very close to the performance of a balanced tree but has much simpler iteration capabilities.
Also, Qt4's QMap was a skiplist as well which was the original inspiration for my using it in my std::map.
Years ago I implemented my own for a probabilistic algorithms class. I'm not aware of any library implementations, but it's been a long time. It is pretty simple to implement. As I recall they had some really nice properties for large data sets and avoided some of the problems of rebalancing. I think the implementation is also simpler than binary tries in general. There is a nice discussion and some sample c++ code here:
http://www.ddj.us/cpp/184403579?pgno=1
There's also an applet with a running demonstration. Cute 90's Java shininess here:
http://www.geocities.com/siliconvalley/network/1854/skiplist.html
Java 1.6 (Java SE 6) introduced ConcurrentSkipListSet and ConcurrentSkipListMap to the collections framework. So, I'd speculate that someone out there is really using them.
Skiplists tend to offer far less contention for locks in a multithreaded situation, and (probabilistically) have performance characteristics similar to trees.
See the original paper [pdf] by William Pugh.
I implemented a variant that I termed a Reverse Skip List for a rules engine a few years ago. Much the same, but the reference links run backward from the last element.
This is because it was faster for inserting sorted items that were most likely towards the back-end of the collection.
It was written in C# and took a few iterations to get working successfully.
The skip list has the same logarithmic time bounds for searching as is achieved by the binary search algorithm, yet it extends that performance to update methods when inserting or deleting entries. Nevertheless, the bounds are expected for the skip list, while binary search of a sorted table has a worst-case bound.
Skip Lists are easy to implement. But, adjusting the pointers on a skip list in case of insertion and deletion you have to be careful. Have not used this in a real program but, have doen some runtime profiling. Skip lists are different from search trees. The similarity is that, it gives average log(n) for a period of dictionary operations just like the splay tree. It is better than an unbalanced search tree but is not better than a balanced tree.
Every skip list node has forward pointers which represent the current->next() connections to the different levels of the skip list. Typically this level is bounded at a maximum of ln(N). So if N = 1million the level is 13. There will be that much pointers and in Java this means twie the number of pointers for implementing reference data types. where as a balanced search tree has less and it gives same runtime!!.
SkipList Vs Splay Tree Vs Hash As profiled for dictionary look up ops a lock stripped hashtable will give result in under 0.010 ms where as a splay tree gives ~ 1 ms and skip list ~720ms.