How to implement an efficient excel-like app - data-structures

I need to implement an efficient excel-like app.
I'm looking for a data structure that will:
Store the data in an efficient manner (for example - I don't want
to pre-allocate memory for unused cells).
Allow efficient update when the user changes a formula in one of the cells
Any ideas?
Thanks,
Li

In this case, you're looking for an online dictionary structure. This is a category of structures which allow you to associate one morsel of data (in this case, the coordinates that represent the cell) with another (in this case, the cell contents or formula). The "online" adjective means dictionary entries can be added, removed, or changed in real time.
There's many such structures. To name some more common ones: hash tables, binary trees, skip lists, linked lists, and even lists in arrays.
Of course, some of these are more efficient than others (depending on implementation and the number of entries). Typically I use hash tables for this sort of problem.
However, if you need to do range querying "modify all of the cells in this range", you may be better off with a binary tree or a more complicated spatial structure -- but not likely given the simple requirements of the problem.

Related

How are large keys and values stored in b-trees with small sectors?

I've been making a key-value store that saves to disk as a personal project and I've been using a b-tree as my data structure, but I want to add large limits to key and value length like many other key-value stores such as redis.
How should large keys and values be stored within a b-tree when the sector size is as little as 512 bytes? If you allow larger sized keys and values, how many keys should you allow per node, and should I consider thinking about another data structure to store variable-sized data?
You can either define overflow pages to form nodes out of a linked list of pages, or you can refer to keys and values via pointers stored in b-tree leaf nodes. The pointers can refer to a linked list of pages or a special kind of sub-tree. You can store some inline content in the leaf node if this reduces wastage due to unfilled pages.
How many keys to allow per node when going for the overflow design? The least possible. The design doesn't scale as the linked list gets larger. If for some reason you need to store very large values, you can see how this design might be quite expensive as you have to scan and skip over so many extra pages.
The pointer based approach scales better, but for it to be most effective for keys, as much of the key must be inlined as possible. Otherwise you always have to follow pointers when doing searches. You can potentially apply a pointer compression technique in which a common prefix is stored once. This allows more of the key to fit in the page, reducing the likelihood of following a pointer.

A data structure with certain properties

I want to implement a data structure myself in C++11. What I'm planning to do is having a data structure with the following properties:
search. O(log(n))
insert. O(log(n))
delete. O(log(n))
iterate. O(n)
What I have been thinking about after research was implementing a balanced binary search tree. Are there other structures that would fulfill my needs? I am completely new to this topic and thought a question here would give me a good jumpstart.
First of all, using the existing standard library data types is definitely the way to go for production code. But since you are asking how to implement such data structures yourself, I assume this is mainly an educational exercise for you.
Binary search trees of some form (https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree#Implementations) or B-trees (https://en.wikipedia.org/wiki/B-tree) and hash tables (https://en.wikipedia.org/wiki/Hash_table) are definitely the data structures that are usually used to accomplish efficient insertion and lookup. If you want to go wild you can combine the two by using a tree instead of a linked list to handle hash collisions (although this has a good potential to actually make your implementation slower if you don't make massive mistakes in sizing your hash table or in choosing an adequate hash function).
Since I'm assuming you want to learn something, you might want to have a look at minimal perfect hashing in the context of hash tables (https://en.wikipedia.org/wiki/Perfect_hash_function) although this only has uses in special applications (I had the opportunity to use a perfect minimal hash function exactly once). But it sure is fascinating. As you can see from the link above, the botany of search trees is virtually limitless in scope so you can also go wild on that front.

Why is lookup in an Array O(1)?

I believe that in some languages other than Ruby, an Array lookup is O(1) because you know where the data starts, and you multiply the index by the size of the data the array is holding, and then access that memory location.
However, in Ruby, an Array can have objects from different classes, so how does it manage to do a lookup of O(1) complexity?
What #Neil Slater said, with a little more detail…
There are basically two plausible approaches to storing an array of heterogeneous objects of differing sizes:
Store the objects as a singly- or doubly-linked list, with the storage space for each individual object preceded by pointer(s) to the preceding and/or following objects. This structure has the advantage of making it very easy to insert new objects at arbitrary points without shifting around the rest of the array, but the huge downside is that looking up an object by its position is generally O(N), since you have to start from one end of the list and jump through it node-by-node until you arrive at the n-th one.
Store a table or array of constant-sized pointers to the individual objects. Since this lookup table contains constant-sized items in a contiguous ordered layout, looking up the addresses of individual objects O(1); the table is just a C-style array, in which lookup only takes 1-to-a-few machine instructions, even on RISC CPU architectures.
(The allocation strategies for storing the individual objects are also interesting and complex, but not immediately relevant to your question.)
Dynamic languages like Perl/Python/Ruby pretty much all opt for #2 for their general-purpose list/array types. In other words, they make lookup more efficient than inserting objects at random locations in the list, which is the better choice for many applications.
I'm not familiar with the implementation details for Ruby, but they are likely quite similar to those of Python's list type, whose performance and design is explained in wonderful detail at effbot.org.
Its implementation probably contains an array of memory addresses, pointing to the actual objects. Therefore it can still lookup without looping through the array.

Is a linked list in a B-tree node superior to an array?

I want to implement a B-tree index for my database.
I have read many data structure and algorithm books to learn how to do it. All implementations use an array to save data and child indexes.
Now I want to know: is a linked list in B-tree node superior to an array?
There are some ideas I've thought about:
when splitting a node, the copy operation will be more quickly than with an array.
when inserting data, if the data is inserted into the middle or at the head of the array, the speed is lower than inserting to the linked list.
The linked list is not better, in fact a simple array is not better either (except its simplicity which is good argument for it and search speed if sorted).
You have to realize that the "array" implementation is more a "reference" implementation than a true full power implementation. For example, the implementation of the data/key pairs inside a B-Tree node in commercial implementations uses many strategies to solve two problems: storage efficiency and efficient search of keys in the node.
With regard with efficient search, an array of key/value with an internal balanced tree structure on the top of it can make insertion/deletion/search be done in O(log N), for large B tree nodes it makes sense.
With regard to memory efficiency, the nature of data in the key and value is very important. For example, lexicographical keys can be shorten by a common start (e.g. "good", "great" have "g" in common), the data might be compressed as well using any possible scheme relevant to the nature of the data. The compression of keys is more complex as you will want to keep this lexicographical property. Remember that the more data and keys you stuff in a node, the fastest are the disk accesses.
The time to split a node is only partially relevant, as it will be much less than the time to read or write a node on typical media by several order of magnitude. On SSD and extremely fast disks (by 10 to 20 years it is expected to have disks as fast as RAM), many researches are conducted to find a successor to B-Trees, stratified B-Trees are an example.
If the BTree is itself stored on the disk then a linked list will make it very complicated to maintain.
Keep the B-Tree structure compact. This will allow more nodes per page, locality of data and allowing caching of more nodes, and fewer disk reads/cache misses.
Use an array.
The perceived in-memory computational benefits are inconsequential.
So, in short, no, linked list is not superior.
B-tree is typically used in DBs where the data is stored on disks and you want to minimize the number of blocks you want to read. I do not think your proposal would be efficient in that case (although it might be beneficial if you can load all data into RAM).
If you want to perform those two operations effectively you should use a Skip List (http://en.wikipedia.org/wiki/Skip_list). Performance-wise it will be similar to what you have outlined.

What data structure do vectors in Clojure use?

What data structure does Clojure use to implement its vector type?
I ask because they have some interesting complexity properties. It is cheap (O(log32(N))) to index in to them, and you can get a new copy with any item changed cheaply.
This would lead me to think that it is based on (really wide) tree, but that wouldn't explain why it is cheap to add to one end, but not the other. You also can't cheaply insert or delete elements in the middle of a vector.
Yes, they are wide trees. http://blog.higher-order.net/2009/02/01/understanding-clojures-persistentvector-implementation.html and http://hypirion.com/musings/understanding-persistent-vector-pt-1 are two article series describing in more detail how they work.

Resources