Best DataStructure to implement Excel Spreadsheet - data-structures

How we can implement Excel spreadsheet with creation and deletion of rows and creation and deletion if cells, with also can modify data inside any cell.
I was looking for best data structure to implement this.

The problem statement is little vague in my opinion. We do not have any information about the kind of operations that will be very frequent or even the amount of data that this DS is going to hold.
So assuming there can be fair amount of data. Also the operations are addition and deletion of rows and cells.
For excel spreadsheet, If I have to implement it with a custom Data Structure, I would take each row as a node of a linked list. This is helpful because as opposed to an array (n dimensional), the memory can be assigned in non contiguous manner. Also with that benefit, it will make adding and deletion of rows much easy.
Inside each node, we can have array of string to hold cell values and a Id field to hold the Id of the row.
The head node of the DS will have column names as value of its string array. So in a way each column is mapped to an index of the array.
To add a row: It will be an insert into the linked list. Make a new row and append in the end.
To delete a row: Same as deletion of node in a linked list.
To add/update a cell value: You basically know the row Id, you have column name so you can know the index of the column in the array from head node. So once you have the node corresponding to the row, access the index of string array to add/read/update/delete the value of cell.
In order to optimize node access you can keep indexes on the actual linked list to easily locate node by row Id. Some more optimizations would be store row-Id to node pointers mapping some where in auxiliary map or array so that inserting rows in between in also fast.
However I would re-iterate that implementation should be done on the use-case basis. If there are heavy column addition/deletion ops for example, it will be quite slow. There are different kind of trade-offs for each kind of use case.

I think the easy way to go ahead with this is to simply use a JSON structure to hold each row. Column names as keys and the cell values as values. This handles null/empty values quite easily.
A spreadsheet is essentially similar to a table, changes can be made on any cell at any row. Hence going with a simple list structure would not be too bad. The downside to this is that deletion and insertion of in between rows is not performant. But the insertion of rows at end, which is the most common use case and modification of cells can be made quite easy.
To facilitate faster insertion and deletion a linked list structure will help, but it will affect random access adversely, so a simple list of json objects would be the better.

Related

How to implement dynamic indexes?

I know, Maybe the title is a little confusing. however, my actual question is basic I think.
I'm working on a brand new LRU implementation for that I use an Index Table which maps the name of the incoming packet to index of where the content of packet stored in CS.
As illustrated below each incoming packet store in the CS and can be addressed by Index Table.
Now suppose new packet arrived, as we know, regarding LRU, its index must set to top of CS (zero) and it needs to upgrade other indexes, they need to be incremented as a result.
One obvious solution is to loop over all entries in the Index Table and increment them.
Is there any solution or structure that is using for such a problem?
I don't see how you are establishing the order of your cache in the description. But to answer your question, it's possible to reduce the LRU store method to O(1) time complexity.
The classical way to do it is to have these two data structures:
Doubly Linked List : for order in the cache. Each node stores a data element (it plays the role of your content store).
HashMap that associates each key to the pointer to the node in the linked list. (it plays the role of your index table)
So when you access already stored data in your cache, it must be at the top of the list, so you delete the corresponding node from the linked list (in O(1) time because you have access to its previous and next nodes) and store it at the head.
For new data it is simpler, only store it at the head of the list and store your (key, value) in the hashmap.

Inserting an element to a full hash table with a constant number of buckets

I am studying hash table at the moment, and got a question about its implementation with a fixed size of buckets.
Suppose we have a hash table with 23 elements(for example). Let's use the simplest hash function (hash_value = key%table_size) and the keys being integers only. If we say that one bucket can have at most only 1 element(no separate chaining), does it mean that when all buckets are full we will no longer be able to insert any element in the table at all? Or will we have to actually replace element that has the same hash value with a new element?
I do understand that I am putting a lot of constrains , and the real implementation might never look like that,but I want to be sure I understand that particular case.
A real implementation usually allows for a hash table to be able to resize, but this usually takes a long time and is undesired. Considering a fixed-size hash table, it would probably return an error code or throw an exception for the user to treat that error or not.
Or will we have to actually replace element that has the same hash value with a new element?
In Java's HashMap if you add a key that equals to another already present in the hash table only the value associated with that key will be replaced by the new one, but never if two keys hash to the same hash.
Yes. An "open" hash table - which you are describing - has a fixed size, so it can fill up.
However implementations will usually respond by copying all contents into a new, bigger table. In fact, normally they won't wait to fill entirely, but use some criterion - for example a fraction of all space used (sometimes called the "load factor") - to decide when it's time to expand.
Some implementations will also "shrink" themselves to a smaller table if the load factor becomes too small due to deletions.
You'd probably find reading Google's hash table implementation, which includes some documentation of its internals, to be a good learning experience.

How to implement a collection that supports real-time filtering?

I want to implement a mutable sequential collection FilteredList that wraps another collection List and filters it based on a predicate.
Both the wrapped List and the exposed FilteredList are mutable and observable, and should be synchronized (so for example, if someone adds an element to List that element should appear in the correct position in FilteredList, and vice versa).
Elements that don't satisfy the predicate can still be added to FilteredList, but they will not be visible (they will still appear in the inner list).
The collections should support:
Insert(index,value) which inserts an element value at position index, pushing elements forward.
Remove(index) which removes the element at position index, moving all proceeding elements back.
Update(index, value), which updates the element at position index to be value.
I'm having trouble coming up with a good synchronization mechanism.
I don't have any strict complexity bounds, but real world efficiency is important.
The best way to avoid synchronization difficulties is to create a data structure that doesn't need them: use a single data structure to present the filtered and unfiltered data.
You should be able to do that with a modified skip list (actually, an indexable skip list), which will give you O(log n) access by index.
What you do is maintain two separate sets of forward pointers for each node, rather than just one set. The one set is for the unfiltered list, as in the normal skip list, and the other set is for the filtered list.
Adding to or removing from the list is the same for the filtered and unfiltered lists. That is, you find the node at index by following the appropriate filtered or unfiltered links, and then add or remove the node, updating both sets of link pointers.
This should be more efficient than a standard sequential list, because insertion and removal don't incur the cost of moving items up or down to make a hole or fill a gap; it's all done with references.
It takes a little more space per node, though. On average, skip list requires two extra references per node. Since you're building what is in effect two skip lists in one, expect your nodes to require, on average, four extra references per node.
Edit after comment
If, as you say, you don't control List, then you still maintain this dual skip list that I described. But the data stored in the skip list is just the index into List. You said that List is observable, so you get notification of all insert and delete operations, so you should be able to maintain an index by reacting to all notifications.
When somebody wants to operate on FilteredList, you use the filtered index links to find the List index of the FilteredList record the user wanted to affect. Then you pass the request onto List, using the translated index. And then you react to the observable notification from List.
Basically, you're just maintaining a secondary index into List, so that you can translate FilteredList indexes into List indexes.

Sorting application difficulty

Currently I am reading a book on algorithms and found this usage of sorting.
Reconstructing the original order - How can we restore the original arrangment of a set of items after we permute them for some application? Add an extra field to the data record for the item, such that i-th record sets this field to i. Carry this field along whenever you move the record, and later sort on it when you want the initial order back.
I ve been trying hard to understand what does it mean. And I failed miserably. Pls somebody help?
Suppose you have list of items in random order:
itemC, itemB, itemA, itemD
you sorted them up:
itemA, itemB, itemC, itemD
and you didn't have enough memory to store them in a separate location, so original sequence is lost. Moreover, original order is random and it will be problematic/impossible to restore it.
This article gives a solution to this problem.
Add an extra field to the data record for the item, such that i-th record sets this field to i
So, we add an extra field for each of the items:
(itemC,1), (itemB,2), (itemA,3), (itemD, 4)
And after sort we have:
(itemA,3), (itemB,2), (itemC,1), (itemD, 4)
So we can easily restore initial order sorting by additional field
Let's say you have the data in an array, because it's the simplest structure that I can use to exemplify.
So, your node (i.e., element of the array) may look like this:
(some data type) data
The algorithm is suggesting you to add an integer field, so it looks like this:
(some data type) data,
int position
And then, you fill the positions with the actual index. Something like this pseudocode:
for current: 0 to lastElement
array[current].position = current
(that's not written in any language I know of, but it should be readable)
After doing that, you shuffle it (resort it) for whatever you need to.
When you want to restore the original ordering, all you need to do is sort by the position field.
Well, basically it's saying that you need some sort of thingy to keep track of the original order (which is destroyed by the permutation). One option would be to simply reverse the permutation (check out Steve Jessop's infrmative answer here).
Another option to invert the permutation would require fewer processing steps, but more memory. More specifically, each node in your input set would have an extra ID field, and all the elements in this input set are sorted based on this field. Once you apply the permutation, it's obvious that the IDs are no longer in a sorted order. If you wish to invert the permutation, all you have to do is sort the list again based on this field.

Best way to remove an entry from a hash table

What is the best way to remove an entry from a hashtable that uses linear probing? One way to do this would be to use a flag to indicate deleted elements? Are there any ways better than this?
An easy technique is to:
Find and remove the desired element
Go to the next bucket
If the bucket is empty, quit
If the bucket is full, delete the element in that bucket and re-add it to the hash table using the normal means. The item must be removed before re-adding, because it is likely that the item could be added back into its original spot.
Repeat step 2.
This technique keeps your table tidy at the expense of slightly slower deletions.
It depends on how you handle overflow and whether (1) the item being removed is in an overflow slot or not, and (2) if there are overflow items beyond the item being removed, whether they have the hash key of the item being removed or possibly some other hash key. [Overlooking that double condition is a common source of bugs in deletion implementations.]
If collisions overflow into a linked list, it is pretty easy. You're either popping up the list (which may have gone empty) or deleting a member from the middle or end of the linked list. Those are fun and not particularly difficult. There can be other optimizations to avoid excessive memory allocations and freeings to make this even more efficient.
For linear probing, Knuth suggests that a simple approach is to have a way to mark a slot as empty, deleted, or occupied. Mark a removed occupant slot as deleted so that overflow by linear probing will skip past it, but if an insertion is needed, you can fill the first deleted slot that you passed over [The Art of Computer Programming, vol.3: Sorting and Searching, section 6.4 Hashing, p. 533 (ed.2)]. This assumes that deletions are rather rare.
Knuth gives a nice refinment as Algorithm R6.4 [pp. 533-534] that instead marks the cell as empty rather than deleted, and then finds ways to move table entries back closer to their initial-probe location by moving the hole that was just made until it ends up next to another hole.
Knuth cautions that this will move existing still-occupied slot entries and is not a good idea if pointers to the slots are being held onto outside of the hash table. [If you have garbage-collected- or other managed-references in the slots, it is all right to move the slot, since it is the reference that is being used outside of the table and it doesn't matter where the slot that references the same object is in the table.]
The Python hash table implementation (arguable very fast) uses dummy elements to mark deletions. As you grow or shrink or table (assuming you're not doing a fixed-size table), you can drop the dummies at the same time.
If you have access to a copy, have a look at the article in Beautiful Code about the implementation.
The best general solutions I can think of include:
If you're can use a non-const iterator (ala C++ STL or Java), you should be able to remove them as you encounter them. Presumably, though, you wouldn't be asking this question unless you're using a const iterator or an enumerator which would be invalidated if the underlying collection is modified.
As you said, you could mark a deleted flag within the contained object. This doesn't release any memory or reduce collisions on the key, though, so it's not the best solution. Also requires the addition of a property on the class that probably doesn't really belong there. If this bothers you as much as it would me, or if you simply can't add a flag to the stored object (perhaps you don't control the class), you could store these flags in a separate hash table. This requires the most long-term memory use.
Push the keys of the to-be-removed items into a vector or array list while traversing the hash table. After releasing the enumerator, loop through this secondary list and remove the keys from the hash table. If you have a lot of items to remove and/or the keys are large (which they shouldn't be), this may not be the best solution.
If you're going to end up removing more items from the hash table than you're leaving in there, it may be better to create a new hash table, and as you traverse your original one, add to the new hash table only the items you're going to keep. Then replace your reference(s) to the old hash table with the new one. This saves a secondary list iteration, but it's probably only efficient if the new hash table will have significantly fewer items than the original one, and it definitely only works if you can change all the references to the original hash table, of course.
If your hash table gives you access to its collection of keys, you may be able to iterate through those and remove items from the hash table in one pass.
If your hash table or some helper in your library provides you with predicate-based collection modifiers, you may have a Remove() function to which you can pass a lambda expression or function pointer to identify the items to remove.
A common technique when time is a factor is to have a second table of deleted items, and clean up the main table when you have time. Commonly used in search engines.
How about enhancing the hash table to contain pointers like a linked list?
When you insert, if the bucket is full, create a pointer from this bucket to the bucket where the new field in stored.
While deleting something from the hashtable, the solution will be equivalent to how you write a function to delete a node from linkedlist.

Resources