How to implement a collection that supports real-time filtering? - data-structures

I want to implement a mutable sequential collection FilteredList that wraps another collection List and filters it based on a predicate.
Both the wrapped List and the exposed FilteredList are mutable and observable, and should be synchronized (so for example, if someone adds an element to List that element should appear in the correct position in FilteredList, and vice versa).
Elements that don't satisfy the predicate can still be added to FilteredList, but they will not be visible (they will still appear in the inner list).
The collections should support:
Insert(index,value) which inserts an element value at position index, pushing elements forward.
Remove(index) which removes the element at position index, moving all proceeding elements back.
Update(index, value), which updates the element at position index to be value.
I'm having trouble coming up with a good synchronization mechanism.
I don't have any strict complexity bounds, but real world efficiency is important.

The best way to avoid synchronization difficulties is to create a data structure that doesn't need them: use a single data structure to present the filtered and unfiltered data.
You should be able to do that with a modified skip list (actually, an indexable skip list), which will give you O(log n) access by index.
What you do is maintain two separate sets of forward pointers for each node, rather than just one set. The one set is for the unfiltered list, as in the normal skip list, and the other set is for the filtered list.
Adding to or removing from the list is the same for the filtered and unfiltered lists. That is, you find the node at index by following the appropriate filtered or unfiltered links, and then add or remove the node, updating both sets of link pointers.
This should be more efficient than a standard sequential list, because insertion and removal don't incur the cost of moving items up or down to make a hole or fill a gap; it's all done with references.
It takes a little more space per node, though. On average, skip list requires two extra references per node. Since you're building what is in effect two skip lists in one, expect your nodes to require, on average, four extra references per node.
Edit after comment
If, as you say, you don't control List, then you still maintain this dual skip list that I described. But the data stored in the skip list is just the index into List. You said that List is observable, so you get notification of all insert and delete operations, so you should be able to maintain an index by reacting to all notifications.
When somebody wants to operate on FilteredList, you use the filtered index links to find the List index of the FilteredList record the user wanted to affect. Then you pass the request onto List, using the translated index. And then you react to the observable notification from List.
Basically, you're just maintaining a secondary index into List, so that you can translate FilteredList indexes into List indexes.

Related

Best statically allocated data structure for writing and extending contiguous blocks of data?

Here's what I want to do:
I have an arbitrary number of values of a different kind: string, int, float, bool, etc. that I need to store somehow. Multiple elements are often written and read as a whole, forming "contiguous blocks" that can also be extended and shortened at the users wish and even elements in the middle might be taken out. Also, the whole thing should be statically allocated.
I was thinking about using some kind of statically allocated forward lists. The way I imagine this to work is defining an array of a struct containing one std::variant field and a field "previous head" which always points to the location of the previous head of the list. A new element is always placed at the globally known "head" which it stores inside "previous head" field. This way I can keep track of holes inside my list because once an element is taken out, its location is written to global head and will be filled up by subsequent inserts.
This approach however has downsides: When a "contiguous block" is extended, there might be the case that further elements of other blocks have already queued up in the list past its last element. So I either need to move all subsequent entries or copy over the last element in the previous list and insert a link object that allows me to jump to the new location when traversing the contiguous block.
The priority to optimize this datastructure is following (by number of use cases):
Initially write contigous blocks
read the whole data structure
add new elements to contigous blocks
remove elements of contigous blocks
At the moment my data structure will have time complexity of O(1) für writes, O(n) for continous reads (with the caveat that in the worst case there is a jump to the next location inside the array every other element), O(1) for adding new elements and O(1) for removing elements. However, space complexity is S(2n) in the worst case (when I have to do a jump every second time the slot to store data is lost to the "link").
What I'm wondering now is: Is the described way the best viable way to accomplish what I'm trying or is there a better data structure? Is there an official name for this data structure?

Best DataStructure to implement Excel Spreadsheet

How we can implement Excel spreadsheet with creation and deletion of rows and creation and deletion if cells, with also can modify data inside any cell.
I was looking for best data structure to implement this.
The problem statement is little vague in my opinion. We do not have any information about the kind of operations that will be very frequent or even the amount of data that this DS is going to hold.
So assuming there can be fair amount of data. Also the operations are addition and deletion of rows and cells.
For excel spreadsheet, If I have to implement it with a custom Data Structure, I would take each row as a node of a linked list. This is helpful because as opposed to an array (n dimensional), the memory can be assigned in non contiguous manner. Also with that benefit, it will make adding and deletion of rows much easy.
Inside each node, we can have array of string to hold cell values and a Id field to hold the Id of the row.
The head node of the DS will have column names as value of its string array. So in a way each column is mapped to an index of the array.
To add a row: It will be an insert into the linked list. Make a new row and append in the end.
To delete a row: Same as deletion of node in a linked list.
To add/update a cell value: You basically know the row Id, you have column name so you can know the index of the column in the array from head node. So once you have the node corresponding to the row, access the index of string array to add/read/update/delete the value of cell.
In order to optimize node access you can keep indexes on the actual linked list to easily locate node by row Id. Some more optimizations would be store row-Id to node pointers mapping some where in auxiliary map or array so that inserting rows in between in also fast.
However I would re-iterate that implementation should be done on the use-case basis. If there are heavy column addition/deletion ops for example, it will be quite slow. There are different kind of trade-offs for each kind of use case.
I think the easy way to go ahead with this is to simply use a JSON structure to hold each row. Column names as keys and the cell values as values. This handles null/empty values quite easily.
A spreadsheet is essentially similar to a table, changes can be made on any cell at any row. Hence going with a simple list structure would not be too bad. The downside to this is that deletion and insertion of in between rows is not performant. But the insertion of rows at end, which is the most common use case and modification of cells can be made quite easy.
To facilitate faster insertion and deletion a linked list structure will help, but it will affect random access adversely, so a simple list of json objects would be the better.

Heap-like data structure with fast random access?

My situation is the following:
I have a collection of entities, each of which has a "goodness" property.
I wish to grab the entities one at a time, from "best" to "worst."
After a "best" entity is grabbed, the "goodness" properties of several (relatively few) of my other entities change, and this change must be incorporated into my upcoming decision of the next "best" entity to grab.
Some (relatively few) entities may become "worthless" after a grab, and these should be removed from my collection.
It is easy for me to construct, given the entity that I just grabbed, the set of now-"dirty" objects, that is, the set of entities which potentially have a now-different "goodness," or have become "worthless."
So, I need a data structure that allows me to:
Quickly grab the "biggest" of a collection (as in, a max-heap).
Quickly update the underlying ordering of the objects in my collection to accommodate the situation described above. (Easy to do in a heap, if we can access the dirty objects' locations, e.g. array indices, within the underlying heap implementation.)
There is a guarantee that there are no collisions among the entries of my collection. (The entries are references to the entities I described above.)
The idea I have is to use a max-heap together with an unordered map, keyed on the heap entries, and having values equal to, e.g., the objects' respective indices in the underlying array in the heap implementation.
What I'm wondering is whether there may be a data structure which is better for this situation.
If few members are affected when the best entity is grabbed, then you might be able to improve the runtime by using a linked list and an unordered map (each with the original set of entities), and a max heap. After removing the best entity from the end of the linked list you'll use the map to locate the affected entities, removing them from the list and adding the non-worthless entities to the max heap. Thereafter, the next best entity is the greater of the entity at the end of the list or the max entity in the heap. The advantage of this setup is that removal from the linked list is a constant time operation, and insertion into the max heap will be a relatively small (compared to the total number of entities) log time operation.
Because entities' values can only get worse, you can lazily remove them from the linked list - if the item is worthless then remove it, and if its value has changed then flag it as "changed." Check the "changed" flag on the entity at the end of the linked list, and if it's "true" then remove the entity and add it to the max-heap. The advantage of lazy updates is that you usually won't need to update items that are in the heap (you'll just need to update the value of items in the linked list), and if an item is changed and then later made worthless then you can remove it from the linked list without ever having to add it to the heap.

Which Data Structure should I choose?

I am thinking to list the top scores for a game. More than 300000 players are to be listed in order by their top score. Players can update their high score by typing in their name, and their new top score. Only 10 scores show up at a time, and the user can type in which place they want to start with. So if they type "100100" then the whole list should refresh, and show them the 100,100th score through the 100,109th score. So what data structure should I use in this case? I am thinking to use hashTable with users' names as keys, it would take constant time to update their scores. But what if a user's previous score is at 100,100th, and after he updated his score his score became the highest one in whole list? Then if by using hash table it would take linear time since I need to compare each score in the list to make sure is the highest one. Therefore, is there any better data structure to choose beside using hashTable?
You should choose the data structure that is optimized for the most common operation. By your description of an ordered list probably the most common operation will be viewing the list (and jumping around in it).
If you use a hashtable with the user's names as keys, then it will be very expensive to display the list ordered by score, and very expensive to compute different views when viewers skip around in the list.
Instead, using a simple list sorted by score will make all of the "view" operations very cheap and very easy to implement. When a user updates their score, simply do a linear (O(n)) search for the user by name and remove their old entry. Then, since the list is sorted, you can search it in O(log n) time to find where to re-insert their new entry in the list.
Use a map (ordered tree) based container with score keys and a hash with name keys. Let the values be a link to your entities stored in a list or array etc. i.e. store the data as you like an make indeces for the different access you need performed fast.

Sorting application difficulty

Currently I am reading a book on algorithms and found this usage of sorting.
Reconstructing the original order - How can we restore the original arrangment of a set of items after we permute them for some application? Add an extra field to the data record for the item, such that i-th record sets this field to i. Carry this field along whenever you move the record, and later sort on it when you want the initial order back.
I ve been trying hard to understand what does it mean. And I failed miserably. Pls somebody help?
Suppose you have list of items in random order:
itemC, itemB, itemA, itemD
you sorted them up:
itemA, itemB, itemC, itemD
and you didn't have enough memory to store them in a separate location, so original sequence is lost. Moreover, original order is random and it will be problematic/impossible to restore it.
This article gives a solution to this problem.
Add an extra field to the data record for the item, such that i-th record sets this field to i
So, we add an extra field for each of the items:
(itemC,1), (itemB,2), (itemA,3), (itemD, 4)
And after sort we have:
(itemA,3), (itemB,2), (itemC,1), (itemD, 4)
So we can easily restore initial order sorting by additional field
Let's say you have the data in an array, because it's the simplest structure that I can use to exemplify.
So, your node (i.e., element of the array) may look like this:
(some data type) data
The algorithm is suggesting you to add an integer field, so it looks like this:
(some data type) data,
int position
And then, you fill the positions with the actual index. Something like this pseudocode:
for current: 0 to lastElement
array[current].position = current
(that's not written in any language I know of, but it should be readable)
After doing that, you shuffle it (resort it) for whatever you need to.
When you want to restore the original ordering, all you need to do is sort by the position field.
Well, basically it's saying that you need some sort of thingy to keep track of the original order (which is destroyed by the permutation). One option would be to simply reverse the permutation (check out Steve Jessop's infrmative answer here).
Another option to invert the permutation would require fewer processing steps, but more memory. More specifically, each node in your input set would have an extra ID field, and all the elements in this input set are sorted based on this field. Once you apply the permutation, it's obvious that the IDs are no longer in a sorted order. If you wish to invert the permutation, all you have to do is sort the list again based on this field.

Resources