how to store and delete sorted items in a file - algorithm

I am trying to store elements in a file in a sorted order.
The elements will be in the following format:
1 MessageA
2 MessageB
.
.
54 MessageM
68 MessageN
Each element will have a number(timestamp) & a message(size is variable).
The elements must be sorted by timestamp.
Operation allowed are insert and delete(Pop).
(Growing file size is not an issue)
and we can delete only from the lower most element(i.e. delete one after another).
Currently I have implemented it as a linked list which is very slow on inserts when the number of elements are large.
what will be the most efficient data structure to store this?

I'm not sure if you want to delete the oldest or the newest element but you should probably look into stacks and queues.
Stacks are First In Last Out, meaning that the element inserted first will be deleted (popped) last, as it would happen with a real stack, hence the name. Here the popped element would be the newest.
Queues are First In First Out. Here the deleted element (dequeued) is the oldest still present in the queue.

Related

Best statically allocated data structure for writing and extending contiguous blocks of data?

Here's what I want to do:
I have an arbitrary number of values of a different kind: string, int, float, bool, etc. that I need to store somehow. Multiple elements are often written and read as a whole, forming "contiguous blocks" that can also be extended and shortened at the users wish and even elements in the middle might be taken out. Also, the whole thing should be statically allocated.
I was thinking about using some kind of statically allocated forward lists. The way I imagine this to work is defining an array of a struct containing one std::variant field and a field "previous head" which always points to the location of the previous head of the list. A new element is always placed at the globally known "head" which it stores inside "previous head" field. This way I can keep track of holes inside my list because once an element is taken out, its location is written to global head and will be filled up by subsequent inserts.
This approach however has downsides: When a "contiguous block" is extended, there might be the case that further elements of other blocks have already queued up in the list past its last element. So I either need to move all subsequent entries or copy over the last element in the previous list and insert a link object that allows me to jump to the new location when traversing the contiguous block.
The priority to optimize this datastructure is following (by number of use cases):
Initially write contigous blocks
read the whole data structure
add new elements to contigous blocks
remove elements of contigous blocks
At the moment my data structure will have time complexity of O(1) für writes, O(n) for continous reads (with the caveat that in the worst case there is a jump to the next location inside the array every other element), O(1) for adding new elements and O(1) for removing elements. However, space complexity is S(2n) in the worst case (when I have to do a jump every second time the slot to store data is lost to the "link").
What I'm wondering now is: Is the described way the best viable way to accomplish what I'm trying or is there a better data structure? Is there an official name for this data structure?

Constant time to access first and last node

How would you modify a linked list based queue so that first and last node can be accessed in a constant time regardless of data nodes in queue?
You will keep two pointers/reference variables, one for head of queue and other for tail. When you insert an item, you will set the tail to last inserted item, and when you remove an item, your head will obviously go to next item in queue. Since you have two variables for head and tail, it will be a constant time operation to access them.
This is the general way to create a queue with linked list itself, this will be needed to insert items and remove items from queue, nothing special here.

Is there such a data-structure that combines queue and hashmap?

Is there such a data structure that combines a Queue and a Hashmap?
In addition to the FIFO (enqueue/dequeue) behaviour where a queue normally has, I want
when enqueuing, always enqueue with a key,
when peeking without the key, returns the head of the queue
when peeking with the key, returns the first element enqueued with this key
when dequeuing without the key, remove the first element ever enqueued
when dequeuing with the key, remove all elements having the key
I wonder if such data structure already exist in the wild?
No there is not. But you can combine both to achieve the behavior you want (though you will have to make tradeoffs along the way).
To do so, you will store:
A HashMap where the values are references to items in the queue: HashMap<Key, ReferenceToFIFOElement> or HashMap<Key, Set<ReferenceToFIFOElement>>.
An actual FIFO queue: FIFO<Item>
When you enqueue, you first add your element at the top of the queue. Then you update the hashmap with a reference to this newly created element if the key was not registered yet (or add the said reference to the reference bucket mapped to the given key in the set case).
Peeking will be easy: just retrieve the key and access the referenced item (or the first referenced item in the set case, or the top if no key were provided).
Dequeuing is where the real tradeoff will take place:
If you only store a reference to the first item inserted with a given key in the hashmap, then you will have to iterate over all the queue, starting from the said item. This means an overall higher time complexity.
If you store all the references to items with a given key in the hashmap (using a set), then you will just have to iterate over that set and remove the referenced elements from the queue. This increases the space complexity of the data structure.
However, in reality it can be more complicated depending on the data structure you choose to place under the hood of the FIFO:
Array list: cache friendly, random access... But can require reallocation as you insert/delete elements. This invalidates references -> store indices instead of actual references.
Linked list: not cache friendly but insertion and deletion are guaranteed to be O(1).

Efficient insertion in sorted collection

I have a collection of 10 messages sorted by number of likes message has. Periodically i update that collection with replacing some of the old messages with new that got more likes in meantime, so that collection again contains 10 messages and is sorted by number of likes.
I have api to insert or remove message from collection relative to existing member message. insert(message_id, relative_to, above_or_bellow) and remove(message_id). I want to minimize number of api calls by optimizing position where I insert new messages so that collection is always sorted and 10 long at the end. (in the process length and order is irrelevant, just at the end of process)
I know i can calculate new collection and then replace just messages that dont match their new position but I believe it can be further optimized and algorithms exist already.
Edit:
Note the word "periodically", meaning messages do not come one by one, but in time interval i collect new messages, sort them, and make new collection which i then publish on site via api. So i do have 2 collections, one is simple array in memory, and other is on site.
Idea is to reuse already inserted messages that should be kept and their order in updated collection to save http api calls. I believe there are existing algorithms i could reuse to transform existing collection into already known resulting collection with minimal number of insert, remove operations.
First remove all messages that are no longer in the top 10 liked messages.
In order to get the most from the existing list, we should now look for the longest subsequence of messages that is ordered by their likes (we can use the algorithm mentioned here using number of likes as value How to determine the longest increasing subsequence using dynamic programming? )
We would then remove all other messages (not in subsequence) and insert the missing ones by their order.
I think you only need to keep one list/vector of messages and keep it sorted at all times and up to date with every new message.
Since this collection will always be sorted and assuming it has random access you could use binary search to find the insertion point i.e. O(log_2^M) where M is your maximum list size e.g. 10. But then when you insert here it anyway requires O(M) to shift the elements. Therefore, I would just use a linked list and iterate it while the message to insert (or update) has less likes than the current one.

How to implement a collection that supports real-time filtering?

I want to implement a mutable sequential collection FilteredList that wraps another collection List and filters it based on a predicate.
Both the wrapped List and the exposed FilteredList are mutable and observable, and should be synchronized (so for example, if someone adds an element to List that element should appear in the correct position in FilteredList, and vice versa).
Elements that don't satisfy the predicate can still be added to FilteredList, but they will not be visible (they will still appear in the inner list).
The collections should support:
Insert(index,value) which inserts an element value at position index, pushing elements forward.
Remove(index) which removes the element at position index, moving all proceeding elements back.
Update(index, value), which updates the element at position index to be value.
I'm having trouble coming up with a good synchronization mechanism.
I don't have any strict complexity bounds, but real world efficiency is important.
The best way to avoid synchronization difficulties is to create a data structure that doesn't need them: use a single data structure to present the filtered and unfiltered data.
You should be able to do that with a modified skip list (actually, an indexable skip list), which will give you O(log n) access by index.
What you do is maintain two separate sets of forward pointers for each node, rather than just one set. The one set is for the unfiltered list, as in the normal skip list, and the other set is for the filtered list.
Adding to or removing from the list is the same for the filtered and unfiltered lists. That is, you find the node at index by following the appropriate filtered or unfiltered links, and then add or remove the node, updating both sets of link pointers.
This should be more efficient than a standard sequential list, because insertion and removal don't incur the cost of moving items up or down to make a hole or fill a gap; it's all done with references.
It takes a little more space per node, though. On average, skip list requires two extra references per node. Since you're building what is in effect two skip lists in one, expect your nodes to require, on average, four extra references per node.
Edit after comment
If, as you say, you don't control List, then you still maintain this dual skip list that I described. But the data stored in the skip list is just the index into List. You said that List is observable, so you get notification of all insert and delete operations, so you should be able to maintain an index by reacting to all notifications.
When somebody wants to operate on FilteredList, you use the filtered index links to find the List index of the FilteredList record the user wanted to affect. Then you pass the request onto List, using the translated index. And then you react to the observable notification from List.
Basically, you're just maintaining a secondary index into List, so that you can translate FilteredList indexes into List indexes.

Resources