Finding proper data structure c++ - data-structures

I was looking for some simple implemented data structure which gets my needs fulfilled in least possible time (in worst possible case) :-
(1)To pop nth element (I have to keep relative order of elements intact)
(2)To access nth element .
I couldn't use array because it can't pop and i dont want to have a gap after deleting ith element . I tried to remove the gap , by exchanging nth element with next again with next untill last but that proves time ineffecient though array's O(1) is unbeatable .
I tried using vector and used 'erase' for popup and '.at()' for access , but even this is not cheap for time effeciency though its better than array .

What you can try is skip list - it support the operation you are requesting in O(log(n)). Another option would be tiered vector that is just slightly easier to implement and takes O(sqrt(n)). both structures are quite cool but alas not very popular.

Well , tiered vector implemented on array would i think best fit your purpose . Though the tiered vector concept may be knew and little tricky to understand at first but then once you get it , it opens lot of question and you get a handy weapon to tackle many question's data structure part very effeciently . So it is recommended that you master tiered vectors implementation.

An array will give you O(1) lookup but O(n) delete of the element.
A list will give you O(n) lookup bug O(1) delete of the element.
A binary search tree will give you O(log n) lookup with O(1) delete of the element. But it doesn't preserve the relative order.
A binary search tree used in conjunction with the list will give you the best of both worlds. Insert a node into both the list (to preserve order) and the tree (fast lookup). Delete will be O(1).
struct node {
node* list_next;
node* list_prev;
node* tree_right;
node* tree_left;
// node data;
};
Note that if the nodes are inserted into the tree using the index as the sort value, you will end up with another linked list pretending to be a tree. The tree can be balanced however in O(n) time once it is built which you would only have to incur once.
Update
Thinking about this more this might not be the best approach for you. I'm used to doing lookups on the data itself not its relative position in a set. This is a data centric approach. Using the index as the sort value will break as soon as you remove a node since the "higher" indices will need to change.

Warning: Don't take this answer seriously.
In theory, you can do both in O(1). Assuming this are the only operations you want to optimize for. The following solution will need lots of space (and it will leak space), and it will take long to create the data structure:
Use an array. In every entry of the array, point to another array which is the same, but with that entry removed.

Related

Binary Search-accessing the middle element drawback

I am studying from my course book on Data Structures by Seymour Lipschutz and I have come across a point I don’t fully understand..
Binary Search Algorithm assumes that one has direct access to middle element in the list. This means that the list must be stored in some typeof linear array.
I read this and also recognised that in Python you can have access to the middle element at all times. Then the book goes onto say:
Unfortunately, inserting an element in an array requires elements to be moved down the list, and deleting an element from an array requires element to be moved up the list.
How is this a Drawback ?
Won’t we still be able to access the middle element by dividing the length of array by 2?
In the case where the array will not be modified, the cost of insertion and deletion are not relevant.
However, if an array is to be used to maintain a sorted set of non-fixed items, then insertion and deletion costs are relevant. In this case, binary search can be used to find items (possibly for deletion) and/or find where new items should be inserted. The drawback is that insertion and deletion require movement of other elements.
Python's bisect module provides binary search functionality that can be used for locating insertion points for maintaining sorted order. The drawback mentioned applies.
In some cases, a binary search tree may be a preferable alternative to a sorted array for maintaining a sorted set of non-fixed items.
It seems that author compares array-like structures and linked list
The first (array, Python and Java list, C++ vector) allows fast and simple access to any element by index, but appending, inserting or deletion might cause memory redistribution.
For the second we cannot address i-th element directly, we need to traverse list from the beginning, but when we have element - we can insert or delete quickly.

Binary Search Trees when compared to sorted linked lists

if we have our elements in a sorted circular double linked list the order of operations (insert delete Max Min successor predecessor) are the same or even better than the binary search tree . so why we use them ?
is it because data structure authors want to familiarize reader with general concept of tree as a data structure with some simple examples ?
i have read some same questions but the questions were (inconsiderately !) asked with arrays instead of linked lists and answers were not useful for linked lists! since most of them addressed the problem of shifting the elements in the array for insertion.
A linked list is not "addressable", in a sense that, if you want to access the element in the middle of the list, for example to do binary search, you will have to walk the list. That is to say the performance of list.get(index) is O(n). If you back it up with any data structure that gives you O(1) performance, it will be an array in the end. Then we will be back to the problem of allocating extra space and shifting the elements, which is not as efficient as binary search trees.
Actually the binary search in double circular linked list cannot be done since in a binary search we need the middle element but we cannot access the middle element in a linked list unless we pay theta(n/2) and so on ( half of the first half (1/4) or half of the second half (3/4) ) .
but the idea of making a binary search tree stems from this idea.we almost keep the middle of each part of our data to use it for searching and other ends .

Efficiently Filtering out Sorted Data With A Second Predicate

Lets say that I have a list (or a hashmap etc., whatever makes this the fastest) of objects that contain the following fields: name, time added, and time removed. The list given to me is already sorted by time removed. Now, given a time T, I want to filter (remove from the list) out all objects of the list where:
the time T is greater than an object's time removed OR T is less than an object's time added.
So after processing, the list should only contain objects where T falls in the range specified by time added and time removed.
I know I can do this easily in O(n) time by going through each individual object, but I was wondering if there was a more efficient way considering the list was already sorted by the first predicate (time removed).
*Also, I know I can easily remove all objects with time removed less than T because the list is presorted (possibly in O(log n) time since I do a binary search to find the first element that is less than and then remove the first part of the list up to that object).
(Irrelevant additional info: I will be using C++ for any code that I write)
Unfortunately you are stuck with a O(n) being your fastest option. That is unless their are hidden requirements about the difference between time added and time removed (such as a max time span) that can be exploited.
As you said you can start the search where the time removed equals (or is the first greater than) the time removed. Unfortunately you'll need to go through the rest of the list to see if time added is less than your time.
Because a comparative sort is at best O(n*log(n)) you cannot sort the objects again to improve your performance.
One thing, based on the heuristics of the application it may be beneficial to receive the data in order of date added but that is between you and wherever you get the data from.
Let's examine the data structures you offered:
A list (usually implemented as a linked list, or a dynamic array), or a hash map.
Linked List: Cannot do binary search, finding first occurance of an
element (even if list is sorted) is done in O(n), so no benefit
from the fact the data is sorted.
Dynamic Array: Removing a single element (or more) from arbitrary location requires shifting all the following elements to the left, and thus is O(n). You cannot remove elements from the list better than O(n), so no gain here from the fact the DS is sorted.
HashMap: is unsorted by definition. Also, removing k elements is O(k), no way to go around this.
So, you cannot even improve performance from O(n) to O(logn) for the same field the list was sorted by.
Some data structures such as B+ trees do allow efficient range queries, and you can pretty efficiently [O(logn)] remove a range of elements from the tree.
However, it does not help you to filter the data of the 2nd field, which the tree is unsorted by, and to filter according to it (unless there is some correlation you can exploit) - will still need O(n) time.
If all you are going to do is to later on iterate the new list, you can push the evaluation to the iteration step, but there won't be any real benefit from it - only delaying the processing to when it's needed, and avoiding it, if it is not needed.

Sorting a binary search tree on different key value

Say I have a binary tree with the following definition for a node.
struct node
{
int key1 ;
int key2 ;
}
The binary search tree is created on the basis of key1. Now is it possible to rearrange the binary search tree on basis of key2 in O(1) space. Although I can do this in variable space using an array of pointers to nodes.
The actual problem where I require this is "counting number of occurrences of unique words in a file and displaying the result in decreasing order of frequency."
Here, a BST node is
{
char *word;
int freq ;
}
The BST is first created on basis of alphabetic order of words and finally I want it on basis of freq.
Am I wrong at choice of data structure i.e a BST?
I think you can create a new tree sorted by freq and push there all elements popping them from an old tree.
That could be O(1) though likely more like O(log N) which isn't big anyway.
Also, I don't know how you call it in C#, but in Python you can use list but sort it by two different keys in-place.
Map, BST are good if you need to have sorted output for your dictionnary.
And it is good if you need to mix up add, remove and lookup operations.
I don't think this is your need here. You load the dictionnary, sort it, then do only look up in it, that's right ?
In this case a sorted array is probably a better container. (See Item 23 from Effective STL from Scott Meyer).
(Update: simply consider that a map could generate more memory cache misses than a sorted array, as an array get its data contiguous in memory, and as each node in a map contain 2 pointers to other nodes in the map. When your objects are simple and take not much space in memory, a sorted vector is probable a better option. I warmly recommand you to read that item from Meyer's book)
About the kind of sort you are talking about, you will need that algorithm from the stl:
stable_sort.
The idea is to sort the dictionnary, then sort with stable_sort() on the frequence key.
It will give something like that (not tested actually, but you got the idea):
struct Node
{
char * word;
int key;
};
bool operator < (const Node& l, const Node& r)
{
return std::string(l.word) < std::string(r.word));
}
bool freq_comp(const Node& l, const Node& r)
{
return l.key < r.key;
}
std::vector<node> my_vector;
... // loading elements
sort(vector.begin(), vector.end());
stable_sort(vector.begin(), vector.end(), freq_comp);
Using a HashTable (Java) or Dictionary (.NET) or equivalent data structure in your language of choice (hash_set or hash_map in STL) will give you O(1) inserts during the counting phase, unlike the binary search tree which would be somewhere from O(log n) to O(n) on insert depending on whether it balances itself. If performance is really that important just make sure you try to initialize your HashTable to a large enough size that it won't need to resize itself dynamically, which can be expensive.
As for listing by frequency, I can't immediately think of a tricky way to do that without involving a sort, which would be O(n log n).
Here is my suggestion for re-balancing the tree based off of the new keys (well, I have 2 suggestions).
The first and more direct one is to somehow adapt Heapsort's "bubble-up" function (to use Sedgewick's name for it). Here is a link to wikipedia, there they call it "sift-up". It is not designed for an entirely-unbalanced tree (which is what you'd need), but I believe it demonstrates the basic flow of an in-place reordering of a tree. It may be a bit hard to follow because the tree is in fact stored in array rather than a tree (though the logic in a sense treats it as a tree) --- perhaps, though, you'll find such an array-based representation is best! Who knows.
The more crazy-out-there suggestion of mine is to use a splay tree. I think they're nifty, and here's the wiki link. Basically, whichever element you access is "bubbled up" to the top, but it maintains the BST invariants. So you maintain the original Key1 for building the initial tree, but hopefully most of the "higher-frequency" values will also be near the top. This may not be enough (as all it will mean is that higher-frequency words will be "near" the top of the tree, not necessarily ordered in any fashion), but if you do happen to have or find or make a tree-balancing algorithm, it may run a lot faster on such a splay tree.
Hope this helps! And thank you for an interesting riddle, this sounds like a good Haskell project to me..... :)
You can easily do this in O(1) space, but not in O(1) time ;-)
Even though re-arranging a whole tree recursively until it is sorted again seems possible, it is probably not very fast - it may be O(n) at best, probably worse in practice. So you might get a better result by adding all nodes to an array once you are done with the tree and just sorting this array using quicksort on frequency (which will be O(log n) on average). At least that's what I would do. Even tough it takes extra space it sounds more promising to me than re-arranging the tree in place.
One approach you could consider is to build two trees. One indexed by word, one indexed by freq.
As long as the tree nodes contain a pointer to the data node, you could access if via the word-based tree to update the info, but later access it by the freq-based tree to output.
Although, if speed is really that important, I'd be looking to get rid of the string as a key. String comparisons are notoriously slow.
If speed is not important, I think your best bet is to gather the data based on word and re-sort based on freq as yves has suggested.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources