Data Structure for Ascending Order Key Value Pairs with Further Insertion - data-structures

I am implementing a table in which each entry consists of two integers. The entries must be ordered in ascending order by key (according to the first integer of each set). All elements will be added to the table as the program is running and must be put in the appropriate slot. Time complexity is of utmost importance and I will only use the insert, remove, and iterate functions.
Which Java data structure is ideal for this implementation?
I was thinking LinkedHashMap, as it maps keys to values (each entry in my table is two values). It also provides O(1) insert/remove functionality. However, it is not sorted. If entries can be efficiently inserted in appropriate order as they come in, this is not a bad idea as the data structure would be sorted. But I have not read or thought of an efficient way to do this. (Maybe like a comparator)?
TreeMap has a time complexity of log(n) for both add and remove. It maintains sorted order and has an iterator. But can we do better than than log(n)?
LinkedList has O(1) add/remove. I could insert with a loop, but this seems inefficient as well.
It seems like TreeMap is the way to go. But I am not sure.
Any thoughts on the ideal data structure for this program are much appreciated. If I have missed an obvious answer, please let me know.
(It can be a data structure with a Set interface, as there will not be duplicates.)

A key-value pair suggests for a Map. As you need key based ordering it narrows down to a SortedMap, in your case a TreeMap. As far as keeping sorting elements in a data structure, it can't get better than O(logn). Look no further.

The basic idea is that you need to insert the key at a proper place. For that your code needs to search for that "proper place". Now, for searching like that, you cannot perform better than a binary search, which is log(n), which is why I don't think you can perform an insert better than log(n).
Hence, again, a TreeMap would be that I would advise you to use.
Moreover, if the hash values, that you state, (specially because there are no duplicates) can be enumerated (as in integer number, serial numbers or so), you could try using statically allocated arrays for doing that. Then you might get a complexity of O(1) perhaps!

Related

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

How to remove duplicates from a file?

How to remove duplicates from a large file of large numbers ? This is an interview question about algorithms and data structures rather than sort -u and stuff like that.
I assume there that the file does not fit in memory and the numbers range is large enough so I cannot use in-memory count/bucket sort.
The only option is see is to sort the file (e.g. merge sort) and pass the sorted file again to filter out duplicates.
Does it make sense. Are there other options?
You won't even need separate pass over sorted data if you use a duplicates-removing variant of "merge" (a.k.a. "union") in your mergesort. Hash table should be empty-ish to perform well, i.e. be even bigger than the file itself - and we're told that the file itself is big.
Look up multi-way merge (e.g. here) and external sorting.
Yes, the solution makes sense.
An alternative is build a file-system-based hash table, and maintain it as a set. First iterate on all elements and insert them to your set, and later - in a second iteration, print all elements in the set.
It is implementation and data dependent which will perform better, in terms of big-O complexity, the hash offers O(n) time average case and O(n^2) worst case, while the merge sort option offers more stable O(nlogn) solution.
Mergesort or Timsort (which is an improved mergesort) is a good idea. EG: http://stromberg.dnsalias.org/~strombrg/sort-comparison/
You might also be able to get some mileage out of a bloom filter. It's a probabilistic datastructure that has low memory requirements. You can adjust the error probability with bloom filters. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ You could use one to toss out values that are definitely unique, and then scrutinize the values that are probably not unique more closely via some other method. This would be especially valuable if your input dataset has a lot of duplicates. It doesn't require comparing elements directly, it just hashes the elements using a potentially-large number of hash functions.
You could also use an on-disk BTree or 2-3 Tree or similar. These are often stored on disk, and keep key/value pairs in key order.

Should you sort a list when getting or setting it?

A decision I often run into is when to sort a list of items. When an item is added, keeping the list sorted at all times, or when the list is accessed.
Is there a best practice for better performance, or is it just the matter of saying: if the list is mostly accessed, sort it when it is changed or vice versa.
Sorting the list at every acccess is a bad idea. You have to have a flag which you set when the collection is modified. Only if this flag is set, you need to sort and then reset the flag.
But the best is if you have a data structure which is per definition always sorted. That means, if you insert a new element, the element is automatically inserted at the right index, thus keeping the collection sorted.
I don't know which platform / framework you are using. I know .NET provides a SortedList class which manages that kind of insertion-sort algorithm for you.
The answer is a big depends. You should profile and apply a strategy that is best for your case.
If you want performance on access/finding elements a good decision will be to maintain the list sorted using InsertionSort (http://en.wikipedia.org/wiki/Insertion_sort).
Sorting list on access may be an option only on some very particular scenarios, when are many insertions, low access and performance is not very important.
But, there are many other options: like maintain a var that say "list is sorted" and sort at every n-th insertion, on idle or on access (if you need).
I'm used to think in this way:
If the list is filled all at once and only after this is read, then add elements in non-sorted order and sort it just at the end of filling (in complexity terms it requires O(n log n) plus the complexity of filling, and that's usually faster than sorting while adding elements)
Conversely, if the list needs to be read before it is completely filled, then you have to add elements in sorted order (maybe using some special data structure doing the work for you, like sortedlist, red-black tree etc.)

Best Data Structure to Store Large Amounts of Data with Dynamic and Non-unique Keys?

Basically, I have a large number of C structs to keep track of, that are essentially:
struct Data {
int key;
... // More data
};
I need to periodically access lots (hundreds) of these, and they must be sorted from lowest to highest key values. The keys are not unique and they will be changed over the course of the program. To make matters even more interesting, the majority of the structures will be culled (based on criteria completely unrelated to the key values) from the pool right before being sorted, but I still need to keep references to them.
I've looked into using a binary search tree to store them, but the keys are not guaranteed to be unique and I'm not entirely sure how to restructure the tree once a key is changed or how to cull specific structures.
To recap in case that was unclear above, I need to:
Store a large number of structures with non-unique and dynamic keys.
Cull a large percentage of the structures (but not free them entirely because different structures are culled each time).
Sort the remaining structures from highest to lowest key value.
What data structure/algorithms would you use to solve this problem? The method needs to be as fast and/or memory efficient as possible, since this is a real-time application.
EDIT: The culling is done by iterating over all of the objects and making a decision for each one. The keys change between the culling/sorting runs. I should have stated that they don't change a lot, but they do change, and they can change multiple times between the culling/sorting runs. (If it helps, the key for each structure is actually a z-order for a Sprite. They need to be sorted before each drawing loop so the Sprites with lower z-orders are drawn first.)
Just stick 'em all in a big array.
When the time comes to do the cull and sort, start by doing the sort. Do an insertion sort. That's right - nothing clever, just an insertion sort.
After the sort, go through the sorted array, and for each object, make the culling decision, then immediately output the object if it isn't culled.
This is about as memory-efficient as it gets. It should also require very little computation: there's no bookkeeping on updates between cull/sort passes, and the sort will be cheap - because insertion sort is adaptive, and for an almost-sorted array like this, it will be almost O(n). The one thing it doesn't do is cache locality: there will be two separate passes over the array, for the sort, and the cull/output.
If you demand more cleverness, then instead of an insertion sort, you could use another adaptive, in-place sort that's faster. Timsort and smoothsort are good candidates; both are utterly fiendish to implement.
The big alternative to this is to only sort unculled objects, using a secondary, temporary, list of such objects which you sort (or keep in a binary tree or whatever). But the thing is, if the keys don't change that much, then the win you get from using an adaptive sort on an almost-sorted array will (i reckon!) outweigh the win you would get from sorting a smaller dataset. It's O(n) vs O(n log n).
The general solution to this type of problem is to use a balanced search tree (e.g. AVL tree, red-black tree, B-tree), which guarantees O(log n) time (almost constant, but not quite) for insertion, deletion, and lookup, where n is the number of items currently stored in the tree. Guaranteeing no key is stored in the tree twice is quite trivial, and is done automatically by many implementations.
If you're working in C++, you could try using std::map<int, yourtype>. If in C, find or implement some simple binary search tree code, and see if it's fast enough.
However, if you use such a tree and find it's too slow, you could look into some more fine-tuned approaches. One might be to put your structs in one big array, radix sort by the integer key, cull on it, then re-sort per pass. Another approach might be to use a Patricia tree.

What sort of sorted datastructure is optimized for finding items within a range?

Say I have a bunch of objects with dates and I regularly want to find all the objects that fall between two arbitrary dates. What sort of datastructure would be good for this?
A binary search tree sounds like what you're looking for.
You can use it to find all the objects in O(log(N) + K), where N is the total number of objects and K is the number of objects that are actually in that range. (provided that it's balanced). Insertion/removal is O(log(N)).
Most languages have a built-in implementation of this.
C++:
http://www.cplusplus.com/reference/stl/set/
Java:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/TreeSet.html
You can find the lower bound of the range (in log(n)) and then iterate from there until you reach the upper bound.
Assuming you mean by date when you say sorted, an array will do it.
Do a binary search to find the index that's >= the start date. You can then either do another search to find the index that's <= the end date leaving you with an offset & count of items, or if you're going to process them anyway just iterate though the list until you exceed the end date.
It's hard to give a good answer without a little more detail.
What kind of performance do you need?
If linear is fine then I would just use a list of dates and iterate through the list collecting all dates that fall within the range. As Andrew Grant suggested.
Do you have duplicates in the list?
If you need to have repeated dates in your collection then most implementations of a binary tree would probably be out. Something like Java's TreeSet are set implementations and don't allow repeated elements.
What are the access characteristics? Lots of lookups with few updates, vice-versa, or fairly even?
Most datastructures have trade-offs between lookups and updates. If you're doing lots of updates then some datastructure that are optimized for lookups won't be so great.
So what are the access characteristics of the data structure, what kind of performance do you need, and what are structural characteristics that it must support (e.g. must allow repeated elements)?
If you need to make random-access modifications: a tree, as in v3's answer. Find the bottom of the range by lookup, then count upwards. Inserting or deleting a node is O(log N). stbuton makes a good point that if you want to allow duplicates (as seems plausible for datestamped events), then you don't want a tree-based set.
If you do not need to make random-access modifications: a sorted array (or vector or whatever). Find the location of the start of the range by binary chop, then count upwards. Inserting or deleting is O(N) in the middle. Duplicates are easy.
Algorithmic performance of lookups is the same in both cases, O(M + log N), where M is the size of the range. But the array uses less memory per entry, and might be faster to count through the range, because after the binary chop it's just forward sequential memory access rather than following pointers.
In both cases you can arrange for insertion at the end to be (amortised) O(1). For the tree, keep a record of the end element at the head, and you get an O(1) bound. For the array, grow it exponentially and you get amortised O(1). This is useful if the changes you make are always or almost-always "add a new event with the current time", since time is (you'd hope) a non-decreasing quantity. If you're using system time then of course you'd have to check, to avoid accidents when the clock resets backwards.
Alternative answer: an SQL table, and let the database optimise how it wants. And Google's BigTable structure is specifically designed to make queries fast, by ensuring that the result of any query is always a consecutive sequence from a pre-prepared index :-)
You want a structure that keeps your objects sorted by date, whenever you insert or remove a new one, and where finding the boundary for the segment of all objects later than or earlier than a given date is easy.
A heap seems the perfect candidate. In practical applications, heaps are simply represented by an array, where all the objects are stored in order. Seeing that sorted array as a heap is simply a way to make insertions of new objects and deletions happen in the right place, and in O(log(n)).
When you have to find all the objects between date A (excluded) and B (included), find the position of A (or the insert position, that is, the position of the earlier element later than A), and the position of B (or the insert position of B), and return all the objects between those positions (which is simply the section between those positions in the array/heap)

Resources