Sorting application difficulty - algorithm

Currently I am reading a book on algorithms and found this usage of sorting.
Reconstructing the original order - How can we restore the original arrangment of a set of items after we permute them for some application? Add an extra field to the data record for the item, such that i-th record sets this field to i. Carry this field along whenever you move the record, and later sort on it when you want the initial order back.
I ve been trying hard to understand what does it mean. And I failed miserably. Pls somebody help?

Suppose you have list of items in random order:
itemC, itemB, itemA, itemD
you sorted them up:
itemA, itemB, itemC, itemD
and you didn't have enough memory to store them in a separate location, so original sequence is lost. Moreover, original order is random and it will be problematic/impossible to restore it.
This article gives a solution to this problem.
Add an extra field to the data record for the item, such that i-th record sets this field to i
So, we add an extra field for each of the items:
(itemC,1), (itemB,2), (itemA,3), (itemD, 4)
And after sort we have:
(itemA,3), (itemB,2), (itemC,1), (itemD, 4)
So we can easily restore initial order sorting by additional field

Let's say you have the data in an array, because it's the simplest structure that I can use to exemplify.
So, your node (i.e., element of the array) may look like this:
(some data type) data
The algorithm is suggesting you to add an integer field, so it looks like this:
(some data type) data,
int position
And then, you fill the positions with the actual index. Something like this pseudocode:
for current: 0 to lastElement
array[current].position = current
(that's not written in any language I know of, but it should be readable)
After doing that, you shuffle it (resort it) for whatever you need to.
When you want to restore the original ordering, all you need to do is sort by the position field.

Well, basically it's saying that you need some sort of thingy to keep track of the original order (which is destroyed by the permutation). One option would be to simply reverse the permutation (check out Steve Jessop's infrmative answer here).
Another option to invert the permutation would require fewer processing steps, but more memory. More specifically, each node in your input set would have an extra ID field, and all the elements in this input set are sorted based on this field. Once you apply the permutation, it's obvious that the IDs are no longer in a sorted order. If you wish to invert the permutation, all you have to do is sort the list again based on this field.

Related

Best DataStructure to implement Excel Spreadsheet

How we can implement Excel spreadsheet with creation and deletion of rows and creation and deletion if cells, with also can modify data inside any cell.
I was looking for best data structure to implement this.
The problem statement is little vague in my opinion. We do not have any information about the kind of operations that will be very frequent or even the amount of data that this DS is going to hold.
So assuming there can be fair amount of data. Also the operations are addition and deletion of rows and cells.
For excel spreadsheet, If I have to implement it with a custom Data Structure, I would take each row as a node of a linked list. This is helpful because as opposed to an array (n dimensional), the memory can be assigned in non contiguous manner. Also with that benefit, it will make adding and deletion of rows much easy.
Inside each node, we can have array of string to hold cell values and a Id field to hold the Id of the row.
The head node of the DS will have column names as value of its string array. So in a way each column is mapped to an index of the array.
To add a row: It will be an insert into the linked list. Make a new row and append in the end.
To delete a row: Same as deletion of node in a linked list.
To add/update a cell value: You basically know the row Id, you have column name so you can know the index of the column in the array from head node. So once you have the node corresponding to the row, access the index of string array to add/read/update/delete the value of cell.
In order to optimize node access you can keep indexes on the actual linked list to easily locate node by row Id. Some more optimizations would be store row-Id to node pointers mapping some where in auxiliary map or array so that inserting rows in between in also fast.
However I would re-iterate that implementation should be done on the use-case basis. If there are heavy column addition/deletion ops for example, it will be quite slow. There are different kind of trade-offs for each kind of use case.
I think the easy way to go ahead with this is to simply use a JSON structure to hold each row. Column names as keys and the cell values as values. This handles null/empty values quite easily.
A spreadsheet is essentially similar to a table, changes can be made on any cell at any row. Hence going with a simple list structure would not be too bad. The downside to this is that deletion and insertion of in between rows is not performant. But the insertion of rows at end, which is the most common use case and modification of cells can be made quite easy.
To facilitate faster insertion and deletion a linked list structure will help, but it will affect random access adversely, so a simple list of json objects would be the better.

How to implement a collection that supports real-time filtering?

I want to implement a mutable sequential collection FilteredList that wraps another collection List and filters it based on a predicate.
Both the wrapped List and the exposed FilteredList are mutable and observable, and should be synchronized (so for example, if someone adds an element to List that element should appear in the correct position in FilteredList, and vice versa).
Elements that don't satisfy the predicate can still be added to FilteredList, but they will not be visible (they will still appear in the inner list).
The collections should support:
Insert(index,value) which inserts an element value at position index, pushing elements forward.
Remove(index) which removes the element at position index, moving all proceeding elements back.
Update(index, value), which updates the element at position index to be value.
I'm having trouble coming up with a good synchronization mechanism.
I don't have any strict complexity bounds, but real world efficiency is important.
The best way to avoid synchronization difficulties is to create a data structure that doesn't need them: use a single data structure to present the filtered and unfiltered data.
You should be able to do that with a modified skip list (actually, an indexable skip list), which will give you O(log n) access by index.
What you do is maintain two separate sets of forward pointers for each node, rather than just one set. The one set is for the unfiltered list, as in the normal skip list, and the other set is for the filtered list.
Adding to or removing from the list is the same for the filtered and unfiltered lists. That is, you find the node at index by following the appropriate filtered or unfiltered links, and then add or remove the node, updating both sets of link pointers.
This should be more efficient than a standard sequential list, because insertion and removal don't incur the cost of moving items up or down to make a hole or fill a gap; it's all done with references.
It takes a little more space per node, though. On average, skip list requires two extra references per node. Since you're building what is in effect two skip lists in one, expect your nodes to require, on average, four extra references per node.
Edit after comment
If, as you say, you don't control List, then you still maintain this dual skip list that I described. But the data stored in the skip list is just the index into List. You said that List is observable, so you get notification of all insert and delete operations, so you should be able to maintain an index by reacting to all notifications.
When somebody wants to operate on FilteredList, you use the filtered index links to find the List index of the FilteredList record the user wanted to affect. Then you pass the request onto List, using the translated index. And then you react to the observable notification from List.
Basically, you're just maintaining a secondary index into List, so that you can translate FilteredList indexes into List indexes.

Which Data Structure should I choose?

I am thinking to list the top scores for a game. More than 300000 players are to be listed in order by their top score. Players can update their high score by typing in their name, and their new top score. Only 10 scores show up at a time, and the user can type in which place they want to start with. So if they type "100100" then the whole list should refresh, and show them the 100,100th score through the 100,109th score. So what data structure should I use in this case? I am thinking to use hashTable with users' names as keys, it would take constant time to update their scores. But what if a user's previous score is at 100,100th, and after he updated his score his score became the highest one in whole list? Then if by using hash table it would take linear time since I need to compare each score in the list to make sure is the highest one. Therefore, is there any better data structure to choose beside using hashTable?
You should choose the data structure that is optimized for the most common operation. By your description of an ordered list probably the most common operation will be viewing the list (and jumping around in it).
If you use a hashtable with the user's names as keys, then it will be very expensive to display the list ordered by score, and very expensive to compute different views when viewers skip around in the list.
Instead, using a simple list sorted by score will make all of the "view" operations very cheap and very easy to implement. When a user updates their score, simply do a linear (O(n)) search for the user by name and remove their old entry. Then, since the list is sorted, you can search it in O(log n) time to find where to re-insert their new entry in the list.
Use a map (ordered tree) based container with score keys and a hash with name keys. Let the values be a link to your entities stored in a list or array etc. i.e. store the data as you like an make indeces for the different access you need performed fast.

Do I need to implement a b-tree search for this?

I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic

What is a quad linked list?

I'm currently working on implementing a list-type structure at work, and I need it to be crazy effective. In my search for effective data structures I stumbled across a patent for a quad liked list, and this sparked my interest enough to make me forget about my current task and start investigating the quad list instead. Unfortunately, internet was very secretive about the whole thing, and google didn't produce much in terms of usable results. The only explanation I got was the patent description that stated:
A quad linked data structure that provides bidirectional search capability for multiple related fields within a single record. The data base is searched by providing sets of pointers at intervals of N data entries to accommodate a binary search of the pointers followed by a linear search of the resultant range to locate an item of interest and its related field.
This, unfortunately, just makes me more puzzled, as I cannot wrap my head around the non-layman explanation. So therefore I turn to you all in hope that you can explain to me what this quad linked history really is, as I know not knowing will drive me up and over the walls pretty quickly.
Do you know what a quad linked list is?
I can't be sure, but it sounds a bit like a skip list.
Even if that's not what it is, you might find skip lists handy. (To the best of my knowledge they are unidirectional, however.)
I've not come across the term formally before, but from the patent description, I can make an educated guess.
A linked list is one where each node has a link to the next...
a -->-- b -->-- c -->-- d -->-- null
A doubly linked list means each node holds a link to its predecessor as well.
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
Let's assume the list is sorted. If I want to perform binary search, I'd normally go half way down the list to find the middle node, then go into the appropriate interval and repeat. However, linked list traversal is always O(n) - I have to follow all the links. From the description, I think they're just adding additional links from a node to "skip" a fixed number of nodes ahead in the list. Something like...
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
| |
|----------->-----------|
-----------<-----------
Now I can traverse the list more rapidly, especially if I chose the extra link targets carefully (i.e. ensure they always go back/forward half of the offset of the item they point from in the list length). I then find the rough interval I want with these links, and use the normal links to find the item.
This is a good example of why I hate software patents. It's eminently obvious stuff, wrapped in florid prose to confuse people.
I don't know if this is exactly a "quad-linked list", but it sounds like something like this:
struct Person {
// Normal doubly-linked list.
Customer *nextCustomer;
Customer *prevCustomer;
std::string firstName;
Customer *nextByFirstName;
Customer *prevByFirstName;
std::string lastName;
Customer *nextByLastName;
Customer *prevByLastName;
};
That is: you maintain several orderings through your collection. You can easily navigate in firstName order, or in lastName order. It's expensive to keep the links up to date, but it makes navigation quite quick.
Of course, this could be something completely different.
My reading of it is that a quad linked list is one which can be traversed (backwards or forwards) in O(n) in two different ways, ie sorted according to FieldX or FieldY:
(a) generating first and second sets
of link pointers, wherein the first
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the fixed ID
field, and the second set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the fixed ID field;
(b) generating third and fourth sets
of link pointers, wherein the third
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the variable
ID field, and the fourth set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the variable ID field;
So if you had a quad linked list of employees you could store it sorted by name AND sorted by age, and enumerate either in O(n).
One source of the patent is this. There are, it appears, two claims, the second of which is more nearly relevant:
A computer implemented method for organizing and searching a set of related records, wherein each record includes:
i) a fixed ID field; and
ii) a variable ID field; the method comprising the steps of:
(a) generating first and second sets of link pointers, wherein the first set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the fixed ID field, and the second set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the fixed ID field;
(b) generating third and fourth sets of link pointers, wherein the third set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the variable ID field, and the fourth set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the variable ID field;
(c) generating first and second sets of field pointers, wherein the first set of field pointers includes an ordered set of pointers that point to every Nth fixed ID field when the records are ordered with respect to the fixed ID field, and the second set of pointers includes an ordered set of pointers that point to every Nth variable ID field when the records are ordered with respect to the variable ID field;
(d) when searching for a particular record by reference to its fixed ID field, conducting a binary search of the first set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(e) examining by linear scarch, the fixed ID fields within the range determined in step (d) to locate the particular record;
(f) when searching for a particular record by reference to its variable ID field, conducting a binary search of the second set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(g) examining, by linear search, the variable ID fields within the range determined in step (f) to locate the particular record.
When you work through the patent gobbledegook, I think it means approximately the same as having two skip lists (one for forward search, one for backwards search) on each of two keys (hence 4 lists in total, and the name 'quad-list'). I don't think it is a very good patent - it looks to be an obvious application of skip lists to a data set where you have two keys to search on.
The description isn't particularly good, but as best I can gather, it sounds like a less-efficient skip list.

Resources