How to determine differences in two lists of data - algorithm

This is an exercise for the CS guys to shine with the theory.
Imagine you have 2 containers with elements. Folders, URLs, Files, Strings, it really doesn't matter.
What is AN algorithm to calculate the added and the removed?
Notice: If there are many ways to solve this problem, please post one per answer so it can be analysed and voted up.
Edit: All the answers solve the matter with 4 containers. Is it possible to use only the initial 2?

Assuming you have two lists of unique items, and the ordering doesn't matter, you can think of them both as sets rather than lists
If you think of a venn diagram, with list A as one circle and list B as the other, then the intersection of these two is the constant pool.
Remove all the elements in this intersection from both A and B, and and anything left in A has been deleted, whilst anything left in B has been added.
So, iterate through A looking for each item in B. If you find it, remove it from both A and B
Then A is a list of things that were deleted, and B is a list of things that were added
I think...
[edit] Ok, with the new "only 2 container" restriction, the same still holds:
foreach( A ) {
if( eleA NOT IN B ) {
DELETED
}
}
foreach( B ) {
if( eleB NOT IN A ) {
ADDED
}
}
Then you aren't constructing a new list, or destroying your old ones...but it will take longer as with the previous example, you could just loop over the shorter list and remove the elements from the longer. Here you need to do both lists
An I'd argue my first solution didn't use 4 containers, it just destroyed two ;-)

I have not done this in a while but I believe the algorithm goes like this...
sort left-list and right-list
adds = {}
deletes = {}
get first right-item from right-list
get first left-item from left-list
while (either list has items)
if left-item < right-item or right-list is empty
add left-item to deletes
get new left-item from left-list
else if left-item > right-item or left-list is empty
add right-item to adds
get new right-item from right-list
else
get new right-item from right-list
get new left-item from left-list
In regards to right-list's relation to left-list, deletes contains items removed and adds now contains new items.

What Joe said. And, if the lists are too large to fit into memory, use an external file sorting utility or a Merge sort.

Missing information: How do you define added/removed? E.g. if the lists (A and B) show the same directory on Server A and Server B, that is in sync. If I now wait for 10 days, generate the lists again and compare them, how can I tell if something has been removed? I cannot. I can only tell there are files on Server A not found on Server B and/or the other way round. Whether that is because a file has been added to Server A (thus the file is not found on B) or a file has been deleted on Server B (thus the file is not found on B anymore) is something I cannot determine by just having a list of file names.
For the solution I suggest, I will just assume that you have one list named OLD and one list named NEW. Everything found on OLD but not on NEW has been removed. Everything found on NEW, but not on OLD has been added (e.g. the content of the same directory on the same server, however lists have been created at different dates).
Further I will assume there are no duplicates. That means every item on either list is unique in the sense of: If I compare this item to any other item on the list (no matter how this compare works), I can always say the item is either smaller or bigger than the one I'm comparing it to, but never equal. E.g. when dealing with strings, I can compare them lexicographically and the same string is never twice in the list.
In that case the simplest (not necessarily best solution, though) is:
Sort the OLD lists. E.g. if the list consists of strings, sort them alphabetically. Sorting is necessary, because it means I can use binary search to quickly find an object in the list, assuming it does exist there (or to quickly determine, it does not exist in the list at all). If the list is unsorted, finding the object has a complexity of O(n) (I need to look at every single item on the list). If the list is sorted, complexity is only O(log n), as after every try to match an item on the list I can always exclude 50% of the items on the list not being a match. Even if the list has 100 items, finding an item (or detecting that the item is not on the list) takes at most 7 tests (or is it 8? Anyway, far less than 100). The NEW list doesn't have to be sorted.
Now we perform list elimination. For every item on the NEW list, try to find this item on the OLD list (using binary search). If the item is found, remove this item from the OLD list and also remove it from the NEW list. This also means the lists get smaller the further the elimination progresses and thus the lookups will become faster and faster. Since removing an item from the a list has no effect on the correct sort order of the lists, there is no need to ever resort the OLD list during the elimination phase.
At the end of elimination, both lists might be empty, in which case they were equal. If they are not empty, all items still on the OLD list are items missing on the NEW list (otherwise we had removed them), hence these are the removed items. All items still on the NEW list are items that were not on the OLD list (again, we had removed them otherwise), hence these are the added items.

Are the objects in the list "unique"? In this case I would first build two maps (hashmaps) and then scan the lists and lookup every object in the maps.
map1
map2
removedElements
addedElements
list1.each |item|
{
map1.add(item)
}
list2.each |item|
{
map2.add(item)
}
list1.each |item|
{
removedElements.add(item) unless map2.contains?(item)
}
list2.each |item|
{
addedElements.add(item) unless map1.contains?(item)
}
Sorry for the horrible meta-language mixing Ruby and Java :-P
In the end removedElements will contain the elements belonging to list1, but not to list2, and addedElements will contain the elements belonging to list2.
The cost of the whole operation is O(4*N) since the lookup in the map/dictionary may be considered constant. On the other hand linear/binary searching each elements in the lists will make that O(N^2).
EDIT: on a second thought moving the last check into the second loop you may remove one of the loops... but that's ugly... :)
list1.each |item|
{
map1.add(item)
}
list2.each |item|
{
map2.add(item)
addedElements.add(item) unless map1.contains?(item)
}
list1.each |item|
{
removedElements.add(item) unless map2.contains?(item)
}

Related

How to implement a collection that supports real-time filtering?

I want to implement a mutable sequential collection FilteredList that wraps another collection List and filters it based on a predicate.
Both the wrapped List and the exposed FilteredList are mutable and observable, and should be synchronized (so for example, if someone adds an element to List that element should appear in the correct position in FilteredList, and vice versa).
Elements that don't satisfy the predicate can still be added to FilteredList, but they will not be visible (they will still appear in the inner list).
The collections should support:
Insert(index,value) which inserts an element value at position index, pushing elements forward.
Remove(index) which removes the element at position index, moving all proceeding elements back.
Update(index, value), which updates the element at position index to be value.
I'm having trouble coming up with a good synchronization mechanism.
I don't have any strict complexity bounds, but real world efficiency is important.
The best way to avoid synchronization difficulties is to create a data structure that doesn't need them: use a single data structure to present the filtered and unfiltered data.
You should be able to do that with a modified skip list (actually, an indexable skip list), which will give you O(log n) access by index.
What you do is maintain two separate sets of forward pointers for each node, rather than just one set. The one set is for the unfiltered list, as in the normal skip list, and the other set is for the filtered list.
Adding to or removing from the list is the same for the filtered and unfiltered lists. That is, you find the node at index by following the appropriate filtered or unfiltered links, and then add or remove the node, updating both sets of link pointers.
This should be more efficient than a standard sequential list, because insertion and removal don't incur the cost of moving items up or down to make a hole or fill a gap; it's all done with references.
It takes a little more space per node, though. On average, skip list requires two extra references per node. Since you're building what is in effect two skip lists in one, expect your nodes to require, on average, four extra references per node.
Edit after comment
If, as you say, you don't control List, then you still maintain this dual skip list that I described. But the data stored in the skip list is just the index into List. You said that List is observable, so you get notification of all insert and delete operations, so you should be able to maintain an index by reacting to all notifications.
When somebody wants to operate on FilteredList, you use the filtered index links to find the List index of the FilteredList record the user wanted to affect. Then you pass the request onto List, using the translated index. And then you react to the observable notification from List.
Basically, you're just maintaining a secondary index into List, so that you can translate FilteredList indexes into List indexes.

Sync Algorithm Pattern

Let's say I have two sources: A and B. For example, both are disparate data stores for storing TODO lists.
How do I build an algorithm for an operation which ensures the both sources are synced ?
Do I just copy A to B and then copy B to A eliminating duplicates (assuming there is a primary key ID to eliminate duplicates)
For items of both lists you should have set the time of the last sync.
During the next sync you work only with sublists of items, which appeared after the last sync time.
Yes, for these sublists the simple double-, or n-sided join will be enough.
The n-sided sync is more interesting. The better way will be to create a star system - where the syncs are done each time between the end list and the core list. The core list could be that one on server, the end lists will be these set and shown by UI.

Sorting application difficulty

Currently I am reading a book on algorithms and found this usage of sorting.
Reconstructing the original order - How can we restore the original arrangment of a set of items after we permute them for some application? Add an extra field to the data record for the item, such that i-th record sets this field to i. Carry this field along whenever you move the record, and later sort on it when you want the initial order back.
I ve been trying hard to understand what does it mean. And I failed miserably. Pls somebody help?
Suppose you have list of items in random order:
itemC, itemB, itemA, itemD
you sorted them up:
itemA, itemB, itemC, itemD
and you didn't have enough memory to store them in a separate location, so original sequence is lost. Moreover, original order is random and it will be problematic/impossible to restore it.
This article gives a solution to this problem.
Add an extra field to the data record for the item, such that i-th record sets this field to i
So, we add an extra field for each of the items:
(itemC,1), (itemB,2), (itemA,3), (itemD, 4)
And after sort we have:
(itemA,3), (itemB,2), (itemC,1), (itemD, 4)
So we can easily restore initial order sorting by additional field
Let's say you have the data in an array, because it's the simplest structure that I can use to exemplify.
So, your node (i.e., element of the array) may look like this:
(some data type) data
The algorithm is suggesting you to add an integer field, so it looks like this:
(some data type) data,
int position
And then, you fill the positions with the actual index. Something like this pseudocode:
for current: 0 to lastElement
array[current].position = current
(that's not written in any language I know of, but it should be readable)
After doing that, you shuffle it (resort it) for whatever you need to.
When you want to restore the original ordering, all you need to do is sort by the position field.
Well, basically it's saying that you need some sort of thingy to keep track of the original order (which is destroyed by the permutation). One option would be to simply reverse the permutation (check out Steve Jessop's infrmative answer here).
Another option to invert the permutation would require fewer processing steps, but more memory. More specifically, each node in your input set would have an extra ID field, and all the elements in this input set are sorted based on this field. Once you apply the permutation, it's obvious that the IDs are no longer in a sorted order. If you wish to invert the permutation, all you have to do is sort the list again based on this field.

What is a quad linked list?

I'm currently working on implementing a list-type structure at work, and I need it to be crazy effective. In my search for effective data structures I stumbled across a patent for a quad liked list, and this sparked my interest enough to make me forget about my current task and start investigating the quad list instead. Unfortunately, internet was very secretive about the whole thing, and google didn't produce much in terms of usable results. The only explanation I got was the patent description that stated:
A quad linked data structure that provides bidirectional search capability for multiple related fields within a single record. The data base is searched by providing sets of pointers at intervals of N data entries to accommodate a binary search of the pointers followed by a linear search of the resultant range to locate an item of interest and its related field.
This, unfortunately, just makes me more puzzled, as I cannot wrap my head around the non-layman explanation. So therefore I turn to you all in hope that you can explain to me what this quad linked history really is, as I know not knowing will drive me up and over the walls pretty quickly.
Do you know what a quad linked list is?
I can't be sure, but it sounds a bit like a skip list.
Even if that's not what it is, you might find skip lists handy. (To the best of my knowledge they are unidirectional, however.)
I've not come across the term formally before, but from the patent description, I can make an educated guess.
A linked list is one where each node has a link to the next...
a -->-- b -->-- c -->-- d -->-- null
A doubly linked list means each node holds a link to its predecessor as well.
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
Let's assume the list is sorted. If I want to perform binary search, I'd normally go half way down the list to find the middle node, then go into the appropriate interval and repeat. However, linked list traversal is always O(n) - I have to follow all the links. From the description, I think they're just adding additional links from a node to "skip" a fixed number of nodes ahead in the list. Something like...
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
| |
|----------->-----------|
-----------<-----------
Now I can traverse the list more rapidly, especially if I chose the extra link targets carefully (i.e. ensure they always go back/forward half of the offset of the item they point from in the list length). I then find the rough interval I want with these links, and use the normal links to find the item.
This is a good example of why I hate software patents. It's eminently obvious stuff, wrapped in florid prose to confuse people.
I don't know if this is exactly a "quad-linked list", but it sounds like something like this:
struct Person {
// Normal doubly-linked list.
Customer *nextCustomer;
Customer *prevCustomer;
std::string firstName;
Customer *nextByFirstName;
Customer *prevByFirstName;
std::string lastName;
Customer *nextByLastName;
Customer *prevByLastName;
};
That is: you maintain several orderings through your collection. You can easily navigate in firstName order, or in lastName order. It's expensive to keep the links up to date, but it makes navigation quite quick.
Of course, this could be something completely different.
My reading of it is that a quad linked list is one which can be traversed (backwards or forwards) in O(n) in two different ways, ie sorted according to FieldX or FieldY:
(a) generating first and second sets
of link pointers, wherein the first
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the fixed ID
field, and the second set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the fixed ID field;
(b) generating third and fourth sets
of link pointers, wherein the third
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the variable
ID field, and the fourth set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the variable ID field;
So if you had a quad linked list of employees you could store it sorted by name AND sorted by age, and enumerate either in O(n).
One source of the patent is this. There are, it appears, two claims, the second of which is more nearly relevant:
A computer implemented method for organizing and searching a set of related records, wherein each record includes:
i) a fixed ID field; and
ii) a variable ID field; the method comprising the steps of:
(a) generating first and second sets of link pointers, wherein the first set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the fixed ID field, and the second set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the fixed ID field;
(b) generating third and fourth sets of link pointers, wherein the third set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the variable ID field, and the fourth set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the variable ID field;
(c) generating first and second sets of field pointers, wherein the first set of field pointers includes an ordered set of pointers that point to every Nth fixed ID field when the records are ordered with respect to the fixed ID field, and the second set of pointers includes an ordered set of pointers that point to every Nth variable ID field when the records are ordered with respect to the variable ID field;
(d) when searching for a particular record by reference to its fixed ID field, conducting a binary search of the first set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(e) examining by linear scarch, the fixed ID fields within the range determined in step (d) to locate the particular record;
(f) when searching for a particular record by reference to its variable ID field, conducting a binary search of the second set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(g) examining, by linear search, the variable ID fields within the range determined in step (f) to locate the particular record.
When you work through the patent gobbledegook, I think it means approximately the same as having two skip lists (one for forward search, one for backwards search) on each of two keys (hence 4 lists in total, and the name 'quad-list'). I don't think it is a very good patent - it looks to be an obvious application of skip lists to a data set where you have two keys to search on.
The description isn't particularly good, but as best I can gather, it sounds like a less-efficient skip list.

Implementing shuffle on the celestial jukebox

How would one implement shuffle for the "Celestial Jukebox"?
More precisely, at each time t, return an uniform random number between 0..n(t), such that there are no repeats in the entire sequence, with n() increasing over time.
For the concrete example, assume a flat-rate music service which allows playing any song in the catalog by a 0 based index number. Every so often, new songs are added which increase range of index numbers. The goal is to play a new song each time (assuming no duplicates in the catalog).
an ideal solution would be feasible on existing hardware - how would I shoehorn a list of six million songs in 8MB of DRAM? Similarly, the high song count exacerbates O(n) selection timings.
-- For an LCG generator, given a partially exhausted LCG on 0..N0, can that be translated to a different LCG on 0..N1 (where N1 > N0), that doen't repeat the exhausted sequence.
-- Checking if a particular song has already been played seems to rapidly grow out of hand, although this might be the only way ? Is there an efficient data structure for this?
The way that I like to do that kind of non-repeating random selection is to have a list, and each time I select an item at random between [0-N), I remove it from that list. In your case, as new items get added to the catalog, it would also be added to the not-yet-selected list. Once you get to the end, simply reload all the songs back to the list.
EDIT:
If you take v3's suggestion into account, this can be done in basically O(1) time after the O(N) initialization step. It guarantees non-repeating random selection.
Here is the recap:
Add the initial items to a list
Pick index i at random (from set of [0,N))
Remove item at index i
Replace the hole at i with the Nth item (or null if i == Nth) and decrement N
For new items, simply append to the end of the list and increment N as necessary
If you ever get to playing through all the songs (which I doubt if you have 6M songs), then add all the songs back to the list, lather, rinse, and repeat.
Since you are trying to deal with rather large sets, I would recommend the use of a DB. A simple table with basically two fields: id and "pointer" (where "pointer" is what tells you the song to play which could be a GUID, FileName, etc, depending on how you want to do it). Have an index on id and you should get very decent performance with persistence between application runs.
EDIT for 8MB limit:
Umm, this does make it a bit harder... In 8 MB, you can store a maximum of ~2M entries using 32-bit keys.
So what I would recommend is to pre-select the next 2M entries. If the user plays through 2M songs in a lifetime, damn! To pre-select them, do a pre-init step using the above algorithm. The one change I would make is that as you add new songs, roll the dice and see if you want to randomly add that song to the mix. If yes, then pick a random index and replace it with the new song's index.
With a limit of 8MB for 6 million songs, there's plainly not room to store even a single 32 bit integer for each song. Unless you're prepared to store the list on disk (in which case, see below).
If you're prepared to drop the requirement that new items be immediately added to the shuffle, you can generate an LCG over the current set of songs, then when that is exhausted, generate a new LCG over only the songs that were added since you began. Rinse and repeat until you no longer have any new songs. You can also use this rather cool algorithm that generates an unguessable permutation over an arbitrary range without storing it.
If you're prepared to relax the requirement of 8MB ram for 6 million songs, or to go to disk (for example, by memory mapping), you could generate the sequence from 1..n at the beginning, shuffle it with fisher-yates, and whenever a new song is added, pick a random element from the so-far-unplayed section, insert the new ID there, and append the original ID to the end of the list.
If you don't care much about computational efficiency, you could store a bitmap of all songs, and repeatedly pick IDs uniformly at random until you find one you haven't played yet. This would take 6 million tries to find the last song (on average), which is still damn fast on a modern CPU.
While Erich's solution is probably better for your specific use case, checking if a song has already been played is very fast (amortized O(1)) with a hash-based structure, such as a set in Python or a hashset<int> in C++.
You could simply generate the sequence of numbers from 1 to n and then shuffle it using a Fisher-Yates shuffle. That way you can guarantee that the sequence won't repeat, regardless of n.
You could use a linked list inside an array:
To build the initial playlist, use an array containing a something like this:
struct playlistNode{
songLocator* song;
playlistNode *next;
};
struct playlistNode arr[N];
Also keep a 'head' and 'freelist' pointer;
Populate it in 2 passes:
1. fill in arr with all the songs in the catalog in order 0..N.
2. randomly iterate through all the indexes, filling in the next pointer;
Deletion of songs played is O(1):
head=cur->next;
cur->song=NULL;
freelist->next = freelist;
cur->next=freelist;
freelist=cur;
Insertion of new songs is O(1) also: pick an array index at random, and patch a new node.
node = freelist;
freelist=freelist->next;
do {
i=rand(N);
} while (!arr[i].song); //make sure you didn't hit a played node
node->next = arr[i].next;
arr[i].next=node;

Resources