This problem I recently came back to after putting it on a backburner for a while - and that's the one of trying to create a program to calculate the flow rates of some kind of resource through a network of pipes from resource sources to resource sinks, each of which can only pass so much resource per unit of time. This is, of course, a classic problem called the "network flow problem", and in particular the typical aim is to find a flow pattern that maximizes the flow going from the sources to the sinks. And I made a program that uses a common algorithm called the Ford-Fulkerson Max-Flow Method to do this, but I found that while this algorithm certainly does a nice job at finding a flow solution, it doesn't necessarily do a good job at making one that is particularly "natural" in terms of the flow pattern.
That is to say, consider a graph like the one below.
------------- SINK 1
0 / 8 | 0 / 5
SOURCE ---------X
| 0 / 5
------------- SINK 2
where the numbers represent the current flow rate on that particular edge or "pipe", here in "units" per second, versus the maximum flow the pipe can support, the "X" is a junction node, and the other labels should be self-explanatory.
When we solve this using F-F (which requires us to temporarily add an "aggregate sink" node that ties the two sinks on the right together), we find the max flow rate is indeed 8 U/s, which should be obvious just from simple inspection for such a simple graph. However, the flow pattern it gives may look something like either
------------- SINK 1
8 / 8 | 5 / 5
SOURCE ---------X
| 3 / 5
------------- SINK 2
or
------------- SINK 1
8 / 8 | 3 / 5
SOURCE ---------X
| 5 / 5
------------- SINK 2
depending on the order on which it encounters the edges during the depth-first walk used in the calculation. Trouble is, not only is that behavior itself not ideal, that flow doesn't "feel natural" in a certain sense. Intuitively, if we were imagining pushing a fluid, we'd expect 4 U/s of flow to go to sink 1 and another 4 to go to sink 2 by symmetry. Indeed, if we actually shrink the capacity of the edge leading out of the source to 5, the Ford-Fulkerson algorithm will starve one sink entirely, and that is also a behavior I'd like to avoid - if there's not enough flow to keep everybody as happy as they'd like to be, then at least try to distribute it as evenly as possible. In this case, that'd mean that if the max flow is, say, as here, 80% of the flow needed to fully satiate all the sinks, then 80% should go to each sink, unless there's a constriction somewhere in the graph that prevents sending even that much to that sink, in which case excess flow should back up and go to the other sinks while that one still gets the maximum it can get.
So my question is, what sort of algorithms would have either this behavior or a behavior similar to it? Or, to put it another way, if F-F is a good tool to just find a maximum flow, what is a good tool for tailoring the pattern of that maximum flow to some "desirable" form like this?
One simple solution I thought of is to just repeatedly apply F-F, only instead of routing from the source to the fictitious aggregate sink, apply it from the source to each individual sink, thus giving the max flow that is capable of making it through the constrictions, then work out from that how much each sink can actually get fed based on its demand and the whole-graph max flow. Trouble is, that means running the algorithm as many times as there are sinks, so the Big-O goes up, perhaps too much. Is there a more efficient way to achieve this?
I would like a simple way to represent the order of a list of objects. When an object changes position in that list I would like to update just one record. I don't know if this can be done but I'm interested to ask the SO hive...
Wish-list constraints
the algorithm (or data structure) should allow for items to be repositioned in the list by updating the properties of a single item
the algorithm (or data structure) should require no housekeeping to maintain the integrity of the list
the algorithm (or data structure) should allow for the insertion of new items or the removal of existing items
Why I care about only updating one item at a time...
[UPDATED to clarify question]
The use-case for this algorithm is a web application with a CRUDy, resourceful server setup and a clean (Angular) client.
It's good practice to keep to the pure CRUD actions where possible and makes for cleaner code all round. If I can do this operation in a single resource#update request then I don't need any additional serverside code to handle the re-ordering and it can all be done using CRUD with no alterations.
If more than one item in the list needs to be updated for each move then I need a new action on my controller to handle it. It's not a showstopper but it starts spilling over into Angular and everything becomes less clean than it ideally should be.
Example
Let's say we have a magazine and the magazine has a number of pages :
Original magazine
- double page advert for Ford (page=1)
- article about Jeremy Clarkson (page=2)
- double page advert for Audi (page=3)
- article by James May (page=4)
- article by Richard Hammond (page=5)
- advert for Volkswagen (page=6)
Option 1: Store integer page numbers
... in which we update up to N records per move
If I want to pull Richard Hammond's page up from page 5 to page 2 I can do so by altering its page number. However I also have to alter all the pages which it then displaces:
Updated magazine
- double page advert for Ford (page=1)
- article by Richard Hammond (page=2)(old_value=5)*
- article about Jeremy Clarkson (page=3)(old_value=2)*
- double page advert for Audi (page=4)(old_value=3)*
- article by James May (page=5)(old_value=4)*
- advert for Volkswagen (page=6)
* properties updated
However I don't want to update lots of records
- it doesn't fit my architecture
Let's say this is being done using javascript drag-n-drop re-ordering via Angular.js. I would ideally like to just update a value on the page which has been moved and leave the other pages alone. I want to send an http request to the CRUD resource for Richard Hammond's page saying that it's now been moved to the second page.
- and it doesn't scale
It's not a problem for me yet but at some point I may have 10,000 pages. I'd rather not update 9,999 of them when I move a new page to the front page.
Option 2: a linked list
... in which we update 3 records per move
If instead of storing the page's position, I instead store the page that comes before it then I reduce the number of actions from a maximum of N to 3.
Original magazine
- double page advert for Ford (id = ford, page_before = nil)
- article about Jeremy Clarkson (id = clarkson, page_before = ford)
- article by James May (id = captain_slow, page_before = clarkson)
- double page advert for Audi (id = audi, page_before = captain_slow)
- article by Richard Hammond (id = hamster, page_before = audi)
- advert for Volkswagen (id = vw, page_before = hamster)
again we move the cheeky hamster up...
Updated magazine
- double page advert for Ford (id = ford, page_before = nil)
- article by Richard Hammond (id = hamster, page_before = ford)*
- article about Jeremy Clarkson (id = clarkson, page_before = hamster)*
- article by James May (id = captain_slow, page_before = clarkson)
- double page advert for Audi (id = audi, page_before = captain_slow)
- advert for volkswagen (id = vw, page_before = audi)*
* properties updated
This requires updating three rows in the database: the page we moved, the page just below its old position and the page just below its new position.
It's better but it still involves updating three records and doesn't give me the resourceful CRUD behaviour I'm looking for.
Option 3: Non-integer positioning
...in which we update only 1 record per move (but need to housekeep)
Remember though, I still want to update only one record for each repositioning. In my quest to do this I take a different approach. Instead of storing the page position as an integer I store it as a float. This allows me to move an item by slipping it between two others:
Original magazine
- double page advert for Ford (page=1.0)
- article about Jeremy Clarkson (page=2.0)
- double page advert for Audi (page=3.0)
- article by James May (page=4.0)
- article by Richard Hammond (page=5.0)
- advert for Volkswagen (page=6.0)
and then we move Hamster again:
Updated magazine
- double page advert for Ford (page=1.0)
- article by Richard Hammond (page=1.5)*
- article about Jeremy Clarkson (page=2.0)
- double page advert for Audi (page=3.0)
- article by James May (page=4.0)
- advert for Volkswagen (page=6.0)
* properties updated
Each time we move an item, we chose a value somewhere between the item above and below it (say by taking the average of the two items we're slipping between).
Eventually though you need to reset...
Whatever algorithm you use for inserting the pages into each other will eventually run out of decimal places since you have to keep using smaller numbers. As you move items more and more times you gradually move down the floating point chain and eventually need a new position which is smaller than anything available.
Every now and then you therefore have to do a reset to re-index the list and bring it all back within range. This is ok but I'm interested to see whether there is a way to encode the ordering which doesn't require this housekeeping.
Is there an algorithm which requires only 1 update and no housekeeping?
Does an algorithm (or perhaps more accurately, a data encoding) exist for this problem which requires only one update and no housekeeping? If so can you explain it in plain english how it works (i.g. no reference to directed graphs or vertices...)? Muchos gracias.
UPDATE (post points-awarding)
I've awarded the bounty on this to the question I feel had the most interesting answer. Nobody was able to offer a solution (since from the looks of things there isn't one) so I've not marked any particular question as correct.
Adjusting the no-housekeeping criterion
After having spent even more time thinking about this problem, it occurs to me that the housekeeping criterion should actually be adjusted. The real danger with housekeeping is not that it's a hassle to do but that it should ideally be robust to a client who has an outstanding copy of a pre-housekept set.
Let's say that Joe loads up a page containing a list (using Angular) and then goes off to make a cup of tea. Just after he downloads it the housekeeping happens and re-indexes all items (1000, 2000, 3000 etc).. After he comes back from his cup of tea, he moves an item from 1010 1011. There is a risk at this point that the re-indexing will place his item into a position it wasn't intended to go.
As a note for the future - any housekeeping algorithm should ideally be robust to items submitted across different housekept versions of the list too. Alternatively you should version the housekeeping and create an error if someone tries to update across versions.
Issues with the linked list
While the linked list requires only a few updates it's got some drawbacks too:
it's not trivial to deal with deletions from the list (and you may have to adjust your #destroy method accordingly
it's not easy to order the list for retrieval
The method I would choose
I think that having seen all the discussion, I think I would choose the non-integer (or string) positioning:
it's robust to inserts and deletions
it works of a single update
It does however need housekeeping and as mentioned above, if you're going to be complete you will also need to version each housekeeping and raise an error if someone tries to update based on a previous list version.
You should add one more sensible constraint to your wish-list:
max O(log N) space for each item (N being total number of items)
For example, the linked-list solution holds to this - you need at least N possible values for pointer, so the pointer takes up log N space. If you don't have this limit, trivial solution (growing strings) already mentioned by Lasse Karlsen and tmyklebu are solution to your problem, but the memory grows one character up (in the worst case) for each operation). You need some limit and this is a sensible one.
Then, hear the answer:
No, there is no such algorithm.
Well, this is a strong statement, and not easy to hear, so I guess proof is required :) I tried to figure out general proof, posted a question on Computer Science Theory, but the general proof is really hard to do. Say we make it easier and we will explicitly assume there are two classes of solutions:
absolute addressing - address of each item is specified by some absolute reference (integer, float, string)
relative addressing - address of each item is specified relatively to other items (e.g. the linked list, tree, etc.)
To disprove the existence of absolute addressing algorithm is easy. Just take 3 items, A, B, C, and keep moving the last one between the first two. You will soon run out of the possible combinations for the address of the moved element and will need more bits. You will break the constraint of the limited space.
Disproving the existence of relative addressing is also easy. For non-trivial arrangement, certainly some two different positions exist to which some other items are referring to. Then if you move some item between these two positions, at least two items have to be changed - the one which referred to the old position and the one which will refer to the new position. This violates the constraint of only one item changed.
Q.E.D.
Don't be fascinated by complexity - it doesn't work
Now that we (and you) can admit your desired solution does not exist, why would you complicate your life with complex solution that do not work? They can't work, as we proved above. I think we got lost here. Guys here spent immense effort just to end up with overly complicated solutions that are even worse than the simplest solution proposed:
Gene's rational numbers - they grow 4-6 bits in his example, instead of just 1 bit which is required by the most trivial algorithm (described below). 9/14 has 4 + 4 = 8 bits, 19/21 has 5 + 5 = 10 bits, and the resultant number 65/84 has 7 + 7 = 14 bits!! And if we just look at those numbers, we see that 10/14 or 2/3 are much better solutions. It can be easily proven that the growing string solution is unbeatable, see below.
mhelvens' solution - in the worst case he will add a new correcting item after each operation. This will for sure occupy much more than one bit more.
These guys are very clever but obviously cannot bring something sensible. Someone has to tell them - STOP, there's no solution, and what you do simply can't be better than the most trivial solution you are afraid to offer :-)
Go back to square one, go simple
Now, go back to the list of your restrictions. One of them must be broken, you know that. Go through the list and ask, which one of these is least painful?
1) Violate memory constraint
This is hard to violate infinitely, because you have limited space... so be prepared to also violate the housekeeping constraint from time to time.
The solution to this is the solution already proposed by tmyklebu and mentioned by Lasse Karlsen - growing strings. Just consider binary strings of 0 and 1. You have items A, B and C and moving C between A and B. If there is no space between A and B, i.e. they look
A xxx0
B xxx1
Then just add one more bit for C:
A xxx0
C xxx01
B xxx1
In worst case, you need 1 bit after every operation. You can also work on bytes, not bits. Then in the worst case, you will have to add one byte for every 8 operations. It's all the same. And, it can be easily seen that this solution cannot be beaten. You must add at least one bit, and you cannot add less. In other words, no matter how the solution is complex, it can't be better than this.
Pros:
you have one update per item
can compare any two elements, but slow
Cons:
comparing or sorting will get very very slow as the strings grow
there will be a housekeeping
2) Violate one item modified constraint
This leads to the original linked-list solution. Also, there are plenty of balanced tree data structures, which are even better if you need to look up or compare items (which you didn't mention).
These can go with 3 items modified, balanced trees sometimes need more (when balance operations are needed), but as it is amortized O(1), in a long row of operations the number of modifications per operation is constant. In your case, I would use tree solution only if you need to look up or compare items. Otherwise, the linked-list solution rocks. Throwing it out just because they need 3 operations instead of 1? C'mon :)
Pros:
optimal memory use
fast generation of ordered list (one linear pass), no need to sort
fast operations
no housekeeping
Cons:
cannot easily compare two items. Can easily generate the order of all the items, but given two items randomly, comparing them will take O(N) for list and O(log N) for balanced trees.
3 modified items instead of 1 (... letting up to you how much of a "con" this is)
3) Violate "no housekeeping" constraint
These are the solution with integers and floats, best described by Lasse Karlsen here. Also, the solutions from point 1) will fall here :). The key question was already mentioned by Lasse:
How often will housekeeping have to take place?
If you will use k-bit integers, then from the optimal state, when items are spread evenly in the integer space, the housekeeping will have to take place every k - log N operations, in the worst-case. You might then use more ore less sophisticated algorithms to restrict the number of items you "housekeep".
Pros:
optimal memory use
fast operation
can compare any two elements
one item modified per operation
Cons:
housekeeping
Conclusion - hope never dies
I think the best way, and the answers here prove that, is to decide which one of those constraints is least pain and just take one of those simple solutions formerly frowned upon.
But, hope never dies. When writing this, I realized that there would be your desired solution, if we just were able to ask the server!! Depends on the type of the server of course, but the classical SQL server already has the trees/linked-list implemented - for indices. The server is already doing the operations like "move this item before this one in the tree"!! But the server is doing based on the data, not based on our request. If we were able somehow to ask server to do this without the need to create perverse, endlessly growing data, that would be your desired solution! As I said, the server already does it - the solution is sooo close, but so far. If you can write your own server, you can do it :-)
#tmyklebu has the answer, but he never quite got to the punch line: The answer to your question is "no" unless you are willing to accept a worst case key length of n-1 bits to store n items.
This means that total key storage for n items is O(n^2).
There is an "adversary" information-theoretic argument that says no matter what scheme for assigning keys you choose for a database of n items, I can always come up with a series of n item re-positionings ("Move item k to position p.") that will force you to use a key with n-1 bits. Or by extension, if we start with an empty database, and you give me items to insert, I can choose a sequence of insertion positions that will require you to use at least zero bits for the first, one for the second, etc. indefinitely.
Edit
I earlier had an idea here about using rational numbers for keys. But it was more expensive than just adding one bit of length to split the gap between pairs of keys that differ by one. So I've removed it.
You can also interpret option 3 as storing positions as an unbounded-length string. That way you don't "run out of decimal places" or anything of that nature. Give the first item, say 'foo', position 1. Recursively partition your universe into "the stuff that's less than foo", which get a 0 prefix, and "the stuff that's bigger than foo", which get a 1 prefix.
This sucks in a lot of ways, notably that the position of an object can need as many bits to represent as you've done object moves.
I was fascinated by this question, so I started working on an idea. Unfortunately, it's complicated (you probably knew it would be) and I don't have time to work it all out. I just thought I'd share my progress.
It's based on a doubly-linked list, but with extra bookkeeping information in every moved item. With some clever tricks, I suspect that each of the n items in the set will require less than O(n) extra space, even in the worst case, but I have no proof of this. It will also take extra time to figure out the view order.
For example, take the following initial configuration:
A (-,B|0)
B (A,C|0)
C (B,D|0)
D (C,E|0)
E (D,-|0)
The top-to-bottom ordering is derived purely from the meta-data, which consists of a sequence of states (predecessor,successor|timestamp) for each item.
When moving D between A and B, you push a new state (A,B|1) to the front of its sequence with a fresh timestamp, which you get by incrementing a shared counter:
A (-,B|0)
D (A,B|1) (C,E|0)
B (A,C|0)
C (B,D|0)
E (D,-|0)
As you see, we keep the old information around in order to connect C to E.
Here is roughly how you derive the proper order from the meta-data:
You keep a pointer to A.
A agrees it has no predecessor. So insert A. It leads you to B.
B agrees it wants to be successor to A. So insert B after A. It leads you to C.
C agrees it wants to be successor to B. So insert C after B. It leads you to D.
D disagrees. It wants to be successor to A. Start recursion to insert it and find the real successor:
D wins from B because it has a more recent timestamp. Insert D after A. It leads you to B.
B is already D's successor. Look back in D's history, which leads you to E.
E agrees it wants to be successor to D with timestamp 0. So return E.
So the successor is E. Insert E after C. It tells you it has no successor. You are finished.
This is not exactly an algorithm yet, because it doesn't cover all cases. For example, when you move an item forwards instead of backwards. When moving B between D and E:
A (-,B|0)
C (B,D|0)
D (C,E|0)
B (D,E|1)(A,C|0)
E (D,-|0)
The 'move' operation is the same. But the algorithm to derive the proper order is a bit different. From A it will run into B, able to get the real successor C from it, but with no place to insert B itself yet. You can keep it in reserve as a candidate for insertion after D, where it will eventually match timestamps against E for the privilege of that position.
I wrote some Angular.js code on Plunker that can be used as a starting-point to implement and test this algorithm. The relevant function is called findNext. It doesn't do anything clever yet.
There are optimizations to reduce the amount of metadata. For example, when moving an item away from where it was recently placed, and its neighbors are still linked of their own accord, you won't have to preserve its newest state but can just replace it. And there are probably situations where you can discard all of an item's sufficiently old states (when you move it).
It's a shame I don't have time to fully work this out. It's an interesting problem.
Good luck!
Edit: I felt I needed to clarify the above-mentioned optimization ideas. First, there is no need to push a new history configuration if the original links still hold. For example, it is fine to go from here (moved D between A and B):
A (-,B|0)
D (A,B|1) (C,E|0)
B (A,C|0)
C (B,D|0)
E (D,-|0)
to here (then moved D between B and C):
A (-,B|0)
B (A,C|0)
D (B,C|2) (C,E|0)
C (B,D|0)
E (D,-|0)
We are able to discard the (A,B|1) configuration because A and B were still connected by themselves. Any number of 'unrelated' movements can come inbetween without changing that.
Secondly, imagine that eventually C and E are moved away from each other, so the (C,E|0) configuration can be dropped the next time D is moved. This is trickier to prove, though.
All of this considered, I believe there is a good chance that the list requires less than O(n+k) space (n being the number of items in the list, k being the number of operations) in the worst case; especially in the average case.
The way to prove any of this is to come up with a simpler model for this data-structure, most likely based on graph theory. Again, I regret that I don't have time to work on this.
Your best option is "Option 3", although "non-integer" doesn't necessarily have to be involved.
"Non-integer" can mean anything that have some kind of accuracy definition, which means:
Integers (you just don't use 1, 2, 3, etc.)
Strings (you just tuck on more characters to ensure the proper "sort order")
Floating point values (adding more decimal points, somewhat the same as strings)
In each case you're going to have accuracy problems. For floating point types, there might be a hard limit in the database engine, but for strings, the limit will be the amount of space you allow for this. Please note that your question can be understood to mean "with no limits", meaning that for such a solution to work, you really need infinite accuracy/space for the keys.
However, I think that you don't need that.
Let's assume that you initially allocate every 1000th index to each row, meaning you will have:
1000 A
2000 B
3000 C
4000 D
... and so on
Then you move as follows:
D up between A and B (gets index 1500)
C up between A and D (gets index 1250)
B up between A and C (gets index 1125)
D up between A and B (gets index 1062)
C up between A and D (gets index 1031)
B up between A and C (gets index 1015)
D up between A and B (gets index 1007)
C up between A and D (gets index 1004)
B up between A and C (gets index 1002)
D up between A and B (gets index 1001)
At this point, the list looks like this:
1000 A
1001 D
1002 B
1004 C
Now, then you want to move C up between A and D.
This is currently not possible, so you're going to have to renumber some items.
You can get by by updating B to have number 1003, trying to update the minimum number of rows, and thus you get:
1000 A
1001 C
1002 D
1003 B
but now, if you want to move B up between A and C, you're going to renumber everything except A.
The question is this: How likely is it that you have this pathological sequence of events?
If the answer is very likely then you will have problems, regardless of what you do.
If the answer is likely seldom, then you might decide that the "problems" with the above approach are manageable. Note that renumbering and ordering more than one row will likely be the exceptions here, and you would get something like "amortized 1 row updated per move". Amortized means that you spread the cost of those occasions where you have to update more than one row out over all the other occasions where you don't.
What if you store the original order and don't change it after saving it once and then store the number of increments up the list or down the list?
Then by moving something up 3 levels you would store this action only.
in the database you can then order by a mathematically counted column.
First time insert:
ord1 | ord2 | value
-----+------+--------
1 | 0 | A
2 | 0 | B
3 | 0 | C
4 | 0 | D
5 | 0 | E
6 | 0 | F
Update order, move D up 2 levels
ord1 | ord2 | value | ord1 + ord2
-----+------+-------+-------------
1 | 0 | A | 1
2 | 0 | B | 2
3 | 0 | C | 3
4 | -2 | D | 2
5 | 0 | E | 5
6 | 0 | F | 6
Order by ord1 + ord2
ord1 | ord2 | value | ord1 + ord2
-----+------+-------+-------------
1 | 0 | A | 1
2 | 0 | B | 2
4 | -2 | D | 2
3 | 0 | C | 3
5 | 0 | E | 5
6 | 0 | F | 6
Order by ord1 + ord2 ASC, ord2 ASC
ord1 | ord2 | value | ord1 + ord2
-----+------+-------+-------------
1 | 0 | A | 1
4 | -2 | D | 2
2 | 0 | B | 2
3 | 0 | C | 3
5 | 0 | E | 5
6 | 0 | F | 6
Move E up 4 levels
ord1 | ord2 | value | ord1 + ord2
-----+------+-------+-------------
5 | -4 | E | 1
1 | 0 | A | 1
4 | -2 | D | 2
2 | 0 | B | 2
3 | 0 | C | 3
6 | 0 | F | 6
Something like relative ordering, where ord1 is the absolute order while ord2 is the relative order.
Along with the same idea of just storing the history of movements and sorting based on that.
Not tested, not tried, just wrote down what I thought at this moment, maybe it can point you in some direction :)
I am unsure if you will call this cheating, but why not create a separate page list resource that references the page resources?
If you change the order of the pages you need not update any of the pages, just the list that stores the order if the IDs.
Original page list
[ford, clarkson, captain_slow, audi, hamster, vw]
Update to
[ford, hamster, clarkson, captain_slow, audi, vw]
Leave the page resources untouched.
You could always store the ordering permutation separately as a ln(num_records!)/ln(2) bit bitstring and figure out how to transform/CRUD that yourself so that you'd only need to update a single bit for simple operations, if updating 2/3 records is not good enough for you.
What about the following very simple algorithm:
(let's take the analogy with page numbers in a book)
If you move a page to become the "new" page 3, you now have "at least" one page 3, possibly two, or even more. So, which one is the "right" page 3?
Solution: the "newest". So, we make use of the fact that a record also has an "updated date/time", to determine who the real page 3 is.
If you need to represent the entire list in its right order, you have to sort with two keys, one for the page number, and one for the "updated date/time" field.
Background:
I have read that many DBMSs use write-ahead logging to preserve atomicity and durability of transactions by storing updates as a group of write operations. What I'm trying to accomplish is to create a dbms model with improved concurrency by allowing reads to proceed on 'old' data while writes are pending.
Question:
Is there a data structure that allows me to efficiently (ideally O(1) amortized, at most O(log(n)) look up array elements (or memory locations, if you like), which may or may not have been overwritten by write actions, in reference to some point in time? This would be for about 1TB of data total.
Here is some ascii art to make this a little clearer. The dashes are data, with version 0 being the oldest version. The arrows indicate write operations.
^ ___________________________________Snapshot 2
| V | | V
| -- --- | | -------- Version 2
| | | __________________Snapshot 1
| V | | V
T| -------- | | --------- Version 1
I| | | ___________Snapshot 0
M| V V V V
E|------------------------------------- Version 0
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~>
SPACE/ADDRESS
Attempts at solution:
Let N be the data size, M be the number of versions, and P be the average number of updates per version.
The naive algorithm (searching each update) is O(M*P).
Dividing the data into buckets, updating only entire buckets, and searching a bitmask of buckets would be O(N/B*M), where B is bucket size, which isn't much better.
A Bloom filter seems like a good candidate at first glance, except that it requires more data than a simple bitmask of each memory location (which would be bad anyway, since it requires M*N/8 bytes to store.)
A standard hash table also comes to mind, but what would the key be?
Actually, now that I've gone to the trouble of writing this all up, I've thought of a solution that uses a binary search tree. I'll submit it as an answer in a bit, but it's still O(M*log2(P)) in space and time which is not ideal. See below.
The following is the best solution I could come up with, though it is still suboptimal.
The idea is to place each region into a binary search tree, one tree per version, where each inner node contains a memory location, and each leaf node is either Hit or Miss (and possibly lookup information), depending on if updated data exists there. This is O(P*log(P)) to construct for each version, and O(M*log(P)) to look up in.
This is suboptimal for two reasons:
The tree is balanced, but Misses are much more likely than Hits in practice, so it would make sense to put Miss nodes higher in the tree, or arrange nodes by their size. Some kind of Huffman coding comes to mind, but Huffman's algorithm does not preserve the search tree invariants.
It requires M trees (hence O(M*log(P)) lookup). Maybe there is some way to combine the trees.
Heard the following problem in Google Code Jam. The competition has ended now, so it's okay to talk about it https://code.google.com/codejam/contest/2270488/dashboard#s=p3
Following an old map, you have stumbled upon the Dread Pirate Larry's secret treasure trove!
The treasure trove consists of N locked chests, each of which can only be opened by a key of a specific type. Furthermore, once a key is used to open a chest, it can never be used again. Inside every chest, you will of course find lots of treasure, and you might also find one or more keys that you can use to open other chests. A chest may contain multiple keys of the same type, and you may hold any number of keys.
You already have at least one key and your map says what other keys can be found inside the various chests. With all this information, can you figure out how to unlock all the chests?
If there are multiple ways of opening all the chests, choose the "lexicographically smallest" way.
For the competition there were two datasets, a small dataset with troves of at most 20 chests, and a large dataset with troves as big as 200 chests.
My backtracking branch-and-bound algorithm was only fast enough to solve the small dataset. What's a faster algorithm?
I'm not used to algorithm competitions. I was a bit disturbed about this question: to cut branches in the branch & bound general algorithm, you need to have an idea of the general input you'll have.
Basically, i looked at some of the inputs that were provided in the small set. What happen in this set is that you end up in paths where you can't get any key of some type t: all the remaining keys of this type t are all in chests which have a lock of the same type t. So you are not able to access them anymore.
So you could build the following cut criterion: if there is some chest of type t to be opened, if all remaining keys of types t are in those chests and if you don't have anymore key of this type, then you won't find a solution in this branch.
You can generalize the cut criterion. Consider a graph, where vertices are key types and there is an edge between t1 and t2 if there are still some closed chest in t1 which have a key of type t2. If you have some key of type t1, then you can open one of the chests of this type and then get at least a key to one of the chests accessible from the outgoing edges. If you follow a path, then you know you can open at least one chest of each lock type in this path. But if there is no path to a vertex, you know there is no way you will open a chest represented by this vertex.
There is the cuting algorithm. Compute all reachable vertices from the set of vertices you have a key in your posession. If there are unreachable vertices for which there are still closed chest, then you cut the branch. (This means you backtrack)
This was enough to solve the large set. But i had to add the first condition you wrote:
if any(required > keys_across_universe):
return False
Otherwise, it wouldn't work. This means that my solution is weak when the number of keys is very close to the number of chests.
This cut condition is not cheap. It can actually cost O(N²). But it cut so much branches that it is definetely worth it... provided the data sets are nice. (fair ?)
Surprisingly this problem is solvable via a greedy algorithm. I, too, implemented it as a memoized depth-first search. Only afterwards did I notice that the search never backtracked, and there were no hits to the memoization cache. Only two checks on the state of the problem and the partial solution are necessary to know whether a particular solution branch should be pursued further. They are easily illustrated with a pair of examples.
First, consider this test case:
Chest Number | Key Type To Open Chest | Key Types Inside
--------------+--------------------------+------------------
1 | 2 | 1
2 | 1 | 1 1
3 | 1 | 1
4 | 2 | 1
5 | 2 | 2
Initial keys: 1 1 2
Here there are a total of only two keys of type 2 in existence: one in chest #5, and one in your possession initially. However, three chests require a key of type 2 to be opened. We need more keys of this type than exist, so clearly it is impossible to open all of the chests. We know immediately that the problem is impossible. I call this key counting the "global constraint." We only need to check it once. I see this check is already in your program.
With just this check and a memoized depth-first search (like yours!), my program was able to solve the small problem, albeit slowly: it took about a minute. Knowing that the program wouldn't be able to solve the large input in sufficient time, I took a look at the test cases from the small set. Some test cases were solved very quickly while others took a long time. Here's one of the test cases where the program took a long time to find a solution:
Chest Number | Key Type To Open Chest | Key Types Inside
--------------+--------------------------+------------------
1 | 1 | 1
2 | 6 |
3 | 3 |
4 | 1 |
5 | 7 | 7
6 | 5 |
7 | 2 |
8 | 10 | 10
9 | 8 |
10 | 3 | 3
11 | 9 |
12 | 7 |
13 | 4 | 4
14 | 6 | 6
15 | 9 | 9
16 | 5 | 5
17 | 10 |
18 | 2 | 2
19 | 4 |
20 | 8 | 8
Initial keys: 1 2 3 4 5 6 7 8 9 10
After a brief inspection, the structure of this test case is obvious. We have 20 chests and 10 keys. Each of the ten key types will open exactly two chests. Of the two chests that are openable with a given key type, one contains another key of the same type, and the other contains no keys at all. The solution is obvious: for each key type, we have to first open the chest that will give us another key in order to be able to open the second chest that also requires a key of that type.
The solution is obvious to a human, but the program was taking a long time to solve it, since it didn't yet have any way to detect whether there were any key types that could no longer be acquired. The "global constraint" concerned the quantities of each type of key, but not the order in which they were to be obtained. This second constraint concerns instead the order in which keys can be obtained but not their quantity. The question is simply: for each key type I will need, is there some way I can still get it?
Here's the code I wrote to check this second constraint:
# Verify that all needed key types may still be reachable
def still_possible(chests, keys, key_req, keys_inside):
keys = set(keys) # set of keys currently in my possession
chests = chests.copy() # set of not-yet-opened chests
# key_req is a dictionary mapping chests to the required key type
# keys_inside is a dict of Counters giving the keys inside each chest
def openable(chest):
return key_req[chest] in keys
# As long as there are chests that can be opened, keep opening
# those chests and take the keys. Here the key is not consumed
# when a chest is opened.
openable_chests = filter(openable, chests)
while openable_chests:
for chest in openable_chests:
keys |= set(keys_inside[chest])
chests.remove(chest)
openable_chests = filter(openable, chests)
# If any chests remain unopened, then they are unreachable no
# matter what order chests are opened, and thus we've gotten into
# a situation where no solution exists.
return not chests # true iff chests is empty
If this check fails, we can immediately abort a branch of the search. After implementing this check, my program ran very fast, requiring something like 10 seconds instead of 1 minute. Moreover, I noticed that the number of cache hits dropped to zero, and, furthermore, the search never backtracked. I removed the memoization and converted the program from a recursive to an iterative form. The Python solution was then able to solve the "large" test input in about 1.5 seconds. A nearly identical C++ solution compiled with optimizations solves the large input in 0.25 seconds.
A proof of the correctness of this iterative, greedy algorithm is given in the official Google analysis of the problem.
I was not able to solve this problem too. My algorithm at first was too slow, then I added some enhancements but I guess I failed on something else:
As Valentin said, I counted the available keys to quickly discard tricky cases
Tried to ignore chests without keys inside on first hit, skipping to the next one
Skip solutions starting with higher chests
Check for "key-loops", if the available keys were not enough to open a chest (chest contained the key for itself inside)
Performance was good (<2 secs for the 25 small cases), I manually checked the cases and it worked properly but got incorrect answer anyway :P