I think it's similar to offline data synchronization, but it doesn't have to be nearly as extreme as that.
So I'm looking for a way to merge two, likely similar, sets of data that are aware of their version (so where they split) , aware of the set of CRUD actions that got them to their most recent versions. The difference being the child set is probably off by one or many actions per one version while the authority set has multiple versions with around one delta per version.
Say you have two lists. Lists A and List B. You have 3 versions. Version ab, version abcde, and version abk.
In version ab it is 1, 2, 3.
List A is version abcde.
In version abc it appended item 4.
In version abcd it moved item 1 to last place.
In version abcde it deleted item 3.
It looks like 2, 4, 1 in the latest version.
List B is version abk.
In version abk it appended item 4k.
It looks like 1, 2, 3, 4k in the latest version.
The goal is to synchronize List B with the authority, A, by sending what it did in version abk - and getting back a response for the deltas it needs to get from version abk to abcdef. Where abcdef will be list A merged with list B.
Given the information I have above, how might the logic look for merging two lists using deltas based on their versions look like? Is there additional information needed for efficiently doing a merge between deltas? Or maybe what is a good direction on this? I'm hoping to synchronize the two to a new version by having one send the deltas to another and getting deltas back that bring the old list up to speed.
Related
we recently stumpled upon this problem.
We have n different product types, where each product type might have an arbitrary amount(>=0) of products. We want to "sort" all the products (all products in all types) in a list, however, some product types are to be inserted in the list at a periodic order.
I'd like to explain by using an example:
We have 3 different product types (ProdA, ProdB, ProdC), and 3 units of ProdA, 4 units of ProdB and 2 units of ProdC. ProdC is to be inserted in the list with a period of 2.
One ordering could be:
ProdA1
ProdA2
ProdC1
ProdA3
ProdB1
ProdC2
ProdB2
ProdB3
ProdC1
ProdB4
Look how, since ProdC is periodic and only has 2 units, the units are repeated when we run out of ProdC units.
If one unit of ProdA and one unit of ProdB is deleted, we want the list to look like this:
ProdA1
ProdA2
ProdC1
ProdB1
ProdB2
ProdC2
ProdB3
We do not want to "recompute" the whole list.
My question is: Is there a general algorithm for doing this, which includes dynamic "resorting"?
Thanks!
No. Once the list is modified the positions for non periodic ProdA, ProdB change. So, if the list is an array, its memory has changed. If the list is a linked list, the node pointers change.
I suggest to use two list: one with ProdA and ProdB and the other with just ProdC. You can build the merged list when it's needed.
Lets say I have a large sorted (+10 MB, +650k rows) dataset on node_a and different dataset on node_b. There is no master version of the dataset, meaning that either node can have some pieces which are not available to other node. My goal is to have a content of node_a synchronized with content of node_b. What is the most efficient way to do so?
Common sense solution would be:
node_a: Here's everything I have... (sends entire dataset)
node_b: Here's what you don't have... (sends missing parts)
But this solution is not efficient at all. It requires the node_a to send (+10 MB) every time he attempts to synchronize.
So this time using a little brainpower I could introduce a partitioning of the dataset, sending only a part of entire content and expect differences found between first and last row of the part.
Can you think of any better solutions?
For a single synchronization:
Break the dataset up into arbitrary parts, hash each (with MD5, for example), and only send through the hash values instead of the whole data set. Then use a comparison of the hash values on the other side to determine what's not the same on each side, and send this through as appropriate.
If each part doesn't have a global unique ID (i.e. a primary key that's guaranteed to be the same for the corresponding row on each side), you may need some meta-data sent across as well, or send hashes of parts incrementally, determining the difference as you go, and changing what you send if required (e.g. send the hash of 10 rows at a time, if you find a missing row, there will be a mismatch of the rows - either cater for this on the receiver-side, or offset the sender by one row). How exactly this should be done will depend on what your data looks like.
For repeated synchronization:
A good idea might be to create a master version, and store this separately on one of the nodes, although this probably isn't necessary if you don't care about conflicts or being able to revert mistakes.
With or without a master version, you can use versioning here. Store the version of last synchronize, and store a version on each part. When synchronizing, just send the parts with a version higher than the last synchronize version.
As an alternative to a globally auto-incremented version, you could either use a timestamp as the version, or just have a modified flag on each part, setting it when modified, sending all parts with their flag set, and resetting the flags once synchronized.
Let's say I have two sources: A and B. For example, both are disparate data stores for storing TODO lists.
How do I build an algorithm for an operation which ensures the both sources are synced ?
Do I just copy A to B and then copy B to A eliminating duplicates (assuming there is a primary key ID to eliminate duplicates)
For items of both lists you should have set the time of the last sync.
During the next sync you work only with sublists of items, which appeared after the last sync time.
Yes, for these sublists the simple double-, or n-sided join will be enough.
The n-sided sync is more interesting. The better way will be to create a star system - where the syncs are done each time between the end list and the core list. The core list could be that one on server, the end lists will be these set and shown by UI.
Since I am unsure how to phrase the question I will illustrate it with an example that is very similar to what I am trying to achieve.
I am looking for a way to optimize the amount of time it takes to perform the following task.
Suppose I have three sets of numbers labeled "A", "B", and "C", each set containing an arbitrary number of integers.
I receive a stack of orders that ask for a "package" of numbers, each order asking for a particular combination of integers, one from each set. So an order might look like "A3, B8, C1", which means I will need to grab a 3 from set A, an 8 from set B, and a 1 from set C.
The task is simple: grab an order, look at the numbers, then go collect them and put them together into a "package".
It takes awhile for me to collect the numbers, and often times an order comes in asking for the same numbers as a previous order, so I decide to store all of the packages for later retrieval; this way, the amount of time it takes for me to process a duplicate order would be dramatically reduced rather than having to go and collect the same numbers again.
The amount of time it takes to collect a number is quite long, but not as long as examining each package one by one, if I have a lot of orders that day.
So for example if I have the following sets of numbers and orders
set A: [1, 2, 3]
set B: [4, 5, 6, 12, 18]
set C: [7, 8]
Order 1: A1, B6, C7
Order 2: A3, B5, C8
Order 3: A1, B6, C7
I would put together packages for orders 1 and 2, but then I notice that order 3 is a duplicate order so I can choose to just take the package I put together for the first order and finish this last order quickly.
The goal is to optimize the amount of time taken to process a stack of orders. Currently I have come up with two methods, but perhaps there may be more ways to do things
Gather the numbers for each order, regardless whether it's a duplicate or not. I end up with a lot of packages in the end, and for extreme cases where someone places a bulk order for 50 identical packages, it's clearly a waste of time
check whether the package already exists in cache, perhaps using some sort of hashing method on the orders.
Any ideas?
There is not much detail given about how you fetch the data to compose packages etc. This makes it hard to come up with different solutions to your problem. For example, maybe existing packages could lead you to the data you need to compose new packages, although they differ in one way or another. For this there are actually dedicated hashing methods available like Locality Sensitive Hashing.
Given the two approaches you came up with, it sounds very natural to go for route 2. Hashing in the indices sounds trivial (first order is easily identified by the number 167, or string "167", right?) and therefore you would have no real drawback from using a hash. Maybe memory constraints as you need to keep old packages around. There are also common methods out there to define which packages to keep in the (hashed) cache and which ones to throw away.
Without knowing the exact timings is is not possible to be definitive, but it looks to me as if your idea 2, using some sort of hash table to store previous orders is the way to go.
I'm looking for a scheme for assigning keys to rows in a table that would allow rows to be moved around and assigned new locations in the table without having to renumber the entire table.
Something like having keys 1, 2, 3, 4, then moving row "2" between 3 and 4 and then renaming it "3.5" (so you end up with 1, 3, 3.5, 4). But the scheme needs to be "infinitely" extensible (permitting at least a few thousand "random" row moves before it would be normally be necessary to "normalize" the keys, and worst (most pathological) case allowing 25-50 such moves).
And the keys produced should be easily sorted, ideally I'd like them to be "naturally" ordered for a database (assume SQLite) query.
Any ideas?
This problem reminds me of the line numbering problem when a person was writing code in BASIC. What most people did in this situation was take an educated guess on how many lines might be inserted in between two lines. Then that guess would be the spacing between those lines. So if you think you might have 2000 inserts between two elements, then you might make element1 have a key of 2000 and make element2 have a key of 4000. Then we you want to put an element between element1 or element2 you either naively split the difference (3000) or if you have some intuition about how many elements would go on each side of element3, then you might weight it some (i.e. 3500 instead of 3000).
Another alternative (its really just the same thing but you are using a different numbering system) is to use floating point numbers which I believe you eluded to. Between 1 and 2 would be 1.5. Between 1.5 and 2 would be 1.75. Between 1.5 and 1.75 would be 1.625, etc.
I would recommend against a key that is a string. It is better to stick with numeric keys, and on top of that it is probably better to have integer type keys rather than floating point type keys if you can help it.
Conceptually, you could treat your table like a linked list. Create a table with a unique ID, the key and it's next node and whatever other data you want. Simply insert items sequentially, when you need to put a new item in between, simply swap the key values and the associated parent nodes. The key values won't remain consistent, but that is what the additional unique ID is for and this works fine for ordering by the key as well.
Really, since you have order already specified by the key, you don't even need the 'next node'. Your scheme as described above should be fine as long as you rename the keys of the other nodes in addition to the one you moved - i.e., 2 and 3 get their key values swapped.