What's the standard algorithm for syncing two lists of related objects? - algorithm

I'm pretty sure this must be in some kind of text book (or more likely in all of them) but I seem to be using the wrong keywords to search for it... :(
A recurring task I'm facing while programming is that I am dealing with lists of objects from different sources which I need to keep in sync somehow. Typically there's some sort of "master list" e.g. returned by some external API and then a list of objects I create myself each of which corresponds to an object in the master list (think "wrappers" or "adapters" - they typically contain extended information about the external objects specific to my application and/or they simplify access to the external objects).
Hard characteristics of all instances of the problem:
the implementation of the master list is hidden from me; its interface is fixed
the elements in the two lists are not assignment-compatible
I have full control over the implementation of the slave list
I cannot control the order of elements in the master list (i.e. it's not sortable)
the master list does either not provide notification about added or removed elements at all or notification is unreliable, i.e. the sync can only happen on-demand, not live
simply clearing and rebuilding the slave list from scratch whenever it's needed is not an option:
initializing the wrapper objects should be considered expensive
other objects will hold references to the wrappers
Additional characteristics in some instances of the problem:
elements in the master list can only be identified by reading their properties rather than accessing them directly by index or memory address:
after a refresh, the master list might return a completely new set of instances even though they still represent the same information
the only interface for accessing elements in the master list might be a sequential enumerator
most of the time, the order of elements in the master list is stable, i.e. new elements are always added either at the beginning or at the end, never in the middle; however, deletion can usually occur at any position
So how would I typically tackle this? What's the name of the algorithm I should google for?
In the past I have implemented this in various ways (see below for an example) but it always felt like there should be a cleaner and more efficient way, especially one that did not require two iterations (one over each list).
Here's an example approach:
Iterate over the master list
Look up each item in the "slave list"
Add items that do not yet exist
Somehow keep track of items that already exist in both lists (e.g. by tagging them or keeping yet another list)
When done, iterate over the slave list and remove all objects that have not been tagged (see 4.) and clear the tag again from all others
Update 1
Thanks for all your responses so far! I will need some time to look at the links.
[...] (text moved to main body of question)
Update 2
Restructered the middle-paragraph into a (hopefully) more easily parseable bullet lists and incorporated details added later in the first update.

The 2 typical solutions are:
1. Copy the master list to the sync list.
2. Do an O(N*N) comparison between all element pairs.
You've excluded the smart options already: shared identity, sorting and change notifications.
Note that it's not relevant whether the lists can be sorted in a meaningful way, or even completely. For instance, when comparing two string lists, it would be ideal to sort alphabetically. But the list comparison would still be more efficient if you'd sort both lists by string length! You'd still have to do a full pairwise comparison of strings of the same length, but that will probably be a much smaller nummber of pairs.

This looks like the set reconciliation problem i.e. the problem of synchronizing unordered data. A question on SO was asked on this: Implementation of set reconciliation algorithm.
Most of the references on google are to technical paper abstracts.

Often the best solution to such problems is to not solve them directly.
IF you really can't use a sorted binary searchable container in your part of the code (like a set or even a sorted vector) then...
Are you very memory bound? If not then I'd just create a dictionary (an std::set for example) containing the contents of one of the lists and then just iterate over the second which I want o sync with the first.
This way you're doing nlogn to create the dictionary (or nX for a hash dictionary depending on which will be more efficient) + mlogn operations to go over the second list and sync it (or just MY) - hard to beat if you really have to use lists in the first place - it's also good you do it only once when and if you need it and it's way better then keeping the lists sorted all the time which would be a n^2 task for both of them.

It looks like a fellow named Michael Heyeck has a good, O(n) solution to this problem. Check out that blog post for an explanation and some code.
Essentially, the solution tracks both the master and slave lists in a single pass, tracking indices into each. Two data structures are then managed: a list of insertions to be replayed on the slave list, and a list of deletions.
It looks straightforward and also has the benefit of a proof of minimalism, which Heyeck followed up with in a subsequent post. The code snippet in this post is more compact, as well:
def sync_ordered_list(a, b):
x = 0; y = 0; i = []; d = []
while (x < len(a)) or (y < len(b)):
if y >= len(b): d.append(x); x += 1
elif x >= len(a): i.append((y, b[y])); y += 1
elif a[x] < b[y]: d.append(x); x += 1
elif a[x] > b[y]: i.append((y, b[y])); y += 1
else: x += 1; y += 1
return (i,d)
Again, credit to Michael Heyeck.

In the C++ STL the algorithm is called set_union. Also, implementing the algorithm is likely to be a lot simpler if you do the union into a 3rd list.

I had such problem in one project in the past.
That project had one master data source and several clients that update the data independently and in the end all of them have to have the latest and unified set of data that is the sum of them.
What I did was building something similar to the SVN protocol, in which every time I wanted to update the master database (which was accessible through a web service) I got the revision number. Updated my local data store to that revision and then commited the entities that aren't covered by any revision number to the database.
Every client has the ability to update it's local data store to the latest revision.

Here is a Javascript version of Michael Heyek's python code.
var b= [1,3,8,12,16,19,22,24,26]; // new situation
var a = [1,2,8,9,19,22,23,26]; // previous situation
var result = sync_ordered_lists(a,b);
console.log(result);
function sync_ordered_lists(a,b){
// by Michael Heyeck see http://www.mlsite.net/blog/?p=2250
// a is the subject list
// b is the target list
// x is the "current position" in the subject list
// y is the "current position" in the target list
// i is the list of inserts
// d is the list of deletes
var x = 0;
var y = 0;
var i = [];
var d = [];
var acc = {}; // object containing inserts and deletes arrays
while (x < a.length || y < b.length) {
if (y >= b.length){
d.push(x);
x++;
} else if (x >= a.length){
i.push([y, b[y]]);
y++;
} else if (a[x] < b[y]){
d.push(x);
x++;
} else if (a[x] > b[y]){
i.push([y, b[y]]);
y++;
} else {
x++; y++;
}
}
acc.inserts = i;
acc.deletes = d;
return acc;
}

Very brute-force and pure technical approach:
Inherit from your List class (sorry don't know what is your language). Override add/remove methods in your child list class. Use your class instead of the base one. Now you can track changes with your own methods and synchronize lists on-line.

Related

Using redis to store a structured event log

I'm a bit new to Redis, so please forgive if this is basic.
I'm working on an app that sends automatic replies to users for certain events. I would like to use Redis to store who has received what event.
Essentially, in ruby, the data structure could look like this where you have a map of users to events and the dates that each event was sent.
{
"mary#example.com" => {
"sent_comment_reply" => ["12/12/2014", "3/6/2015"],
"added_post_reply" => ["1/4/2006", "7/1/2016"]
}
}
What is the best way to represent this in a Redis data structure so you can ask, did Mary get a sent_comment_reply? and if so, when was the latest?
In short, the question is, how(if possible) can you have a Hash structure that holds an array in Redis.
The rationale as opposed to using a set or list with a compound key is that hashes have O(1) lookup time, whereas lookups on lists(lrange) and sets(smembers) will be O(s+n) and sets O(n), respectively.
One way of structuring it in Redis, depending on the idea that you know the events of the user and you want the latest to be fresh in memory :
A sorted set per user. the content of the sorted set will be event codes; sent_comment_reply, added_post_reply with the score of the latest event as the highest. you can use ZRANK to get the answer for the question :
Did Mary get a sent_comment_reply?
A hash also for the user, this time you will have the field as the event sent_comment_reply and the value is the content of it which should be updated with the latest value including the body, date, etc. this will answer the question:
and if so, when was the latest?
Note: Sorted sets are really fast , and in this example we are depending on the events as the data.
With sorted sets you can add, remove, or update elements in a very
fast way (in a time proportional to the logarithm of the number of
elements). Since elements are taken in order and not ordered
afterwards, you can also get ranges by score or by rank (position) in
a very fast way. Accessing the middle of a sorted set is also very
fast, so you can use Sorted Sets as a smart list of non repeating
elements where you can quickly access everything you need: elements in
order, fast existence test, fast access to elements in the middle!
A possible approach to use a hash to map an array is as follows:
add_element(key , value):
len := redis.hlen(key)
redis.hset(key , len , value)
this will map array[i] element to i field in a hash key.
this will work for some cases, but I would probably go with the answer suggested in https://stackoverflow.com/a/34886801/2868839

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Hashing table design in C

I have a design issue regarding HASH function.
In my program I am using a hash table of size 2^13, where the slot is calculated based on the value of the node(the hash key) which I want to insert.
Now, say my each node has two value |A|B| however I am inserting value into hash table using A.
Later on, I want to search a particular node which B not A.
Is it possible to that way? Is yes, could you highlight some design approaches?
The constraint is that I have to use A as the hash key.
Sorry, I can't share the code. Small example:
Value[] = {Part1, Part2, Part3};
insert(value)
check_for_index(value.part1)
value.part1 to be used to calculate the index of the slot.
Once slot is found then insert the "value"
Later on,
search_in_hash(part2)
check_for_index("But here I need the value.part1 to check for slot index")
So, how can I relate the part1, part2 & part3 such that I later on I can find the slot by either part2 or part3
If the problem statement is vague kindly let me know.
Unless you intend to do a search element-by-element (in which case you don't need a hash, just a plain list), then what you basically ask is - can I have a hash such that hash(X) == hash(Y), but X!=Y, so that you could map to a location using part1 and then map to the same one using part2 or 3. That completely goes against what hashing stands for.
What you should do is (as viraptor also suggested), create 3 structures, each hashed using a different part of the value, and push the full value to all 3. Then when you need to search use the proper hash by the part you want to search by.
for e.g.:
value[] = {part1, part2, part3};
hash1.insert(part1, value)
hash2.insert(part2, value)
hash3.insert(part3, value)
then
hash2.search_in_hash(part2)
or
hash3.search_in_hash(part3)
The above 2 should produce the exact same values.
Also make sure that all data manipulations (removing values, changing them), is done on all 3 structures simultaneously. For e.g. -
value = hash2.search_in_hash(part2)
hash1.remove(value.part1)
hash2.remove(part2) // you can assert that part2 == value.part2
hash3.remove(value.part3)

Should I get a habit of removing unused variables in R?

Currently I'm working with relatively large data files, and my computer is not a super computer. I'm creating many subsets of these data sets temporarily and don't remove them from workspace. Obviously those are making a clutter of many variables. But, is there any effect of having many unused variables on performance of R? (i.e. does memory of computer fill at some point?)
When writing code should I start a habit of removing unused variables? Does it worth it?
x <- rnorm(1e8)
y <- mean(x)
# After this point I will not use x anymore, but I will use y
# Should I add following line to my code? or
# Maybe there will not be any performance lag if I skip the following line:
rm(x)
I don't want to add another line to my code. Instead of my code to seem cluttered I prefer my workspace to be cluttered (if there will be no performance improvement).
Yes, having unused objects will affect your performance, since R stores all its objects in memry. Obviously small objects will have negligible impact, and you mostly need to remove only the really big ones (data frames with millions of rows, etc) but having an uncluttered workspace won't hurt anything.
The only risk is removing something that you need later. Even when using a repo, as suggested, breaking stuff accidentally is something you want to avoid.
One way to get around these issues is to make extensive use of local. When you do a computation that scatters around lots of temporary objects, you can wrap it inside a local call, which will effectively dispose of those objects for you afterward. No more having to clean up lots of i, j, x, temp.var, and whatnot.
local({
x <- something
for(i in seq_along(obj))
temp <- some_unvectorised function(obj[[i]], x)
for(j in 1:temp)
temp2 <- some_other_unvectorised_function(temp, j)
# x, i, j, temp, temp2 only exist for the duration of local(...)
})
Adding to the above suggestions, for assisting beginners like me, I would like to list steps to check on R memory:
List the objects that are unused using ls().
Check the objects of interest using object.size("Object_name")
Remove unused/unnecessary objects using rm("Object_name")
Use gc()
Check memory cleared using memory.size()
In case, you are using a new session, use rm(list=ls()) followed by gc().
If one feels that the habit of removing unused variables, can be dangerous, it is always a good practice to save the objects into R images occasionally.
I think it's a good programming practice to remove unused code, regardless of language.
It's also a good practice to use a version control system like Subversion or Git to track your change history. If you do that you can remove code without fear, because it's always possible to roll back to earlier versions if you need to.
That's fundamental to professional coding.
Show distribution of the largest objects and return their names, based on #Peter Raynham:
memory.biggest.objects <- function(n=10) { # Show distribution of the largest objects and return their names
Sizes.of.objects.in.mem <- sapply( ls( envir = .GlobalEnv), FUN = function(name) { object.size(get(name)) } );
topX= sort(Sizes.of.objects.in.mem,decreasing=T)[1:n]
Memorty.usage.stat =c(topX, 'Other' = sum(sort(Sizes.of.objects.in.mem,decreasing=T)[-(1:n)]))
pie(Memorty.usage.stat, cex=.5, sub=make.names(date()))
# wpie(Memorty.usage.stat, cex=.5 )
# Use wpie if you have MarkdownReports, from https://github.com/vertesy/MarkdownReports
print(topX)
print("rm(list=c( 'objectA', 'objectB'))")
# inline_vec.char(names(topX))
# Use inline_vec.char if you have DataInCode, from https://github.com/vertesy/DataInCode
}

renumbering ordered session variables when deleting one

I'm updating a classic ASP application, written in jScript, for a local pita restaurant. I've created a new mobile-specific version of their desktop site, which allows ordering for delivery and lots of customization of the final pita (imagine a website for Subway, which would allow you to add pickles, lettuce, etc.). Each pita is stored as a string of numbers in a session variable. The total number of pitas is also stored. The session might look like this:
PitaCount = 3
MyPita1 = "35,23,16,231,12"
MyPita2 = "24,23,111,52,12,23,93"
MyPita3 = "115,24"
I know there may be better ways to store the data, but for now, since the whole thing is written, working , and live (and the client is happy), I'd like to just solve the problem I have. Here's the problem...
I've got buttons on the order recap page which allow the customer to delete pitas from the cart. When I do this, I want to renumber the session variables. If the customer deletes MyPita1, I need to renumber MyPita2 to MyPita1, renumber MyPita3 to MyPita2, and then decrement the PitaCount.
The AJAX button sends an integer to an ASP file with the number of the pita to be deleted (DeleteID). My function looks at PitaCount and DeleteID. If they're both 1, it just abandons the session. If they're both the same, but greater than one, we're deleting the most recently added pita, so no renumbering is needed. However, if PitaCount is greater then DeleteID, we need to renumber the pitas. Here's the code I'm using to do that:
for (y=DeleteID;y<PitaCount;y++) {
Session("MyPita" + y) = String(Session.Contents("MyPita" + (y+1)));
};
Session.Contents.Remove("MyPita" + PitaCount);
PitaCount--;
Session.Contents("PitaCount") = PitaCount;
This works for every pita EXCEPT the one which replaces the deleted one, which returns 'undefined'. For example, if I have 6 pitas in my cart, and I delete MyPita2, I end up with 5 pitas in the cart. Number 1, 3, 4, and 5 are exactly what you'd expect, but MyPita2 returns undefined.
I also tried a WHILE loop instead:
while (DeleteID < PitaCount) {
Session("MyPita" + DeleteID) = String(Session.Contents("MyPita" + (DeleteID+1)));
DeleteID++;
};
Session.Contents.Remove("MyPita" + PitaCount);
PitaCount--;
Session.Contents("PitaCount") = PitaCount;
This also returns 'undefined', just like the one above.
Until I can get this working I'm simply writing the most recent pita into the spot vacated by the deleted pita, but this reorders the cart, and I consider that a usability problem because people expect the items they added to the cart to remain in the same order. (Yes, I could add some kind of timestamp to the sessions and order using that, but it would be quicker to fix the problem I'm having, I think).
I'm baffled. Why (using the 6 pita example above) would it work perfectly on the second, third, and fourth iteration through the loop, but not on the first?
I can't be sure, but I think your issue may be that the value of DeleteID is a string. This could happen you assign its value by doing something like:
var DeleteID = Session("DeleteID");
Assuming this is true, then in the first iteration of your loop (which writes to the deleted spot), y is a string, and the expression y+1 is interpreted as a string concatenation instead of a numeric addition. If, for example, you delete ID 1, you're actually copying the value from id 11 ("1" + 1) into the deleted spot, which probably doesn't exist in your tests. This can be tested by adding at least 11 items to your cart and then deleting the first one. On the next iteration, the increment operator ++ forces y to be a number, so the script works as expected from that point on.
The solution is to convert DeleteID to a number when initializing your loop:
for (y = +DeleteID; y < PitaCount; y++) {
There may be better ways to convert a string to a number, but the + is what I remember.

Resources