Secret santa algorithm - algorithm

Every Christmas we draw names for gift exchanges in my family. This usually involves mulitple redraws until no one has pulled their spouse. So this year I coded up my own name drawing app that takes in a bunch of names, a bunch of disallowed pairings, and sends off an email to everyone with their chosen giftee.
Right now, the algorithm works like this (in pseudocode):
function DrawNames(list allPeople, map disallowedPairs) returns map
// Make a list of potential candidates
foreach person in allPeople
person.potentialGiftees = People
person.potentialGiftees.Remove(person)
foreach pair in disallowedPairs
if pair.first = person
person.Remove(pair.second)
// Loop through everyone and draw names
while allPeople.count > 0
currentPerson = allPeople.findPersonWithLeastPotentialGiftees
giftee = pickRandomPersonFrom(currentPerson.potentialGiftees)
matches[currentPerson] = giftee
allPeople.Remove(currentPerson)
foreach person in allPeople
person.RemoveIfExists(giftee)
return matches
Does anyone who knows more about graph theory know some kind of algorithm that would work better here? For my purposes, this works, but I'm curious.
EDIT: Since the emails went out a while ago, and I'm just hoping to learn something I'll rephrase this as a graph theory question. I'm not so interested in the special cases where the exclusions are all pairs (as in spouses not getting each other). I'm more interested in the cases where there are enough exclusions that finding any solution becomes the hard part. My algorithm above is just a simple greedy algorithm that I'm not sure would succeed in all cases.
Starting with a complete directed graph and a list of vertex pairs. For each vertex pair, remove the edge from the first vertex to the second.
The goal is to get a graph where each vertex has one edge coming in, and one edge leaving.

Just make a graph with edges connecting two people if they are allowed to share gifts and then use a perfect matching algorithm. (Look for "Paths, Trees, and Flowers" for the (clever) algorithm)

I was just doing this myself, in the end the algorithm I used doesn't exactly model drawing names out of a hat, but it's pretty damn close. Basically shuffle the list, and then pair each person with the next person in the list. The only difference with drawing names out of a hat is that you get one cycle instead of potentially getting mini subgroups of people who only exchange gifts with each other. If anything that might be a feature.
Implementation in Python:
import random
from collections import deque
def pairup(people):
""" Given a list of people, assign each one a secret santa partner
from the list and return the pairings as a dict. Implemented to always
create a perfect cycle"""
random.shuffle(people)
partners = deque(people)
partners.rotate()
return dict(zip(people,partners))

I wouldn't use disallowed pairings, since that greatly increases the complexity of the problem. Just enter everyone's name and address into a list. Create a copy of the list and keep shuffling it until the addresses in each position of the two lists don't match. This will ensure that no one gets themselves, or their spouse.
As a bonus, if you want to do this secret-ballot-style, print envelopes from the first list and names from the second list. Don't peek while stuffing the envelopes. (Or you could just automate emailing everyone thier pick.)
There are even more solutions to this problem on this thread.

Hmmm. I took a course in graph theory, but simpler is to just randomly permute your list, pair each consecutive group, then swap any element that is disallowed with another. Since there's no disallowed person in any given pair, the swap will always succeed if you don't allow swaps with the group selected. Your algorithm is too complex.

Create a graph where each edge is "giftability" Vertices that represent Spouses will NOT be adjacent. Select an edge at random (that is a gift assignment). Delete all edges coming from the gifter and all edges going to the receiver and repeat.

There is a concept in graph theory called a Hamiltonian Circuit that describes the "goal" you describe. One tip for anybody who finds this is to tell users which "seed" was used to generate the graph. This way if you have to re-generate the graph you can. The "seed" is also useful if you have to add or remove a person. In that case simply choose a new "seed" and generate a new graph, making sure to tell participants which "seed" is the current/latest one.

I just created a web app that will do exactly this - http://www.secretsantaswap.com/
My algorithm allows for subgroups. It's not pretty, but it works.
Operates as follows:
1. assign a unique identifier to all participants, remember which subgroup they're in
2. duplicate and shuffle that list (the targets)
3. create an array of the number of participants in each subgroup
4. duplicate array from [3] for targets
5. create a new array to hold the final matches
6. iterate through participants assigning the first target that doesn't match any of the following criteria:
A. participant == target
B. participant.Subgroup == target.Subgroup
C. choosing the target will cause a subgroup to fail in the future (e.g. subgroup 1 must always have at least as many non-subgroup 1 targets remaining as participants subgroup 1 participants remaining)
D. participant(n+1) == target (n +1)
If we assign the target we also decrement the arrays from 3 and 4
So, not pretty (at all) but it works. Hope it helps,
Dan Carlson

Here a simple implementation in java for the secret santa problem.
public static void main(String[] args) {
ArrayList<String> donor = new ArrayList<String>();
donor.add("Micha");
donor.add("Christoph");
donor.add("Benj");
donor.add("Andi");
donor.add("Test");
ArrayList<String> receiver = (ArrayList<String>) donor.clone();
Collections.shuffle(donor);
for (int i = 0; i < donor.size(); i++) {
Collections.shuffle(receiver);
int target = 0;
if(receiver.get(target).equals(donor.get(i))){
target++;
}
System.out.println(donor.get(i) + " => " + receiver.get(target));
receiver.remove(receiver.get(target));
}
}

Python solution here.
Given a sequence of (person, tags), where tags is itself a (possibly empty) sequence of strings, my algorithm suggests a chain of persons where each person gives a present to the next in the chain (the last person obviously is paired with the first one).
The tags exist so that the persons can be grouped and every time the next person is chosen from the group most dis-joined to the last person chosen. The initial person is chosen by an empty set of tags, so it will be picked from the longest group.
So, given an input sequence of:
example_sequence= [
("person1", ("male", "company1")),
("person2", ("female", "company2")),
("person3", ("male", "company1")),
("husband1", ("male", "company2", "marriage1")),
("wife1", ("female", "company1", "marriage1")),
("husband2", ("male", "company3", "marriage2")),
("wife2", ("female", "company2", "marriage2")),
]
a suggestion is:
['person1 [male,company1]',
'person2 [female,company2]',
'person3 [male,company1]',
'wife2 [female,marriage2,company2]',
'husband1 [male,marriage1,company2]',
'husband2 [male,marriage2,company3]',
'wife1 [female,marriage1,company1]']
Of course, if all persons have no tags (e.g. an empty tuple) then there is only one group to choose from.
There isn't always an optimal solution (think an input sequence of 10 women and 2 men, their genre being the only tag given), but it does a good work as much as it can.
Py2/3 compatible.
import random, collections
class Statistics(object):
def __init__(self):
self.tags = collections.defaultdict(int)
def account(self, tags):
for tag in tags:
self.tags[tag] += 1
def tags_value(self, tags):
return sum(1./self.tags[tag] for tag in tags)
def most_disjoined(self, tags, groups):
return max(
groups.items(),
key=lambda kv: (
-self.tags_value(kv[0] & tags),
len(kv[1]),
self.tags_value(tags - kv[0]) - self.tags_value(kv[0] - tags),
)
)
def secret_santa(people_and_their_tags):
"""Secret santa algorithm.
The lottery function expects a sequence of:
(name, tags)
For example:
[
("person1", ("male", "company1")),
("person2", ("female", "company2")),
("person3", ("male", "company1")),
("husband1", ("male", "company2", "marriage1")),
("wife1", ("female", "company1", "marriage1")),
("husband2", ("male", "company3", "marriage2")),
("wife2", ("female", "company2", "marriage2")),
]
husband1 is married to wife1 as seen by the common marriage1 tag
person1, person3 and wife1 work at the same company.
…
The algorithm will try to match people with the least common characteristics
between them, to maximize entrop— ehm, mingling!
Have fun."""
# let's split the persons into groups
groups = collections.defaultdict(list)
stats = Statistics()
for person, tags in people_and_their_tags:
tags = frozenset(tag.lower() for tag in tags)
stats.account(tags)
person= "%s [%s]" % (person, ",".join(tags))
groups[tags].append(person)
# shuffle all lists
for group in groups.values():
random.shuffle(group)
output_chain = []
prev_tags = frozenset()
while 1:
next_tags, next_group = stats.most_disjoined(prev_tags, groups)
output_chain.append(next_group.pop())
if not next_group: # it just got empty
del groups[next_tags]
if not groups: break
prev_tags = next_tags
return output_chain
if __name__ == "__main__":
example_sequence = [
("person1", ("male", "company1")),
("person2", ("female", "company2")),
("person3", ("male", "company1")),
("husband1", ("male", "company2", "marriage1")),
("wife1", ("female", "company1", "marriage1")),
("husband2", ("male", "company3", "marriage2")),
("wife2", ("female", "company2", "marriage2")),
]
print("suggested chain (each person gives present to next person)")
import pprint
pprint.pprint(secret_santa(example_sequence))

Related

Hashmap (O(1)) supporting joker/match-all keys

The title is not so clear, because I cannot put my problem in a sentence (If you have a better title for this question, please suggest). I'll try to clarify my requirement with an example:
Suppose I have a table like this:
| Origin | Destination | Airline | Free Baggage |
===================================================
| NYC | London | American | 20KG |
---------------------------------------------------
| NYC | * | Southwest | 30KG |
---------------------------------------------------
| * | * | Southwest | 25KG |
---------------------------------------------------
| * | LA | * | 20KG |
---------------------------------------------------
| * | * | * | 15KG |
---------------------------------------------------
and so on ...
This table describes free baggage amount that the airlines provide in different routes. You can see that some rows have * value, meaning that they match all possible values (those values are not known necessarily).
So we have a large list of baggage rules (like the table above) and a large list of flights (which their origin, destination and airline is known), and we intend to find the baggage amount for each one of flights in the most efficient way (iterating the list is not an efficient way, obviously, as it will cost an O(N) computation). It is possible to exist more than one result for each flight, but we will assume that in this case either the first matching or the most specific one will be preferred (whichever is simpler for you to continue with).
If there was not * signs in the table, the problem would be easy, and we could use a Hashmap or Dictionary with a Tuple of values as a key. But with presence of those * (lets say match-all) keys, it is not so straight forward to provide a general solution for that.
Please note that the above example was just an example, and I need a solution that can be used for any number of keys, not just three.
Do you have any idea or implementation for this problem, with a lookup method having time complexity equal or close to O(1) like a regular hashmap (memory will not be an issue)? What would be the best possible solution?
Regarding the comments, the more I think about it, and the more it looks like a relational database with indexes rather than an hashmap...
A trivial, quite easy solution could be something like an In-memory SQlite database. But it would probably be something in O(log2(n)), and not O(1). The main advantage is that it's easy to set up, and IF performances are good enough, it could be the final solution.
Here, key is to use proper indexes, the LIKE operator, and of course well-defined JOIN clauses.
From scratch, I can't think about any solution that, having N rows and M columns, isn't at least in O(M)... But usually, you'll have way less columns than rows. Quickly - I may have skipped a detail, I write that on-the-fly - I can propose you this algorithm / container:
Data must be stored in a vector-like container VECDATA, accessed by a simple index in O(1). Think about this as a primary key in databases, and we'll call it PK. Knowing PK gives you instantly, in O(1), the required data. You'll have N rows grand total.
For each row NOT containing any *, you'll insert in a real hashmap called MAINHASH the pair (<tuple>, PK). This is your primary index, for exact results. It will be in O(1), BUT what you requested may not be within... Obviously, you must maintain consistency between MAINHASH and VECDATA, with whatever is needed (mutexes, locks, don't care as long as both are consistents).
This hash contains at most N entries. Without any joker, it will act near as a standard hashmap, but for the indirection to VECDATA. It's still O(1) in this case.
For each searchable column, you'll build a specific index, dedicated to this column.
The index has N entries. It will be a standard hashmap, but it MUST allow multiple values for a given key. That's quite a common container, so it shouldn't be an issue.
For each row, the index entry will be: ( <VECDATA value>, PK ). The container is stored in a vector of indexes, INDEX[i] (with 0<=i<M).
Same as MAINHASH, consistency must be enforced.
Obviously, all these indexes / subcontainers should be constructed when an entry is inserted into VECDATA, and saved on disk across sessions if needed - you don't want to reconstruct all this each time you start the application...
Searching a row
So, user search for a given tuple.
Search it in MAINHASH. If found, return it, search done.
Upgrade (see below): search also in CACHE before going to step #2.
For each tuple element tuple[0<=i<M], search in INDEX[i] for both tuple[i] (returns a vector of PK, EXACT[i]) AND for * (returns another vector of PK, FUZZY[i]).
With these two vectors, build another (temporary) hash TMPHASH, associating ( PK, integer COUNT ). It quite simple: COUNT is initialized to 1 if entry comes from EXACT, and 0 if it comes from FUZZY.
For next column, build EXACT and FUZZY (see #2). But instead of making a new TMPHASH, you'll MERGE the results into rather than creating a new temporary hash.
Method is: if TMPHASH doesn't have this PK entry, trash this entry: it can't match at all. Otherwise, read the COUNT value, add 1 or 0 to it according to where it comes from, reinject it in TMPHASH.
Once all columns are done, you'll have to analyze TMPHASH.
Analyzing TMPHASH
First, if TMPHASH is empty, then you don't have any suitable answer. Return that to user. If it contains only one entry, same: return to user directly.
For more than one element in TMPHASH:
Parse the whole TMPHASH container, searching for the maximum COUNT. Maintain in memory the PK associated to the current maximum for COUNT.
Developper's choice: in case of multiple COUNT at the same maximum value, you can either return them all, return the first one, or the last one.
COUNT if obviously always stricly lower than M - otherwise, you would have found the tuple in MAINHASH. This value, compared to M, can give a confidence mark to your result (=100*COUNT/M% of confidence).
You can also now store the original tuple searched, and the corresponding PK, in another hashmap called CACHE.
Since it would be way too complicated to update properly CACHE when adding/modifying something in VECDATA, simply purge CACHE when it occurs. It's only a cache, after all...
This is quite complex to implement if the language doesn't help you, in particular by allowing to redefine operators and having all base containers available, but it should work.
Exact matches / cached matches are in O(1). Fuzzy search is in O(n.M), n being the number of matching rows (and 0<=n<N, of course).
Without further researchs, I can't see anything better than that. It will consume an obscene amount of memory, but you said that it won't be an issue.
I would recommend doing this with Tries that have a little data decorated. For routes, you want to know the lowest route ID so we can match to the first available route. For flights you want to track how many flights there are left to match.
What this will allow you to do, for instance, is partway through the match ONLY ONCE realize that flights from city1 to city2 might be matching routes that start off city1, city2, or city1, * or *, city2, or *, * without having to repeat that logic for each route or flight.
Here is a proof of concept in Python:
import heapq
import weakref
class Flight:
def __init__(self, fields, flight_no):
self.fields = fields
self.flight_no = flight_no
class Route:
def __init__(self, route_id, fields, baggage):
self.route_id = route_id
self.fields = fields
self.baggage = baggage
class SearchTrie:
def __init__(self, value=0, item=None, parent=None):
# value = # unmatched flights for flights
# value = lowest route id for routes.
self.value = value
self.item = item
self.trie = {}
self.parent = None
if parent:
self.parent = weakref.ref(parent)
def add_flight (self, flight, i=0):
self.value += 1
fields = flight.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(0, None, self)
self.trie[fields[i]].add_flight(flight, i+1)
else:
self.item = flight
def remove_flight(self):
self.value -= 1
if self.parent and self.parent():
self.parent().remove_flight()
def add_route (self, route, i=0):
route_id = route.route_id
fields = route.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(route_id)
self.trie[fields[i]].add_route(route, i+1)
else:
self.item = route
def match_flight_baggage(route_search, flight_search):
# Construct a heap of one search to do.
tmp_id = 0
todo = [((0, tmp_id), route_search, flight_search)]
# This will hold by flight number, baggage.
matched = {}
while 0 < len(todo):
priority, route_search, flight_search = heapq.heappop(todo)
if 0 == flight_search.value: # There are no flights left to match
# Already matched all flights.
pass
elif flight_search.item is not None:
# We found a match!
matched[flight_search.item.flight_no] = route_search.item.baggage
flight_search.remove_flight()
else:
for key, r_search in route_search.trie.items():
if key == '*': # Found wildcard.
for a_search in flight_search.trie.values():
if 0 < a_search.value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
tmp_id += 1
elif key in flight_search.trie and 0 < flight_search.trie[key].value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
tmp_id += 1
return matched
# Sample data - the id is the position.
route_data = [
["NYC", "London", "American", "20KG"],
["NYC", "*", "Southwest", "30KG"],
["*", "*", "Southwest", "25KG"],
["*", "LA", "*", "20KG"],
["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
data = route_data[i]
routes.append(Route(i, [data[0], data[1], data[2]], data[3]))
flight_data = [
["NYC", "London", "American"],
["NYC", "Dallas", "Southwest"],
["Dallas", "Houston", "Southwest"],
["Denver", "LA", "American"],
["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
data = flight_data[i]
flights.append(Flight([data[0], data[1], data[2]], i))
# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
flight_search.add_flight(flight)
route_search = SearchTrie()
for route in routes:
route_search.add_route(route)
print(route_search.match_flight_baggage(flight_search))
As Wisblade notices in his answer, for an array of N rows and M columns the best possible complexity is O(M). You can get O(1) only if you consider M to be a constant.
You can easily solve your problem in O(2^M) which is practical for a small M and is effectively O(1) if you consider M to be a constant.
Create a single hashmap which contains (as keys) strings of concatenated column values, possibly separated by some special character, e.g. a slash:
map.put("NYC/London/American", "20KG");
map.put("NYC/*/Southwest", "30KG");
map.put("*/*/Southwest", "25KG");
map.put("*/LA/*", "20KG");
map.put("*/*/*", "15KG");
Then, when you query, you try different combinations of actual data and wildcard characters. E.g. let's assume you want to query NYC/LA/Southwest; then you try the following combinations:
map.get("NYC/LA/Southwest"); // null
map.get("NYC/LA/*"); // null
map.get("NYC/*/Southwest"); // found: 30KG
If the answer in the third step was null, you would continue as follows:
map.get("NYC/*/*"); // null
map.get("*/LA/Southwest"); // null
map.get("*/LA/*"); // found: 20KG
And there still remain two options:
map.get("*/*/Southwest"); // found: 25KG
map.get("*/*/*"); // found: 15KG
Basically, for three data columns you have 8 possibilities to check in the hashmap -- not bad! and possibly you find an answer much earlier.

What would you use for `n to n` relations in python?

after fiddling around with dictionaries, I came to the conclusion, that I would need a data structure that would allow me an n to n lookup. One example would be: A course can be visited by several students and each student can visit several courses.
What would be the most pythonic way to achieve this? It wont be more than 500 Students and 100 courses, to stay with the example. So I would like to avoid using a real database software.
Thanks!
Since your working set is small, I don't think it is a problem to just store the student IDs as lists in the Course class. Finding students in a class would be as simple as doing
course.studentIDs
To find courses a student is in, just iterate over the courses and find the ID:
studentIDToGet = "johnsmith001"
studentsCourses = list()
for course in courses:
if studentIDToGet in course.studentIDs:
studentsCourses.append(course.id)
There's other ways you could do it. You could have a dictionary of studentIDs mapped to courseIDs or two dictionaries that - one mapped studentIDs:courseIDs and another courseIDs:studentIDs - when updated, update each other.
The implementation I wrote out the code for would probably be the slowest, which is why I mentioned that your working set is small enough that it would not be a problem. The other implentations I mentioned but did not show the code for would require some more code to make them work that just aren't worth the effort.
It depends completely on what operations you want the structure to be able to carry out quickly.
If you want to be able to quickly look up properties related to both a course and a student, for example how many hours a student has spent on studies for a specific course, or what grade the student has in the course if he has finished it, and if he has finished it etc. a vector containing n*m elements is probably what you need, where n is the number of students and m is the number of courses.
If on the other hand the average number of courses a student has taken is much less than the total number of courses (which it probably is for a real case scenario), and you want to be able to quickly look up all the courses a student has taken, you probably want to use an array consisting of n lists, either linked lists, resizable vectors or similar – depending on if you want to be able to with the lists; maybe that is to quickly remove elements in the middle of the lists, or quickly access an element at a random location. If you both want to be able to quickly remove elements in the middle of the lists and have quick random access to list elements, then maybe some kind of tree structure would be the most suitable for you.
Most tree data structures carry out all basic operations in logarithmic time to the number of elements in the tree. Beware that some tree data structures have an amortized time on these operators that is linear to the number of elements in the tree, even though the average time for a randomly constructed tree would be logarithmic. A typical example of when this happens is if you use a binary search tree and build it up with increasingly large elements. Don't do that; scramble the elements before you use them to build up the tree in that case, or use a divide-and-conquer method and split the list in two parts and one pivot element and create the tree root with the pivot element, then recursively create trees from both the left part of the list and the right part of the list, these also using the divide-and-conquer method, and attach them to the root as the left child and the right child respectively.
I'm sorry, I don't know python so I don't know what data structures that are part of the language and which you have to create yourself.
I assume you want to index both the Students and Courses. Otherwise you can easily make a list of tuples to store all Student,Course combinations: [ (St1, Crs1), (St1, Crs2) .. (St2, Crs1) ... (Sti, Crsi) ... ] and then do a linear lookup everytime you need to. For upto 500 students this ain't bad either.
However if you'd like to have a quick lookup either way, there is no builtin data structure. You can simple use two dictionaries:
courses = { crs1: [ st1, st2, st3 ], crs2: [ st_i, st_j, st_k] ... }
students = { st1: [ crs1, crs2, crs3 ], st2: [ crs_i, crs_j, crs_k] ... }
For a given student s, looking up courses is now students[s]; and for a given course c, looking up students is courses[c].
For something simple like what you want to do, you could create a simple class with data members and methods to maintain them and keep them consistent with each other. For this problem two dictionaries would be needed. One keyed by student name (or id) that keeps track of the courses each is taking, and another that keeps track of which students are in each class.
defaultdicts from the 'collections' module could be used instead of plain dicts to make things more convenient. Here's what I mean:
from collections import defaultdict
class Enrollment(object):
def __init__(self):
self.students = defaultdict(set)
self.courses = defaultdict(set)
def clear(self):
self.students.clear()
self.courses.clear()
def enroll(self, student, course):
if student not in self.courses[course]:
self.students[student].add(course)
self.courses[course].add(student)
def drop(self, course, student):
if student in self.courses[course]:
self.students[student].remove(course)
self.courses[course].remove(student)
# remove student if they are not taking any other courses
if len(self.students[student]) == 0:
del self.students[student]
def display_course_enrollments(self):
print "Class Enrollments:"
for course in self.courses:
print ' course:', course,
print ' ', [student for student in self.courses[course]]
def display_student_enrollments(self):
print "Student Enrollments:"
for student in self.students:
print ' student', student,
print ' ', [course for course in self.students[student]]
if __name__=='__main__':
school = Enrollment()
school.enroll('john smith', 'biology 101')
school.enroll('mary brown', 'biology 101')
school.enroll('bob jones', 'calculus 202')
school.display_course_enrollments()
print
school.display_student_enrollments()
school.drop('biology 101', 'mary brown')
print
print 'After mary brown drops biology 101:'
print
school.display_course_enrollments()
print
school.display_student_enrollments()
Which when run produces the following output:
Class Enrollments:
course: calculus 202 ['bob jones']
course: biology 101 ['mary brown', 'john smith']
Student Enrollments:
student bob jones ['calculus 202']
student mary brown ['biology 101']
student john smith ['biology 101']
After mary brown drops biology 101:
Class Enrollments:
course: calculus 202 ['bob jones']
course: biology 101 ['john smith']
Student Enrollments:
student bob jones ['calculus 202']
student john smith ['biology 101']

Algorithm for converting hierarchical flat data (w/ ParentID) into sorted flat list w/ indentation levels

I have the following structure:
MyClass {
guid ID
guid ParentID
string Name
}
I'd like to create an array which contains the elements in the order they should be displayed in a hierarchy (e.g. according to their "left" values), as well as a hash which maps the guid to the indentation level.
For example:
ID Name ParentID
------------------------
1 Cats 2
2 Animal NULL
3 Tiger 1
4 Book NULL
5 Airplane NULL
This would essentially produce the following objects:
// Array is an array of all the elements sorted by the way you would see them in a fully expanded tree
Array[0] = "Airplane"
Array[1] = "Animal"
Array[2] = "Cats"
Array[3] = "Tiger"
Array[4] = "Book"
// IndentationLevel is a hash of GUIDs to IndentationLevels.
IndentationLevel["1"] = 1
IndentationLevel["2"] = 0
IndentationLevel["3"] = 2
IndentationLevel["4"] = 0
IndentationLevel["5"] = 0
For clarity, this is what the hierarchy looks like:
Airplane
Animal
Cats
Tiger
Book
I'd like to iterate through the items the least amount of times possible. I also don't want to create a hierarchical data structure. I'd prefer to use arrays, hashes, stacks, or queues.
The two objectives are:
Store a hash of the ID to the indentation level.
Sort the list that holds all the objects according to their left values.
When I get the list of elements, they are in no particular order. Siblings should be ordered by their Name property.
Update: This may seem like I haven't tried coming up with a solution myself and simply want others to do the work for me. However, I have tried coming up with three different solutions, and I've gotten stuck on each. One reason might be that I've tried to avoid recursion (maybe wrongly so). I'm not posting the partial solutions I have so far since they are incorrect and may badly influence the solutions of others.
I needed a similar algorithm to sort tasks with dependencies (each task could have a parent task that needed to be done first). I found topological sort. Here is an iterative implementation in Python with very detailed comments.
The indent level may be calculated while doing the topological sort. Simply set a node's indent level to its parent node's indent level + 1 as it is added to the topological ordering.
Note there can exist many valid topological orderings. To ensure the resulting topological order groups parent nodes with child nodes, select a topological sort algorithm based on depth-first traversal of the graph produced by the partial ordering information.
Wikipedia gives two more algorithms for topological sort. Note these algorithms aren't as good because the first one is breadth-first traversal, and the second one is recursive.
For hierarchical structures you almost certainly will need recursion (if you allow for arbitrary depth). I quickly hacked up some ruby code to illustrate how you might achieve this (though I haven't done the indentation):
# setup the data structure
class S < Struct.new(:id, :name, :parent_id);end
class HierarchySorter
def initialize(il)
#initial_list = il
first_level = #initial_list.select{|a| a.parent_id == nil}.sort_by{|a| a.name }
#final_array = subsort(first_level, 0)
end
#recursive function
def subsort(list, indent_level)
result = []
list.each do |item|
result << [item, indent_level]
result += subsort(#initial_list.select{|a| a.parent_id == item.id}.sort_by{|a| a.name }, indent_level + 1)
end
result
end
def sorted_array
#final_array.map &:first
end
def indent_hash
# magick to transform array of structs into hash
Hash[*#final_array.map{|a| [a.first.id, a.last]}.flatten]
end
end
hs = HierarchySorter.new [S.new(1, "Cats", 2), S.new(2, "Animal", nil), S.new(3, "Tiger", 1), S.new(4, "Book", nil),
S.new(5, "Airplane", nil)]
puts "Array:"
puts hs.sorted_array.inspect
puts "\nIndentation hash:"
puts hs.indent_hash.inspect
If you don't speak ruby I can re-craft it in something else.
Edit: I updated the code above to output both data-structures.
Outputs:
Array:
[#<struct S id=5, name="Airplane", parent_id=nil>, #<struct S id=2, name="Animal", parent_id=nil>, #<struct S id=1, name="Cats", parent_id=2>, #<struct S id=3, name="Tiger", parent_id=1>, #<struct S id=4, name="Book", parent_id=nil>]
Indentation hash:
{5=>0, 1=>1, 2=>0, 3=>2, 4=>0}
Wonsungi's post helped a lot, however that is for a generic graph rather than a tree. So I modified it quite a bit to create an algorithm designed specifically for a tree:
// Data strcutures:
nodeChildren: Dictionary['nodeID'] = List<Children>;
indentLevel: Dictionary['nodeID'] = Integer;
roots: Array of nodes;
sorted: Array of nodes;
nodes: all nodes
// Step #1: Prepare the data structures for building the tree
for each node in nodes
if node.parentID == NULL
roots.Append(node);
indentLevel[node] = 0;
else
nodeChildren[node.parentID].append(node);
// Step #2: Add elements to the sorted list
roots.SortByABC();
while roots.IsNotEmpty()
root = roots.Remove(0);
rootIndentLevel = indentLevel[root];
sorted.Append(root);
children = nodeChildren[root];
children.SortByABC();
for each child in children (loop backwards)
indentLevel[child] = rootIndentLevel + 1
roots.Prepend(child)

How can I detect common substrings in a list of strings

Given a set of strings, for example:
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
I want to be able to detect that these are three sets of files:
EntireS[1,2]
J27[Red,Green]P[1,2]
JournalP[1,2][Red,Green,Blue]
Are there any known ways of approaching this problem - any published papers I can read on this?
The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.
Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.
I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem
There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.
Edited to add:
Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.
I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.
To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.
Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).
In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.
Great question! The steps to a solution are:
tokenizing input
using tokens to build an appropriate data structure. a DAWG is ideal, but a Trie is simpler and a decent starting point.
optional post-processing of the data structure for simplification or clustering of subtrees into separate outputs
serialization of the data structure to a regular expression or similar syntax
I've implemented this approach in regroup.py. Here's an example:
$ cat | ./regroup.py --cluster-prefix-len=2
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
^D
EFgre(en|y)
EntireS[12]
J27(Green|Red)P[12]
JournalP[12](Bl(ack|ue)|(Green|Red))
Something like that might work.
Build a trie that represents all your strings.
In the example you gave, there would be two edges from the root: "E" and "J". The "J" branch would then split into "Jo" and "J2".
A single strand that forks, e.g. E-n-t-i-r-e-S-(forks to 1, 2) indicates a choice, so that would be EntireS[1,2]
If the strand is "too short" in relation to the fork, e.g. B-A-(forks to N-A-N-A and H-A-M-A-S), we list two words ("banana, bahamas") rather than a choice ("ba[nana,hamas]"). "Too short" might be as simple as "if the part after the fork is longer than the part before", or maybe weighted by the number of words that have a given prefix.
If two subtrees are "sufficiently similar" then they can be merged so that instead of a tree, you now have a general graph. For example if you have ABRed,ABBlue,ABGreen,CDRed,CDBlue,CDGreen, you may find that the subtree rooted at "AB" is the same as the subtree rooted at "CD", so you'd merge them. In your output this will look like this: [left branch, right branch][subtree], so: [AB,CD][Red,Blue,Green]. How to deal with subtrees that are close but not exactly the same? There's probably no absolute answer but someone here may have a good idea.
I'm marking this answer community wiki. Please feel free to extend it so that, together, we may have a reasonable answer to the question.
try "frak" . It creates regex expression from set of strings. Maybe some modification of it will help you.
https://github.com/noprompt/frak
Hope it helps.
There are many many approaches to string similarity. I would suggest taking a look at this open-source library that implements a lot of metrics like Levenshtein distance.
http://sourceforge.net/projects/simmetrics/
You should be able to achieve this with generalized suffix trees: look for long paths in the suffix tree which come from multiple source strings.
There are many solutions proposed that solve the general case of finding common substrings. However, the problem here is more specialized. You're looking for common prefixes, not just substrings. This makes it a little simpler.
A nice explanation for finding longest common prefix can be found at
http://www.geeksforgeeks.org/longest-common-prefix-set-1-word-by-word-matching/
So my proposed "pythonese" pseudo-code is something like this (refer to the link for an implementation of find_lcp:
def count_groups(items):
sorted_list = sorted(items)
prefix = sorted_list[0]
ctr = 0
groups = {}
saved_common_prefix = ""
for i in range(1, sorted_list):
common_prefix = find_lcp(prefix, sorted_list[i])
if len(common_prefix) > 0: #we are still in the same group of items
ctr += 1
saved_common_prefix = common_prefix
prefix = common_prefix
else: # we must have switched to a new group
groups[saved_common_prefix] = ctr
ctr = 0
saved_common_prefix = ""
prefix = sorted_list[i]
return groups
For this particular example of strings to keep it extremely simple consider using simple word/digit -separation.
A non-digit sequence apparently can begin with capital letter (Entire). After breaking all strings into groups of sequences, something like
[Entire][S][1]
[Entire][S][2]
[J][27][Red][P][1]
[J][27][Green][P][1]
[J][27][Red][P][2]
....
[Journal][P][1][Blue]
[Journal][P][1][Green]
Then start grouping by groups, you can fairly soon see that prefix "Entire" is a common for some group and that all subgroups have S as headgroup, so only variable for those is 1,2.
For J27 case you can see that J27 is only leaf, but that it then branches at Red and Green.
So somekind of List<Pair<list, string>> -structure (composite pattern if I recall correctly).
import java.util.*;
class StringProblem
{
public List<String> subString(String name)
{
List<String> list=new ArrayList<String>();
for(int i=0;i<=name.length();i++)
{
for(int j=i+1;j<=name.length();j++)
{
String s=name.substring(i,j);
list.add(s);
}
}
return list;
}
public String commonString(List<String> list1,List<String> list2,List<String> list3)
{
list2.retainAll(list1);
list3.retainAll(list2);
Iterator it=list3.iterator();
String s="";
int length=0;
System.out.println(list3); // 1 1 2 3 1 2 1
while(it.hasNext())
{
if((s=it.next().toString()).length()>length)
{
length=s.length();
}
}
return s;
}
public static void main(String args[])
{
Scanner sc=new Scanner(System.in);
System.out.println("Enter the String1:");
String name1=sc.nextLine();
System.out.println("Enter the String2:");
String name2=sc.nextLine();
System.out.println("Enter the String3:");
String name3=sc.nextLine();
// String name1="salman";
// String name2="manmohan";
// String name3="rahman";
StringProblem sp=new StringProblem();
List<String> list1=new ArrayList<String>();
list1=sp.subString(name1);
List<String> list2=new ArrayList<String>();
list2=sp.subString(name2);
List<String> list3=new ArrayList<String>();
list3=sp.subString(name3);
sp.commonString(list1,list2,list3);
System.out.println(" "+sp.commonString(list1,list2,list3));
}
}

Speed dating algorithm

I work in a consulting organization and am most of the time at customer locations. Because of that I rarely meet my colleagues. To get to know each other better we are going to arrange a dinner party. There will be many small tables so people can have a chat. In order to talk to as many different people as possible during the party, everybody has to switch tables at some interval, say every hour.
How do I write a program that creates the table switching schedule? Just to give you some numbers; in this case there will be around 40 people and there can be at most 8 people at each table. But, the algorithm needs to be generic of course
heres an idea
first work from the perspective of the first person .. lets call him X
X has to meet all the other people in the room, so we should divide the remaining people into n groups ( where n = #_of_people/capacity_per_table ) and make him sit with one of these groups per iteration
Now that X has been taken care of, we will consider the next person Y
WLOG Y be a person X had to sit with in the first iteration itself.. so we already know Y's table group for that time-frame.. we should then divide the remaining people into groups such that each group sits with Y for every consecutive iteration.. and for each iteration X's group and Y's group have no person in common
.. I guess, if you keep doing something like this, you will get an optimal solution (if one exists)
Alternatively you could crowd source the problem by giving each person a card where they could write down the names of all the people they got dine with.. and at the end of event, present some kind of prize to the person with the most names in their card
This sounds like an application for genetic algorithm:
Select a random permutation of the 40 guests - this is one seating arrangement
Repeat the random permutation N time (n is how many times you are to switch seats in the night)
Combine the permutations together - this is the chromosome for one organism
Repeat for how ever many organisms you want to breed in one generation
The fitness score is the number of people each person got to see in one night (or alternatively - the inverse of the number of people they did not see)
Breed, mutate and introduce new organisms using the normal method and repeat until you get a satisfactory answer
You can add in any other factors you like into the fitness, such as male/female ratio and so on without greatly changing the underlying method.
Why not imitate real world?
class Person {
void doPeriodically() {
do {
newTable = random (numberOfTables);
} while (tableBusy(newTable))
switchTable (newTable)
}
}
Oh, and note that there is a similar algorithm for finding a mating partner and it's rumored to be effective for those 99% of people who don't spend all of their free time answering programming questions...
Perfect Table Plan
You might want to have a look at combinatorial design theory.
Intuitively I don't think you can do better than a perfect shuffle, but it's beyond my pre-coffee cognition to prove it.
This one was very funny! :D
I tried different method but the logic suggested by adi92 (card + prize) is the one that works better than any other I tried.
It works like this:
a guy arrives and examines all the tables
for each table with free seats he counts how many people he has to meet yet, then choose the one with more unknown people
if two tables have an equal number of unknown people then the guy will choose the one with more free seats, so that there is more probability to meet more new people
at each turn the order of the people taking seats is random (this avoid possible infinite loops), this is a "demo" of the working algorithm in python:
import random
class Person(object):
def __init__(self, name):
self.name = name
self.known_people = dict()
def meets(self, a_guy, propagation = True):
"self meets a_guy, and a_guy meets self"
if a_guy not in self.known_people:
self.known_people[a_guy] = 1
else:
self.known_people[a_guy] += 1
if propagation: a_guy.meets(self, False)
def points(self, table):
"Calculates how many new guys self will meet at table"
return len([p for p in table if p not in self.known_people])
def chooses(self, tables, n_seats):
"Calculate what is the best table to sit at, and return it"
points = 0
free_seats = 0
ret = random.choice([t for t in tables if len(t)<n_seats])
for table in tables:
tmp_p = self.points(table)
tmp_s = n_seats - len(table)
if tmp_s == 0: continue
if tmp_p > points or (tmp_p == points and tmp_s > free_seats):
ret = table
points = tmp_p
free_seats = tmp_s
return ret
def __str__(self):
return self.name
def __repr__(self):
return self.name
def Switcher(n_seats, people):
"""calculate how many tables and what switches you need
assuming each table has n_seats seats"""
n_people = len(people)
n_tables = n_people/n_seats
switches = []
while not all(len(g.known_people) == n_people-1 for g in people):
tables = [[] for t in xrange(n_tables)]
random.shuffle(people) # need to change "starter"
for the_guy in people:
table = the_guy.chooses(tables, n_seats)
tables.remove(table)
for guy in table:
the_guy.meets(guy)
table += [the_guy]
tables += [table]
switches += [tables]
return switches
lst_people = [Person('Hallis'),
Person('adi92'),
Person('ilya n.'),
Person('m_oLogin'),
Person('Andrea'),
Person('1800 INFORMATION'),
Person('starblue'),
Person('regularfry')]
s = Switcher(4, lst_people)
print "You need %d tables and %d turns" % (len(s[0]), len(s))
turn = 1
for tables in s:
print 'Turn #%d' % turn
turn += 1
tbl = 1
for table in tables:
print ' Table #%d - '%tbl, table
tbl += 1
print '\n'
This will output something like:
You need 2 tables and 3 turns
Turn #1
Table #1 - [1800 INFORMATION, Hallis, m_oLogin, Andrea]
Table #2 - [adi92, starblue, ilya n., regularfry]
Turn #2
Table #1 - [regularfry, starblue, Hallis, m_oLogin]
Table #2 - [adi92, 1800 INFORMATION, Andrea, ilya n.]
Turn #3
Table #1 - [m_oLogin, Hallis, adi92, ilya n.]
Table #2 - [Andrea, regularfry, starblue, 1800 INFORMATION]
Because of the random it won't always come with the minimum number of switch, especially with larger sets of people. You should then run it a couple of times and get the result with less turns (so you do not stress all the people at the party :P ), and it is an easy thing to code :P
PS:
Yes, you can save the prize money :P
You can also take look at stable matching problem. The solution to this problem involves using max-flow algorithm. http://en.wikipedia.org/wiki/Stable_marriage_problem
I wouldn't bother with genetic algorithms. Instead, I would do the following, which is a slight refinement on repeated perfect shuffles.
While (there are two people who haven't met):
Consider the graph where each node is a guest and edge (A, B) exists if A and B have NOT sat at the same table. Find all the connected components of this graph. If there are any connected components of size < tablesize, schedule those connected components at tables. Note that even this is actually an instance of a hard problem known as Bin packing, but first fit decreasing will probably be fine, which can be accomplished by sorting the connected components in order of biggest to smallest, and then putting them each of them in turn at the first table where they fit.
Perform a random permutation of the remaining elements. (In other words, seat the remaining people randomly, which at first will be everyone.)
Increment counter indicating number of rounds.
Repeat the above for a while until the number of rounds seems to converge.

Resources