Finding the most commonly occuring pairs - algorithm

Say that I have a list (or array) that links Suppliers with the materials they supply. For example, an array of the form
[[Supplier_1, Material_a], [Supplier_2, Material_a], [Supplier_3, Material_a], [Supplier_1, Material_b], [Supplier_2, Material_c], [Supplier_3, Material_b], ...]
I am interested in finding the the list of suppliers that supply at least k materials that a particular supplier say Supplier_1 supplies.
One way that I can think of is to pair all suppliers with Supplier_1 for each material Supplier_1 supplies
[[Supplier_1, Supplier_2, Material_a], [Supplier_1, Supplier_3, Material_a], [Supplier_1, Supplier_3, Material_b]...]
and then count the number of times each pair is present
[[Supplier_1, Supplier_2, 1], [Supplier_1, Supplier_3, 2]...]
The problem is that this approach can be very time consuming since the list provided can be quite long. I was wondering if there is a better way to do this.

You would put the materials of Supplier_1 in a hash set, so that you can verify for any material whether it is supplied by Supplier_1 in constant time.
Once you have that you can iterate the data again, and in a dictionary (hash map) keep a count per supplier which you increment each time the material is in the above mentioned set.
In Python it would look like this:
def getsuppliers(pairs, selected_supplier, k):
materialset = set()
countmap = {} # a dictionary with <key=supplier, value=count> pairs
for supplier, material in pairs:
if supplier == selected_supplier:
materialset.add(material)
countmap[supplier] = 0
# An optional quick exit: if the selected provider does not have k materials,
# there is no use in continuing...
if countmap[selected_supplier] < k:
return [] # no supplier meets the requirement
for supplier, material in pairs:
if material in materialset:
countmap[supplier] = countmap[supplier]+1
result = []
for supplier, count in countmap.items():
if count >= k:
result.append(supplier)
return result
NB: this would include the selected supplier also, provided it has at least k materials.
All operations within each individual loop body, have a constant time complexity, so the overall time complexity is O(n), where n is the size of the input list (pairs).

Related

Hashmap (O(1)) supporting joker/match-all keys

The title is not so clear, because I cannot put my problem in a sentence (If you have a better title for this question, please suggest). I'll try to clarify my requirement with an example:
Suppose I have a table like this:
| Origin | Destination | Airline | Free Baggage |
===================================================
| NYC | London | American | 20KG |
---------------------------------------------------
| NYC | * | Southwest | 30KG |
---------------------------------------------------
| * | * | Southwest | 25KG |
---------------------------------------------------
| * | LA | * | 20KG |
---------------------------------------------------
| * | * | * | 15KG |
---------------------------------------------------
and so on ...
This table describes free baggage amount that the airlines provide in different routes. You can see that some rows have * value, meaning that they match all possible values (those values are not known necessarily).
So we have a large list of baggage rules (like the table above) and a large list of flights (which their origin, destination and airline is known), and we intend to find the baggage amount for each one of flights in the most efficient way (iterating the list is not an efficient way, obviously, as it will cost an O(N) computation). It is possible to exist more than one result for each flight, but we will assume that in this case either the first matching or the most specific one will be preferred (whichever is simpler for you to continue with).
If there was not * signs in the table, the problem would be easy, and we could use a Hashmap or Dictionary with a Tuple of values as a key. But with presence of those * (lets say match-all) keys, it is not so straight forward to provide a general solution for that.
Please note that the above example was just an example, and I need a solution that can be used for any number of keys, not just three.
Do you have any idea or implementation for this problem, with a lookup method having time complexity equal or close to O(1) like a regular hashmap (memory will not be an issue)? What would be the best possible solution?
Regarding the comments, the more I think about it, and the more it looks like a relational database with indexes rather than an hashmap...
A trivial, quite easy solution could be something like an In-memory SQlite database. But it would probably be something in O(log2(n)), and not O(1). The main advantage is that it's easy to set up, and IF performances are good enough, it could be the final solution.
Here, key is to use proper indexes, the LIKE operator, and of course well-defined JOIN clauses.
From scratch, I can't think about any solution that, having N rows and M columns, isn't at least in O(M)... But usually, you'll have way less columns than rows. Quickly - I may have skipped a detail, I write that on-the-fly - I can propose you this algorithm / container:
Data must be stored in a vector-like container VECDATA, accessed by a simple index in O(1). Think about this as a primary key in databases, and we'll call it PK. Knowing PK gives you instantly, in O(1), the required data. You'll have N rows grand total.
For each row NOT containing any *, you'll insert in a real hashmap called MAINHASH the pair (<tuple>, PK). This is your primary index, for exact results. It will be in O(1), BUT what you requested may not be within... Obviously, you must maintain consistency between MAINHASH and VECDATA, with whatever is needed (mutexes, locks, don't care as long as both are consistents).
This hash contains at most N entries. Without any joker, it will act near as a standard hashmap, but for the indirection to VECDATA. It's still O(1) in this case.
For each searchable column, you'll build a specific index, dedicated to this column.
The index has N entries. It will be a standard hashmap, but it MUST allow multiple values for a given key. That's quite a common container, so it shouldn't be an issue.
For each row, the index entry will be: ( <VECDATA value>, PK ). The container is stored in a vector of indexes, INDEX[i] (with 0<=i<M).
Same as MAINHASH, consistency must be enforced.
Obviously, all these indexes / subcontainers should be constructed when an entry is inserted into VECDATA, and saved on disk across sessions if needed - you don't want to reconstruct all this each time you start the application...
Searching a row
So, user search for a given tuple.
Search it in MAINHASH. If found, return it, search done.
Upgrade (see below): search also in CACHE before going to step #2.
For each tuple element tuple[0<=i<M], search in INDEX[i] for both tuple[i] (returns a vector of PK, EXACT[i]) AND for * (returns another vector of PK, FUZZY[i]).
With these two vectors, build another (temporary) hash TMPHASH, associating ( PK, integer COUNT ). It quite simple: COUNT is initialized to 1 if entry comes from EXACT, and 0 if it comes from FUZZY.
For next column, build EXACT and FUZZY (see #2). But instead of making a new TMPHASH, you'll MERGE the results into rather than creating a new temporary hash.
Method is: if TMPHASH doesn't have this PK entry, trash this entry: it can't match at all. Otherwise, read the COUNT value, add 1 or 0 to it according to where it comes from, reinject it in TMPHASH.
Once all columns are done, you'll have to analyze TMPHASH.
Analyzing TMPHASH
First, if TMPHASH is empty, then you don't have any suitable answer. Return that to user. If it contains only one entry, same: return to user directly.
For more than one element in TMPHASH:
Parse the whole TMPHASH container, searching for the maximum COUNT. Maintain in memory the PK associated to the current maximum for COUNT.
Developper's choice: in case of multiple COUNT at the same maximum value, you can either return them all, return the first one, or the last one.
COUNT if obviously always stricly lower than M - otherwise, you would have found the tuple in MAINHASH. This value, compared to M, can give a confidence mark to your result (=100*COUNT/M% of confidence).
You can also now store the original tuple searched, and the corresponding PK, in another hashmap called CACHE.
Since it would be way too complicated to update properly CACHE when adding/modifying something in VECDATA, simply purge CACHE when it occurs. It's only a cache, after all...
This is quite complex to implement if the language doesn't help you, in particular by allowing to redefine operators and having all base containers available, but it should work.
Exact matches / cached matches are in O(1). Fuzzy search is in O(n.M), n being the number of matching rows (and 0<=n<N, of course).
Without further researchs, I can't see anything better than that. It will consume an obscene amount of memory, but you said that it won't be an issue.
I would recommend doing this with Tries that have a little data decorated. For routes, you want to know the lowest route ID so we can match to the first available route. For flights you want to track how many flights there are left to match.
What this will allow you to do, for instance, is partway through the match ONLY ONCE realize that flights from city1 to city2 might be matching routes that start off city1, city2, or city1, * or *, city2, or *, * without having to repeat that logic for each route or flight.
Here is a proof of concept in Python:
import heapq
import weakref
class Flight:
def __init__(self, fields, flight_no):
self.fields = fields
self.flight_no = flight_no
class Route:
def __init__(self, route_id, fields, baggage):
self.route_id = route_id
self.fields = fields
self.baggage = baggage
class SearchTrie:
def __init__(self, value=0, item=None, parent=None):
# value = # unmatched flights for flights
# value = lowest route id for routes.
self.value = value
self.item = item
self.trie = {}
self.parent = None
if parent:
self.parent = weakref.ref(parent)
def add_flight (self, flight, i=0):
self.value += 1
fields = flight.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(0, None, self)
self.trie[fields[i]].add_flight(flight, i+1)
else:
self.item = flight
def remove_flight(self):
self.value -= 1
if self.parent and self.parent():
self.parent().remove_flight()
def add_route (self, route, i=0):
route_id = route.route_id
fields = route.fields
if i < len(fields):
if fields[i] not in self.trie:
self.trie[fields[i]] = SearchTrie(route_id)
self.trie[fields[i]].add_route(route, i+1)
else:
self.item = route
def match_flight_baggage(route_search, flight_search):
# Construct a heap of one search to do.
tmp_id = 0
todo = [((0, tmp_id), route_search, flight_search)]
# This will hold by flight number, baggage.
matched = {}
while 0 < len(todo):
priority, route_search, flight_search = heapq.heappop(todo)
if 0 == flight_search.value: # There are no flights left to match
# Already matched all flights.
pass
elif flight_search.item is not None:
# We found a match!
matched[flight_search.item.flight_no] = route_search.item.baggage
flight_search.remove_flight()
else:
for key, r_search in route_search.trie.items():
if key == '*': # Found wildcard.
for a_search in flight_search.trie.values():
if 0 < a_search.value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, a_search))
tmp_id += 1
elif key in flight_search.trie and 0 < flight_search.trie[key].value:
heapq.heappush(todo, ((r_search.value, tmp_id), r_search, flight_search.trie[key]))
tmp_id += 1
return matched
# Sample data - the id is the position.
route_data = [
["NYC", "London", "American", "20KG"],
["NYC", "*", "Southwest", "30KG"],
["*", "*", "Southwest", "25KG"],
["*", "LA", "*", "20KG"],
["*", "*", "*", "15KG"],
]
routes = []
for i in range(len(route_data)):
data = route_data[i]
routes.append(Route(i, [data[0], data[1], data[2]], data[3]))
flight_data = [
["NYC", "London", "American"],
["NYC", "Dallas", "Southwest"],
["Dallas", "Houston", "Southwest"],
["Denver", "LA", "American"],
["Denver", "Houston", "American"],
]
flights = []
for i in range(len(flight_data)):
data = flight_data[i]
flights.append(Flight([data[0], data[1], data[2]], i))
# Convert to searches.
flight_search = SearchTrie()
for flight in flights:
flight_search.add_flight(flight)
route_search = SearchTrie()
for route in routes:
route_search.add_route(route)
print(route_search.match_flight_baggage(flight_search))
As Wisblade notices in his answer, for an array of N rows and M columns the best possible complexity is O(M). You can get O(1) only if you consider M to be a constant.
You can easily solve your problem in O(2^M) which is practical for a small M and is effectively O(1) if you consider M to be a constant.
Create a single hashmap which contains (as keys) strings of concatenated column values, possibly separated by some special character, e.g. a slash:
map.put("NYC/London/American", "20KG");
map.put("NYC/*/Southwest", "30KG");
map.put("*/*/Southwest", "25KG");
map.put("*/LA/*", "20KG");
map.put("*/*/*", "15KG");
Then, when you query, you try different combinations of actual data and wildcard characters. E.g. let's assume you want to query NYC/LA/Southwest; then you try the following combinations:
map.get("NYC/LA/Southwest"); // null
map.get("NYC/LA/*"); // null
map.get("NYC/*/Southwest"); // found: 30KG
If the answer in the third step was null, you would continue as follows:
map.get("NYC/*/*"); // null
map.get("*/LA/Southwest"); // null
map.get("*/LA/*"); // found: 20KG
And there still remain two options:
map.get("*/*/Southwest"); // found: 25KG
map.get("*/*/*"); // found: 15KG
Basically, for three data columns you have 8 possibilities to check in the hashmap -- not bad! and possibly you find an answer much earlier.

Extracting Topic distribution from gensim LDA model

I created an LDA model for some text files using gensim package in python. I want to get topic's distributions for the learned model. Is there any method in gensim ldamodel class or a solution to get topic's distributions from the model?
For example, I use the coherence model to find a model with the best cohrence value subject to the number of topics in range 1 to 5. After getting the best model I use get_document_topics method (thanks kenhbs) to get topic distribution in the document that used for creating the model.
id2word = corpora.Dictionary(doc_terms)
bow = id2word.doc2bow(doc_terms)
max_coherence = -1
best_lda_model = None
for num_topics in range(1, 6):
lda_model = gensim.models.ldamodel.LdaModel(corpus=bow, num_topics=num_topics)
coherence_model = gensim.models.CoherenceModel(model=lda_model, texts=doc_terms,dictionary=id2word)
coherence_value = coherence_model.get_coherence()
if coherence_value > max_coherence:
max_coherence = coherence_value
best_lda_model = lda_model
The best has 4 topics
print(best_lda_model.num_topics)
4
But when I use get_document_topics, I get less than 4 values for document distribution.
topic_ditrs = best_lda_model.get_document_topics(bow)
print(len(topic_ditrs))
3
My question is: For best lda model with 4 topics (using coherence model) for a document, why get_document_topics returns fewer topics for the same document? why some topics have very small distribution (less than 1-e8)?
From the documentation, you can use two methods for this.
If you are aiming to get the main terms in a specific topic, use get_topic_terms:
from gensim.model.ldamodel import LdaModel
K = 10
lda = LdaModel(some_corpus, num_topics=K)
lda.get_topic_terms(5, topn=10)
# Or for all topics
for i in range(K):
lda.get_topic_terms(i, topn=10)
You can also print the entire underlying np.ndarray (called either beta or phi in standard LDA papers, dimensions are (K, V) or (V, K)).
phi = lda.get_topics()
edit:
From the link i included in the original answer: if you are looking for a document's topic distribution, use
res = lda.get_document_topics(bow)
As can be read from the documentation, the resulting object contains the following three lists:
list of (int, float) – Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.
list of (int, list of (int, float), optional – Most probable topics per word. Each element in the list is a pair of a word’s id, and a list of topics sorted by their relevance to this word. Only returned if per_word_topics was set to True.
list of (int, list of float), optional – Phi relevance values, multipled by the feature length, for each word-topic combination. Each element in the list is a pair of a word’s id and a list of the phi values between this word and each topic. Only returned if per_word_topics was set to True.
Now,
tops, probs = zip(*res[0])
probs will contains K (for you 4) probabilities. Some may be zero, but they should sum up to 1
You can play with the minimum_probability parameter and set it to a very small value like 0.000001.
topic_vector = [ x[1] for x in ldamodel.get_document_topics(new_doc_bow , minimum_probability= 0.0, per_word_topics=False)]
Just type,
pd.DataFrame(lda_model.get_document_topics(doc_term_matrix))

Interval tree with added dimension of subset matching?

This is an algorithmic question about a somewhat complex problem. The foundation is this:
A scheduling system based on available slots and reserved slots. Slots have certain criteria, let's call them tags. A reservation is matched to an available slot by those tags, if the available slot's tag set is a superset of the reserved slot.
As a concrete example, take this scenario:
11:00 12:00 13:00
+--------+
| A, B |
+--------+
+--------+
| C, D |
+--------+
Between the times of 11:00 to 12:30 reservations for the tags A and B can be made, from 12:00 to 13:30 C and D is available, and there's an overlap from about 12:00 to 12:30.
11:00 12:00 13:00
+--------+
| A, B |
+--------+
+--------+
| C, D |
+--------+
xxxxxx
x A x
xxxxxx
Here a reservation for A has been made, so no other reservations for A or B can be made between 11:15-ish and 12:00-ish.
That's the idea in a nutshell. There are no specific limitations for the available slots:
an available slot can contain any number of tags
any number of slots can overlap at any time
slots are of arbitrary length
reservations can contain any number of tags
The only rule that needs to be obeyed in the system is:
when adding a reservation, at least one remaining available slot must match all the tags in the reservation
To clarify: when there are two available slots at the same time with, say, tag A, then two reservations for A can be made at that time, but no more.
I have that working with a modified implementation of an interval tree; as a quick overview:
all available slots are added to the interval tree (duplicates/overlaps are preserved)
all reserved slots are iterated and:
all available slots matching the time of the reservation are queried from the tree
the first of those matching the reservation's tags is sliced and the slice removed from the tree
When that process is finished, what's left are the remaining slices of available slots, and I can query whether a new reservation can be made for a particular time and add it.
Data structures look something like this:
{
type: 'available',
begin: 1497857244,
end: 1497858244,
tags: [{ foo: 'bar' }, { baz: 42 }]
}
{
type: 'reserved',
begin: 1497857345,
end: 1497857210,
tags: [{ foo: 'bar' }]
}
Tags are themselves key-value objects, a list of them is a "tag set". Those could be serialised if it helps; so far I'm using a Python set type which makes comparison easy enough. Slot begin/end times are UNIX time stamps within the tree. I'm not particularly married to these specific data structures and can refactor them if it's useful.
The problem I'm facing is that this doesn't work bug-free; every once in a while a reservation sneaks its way into the system that conflicts with other reservations, and I couldn't yet figure out how that can happen exactly. It's also not very clever when tags overlap in a complex way where the optimal distribution needs to be calculated so all reservations can be fit into the available slots as best as possible; in fact currently it's non-deterministic how reservations are matched to available slots in overlapping scenarios.
What I want to know is: interval trees are mostly great for this purpose, but my current system to add tag set matching as an additional dimension to this is clunky and bolted-on; is there a data structure or algorithm that can handle this in an elegant way?
Actions that must be supported:
Querying the system for available slots that match certain tag sets (taking into account reservations that may reduce availability but are not themselves part of said tag set; e.g. in the example above querying for an availability for B).
Ensuring no reservations can be added to the system which don't have a matching available slot.
Your problem can be solved using constraint programming. In python this can be implemented using the python-constraint library.
First, we need a way to check if two slots are consistent with each other. this is a function that returns true if two slots share a tag and their rimeframes overlap. In python this can be implemented using the following function
def checkNoOverlap(slot1, slot2):
shareTags = False
for tag in slot1['tags']:
if tag in slot2['tags']:
shareTags = True
break
if not shareTags: return True
return not (slot2['begin'] <= slot1['begin'] <= slot2['end'] or
slot2['begin'] <= slot1['end'] <= slot2['end'])
I was not sure whether you wanted the tags to be completely the same (like {foo: bar} equals {foo: bar}) or only the keys (like {foo: bar} equals {foo: qux}), but you can change that in the function above.
Consistency check
We can use the python-constraint module for the two kinds of functionality you requested.
The second functionality is the easiest. To implement this, we can use the function isConsistent(set) which takes a list of slots in the provided data structure as input. The function will then feed all the slots to python-constraint and will check if the list of slots is consistent (no 2 slots that shouldn't overlap, overlap) and return the consistency.
def isConsistent(set):
#initialize python-constraint context
problem = Problem()
#add all slots the context as variables with a singleton domain
for i in range(len(set)):
problem.addVariable(i, [set[i]])
#add a constraint for each possible pair of slots
for i in range(len(set)):
for j in range(len(set)):
#we don't want slots to be checked against themselves
if i == j:
continue
#this constraint uses the checkNoOverlap function
problem.addConstraint(lambda a,b: checkNoOverlap(a, b), (i, j))
# getSolutions returns all the possible combinations of domain elements
# because all domains are singleton, this either returns a list with length 1 (consistent) or 0 (inconsistent)
return not len(problem.getSolutions()) == 0
This function can be called whenever a user wants to add a reservation slot. The input slot can be added to the list of already existing slots and the consistency can be checked. If it is consistent, the new slot an be reserverd. Else, the new slot overlaps and should be rejected.
Finding available slots
This problem is a bit trickier. We can use the same functionality as above with a few significant changes. Instead of adding the new slot together with the existing slot, we now want to add all possible slots to the already existing slots. We can then check the consistency of all those possible slots with the reserved slots and ask the constraint system for the combinations that are consistent.
Because the number of possible slots would be infinite if we didn't put any restrictions on it, we first need to declare some parameters for the program:
MIN = 149780000 #available time slots can never start earlier then this time
MAX = 149790000 #available time slots can never start later then this time
GRANULARITY = 1*60 #possible time slots are always at least one minut different from each other
We can now continue to the main function. It looks a lot like the consistency check, but instead of the new slot from the user, we now add a variable to discover all available slots.
def availableSlots(tags, set):
#same as above
problem = Problem()
for i in range(len(set)):
problem.addVariable(i, [set[i]])
#add an extra variable for the available slot is added, with a domain of all possible slots
problem.addVariable(len(set), generatePossibleSlots(MIN, MAX, GRANULARITY, tags))
for i in range(len(set) +1):
for j in range(len(set) +1):
if i == j:
continue
problem.addConstraint(lambda a, b: checkNoOverlap(a, b), (i, j))
#extract the available time slots from the solution for clean output
return filterAvailableSlots(problem.getSolutions())
I use some helper functions to keep the code cleaner. They are included here.
def filterAvailableSlots(possibleCombinations):
result = []
for slots in possibleCombinations:
for key, slot in slots.items():
if slot['type'] == 'available':
result.append(slot)
return result
def generatePossibleSlots(min, max, granularity, tags):
possibilities = []
for i in range(min, max - 1, granularity):
for j in range(i + 1, max, granularity):
possibleSlot = {
'type': 'available',
'begin': i,
'end': j,
'tags': tags
}
possibilities.append(possibleSlot)
return tuple(possibilities)
You can now use the function getAvailableSlots(tags, set) with the tags for which you want the available slots and a set of already reserved slots. Note that this function really return all the consistent possible slots, so no effort is done to find the one of maximum lenght or for other optimalizations.
Hope this helps! (I got it to work as you described in my pycharm)
Here's a solution, I'll include all the code below.
1. Create a table of slots, and a table of reservations
2. Create a matrix of reservations x slots
which is populated by true or false values based on whether that reservation-slot combination are possible
3. Figure out the best mapping that allows for the most Reservation-Slot Combinations
Note: my current solution scales poorly with very large arrays as it involves looping through all possible permutations of a list with size = number of slots. I've posted another question to see if anyone can find a better way of doing this. However, this solution is accurate and can be optimized
Python Code Source
Part 1
from IPython.display import display
import pandas as pd
import datetime
available_data = [
['SlotA', datetime.time(11, 0, 0), datetime.time(12, 30, 0), set(list('ABD'))],
['SlotB',datetime.time(12, 0, 0), datetime.time(13, 30, 0), set(list('C'))],
['SlotC',datetime.time(12, 0, 0), datetime.time(13, 30, 0), set(list('ABCD'))],
['SlotD',datetime.time(12, 0, 0), datetime.time(13, 30, 0), set(list('AD'))],
]
reservation_data = [
['ReservationA', datetime.time(11, 15, 0), datetime.time(12, 15, 0), set(list('AD'))],
['ReservationB', datetime.time(11, 15, 0), datetime.time(12, 15, 0), set(list('A'))],
['ReservationC', datetime.time(12, 0, 0), datetime.time(12, 15, 0), set(list('C'))],
['ReservationD', datetime.time(12, 0, 0), datetime.time(12, 15, 0), set(list('C'))],
['ReservationE', datetime.time(12, 0, 0), datetime.time(12, 15, 0), set(list('D'))]
]
reservations = pd.DataFrame(data=reservation_data, columns=['reservations', 'begin', 'end', 'tags']).set_index('reservations')
slots = pd.DataFrame(data=available_data, columns=['slots', 'begin', 'end', 'tags']).set_index('slots')
display(slots)
display(reservations)
Part 2
def is_possible_combination(r):
return (r['begin'] >= slots['begin']) & (r['end'] <= slots['end']) & (r['tags'] <= slots['tags'])
solution_matrix = reservations.apply(is_possible_combination, axis=1).astype(int)
display(solution_matrix)
Part 3
import numpy as np
from itertools import permutations
# add dummy columns to make the matrix square if it is not
sqr_matrix = solution_matrix
if sqr_matrix.shape[0] > sqr_matrix.shape[1]:
# uhoh, there are more reservations than slots... this can't be good
for i in range(sqr_matrix.shape[0] - sqr_matrix.shape[1]):
sqr_matrix.loc[:,'FakeSlot' + str(i)] = [1] * sqr_matrix.shape[0]
elif sqr_matrix.shape[0] < sqr_matrix.shape[1]:
# there are more slots than customers, why doesn't anyone like us?
for i in range(sqr_matrix.shape[0] - sqr_matrix.shape[1]):
sqr_matrix.loc['FakeCustomer' + str(i)] = [1] * sqr_matrix.shape[1]
# we only want the values now
A = solution_matrix.values.astype(int)
# make an identity matrix (the perfect map)
imatrix = np.diag([1]*A.shape[0])
# randomly swap columns on the identity matrix until they match.
n = A.shape[0]
# this will hold the map that works the best
best_map_so_far = np.zeros([1,1])
for column_order in permutations(range(n)):
# this is an identity matrix with the columns swapped according to the permutation
imatrix = np.zeros(A.shape)
for row, column in enumerate(column_order):
imatrix[row,column] = 1
# is this map better than the previous best?
if sum(sum(imatrix * A)) > sum(sum(best_map_so_far)):
best_map_so_far = imatrix
# could it be? a perfect map??
if sum(sum(imatrix * A)) == n:
break
if sum(sum(imatrix * A)) != n:
print('a perfect map was not found')
output = pd.DataFrame(A*imatrix, columns=solution_matrix.columns, index=solution_matrix.index, dtype=int)
display(output)
The suggested approaches by Arne and tinker were both helpful, but not ultimately sufficient. I came up with a hybrid approach that solves it well enough.
The main problem is that it's a three-dimensional issue, which is difficult to solve in all dimensions at once. It's not just about matching a time overlap or a tag overlap, it's about matching time slices with tag overlaps. It's simple enough to match slots to other slots based on time and even tags, but it's then pretty complicated to match an already matched availability slot to another reservation at another time. Meaning, this scenario in which one availability can cover two reservations at different times:
+---------+
| A, B |
+---------+
xxxxx xxxxx
x A x x A x
xxxxx xxxxx
Trying to fit this into constraint based programming requires an incredibly complex relationship of constraints which is hardly manageable. My solution to this was to simplify the problem…
Removing one dimension
Instead of solving all dimensions at once, it simplifies the problem enormously to largely remove the dimension of time. I did this by using my existing interval tree and slicing it as needed:
def __init__(self, slots):
self.tree = IntervalTree(slots)
def timeslot_is_available(self, start: datetime, end: datetime, attributes: set):
candidate = Slot(start.timestamp(), end.timestamp(), dict(type=SlotType.RESERVED, attributes=attributes))
slots = list(self.tree[start.timestamp():end.timestamp()])
return self.model_is_consistent(slots + [candidate])
To query whether a specific slot is available, I take only the slots relevant at that specific time (self.tree[..:..]), which reduces the complexity of the calculation to a localised subset:
| | +-+ = availability
+-|------|-+ xxx = reservation
| +---|------+
xx|x xxx|x
| xxxx|
| |
Then I confirm the consistency within that narrow slice:
#staticmethod
def model_is_consistent(slots):
def can_handle(r):
return lambda a: r.attributes <= a.attributes and a.contains_interval(r)
av = [s for s in slots if s.type == SlotType.AVAILABLE]
rs = [s for s in slots if s.type == SlotType.RESERVED]
p = Problem()
p.addConstraint(AllDifferentConstraint())
p.addVariables(range(len(rs)), av)
for i, r in enumerate(rs):
p.addConstraint(can_handle(r), (i,))
return p.getSolution() is not None
(I'm omitting some optimisations and other code here.)
This part is the hybrid approach of Arne's and tinker's suggestions. It uses constraint-based programming to find matching slots, using the matrix algorithm suggested by tinker. Basically: if there's any solution to this problem in which all reservations can be assigned to a different available slot, then this time slice is in a consistent state. Since I'm passing in the desired reservation slot, if the model is still consistent including that slot, this means it's safe to reserve that slot.
This is still problematic if there are two short reservations assignable to the same availability within this narrow window, but the chances of that are low and the result is merely a false negative for an availability query; false positives would be more problematic.
Finding available slots
Finding all available slots is a more complex problem, so again some simplification is necessary. First, it's only possible to query the model for availabilities for a particular set of tags (there's no "give me all globally available slots"), and secondly it can only be queried with a particular granularity (desired slot length). This suits me well for my particular use case, in which I just need to offer users a list of slots they can reserve, like 9:15-9:30, 9:30-9:45, etc.. This makes the algorithm very simple by reusing the above code:
def free_slots(self, start: datetime, end: datetime, attributes: set, granularity: timedelta):
slots = []
while start < end:
slot_end = start + granularity
if self.timeslot_is_available(start, slot_end, attributes):
slots.append((start, slot_end))
start += granularity
return slots
In other words, it just goes through all possible slots during the given time interval and literally checks whether that slot is available. It's a bit of a brute-force solution, but works perfectly fine.

How to implement semi-random groups with Ruby?

suppose I've got groups:
{1=>[1,1,1,1,1,1], 2=>[2,2,2], 3=>[3,3,3,3,3,3], 4=>[4,4,4,4,4,4]}
the keys represent teams, and the values within the arrays represent employees. Imagine I wish to match employees in a semi-random way. I want to make groups of 3's-5's like this:
[1,1,2,3,5], [1,2,3,4], [1,2,3,3], [1,3,4,1,4]
I have the wish to create groups and have a bias for matching team members of opposite teams, but not an absolute bias. Also you must match every member of each team with a group.
How would you solve this?
This is how I've done it:
group_by_team = records.group_by {|x| x.team_id}.values
mixed_groups = group_by_team.each{|x| x.shuffle!}
# take 1 element from each team and mix
# the number of teams is defined as a constant so we don't have to hit the db with a count
for index in (1..(TEAMS-1))
zipped_groups ||= mixed_groups[0]
zipped_groups = zipped_groups.zip(mixed_groups[index])
end
# flatten the arrays to produce one large Array
# remove nil values from ziped steps with compact!
zipped_groups = zipped_groups.flatten!.compact!
lunch_groups = zipped_groups.each_slice(3)
# we can no longer reduce table size, so lets join two small lunch groups
if lunch_groups.any?{|x| x.size<3}
lunch_groups = self.merge_last_two(lunch_groups)
end
But the problems with my implementation are vast. Groups size is fixed at 3. And its no exactly elegant, or efficient.
How would you make semi-random groups happen?

find endpoints for range given a value within the range

I am trying to solve a simple problem, but at the moment I cannot think of a better solution. I am testing an API that is not documented.
There is an ID used to fetch objects and it has a min and max value with random values missing in-between. I'm trying to test the responses I receive for random objects, but to find objects, I need to have valid IDs.
It would be very inefficient to test random numbers and hope that I get an object back. The best I can do is find a range, get a random number between that range and check if it exists before conducting tests.
A sample list of all of the IDs in the database might look like this:
[1005, 25984, 25986, 29587, 30000, ...]
Assuming the deviation from one value to another will never exceed C, e.g. from the first value to the next value, the difference will never be greater than a pre-defined constant, how would you calculate the min/max of the range given only one value in the range?
Starting from a given value and looping until the last value is found is horrible but that is how it was implemented by previous devs. Below is pseudocode that more or less covers what they do.
// this can be any valid object ID from the database
// assuming the ID's in the database are [1005, 25984, 25986, 29587, 30000]
// "i" could be any one of these values
var i = givenPredefinedObjectId;
var deviation = 100;
// objectWithIdExists() is going to lookup an object with the ID "i" in the database
// if there is no object with the ID "i" , it will return false
// otherwise the object will get tested and return true
while(objectWithIdExists(i)){
i++;
}
for(i; i < i+deviation; i++){
if(objectWithIdExists(i)){
goto while loop;
}
}
endPoint = i - deviation;
Assuming there is no knowledge about the possible values except you can check if they exist and you are given one valid value (there is no array with all possible IDs, that was just an example), how would you find the min/max values?
Unbounded binary search is feasible, with a factor of C slowdown. Given an algorithm for unbounded binary search that, given access to the oracle less_equal(n) for some natural number n, returns n in time O(log n), implement the oracle on input k by querying all of the IDs C*k, C*k+1, ..., C*k+C-1 and reporting that k is less than or equal to n if and only if one ID is found. The running time is O(C*log((max-min)/C)).

Resources