Say that I have a list (or array) that links Suppliers with the materials they supply. For example, an array of the form
[[Supplier_1, Material_a], [Supplier_2, Material_a], [Supplier_3, Material_a], [Supplier_1, Material_b], [Supplier_2, Material_c], [Supplier_3, Material_b], ...]
I am interested in finding the the list of suppliers that supply at least k materials that a particular supplier say Supplier_1 supplies.
One way that I can think of is to pair all suppliers with Supplier_1 for each material Supplier_1 supplies
[[Supplier_1, Supplier_2, Material_a], [Supplier_1, Supplier_3, Material_a], [Supplier_1, Supplier_3, Material_b]...]
and then count the number of times each pair is present
[[Supplier_1, Supplier_2, 1], [Supplier_1, Supplier_3, 2]...]
The problem is that this approach can be very time consuming since the list provided can be quite long. I was wondering if there is a better way to do this.
You would put the materials of Supplier_1 in a hash set, so that you can verify for any material whether it is supplied by Supplier_1 in constant time.
Once you have that you can iterate the data again, and in a dictionary (hash map) keep a count per supplier which you increment each time the material is in the above mentioned set.
In Python it would look like this:
def getsuppliers(pairs, selected_supplier, k):
materialset = set()
countmap = {} # a dictionary with <key=supplier, value=count> pairs
for supplier, material in pairs:
if supplier == selected_supplier:
materialset.add(material)
countmap[supplier] = 0
# An optional quick exit: if the selected provider does not have k materials,
# there is no use in continuing...
if countmap[selected_supplier] < k:
return [] # no supplier meets the requirement
for supplier, material in pairs:
if material in materialset:
countmap[supplier] = countmap[supplier]+1
result = []
for supplier, count in countmap.items():
if count >= k:
result.append(supplier)
return result
NB: this would include the selected supplier also, provided it has at least k materials.
All operations within each individual loop body, have a constant time complexity, so the overall time complexity is O(n), where n is the size of the input list (pairs).
We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java
This is an algorithmic question about a somewhat complex problem. The foundation is this:
A scheduling system based on available slots and reserved slots. Slots have certain criteria, let's call them tags. A reservation is matched to an available slot by those tags, if the available slot's tag set is a superset of the reserved slot.
As a concrete example, take this scenario:
11:00 12:00 13:00
+--------+
| A, B |
+--------+
+--------+
| C, D |
+--------+
Between the times of 11:00 to 12:30 reservations for the tags A and B can be made, from 12:00 to 13:30 C and D is available, and there's an overlap from about 12:00 to 12:30.
11:00 12:00 13:00
+--------+
| A, B |
+--------+
+--------+
| C, D |
+--------+
xxxxxx
x A x
xxxxxx
Here a reservation for A has been made, so no other reservations for A or B can be made between 11:15-ish and 12:00-ish.
That's the idea in a nutshell. There are no specific limitations for the available slots:
an available slot can contain any number of tags
any number of slots can overlap at any time
slots are of arbitrary length
reservations can contain any number of tags
The only rule that needs to be obeyed in the system is:
when adding a reservation, at least one remaining available slot must match all the tags in the reservation
To clarify: when there are two available slots at the same time with, say, tag A, then two reservations for A can be made at that time, but no more.
I have that working with a modified implementation of an interval tree; as a quick overview:
all available slots are added to the interval tree (duplicates/overlaps are preserved)
all reserved slots are iterated and:
all available slots matching the time of the reservation are queried from the tree
the first of those matching the reservation's tags is sliced and the slice removed from the tree
When that process is finished, what's left are the remaining slices of available slots, and I can query whether a new reservation can be made for a particular time and add it.
Data structures look something like this:
{
type: 'available',
begin: 1497857244,
end: 1497858244,
tags: [{ foo: 'bar' }, { baz: 42 }]
}
{
type: 'reserved',
begin: 1497857345,
end: 1497857210,
tags: [{ foo: 'bar' }]
}
Tags are themselves key-value objects, a list of them is a "tag set". Those could be serialised if it helps; so far I'm using a Python set type which makes comparison easy enough. Slot begin/end times are UNIX time stamps within the tree. I'm not particularly married to these specific data structures and can refactor them if it's useful.
The problem I'm facing is that this doesn't work bug-free; every once in a while a reservation sneaks its way into the system that conflicts with other reservations, and I couldn't yet figure out how that can happen exactly. It's also not very clever when tags overlap in a complex way where the optimal distribution needs to be calculated so all reservations can be fit into the available slots as best as possible; in fact currently it's non-deterministic how reservations are matched to available slots in overlapping scenarios.
What I want to know is: interval trees are mostly great for this purpose, but my current system to add tag set matching as an additional dimension to this is clunky and bolted-on; is there a data structure or algorithm that can handle this in an elegant way?
Actions that must be supported:
Querying the system for available slots that match certain tag sets (taking into account reservations that may reduce availability but are not themselves part of said tag set; e.g. in the example above querying for an availability for B).
Ensuring no reservations can be added to the system which don't have a matching available slot.
Your problem can be solved using constraint programming. In python this can be implemented using the python-constraint library.
First, we need a way to check if two slots are consistent with each other. this is a function that returns true if two slots share a tag and their rimeframes overlap. In python this can be implemented using the following function
def checkNoOverlap(slot1, slot2):
shareTags = False
for tag in slot1['tags']:
if tag in slot2['tags']:
shareTags = True
break
if not shareTags: return True
return not (slot2['begin'] <= slot1['begin'] <= slot2['end'] or
slot2['begin'] <= slot1['end'] <= slot2['end'])
I was not sure whether you wanted the tags to be completely the same (like {foo: bar} equals {foo: bar}) or only the keys (like {foo: bar} equals {foo: qux}), but you can change that in the function above.
Consistency check
We can use the python-constraint module for the two kinds of functionality you requested.
The second functionality is the easiest. To implement this, we can use the function isConsistent(set) which takes a list of slots in the provided data structure as input. The function will then feed all the slots to python-constraint and will check if the list of slots is consistent (no 2 slots that shouldn't overlap, overlap) and return the consistency.
def isConsistent(set):
#initialize python-constraint context
problem = Problem()
#add all slots the context as variables with a singleton domain
for i in range(len(set)):
problem.addVariable(i, [set[i]])
#add a constraint for each possible pair of slots
for i in range(len(set)):
for j in range(len(set)):
#we don't want slots to be checked against themselves
if i == j:
continue
#this constraint uses the checkNoOverlap function
problem.addConstraint(lambda a,b: checkNoOverlap(a, b), (i, j))
# getSolutions returns all the possible combinations of domain elements
# because all domains are singleton, this either returns a list with length 1 (consistent) or 0 (inconsistent)
return not len(problem.getSolutions()) == 0
This function can be called whenever a user wants to add a reservation slot. The input slot can be added to the list of already existing slots and the consistency can be checked. If it is consistent, the new slot an be reserverd. Else, the new slot overlaps and should be rejected.
Finding available slots
This problem is a bit trickier. We can use the same functionality as above with a few significant changes. Instead of adding the new slot together with the existing slot, we now want to add all possible slots to the already existing slots. We can then check the consistency of all those possible slots with the reserved slots and ask the constraint system for the combinations that are consistent.
Because the number of possible slots would be infinite if we didn't put any restrictions on it, we first need to declare some parameters for the program:
MIN = 149780000 #available time slots can never start earlier then this time
MAX = 149790000 #available time slots can never start later then this time
GRANULARITY = 1*60 #possible time slots are always at least one minut different from each other
We can now continue to the main function. It looks a lot like the consistency check, but instead of the new slot from the user, we now add a variable to discover all available slots.
def availableSlots(tags, set):
#same as above
problem = Problem()
for i in range(len(set)):
problem.addVariable(i, [set[i]])
#add an extra variable for the available slot is added, with a domain of all possible slots
problem.addVariable(len(set), generatePossibleSlots(MIN, MAX, GRANULARITY, tags))
for i in range(len(set) +1):
for j in range(len(set) +1):
if i == j:
continue
problem.addConstraint(lambda a, b: checkNoOverlap(a, b), (i, j))
#extract the available time slots from the solution for clean output
return filterAvailableSlots(problem.getSolutions())
I use some helper functions to keep the code cleaner. They are included here.
def filterAvailableSlots(possibleCombinations):
result = []
for slots in possibleCombinations:
for key, slot in slots.items():
if slot['type'] == 'available':
result.append(slot)
return result
def generatePossibleSlots(min, max, granularity, tags):
possibilities = []
for i in range(min, max - 1, granularity):
for j in range(i + 1, max, granularity):
possibleSlot = {
'type': 'available',
'begin': i,
'end': j,
'tags': tags
}
possibilities.append(possibleSlot)
return tuple(possibilities)
You can now use the function getAvailableSlots(tags, set) with the tags for which you want the available slots and a set of already reserved slots. Note that this function really return all the consistent possible slots, so no effort is done to find the one of maximum lenght or for other optimalizations.
Hope this helps! (I got it to work as you described in my pycharm)
Here's a solution, I'll include all the code below.
1. Create a table of slots, and a table of reservations
2. Create a matrix of reservations x slots
which is populated by true or false values based on whether that reservation-slot combination are possible
3. Figure out the best mapping that allows for the most Reservation-Slot Combinations
Note: my current solution scales poorly with very large arrays as it involves looping through all possible permutations of a list with size = number of slots. I've posted another question to see if anyone can find a better way of doing this. However, this solution is accurate and can be optimized
Python Code Source
Part 1
from IPython.display import display
import pandas as pd
import datetime
available_data = [
['SlotA', datetime.time(11, 0, 0), datetime.time(12, 30, 0), set(list('ABD'))],
['SlotB',datetime.time(12, 0, 0), datetime.time(13, 30, 0), set(list('C'))],
['SlotC',datetime.time(12, 0, 0), datetime.time(13, 30, 0), set(list('ABCD'))],
['SlotD',datetime.time(12, 0, 0), datetime.time(13, 30, 0), set(list('AD'))],
]
reservation_data = [
['ReservationA', datetime.time(11, 15, 0), datetime.time(12, 15, 0), set(list('AD'))],
['ReservationB', datetime.time(11, 15, 0), datetime.time(12, 15, 0), set(list('A'))],
['ReservationC', datetime.time(12, 0, 0), datetime.time(12, 15, 0), set(list('C'))],
['ReservationD', datetime.time(12, 0, 0), datetime.time(12, 15, 0), set(list('C'))],
['ReservationE', datetime.time(12, 0, 0), datetime.time(12, 15, 0), set(list('D'))]
]
reservations = pd.DataFrame(data=reservation_data, columns=['reservations', 'begin', 'end', 'tags']).set_index('reservations')
slots = pd.DataFrame(data=available_data, columns=['slots', 'begin', 'end', 'tags']).set_index('slots')
display(slots)
display(reservations)
Part 2
def is_possible_combination(r):
return (r['begin'] >= slots['begin']) & (r['end'] <= slots['end']) & (r['tags'] <= slots['tags'])
solution_matrix = reservations.apply(is_possible_combination, axis=1).astype(int)
display(solution_matrix)
Part 3
import numpy as np
from itertools import permutations
# add dummy columns to make the matrix square if it is not
sqr_matrix = solution_matrix
if sqr_matrix.shape[0] > sqr_matrix.shape[1]:
# uhoh, there are more reservations than slots... this can't be good
for i in range(sqr_matrix.shape[0] - sqr_matrix.shape[1]):
sqr_matrix.loc[:,'FakeSlot' + str(i)] = [1] * sqr_matrix.shape[0]
elif sqr_matrix.shape[0] < sqr_matrix.shape[1]:
# there are more slots than customers, why doesn't anyone like us?
for i in range(sqr_matrix.shape[0] - sqr_matrix.shape[1]):
sqr_matrix.loc['FakeCustomer' + str(i)] = [1] * sqr_matrix.shape[1]
# we only want the values now
A = solution_matrix.values.astype(int)
# make an identity matrix (the perfect map)
imatrix = np.diag([1]*A.shape[0])
# randomly swap columns on the identity matrix until they match.
n = A.shape[0]
# this will hold the map that works the best
best_map_so_far = np.zeros([1,1])
for column_order in permutations(range(n)):
# this is an identity matrix with the columns swapped according to the permutation
imatrix = np.zeros(A.shape)
for row, column in enumerate(column_order):
imatrix[row,column] = 1
# is this map better than the previous best?
if sum(sum(imatrix * A)) > sum(sum(best_map_so_far)):
best_map_so_far = imatrix
# could it be? a perfect map??
if sum(sum(imatrix * A)) == n:
break
if sum(sum(imatrix * A)) != n:
print('a perfect map was not found')
output = pd.DataFrame(A*imatrix, columns=solution_matrix.columns, index=solution_matrix.index, dtype=int)
display(output)
The suggested approaches by Arne and tinker were both helpful, but not ultimately sufficient. I came up with a hybrid approach that solves it well enough.
The main problem is that it's a three-dimensional issue, which is difficult to solve in all dimensions at once. It's not just about matching a time overlap or a tag overlap, it's about matching time slices with tag overlaps. It's simple enough to match slots to other slots based on time and even tags, but it's then pretty complicated to match an already matched availability slot to another reservation at another time. Meaning, this scenario in which one availability can cover two reservations at different times:
+---------+
| A, B |
+---------+
xxxxx xxxxx
x A x x A x
xxxxx xxxxx
Trying to fit this into constraint based programming requires an incredibly complex relationship of constraints which is hardly manageable. My solution to this was to simplify the problem…
Removing one dimension
Instead of solving all dimensions at once, it simplifies the problem enormously to largely remove the dimension of time. I did this by using my existing interval tree and slicing it as needed:
def __init__(self, slots):
self.tree = IntervalTree(slots)
def timeslot_is_available(self, start: datetime, end: datetime, attributes: set):
candidate = Slot(start.timestamp(), end.timestamp(), dict(type=SlotType.RESERVED, attributes=attributes))
slots = list(self.tree[start.timestamp():end.timestamp()])
return self.model_is_consistent(slots + [candidate])
To query whether a specific slot is available, I take only the slots relevant at that specific time (self.tree[..:..]), which reduces the complexity of the calculation to a localised subset:
| | +-+ = availability
+-|------|-+ xxx = reservation
| +---|------+
xx|x xxx|x
| xxxx|
| |
Then I confirm the consistency within that narrow slice:
#staticmethod
def model_is_consistent(slots):
def can_handle(r):
return lambda a: r.attributes <= a.attributes and a.contains_interval(r)
av = [s for s in slots if s.type == SlotType.AVAILABLE]
rs = [s for s in slots if s.type == SlotType.RESERVED]
p = Problem()
p.addConstraint(AllDifferentConstraint())
p.addVariables(range(len(rs)), av)
for i, r in enumerate(rs):
p.addConstraint(can_handle(r), (i,))
return p.getSolution() is not None
(I'm omitting some optimisations and other code here.)
This part is the hybrid approach of Arne's and tinker's suggestions. It uses constraint-based programming to find matching slots, using the matrix algorithm suggested by tinker. Basically: if there's any solution to this problem in which all reservations can be assigned to a different available slot, then this time slice is in a consistent state. Since I'm passing in the desired reservation slot, if the model is still consistent including that slot, this means it's safe to reserve that slot.
This is still problematic if there are two short reservations assignable to the same availability within this narrow window, but the chances of that are low and the result is merely a false negative for an availability query; false positives would be more problematic.
Finding available slots
Finding all available slots is a more complex problem, so again some simplification is necessary. First, it's only possible to query the model for availabilities for a particular set of tags (there's no "give me all globally available slots"), and secondly it can only be queried with a particular granularity (desired slot length). This suits me well for my particular use case, in which I just need to offer users a list of slots they can reserve, like 9:15-9:30, 9:30-9:45, etc.. This makes the algorithm very simple by reusing the above code:
def free_slots(self, start: datetime, end: datetime, attributes: set, granularity: timedelta):
slots = []
while start < end:
slot_end = start + granularity
if self.timeslot_is_available(start, slot_end, attributes):
slots.append((start, slot_end))
start += granularity
return slots
In other words, it just goes through all possible slots during the given time interval and literally checks whether that slot is available. It's a bit of a brute-force solution, but works perfectly fine.
I have some hashtable. For instance I have two entities like
john = { 1stname: jonh, 2ndname: johnson },
eric = { 1stname: eric, 2ndname: ericson }
Then I put them in hashtable:
ht["john"] = john;
ht["eric"] = eric;
Let's imagine there is a collision and hashtable use chaining to fix it. As a result there should be a linked list with these two entities like this
How does hashtable understand what entity should be returned for key? Hash values are the same and it knows nothing about entities structure. For instance if I write thisvar val = ht["john"]; how does hashtable (having only key value and its hash) find out that value should be john record and not eric.
I think what you are confused about is what is stored at each location in the hashtable's adjacent list. It seems like you assume that only the value is being stored. In fact, the data in each list node is a tuple (key, value).
Once you ask for ht['john'], the hashtable find the list associated with hash('john') and if the list is not empty it searches for the key 'john' in the list. If the key is found as the first element of the tuple then the value (second element of the tuple) is returned. If the key is not found, then it means that the element is not in the hashtable.
To summarize, the key hash is used to quickly identify the cell in which the element should be stored if present. Actual key equality is tested for to decide whether the key exists or not.
Is this what you are asking for? I have already put this in comments but seems to me you did not follow link
Collision Resolution in the Hashtable Class
Recall that when inserting an item into or retrieving an item from a hash table, a collision can occur. When inserting an item, an open slot must be found. When retrieving an item, the actual item must be found if it is not in the expected location. Earlier we briefly examined two collusion resolution strategies:
Linear probing
Quardratic probing
The Hashtable class uses a different technique referred to as rehasing. (Some sources refer to rehashing as double hashing.)
Rehasing works as follows: there is a set of hash different functions, H1 ... Hn, and when inserting or retrieving an item from the hash table, initially the H1 hash function is used. If this leads to a collision, H2 is tried instead, and onwards up to Hn if needed. The previous section showed only one hash function, which is the initial hash function (H1). The other hash functions are very similar to this function, only differentiating by a multiplicative factor. In general, the hash function Hk is defined as:
Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1)))] % hashsize
Mathematical Note With rehasing it is important that each slot in the hash table is visited exactly once when hashsize number of probes are made. That is, for a given key you don't want Hi and Hj to hash to the same slot in the hash table. With the rehashing formula used by the Hashtable class, this property is maintained if the result of (1 + (((GetHash(key) >> 5) + 1) % (hashsize – 1))and hashsize are relatively prime. (Two numbers are relatively prime if they share no common factors.) These two numbers are guaranteed to be relatively prime if hashsize is a prime number.
Rehasing provides better collision avoidance than either linear or quadratic probing.
sources here
I have a problem, and I'm not too sure how to solve it without going down the route of inefficiency. Say I have a list of words:
Apple
Ape
Arc
Abraid
Bridge
Braide
Bray
Boolean
What I want to do is process this list and get what each word starts with up to a certain depth, e.g.
a - Apple, Ape, Arc, Abraid
ab - Abraid
ar -Arc
ap - Apple, Ape
b - Bridge, Braide, Bray, Boolean
br - Bridge, Braide, Bray
bo - Boolean
Any ideas?
You can use a Trie structure.
(root)
/
a - b - r - a - i - d
/ \ \
p r e
/ \ \
p e c
/
l
/
e
Just find the node that you want and get all its descendants, e.g., if I want ap-:
(root)
/
a - b - r - a - i - d
/ \ \
[p] r e
/ \ \
p e c
/
l
/
e
Perhaps you're looking for something like:
#!/usr/bin/env python
def match_prefix(pfx,seq):
'''return subset of seq that starts with pfx'''
results = list()
for i in seq:
if i.startswith(pfx):
results.append(i)
return results
def extract_prefixes(lngth,seq):
'''return all prefixes in seq of the length specified'''
results = dict()
lngth += 1
for i in seq:
if i[0:lngth] not in results:
results[i[0:lngth]] = True
return sorted(results.keys())
def gen_prefix_indexed_list(depth,seq):
'''return a dictionary of all words matching each prefix
up to depth keyed on these prefixes'''
results = dict()
for each in range(depth):
for prefix in extract_prefixes(each, seq):
results[prefix] = match_prefix(prefix, seq)
return results
if __name__ == '__main__':
words='''Apple Ape Arc Abraid Bridge Braide Bray Boolean'''.split()
test = gen_prefix_indexed_list(2, words)
for each in sorted(test.keys()):
print "%s:\t\t" % each,
print ' '.join(test[each])
That is you want to generate all the prefixes that are present in a list of words between one and some number you'll specify (2 in this example). Then you want to produce an index of all words matching each of these prefixes.
I'm sure there are more elegant ways to do this. For for a quick and easily explained approach I've just built this from a simple bottom-up functional decomposition of the apparent spec. Of the end result values are lists each matching a given prefix, then we start with the function to filter out such matches from our inputs. If the end result keys are all prefixes between 1 and some N that appear in our input then we have a function to extract those. Then our spec. is an extremely straightforward nested loop around that.
Of course this nest loop might be a problem. Such things usually equate to an O(n^2) efficiency. As shown this will iterate over the original list C * N * N times (C is the constant number representing the prefixes of length 1, 2, etc; while N is the length of the list).
If this decomposition provides the desired semantics then we can look at improving the efficiency. The obvious approach would be to lazily generate the dictionary keys as we iterate once over the list ... for each word, for each prefix length, generate key ... append this word to the the list/value stored at that key ... and continue to the next word.
There's still a nested loop ... but it's the short loop for each key/prefix length. That alternative design has the advantage of allowing us to iterate over lists of words from any iterable, not just an in memory list. So we could iterate over lines of a file, results generated from a database query, etc --- without incurring the memory overhead of keeping the entire original word list in memory.
Of course we're still storing the dictionary in memory. However we can also change that, decouple the logic from the input and storage. When we append each input to the various prefix/key values we don't care if they're lists in a dictionary, or lines in a set of files, or values being pulled out of (and pushed back into) a DBM or other key/value store (for example some sort of CouchDB or other "noSQL clustered/database."
The implementation of that is left as an exercise to the reader.
I don't know what you are thinking about, when you say "route of inefficiency", but pretty obvious solution (possibly the one you are thinking about) comes to mind. Trie looks like a structure for this kind of problems, but it's costly in terms of memory (there is a lot of duplication) and I'm not sure it makes things faster in your case. Maybe the memory usage would pay off, if the information was to be retrieved many times, but your answer suggests, you want to generate the output file once and store it. So in your case the Trie would be generated just to be traversed once. I don't think it makes sense.
My suggestion is to just sort the list of words in lexical order and then traverse the list in order as many times as the max length of the beginning is.
create a dictionary with keys being strings and values being lists of strings
for(i = 1 to maxBeginnigLength)
{
for(every word in your sorted list)
{
if(the word's length is no less than i)
{
add the word to the list in the dictionary at a key
being the beginning of the word of length i
}
}
}
store contents of the dictionary to the file
Using this PHP trie implementation will get you about 50% there. It's got some stuff you don't need and it doesn't have a "search by prefix" method, but you can write one yourself easily enough.
$trie = new Trie();
$trie->add('Apple', 'Apple');
$trie->add('Ape', 'Ape');
$trie->add('Arc', 'Arc');
$trie->add('Abraid', 'Abraid');
$trie->add('Bridge', 'Bridge');
$trie->add('Braide', 'Braide');
$trie->add('Bray', 'Bray');
$trie->add('Boolean', 'Boolean');
It builds up a structure like this:
Trie Object
(
[A] => Trie Object
(
[p] => Trie Object
(
[ple] => Trie Object
[e] => Trie Object
)
[rc] => Trie Object
[braid] => Trie Object
)
[B] => Trie Object
(
[r] => Trie Object
(
[idge] => Trie Object
[a] => Trie Object
(
[ide] => Trie Object
[y] => Trie Object
)
)
[oolean] => Trie Object
)
)
If the words were in a Database (Access, SQL), and you wanted to retrieve all words starting with 'br', you could use:
Table Name: mytable
Field Name: mywords
"Select * from mytable where mywords like 'br*'" - For Access - or
"Select * from mytable where mywords like 'br%'" - For SQL