Effective sort-by-example algorithm - algorithm

Assume we have an array of objects of length N (all objects have the same set of fields).
And we have an array of length N of the same type values, which represent certain object's field (e.g. array of numbers representing IDs).
Now we want to sort the array of objects by the field which is represented in the 2nd array and in the same order as in the 2nd array.
For example, here are 2 arrays (as in description) and expected result:
A = [ {id: 1, color: "red"}, {id: 2, color: "green"}, {id: 3, color: "blue"} ]
B = [ "green", "blue", "red"]
sortByColorByExample(A, B) ==
[ {id: 2, color: "green"}, {id: 3, color: "blue"}, {id: 1, color: "red"} ]
How to effectively implement 'sort-by-example' function? I can't come up with anything better then O(N^2).

This is assuming you have a bijection from elements in B to elements in A
Build a map (say M) from B's elements to their position (O(N))
For each element of A (O(N)), access the map to find where to put it in the sorted array (O(log(N)) with a efficient implementation of the map)
Total complexity: O(NlogN) time and O(N) space

Suppose we are sorting on an item's colour. Then create a dictionary d that maps each colour to a list of the items in A that have that colour. Then iterate across the colours in the list B, and for each colour c output (and remove) a value from the list d[c]. This runs in O(n) time with O(n) extra space for the dictionary.
Note that you have to decide what to do if A cannot be sorted according to the examples in B: do you raise an error? Choose the order that maximizes the number of matches? Or what?
Anyway, here's a quick implementation in Python:
from collections import defaultdict
def sorted_by_example(A, B, key):
"""Return a list consisting of the elements from the sequence A in the
order given by the sequence B. The function key takes an element
of A and returns the value that is used to match elements from B.
If A cannot be sorted by example, raise IndexError.
"""
d = defaultdict(list)
for a in A:
d[key(a)].append(a)
return [d[b].pop() for b in B]
>>> A = [{'id': 1, 'color': 'red'}, {'id': 2, 'color': 'green'}, {'id': 3, 'color': 'blue'}]
>>> B = ['green', 'blue', 'red']
>>> from operator import itemgetter
>>> sorted_by_example(A, B, itemgetter('color'))
[{'color': 'green', 'id': 2}, {'color': 'blue', 'id': 3}, {'color': 'red', 'id': 1}]
Note that this approach handles the case where there are multiple identical values in the sequence B, for example:
>>> A = 'proper copper coffee pot'.split()
>>> B = 'ccpp'
>>> ' '.join(sorted_by_example(A, B, itemgetter(0)))
'coffee copper pot proper'
Here when there are multiple identical values in B, we get the corresponding elements in A in reverse order, but this is just an artefact of the implementation: by using a collections.deque instead of a list (and popleft instead of pop), we could arrange to get the corresponding elements of A in the original order, if that were preferred.

Make an array of arrays, call it C of size B.length.
Loop through A. If it has color 'green' put it in C[0]. If it has a color of 'blue' put it in C[1], if it has a color of red put it in C[2].
When you're done go through C, and flatten it out to your original structure.

Wouldn't something along the lines of a merge sort be better? Create B.length arrays, one for each element inside B, and go through A, and place them in the appropriate smaller array then when it's all done merge the arrays together. It should be around O(2n)

Iterate through the first array and Make a HashMap of such fields versus the List of Objects. O(n) [assuming there are duplicate values of those key fields]
For eg. key = green will contain all objects with field value Green
Now iterate through the second array, get the list of objects from HashMap and store it in another array. O(k) .. (where k - distinct values of field)
The total running time is O(n) but it requires some additional memory in terms of a map and an auxiliary array
In the end you will get the array sorted as per your requirements.

Related

How to order a list according to an arbitrary order

I searched a relevant question but couldn't find one. So my question is how do I sort an array based on an arbitrary order. For example, let's say the ordering is:
order_of_elements = ['cc', 'zz', '4b', '13']
and my list to be sorted:
list_to_be_sorted = ['4b', '4b', 'zz', 'cc', '13', 'cc', 'zz']
so the result needs to be:
ordered_list = ['cc', 'cc', 'zz', 'zz', '4b', '4b', '13']
please note that the reference list(order_of_elements) describes ordering and I don't ask about sorting according to the alphabetically sorted indices of the reference list.
You can assume that order_of_elements array includes all the possible elements.
Any pseudocode is welcome.
A simple and Pythonic way to accomplish this would be to compute an index lookup table for the order_of_elements array, and use the indices as the sorting key:
order_index_table = { item: idx for idx, item in enumerate(order_of_elements) }
ordered_list = sorted(list_to_be_sorted, key=lambda x: order_index_table[x])
The table reduces order lookup to O(1) (amortized) and thus does not change the time complexity of the sort.
(Of course it does assume that all elements in list_to_be_sorted are present in order_of_elements; if this is not necessarily the case then you would need a default return value in the key lambda.)
Since you have a limited number of possible elements, and if these elements are hashable, you can use a kind of counting sort.
Put all the elements of order_of_elements in a hashmap as keys, with counters as values. Traverse you list_to_be_sorted, incrementing the counter corresponding to the current element. To build ordered_list, go through order_of_elements and add each current element the number of times indicated by the counter of that element.
hashmap hm;
for e in order_of_elements {
hm.add(e, 0);
}
for e in list_to_be_sorted {
hm[e]++;
}
list ordered_list;
for e in order_of_elements {
list.append(e, hm[e]); // Append hm[e] copies of element e
}
Approach:
create an auxiliary array which will hold the index of 'order_of_elements'
sort the auxiliary array.
2.1 re-arrange the value in the main array while sorting the auxiliary array

Get kth group of unsorted result list with arbitrary number of results per group

Okay so I have a huge array of unsorted elements of an unknown data type (all elements are of the same type, obviously, I just can't make assumptions as they could be numbers, strings, or any type of object that overloads the < and > operators. The only assumption I can make about those objects is that no two of them are the same, and comparing them (A < B) should give me which one should show up first if it was sorted. The "smallest" should be first.
I receive this unsorted array (type std::vector, but honestly it's more of an algorithm question so no language in particular is expected), a number of objects per "group" (groupSize), and the group number that the sender wants (groupNumber).
I'm supposed to return an array containing groupSize elements, or less if the group requested is the last one. (Examples: 17 results with groupSize of 5 would only return two of them if you ask for the fourth group. Also, the fourth group is group number 3 because it's a zero-indexed array)
Example:
Received Array: {1, 5, 8, 2, 19, -1, 6, 6.5, -14, 20}
Received pageSize: 3
Received pageNumber: 2
If the array was sorted, it would be: {-14, -1, 1, 2, 5, 6, 6.5, 8, 19, 20}
If it was split in groups of size 3: {{-14, -1, 1}, {2, 5, 6}, {6.5, 8, 19}, {20}}
I have to return the third group (pageNumber 2 in a 0-indexed array): {6.5, 8, 19}
The biggest problem is the fact that it needs to be lightning fast. I can't sort the array because it has to be faster than O(n log n).
I've tried several methods, but can never get under O(n log n).
I'm aware that I should be looking for a solution that doesn't fill up all the other groups, and skips a pretty big part of the steps shown in the example above, to create only the requested group before returning it, but I can't figure out a way to do that.
You can find the value of the smallest element s in the group in linear time using the standard C++ std::nth_element function (because you know it's index in the sorted array). You can find the largest element S in the group in the same way. After that, you need a linear pass to find all elements x such that s <= x <= S and return them. The total time complexity is O(n).
Note: this answer is not C++ specific. You just need an implementation of the k-th order statistics in linear time.

Algorithm to find similar strings in a list of many strings

I know about approximate string searching and things like the Levenshtein distance, but what I want to do is take a large list of strings and quickly pick out any matching pairs that are similar to each other (say, 1 Damerau-Levenshtein distance apart). So something like this
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
matching_strings(l)
# Output
# [["moose","mouse"],["rat", "cat"]]
I only really know how to use R and Python, so bonus points if your solution can be easily implemented in one of those languages.
UPDATE:
Thanks to Collapsar's help, here is a solution in Python
import numpy
import functools
alphabet = {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
fvlist=[]
for string in l:
fv = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
for letter in string:
fv[alphabet[letter]]+=1
fvlist.append(fv)
fvlist.sort (key=functools.cmp_to_key(lambda fv1,fv2: numpy.sign(numpy.sum(numpy.subtract(fv1, fv2)))))
However, the sorted vectors are returned in the following order:
"rat" "cat" "lion" "fish" "moose" "tiger" "mouse"
Which I would consider to be sub-optimal because I would want moose and mouse to end up next to each other. I understand that however I sort these words there's no way to get all of the words next to all of their closest pairs. However, I am still open to alternative solutions
One way to do that (with complexity O(n k^2), where n is number of strings and k is the longest string) is to convert every string into a set of masks like this:
rat => ?at, r?t, ra?, ?rat, r?at, ra?t, rat?
This way if two words are different in one letter, like 'rat' and 'cat', they will both have a mask ?at among others, while if one word is a subsequence of another, like 'rat' and 'rats', they will both have mask 'rat?'.
Then you just group strings based on their masks, and print groups that have more than two strings. You might want to dedup your array first, if it has duplicates.
Here's an example code, with an extra cats string in it.
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat", "cats"]
d = {}
def add(mask, s):
if mask not in d:
d[mask] = []
d[mask].append(s)
for s in l:
for pos in range(len(s)):
add(s[:pos] + '?' + s[pos + 1:], s)
add(s[:pos] + '?' + s[pos:], s)
add(s + '?', s)
for k, v in d.items():
if len(v) > 1:
print v
Outputs
['moose', 'mouse']
['rat', 'cat']
['cat', 'cats']
1st step, you must index your list with any fuzzy search indexing.
2nd, you needed iterate your list and search for neighbors by quick lookup in the pre-indexed list.
About fuzzy indexing:
Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification is named "Wilbur-Khovayko algorithm".
Basic idea: To split strings by trigrams, and search maximal intersection scores.
For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", and so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.
Thereafter, for each trigram built index, in which term it is present.
So, this is list of term_ID for each trigram.
When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.
It works quick: I remember, on old Sun/Solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s
You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tgz
The naive implementation amounts to setting up a boolean matrix indexed by the strings (i.e. their position in the sorted list) and comparing each pair of strings, setting the corresponding matrix element to true iff the strings are 'similar' wrt your criterion. This will run in O(n^2).
You might be better off by transforming your strings into tuples of character frequencies ( e.g. 'moose' -> (0,0,0,0,1,0,0,0,0,0,0,0,1,0,2,0,0,0,1,0,0,0,0,0,0,0) where the i-th vector component represents the i-th letter in the alphabet). Note that the frequency vectors will differ in 'few' components only ( e.g. for D-L distance 1 in at most 2 components, the respective differences being +1,-1 ).
Sort your transformed data. Candidates for the pairs you wish to generate will be adjacent or at least 'close' to each other in your sorted list of transformed values. You check the candidates by comparing each list entry with at most k of its successors in the list, k being a small integer (actually comparing the corresponding strings, of course). This algorithm will run in O(n log n).
You have to trade off between the added overhead of transformation / sorting (with complex comparison operations depending on the representation you choose for the frequency vectors ) and the reduced number of comparisons. The method does not consider the intra-word position of characters but only their occurrence. Depending on the actual set of strings there'll be many candidate pairs that do not turn into actually 'similar' pairs.
As your data appears to consist of English lexemes, a potential optimisation would be to define character classes ( e.g. vowels/consonants, 'e'/other vowels/syllabic consonants/non-syllabic consonants ) instead of individual characters.
Additional optimisation:
Note that precisely the pairs of strings in your data set that are permutations of each other (e.g. [art,tar]) will produce identical values under the given transformation. so if you limit yourself to a D-L distance of 1 and if you do not consider the transposition of adjacent characters as a single edit step, never pick list items with identical transformation values as candidates.

retrieve closest element from a set of elements

I'm experimenting with an idea, where I have following subproblem:
I have a list of size m containing tuples of fixed length n.
[(e11, e12, .., e1n), (e21, e22, .., e2n), ..., (em1, em2, .., emn)]
Now, given some random tuple (t1, t2, .., tn), which does not belong to the list, I want to find the closest tuple(s), that belongs to the list.
I use the following distance function (Hamming distance):
def distance(A, B):
total = 0
for e1, e2 in zip(A, B):
total += e1 == e2
return total
One option is to use exhaustive search, but this is not sufficient for my problem as the lists are quite large. Other idea, I have come up with, is to first use kmedoids to cluster the list and retrieve K medoids (cluster centers). For querying, I can determine the closest cluster with K calls to distance function. Then, I can search for the closest tuple from that particular cluster. I think it should work, but I am not completely sure, if it is fine in cases the query tuple is on the edges of the clusters.
However, I was wondering, if you have a better idea to solve the problem as my mind is completely blank at the moment. However, I have a strong feeling that there may be a clever way to do it.
Solutions that require precomputing something are fine as long as they bring down the complexity of the query.
You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list<tupple>.
Now, when you have a new "query", you will need to iterate each of hash(element) for each element of the new "query", and find the maximal number of hits.
pseudo code:
findMax(tuple):
histogram <- empty map
for each element in tuple:
#assuming hash_table is the described DS from above
for each x in hash_table[element]:
histogram[x]++ #assuming lazy initialization to 0
return key with highest value in histogram
An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).
k-d trees also require the elements to be comparable.
If your data is big enough, you may want to create some inverted indexes over it.
With a data of m vectors of n elements.
Data:
0: 1, 2, 3, 4, 5, ...
1: 2, 3, 1, 5, 3, ...
2: 5, 3, 2, 1, 3, ...
3: 1, 2, 1, 5, 3, ...
...
m: m0, ... mn
Then you want to get n indexes like this:
Index0
1: 0, 3
2: 1
5: 2
Index1
2: 0, 3
3: 3, 3
Index2
3: 0
1: 1, 3
2: 2
...
Then you only search on your indexes to get the tuples that contain any of the query tuple values and find the closest tuple within those.
def search(query)
candidates = []
for i in range(len(query))
value = query[i]
candidates.append(indexes[i][value])
# find candidates with min distance
for candidate in candidates
distance = distance(candidate, query)
...
The heavy process is creating the indexes, once you built them the search will be really fast.

Method for associating vector elements with items they represent

Imagine an "item" structure (represented as a JSON hash)
{
id: 1,
value: 5
}
Now imagine I have a set of 100,000 items, and I need to perform calculations on the value associated with each. At the end of the calculation, I update each item with the new value.
To do this quickly, I have been using GSL vector libraries, loading each value as an element of the vector.
For example, the items:
{ id: 1, value: 5 }
{ id: 2, value: 6 }
{ id: 3, value: 7 }
Becomes:
GSL::Vector[5, 6, 7]
Element 1 corresponds to item id 1, element 2 corresponds to item id 2, etc. I then proceed to perform element-wise calculations on each element in the vector, multiplying, dividing etc.
While this works, it bothers me that I have to depend on the list of items being sorted by ID.
Is there another structure that acts like a hash (allowing me to say with certainty a particular result value corresponds to a particular item), but allows me to do fast, memory efficient element-wise operations like a vector?
I'm using Ruby and the GSL bindings, but willing to re-write this in another language if necessary.

Resources