How to remap group indices of a sparse set to a compact set? - algorithm

Assume we have a list of data, for example
data [0] = /* employee 0 name, age, DOB, position */
data [1] = /* employee 1 name, age, DOB, position */
...
data [n-1] = /* employee n-1 name, age, DOB, position */
We also have a list of groups/teams, which is a list of lists:
group [0] = {0, 1, 72}
group [1] = {38, 1, 40}
...
group [k] = {0, 70, 72, 90}
Groups can have any non zero number of indices. Indices can be repeated as many times as possible.
The input is guaranteed that every index from [1-n] is present in at least one group.
A random list of teams to be deleted is given to you, for example remove {1, 6,7,8} means to remove the groups on the group list at indices 1,6,7,8.
Assume you do remove the groups.
You now potentially have indices in the data that belong to no group.
You want to remove any such datum, but you also want to keep indices contiguous. So for example if the input is
data has 4 elements
group[0] = {0, 1, 2}
group[1] = {0, 2, 3}
group[2] = {2, 3}
And you are to remove group 0, then datum 1 is to be removed. meaning you must shift the indices of groups 0 and 2.
The new data would look like:
data has 3 elements
group[1] = {0, 1, 2}
group[2] = {1, 2}
I want to implement this in an efficient way.
My current solution is to delete all elements, iterate over the list checking for data without an assigned group, and creating a permutation map for each index.
Then copy all surviving groups using the permutation map.
This is, very, very, very slow for large data. Is there a way to do this without using O(n) additional memory? Or at the bare minimum a data structure with better cache performance than a map?

Related

Get kth group of unsorted result list with arbitrary number of results per group

Okay so I have a huge array of unsorted elements of an unknown data type (all elements are of the same type, obviously, I just can't make assumptions as they could be numbers, strings, or any type of object that overloads the < and > operators. The only assumption I can make about those objects is that no two of them are the same, and comparing them (A < B) should give me which one should show up first if it was sorted. The "smallest" should be first.
I receive this unsorted array (type std::vector, but honestly it's more of an algorithm question so no language in particular is expected), a number of objects per "group" (groupSize), and the group number that the sender wants (groupNumber).
I'm supposed to return an array containing groupSize elements, or less if the group requested is the last one. (Examples: 17 results with groupSize of 5 would only return two of them if you ask for the fourth group. Also, the fourth group is group number 3 because it's a zero-indexed array)
Example:
Received Array: {1, 5, 8, 2, 19, -1, 6, 6.5, -14, 20}
Received pageSize: 3
Received pageNumber: 2
If the array was sorted, it would be: {-14, -1, 1, 2, 5, 6, 6.5, 8, 19, 20}
If it was split in groups of size 3: {{-14, -1, 1}, {2, 5, 6}, {6.5, 8, 19}, {20}}
I have to return the third group (pageNumber 2 in a 0-indexed array): {6.5, 8, 19}
The biggest problem is the fact that it needs to be lightning fast. I can't sort the array because it has to be faster than O(n log n).
I've tried several methods, but can never get under O(n log n).
I'm aware that I should be looking for a solution that doesn't fill up all the other groups, and skips a pretty big part of the steps shown in the example above, to create only the requested group before returning it, but I can't figure out a way to do that.
You can find the value of the smallest element s in the group in linear time using the standard C++ std::nth_element function (because you know it's index in the sorted array). You can find the largest element S in the group in the same way. After that, you need a linear pass to find all elements x such that s <= x <= S and return them. The total time complexity is O(n).
Note: this answer is not C++ specific. You just need an implementation of the k-th order statistics in linear time.

Union of two sets given a certain ordering in O(n) time

[Note: I am hoping this problem can be solved in O(n) time. If not, I am looking for a proof that it cannot be solved in O(n) time. If I get the proof, I'll try to implement a new algorithm to reach to the union of these sorted sets in a different way.]
Consider the sets:
(1, 4, 0, 6, 3)
(0, 5, 2, 6, 3)
The resultant should be:
(1, 4, 0, 5, 2, 6, 3)
Please note that the problem of union of sorted sets is easy. These are also sorted sets but the ordering is defined by some other properties from which these indices have been resolved. But the ordering (whatever it is) is valid to both the sets, i.e. for any i, j ∈ Set X if i <= j, then in some other Set Y, for the same i, j, i <= j.
EDIT: I am sorry I have missed something very important that I have covered in one of the comments below — intersection of two sets is not a null set, i.e. the two sets have common elements.
Insert each item in the first set into a hash table.
Go through each item in the second set, looking up that value.
If not found, insert that item into the resulting set.
If found, insert all items from the first set between the last item we inserted up to this value.
At the end, insert all remaining items from the first set into the resulting set.
Running time
Expected O(n).
Side note
With the constraints given, the union is not necessarily unique.
For e.g. (1) (2), the resulting set can be either (1, 2) or (2, 1).
This answer will pick (2, 1).
Implementation note
Obviously looping through the first set to find the last inserted item is not going to result in an O(n) algorithm. Instead we must keep an iterator into the first set (not the hash table), and then we can simply continue from the last position that iterator had.
Here's some pseudo-code, assuming both sets are arrays (for simplicity):
for i = 0 to input1.length
hashTable.insert(input1[i])
i = 0 // this will be our 'iterator' into the first set
for j = 0 to input2.length
if hashTable.contains(input2[j])
do
output.append(input1[i])
i++
while input1[i] != input2[j]
else
output.append(input2[j])
while i < input.length
output.append(input1[i])
The do-while-loop inside the for-loop may look suspicious, but note that each iteration that that loop runs, we increase i, so it can run a total of input1.length times.
Example
Input:
(1, 4, 0, 6, 8, 3)
(0, 5, 2, 6, 3)
Hash table: (1, 4, 0, 6, 8, 3)
Then, go through the second set.
Look up 0, found, so insert 1, 4, 0 into the resulting set
(no item from first set inserted yet, so insert all items from the start until we get 0).
Look up 5, not found, so insert 5 into the resulting set.
Look up 2, not found, so insert 2 into the resulting set.
Look up 6, found, so insert 6 into the resulting set
(last item inserted from first set is 0, so only 6 needs to be inserted).
Look up 3, found, so insert 8, 3 into the resulting set
(last item inserted from first set is 6, so insert all items from after 6 until we get 3).
Output: (1, 4, 0, 5, 2, 6, 8, 3)
We have two ordered sets of indices A and B, which are ordered by some function f(). So we know that f(A[i]) < f(A[j]) iff i < j, and the same holds true for set B.
From here, we got a linear mapping to a "sorted" linear sets, thus reduced to the "problem of union of sorted sets".
This also doesn't have the best space complexity, but you can try:
a = [1,2,3,4,5]
b = [4,2,79,8]
union = {}
for each in a:
union[each]=1
for each in b:
union[each]=1
for each in union:
print each,' ',
Output:
>>> 1 2 3 4 5 8 79

Effective sort-by-example algorithm

Assume we have an array of objects of length N (all objects have the same set of fields).
And we have an array of length N of the same type values, which represent certain object's field (e.g. array of numbers representing IDs).
Now we want to sort the array of objects by the field which is represented in the 2nd array and in the same order as in the 2nd array.
For example, here are 2 arrays (as in description) and expected result:
A = [ {id: 1, color: "red"}, {id: 2, color: "green"}, {id: 3, color: "blue"} ]
B = [ "green", "blue", "red"]
sortByColorByExample(A, B) ==
[ {id: 2, color: "green"}, {id: 3, color: "blue"}, {id: 1, color: "red"} ]
How to effectively implement 'sort-by-example' function? I can't come up with anything better then O(N^2).
This is assuming you have a bijection from elements in B to elements in A
Build a map (say M) from B's elements to their position (O(N))
For each element of A (O(N)), access the map to find where to put it in the sorted array (O(log(N)) with a efficient implementation of the map)
Total complexity: O(NlogN) time and O(N) space
Suppose we are sorting on an item's colour. Then create a dictionary d that maps each colour to a list of the items in A that have that colour. Then iterate across the colours in the list B, and for each colour c output (and remove) a value from the list d[c]. This runs in O(n) time with O(n) extra space for the dictionary.
Note that you have to decide what to do if A cannot be sorted according to the examples in B: do you raise an error? Choose the order that maximizes the number of matches? Or what?
Anyway, here's a quick implementation in Python:
from collections import defaultdict
def sorted_by_example(A, B, key):
"""Return a list consisting of the elements from the sequence A in the
order given by the sequence B. The function key takes an element
of A and returns the value that is used to match elements from B.
If A cannot be sorted by example, raise IndexError.
"""
d = defaultdict(list)
for a in A:
d[key(a)].append(a)
return [d[b].pop() for b in B]
>>> A = [{'id': 1, 'color': 'red'}, {'id': 2, 'color': 'green'}, {'id': 3, 'color': 'blue'}]
>>> B = ['green', 'blue', 'red']
>>> from operator import itemgetter
>>> sorted_by_example(A, B, itemgetter('color'))
[{'color': 'green', 'id': 2}, {'color': 'blue', 'id': 3}, {'color': 'red', 'id': 1}]
Note that this approach handles the case where there are multiple identical values in the sequence B, for example:
>>> A = 'proper copper coffee pot'.split()
>>> B = 'ccpp'
>>> ' '.join(sorted_by_example(A, B, itemgetter(0)))
'coffee copper pot proper'
Here when there are multiple identical values in B, we get the corresponding elements in A in reverse order, but this is just an artefact of the implementation: by using a collections.deque instead of a list (and popleft instead of pop), we could arrange to get the corresponding elements of A in the original order, if that were preferred.
Make an array of arrays, call it C of size B.length.
Loop through A. If it has color 'green' put it in C[0]. If it has a color of 'blue' put it in C[1], if it has a color of red put it in C[2].
When you're done go through C, and flatten it out to your original structure.
Wouldn't something along the lines of a merge sort be better? Create B.length arrays, one for each element inside B, and go through A, and place them in the appropriate smaller array then when it's all done merge the arrays together. It should be around O(2n)
Iterate through the first array and Make a HashMap of such fields versus the List of Objects. O(n) [assuming there are duplicate values of those key fields]
For eg. key = green will contain all objects with field value Green
Now iterate through the second array, get the list of objects from HashMap and store it in another array. O(k) .. (where k - distinct values of field)
The total running time is O(n) but it requires some additional memory in terms of a map and an auxiliary array
In the end you will get the array sorted as per your requirements.

retrieve closest element from a set of elements

I'm experimenting with an idea, where I have following subproblem:
I have a list of size m containing tuples of fixed length n.
[(e11, e12, .., e1n), (e21, e22, .., e2n), ..., (em1, em2, .., emn)]
Now, given some random tuple (t1, t2, .., tn), which does not belong to the list, I want to find the closest tuple(s), that belongs to the list.
I use the following distance function (Hamming distance):
def distance(A, B):
total = 0
for e1, e2 in zip(A, B):
total += e1 == e2
return total
One option is to use exhaustive search, but this is not sufficient for my problem as the lists are quite large. Other idea, I have come up with, is to first use kmedoids to cluster the list and retrieve K medoids (cluster centers). For querying, I can determine the closest cluster with K calls to distance function. Then, I can search for the closest tuple from that particular cluster. I think it should work, but I am not completely sure, if it is fine in cases the query tuple is on the edges of the clusters.
However, I was wondering, if you have a better idea to solve the problem as my mind is completely blank at the moment. However, I have a strong feeling that there may be a clever way to do it.
Solutions that require precomputing something are fine as long as they bring down the complexity of the query.
You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list<tupple>.
Now, when you have a new "query", you will need to iterate each of hash(element) for each element of the new "query", and find the maximal number of hits.
pseudo code:
findMax(tuple):
histogram <- empty map
for each element in tuple:
#assuming hash_table is the described DS from above
for each x in hash_table[element]:
histogram[x]++ #assuming lazy initialization to 0
return key with highest value in histogram
An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).
k-d trees also require the elements to be comparable.
If your data is big enough, you may want to create some inverted indexes over it.
With a data of m vectors of n elements.
Data:
0: 1, 2, 3, 4, 5, ...
1: 2, 3, 1, 5, 3, ...
2: 5, 3, 2, 1, 3, ...
3: 1, 2, 1, 5, 3, ...
...
m: m0, ... mn
Then you want to get n indexes like this:
Index0
1: 0, 3
2: 1
5: 2
Index1
2: 0, 3
3: 3, 3
Index2
3: 0
1: 1, 3
2: 2
...
Then you only search on your indexes to get the tuples that contain any of the query tuple values and find the closest tuple within those.
def search(query)
candidates = []
for i in range(len(query))
value = query[i]
candidates.append(indexes[i][value])
# find candidates with min distance
for candidate in candidates
distance = distance(candidate, query)
...
The heavy process is creating the indexes, once you built them the search will be really fast.

ideas for algorithm? sorting a list randomly with emphasis on variety

I have a table of items with [ID, ATTR1, ATTR2, ATTR3]. I'd like to select about half of the items, but try to get a random result set that is NOT clustered. In other words, there's a fairly even spread of ATTR1 values, ATTR2 values, and ATTR3 values. This does NOT necessarily represent the data as a whole, in other words, the total table may be generally concentrated on certain attribute values, but I'd like to select a subset with more variety. The attributes are not inter-related, so there's not really a correlation between ATTR1 and ATTR2.
As an example, imagine ATTR1 = "State". I'd like each line item in my subset to be from a different state, even if in the whole set, most of my data is concentrated on a few states. And for this to simultaneously be true of the other 2 attributes, too. (I realize that some tables might not make this possible, but there's enough data that it's unlikely to have no solution)
Any ideas for an efficient algorithm? Thanks! I don't really even know how to search for this :)
(by the way, it's OK if this requires pre-calculation or -indexing on the whole set, so long as I can draw out random varied subsets quickly)
Interesting problem. Since you want about half of the list, how about this:-
Create a list of half the values chosen entirely at random. Compute histograms for the value of ATTR1, ATTR2, ATTR3 for each of the chosen items.
:loop
Now randomly pick an item that's in the current list and an item that isn't.
If the new item increases the 'entropy' of the number of unique attributes in the histograms, keep it and update the histograms to reflect the change you just made.
Repeat N/2 times, or more depending on how much you want to force it to move towards covering every value rather than being random. You could also use 'simulated annealing' and gradually change the probability to accepting the swap - starting with 'sometimes allow a swap even if it makes it worse' down to 'only swap if it increases variety'.
I don't know (and I hope someone who does will answer). Here's what comes to mind: make up a distribution for MCMC putting the most weight on the subsets with 'variety'.
Assuming the items in your table are indexed by some form of id, I would in a loop, iterate through half of the items in your table, and use a random number generator to get the number.
IMHO
Finding variety is difficult but generating it is easy.
So we can generate variety of combinations and
then seach the table for records with those combinations.
If the table is sorted then searching also becomes easy.
Sample python code:
d = {}
d[('a',0,'A')]=0
d[('a',1,'A')]=1
d[('a',0,'A')]=2
d[('b',1,'B')]=3
d[('b',0,'C')]=4
d[('c',1,'C')]=5
d[('c',0,'D')]=6
d[('a',0,'A')]=7
print d
attr1 = ['a','b','c']
attr2 = [0,1]
attr3 = ['A','B','C','D']
# no of items in
# attr2 < attr1 < attr3
# ;) reason for strange nesting of loops
for z in attr3:
for x in attr1:
for y in attr2:
k = (x,y,z)
if d.has_key(k):
print '%s->%s'%(k,d[k])
else:
print k
Output:
('a', 0, 'A')->7
('a', 1, 'A')->1
('b', 0, 'A')
('b', 1, 'A')
('c', 0, 'A')
('c', 1, 'A')
('a', 0, 'B')
('a', 1, 'B')
('b', 0, 'B')
('b', 1, 'B')->3
('c', 0, 'B')
('c', 1, 'B')
('a', 0, 'C')
('a', 1, 'C')
('b', 0, 'C')->4
('b', 1, 'C')
('c', 0, 'C')
('c', 1, 'C')->5
('a', 0, 'D')
('a', 1, 'D')
('b', 0, 'D')
('b', 1, 'D')
('c', 0, 'D')->6
('c', 1, 'D')
But assuming your table is very big (otherwise why would you need algorithm ;) and data is fairly uniformly distributed there will be more hits in actual scenario. In this dummy case there are too many misses which makes algorithm look inefficient.
Let's assume that ATTR1, ATTR2, and ATTR3 are independent random variables (over a uniform random item). (If ATTR1, ATTR2, and ATTR3 are only approximately independent, then this sample should be approximately uniform in each attribute.) To sample one item (VAL1, VAL2, VAL3) whose attributes are uniformly distributed, choose VAL1 uniformly at random from the set of values for ATTR1, choose VAL2 uniformly at random from the set of values for ATTR2 over items with ATTR1 = VAL1, choose VAL3 uniformly at random from the set of values for ATTR3 over items with ATTR1 = VAL1 and ATTR2 = VAL2.
To get a sample of distinct items, apply the above procedure repeatedly, deleting each item after it is chosen. Probably the best way to implement this would be a tree. For example, if we have
ID ATTR1 ATTR2 ATTR3
1 a c e
2 a c f
3 a d e
4 a d f
5 b c e
6 b c f
7 b d e
8 b d f
9 a c e
then the tree is, in JavaScript object notation,
{"a": {"c": {"e": [1, 9], "f": [2]},
"d": {"e": [3], "f": [4]}},
"b": {"c": {"e": [5], "f": [6]},
"d": {"e": [7], "f": [8]}}}
Deletion is accomplished recursively. If we sample id 4, then we delete it from its list at the leaf level. This list empties, so we delete the entry "f": [] from tree["a"]["d"]. If we now delete 3, then we delete 3 from its list, which empties, so we delete the entry "e": [] from tree["a"]["d"], which empties tree["a"]["d"], so we delete it in turn. In a good implementation, each item should take time O(# of attributes).
EDIT: For repeated use, reinsert the items into the tree after the whole sample is collected. This doesn't affect the asymptotic running time.
Idea #2.
Compute histograms for each attribute on the original table.
For each item compute it's uniqueness score = p(ATTR1) x p(ATTR2) x p(ATTR3) (multiply the probabilities for each attribute it has).
Sort by uniqueness.
Chose a probability distribution curve for your random numbers ranging from picking only values in the first half of the set (a step function) to picking values evenly over the entire set (a flat line). Maybe a 1/x curve might work well for you in this case.
Pick values from the sorted list using your chosen probability curve.
This allows you to bias it towards more unique values or towards more evenness just by adjusting the probability curve you use to generate the random numbers.
Taking over your example, assign every possible 'State' a numeric value (say, between 1 and 9). Do the same for the other attributes.
Now, assuming you don't have more than 10 possible values for each attribute, multiply the values for ATTR3 for 100, ATTR2 for 1000, ATTR1 for 10000. Add up the results, you will end up with what can resemble a vague hash of the item. Something like
10,000 * |ATTR1| + 1000 * |ATTR2| + 100 * |ATTR3|
the advantage here is that you know that values between 10000 and 19000 have the same 'State' value; in other words, the first digit represents ATTR1. Same for ATTR2 and the other attributes.
You can sort all values and using something like bucket-sort pick one for each type, checking that the digit you're considering hasn't been picked already.
An example: if your end values are
A: 15,700 = 10,000 * 1 + 1,000 * 5 + 100 * 7
B: 13,400 = 10,000 * 1 + 1,000 * 3 + 100 * 4
C: 13,200 = ...
D: 12,300
E: 11,400
F: 10,900
you know that all your values have the same ATTR1; 2 have the same ATTR2 (that being B and C); and 2 have the same ATTR3 (B, E).
This, of course, assuming I understood correctly what you want to do. It's saturday night, afterall.
ps: yes, I could have used '10' as the first multiplier but the example would have been messier; and yes, it's clearly a naive example and there are lots of possible optimizations here, which are left as an exercise to the reader
It's a very interesting problem, for which I can see a number of applications. Notably for testing software: you get many 'main-flow' transactions, but only one is necessary to test that it works and you would prefer when selecting to get an extremely varied sample.
I don't think you really need a histogram structure, or at least only a binary one (absent/present).
{ ATTR1: [val1, val2], ATTR2: [i,j,k], ATTR3: [1,2,3] }
This is used in fact to generate a list of predicates:
Predicates = [ lambda x: x.attr1 == val1, lambda x: x.attr1 == val2,
lambda x: x.attr2 == i, ...]
This list will contain say N elements.
Now you wish to select K elements from this list. If K is less than N it's fine, otherwise we will duplicate the list i times, so that K <= N*i and with i minimal of course, so i = ceil(K/N) (note that it works although if K <= N, with i == 1).
i = ceil(K/N)
Predz = Predicates * i # python's wonderful
And finally, pick up a predicate there, and look for an element that satisfies it... that's where randomness actually hits and I am less than adequate here.
Two remarks:
if K > N you may be willing to actually select i-1 times each predicate and then select randomly from the list of predicates only to top off your selection. Thus ensuring the over representation of even the least common elements.
the attributes are completely uncorrelated this way, you may be willing to select patterns as you could never get the tuple (1,2,3) by selecting on the third element being 3, so perhaps a refinement would be to group some related attributes together, though it would probably increase the number of predicates generated
for efficiency reasons, you should have the table by the predicate category if you wish to have an efficient select.

Resources