Algorithm to find correct index - algorithm

I have an array and a map. The array contains a list of numbers the map contains a key (integer) value (boolean) pair telling us which items have been removed from the list.
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
And a map telling us which items have been removed:
{ 3: true, 4: true, 5: true, 6: true, 7: true, 8: true, 9: true }
The items remain in the array, but are not counted toward the index when looking up an item in the array by index.
For example, given the above, index 2 would return 10:
[1, 2, x, x, x, x, x, x, x, 10]
-0--1------------------------2-
We can loop through each item in the list to see if it is in the map, but worst case the complexity would be O(n) and if there was one billion items in the list this would be a problem. Is there a better way to determine the correct index? I have thought of using a batch type binary tree - each node would hold a sequential range (if a sequential range exists), or a single number, but even then - if every other item was removed - worst case would be O(n) since there would be no sequential ranges.

You can use a binary tree with an extra property that each node will hold the number of removed items in its subtree. update that value when you insert or remove an item.
To find the index of an item, find it on the tree and for every right turn in the search add the number of removed items of the left subtree.

Related

Sorting a list that have n fixed segments already sorted in ascending order

The question goes as follows:
Lists which consist of a small fixed number, n, of segments connected
end-to-end, each segment is already in ascending order.
I thought about using mergesort with base case being if equal to n then go back and merge them since we already know that they are sorted, but if I have 3 segments it won't work since I'm dividing by two and you can't divide 3 segments equally into two parts.
The other approach which is similar to merge sort. so I use n stacks for each segment, which we can identify if L[i] > L[i+1] since segments are in ascending order. But I need n comparisons to figure out which element comes first, and I don't know an efficient way of comparing n elements dynamically without using another data structure to compare the elements at the top of the stack.
Also, you are supposed to use the problem feature, segments already ordered, to get better results than conventional algorithms. i.e. complexity less than O(nlogn).
A pseudocode would be nice if you have an idea.
Edit:
An example would be [(14,20,22),(7,8,9),(1,2,3)] here we have 3 segments of 3 elements, even though the segments are sorted, the whole list isn't.
p.s. () is there to point out the segments only
I think maybe you've misunderstood mergesort. While usually you would split in half and sort each half before merging, it's really the merging part which makes the algorithm. You just need to merge on runs instead.
With your example of [(14,20,22),(7,8,9),(1,2,3)]
After first merge you have [(7, 8, 9, 14, 20, 22),(1, 2, 3)]
After second merge you have [(1, 2, 3, 7, 8, 9, 14, 20, 22)]
l = [14, 20, 22, 7, 8, 9, 1, 2, 3]
rl = [] # run list
sl = [l[0]] # temporary sublist
#split list into list of sorted sublists
for item in l[1:]:
if item > sl[-1]:
sl.append(item)
else:
rl.append(sl)
sl = [item]
rl.append(sl)
print(rl)
#function for merging two sorted lists
def merge(l1, l2):
l = [] #list we add into
while True:
if not l1:
# first list is empty, add second list onto new list
return l + l2
if not l2:
# second list is empty, add first list onto new list
return l + l1
if l1[0] < l2[0]:
# rather than deleting, you could increment an index
# which is likely to be faster, or reverse the list
# and pop off the end, or use a data structure which
# allows you to pop off the front
l.append(l1[0])
del l1[0]
else:
l.append(l2[0])
del l2[0]
# keep mergins sublists until only one remains
while len(rl) > 1:
rl.append(merge(rl.pop(), rl.pop()))
print(rl)
It's worth noting that unless this is simply an excercise, you are probably better off using whatever inbuilt sorting function your language of choice uses.

Algorithm to select 1 item from a list derived from a score

to explain what I mean, lets start with a list of items each with their own score
[2,3,4]
adding these up these gives 9, so I want to be able to select a random number in the total range, where this number would represent one of these items
item 1, score of 2, range 0-1
item 2, score of 3, range 2-4
item 3, score of 4, range 5-8
The list would not be sorted and the "scores" could be any value
I could loop through the whole list and check if the number is say 2-4 and so on through the list, or do a binary search on initially the lower range... but I think there should be some way I can just take the random number a calculate which item it represents...

Get kth group of unsorted result list with arbitrary number of results per group

Okay so I have a huge array of unsorted elements of an unknown data type (all elements are of the same type, obviously, I just can't make assumptions as they could be numbers, strings, or any type of object that overloads the < and > operators. The only assumption I can make about those objects is that no two of them are the same, and comparing them (A < B) should give me which one should show up first if it was sorted. The "smallest" should be first.
I receive this unsorted array (type std::vector, but honestly it's more of an algorithm question so no language in particular is expected), a number of objects per "group" (groupSize), and the group number that the sender wants (groupNumber).
I'm supposed to return an array containing groupSize elements, or less if the group requested is the last one. (Examples: 17 results with groupSize of 5 would only return two of them if you ask for the fourth group. Also, the fourth group is group number 3 because it's a zero-indexed array)
Example:
Received Array: {1, 5, 8, 2, 19, -1, 6, 6.5, -14, 20}
Received pageSize: 3
Received pageNumber: 2
If the array was sorted, it would be: {-14, -1, 1, 2, 5, 6, 6.5, 8, 19, 20}
If it was split in groups of size 3: {{-14, -1, 1}, {2, 5, 6}, {6.5, 8, 19}, {20}}
I have to return the third group (pageNumber 2 in a 0-indexed array): {6.5, 8, 19}
The biggest problem is the fact that it needs to be lightning fast. I can't sort the array because it has to be faster than O(n log n).
I've tried several methods, but can never get under O(n log n).
I'm aware that I should be looking for a solution that doesn't fill up all the other groups, and skips a pretty big part of the steps shown in the example above, to create only the requested group before returning it, but I can't figure out a way to do that.
You can find the value of the smallest element s in the group in linear time using the standard C++ std::nth_element function (because you know it's index in the sorted array). You can find the largest element S in the group in the same way. After that, you need a linear pass to find all elements x such that s <= x <= S and return them. The total time complexity is O(n).
Note: this answer is not C++ specific. You just need an implementation of the k-th order statistics in linear time.

Disperse Duplicates in an Array

Source : Google Interview Question
Write a routine to ensure that identical elements in the input are maximally spread in the output?
Basically, we need to place the same elements,in such a way , that the TOTAL spreading is as maximal as possible.
Example:
Input: {1,1,2,3,2,3}
Possible Output: {1,2,3,1,2,3}
Total dispersion = Difference between position of 1's + 2's + 3's = 4-1 + 5-2 + 6-3 = 9 .
I am NOT AT ALL sure, if there's an optimal polynomial time algorithm available for this.Also,no other detail is provided for the question other than this .
What i thought is,calculate the frequency of each element in the input,then arrange them in the output,each distinct element at a time,until all the frequencies are exhausted.
I am not sure of my approach .
Any approaches/ideas people .
I believe this simple algorithm would work:
count the number of occurrences of each distinct element.
make a new list
add one instance of all elements that occur more than once to the list (order within each group does not matter)
add one instance of all unique elements to the list
add one instance of all elements that occur more than once to the list
add one instance of all elements that occur more than twice to the list
add one instance of all elements that occur more than trice to the list
...
Now, this will intuitively not give a good spread:
for {1, 1, 1, 1, 2, 3, 4} ==> {1, 2, 3, 4, 1, 1, 1}
for {1, 1, 1, 2, 2, 2, 3, 4} ==> {1, 2, 3, 4, 1, 2, 1, 2}
However, i think this is the best spread you can get given the scoring function provided.
Since the dispersion score counts the sum of the distances instead of the squared sum of the distances, you can have several duplicates close together, as long as you have a large gap somewhere else to compensate.
for a sum-of-squared-distances score, the problem becomes harder.
Perhaps the interview question hinged on the candidate recognizing this weakness in the scoring function?
In perl
#a=(9,9,9,2,2,2,1,1,1);
then make a hash table of the counts of different numbers in the list, like a frequency table
map { $x{$_}++ } #a;
then repeatedly walk through all the keys found, with the keys in a known order and add the appropriate number of individual numbers to an output list until all the keys are exhausted
#r=();
$g=1;
while( $g == 1 ) {
$g=0;
for my $n (sort keys %x)
{
if ($x{$n}>1) {
push #r, $n;
$x{$n}--;
$g=1
}
}
}
I'm sure that this could be adapted to any programming language that supports hash tables
python code for algorithm suggested by Vorsprung and HugoRune:
from collections import Counter, defaultdict
def max_spread(data):
cnt = Counter()
for i in data: cnt[i] += 1
res, num = [], list(cnt)
while len(cnt) > 0:
for i in num:
if num[i] > 0:
res.append(i)
cnt[i] -= 1
if cnt[i] == 0: del cnt[i]
return res
def calc_spread(data):
d = defaultdict()
for i, v in enumerate(data):
d.setdefault(v, []).append(i)
return sum([max(x) - min(x) for _, x in d.items()])
HugoRune's answer takes some advantage of the unusual scoring function but we can actually do even better: suppose there are d distinct non-unique values, then the only thing that is required for a solution to be optimal is that the first d values in the output must consist of these in any order, and likewise the last d values in the output must consist of these values in any (i.e. possibly a different) order. (This implies that all unique numbers appear between the first and last instance of every non-unique number.)
The relative order of the first copies of non-unique numbers doesn't matter, and likewise nor does the relative order of their last copies. Suppose the values 1 and 2 both appear multiple times in the input, and that we have built a candidate solution obeying the condition I gave in the first paragraph that has the first copy of 1 at position i and the first copy of 2 at position j > i. Now suppose we swap these two elements. Element 1 has been pushed j - i positions to the right, so its score contribution will drop by j - i. But element 2 has been pushed j - i positions to the left, so its score contribution will increase by j - i. These cancel out, leaving the total score unchanged.
Now, any permutation of elements can be achieved by swapping elements in the following way: swap the element in position 1 with the element that should be at position 1, then do the same for position 2, and so on. After the ith step, the first i elements of the permutation are correct. We know that every swap leaves the scoring function unchanged, and a permutation is just a sequence of swaps, so every permutation also leaves the scoring function unchanged! This is true at for the d elements at both ends of the output array.
When 3 or more copies of a number exist, only the position of the first and last copy contribute to the distance for that number. It doesn't matter where the middle ones go. I'll call the elements between the 2 blocks of d elements at either end the "central" elements. They consist of the unique elements, as well as some number of copies of all those non-unique elements that appear at least 3 times. As before, it's easy to see that any permutation of these "central" elements corresponds to a sequence of swaps, and that any such swap will leave the overall score unchanged (in fact it's even simpler than before, since swapping two central elements does not even change the score contribution of either of these elements).
This leads to a simple O(nlog n) algorithm (or O(n) if you use bucket sort for the first step) to generate a solution array Y from a length-n input array X:
Sort the input array X.
Use a single pass through X to count the number of distinct non-unique elements. Call this d.
Set i, j and k to 0.
While i < n:
If X[i+1] == X[i], we have a non-unique element:
Set Y[j] = Y[n-j-1] = X[i].
Increment i twice, and increment j once.
While X[i] == X[i-1]:
Set Y[d+k] = X[i].
Increment i and k.
Otherwise we have a unique element:
Set Y[d+k] = X[i].
Increment i and k.

retrieve closest element from a set of elements

I'm experimenting with an idea, where I have following subproblem:
I have a list of size m containing tuples of fixed length n.
[(e11, e12, .., e1n), (e21, e22, .., e2n), ..., (em1, em2, .., emn)]
Now, given some random tuple (t1, t2, .., tn), which does not belong to the list, I want to find the closest tuple(s), that belongs to the list.
I use the following distance function (Hamming distance):
def distance(A, B):
total = 0
for e1, e2 in zip(A, B):
total += e1 == e2
return total
One option is to use exhaustive search, but this is not sufficient for my problem as the lists are quite large. Other idea, I have come up with, is to first use kmedoids to cluster the list and retrieve K medoids (cluster centers). For querying, I can determine the closest cluster with K calls to distance function. Then, I can search for the closest tuple from that particular cluster. I think it should work, but I am not completely sure, if it is fine in cases the query tuple is on the edges of the clusters.
However, I was wondering, if you have a better idea to solve the problem as my mind is completely blank at the moment. However, I have a strong feeling that there may be a clever way to do it.
Solutions that require precomputing something are fine as long as they bring down the complexity of the query.
You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list<tupple>.
Now, when you have a new "query", you will need to iterate each of hash(element) for each element of the new "query", and find the maximal number of hits.
pseudo code:
findMax(tuple):
histogram <- empty map
for each element in tuple:
#assuming hash_table is the described DS from above
for each x in hash_table[element]:
histogram[x]++ #assuming lazy initialization to 0
return key with highest value in histogram
An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).
k-d trees also require the elements to be comparable.
If your data is big enough, you may want to create some inverted indexes over it.
With a data of m vectors of n elements.
Data:
0: 1, 2, 3, 4, 5, ...
1: 2, 3, 1, 5, 3, ...
2: 5, 3, 2, 1, 3, ...
3: 1, 2, 1, 5, 3, ...
...
m: m0, ... mn
Then you want to get n indexes like this:
Index0
1: 0, 3
2: 1
5: 2
Index1
2: 0, 3
3: 3, 3
Index2
3: 0
1: 1, 3
2: 2
...
Then you only search on your indexes to get the tuples that contain any of the query tuple values and find the closest tuple within those.
def search(query)
candidates = []
for i in range(len(query))
value = query[i]
candidates.append(indexes[i][value])
# find candidates with min distance
for candidate in candidates
distance = distance(candidate, query)
...
The heavy process is creating the indexes, once you built them the search will be really fast.

Resources