Efficiently re-computing area under ROC when one label changes

Efficiently re-computing area under ROC when one label changes - algorithm

Say that you have a list of scores with binary labels (for simplicity, assume no ties), and that we've used the labels to compute the area under the associated receiver operating characteristic (ROC) curve. For a set of n scores, this calculation is straightforward to do in O(n log n) time -- you simply sort the list, then traverse the list in sorted order, keeping a running total of the number of positively labeled examples you've seen so far. Every time you see a negative label, you add the number of positives, and at the end you divide the resulting sum by the product of the number of positives times the number of negatives.
Now, having done that calculation, say that someone comes along and flips exactly one label (from positive to negative or vice versa). The scores themselves do not change, so you don't need to re-sort. It's straightforward to calculate the new area under the curve (AUC) in O(n) time by re-traversing the sorted list. My question is, is it possible to compute the new AUC in something better than O(n)? I.e., do I have to re-traverse the entire sorted list to get the new AUC?
I think I can do the re-calculation in O(1) time by storing a count, at each position in the ranked list, to the number of positives and negatives above this position. But I am going to need to repeatedly calculate the AUC as more labels get flipped. And I think that if I rely on those stored values, then updating them for the next time will be O(n).

Yes, it is possible to compute AUC in O(log(n)). You need two sets of scores, one for positives and one for negatives, that provide the following operations:
Querying the number of items with higher (or lower) score than a given value (score of the label being flipped).
Inserting and removing the elements.
Knowing the number of positives above/below given position lets you update AUC efficiently as you already mentioned. After that you have to remove the item from the set of positives/negatives and insert to negatives/positives, respectively.
Balanced search trees can do both operations in O(log(n)).
Furthermore, actual values of scores do not matter, only position is relevant. This leads to very simple and efficient implementation using binary indexed tree. See http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees for explanation.
Also, you don't really need to maintain two sets. Since you already know the total number of positives and negatives above given position, single set is enough.

Related

Return N Optimal Choices for Multiple Choice Knapsack Variation

Problem
I'm trying to return N optimal answers (for an exact number I want 250). From what I understand with dynamic programming is that it will return one most optimal answer. So I believe I have to use backtracking in order to generate N optimal answers.
For the knapsack variant problem, I have a maximum weight that the combination of objects should not pass. I have 4 sets of objects, and exactly one must be chosen from each set to give the highest value without surpassing the weight constraint. Each object in the sets have a value and a weight.
The sets have 164, 201, 90 and 104 objects which means there are 308,543,040 variations to try. I have a brute force algorithm implemented but it takes forever.
Attempts At Optimization
So far, my attempt at optimizing is to preprocess the input sets by sorting by increasing weight. (lowest first). At the addition of each object, if the constraint weight is greater than the object's combination weight, then I can skip the rest of the set since all other options will not be valid. This can be run at any level of the recursive function.
I also have a minimum heap that stores the maximum values I've found. If the combination of four objects is less than the top of the heap, then it will not be added. Otherwise, pushpop to the heap. I'm not sure if I can use this to optimize the backtracking even further, since it requires all four objects to be selected. It's used more as validation rather than improving the speed.
Questions
Are there any other optimizations I can do with backtracking that will speed up the process of finding N optimal answers? Have I exhausted optimization and should just use multiple threads?
Is it possible to use dynamic programming with this? How can I modify dynamic programming to return N optimal choices?
Any other algorithms to look into?

Since exactly one item has to be picked from each set, you can try this optimization:
Let the sets be A,B,C,D.
Create all combinations of items from sets A,B together and sets C,D together. This will have O(n^2) complexity, assuming lists have length n. Let the combination lists be X and Y now.
Sort X and Y based on weight. You can use something like a cumulative array to track the combination with the max possible value under a given weight. (Other data structures might be used for the same task as well, this is just a suggestion to highlight the underlying idea).
Create a max heap to store the combinations with max values
For each combination in X, pick the combination in Y with the highest value under the constraint that it's weight is <= target weight - X_combination_weight. Based on this combination's value, insert it in the max heap.

Algorithmic help needed (N bags and items distributed randomly)

I have encountered an algorithmic problem but am not able to figure out anything better than brute force or reduce it to a better know problem. Any hints?
There are N bags of variable sizes and N types of items. Each type of items belongs to one bag. There are lots of items of each type and each item may be of a different size. Initially, these items are distributed across all the bags randomly. We have to place the items in their respective bags. However, we can only operate with a pair of bags at one time by exchanging items (as much as possible) and proceeding to the next pair. The aim is to reduce the total number of pairs. Edit: The aim is to find a sequence of transfers that minimizes the total number of bag pairs involved
Clarification:
The bags are not arbitrarily large (You can assume the bag and item sizes to be integers between 0 to 1000 if it helps). You'll frequently encounter scenarios where the all the items between 2 bags cannot be swapped due to the limited capacity of one of the bags. This is where the algorithm needs to make an optimisation. Perhaps, if another pair of bags were swapped first, the current swap can be done in one go. To illustrate this, let's consider Bags A, B and C and their items 1, 2, 3 respectively. The number in the brackets is the size.
A(10) : 3(8)
B(10): 1(2), 1(3)
C(10): 1(4)
The swap orders can be AB, AC, AB or AC, AB. The latter is optimal as the number of swaps is lesser.

Since I cannot come to an idea for an algorithm that will always find an optimal answer, and approximation of the fitness of the solution (amount of swaps) is also fine, I suggest a stochastic local search algorithm with pruning.
Given a random starting configuration, this algorithm considers all possible swaps, and makes a weighed decision based on chance: the better a swap is, the more likely it is chosen.
The value of a swap would be the sum of the value of the transaction of an item, which is zero if the item does not end up in it's belonging bag, and is positive if it does end up there. The value increases as the item's size increases (the idea behind this is that a larger block is hard to move many times in comparison to smaller blocks). This fitness function can be replaced by any other fitness function, it's efficiency is unknown until empirically shown.
Since any configuration can be the consequence of many preceding swaps, we keep track of which configurations we have seen before, along with a fitness (based on how many items are in their correct bag - this fitness is not related to the value of a swap) and the list of preceded swaps. If the fitness function for a configuration is the sum of the items that are in their correct bags, then the amount of items in the problem is the highest fitness (and therefor marks a configuration to be a solution).
A swap is not possible if:
Either of the affected bags is holding more than it's capacity after the potential swap.
The new swap brings you back to the last configuration you were in before the last swap you did (i.e. reversed swap).
When we identify potential swaps, we look into our list of previously seen configurations (use a hash function for O(1) lookup). Then we either set its preceded swaps to our preceded swaps (if our list is shorter than it's), or we set our preceded swaps to its list (if it's list is shorter than ours). We can do this because it does not matter which swaps we did, as long as the amount of swaps is as small as possible.
If there are no more possible swaps left in a configuration, it means you're stuck. Local search tells you 'reset' which you can do in may ways, for instance:
Reset to a previously seen state (maybe the best one you've seen so far?)
Reset to a new valid random solution
Note
Since the algorithm only allows you to do valid swaps, all constraints will be met for each configuration.
The algorithm does not guarantee to 'stop' out of the box, you can implement a maximum number of iterations (swaps)
The algorithm does not guarantee to find a correct solution, as it does it's best to find a better configuration each iteration. However, since a perfect solution (set of swaps) should look closely to an almost perfect solution, a human might be able to finish what the local search algorithm was not after it results in a invalid configuration (where not every item is in its correct bag).
The used fitness functions and strategies are very likely not the most efficient out there. You could look around to find better ones. A more efficient fitness function / strategy should result in a good solution faster (less iterations).

Constant time search

Suppose I have a rod which I cut to pieces. Given a point on the original rod, is there a way to find out which piece it belongs to, in constant time?
For example:
|------------------|---------|---------------|
0.0 4.5 7.8532 9.123
Given a position:
^
|
8.005
I would like to get 3rd piece.
It is possible to easily get such answer in O(log n) time with binary search but is it possible to do it in O(1)? If I pre-process the "cut" positions somehow?

If you assume the point you want to query is uniformly randomly chosen along the rod, then you can have EXPECTED constant time solution, without crazy memory explosion, as follows. If you break up the rod into N equally spaced pieces, where N is the number of original irregularly spaced segments you have in your rod, and then record for each of the N equal-sized pieces which of the original irregular segment(s) it overlaps, then to do a query you first just take the query point and do simple round-off to find out which equally spaced piece it lies in, then use that index to look up which of your original segments intersect the equally spaced piece, and then check each intersecting original segment to see if the segment contains your point (and you can use binary search if you want to make sure the worst-case performance is still logarithmic). The expected running time for this approach is constant if you assume that the query point is randomly chosen along your rod, and the amount of memory is O(N) if your rod was originally cut into N irregular pieces, so no crazy memory requirements.
PROOF OF EXPECTED O(1) RUNNING TIME:
When you count the total number of intersection pairs between your original N irregular segments and the N equally-spaced pieces I propose constructing, the total number is no more than 2*(N+1) (because if you sort all the end-points of all the regular and irregular segments, a new intersection pair can always be charged to one of the end-points defining either a regular or irregular segment). So you have a multi-set of at most 2(N+1) of your irregular segments, distributed out in some fashion among the N regular segments that they intersect. The actual distribution of intersections among the regular segments doesn't matter. When you have a uniform query point and compute the expected number of irregular segments that intersect the regular segment that contains the query point, each regular segment has probability 1/N of being chosen by the query point, so the expected number of intersected irregular segments that need to be checked is 2*(N+1)/N = O(1).

For arbitrary cuts and precisions, not really, you have to compare the position with the various start or end points.
But, if you're only talking a small number of cuts, performance shouldn't really be an issue.
For example, even with ten segments, you only have nine comparisons, not a huge amount of computation.
Of course, you can always turn the situation into a ploynomial formula (such as ax^4 + bx^3 +cx^2 + dx + e), generated using simultaneous equations, which will give you a segment but the highest power tends to rise with the segment count so it's not necessarily as efficient as simple checks.

You're not going to do better than lg n with a comparison-based algorithm. Reinterpreting the 31 non-sign bits of a positive IEEE float as a 31-bit integer is an order-preserving transformation, so tries and van Emde Boas trees both are options. I would steer you first toward a three-level trie.

You could assign an integral number to every position and then use that as index into a lookup table, which would give you constant-time lookup. This is pretty easy if your stick is short and you don't cut it into pieces that are fractions of a millimeter long. If you can get by with such an approximation, that would be my way to go.
There is one enhanced way which generalizes this even further. In each element of a lookup table, you store the middle position and the segment ID to the left and right. This makes one lookup (O(1)) plus one comparison (O(1)). The downside is that the lookup table has to be so large that you never have more than two different segments in the same table element's range. Again, it depends on your requirements and input data whether this works or not.

Calculating the actual average value

I've got a relatively little (~100 values) set of integers: each of them represents how much time (in millisecond) a test I ran lasted.
The trivial algorithm to calculate the average is to sum up all the n values and divide the result by n, but this doesn't take into account that some ridiculously high/low value must be wrong and should get discarded.
What algorithms are available to estimate the actual average value?

As you said you can discard all values that diverge more than a given value from the average and then recompute the average. Another value that can be interesting is the Median, that is the most frequent value.

It depends on different conditions of your test. And it is a task from probability theory.
One of the simplest way is to try calculate a median, that you can deal with ridiculously high/low values. Look at link below:
Wiki about median

As you noted, the arithmetic mean isn't good if there are very high/low values.
You could compute the median, as someone suggested, which is, in a sorted list of your values, the "middle" value (if your set contains an uneven amount of items) or the arithmetic mean of the two "middle" values (else).
Another method would be to drop, say, the lowest and highest five percentiles and compute the arithmetic mean of the rest.

Some options:
First discard N highest and lowest values and compute arithmetic mean for the rest. Set N to suitable value so that, for example 1% or 10% of values are discarded.
Use the the median, or middle value.
Use geometric mean that give less weight for the outliers.
Wikipedia lists some ways to compute different "mean" values

Finding median of large set of numbers too big to fit into memory

I was asked this question in an interview recently.
There are N numbers, too many to fit into memory. They are split across k database tables (unsorted), each of which can fit into memory. Find the median of all the numbers.
Wasn't quite sure about the answer to this one.

There's a few potential solutions:
External merge sort - O(n log n)
You basically sort the numbers on the first pass, then find the median on the second.
Order statistics distributed selection algorithm - O(n)
Simplify the problem to the original problem of finding the kth number in an unsorted array.
Counting sort histogram O(n)
You have to assume some properties about the range of the numbers - can the range fit in the memory?
If anything is known about the distribution of the numbers other
algorithms can be produced.
For more details and implementation see:
http://www.fusu.us/2013/07/median-in-large-set-across-1000-servers.html

This answer on quora explains the whole process clearly step by step http://qr.ae/dMkGc. Simply copying it down for non Quorans
Suppose you have a master node (or are able to use a consensus protocol to elect a master from among your servers). The master first queries the servers for the size of their sets of data, call this n, so that it knows to look for the k = n/2 largest element.
The master then selects a random server and queries it for a random element from the elements on that server. The master broadcasts this element to each server, and each server partitions its elements into those larger than or equal to the broadcasted element and those smaller than the broadcasted element.
Each server returns to the master the size of the larger-than partition, call this m. If the sum of these sizes is greater than k, the master indicates to each server to disregard the less-than set for the remainder of the algorithm. If it is less than k, then the master indicates to disregard the larger-than sets and updates k = k - m. If it is exactly k, the algorithm terminates and the value returned is the pivot selected at the beginning of the iteration.
If the algorithm does not terminate, recurse beginning with selecting a new random pivot from the remaining elements.
Analysis:
Let n be the total number of elements and s be the number of servers. Assume that the elements are roughly randomly and evenly distributed among servers (each server has O(n/s) elements). In iteration i, we expect to do about O(n/(s*2^i)) work on each server, as the size of each servers element sets will be approximately cut in half (remember, we assumed roughly random distribution of elements) and O(s) work on the master (for broadcasting/receiving messages and adding the sizes together). We expect O(log(n/s)) iterations. Adding these up over all iterations gives an expected runtime of O(n/s + slog(n/s)), and assuming s << sqrt(n) which is normally the case, this becomes simply (O(n/s)), which is the best you could possibly hope for.
Note also that this works not just for finding the median but also for finding the kth largest value for any value of k.

Have a look at the "Median of Medians" algorithm in this Wikipedia article.
Related question: Median-of-medians in Java.
Explanation: http://www.ics.uci.edu/~eppstein/161/960130.html

Another way to look at this is to go back to the definition of "median." Authors vary in their language, but basically the median is the value which splits a probability distribution into two equal parts.
So instead of spending a lot of effort sorting enormous data sets, estimate the distribution and find the middle. As noted above for some distributions the median equals the mean, which is quick and easy to compute. Also, if an exact answer isn't necessary you can use the empirical relationship: mean - mode = 3 * (mean - median).

Here is what I would do:
Sample the data to get a general idea about the distribution.
Using the information about the distribution, choose a "bucket" (a range), large enough to get the median inside and small enough to fit into the memory.
With one pass (O(N)) count the numbers before the bucket (L1_size), after the bucket (L3_size) and put numbers within the range into the bucket (L2). You will see if the chosen bucket contains the median. If not - go to step 2.
Use quickselect or other method to find the k=(L1_size + L2_size/2) element in the bucket.
Requires O(N) + O(L2_size) steps.

I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.

If an approximate answer is sufficient, a method similar to #piccolbo works well. I'll assume all the points are integers, but if not you can multiply by ten or a hundred or whatever to normalize the data to integers. Make one pass over the data calculating an average (arithmetic mean. Call that number the provisional median. Then make a second pass over the data. If the data point is less than the provisional median, reduce the provisional median by one. If the data point is greater than the provisional median, increase the provisional median by one. If the data point is the same as the provisional median, leave the provisional median unchanged. After the end of the data, return the provisional median. What will happen is that the provisional median will initially change from time to time, but eventually it will stabilize over a very small range, which will be very close to the actual median.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio