How do I calculate the Confusion Matrix? - matrix

This is the WEKA output that i was able to generate. Unfortunatly, I do not know how to calculate the confusion matrix. Could someone help me calculate it?
=== Classifier model (full training set) ===
J48 pruned tree
-----------------
plas <= 127: negative (485.0/94.0)
plas > 127
| mass <= 29.9
| | plas <= 145: negative (41.0/6.0)
| | plas > 145
| | | age <= 25: negative (4.0)
| | | age > 25
| | | | age <= 61: positive (27.0/9.0)
| | | | age > 61: negative (4.0)
| mass > 29.9
| | plas <= 157
| | | age <= 30: negative (50.0/23.0)
| | | age > 30: positive (65.0/18.0)
| | plas > 157: positive (92.0/12.0)
Number of Leaves : 8
Size of the tree : 15
a. Use the WEKA output to construct a confusion matrix. (Hint: look at each leaf node to determine how many instances fall into each of the four quadrants; and aggregate results of all leaf nodes to obtain the final counts)
TP=?
FP=?
FN=?
TN=?
b. In medical diagnosis, three metrics are commonly used: sensitivity, specificity and diagnosis accuracy. Sensitivity is defined as TP/(TP+FN) ; Specificity is defined as TN/(FP+TN); Diagnosis Accuracy is defined as the average of Sensitivity and Specificity. Calculate the Diagnosis Accuracy based on the confusion matrix above.
If someone could help me with this, i would greatly appreciate it. Thank you!

In the "Classify" Panel, click on "More Options", Click on "Output Confusion matrix", click OK.
I have added a screenshot of the respective GUI screens and dialog boxes. In the sccreenshot "More options..." button (1) is greyed out because I have already clicked it.

Here to fill the required table you have to understand the tree and figures at each of its leaf.
Root node of the tree is 'plas'. It has two children. All the cases of input where 'plas' less than or equal to 127 falls at first child whereas all cases where 'plas' greater than 127 falls at second. Negative at leaf of first child indicates that cases which falls at first child are all negative. Figure 485 in parenthesis denotes number of input cases that are having 'plas' less than or equal to 127 & 94 denotes that out of these 485 cases, 94 are miss-classified as negative. Similar is the case for rest of the tree. So,
TP=145
FP=39
TN=461
FN=123
Hope this helps. Comment if anything seems doubtful.

Related

Data structure to achieve random delete and insert where elements are weighted in [a,b]

I would like to design a data structure and algorithm such that, given an array of elements, where each element has a weight according to [a,b], I can achieve constant time insertion and deletion. The deletion is performed randomly where the probability of an element being deleted is proportional to its weight.
I do not believe there is a deterministic algorithm that can achieve both operations in constant time, but I think there are there randomized algorithms that should be can accomplish this?
I don't know if O(1) worst-case time is impossible; I don't see any particular reason it should be. But it's definitely possible to have a simple data structure which achieves O(1) expected time.
The idea is to store a dynamic array of pairs (or two parallel arrays), where each item is paired with its weight; insertion is done by appending in O(1) amortised time, and an element can be removed by index by swapping it with the last element so that it can be removed from the end of the array in O(1) time. To sample a random element from the weighted distribution, choose a random index and generate a random number in the half-open interval [0, 2); if it is less than the element's weight, select the element at that index, otherwise repeat this process until an element is selected. The idea is that each index is equally likely to be chosen, and the probability it gets kept rather than rejected is proportional to its weight.
This is a Las Vegas algorithm, meaning it is expected to complete in a finite time, but with very low probability it can take arbitrarily long to complete. The number of iterations required to sample an element will be highest when every weight is exactly 1, in which case it follows a geometric distribution with parameter p = 1/2, so its expected value is 2, a constant which is independent of the number of elements in the data structure.
In general, if all weights are in an interval [a, b] for real numbers 0 < a <= b, then the expected number of iterations is at most b/a. This is always a constant, but it is potentially a large constant (i.e. it takes many iterations to select a single sample) if the lower bound a is small relative to b.
This is not an answer per se, but just a tiny example to illustrate the algorithm devised by #kaya3
| value | weight |
| v1 | 1.0 |
| v2 | 1.5 |
| v3 | 1.5 |
| v4 | 2.0 |
| v5 | 1.0 |
| total | 7.0 |
The total weight is 7.0. It's easy to maintain in O(1) by storing it in some memory and increasing/decreasing at each insertion/removal.
The probability of each element is simply it's weight divided by total weight.
| value | proba |
| v1 | 1.0/7 | 0.1428...
| v2 | 1.5/7 | 0.2142...
| v3 | 1.5/7 | 0.2142...
| v4 | 2.0/7 | 0.2857...
| v5 | 1.0/7 | 0.1428...
Using the algorithm of #kaya3, if we draw a random index, then the probability of each value is 1/size (1/5 here).
The chance of being rejected is 50% for v1, 25% for v2 and 0% for v4. So at first round, the probability to be selected are:
| value | proba |
| v1 | 2/20 | 0.10
| v2 | 3/20 | 0.15
| v3 | 3/20 | 0.15
| v4 | 4/20 | 0.20
| v5 | 2/20 | 0.10
| total | 14/20 | (70%)
Then the proba of having a 2nd round is 30%, and the proba of each index is 6/20/5 = 3/50
| value | proba 2 rounds |
| v1 | 2/20 + 6/200 | 0.130
| v2 | 3/20 + 9/200 | 0.195
| v3 | 3/20 + 9/200 | 0.195
| v4 | 4/20 + 12/200 | 0.260
| v5 | 2/20 + 6/200 | 0.130
| total | 14/20 + 42/200 | (91%)
The proba to have a 3rd round is 9%, that is 9/500 for each index
| value | proba 3 rounds |
| v1 | 2/20 + 6/200 + 18/2000 | 0.1390
| v2 | 3/20 + 9/200 + 27/2000 | 0.2085
| v3 | 3/20 + 9/200 + 27/2000 | 0.2085
| v4 | 4/20 + 12/200 + 36/2000 | 0.2780
| v5 | 2/20 + 6/200 + 18/2000 | 0.1390
| total | 14/20 + 42/200 + 126/2000 | (97,3%)
So we see that the serie is converging to the correct probabilities. The numerators are multiple of the weight, so it's clear that the relative weight of each element is respected.
This is a sketch of an answer.
With weights only 1, we can maintain a random permutation of the inputs.
Each time an element is inserted, put it at the end of the array, then pick a random position i in the array, and swap the last element with the element at position i.
(It may well be a no-op if the random position turns out to be the last one.)
When deleting, just delete the last element.
Assuming we can use a dynamic array with O(1) (worst case or amortized) insertion and deletion, this does both insertion and deletion in O(1).
With weights 1 and 2, the similar structure may be used.
Perhaps each element of weight 2 should be put twice instead of once.
Perhaps when an element of weight 2 is deleted, its other copy should also be deleted.
So we should in fact store indices instead of the elements, and another array, locations, which stores and tracks the two indices for each element. The swaps should keep this locations array up-to-date.
Deleting an arbitrary element can be done in O(1) similarly to inserting: swap with the last one, delete the last one.

How to decide the probability percentage in question

I have the below question:
In the first part of the question, is says the probability that the selected person will be a male is 0.44, it means the number of males is 25*0.44 = 11. That's ok
In the second part, the probability of the selected person will be a male who was born before 1960 is 0.28, Does that mean 0.28 out of the total number which is 25 or out of the number of males?
I mean should the number of male who was born before 1960 equals into 250.28 OR 110.28
I find it easiest to think of these sorts of problems as contingency tables.
You use a maxtrix layout to express the distributions in terms of two or more factors or characteristics, each having two or more categories. The table can be constructed either with probabilities (proportions) or with counts, and switching back and forth is easy based on the total count in the table. Entries in the table are the intersections of the categories, corresponding to and in a verbal description. The numbers to the right or at the bottom of the table are called marginals, because they're found in the margins of the tables, and are always the sum of the table row or column entries in which they occur. The total probability (or count) in the table is found by summing across all the rows and columns. The marginal distribution of gender would be found by summing across rows, and the marginal distribution of birthdays would be found by summing across the columns.
Based on this, you can inferentially determine other values as indicated by the entries in parentheses below. With one more entry, either for gender or in the marginal row for birthdays, you'd be able to fill in the whole table inferentially. (This is related to the concept of degrees of freedom - how many pieces of info can you fill in independently before the others are determined by the known constraint that the totals are fixed or that probability adds to 1.)
Probabilities
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (0.56)
n __|_________|__________|
d | | |
e M | 0.28 | (0.16) | 0.44
r __|_________|__________|______
? ? | 1.00
Counts
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (14)
n __|_________|__________|
d | | |
e M | 7 | (4) | 11
r __|_________|__________|_____
? ? | 25
Conditional probability corresponds to limiting yourself to the subset of rows or columns specified in the condition. If you had been asked what is the probability of a birthday < 1960 given the gender is male, i.e., P{birthday < 1960 | M} in relatively standard notation, you'd be restricting your focus to just the M row, so the answer would be 7/11 = 0.28/0.44. Computationally, you take the probabilities or counts in the qualifying table entries and express them as a proportion of the probabilities or counts of the specified (given) marginal entries. This is often written in prob & stats texts as P(A|B) = P(AB)/P(B), where AB is a set shorthand for A and B (intersection).
0,44 = 11 / 25 people are male.
0,28 = 7 / 25 people are male & born before 1960.

Determine max slope of slowly descending signal

I have an analog power signal from a motor. The signal ramps up quickly, but powers off slowly over the course of several seconds. The signal looks almost like a series of plateaus on the descent. The problem is that the signal doesn't settle back to zero. It settles back to an intermediate level unknown, and varying from motor to motor. See chart below.
I'm trying to find a way determine when the motor is off and at that intermediate level.
My thought is to find and store the max point, and calculate the slopes thereafter until the max slope is greater than some large negative slope value like -160 (~ -60 degrees), and declare that the motor must be powering off. The sample points below are with all duplicates removed. (there's about 5000 samples typically).
My problem is determining the X values. In the formula (y2-y1) / (x2 - x1), the x values could far enough away in time that the slope never appears greater than -30 degrees. Picking an absolute number like 10 would fix this, but is there a more mathematically correct method?
The data shows me calculating slope with method described above and the max of 921. ie (y2 -y1) / ( (10+1) - 10). In this scheme, at datapoint 9, i would say the motor is "Off". I'm looking for a more precise means to determine an X value rather than randomly picking 10 for instance.
+---+-----+----------+
| X | Y | Slope |
+---+-----+----------+
| 1 | 65 | 856.000 |
| 2 | 58 | 863.000 |
| 3 | 57 | 864.000 |
| 4 | 638 | 283.000 |
| 5 | 921 | 0.000 |
| 6 | 839 | -82.000 |
| 7 | 838 | -83.000 |
| 8 | 811 | -110.000 |
| 9 | 724 | -197.000 |
+---+-----+----------+
EDIT: A much simpler answer:
Since your motor is either ON or OFF, and ON wattages are strictly higher than OFF wattages, you should be able to discriminate between ON and OFF wattages by maintaining an average wattage, reporting ON if the current measurement is higher than the average and OFF if it is lower.
Count = 0
Average = 500
Whenever a measurement comes in,
Count = Count + 1
Average = Average + (Measurement - Average) / Count
Return Measurement > Average ? ON : OFF
This represents an average of all the values the wattage has ever been. If we want to eventually "forget" the earliest values (before the motor was ever turned on), we could either keep a buffer of recent values and use that for a moving average, or approximate a moving average with an IIR like
Average = (1-X) * Average + X * Measurement
for some X between 0 and 1 (closer to 0 to change more slowly).
Original answer:
You could treat this as an online clustering problem, where you expect three clusters (before the motor turns on, when the motor is on, and when the motor is turned off), or perhaps four (before the motor turns on, peak power, when the motor is running normally, and when the motor turns off). In effect, you're trying to learn what it looks like when a motor is on (or off).
If you don't have any other information about whether the motor is on or off (which could be used to train a model), here's a simple approach:
Define an "Estimate" to contain:
float Value
int Count
Define an "Estimator" to contain:
float TotalError = 0.0
Estimate COLD_OFF = {Value = 0, Count = 1}
Estimate ON = {Value = 1000, Count = 1}
Estimate WARM_OFF = {Value = 500, Count = 1}
a function Update_Estimate(float Measurement)
Find the Estimate E such that E.Value is closest to Measurement
Update TotalError = TotalError + (E.Value - Measurement)*(E.Value - Measurement)
Update E.Value = (E.Value * E.Count + P) / (E.Count + 1)
Update E.Count = E.Count + 1
return E
This takes initial guesses for what the wattages of these stages should be and updates them with the measurements. However, this has some problems. What if our initial guesses are off?
You could initialize some number of Estimators with different possible (e.g. random) guesses for COLD_OFF, ON, and WARM_OFF; after receiving a measurement, let each Estimator update itself and aggregate their values somehow. This aggregation should reward the better estimates. Since you're storing TotalError for each estimate, you could just pick the output of the Estimator that has the lowest TotalError so far, or you could let the Estimators vote (giving each Estimator's vote a weight proportional to 1/(TotalError + 1) or something like that).

Intersection ranges (algorithm)

As example I have next arrays:
[100,192]
[235,280]
[129,267]
As intersect arrays we get:
[129,192]
[235,267]
Simple exercise for people but problem for creating algorithm that find second multidim array…
Any language, any ideas..
If somebody do not understand me:
I'll assume you wish to output any range that has 2 or more overlapping intervals.
So the output for [1,5], [2,4], [3,3] will be (only) [2,4].
The basic idea here is to use a sweep-line algorithm.
Split the ranges into start- and end-points.
Sort the points.
Now iterate through the points with a counter variable initialized to 0.
If you get a start-point:
Increase the counter.
If the counter's value is now 2, record that point as the start-point for a range in the output.
If you get an end-point
Decrease the counter.
If the counter's value is 1, record that point as the end-point for a range in the output.
Note:
If a start-point and an end-point have the same value, you'll need to process the end-point first if the counter is 1 and the start-point first if the counter is 2 or greater, otherwise you'll end up with a 0-size range or a 0-size gap between two ranges in the output.
This should be fairly simple to do by having a set of the following structure:
Element
int startCount
int endCount
int value
Then you combine all points with the same value into one such element, setting the counts appropriately.
Running time:
O(n log n)
Example:
Input:
[100, 192]
[235, 280]
[129, 267]
(S for start, E for end)
Points | | 100 | 129 | 192 | 235 | 267 | 280 |
Type | | Start | Start | End | Start | End | End |
Count | 0 | 1 | 2 | 1 | 2 | 1 | 0 |
Output | | | [129, | 192] | [235, | 267] | |
This is python implementation of intersection algorithm. Its computcomputational complexity O(n^2).
a = [[100,192],[235,280],[129,267]]
def get_intersections(diapasons):
intersections = []
for d in diapasons:
for check in diapasons:
if d == check:
continue
if d[0] >= check[0] and d[0] <= check[1]:
right = d[1]
if check[1] < d[1]:
right = check[1]
intersections.append([d[0], right])
return intersections
print get_intersections(a)

An interview question from Google [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Given a 2d array sorted in increasing order from left to right and top to bottom, what is the best way to search for a target number?
The following was asked in a Google interview:
You are given a 2D array storing integers, sorted vertically and horizontally.
Write a method that takes as input an integer and outputs a bool saying whether or not the integer is in the array.
What is the best way to do this? And what is its time complexity?
Start at the Bottom-Left corner of the Matrix and follow the rules stated below to traverse the matrix:
The matrix traversal is based on these conditions:
If the input number is greater than current number: Move Right
If the input number is less than current number: Move Up.
If the input number is equal to current number: Return Success
If the input number is not equal to current number and no transition is possible: Return Fail
Time Complexity: (Thanks to Martinho Fernandes)
The time complexity is O(N+M). In the worst case, the element searched for is in the upper-left corner, meaning you'll go up N times, and left M times.
Example
Input matrix:
--------------
| 1 | 4 | 6 |
--------------
| 2 | 5 | 9 |
--------------
| *3* | 8 | 10 |
--------------
Number to search: 4
Step 1:
Start at the cell where you have 3 (Bottom-Left).
3 < 4: Move Right
| 1 | 4 | 6 |
--------------
| 2 | 5 | 9 |
--------------
| 3 | *8* | 10 |
--------------
Step 2:
8 > 4: Move Up
| 1 | 4 | 6 |
--------------
| 2 | *5* | 9 |
--------------
| 3 | 8 | 10 |
--------------
Step 3:
5 > 4: Move Up
| 1 | *4* | 6 |
--------------
| 2 | 5 | 9 |
--------------
| 3 | 8 | 10 |
--------------
Step 4:
4=4: Return the index of the number
I would start by asking details about what it means to be "sorted vertically and horizontally"
If the matrix is sorted in a way that the last element of each row is less than the first element of the next row, you can run a binary search on the first column to find out in what row that number is, and then run another binary search on the row. This algorithm will take O(log C + log R) time, where C and R are, respectively the number of rows and columns. Using a property of the logarithm, one can write that as O(log(C*R)), which is the same as O(log N), if N is the number of elements in the array. This is almost the same as treating the array as 1D and running a binary search on it.
But the matrix could be sorted in a way that the last element of each row is not less than the first element of the next row:
1 2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9 10
3 4 5 6 7 8 9 10 11
In this case, you could run some sort of horizontal an vertical binary search simultaneously:
Test the middle number of the first column. If it's less than the target, consider the lines above it. If it's greater, consider those below;
Test the middle number of the first considered line. If it's less, consider the columns left of it. If it's greater, consider those to the right;
Lathe, rinse, repeat until you find one, or you're left with no more elements to consider;
This method is also logarithmic on the number of elements.
The first method that comes to mind is a vertical binary search, followed by a horizontal one when you find the row it should be in. Complexity will be O(log NM) where N and M are the dimensions of the array.
Further explanation:
Consider just the first number of every row. When you perform a binary search of these first numbers for the specified number, the result will be either the specified number if you're lucky, otherwise it will be the position before or after where the specified number would go depending on the binary search implementation. Once you find the two of the first numbers that the specified number should go between, you know that the number is in that row, and a second binary search will find the number if it is in the row.

Resources