Algorithm to find average of group of numbers - algorithm

I have a quite small list of numbers (a few hundred max) like for example this one:
117 99 91 93 95 95 91 97 89 99 89 99
91 95 89 99 89 99 89 95 95 95 89 948
189 99 89 189 189 95 186 95 93 189 95
189 89 193 189 93 91 193 89 193 185 95
89 194 185 99 89 189 95 189 189 95 89
189 189 95 189 95 89 193 101 180 189
95 89 195 185 95 89 193 89 193 185 99
185 95 189 95 89 193 91 190 94 190 185
99 89 189 95 189 189 95 185 95 185 99
89 189 95 189 186 99 89 189 191 95 185
99 89 189 189 96 89 193 189 95 185 95
89 193 95 189 185 95 93 189 189 95 186
97 185 95 189 95 185 99 185 95 185 99
185 95 190 95 185 95 95 189 185 95 189
2451
If you create a graph with X=the number and Y=number of times we see the number, we'll have something like this:
What I want is to know the average number of each group of numbers. In the example, there's 4 groups and the resulting numbers are 92, 187, 948 and 2451
The number of groups of number is not known.
Do you have any idea of how to create a (simple if possible) algorithm do extract these resulting numbers (if possible in c or pseudo code or English :)

What you want to do is called clustering. If the data you've shown is typical, a gready approach, such as neighbor joining, should be sufficient. So the procedure is:
1) Apply neighbor joining
2) Apply an (empirically identified) threshold to define the clusters
3) Calculate average of each cluster
Using a package that already has clustering algorithms, such as R, would probably be the easiest course, though neighbor joining is not a particularly hard algorithm.

I think std::map<int,int> can easily solve this problem. The key of the map would be the number, and value would be the times/frequency the number occurs.
So the average can be calculated as,
int average = (m[key] * key) / count;
Where count is total number of numbers, so it calculates the average of each group over all numbers, as you didn't clearly mention what you mean by average. I'm also assuming that each distinct number forms its own group!

Here's a way:
Decide what width your bins will be. Let's say 10 (i.e. e.g. numbers > -5 and <= 5 go into bin 0, numbers > 5 and <= 15 go into bin 1, ...).
Create a list which holds lists to the number in each bin. I'd go with something like map<unsigned int, vector<unsigned int> * > in C++.
Now iterate over the numbers, decide what bin they belong to. Check if there's already a vector for this bin in your map, if not create one. Add the number to the vector.
After iterating over all the numbers, simply calculate the average of each vector.

So you are looking for "spikes" in the graph. I'm guessing you are interested in the size and position of each group?
You might use something like this:
Sort the numbers
Loop:
Take the highest number you have
Investigate more numbers until you find a number that is too small to belong to the group (maybe 5% smaller)
Calculate the average of the selected numbers
Let the discarded number be the last number
End loop

In PHP you could do it like this:
$array = array(//an array of numbers);
$average = array_sum($array) / count($array);
With multiple groups of numbers you can do something like:
$array = array(
array(array of numbers, group1),
array(array of numbers, group2),
//etc.
);
foreach($array as $numbers)
{
$average[] = array_sum($numbers) / count($numbers);
}
Unless you're looking for the median or mode.
Ah, I see what you're asking now, you're not asking how to find the average, you're asking how to group the numbers up and find the average of each group.
Lets see, you'd have to find the mode, $counts = array_count_values($array)); array_keys(max($counts)); will do that and the keys in $counts will be the values of the original array, with the values in $counts being the number of times that each number shows up. Then you need to figure out where the bigger gaps in the keys in $counts are. You could also array_unique() the array original array and find the gaps in the values.
Wish my statistics teacher had done a bit more than play poker with us, or I could probably figure out the exact statistical method to determine how big the range checked to determine the groups should be.

Related

Looking for a clever way to sort a set of data

I have a set of 80 students and I need to sort them into 20 groups of 4.
I have their previous exam scores from a prerequisite module and I want to ensure that the average of the sorted group members scores is as close as possible to the overall average of the previous exam scores.
Sorry, if that isn't particularly clear.
Here's a snapshot of the problem:
Student Score
AA 50
AB 45
AC 80
AD 70
AE 45
AF 55
AG 65
AH 90
So the average of the scores here is 62.5. How would I best go about sorting these eight students into two groups of four such that, for both groups, the average of their combined exam scores is as close as possible to 62.5.
My problem is exactly this but with 80 data points (20 groups) rather than 8 (2 groups).
The more I think about this problem the harder it seems.
Does anyone have any ideas?
Thanks
One Possible Solution:
I would try going with a greedy algorithm that starts by pairing each student with another student that gets you closest to your target average. After the initial pairing you should then be able to make subsequent pairs out of the first pairs using the same approach.
After the first round of pairing, this approach leverages taking the average of two averages and comparing that to the target mean to create subsequent groups. You can read more about why that will work for this problem here.
However,
This will not necessarily give you the optimal solution, but is rather a heuristic technique to solve the problem. One noted example below is when one low value must be offset by three high values to reach the targeted mean. These types of groupings will not be accounted for by this technique. However, if you know you have a relatively normal distribution centered around your targeted mean then I think this approach should give a decent approximation.
First sort the goup by score. So it becomes:
AH 90
AC 80
.....
AB 45
AE 45
Then start combinning the first with the last:
(AE, AH, 67.5)
(AB, AC, 62.5)
(AD, AA, 60)
(AG, AF, 60)
And so on in the other case you will combine the two by two. First two with the last two.
Another way:
1. Find all the possible groups by 4 students.
2. Then for every combination of groups find the abs deviation from the average score and SUM it up for the combination of groups.
3. Choose the combination of groups with the lowest sum.
Initially, I did think about the top-bottom match option.
However, as John has highlighted, the results certainly aren't optimal:
Scores Students Avg.
40 94 40 94 'AE' 'DA' 'AI' 'AR' 67
40 90 40 88 'AK' 'CI' 'AM' 'BP' 64.5
40 85 40 80 'AQ' 'AW' 'AT' 'BD' 61.25
40 79 40 77 'AU' 'BC' 'AV' 'AB' 59
40 76 40 75 'AX' 'CG' 'AZ' 'CQ' 57.75
40 75 40 75 'BF' 'CB' 'BN' 'BQ' 57.5
40 75 40 74 'BR' 'BI' 'CF' 'CZ' 57.25
40 74 40 74 'CK' 'CO' 'CP' 'AL' 57
40 72 41 71 'DB' 'CN' 'AG' 'BO' 56
41 71 42 70 'CD' 'BM' 'AH' 'BS' 56
42 70 42 69 'BG' 'BL' 'CU' 'CX' 55.75
43 68 44 67 'BK' 'CY' 'AD' 'CE' 55.5
44 64 44 64 'BJ' 'CR' 'BZ' 'BY' 54
45 64 45 63 'BW' 'BV' 'CS' 'BE' 54.25
45 62 47 60 'CV' 'CH' 'AC' 'CM' 53.5
47 59 47 58 'BT' 'AY' 'CL' 'AP' 52.75
47 57 48 57 'CT' 'BA' 'BX' 'AS' 52.25
48 56 49 56 'CA' 'AJ' 'AN' 'AA' 52.25
50 55 50 54 'BB' 'AF' 'CJ' 'AO' 52.25
51 52 51 52 'CC' 'BU' 'CW' 'BH' 51.5

Joining two matrices, one with numbers and the other percentages

I have two matrices, cases and percentages. I want to combine both with the columns alternating between the two i.e. cases [c1] percent [c1] cases [c2] percent [c2]...
tab year region if sex==1, matcell(cases)
tab year region, matcell(total)
mata:st_matrix("percent", 100 * st_matrix("cases"):/st_matrix("total"))
matrix list cases
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 1313 1289 1121 1176 1176 1150 1190 1184 1042 940
r2 340 359 357 366 383 332 406 367 352 272
r3 260 246 266 265 270 259 309 306 266 283
r4 271 267 293 277 317 312 296 285 265 253
r5 218 249 246 213 264 255 247 221 229 220
r6 215 202 157 202 200 204 220 183 176 180
r7 178 193 218 199 194 195 201 187 172 159
r8 127 111 107 130 133 99 142 143 131 114
r9 64 68 85 74 70 60 59 70 76 61
. matrix list percent, format(%2.1f)
percent[9,10]
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
r1 70.1 71.2 67.3 67.2 66.9 71.5 72.6 72.5 74.9 73.2
r2 65.3 65.2 69.1 64.4 68.0 70.5 72.0 64.8 66.4 64.9
r3 74.7 73.7 74.7 69.2 68.9 67.6 70.5 72.3 79.4 80.9
r4 66.3 72.6 72.9 74.9 72.7 73.8 72.2 73.3 74.9 71.7
r5 68.8 67.1 66.0 63.6 67.2 67.1 65.2 67.4 68.6 73.8
r6 73.1 72.9 69.2 63.7 67.6 68.0 72.4 68.8 74.9 78.9
r7 64.5 60.3 69.9 70.6 69.3 78.3 72.3 65.8 71.4 71.3
r8 66.1 64.2 63.3 74.7 69.3 56.9 70.6 70.1 63.9 57.9
r9 77.1 73.9 70.2 74.0 71.4 73.2 81.9 72.9 87.4 74.4
How do I combine both the matrices?
currently I have tried: matrix final=cases, percent but it just puts them beside each other? I want it so each column alternates between cases and percent.
I will then use putexcel command to put them into an already formatted table with columns of cases and percentages.
Let me start by supporting Nick Cox's comments.
The problem is, there is no simple solution for combining matrices as you desire. Nevertheless, it is simple to achieve the results you want, by taking a very much different path from the one you outlined. It's no fun to write an essay describing the technique in natural language; it's much simpler to demonstrate it using code, as I do below, and as I expect Nick might have been inclined to do.
By not providing a Minimal, Complete, and Verifiable example, as described in the link Nick provided to you, you've discouraged others from showing you where you've gone off the tracks.
// create a minimal amount of sample data hopefully similar to actual data
clear
input year region sex
2001 1 1
2001 1 2
2001 1 2
2002 1 1
2002 1 2
2001 2 1
2002 2 1
2002 2 2
end
list, clean noobs
// use collapse to generate summaries equivalent to two tabs
generate male = sex==1
collapse (count) total=male (sum) cases=male, by(year region)
list, clean noobs
generate percent = 100*cases/total
keep year region total percent
// flatten and interleave the columns
reshape wide total percent, i(year) j(region)
drop year
list, clean noobs
// now use export excel to output,
// or use mkmat to load into a matrix and use putexcel to output

Pyramidal algorithm

I'm trying to find an algorithm in which i can go through a numerical pyramid, starting for the top of the pyramid and go forward through adjacent numbers in the next row and each number has to be added to a final sum. The thing is, i have to find the route that returns the highest result.
I already tried to go throught the higher adjacent number in next row, but that is not the answer, because it not always get the best route.
I.E.
34
43 42
67 89 68
05 51 32 78
72 25 32 49 40
If i go through highest adjacent number, it is:
34 + 43 + 89 + 51 + 32 = 249
But if i go:
34 + 42 + 68 + 78 + 49 = 269
In the second case the result is higher, but i made that route by hand and i can't think in an algorithm that get the highest result in all cases.
Can anyone give me a hand?
(Please tell me if I did not express myself well)
Start with the bottom row. As you go from left to right, consider the two adjacent numbers. Now go up one row and compare the sum of the number that is above the two numbers, in the row above, with each of the numbers below. Select the larger sum.
Basically you are looking at the triangles formed by the bottom row and the row above. So for your original triangle,
34
43 42
67 89 68
05 51 32 78
72 25 32 49 40
the bottom left triangle looks like,
05
72 25
So you would add 72 + 05 = 77, as that is the largest sum between 72 + 05 and 25 + 05.
Similarly,
51
25 32
will give you 51 + 32 = 83.
If you continue this approach for each two adjacent numbers and the number above, you can discard the bottom row and replace the row above with the computed sums.
So in this case, the second to last row becomes
77 83 81 127
and your new pyramid is
34
43 42
67 89 68
77 83 81 127
Keep doing this and your pyramid starts shrinking until you have one number which is the number you are after.
34
43 42
150 172 195
34
215 237
Finally, you are left with one number, 271.
Starting at the bottom (row by row), add the highest value of both the values under each element to that element.
So, for your tree, 05 for example, will get replaced by max(72, 25) + 05 = 77. Later you'll add the maximum of that value and the new value for the 51 element to 67.
The top-most node will be the maximum sum.
Not to spoil all your fun, I'll leave the implementation to you, or the details of getting the actual path, if required.

How to calculate classification error rate

Alright. Now this question is pretty hard. I am going to give you an example.
Now the left numbers are my algorithm classification and the right numbers are the original class numbers
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 89
177 89
177 89
177 89
177 89
177 89
177 89
So here my algorithm merged 2 different classes into 1. As you can see it merged class 86 and 89 into one class. So what would be the error at the above example ?
Or here another example
203 7
203 7
203 7
203 7
16 7
203 7
17 7
16 7
203 7
At the above example left numbers are my algorithm classification and the right numbers are original class ids. As can be seen above it miss classified 3 products (i am classifying same commercial products). So at this example what would be the error rate? How would you calculate.
This question is pretty hard and complex. We have finished the classification but we are not able to find correct algorithm for calculating success rate :D
Here's a longish example, a real confuson matrix with 10 input classes "0" - "9"
(handwritten digits),
and 10 output clusters labelled A - J.
Confusion matrix for 5620 optdigits:
True 0 - 9 down, clusters A - J across
-----------------------------------------------------
A B C D E F G H I J
-----------------------------------------------------
0: 2 4 1 546 1
1: 71 249 11 1 6 228 5
2: 13 5 64 1 13 1 460
3: 29 2 507 20 5 9
4: 33 483 4 38 5 3 2
5: 1 1 2 58 3 480 13
6: 2 1 2 294 1 1 257
7: 1 5 1 546 6 7
8: 415 15 2 5 3 12 13 87 2
9: 46 72 2 357 35 1 47 2
----------------------------------------------------
580 383 496 1002 307 670 549 557 810 266 estimates in each cluster
y class sizes: [554 571 557 572 568 558 558 566 554 562]
kmeans cluster sizes: [ 580 383 496 1002 307 670 549 557 810 266]
For example, cluster A has 580 data points, 415 of which are "8"s;
cluster B has 383 data points, 249 of which are "1"s; and so on.
The problem is that the output classes are scrambled, permuted;
they correspond in this order, with counts:
A B C D E F G H I J
8 1 4 3 6 7 0 5 2 6
415 249 483 507 294 546 546 480 460 257
One could say that the "success rate" is
75 % = (415 + 249 + 483 + 507 + 294 + 546 + 546 + 480 + 460 + 257) / 5620
but this throws away useful information —
here, that E and J both say "6", and no cluster says "9".
So, add up the biggest numbers in each column of the confusion matrix
and divide by the total.
But, how to count overlapping / missing clusters,
like the 2 "6"s, no "9"s here ?
I don't know of a commonly agreed-upon way
(doubt that the Hungarian algorithm
is used in practice).
Bottom line: don't throw away information; look at the whole confusion matrix.
NB such a "success rate" will be optimistic for new data !
It's customary to split the data into say 2/3 "training set" and 1/3 "test set",
train e.g. k-means on the 2/3 alone,
then measure confusion / success rate on the test set — generally worse than on the training set alone.
Much more can be said; see e.g.
Cross-validation.
You have to define the error criteria if you want to evaluate the performance of an algorithm, so I'm not sure exactly what you're asking. In some clustering and machine learning algorithms you define the error metric and it minimizes it.
Take a look at this
https://en.wikipedia.org/wiki/Confusion_matrix
to get some ideas
You have to define a error metric to measure yourself. In your case, a simple method should be to find the properties mapping of your product as
p = properties(id)
where id is the product id, and p is likely be a vector with each entry of different properties. Then you can define the error function e (or distance) between two products as
e = d(p1, p2)
Sure, each properties must be evaluated to a number in this function. Then this error function can be used in the classification algorithm and learning.
In your second example, it seems that you treat the pair (203 7) as successful classification, so I think you have already a metric yourself. You may be more specific to get better answer.
Classification Error Rate(CER) is 1 - Purity (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
ClusterPurity <- function(clusters, classes) {
sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
Code of #john-colby
Or
CER <- function(clusters, classes) {
1- sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

drawing an image with RGB data in matlab

I have a text file containing RGB data for an image, how can I draw the image using this data in matlab?
data sample :
Red Green Blue
80 97 117
83 100 120
74 91 111
81 96 115
81 96 115
77 90 107
84 97 114
78 91 108
79 95 110
91 104 120
94 108 121
85 99 112
The IMAGE command takes an MxNx3 matrix and displays it as an RGB image. You can use LOAD and RESHAPE to get the data into the right format. Finally, IMAGE wants either integers between 0 and 255 or doubles between 0 and 1.0, so you need to cast or rescale your numbers. The following code snippet should show you how to put it all together.
x = load('rgbdata.txt'); % makes a 12x3 matrix
x = reshape(x, 2, 6, 3); % reshape pulls columnwise, assume 6x2 image
x = x/255; %scale the data to be between 0 and 1
image(x);

Resources