I have done clustering using evaluation in a churn value in my dataset. however clusters 1,2,3 have no class although more attributes are assigned in yes/no value. Why is that happening and what does it mean? I didn't understand from other examples.
Classes to Clusters:
0 1 2 3 4 <-- assigned to cluster
207 86 123 75 93 | No
85 45 92 93 158 | Yes
Cluster 0 <-- No
Cluster 1 <-- No class
Cluster 2 <-- No class
Cluster 3 <-- No class
Cluster 4 <-- Yes
Related
I have a column called Project_Id which lists the names of many different projects, say Project A, Project B and so on. A second column lists the sales for each project.
A third column shows time series information. For example:
Project_ID Sales Time Series Information
A 10 1
A 25 2
A 31 3
A 59 4
B 22 1
B 38 2
B 76 3
C 82 1
C 23 2
C 83 3
C 12 4
C 90 5
D 14 1
D 62 2
From this dataset, I need to choose (and thus create a new data set) only those projects which have at least 4 time series points, say, to get the new dataset (How do I get this by using an R code, is my question):
Project_ID Sales Time Series Information
A 10 1
A 25 2
A 31 3
A 59 4
C 82 1
C 23 2
C 83 3
C 12 4
C 90 5
Could someone please help?
Thanks a lot!
I tried to do some filtering with R but had little success.
This is snapshot of my dataset
A B
1 34
1 33
1 66
0 54
0 77
0 98
0 39
0 12
I am trying to create a random sample where there are 2 1s and 3 0s from column A in the sample along with their respective B values. Is there a way to do that? Basically trying to see how to get a sample with specific percentages of a particular column? Thanks.
Suppose we have a 2D (5x5) matrix:
test =
39 13 90 5 71
60 78 38 4 11
87 92 46 45 35
40 96 61 17 1
90 50 46 89 63
And a second 2D (5x2) matrix:
tidx =
1 3
2 4
2 3
2 4
4 5
And now we want to use tidx as an idex into test, so that we get the following output:
out =
39 90
78 4
92 46
96 17
89 63
One way to do this is with a for loop...
for i=1:size(test,1)
out(i,:) = test(i,tidx(i,:));
end
Question:
Is there a way to vectorize this so the same output is generated without a for loop?
Here is one way:
test(repmat([1:rows(test)]',1,columns(tidx)) + (tidx-1)*rows(test))
What you describe is an index problem. When you place a matrix all in one dimension, you get
test(:) =
39
60
87
40
90
13
78
92
96
50
90
38
46
61
46
5
4
45
17
89
71
11
35
1
63
This can be indexed using a single number. Here is how you figure out how to transform tidx into the correct format.
First, I use the above reference to figure out the index numbers which are:
outinx =
1 11
7 17
8 13
9 19
20 25
Then I start trying to figure out the pattern. This calculation gives a clue:
(tidx-1)*rows(test) =
0 10
5 15
5 10
5 15
15 20
This will move the index count to the correct column of test. Now I just need the correct row.
outinx-(tidx-1)*rows(test) =
1 1
2 2
3 3
4 4
5 5
This pattern is created by the for loop. I created that matrix with:
[1:rows(test)]' * ones(1,columns(tidx))
*EDIT: This does the same thing with a built in function.
repmat([1:rows(test)]',1,columns(tidx))
I then add the 2 together and use them as the index for test.
Alright. Now this question is pretty hard. I am going to give you an example.
Now the left numbers are my algorithm classification and the right numbers are the original class numbers
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 89
177 89
177 89
177 89
177 89
177 89
177 89
So here my algorithm merged 2 different classes into 1. As you can see it merged class 86 and 89 into one class. So what would be the error at the above example ?
Or here another example
203 7
203 7
203 7
203 7
16 7
203 7
17 7
16 7
203 7
At the above example left numbers are my algorithm classification and the right numbers are original class ids. As can be seen above it miss classified 3 products (i am classifying same commercial products). So at this example what would be the error rate? How would you calculate.
This question is pretty hard and complex. We have finished the classification but we are not able to find correct algorithm for calculating success rate :D
Here's a longish example, a real confuson matrix with 10 input classes "0" - "9"
(handwritten digits),
and 10 output clusters labelled A - J.
Confusion matrix for 5620 optdigits:
True 0 - 9 down, clusters A - J across
-----------------------------------------------------
A B C D E F G H I J
-----------------------------------------------------
0: 2 4 1 546 1
1: 71 249 11 1 6 228 5
2: 13 5 64 1 13 1 460
3: 29 2 507 20 5 9
4: 33 483 4 38 5 3 2
5: 1 1 2 58 3 480 13
6: 2 1 2 294 1 1 257
7: 1 5 1 546 6 7
8: 415 15 2 5 3 12 13 87 2
9: 46 72 2 357 35 1 47 2
----------------------------------------------------
580 383 496 1002 307 670 549 557 810 266 estimates in each cluster
y class sizes: [554 571 557 572 568 558 558 566 554 562]
kmeans cluster sizes: [ 580 383 496 1002 307 670 549 557 810 266]
For example, cluster A has 580 data points, 415 of which are "8"s;
cluster B has 383 data points, 249 of which are "1"s; and so on.
The problem is that the output classes are scrambled, permuted;
they correspond in this order, with counts:
A B C D E F G H I J
8 1 4 3 6 7 0 5 2 6
415 249 483 507 294 546 546 480 460 257
One could say that the "success rate" is
75 % = (415 + 249 + 483 + 507 + 294 + 546 + 546 + 480 + 460 + 257) / 5620
but this throws away useful information —
here, that E and J both say "6", and no cluster says "9".
So, add up the biggest numbers in each column of the confusion matrix
and divide by the total.
But, how to count overlapping / missing clusters,
like the 2 "6"s, no "9"s here ?
I don't know of a commonly agreed-upon way
(doubt that the Hungarian algorithm
is used in practice).
Bottom line: don't throw away information; look at the whole confusion matrix.
NB such a "success rate" will be optimistic for new data !
It's customary to split the data into say 2/3 "training set" and 1/3 "test set",
train e.g. k-means on the 2/3 alone,
then measure confusion / success rate on the test set — generally worse than on the training set alone.
Much more can be said; see e.g.
Cross-validation.
You have to define the error criteria if you want to evaluate the performance of an algorithm, so I'm not sure exactly what you're asking. In some clustering and machine learning algorithms you define the error metric and it minimizes it.
Take a look at this
https://en.wikipedia.org/wiki/Confusion_matrix
to get some ideas
You have to define a error metric to measure yourself. In your case, a simple method should be to find the properties mapping of your product as
p = properties(id)
where id is the product id, and p is likely be a vector with each entry of different properties. Then you can define the error function e (or distance) between two products as
e = d(p1, p2)
Sure, each properties must be evaluated to a number in this function. Then this error function can be used in the classification algorithm and learning.
In your second example, it seems that you treat the pair (203 7) as successful classification, so I think you have already a metric yourself. You may be more specific to get better answer.
Classification Error Rate(CER) is 1 - Purity (http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)
ClusterPurity <- function(clusters, classes) {
sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
Code of #john-colby
Or
CER <- function(clusters, classes) {
1- sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
In most face recognition SDK, it only provides two major functions
detecting faces and extracting templates from photos, this is called detection.
comparing two templates and returning the similar score, this is called recognition.
However, beyond those two functions, what I am looking for is an algorithm or SDK for grouping photos with similar faces together, e.g. based on similar scores.
Thanks
First, perform step 1 to extract the templates, then compare each template with all the others by applying step two on all the possible pairs, obtaining their similarity scores.
Sort the matches based on this similarity score, decide on a threshold and group together those templates that exceed it.
Take, for instance, the following case:
Ten templates: A, B, C, D, E, F, G, H, I, J.
Scores between: 0 and 100.
Similarity threshold: 80.
Similarity table:
A B C D E F G H I J
A 100 85 8 0 1 50 55 88 90 10
B 85 100 5 30 99 60 15 23 8 2
C 8 5 100 60 16 80 29 33 5 8
D 0 30 60 100 50 50 34 18 2 66
E 1 99 16 50 100 8 3 2 19 6
F 50 60 80 50 8 100 20 55 13 90
G 55 15 29 34 3 20 100 51 57 16
H 88 23 33 18 2 55 51 100 8 0
I 90 8 5 2 19 13 57 8 100 3
J 10 2 8 66 6 90 16 0 3 100
Sorted matches list:
AI 90
FJ 90
BE 99
AH 88
AB 85
CF 80
------- <-- Threshold cutoff line
DJ 66
.......
Iterate through the list until the threshold cutoff point, where the values no longer exceed it, maintain a full templates set and association sets for each template, obtaining the final groups:
// Empty initial full templates set
fullSet = {};
// Iterate through the pairs list
foreach (templatePair : pairList)
{
// If the full set contains the first template from the pair
if (fullSet.contains(templatePair.first))
{
// Add the second template to its group
templatePair.first.addTemplateToGroup(templatePair.second);
// If the full set also contains the second template
if (fullSet.contains(templatePair.second))
{
// The second template is removed from the full set
fullSet.remove(templatePair.second);
// The second template's group is added to the first template's group
templatePair.first.addGroupToGroup(templatePair.second.group);
}
}
else
{
// If the full set contains only the second template from the pair
if (fullSet.contains(templatePair.second))
{
// Add the first template to its group
templatePair.second.addTemplateToGroup(templatePair.first);
}
}
else
{
// If none of the templates are present in the full set, add the first one
// to the full set and the second one to the first one's group
fullSet.add(templatePair.first);
templatePair.first.addTemplateToGroup(templatePair.second);
}
}
Execution details on the list:
AI: fullSet.add(A); A.addTemplateToGroup(I);
FJ: fullSet.add(F); F.addTemplateToGroup(J);
BE: fullSet.add(B); B.addTemplateToGroup(E);
AH: A.addTemplateToGroup(H);
AB: A.addTemplateToGroup(B); fullSet.remove(B); A.addGroupToGroup(B.group);
CF: C.addTemplateToGroup(F);
In the end, you end up with the following similarity groups:
A - I, H, B, E
C - F, J