I have 25 classes and class 0 included all the the negative samples of all 24 other classes so the number of samples in this class is much more bigger than the others (e.g 10 times bigger because it should include all of the negative samples from the 24 other classes)
Now my question is what should I do when I want to train this data set ?
Do I have to use unbalaced training option which libsvm prodives ? -w0 1 -w1 ....
I mean is it obligatory to use this option or not ?
because When I'm training the data without this option it gives 99.8% accuracy for separating classes and when I'm testing this accurate model !!! for some classes I get 100% accuracy and for some other classes I get 0.0% !!!
I mean for some classes it will not miss any sample but for another class it will always return 0 !!! which means it is a negative sample !!!
I want to use this option but I dont know the rules for it. I mean How should I set a value for a class using this option?
Suppose number of samples in each classes are :
class 0 -> 3433
class 1 -> 745
class 2 -> 232
class 3 -> 53
.
.
.
class 23 -> 975
how should I set wi for each class should I scaled them between [0,1] or [-1 1] or (-inf +inf) or what ?
Summary >
1). is it obligatory to use -wi option for my dataset ?
2). how should I set this value
Thanks
it is not obligatory, it depends on your data. if your classes are easy to separate there is no need. start without weight and have a look at the confusion matrix. if your errors are between a crowded class and a sparse class some tweaking on the weight may help.
Not able to comment so I'll write it as an answer:
Two suggestions:
decrease the weight for class0
or adopt a 2-step approach:
combine all examples from other 24 classes, treat them as one class,
and build a binary classifier
build a 24-way classifier only for the positive examples. use it if the classification result from last step is positive.
Since you have +ve and -ve data for each class, you should train 24 binary classifiers.
Then when you put in a test case, if there is more than one SVM that has a positive prediction, pick the class for which the classifier has the highest output probability.
If you set up a multiclass SVM with LIBSVM, internally it just trains multiple binary SVM anyway. So there is nothing odd about explicitly setting up 24 SVMs yourself.
Related
I've tried
vw --multilabel_oaa 68 -d vw_data.csv --loss_function=logistic --probabilities -p probabilities.txt
and ended up with target labels only in probabilities.txt. Also -r option designed to produce raw output returned nothing, unfortunately.
Apart from that, I'm not sure is there a way to achieve similar behaviour (multilabel prediction with logistic loss) with other available VW multiclass options such as --csoaa and --wap.
I don't remember exactly, but I think --probabilities does not support multilabel. I even don't know what would be the interpretation (modelling the probability of label co-occurrence? and providing the probabilities for all 2^68 subsets?).
You can use standard multi-class --oaa 68. With --probabilities it should predict the probability for each class, so you can use e.g. some kind of threshold for selecting multiple lables=classes for each example (e.g. such that the sum of their probabilities is at least 42%).
My actual vector has 110 elements that I'll use to extract features from images in matlab, I took this one (tb) to simplify
tb=[22.9 30.0 30.3 27.8 24.1 28.2 26.4 12.6 39.7 38.0];
normalized_V = tb/norm(tb);
I = mat2gray(tb);
For normalized_v I got 0.2503 0.3280 0.3312 0.3039 0.2635 0.3083 0.2886 0.1377 0.4340 0.4154.
For I I got 0.3801 0.6421 0.6531 0.5609 0.4244 0.5756 0.5092 0 1.0000 0.9373 which one should I use if any of those 2 methods and why, and should I transform the features vector to 1 element after extraction for better training or leave it as a 110 element vector.
Normalization can be performed in several ways, such as the following:
Normalizing the vector between 0 and 1. In that case, just use:(tb-min(tb))/max(tb)
Making the maximum point at 1. In that case, just use: tb/max(tb) (which is the method that you have been used before).
Making the mean 0 and the standard deviation as 1. This is the most common method for using the returned values as features in a classification procedure and thus, I think that it is the one that you should use right now: zscore(tb) (or (tb-mean(tb))/std(tb)).
So, your final values would be:
zscore(tb)
ans =
-0.6664
0.2613
0.3005
-0.0261
-0.5096
0.0261
-0.2091
-2.0121
1.5287
1.3066
Edit:
In regard to your second question, it depends on the number of observations. Every single classifier takes an MxN matrix of data and an Mx1 vector of labels as inputs. In this case, M refers to the number of observations, whereas N refers to the number of features. Usually, in order to avoid over-fitting, it is recommended to use a number of features less than the tenth part of the number of observations (i.e., the number of observations must be M > 10N).
So, in your case, if you use the entire 110-set of features, you should have a minimum of 1100 observations, otherwise you can have problems with over-fitting.
Continuing with some experimentation here I was interested is seeing how to continuing training a VW model.
I first ran this and saved the model.
vw -d housing.vm --loss_function squared -f housing2.mod --invert_hash readable.housing2.mod
Examining the readable model:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.020412
^B:158346:0.007608
^CHAS:102153:1.014402
^CRIM:141890:0.016158
^DIS:182658:0.278865
^INDUS:125597:0.062041
^LSTAT:170288:0.028373
^NOX:165794:2.872270
^PTRATIO:223085:0.108966
^RAD:232476:0.074916
^RM:2580:0.330865
^TAX:108300:0.002732
^ZN:54950:0.020350
Constant:116060:2.728616
If I then continue to train the model using two more examples (in housing_2.vm), which note, has zero values for ZN and CHAS:
27.50 | CRIM:0.14866 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7270 AGE:79.90 DIS:2.7778 RAD:5 TAX:384.0 PTRATIO:20.90 B:394.76 LSTAT:9.42
26.50 | CRIM:0.11432 ZN:0.00 INDUS:8.560 CHAS:0 NOX:0.5200 RM:6.7810 AGE:71.30 DIS:2.8561 RAD:5 TAX:384.0 PTRATIO:20.90 B:395.58 LSTAT:7.67
If the model saved is loaded and training continues, the coefficients appear to be lost from these zero valued features. Am I doing something wrong or is this a bug?
vw -d housing_2.vm --loss_function squared -i housing2.mod --invert_hash readable.housing3.mod
output from readable.housing3.mod:
Version 7.7.0
Min label:0.000000
Max label:50.000000
bits:18
0 pairs:
0 triples:
rank:0
lda:0
0 ngram:
0 skip:
options:
:0
^AGE:104042:0.023086
^B:158346:0.008148
^CRIM:141890:1.400201
^DIS:182658:0.348675
^INDUS:125597:0.087712
^LSTAT:170288:0.050539
^NOX:165794:3.294814
^PTRATIO:223085:0.119479
^RAD:232476:0.118868
^RM:2580:0.360698
^TAX:108300:0.003304
Constant:116060:2.948345
If you want to continue learning from saved state in a smooth fashion you must use the --save_resume option.
There are 3 fundamentally different types of "state" that can be saved into a vw "model" file:
The weight vector (regressor) obviously. That's the model itself.
invariant parameters like the version of vw (to ensure binary compatibility which is not always preserved between versions), number of bits in the vector (-b), and type of model
state which dynamically changes during learning. This subset includes parameters like learning and decay rates which gradually change during learning with each example, the example numbers themselves, etc.
Only --save_resume saves the last group.
--save_resume is not the default because it has an overhead and in most use-cases it isn't needed. e.g. if you save a model once in order to do many predictions and no learning (-t), there's no need in saving the 3rd subset of state.
So, I believe in your particular case, you want to use --save_resume.
The possibility of a bug always exists, especially since vw supports so many options (about 100 at last count) which are often interdependent. Some option combinations make sense, other don't. Doing a sanity check for roughly 2^100 possible option combinations is a bit unrealistic. If you find a bug, please open an issue on github. In this case, please make sure to use a complete example (full data & command line) so your problem can be reproduced.
Update 2014-09-20 (after an issue was opened on github, thanks!):
The reason for 0 valued features "disappearing" (not really from the model, but only from the --invert_hash output) is that 1) --invert_hash was never designed for multiple passes, because keeping the original feature names in a hash-table, incurs a large performance overhead 2) The missing features are those with a zero value, which are discarded. The model itself should still have any feature with any prior pass non-zero weight in it. Fixing this inconsistency is too complex and costly for implementation reasons, and would go against the overriding motivation of making vw fast, especially for the most useful/common use-cases. Anyway, thanks for the report, I too learned something new from it.
I'm using pict (Pairwise Independent Combinatorial Testing tool) as my tool. I'm trying to generate test cases using these constraints:
video_resolution: 352x240,352x288,640x480,704x480,704x576,720x240,720x480,720x576
video_rotate: 0,90,180,270
IF [video_resolution] IN { "640x480"} THEN [video_rotate]="90" OR "180";
but I'm having trouble doing so.
One more thing: what is <> sig used for? Means <> operator.
Amit,
A couple comments. The first is a solution. The 2nd two concern where benefits from the kind of test design approach you're asking about tend to be largest.
1) Here is a very short video to how your problem could be solved using Hexawise, a test case generator similar to PICT. To mark the invalid pairs, simply click on the symbols to the right of the relevant parameter values.
http://www.screencast.com/users/Hexawise/folders/Camtasia/media/5c6aae22-ec78-4cae-9471-16d5c96cf175
2) Your question involves 8 screen size resolutions and 4 video rotations. Pairwise coverage (AKA 2-way coverage) will require 32 test cases - or 30 test cases once you eliminate the 2 invalid combinations. This is an OK use of PICT or Hexawise (e.g., they'll make sure you don't forget any valid combination) but where you will really see dramatic benefits is when you have a lot of parameters. In such cases, you'll be able to indentify a small subset of test condition combinations that will be surprisingly effective at triggering defects with only a tiny portion of the total possible test cases.
3) If you had 20 Parameters with 4 values each, for example, you would have more than 1 trillion possible tests. If you set your coverage strength to pairwise (e.g., 2-way coverage), you would be able to achieve 100% coverage of all pairs of values in at least one test in only 37 tests.
An example demonstrating this is shown here: http://www.screencast.com/t/YmYzOTZhNTU
Coverage is adjustable as well. You can use this to alter your coverage strength based upon time available for testing and/or risk-based testing considerations. If you wanted to achieve 100% coverage of all possible combinations of 3 paramter values in at least one test, you would require 213 tests to do so. Furthermore, if you were relatively more concerned about the potential interactions between 3 of the sets of parameters (think, e.g., "Income" and "Credit Rating" and "Price of House" in a mortgage application example vs. other, less important test inputs), then you would be able to create 80 tests to match that objective. The flexibility of this test design approach (available in both PICT and Hexawise) are powerful reasons to use these kinds of test design tools.
Hope these tips help.
Full disclosure: I'm the founder of Hexawise.
Late answer, but just for others experiencing simular problems: Your condition must be:
video_resolution: 352x240,352x288,640x480,704x480,704x576,720x240,720x480,720x576
video_rotate: 0,90,180,270
IF [video_resolution] = "640x480" THEN [video_rotate] in {"90", "180"};
<> means NOT. In your case you could do:
IF [video_resolution] <> "720x576" THEN [video_rotate] >= 180;
This means: "If video_resolution is not 720x576, then video_rotate must be
equal or larger than 180"
I've got a classification system, which I will unfortunately need to be vague about for work reasons. Say we have 5 features to consider, it is basically a set of rules:
A B C D E Result
1 2 b 5 3 X
1 2 c 5 4 X
1 2 e 5 2 X
We take a subject and get its values for A-E, then try matching the rules in sequence. If one matches we return the first result.
C is a discrete value, which could be any of a-e. The rest are just integers.
The ruleset has been automatically generated from our old system and has an extremely large number of rules (~25 million). The old rules were if statements, e.g.
result("X") if $A >= 1 && $A <= 10 && $C eq 'A';
As you can see, the old rules often do not even use some features, or accept ranges. Some are more annoying:
result("Y") if ($A == 1 && $B == 2) || ($A == 2 && $B == 4);
The ruleset needs to be much smaller as it has to be human maintained, so I'd like to shrink rule sets so that the first example would become:
A B C D E Result
1 2 bce 5 2-4 X
The upshot is that we can split the ruleset by the Result column and shrink each independently. However, I cannot think of an easy way to identify and shrink down the ruleset. I've tried clustering algorithms but they choke because some of the data is discrete, and treating it as continuous is imperfect. Another example:
A B C Result
1 2 a X
1 2 b X
(repeat a few hundred times)
2 4 a X
2 4 b X
(ditto)
In an ideal world, this would be two rules:
A B C Result
1 2 * X
2 4 * X
That is: not only would the algorithm identify the relationship between A and B, but would also deduce that C is noise (not important for the rule)
Does anyone have an idea of how to go about this problem? Any language or library is fair game, as I expect this to be a mostly one-off process. Thanks in advance.
Check out the Weka machine learning lib for Java. The API is a little bit crufty but it's very useful. Overall, what you seem to want is an off-the-shelf machine learning algorithm, which is exactly what Weka contains. You're apparently looking for something relatively easy to interpret (you mention that you want it to deduce the relationship between A and B and to tell you that C is just noise.) You could try a decision tree, such as J48, as these are usually easy to visualize/interpret.
Twenty-five million rules? How many features? How many values per feature? Is it possible to iterate through all combinations in practical time? If you can, you could begin by separating the rules into groups by result.
Then, for each result, do the following. Considering each feature as a dimension, and the allowed values for a feature as the metric along that dimension, construct a huge Karnaugh map representing the entire rule set.
The map has two uses. One: research automated methods for the Quine-McCluskey algorithm. A lot of work has been done in this area. There are even a few programs available, although probably none of them will deal with a Karnaugh map of the size you're going to make.
Two: when you have created your final reduced rule set, iterate over all combinations of all values for all features again, and construct another Karnaugh map using the reduced rule set. If the maps match, your rule sets are equivalent.
-Al.
You could try a neural network approach, trained via backpropagation, assuming you have or can randomly generate (based on the old ruleset) a large set of data that hit all your classes. Using a hidden layer of appropriate size will allow you to approximate arbitrary discriminant functions in your feature space. This is more or less the same idea as clustering, but due to the training paradigm should have no issue with your discrete inputs.
This may, however, be a little too "black box" for your case, particularly if you have zero tolerance for false positives and negatives (although, it being a one-off process, you get an arbitrary degree of confidence by checking a gargantuan validation set).