How to combine two features (two minimum distance classifiers) - image

Hello All with my first post here,
I work on tracking objects through images without prior training. I use two features, the color of the region (the ab channels of the Lab space) and the HOG. In my initial experiments, I found that using min. distance classifier with the HOG feature alone has the advantage of low false positives FP but with high FN. On the other hand, using the min. distance classifier with the color alone increases the TP and decreases the FN results but with the price of increasing FP.
My question, is how to combine the two classifiers? I like to know the standard algorithm to do that in an unsupervised way.
I tried to combine the two features into one feature (after normalization) but the HOG dominates the results. Even if I weighted the combined feature, results are worse than either of the two.
The good results I reach till now is to (cascade) the two classifiers, by running the color first to increase the possibilities then run the HOG (with a threshold a little bit higher than that used with HOG alone). I googled the topic but I don't have enough knowledge about classification to find the standard methods.
Thanks for help

Related

Bring Word2Vec models efficiently into Production Service

This is kind of a long shot, but I am hoping that someone has been in a similar situation as I am looking for some advice how to efficiently bring a set of large word2vec models into a production environment.
We have a range of trained w2v models with a dimensionality of 300. Due to the underlying data - huge corpus with POS tagged words; specialized vocabularies with up to 1 mio words - these models became quite large and we are currently looking into effective ways how to expose these to our users w/o paying a too high price in infrastructure.
Besides trying to better control the vocabulary size, obviously, dimensionality reduction on the feature vectors would be an option. Is anyone aware of publications around that, particularly on how this would affect model quality, and how to best measure this?
Another option is to pre-calculate the top X most similar words to each vocabulary word and to provide a lookup table. With the model size being that big, this is currently also very inefficient. Are there any heuristics known that could be used reduce the number of necessary distance calculations from n x n-1 to a lower number?
Thank you very much!
There are pre-indexing techniques for similarity-search in high-dimensional spaces which can speed nearest-neighbor discovery, but usually at a cost of absolute accuracy. (They also need more memory for the index.)
An example is the ANNOY library. The gensim project includes a demo notebook showing its use with Word2Vec.
I once did some experiments using just 16-bit (rather than 32-bit) floats in a Word2Vec model. It saved memory in the idle state, and nearest-neighbor top-N results were nearly unchanged. But, perhaps because some behind-the-scenes up-conversion to 32-bit floats was still occurring during the one-against-all distance-calculations, speed of operations was actually reduced. (And this suggests that each distance-calculation may have caused a temporary memory expansion offsetting any idle-state savings.) So it's not a quick fix, but further research here – perhaps involving finding/implementing the right routines for float16 array operations – could maybe mean 50% model-size savings and equivalent or even better speed.
For many applications, discarding the least-frequent words doesn't hurt much – or even, when done before training, can improve the quality of the remaining vectors. As many implementations, including gensim, sort the word-vector array in most-to-least-frequent order, you can discard the tail-end of the array to save memory, or limit most_similar() searches to the first-N entries to speed calculations.
Once you've minimized the vocabulary size, you want to be sure the full set is in RAM, and no swapping is triggered during the (typical) full-sweep distance-calculations. If you need multiple processes to serve answers from the same vector set, as in a web service on a multicore machine, gensim's memory-mapping operations can prevent each process from loading its own redundant copy of the vectors. You can see a discussion of this technique in this answer about speeding gensim Word2Vec loading time.
Finally, while precomputing top-N neighbors for a larger vocabulary is both time-consuming and memory-intensive, if your pattern of access is such that some tokens are checked far more than others, a cache of the N most-recently or M most-frequently requested top-N could improve perceived performance a lot – making only less-frequently-requested neighbor-lists require the full distance calculations to every other token.

Feature Scaling required or not

I am working with sample data set to learn clustering. This data set contains number of occurrences for the keywords.
Since all are number of occurrences for the different keywords, will it be OK not to scale the values and use them as it is?
I read couple of articles on internet where its emphasized that scaling is important as it will adjust the relativity of the frequency. Since most of frequencies are 0 (95%+), z score scaling will change the shape of distribution, which I am feeling could be problem as I am changing the nature of data.
I am thinking of not changing values at all to avoid this. Will that affect the quality of results I get from the clustering?
As it was already noted, the answer heavily depends on an algorithm being used.
If you're using distance-based algorithms with (usually default) Euclidean distance (for example, k-Means or k-NN), it'll rely more on features with bigger range just because a "typical difference" of values of that feature is bigger.
Non-distance based models can be affected, too. Though one might think that linear models do not get into this category since scaling (and translating, if needed) is a linear transformation, so if it makes results better, then the model should learn it, right? Turns out, the answer is no. The reason is that no one uses vanilla linear models, they're always used with with some sort of a regularization which penalizes too big weights. This can prevent your linear model from learning scaling from data.
There are models that are independent of the feature scale. For example, tree-based algorithms (decision trees and random forests) are not affected. A node of a tree partitions your data into 2 sets by comparing a feature (which splits dataset best) to a threshold value. There's no regularization for the threshold (because one should keep height of the tree small), so it's not affected by different scales.
That being said, it's usually advised to standardize (subtract mean and divide by standard deviation) your data.
Probably it depends on the classification algorithm. I'm only familiar with SVM. Please see Ch. 2.2 for the explanation of scaling
The type of feature (count of words) doesn't matter. The feature ranges should be more or less similar. If the count of e.g. "dignity" is 10 and the count of "have" is 100000000 in your texts, then (at least on SVM) the results of such features would be less accurate as when you scaled both counts to similar range.
The cases, where no scaling is needed are those, where the data is scaled implicitly e.g. features are pixel-values in an image. The data is scaled already to the range 0-255.
*Distance based algorithm need scaling
*There is no need of scaling in tree based algorithms
But it is good to scale your data and train model ,if possible compare the model accuracy and other evaluations before scaling and after scaling and use the best possibility
These is as per my knowledge

Remove noisy and redundant features

I have extracted features from a video sequence based on facial markers as means and standard deviations of those markers over a video sequence. They need to be classified into four different classes based on those markers.
In all I have a feature set of around 260 features. How should I determine which features are noisy and redundant in my set. I read about it in some research papers and some of them used the plus l take away r algorithm that I found to be quite appropriate but in such algorithms they always rate one feature against the other and say its good or bad compared to it.
How do I rate my features to be good or bad? What criterion are used for that generally?
I researched a lot for a couple of days but found nothing clear cut and useful. Would be grateful for the help, Thanks.
Think of your 260 features as a basis for a 260 dimensional room. However, your basis-vectors are not normal to each other so they contain a lot of redundant information. You'd like to transform these vectors into a vector-set where all vectors are normal to each other, thus minimizing the dimensions without losing (much) information.
This is what Principal component analysis does.
Linear discriminant analysis may also be of interest to you.
You can use pca or you can train some classifiers, and after this you loop all over yours features adding a big value to each feature, testing if this alteration changes the precision of the classifier, if not, you can remove this feature, after remove all the redundat features, and then retrain your classifiers!
Its a good ideia to train not one classifier but a lot of them, and them make your prediction based on votes, you can user MODE function in matlab to do this!
Use classification rate to determine a subset of feature how much good. You have 260 feature and then have 2^260 subset, this is too much! and search in this space is very difficult. Thus it's better to remove some feature by Filter method (for example FA, t-test, fisher and ...) and then use your search method to find best subset of feature.
Plus l take away r algorithm (or other search algorithm) find various subset and rate it (in this stage use classification rate) and at last specify which subset is better.

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Modeling distribution of performance measurements

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.
Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)
The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.
In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.
Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.
The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).
Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?

Resources