Difference between feature selection, feature extraction, feature weights - parallel-processing

I am slightly confused as to what "feature selection / extractor / weights" mean and the difference between them. As I read the literature sometimes I feel lost as I find the term used quite loosely, my primary concerns are --
When people talk of Feature Frequency, Feature Presence - is it feature selection?
When people talk of algorithms such as Information Gain, Maximum Entropy - is it still feature selection.
If I train the classifier - with a feature set that asks the classifier to note the position of a word within a document as an example - would one still call this feature selection?
Thanks
Rahul Dighe

Rahul-
All of these are good answers. The one thing I would mention is that the fundamental difference between selection and extraction has to do with how you are treating the data.
Feature Extraction methods are transformative -- that is you are applying a transformation to your data to project it into a new feature space with lower dimension. PCA, and SVD are examples of this.
Feature Selection methods choose features from the original set based on some criteria, Information Gain, Correlation and Mutual Information are just criteria that are used to filter out unimportant or redundant features. Embedded or wrapper methods, as they are called, can use specialized classifiers to achieve feature selection and classify the dataset at the same time.
A really nice overview of the problem space is given here.
Good Luck!

Feature extraction: reduce dimensionality by (linear or non-
linear) projection of D-dimensional vector onto d-dimensional
vector (d < D).
Example: principal component analysis
Feature selection: reduce dimensionality by selecting subset
of original variables.
Example: forward or backward feature selection

Feature Selection is the process of choosing "interesting" features from your set for further processing.
Feature Frequency is just that, the frequency that a feature appears.
Information Gain, Maximum Entropy, etc. are weighting methods, which use Feature Frequency, which in turn, allow you to perform Feature Selection.
Think of it like this:
You parse a corpus, and create a term / document matrix. This matrix starts out as a count of the terms, and what document in which they appear (simple frequency).
To make that matrix more meaningful, you weight the terms based on some function including the frequency (like term frequency-inverse document frequency, Information gain, maximum entropy). Now that matrix contains the weights, or importance of each term in relation to the other terms in the matrix.
Once you have that, you can use feature selection to keep only the most important terms (if you are doing stuff like classification or categorization) and perform further analysis.

Related

Which is a better input for an autoencoder, one with correlated features or one with uncorrelated features?

I am trying to visualise my data in 2D in order to detect fraud (outliers), all my features are likely to take bigger values in case of a fraud. But I was careful not to include redundant features,
for example the features :
Activity (a score that is higher for active users who use the service everyday) and Money-earned both tend to take higher values in case of fraud, but one can't be deduced from the other.
I figured that choosing features in this way will translate to bigger coordinates in the 2D representation and would make fraudulent points distant/stand out from the rest of my data.
I also feel like having correlated features would make it easier for autoencoder to reconstruct the data. But I read many times that having correlated features isn’t efficient in machine learning.
Should I make an effort to make my features less correlated ? For example replacing the Activity score (higher for active users) with the times between two uses (lower for active users)?
Or maybe this isn't important for the autoencoder?
You are right about your understanding that "having correlated features would make it easier for autoencoder to reconstruct the data".
For example, in case all your data points are i.i.d. Gaussian it would make data compression very difficult for autoencoders since they would fail to learn a low dimensional representation of the data.
Please refer to this Stanford UFLDL Tutorial link for details.

Feature Scaling required or not

I am working with sample data set to learn clustering. This data set contains number of occurrences for the keywords.
Since all are number of occurrences for the different keywords, will it be OK not to scale the values and use them as it is?
I read couple of articles on internet where its emphasized that scaling is important as it will adjust the relativity of the frequency. Since most of frequencies are 0 (95%+), z score scaling will change the shape of distribution, which I am feeling could be problem as I am changing the nature of data.
I am thinking of not changing values at all to avoid this. Will that affect the quality of results I get from the clustering?
As it was already noted, the answer heavily depends on an algorithm being used.
If you're using distance-based algorithms with (usually default) Euclidean distance (for example, k-Means or k-NN), it'll rely more on features with bigger range just because a "typical difference" of values of that feature is bigger.
Non-distance based models can be affected, too. Though one might think that linear models do not get into this category since scaling (and translating, if needed) is a linear transformation, so if it makes results better, then the model should learn it, right? Turns out, the answer is no. The reason is that no one uses vanilla linear models, they're always used with with some sort of a regularization which penalizes too big weights. This can prevent your linear model from learning scaling from data.
There are models that are independent of the feature scale. For example, tree-based algorithms (decision trees and random forests) are not affected. A node of a tree partitions your data into 2 sets by comparing a feature (which splits dataset best) to a threshold value. There's no regularization for the threshold (because one should keep height of the tree small), so it's not affected by different scales.
That being said, it's usually advised to standardize (subtract mean and divide by standard deviation) your data.
Probably it depends on the classification algorithm. I'm only familiar with SVM. Please see Ch. 2.2 for the explanation of scaling
The type of feature (count of words) doesn't matter. The feature ranges should be more or less similar. If the count of e.g. "dignity" is 10 and the count of "have" is 100000000 in your texts, then (at least on SVM) the results of such features would be less accurate as when you scaled both counts to similar range.
The cases, where no scaling is needed are those, where the data is scaled implicitly e.g. features are pixel-values in an image. The data is scaled already to the range 0-255.
*Distance based algorithm need scaling
*There is no need of scaling in tree based algorithms
But it is good to scale your data and train model ,if possible compare the model accuracy and other evaluations before scaling and after scaling and use the best possibility
These is as per my knowledge

Remove noisy and redundant features

I have extracted features from a video sequence based on facial markers as means and standard deviations of those markers over a video sequence. They need to be classified into four different classes based on those markers.
In all I have a feature set of around 260 features. How should I determine which features are noisy and redundant in my set. I read about it in some research papers and some of them used the plus l take away r algorithm that I found to be quite appropriate but in such algorithms they always rate one feature against the other and say its good or bad compared to it.
How do I rate my features to be good or bad? What criterion are used for that generally?
I researched a lot for a couple of days but found nothing clear cut and useful. Would be grateful for the help, Thanks.
Think of your 260 features as a basis for a 260 dimensional room. However, your basis-vectors are not normal to each other so they contain a lot of redundant information. You'd like to transform these vectors into a vector-set where all vectors are normal to each other, thus minimizing the dimensions without losing (much) information.
This is what Principal component analysis does.
Linear discriminant analysis may also be of interest to you.
You can use pca or you can train some classifiers, and after this you loop all over yours features adding a big value to each feature, testing if this alteration changes the precision of the classifier, if not, you can remove this feature, after remove all the redundat features, and then retrain your classifiers!
Its a good ideia to train not one classifier but a lot of them, and them make your prediction based on votes, you can user MODE function in matlab to do this!
Use classification rate to determine a subset of feature how much good. You have 260 feature and then have 2^260 subset, this is too much! and search in this space is very difficult. Thus it's better to remove some feature by Filter method (for example FA, t-test, fisher and ...) and then use your search method to find best subset of feature.
Plus l take away r algorithm (or other search algorithm) find various subset and rate it (in this stage use classification rate) and at last specify which subset is better.

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

N-gram text categorization category size difference compensation

Lately I've been mucking about with text categorization and language classification based on Cavnar and Trenkle's article "N-Gram-Based Text Categorization" as well as other related sources.
For doing language classification I've found this method to be very reliable and useful. The size of the documents used to generate the N-gram frequency profiles is fairly unimportant as long as they are "long enough" since I'm just using the most common n N-grams from the documents.
On the other hand well-functioning text categorization eludes me. I've tried with both my own implementations of various variations of the algorithms at hand, with and without various tweaks such as idf weighting and other peoples' implementations. It works quite well as long as I can generate somewhat similarly-sized frequency profiles for the category reference documents but the moment they start to differ just a bit too much the whole thing falls apart and the category with the shortest profile ends up getting a disproportionate number of documents assigned to it.
Now, my question is. What is the preferred method of compensating for this effect? It's obviously happening because the algorithm assumes a maximum distance for any given N-gram that equals the length of the category frequency profile but for some reason I just can't wrap my head around how to fix it. One reason I'm interested in this fix is actually because I'm trying to automate the generation of category profiles based on documents with a known category which can vary in length (and even if they are the same length the profiles may end up being different lengths). Is there a "best practice" solution to this?
If you are still interested, and assuming I understand your question correctly, the answer to your problem would be to normalise your n-gram frequencies.
The simplest way to do this, on a per document basis, is to count the total frequency of all n-grams in your document and divide each individual n-gram frequency by that number. The result is that every n-gram frequency weighting now relates to a percentage of the total document content, regardless of the overall length.
Using these percentages in your distance metrics will discount the size of the documents and instead focus on the actual make up of their content.
It might also be worth noting that the n-gram representation only makes up a very small part of an entire categorisation solution. You might also consider using dimensional reduction, different index weighting metrics and obviously different classification algorithms.
See here for an example of n-gram use in text classification
As I know the task is to count probability of generation some text by language model M.
Recently i was working on measuring the readaiblity of texts using semantic, synctatic and lexical properties. It can be also measured by language model approach.
To answer properly you should consider these questions:
Are you using log-likelihood approach?
What levels of N-Grams are you using? unigrams digrams or higher level?
How big are language corpuses that you use?
Using only digrams and unigrams i managed to classify some documents with nice results. If your classification is weak consider creating bigger language corpuse or using n-grams of lower levels.
Also remember that classifying some text to invalid category may be an error depending on length of text (randomly there are few words appearing in another language models).
Just consider making your language corpuses bigger and know that analysing short texts have higher probability of missclasification

Resources