Numeric to nominal filter - filter

When is it compulsory to use the filter to change the data type to nominal? I am doing classification right now, and the results differ by a huge margin if I changed it to nominal compared to as it is. Thank you in advance.

I don't your question is formed well but I will try to answer it anyway.
Nominal and numeric attributes represent different types of attributes and therefore are treated differently by machine learning algorithms.
Nominal attributes are limited to a closed set of values and they don't have order or any other relation between them. Usually nominal attributes should have a small amount of possible values (large set of possible values may cause over-fitting). The color of car is an example of an attribute that probably would be represented as a nominal attribute.
Numeric attributes are usually more common. They represent values on some axis and are not limited to specific values. Usually the classification algorithms will try to find a point on that axis that differentiate well between the classes or use the value to calculate distance between instances. The salary of an employee is an example of an attribute I will probably use as a numeric attribute.
One more thing you need to take into account is how the classification algorithm treats nominal and numeric attributes. Some algorithms don't handle well nominal attributes. Other algorithms will not work well with several numeric attributes if the values of the attributes are not normalized.

Related

How to decide to convert to categorical variable or keep it numeric?

This might be a basic or trivial question and might be straightforward. Still I would like to ask this to clear my doubt once and for all.
Take example of Passanger Class in Famous Titanic Data. Functionally it is indeed a Categorical Data, so it will make perfect sense to convert it to categorical variable. Algorithms as per my understanding tend to see a pattern specific to that class. But at the same time if you see it as numeric variable, it might denote a range also for a decision tree. Say passangers in between first class and second class.
It looks both are correct and both will affect the machine learning algorithm outputs in different ways.
Which one is appropriate and is there anywhere there is a extensive discussion about it? Should we use such ambiguous variables as numeric as well its copy as a categorical variable, which might prove to be a technique to uncover more patterns?
I suppose it's up to you whether you'd rather interpret a continuous PassengerClass variable as "for every one-unit increase in PassengerClass, the passenger's likelihood of survival goes up/down X%," versus a categorical (factor) PassengerClass as, "the likelihoods of survival for groups 2 and 3 (for example, leaving 1st-class passengers as the base group) are X and Y% percent higher, respectively, than the base group, holding all else constant."
I think about variables like PassengerClass almost as "treatment groups." Yes, I suppose you could interpret it as continuous, but I think it makes more sense to consider the unique effects of each class like "people who were given the drug versus those who weren't" - you can very easily compare the impacts of being in a higher class (e.g. 2 or 3) to being in the most common class, 1, which again would be left out.
The problem with mapping categorical notions to numerical is that some algorithms (e.g. neural networks) will interpret the value itself as having a meaning, i.e. you would get different results if you assign values 1,2,3 to passenger classes than, for example 0,1,2 or 3,2,1. The correspondence between the passenger classes and numbers is purely conventional and doesn't necessarily convey any additional meaning.
One could argue that the lesser the number, the "better" the class is, however it's still hard to interpret it as "the first class is twice as good as second class", unless you'll define some measure of "goodness" that will make the relation between numbers "1" and "2" sensible.
In this example, you have categorical data that is ordinal - meaning you can rank the categories (from best accommodations to worst, for example) but they're still categories. Regardless of how you label them, there's no actual information about the relative distances among your categories. You can put them in a table, but not (correctly) on a number line. In cases like this, it's generally best to treat your categorical data as independent categories.

AMPL: what's a good way to specify equality constraints for large list of pairs of variable-size sets?

I'm working on a problem that involves reconciling data that represents estimates of the same system under two different classification hierarchies. I want to enforce the requirement that equivalent classes or groups of classes have the same sum.
For example, say Classification A divides industries into: Agriculture (sheep/cattle), Agriculture (non-sheep/cattle), Mining, Manufacturing (textiles), Manufacturing (non-textiles), ...
Meanwhile, Classification B has a different breakdown: Agriculture, Mining (iron ore), Mining (non-iron-ore), Manufacturing (chemical), Manufacturing (non-chemical), ...
In this case, any total for A_Agric_SheepCattle + A_Agric_NonSheepCattle should match the equivalent total for B_Agric; A_Mining should match B_MiningIronOre + B_Mining_NonIronOre; and A_MFG_Textiles+A_MFG_NonTextiles should match B_MFG_Chemical+B_MFG_NonChemical.
For bonus complication, one category may be involved in multiple equivalencies, e.g. B_Mining_IronOre might be involved in an equivalency with both A_Mining and A_Mining_Metallic.
I will be working with multi-dimensional tables, with this sort of concordance applied to more than one dimension - e.g. I might be compiling data on Industry x Product, so each equivalency will be used in multiple constraints; hence I need an efficient way to define them once and invoke repeatedly, instead of just setting a direct constraint "A_Agric_SheepCattle + A_Agric_NonSheepCattle = B_Agric".
The most natural way to represent this sort of concordance would seem to be as a list of pairs of sets. The catch is that the set sizes will vary - sometimes we have a 1:1 equivalence, sometimes it's "these 5 categories equate to those 7 categories", etc.
I found this related question which offers two answers for dealing with variable-sized sets. One is to define all set members in a single ordered set with indices, then define the starting index for each set within that. However, this seems unwieldy for my problem; both classifications are likely to be long, so I'd need to be hopping between two loooong lists of industries and two looong lists of indices to see a single equivalency. This seems like it would be a nuisance to check, and hard to modify (since any change to membership for one of the early sets changes the index numbers for all following sets).
The other is to define pairs of long fixed-length sets, and then pad each set to the required length with null members.
This would be a much better option for my purposes since it lets me eyeball a single line and see the equivalence that it represents. But it would require a LOT of padding; most of the equivalence groups will be small but a few might be quite large, and everything has to be padded to the size of the largest expected length.
Is there a better approach?

Which algorithm/implementation for weighted similarity between users by their selected, distanced attributes?

Data Structure:
User has many Profiles
(Limit - no more than one of each profile type per user, no duplicates)
Profiles has many Attribute Values
(A user can have as many or few attribute values as they like)
Attributes belong to a category
(No overlap. This controls which attribute values a profile can have)
Example/Context:
I believe with stack exchange you can have many profiles for one user, as they differ per exchange site? In this problem:
Profile: Video, so Video profile only contains Attributes of Video category
Attributes, so an Attribute in the Video category may be Genre
Attribute Values, e.g. Comedy, Action, Thriller are all Attribute Values
Profiles and Attributes are just ways of grouping Attribute Values on two levels.
Without grouping (which is needed for weighting in 2. onwards), the relationship is just User hasMany Attribute Values.
Problem:
Give each user a similarity rating against each other user.
Similarity based on All Attribute Values associated with the user.
Flat/one level
Unequal number of attribute values between two users
Attribute value can only be selected once per user, so no duplicates
Therefore, binary string/boolean array with Cosine Similarity?
1 + Weight Profiles
Give each profile a weight (totaling 1?)
Work out profile similarity, then multiply by weight, and sum?
1 + Weight Attribute Categories and Profiles
As an attribute belongs to a category, categories can be weighted
Similarity per category, weighted sum, then same by profile?
Or merge profile and category weights
3 + Distance between every attribute value
Table of similarity distance for every possible value vs value
Rather than similarity by value === value
'Close' attributes contribute to overall similarity.
No idea how to do this one
Fancy code and useful functions are great, but I'm really looking to fully understand how to achieve these tasks, so I think generic pseudocode is best.
Thanks!
First of all, you should remember that everything should be made as simple as possible, but not simpler. This rule applies to many areas, but in things like semantics, similarity and machine learning it is essential. Using several layers of abstraction (attributes -> categories -> profiles -> users) makes your model harder to understand and to reason about, so I would try to omit it as much as possible. This means that it's highly preferable to keep direct relation between users and attributes. So, basically your users should be represented as vectors, where each variable (vector element) represents single attribute.
If you choose such representation, make sure all attributes make sense and have appropriate type in this context. For example, you can represent 5 video genres as 5 distinct variables, but not as numbers from 1 to 5, since cosine similarity (and most other algos) will treat them incorrectly (e.g. multiply thriller, represented as 2, with comedy, represented as 5, which makes no sense actually).
It's ok to use distance between attributes when applicable. Though I can hardly come up with example in your settings.
At this point you should stop reading and try it out: simple representation of users as vector of attributes and cosine similarity. If it works well, leave it as is - overcomplicating a model is never good.
And if the model performs bad, try to understand why. Do you have enough relevant attributes? Or are there too many noisy variables that only make it worse? Or do some attributes should really have larger importance than others? Depending on these questions, you may want to:
Run feature selection to avoid noisy variables.
Transform your variables, representing them in some other "coordinate system". For example, instead of using N variables for N video genres, you may use M other variables to represent closeness to specific social group. Say, 1 for "comedy" variable becomes 0.8 for "children" variable, 0.6 for "housewife" and 0.9 for "old_people". Or anything else. Any kind of translation that seems more "correct" is ok.
Use weights. Not weights for categories or profiles, but weights for distinct attributes. But don't set these weights yourself, instead run linear regression to find them out.
Let me describe last point in a bit more detail. Instead of simple cosine similarity, which looks like this:
cos(x, y) = x[0]*y[0] + x[1]*y[1] + ... + x[n]*y[n]
you may use weighted version:
cos(x, y) = w[0]*x[0]*y[0] + w[1]*x[1]*y[1] + ... + w[2]*x[2]*y[2]
Standard way to find such weights is to use some kind of regression (linear one is the most popular). Normally, you collect dataset (X, y) where X is a matrix with your data vectors on rows (e.g. details of house being sold) and y is some kind of "correct answer" (e.g. actual price that the house was sold for). However, in you case there's no correct answer to user vectors. In fact, you can define correct answer to their similarity only. So why not? Just make each row of X be a combination of 2 user vectors, and corresponding element of y - similarity between them (you should assign it yourself for a training dataset). E.g.:
X[k] = [ user_i[0]*user_j[0], user_i[1]*user_j[1], ..., user_i[n]*user_j[n] ]
y[k] = .75 // or whatever you assign to it
HTH

C4.5 algorithm with unbounded attributes

Current implementation of C4.5 in VFDT (http://www.cs.washington.edu/dm/vfml/vfdt.html) or for that matter any other implementation uses the C4.5 format of files for providing inputs for constructing the decision tree. According to this the attributes can have the following formats:
continuous
If the attribute has a continuous value.
discrete
The word 'discrete' followed by an integer which indicates how many values the attribute can take.
list of identifiers
This is a discrete attribute with the values enumerated (this is the prefered method for discrete attributes). The identifiers should be separated by commas.
ignore
means the attribute should be ignored - it won't be used.
Does anybody know how we can specify discrete valued attributes whose complete set of possible values is too large to list down?
For example "IP-Address" attribute can have Math.Pow(255,4) possible discrete values;
"QueryString" attribute can have infinite number of possible values ... etc.
Can the C4.5 algorithm handle the case where the attribute has say 100,000 discrete distinct values, OR where the exact bound is not known, but only an approximation is known?
Thanks.
The usual choice is to enumerate all the values of a discrete feature that occur in your training set. Since the algorithm can never gather enough statistics for values that are not seen during training, those would be ignored no matter how you'd implement them.
Mind you, it's quite hard to gather statistics for such features anyway, so you might want to think about different representations. In particular, multi-word strings of text can be tokenized and treated as bags of words.

Multi-criteria sorting/distribution into sets

I'm trying to figure out an algorithm...
Input is a bunch of objects that have multiple values (eg 3 values per object, colour/taste/age, though it could be more).
The algorithm would then distribute the objects into a pre-defined number of sets. Each set should end up with almost the same number of objects (preferably the object count per set shouldn't differ more than 1), and achieve the objective of as fair a distribution of values per set as possible (eg try to have close to as many red in each set, and same for other colours, as well as tastes and ages, etc).
Values are tied to objects and cannot be changed. If you move an object from one set to another it brings all its values.
I found this related question: Algorithm for fair distribution of numbers into two sets
and the "number partitioning problem" suggested seems to help with single value distributions, but I'm looking for information/algorithms with multiple values per object (as described above).
Also note that the values cannot be normalized, ie each object cannot be totalled up into a single value.
Thank you kindly for any assistance.
IMHO, you should approach this as a clustering problem http://en.wikipedia.org/wiki/Cluster_analysis .

Resources