C4.5 algorithm with unbounded attributes - algorithm

Current implementation of C4.5 in VFDT (http://www.cs.washington.edu/dm/vfml/vfdt.html) or for that matter any other implementation uses the C4.5 format of files for providing inputs for constructing the decision tree. According to this the attributes can have the following formats:
continuous
If the attribute has a continuous value.
discrete
The word 'discrete' followed by an integer which indicates how many values the attribute can take.
list of identifiers
This is a discrete attribute with the values enumerated (this is the prefered method for discrete attributes). The identifiers should be separated by commas.
ignore
means the attribute should be ignored - it won't be used.
Does anybody know how we can specify discrete valued attributes whose complete set of possible values is too large to list down?
For example "IP-Address" attribute can have Math.Pow(255,4) possible discrete values;
"QueryString" attribute can have infinite number of possible values ... etc.
Can the C4.5 algorithm handle the case where the attribute has say 100,000 discrete distinct values, OR where the exact bound is not known, but only an approximation is known?
Thanks.

The usual choice is to enumerate all the values of a discrete feature that occur in your training set. Since the algorithm can never gather enough statistics for values that are not seen during training, those would be ignored no matter how you'd implement them.
Mind you, it's quite hard to gather statistics for such features anyway, so you might want to think about different representations. In particular, multi-word strings of text can be tokenized and treated as bags of words.

Related

How to rotate a word2vec onto another word2vec?

I am training multiple word2vec models with Gensim. Each of the word2vec will have the same parameter and dimension, but trained with slightly different data. Then I want to compare how the change in data affected the vector representation of some words.
But every time I train a model, the vector representation of the same word is wildly different. Their similarity among other words remain similar, but the whole vector space seems to be rotated.
Is there any way I can rotate both of the word2vec representation in such way that same words occupy same position in vector space, or at least they are as close as possible.
Thanks in advance.
That the locations of words vary between runs is to be expected. There's no one 'right' place for words, just mutual arrangements that are good at the training task (predicting words from other nearby words) – and the algorithm involves random initialization, random choices during training, and (usually) multithreaded operation which can change the effective ordering of training examples, and thus final results, even if you were to try to eliminate the randomness by reliance on a deterministically-seeded pseudorandom number generator.
There's a class called TranslationMatrix in gensim that implements the learn-a-projection-between-two-spaces method, as used for machine-translation between natural languages in one of the early word2vec papers. It requires you to have some words that you specify should have equivalent vectors – an anchor/reference set – then lets other words find their positions in relation to those. There's a demo of its use in gensim's documentation notebooks:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb
But, there are some other techniques you could also consider:
transform & concatenate the training corpuses instead, to both retain some words that are the same across all corpuses (such as very frequent words), but make other words of interest different per segment. For example, you might leave words like "hot" and "cold" unchanged, but replace words like "tamale" or "skiing" with subcorpus-specific versions, like "tamale(A)", "tamale(B)", "skiing(A)", "skiing(B)". Shuffle all data together for training in a single session, then check the distances/directions between "tamale(A)" and "tamale(B)" - since they were each only trained by their respective subsets of the data. (It's still important to have many 'anchor' words, shared between different sets, to force a correlation on those words, and thus a shared influence/meaning for the varying-words.)
create a model for all the data, with a single vector per word. Save that model aside. Then, re-load it, and try re-training it with just subsets of the whole data. Check how much words move, when trained on just the segments. (It might again help comparability to hold certain prominent anchor words constant. There's an experimental property in the model.trainables, with a name ending _lockf, that lets you scale the updates to each word. If you set its values to 0.0, instead of the default 1.0, for certain word slots, those words can't be further updated. So after re-loading the model, you could 'freeze' your reference words, by setting their _lockf values to 0.0, so that only other words get updated by the secondary training, and they're still bound to have coordinates that make sense with regard to the unmoving anchor words. Read the source code to better understand how _lockf works.)

Numeric to nominal filter

When is it compulsory to use the filter to change the data type to nominal? I am doing classification right now, and the results differ by a huge margin if I changed it to nominal compared to as it is. Thank you in advance.
I don't your question is formed well but I will try to answer it anyway.
Nominal and numeric attributes represent different types of attributes and therefore are treated differently by machine learning algorithms.
Nominal attributes are limited to a closed set of values and they don't have order or any other relation between them. Usually nominal attributes should have a small amount of possible values (large set of possible values may cause over-fitting). The color of car is an example of an attribute that probably would be represented as a nominal attribute.
Numeric attributes are usually more common. They represent values on some axis and are not limited to specific values. Usually the classification algorithms will try to find a point on that axis that differentiate well between the classes or use the value to calculate distance between instances. The salary of an employee is an example of an attribute I will probably use as a numeric attribute.
One more thing you need to take into account is how the classification algorithm treats nominal and numeric attributes. Some algorithms don't handle well nominal attributes. Other algorithms will not work well with several numeric attributes if the values of the attributes are not normalized.

How to decide to convert to categorical variable or keep it numeric?

This might be a basic or trivial question and might be straightforward. Still I would like to ask this to clear my doubt once and for all.
Take example of Passanger Class in Famous Titanic Data. Functionally it is indeed a Categorical Data, so it will make perfect sense to convert it to categorical variable. Algorithms as per my understanding tend to see a pattern specific to that class. But at the same time if you see it as numeric variable, it might denote a range also for a decision tree. Say passangers in between first class and second class.
It looks both are correct and both will affect the machine learning algorithm outputs in different ways.
Which one is appropriate and is there anywhere there is a extensive discussion about it? Should we use such ambiguous variables as numeric as well its copy as a categorical variable, which might prove to be a technique to uncover more patterns?
I suppose it's up to you whether you'd rather interpret a continuous PassengerClass variable as "for every one-unit increase in PassengerClass, the passenger's likelihood of survival goes up/down X%," versus a categorical (factor) PassengerClass as, "the likelihoods of survival for groups 2 and 3 (for example, leaving 1st-class passengers as the base group) are X and Y% percent higher, respectively, than the base group, holding all else constant."
I think about variables like PassengerClass almost as "treatment groups." Yes, I suppose you could interpret it as continuous, but I think it makes more sense to consider the unique effects of each class like "people who were given the drug versus those who weren't" - you can very easily compare the impacts of being in a higher class (e.g. 2 or 3) to being in the most common class, 1, which again would be left out.
The problem with mapping categorical notions to numerical is that some algorithms (e.g. neural networks) will interpret the value itself as having a meaning, i.e. you would get different results if you assign values 1,2,3 to passenger classes than, for example 0,1,2 or 3,2,1. The correspondence between the passenger classes and numbers is purely conventional and doesn't necessarily convey any additional meaning.
One could argue that the lesser the number, the "better" the class is, however it's still hard to interpret it as "the first class is twice as good as second class", unless you'll define some measure of "goodness" that will make the relation between numbers "1" and "2" sensible.
In this example, you have categorical data that is ordinal - meaning you can rank the categories (from best accommodations to worst, for example) but they're still categories. Regardless of how you label them, there's no actual information about the relative distances among your categories. You can put them in a table, but not (correctly) on a number line. In cases like this, it's generally best to treat your categorical data as independent categories.

AMPL: what's a good way to specify equality constraints for large list of pairs of variable-size sets?

I'm working on a problem that involves reconciling data that represents estimates of the same system under two different classification hierarchies. I want to enforce the requirement that equivalent classes or groups of classes have the same sum.
For example, say Classification A divides industries into: Agriculture (sheep/cattle), Agriculture (non-sheep/cattle), Mining, Manufacturing (textiles), Manufacturing (non-textiles), ...
Meanwhile, Classification B has a different breakdown: Agriculture, Mining (iron ore), Mining (non-iron-ore), Manufacturing (chemical), Manufacturing (non-chemical), ...
In this case, any total for A_Agric_SheepCattle + A_Agric_NonSheepCattle should match the equivalent total for B_Agric; A_Mining should match B_MiningIronOre + B_Mining_NonIronOre; and A_MFG_Textiles+A_MFG_NonTextiles should match B_MFG_Chemical+B_MFG_NonChemical.
For bonus complication, one category may be involved in multiple equivalencies, e.g. B_Mining_IronOre might be involved in an equivalency with both A_Mining and A_Mining_Metallic.
I will be working with multi-dimensional tables, with this sort of concordance applied to more than one dimension - e.g. I might be compiling data on Industry x Product, so each equivalency will be used in multiple constraints; hence I need an efficient way to define them once and invoke repeatedly, instead of just setting a direct constraint "A_Agric_SheepCattle + A_Agric_NonSheepCattle = B_Agric".
The most natural way to represent this sort of concordance would seem to be as a list of pairs of sets. The catch is that the set sizes will vary - sometimes we have a 1:1 equivalence, sometimes it's "these 5 categories equate to those 7 categories", etc.
I found this related question which offers two answers for dealing with variable-sized sets. One is to define all set members in a single ordered set with indices, then define the starting index for each set within that. However, this seems unwieldy for my problem; both classifications are likely to be long, so I'd need to be hopping between two loooong lists of industries and two looong lists of indices to see a single equivalency. This seems like it would be a nuisance to check, and hard to modify (since any change to membership for one of the early sets changes the index numbers for all following sets).
The other is to define pairs of long fixed-length sets, and then pad each set to the required length with null members.
This would be a much better option for my purposes since it lets me eyeball a single line and see the equivalence that it represents. But it would require a LOT of padding; most of the equivalence groups will be small but a few might be quite large, and everything has to be padded to the size of the largest expected length.
Is there a better approach?

Multi-criteria sorting/distribution into sets

I'm trying to figure out an algorithm...
Input is a bunch of objects that have multiple values (eg 3 values per object, colour/taste/age, though it could be more).
The algorithm would then distribute the objects into a pre-defined number of sets. Each set should end up with almost the same number of objects (preferably the object count per set shouldn't differ more than 1), and achieve the objective of as fair a distribution of values per set as possible (eg try to have close to as many red in each set, and same for other colours, as well as tastes and ages, etc).
Values are tied to objects and cannot be changed. If you move an object from one set to another it brings all its values.
I found this related question: Algorithm for fair distribution of numbers into two sets
and the "number partitioning problem" suggested seems to help with single value distributions, but I'm looking for information/algorithms with multiple values per object (as described above).
Also note that the values cannot be normalized, ie each object cannot be totalled up into a single value.
Thank you kindly for any assistance.
IMHO, you should approach this as a clustering problem http://en.wikipedia.org/wiki/Cluster_analysis .

Resources