algorithm to combine data for linear fit? - algorithm

I'm not sure if this is the best place to ask this, but you guys have been helpful with plenty of my CS homework in the past so I figure I'll give it a shot.
I'm looking for an algorithm to blindly combine several dependent variables into an index that produces the best linear fit with an external variable. Basically, it would combine the dependent variables using different mathematical operators, include or not include each one, etc. until an index is developed that best correlates with my external variable.
Has anyone seen/heard of something like this before? Even if you could point me in the right direction or to the right place to ask, I would appreciate it. Thanks.

Sounds like you're trying to do Multivariate Linear Regression or Multiple Regression. The simplest method (Read: less accurate) to do this is to individually compute the linear regression lines of each of the component variables and then do a weighted average of each of the lines. Beyond that I am afraid I will be of little help.

This appears to be simple linear regression using multiple explanatory variables. As the implication here is that you are using a computational approach you could do something as simple apply a linear model to your data using every possible combination of your explanatory variables that you have (whether you want to include interaction effects is your choice), choose a goodness of fit measure (R^2 being just one example) and use that to rank the fit of each model you fit?? The quality of a model is also somewhat subjective in many fields - you could reject a model containing 15 variables if it only moderately improves the fit over a far simpler model just containing 3 variables. If you have not read it already I don't doubt that you will find many useful suggestions in the following text :
Draper, N.R. and Smith, H. (1998).Applied Regression Analysis Wiley Series in Probability and Statistics
You might also try doing a google for the LASSO method of model selection.

The thing you're asking for is essentially the entirety of regression analysis.
this is what linear regression does, and this is a good portion of what "machine learning" does (machine learning is basically just a name for more complicated regression and classification algorithms). There are hundreds or thousands of different approaches with various tradeoffs, but the basic ones frequently work quite well.
If you want to learn more, the coursera course on machine learning is a great place to get a deeper understanding of this.

Related

some confusions in machine learning

I have two confusions when I use machine learning algorithm. At first, I have to say that I just use it.
There are two categories A and B, if I want to pick as many as A from their mixture, what kind of algorithm should I use ( no need to consider the number of samples) . At first I thought it should be a classification algorithm. And I use for example boost decision tree in a package TMVA, but someone told me that BDT is a regression algorithm indeed.
I find when I have coarse data. If I analysis it ( do some combinations ...) before I throw it to BDT, the result is better than I throw the coarse data into BDT. Since the coarse data contains every information, why do I need analysis it myself?
Is you are not clear, please just add a comment. And hope you can give me any advise.
For 2, you have to perform some manipulation on data and feed it to perform better because from it is not built into algorithm to analyze. It only looks at data and classifies. The problem of analysis as you put it is called feature selection or feature engineering and it has to be done by hand (of course unless you are using some kind of technique that learns features eg. deep learning). In machine learning, it has been seen a lot of times that manipulated/engineered features perform better than raw features.
For 1, I think BDT can be used for regression as well as classification. This looks like a classification problem (to choose or not to choose). Hence you should use a classification algorithm
Are you sure ML is the approach for your problem? In case it is, some classification algorithms would be:
logistic regression, neural networks, support vector machines,desicion trees just to name a few.

What is the 'predictive' element of machine learning

I'm hoping someone with a lot more knowledge of machine learning can help me out here. I've been reading examples of regression and classification and I always seem to come back to the question 'what is really the difference between what this algorithm is doing and what standard statistical analysis would do'.
Specifically, none of the examples I read seem to discuss the predictive element. For example, when looking at linear regression the articles commonly explain the concept of trying to create a 'best fit' - the combination of a linear equation and then iterating a cost function until it reaches a minimum. Of course, throughout a lot of emphasis is put on a 'training data set'. No problem... but this is usually where it ends. At this point I can't see the difference between the above and the standard way in which one would carry out statistical analysis on a data set that was assumed to have a linear relationship. Presumably, future values here are 'predicted' from the equation that was produced when the cost function converged on a minimum - again, there doesn't seem to be much 'learning' here as this is exactly what would be done in the usual case.
After a long winded intro... what I'm trying to ask is how has the algorithm learned from the original training data? and how does this training set help with future data sets? (again, this is where I get a bit lost - to me it seems that you would give it a new data set and carry out the same task of minimising the cost function - however, this time you have a better 'starting' point but all of your knowledge really comes from what you already 'knew' about the dataset i.e that one assumed a linear relationship).
I hope this makes sense - it's clearly a lack of understanding, but I'm hoping someone can shove me in the right direction.
Thanks!
You are right, there is no difference. Linear regression is purely a statistical method, and "fitting" would probably be more accurate than "learning" in this case. But again, this is usually just the first lecture on the subject. There many approaches where the differences are much clearer, for example SVMs. There are also approaches where the "learning" aspect is much clearer, eg using reirforcement learning in games, where you can actually see your system improve its performance with experience.
Anyway, the main subject of machine learning is learning from examples. You are given a list of 100 patients, along with blood pressure, age, cholesterol level etc, and for each of them you are told whether they have heart disease or not. Then, you are given a patient that you had not seen before. Does he have heart disease?? Most people call this prediction. You might prefer to call it fitting, or anything else. But the fact is, it usually works quite well.
Still, the subject remains closely tied to statistics, and indeed, you need to make some assumptions (to a larger or smaller extent, depending on the algorithm) about the underlying function. It is not perfect, but in many cases it's the best thing we have, so I would say it is worth studying. If you are starting now, there is a great online course, Stanford's "Statistical Learning", which deals with the subject from your point of view.

Human-interpretable supervised machine learning algorithm

I'm looking for a supervised machine learning algorithm that would produce transparent rules or definitions that can be easily interpreted by a human.
Most algorithms that I work with (SVMs, random forests, PLS-DA) are not very transparent. That is, you can hardly summarize the models in a table in a publication aimed at a non-computer scientist audience. What authors usually do is, for example, publish a list of variables that are important based on some criterion (for example, Gini index or mean decrease of accuracy in the case of RF), and sometimes improve this list by indicating how these variables differ between the classes in question.
What I am looking is a relatively simple output of the style "if (any of the variables V1-V10 > median or any of the variables V11-V20 < 1st quartile) and variable V21-V30 > 3rd quartile, then class A".
Is there any such thing around?
Just to constraint my question a bit: I am working with highly multidimensional data sets (tens of thousands to hundreds of thousands of often colinear variables). So for example regression trees would not be a good idea (I think).
You sound like you are describing decision trees. Why would regression trees not be a good choice? Maybe not optimal, but they work, and those are the most directly interpretable models. Anything that works on continuous values works on ordinal values.
There's a tension between wanting an accurate classifier, and wanting a simple and explainable model. You could build a random decision forest model, and constrain it in several ways to make it more interpretable:
Small max depth
High minimum information gain
Prune the tree
Only train on "understandable" features
Quantize/round decision threhsolds
The model won't be as good, necessarily.
You can find interesting research in the understanding AI methods done by Been Kim at Google Brain.

Separation and pattern matching techniques

I am new to Artificial Neural Networks.
I am interested in an application like this:
I have a significantly large set of objects. Each object has six properties, denoted by P1–P6. Each property has a value which is a symbolic value. In other words, in my example P1–P6 can have a value from the set {A, B, C, D, E, F}. They are not numeric. (Suppose A,B,C,D,E,F are colours; then you will understand my idea.)
Now, there is another property R that I am interested in. Suppose
R = {G1, G2, G3, G4, G5}
I need to train a system for a large set of P1–P6 and the relevant R. Now I want to do the following.
I have an object and I know the values of P1 to P6. I need to find
the R (The Group that the object belongs.)
To get a desired R what is the pattern I need to have in P1–P6.
As an example given that R = G2 I need to figure out any pattern in P1–P6.
My questions are:
What are the theories/technologies/techniques I should read and
learn in order to implement 1 and 2, respectively?
What are the tools/libraries you can recommend to get this
simulated/implemented/tested?
The way you described your problem, you need to look up various machine learning techniques. If it were me, I would try and read about k-NN (k Nearest Neighbours) for the classification. When I say classification, I mean getting the R if you know P1-P6. It is a really simple technique and should be helpful here.
As for the other way around, what you basically need is a representative sample of your population. This is I think not so usual, but you could try something like a k-means Clustering. Clustering methods usually determine the class of an object (property R) by themselves, but k-means Clustering is cool in this situation because you need to give it the number of object classes (e.g. different possible values of R), and in the end you get one representative sample.
You definitely shouldn't go for any really complex techniques (like neural networks) in my opinion since your data doesn't have a precise numerical interpretation and the values can't be interpreted gradually.
The recommended tools really depend on your base programming language. There's a great tool called Orange which is Python-based and it's my tool of choice for these kind of things (especially since it is really easy to connect your Python modules with C/C++). If you prefer Java, there's a quite similar tool called Weka that you could use. I think Weka is a little bit better documented, but I don't like Java so I've never tried it out.
Both of these tools have a graphical clickable interface where you could just load your data and get the classification done, play with the parameters and check what kind of output you get using different techniques and different set-ups. Once you decide that you got the results you need (or if you just don't like graphical interfaces) you can also use both of them as libraries of a kind when programming (Python for Orange and Java for Weka) and make the classification a part of a bigger project.
If you look through the documentation of Orange or Weka, I think it will give you a few ideas about what you could actually do with the data you have and when you know a few techniques that seem interesting to you and applicable to the data, maybe you could get more quality comments and info on a few specific methods here than when just searching for a general advice.
You should check out classification algorithms (a subsection of artificial intelligence), especially the nearest neighbor-algorithms. Your problem may be solved by different techniques, which all have different advantages and disadvantages.
However, I do not know of any method in artificial intelligence, which allows a two-way classification (or in other words, that both implement your prerequisites 1 and 2 simultaneously). As all you want to do so far is having a bidirectional mapping of P1..P6 <=> R, I would suggest to just use a mapping table instead of an artificial intelligence algorithm. An AI would work great if you not exactly know, which of your samples is categorized under A..E in P1..P6.
If you insist on using an AI for it, I'd suggest to first look at a Perceptron. A perceptron consists of input, intermediate and output neurons. For your example, you'd have the input-Neurons P1a..P1e, P2a..P2e, ... and five output neurons R1..R5. After training, you should be able to input P1..P6 and get the appropriate R1..R5 as output.
As for frameworks and technologies, I only know of the Business Intelligence suite for Visual Studio, although there are a lot of other frameworks for AI out there. Since I do not have used any of them (I always coded them myself in C/C++), I can't recommend any.
It seems like a typical classification problem. In case you really have a lot of data have a look at Apache Mahout which provides distributed implementations of machine learning algorithms. If you need something less complex for prototyping TimBL is a nice alternative.

Does fuzzy logic really improve simple machine learning algorithms?

I'm reading about fuzzy logic and I just don't see how it would possibly improve machine learning algorithms in most instances (which it seems to be applied to relatively often).
Take for example, k nearest neighbors. If you have a bunch a bunch of attributes like color: [red,blue,green,orange], temperature: [real number], shape: [round, square, triangle], you can't really fuzzify any of these except for the real numbered attribute (please correct me if I'm wrong), and I don't see how this can improve anything more than bucketing things together.
How can machine fuzzy logic be used to improve machine learning? The toy examples you'll find on most websites don't seem to be all that applicable, most of the time.
Fuzzy logic is advisable when the variables have a natural shape interpretation. For example, [very few, few, many, very many] have a nice overlapping trapezoid interpretation of values.
Variables like color might not. Fuzzy variables denote degree of membership, that's when they become useful.
Regarding machine learning, it depends on what stage of the algorithm you want to apply fuzzy logic. It would be better applied in my opinion after the clusters are found (using traditional learning techniques) to determining the degree of membership of a certain point in the search space on each cluster, but that doesn't improve learning per see, but classification after learning.
[round, square, triangle] are mostly ideal categories, which exist primarily in geometry (i.e. in theory). In real world, some shapes might be almost square or more or less round (circular shape). There are many nuances of red, and some colors are closer to some others (ask a woman to explain turquoise, for example). Hence, also abstract categories and some specific values are useful as references, in real world the objects or values are not necessarily equals to these ones.
Fuzzy membership allow you to measure how far are some specific objects from some ideal. Using this measure lets one to avoid "no, it's not circular" (which might lead to information loss) and make use of the measure the given object is (not) circular.
In my view, fuzzy logic is not a practically viable approach to anything unless you are building a purpose build fuzzified controller or some rule based structure like for compliance/policies. Although, fuzzy implies dealing with everything between and including 0 and 1. It, however, I find is a bit flawed when you approach more complicated problems where you need to apply fuzzy logic aspects in 3 dimensional spaces. You can still approach multivariate without having to look at fuzzy logic. Unfortunately, for me having studied fuzzy logic I found myself disagreeing with the principles approached in fuzzy sets in large dimensional spaces it seems infeasible, unpractical, and not very logically sound. The natural language base that you would be applying in your fuzzy set solution will also be very adhoc what exactly is [very,few, many] this is all what you define in your application.
Alot, of machine learning aspects you will find that you don't even have to go so far as to build natural language underpinnings into your model. In fact, you will find you can achieve even better results without having to apply fuzzy logic into any aspect of your model.
just too irritate you a bit by forcibly adding fuzziness to this. if instead of the "shape" attribute you had a "number of sides" attribute which would have been further divided into "less", "medium", "many" and "uncountable". the square could have been a part of "less" and "medium" both given the appropriate membership function. in place of the "color" attribute, if you had "red" attribute, then using the RGB code, a membership function could have been made. so as my experience in data mining says, every method can be applied to every dataset, what works, works.
Couldn't one just convert discrete sets into continuous ones and get the same effects as fuzziness, while being able to use all the techniques of probability theory?
For instance size ['small', 'medium', 'big'] ==> [0,1]
It's not clear to me what you're trying to accomplish in the example you give (shapes, colors, etc.). Fuzzy logic has been used successfully with machine learning, but personally I think it is probably more often useful in constructing policies. Rather than go on about it, I refer you to an article I published in the Mar/Apr-2002 issue of "PC AI" magazine, which hopefully makes the idea clear:
Putting Fuzzy Logic to Work: An Introduction to Fuzzy Rules

Resources