I've been set the assignment of producing a solution for the capacitated vehicle routing problem using any algorithm that learns. From my brief search of the literature, tabu search variants seem to be the most successful. Can they be classed as learning algorithms though or are they just variants on local search?
Search methods are not "learning". Learning, in cotenxt of computer science is a term for learning machines - which improve their quality over training (experience). Metaheuristics, which simply search through some space do not "learn", they simply browse all possible solutions (in heuristically guided manner) in order to optimize some function. In other words - optimization techniques are used to train some models, but these optimizers themselves don't "learn". Although this is purely linguistic manner, but I would distinguish between methods that learns - in the sense - are trying to generalize knowledge from some set of examples, from algorithms which simply are searching for best parameters for arbitrary given function. The core idea of machine learning (which distinguishes it from optimization itself) is that the aim is to actually maximize the quality of our model on unknown data, while in optimization (and in particular tabu search) we are simply looking for the best quality on exactly known, and well defined data (function).
Related
Rete Algorithm is an efficient pattern matching algorithm that compares a large collection of patterns to a large collection of objects. It is also used in one of the expert system shell that I am exploring right now: is drools.
What is the time complexity of the algorithm, based on the number of rules I have?
Here is a link for Rete Algorithm: http://www.balasubramanyamlanka.com/rete-algorithm/
Also for Drools: https://drools.org/
Estimating the complexity of RETE is a non-trivial problem.
Firstly, you cannot use the number of rules as a dimension. What you should look at are the single constraints or matches the rules have. You can see a rule as a collection of constraints grouped together. This is all what RETE reasons about.
Once you have a rough estimate of the amount of constraints your rule base has, you will need to look at those which are inter-dependent. Inter-dependent constraints are the most complex matches and are pretty similar in concept as JOINS in SQL queries. Their complexity varies based on their nature as well as the state of your working memory.
Then you will need to look at the size of your working memory. The amount of facts you assert within a RETE based expert system strongly influence its performance.
Lastly, you need to consider the engine conflict resolution strategy. If you have several conflicting rules, it might take a lot of time to figure out in which order to execute them.
Regarding RETE performance, there is a very good PhD dissertation I'd suggest you to look at. The author is Robert B. Doorenbos and the title is "Production matching for large learning systems".
I have two confusions when I use machine learning algorithm. At first, I have to say that I just use it.
There are two categories A and B, if I want to pick as many as A from their mixture, what kind of algorithm should I use ( no need to consider the number of samples) . At first I thought it should be a classification algorithm. And I use for example boost decision tree in a package TMVA, but someone told me that BDT is a regression algorithm indeed.
I find when I have coarse data. If I analysis it ( do some combinations ...) before I throw it to BDT, the result is better than I throw the coarse data into BDT. Since the coarse data contains every information, why do I need analysis it myself?
Is you are not clear, please just add a comment. And hope you can give me any advise.
For 2, you have to perform some manipulation on data and feed it to perform better because from it is not built into algorithm to analyze. It only looks at data and classifies. The problem of analysis as you put it is called feature selection or feature engineering and it has to be done by hand (of course unless you are using some kind of technique that learns features eg. deep learning). In machine learning, it has been seen a lot of times that manipulated/engineered features perform better than raw features.
For 1, I think BDT can be used for regression as well as classification. This looks like a classification problem (to choose or not to choose). Hence you should use a classification algorithm
Are you sure ML is the approach for your problem? In case it is, some classification algorithms would be:
logistic regression, neural networks, support vector machines,desicion trees just to name a few.
As stated in the title, I'm simply looking for algorithms or solutions one might use to take in the twitter firehose (or a portion of it) and
a) identify questions in general
b) for a question, identify questions that could be the same, with some degree of confidence
Thanks!
(A)
I would try to identify questions using machine learning and the Bag of Words model.
Create a labeled set of twits, and label each of them with a binary
flag: question or not question.
Extract the features from the training set. The features are traditionally words, but at least for any time I tried it - using bi-grams significantly improved the results. (3-grams were not helpful for my cases).
Build a classifier from the data. I usually found out SVM gives better performance then other classifiers, but you can use others as well - such as Naive Bayes or KNN (But you will probably need feature selection algorithm for these).
Now you can use your classifier to classify a tweet.1
(B)
This issue is referred in the world of Information-Retrieval as "duplicate detection" or "near-duplicate detection".
You can at least find questions which are very similar to each other using Semantic Interpretation, as described by Markovitch and Gabrilovich in their wonderful article Wikipedia-based Semantic Interpretation for Natural Language Processing. At the very least, it will help you identify if two questions are discussing the same issues (even though not identical).
The idea goes like this:
Use wikipedia to build a vector that represents its semantics, for a term t, the entry vector_t[i] is the tf-idf score of the term i as it co-appeared with the term t. The idea is described in details in the article. Reading the 3-4 first pages are enough to understand it. No need to read it all.2
For each tweet, construct a vector which is a function of the vectors of its terms. Compare between two vectors - and you can identify if two questions are discussing the same issues.
EDIT:
On 2nd thought, the BoW model is not a good fit here, since it ignores the position of terms. However, I believe if you add NLP processing for extracting feature (for examples, for each term, also denote if it is pre-subject or post-subject, and this was determined using NLP procssing), combining with Machine Learning will yield pretty good results.
(1) For evaluation of your classifier, you can use cross-validation, and check the expected accuracy.
(2) I know Evgeny Gabrilovich published the implemented algorithm they created as an open source project, just need to look for it.
Tabu Search may be using at Genetic Algorithms.
Genetic Algorithms may need many generations to get a success so running at high performance is important for them. Tabu Search is for enhancement for avoiding local maximums and good with memory mechanism to get better success through the iterations. However Tabu Search makes the algorithm more slower as usual beside its benefits.
My question is:
Is there any research about when to use Tabu Search with Genetic Algorithms and when not?
Generally speaking, GAs spend a lot of time sampling points that are trivially suboptimal. Suppose you're optimizing a function that looks like a couple of camel humps. GAs will dump points all over the place initially, and slowly converge to the points being at the top of the humps. However, even a very simple local search algorithm can take a point that the GA generates on the slope of a hump and push it straight to the top of the hump essentially immediately. If you let every point the GA generates go through this simple local optimization, then you end up with a GA searching only the space of local optima, which generally will greatly improve your chances of finding the best solutions. The problem is that when you start on real problems instead of camel humps, simple local search algorithms often aren't powerful enough to find the really good local optima, but something like tabu search can be used in their place.
There are two drawbacks. One, each generation of the GA goes much more slowly (but you need many fewer generations usually). Two, you lose some diversity, which can cause you to converge to a suboptimal solution more often.
In practice, I would always include some form of local search inside a GA whenever possible. No Free Lunch tells us that sometimes you'll make things worse, but after ten years or so of doing GA and local search research professionally, I'd pretty much always put up a crisp new $100 bill that says that local search will improve things for the majority of cases you really care about. It doesn't have to be tabu search; you could use Simulated Annealing, VDS, or just a simple next-ascent hill climber.
When you combine multiple heuristics together, you have what's referred to as a hybrid-heuristic.
It has been a trend in the last decade or so to explore the advantages and disadvantages of hybrid-heuristics in the optimisation "crowd".
There are literally hundreds of papers available on the topic and a lot of them are quite good. I have seen papers which employ a local-search (hill-climbing, not Tabu) for each offspring at each generation of GA to direct each offspring to the local optimum. The authors report good results. I have also seen papers which use a GA to optimise the cooling schedule of a simulated annealing algorithm for both a particular problem instance and also for a general case and have good results. I've also read a paper which adds a tabu list to a simulated annealing algorithm so that it prevents revisiting solutions it has seen in the past n iterations, unless some aspiration function is satisfied.
If you're working on timetabling (as your other comment suggests), I suggest you read some papers from PATAT (practice and theory in automated timetabling), particularly from E.K.Burke and P.Brucker who are very active and well-known in the field. A lot of the PATAT proceedings are freely available.
Try a Scholar search like this:
http://scholar.google.com/scholar?q=%22hybrid+heuristics%22+%22combinatorial+optimization%22+OR+timetabling+OR+scheduling&btnG=&hl=en&as_sdt=0%2C5&as_ylo=2006
It is very difficult to prove the convergence of these sorts of heuristics mathematically. I have seen a Markov chain representation of simulated-annealing which shows upper- and lower-bounds of convergence and there exists something similar for GA. Often you can use many different heuristics on a single problem, and only experimental results will show which is better. You may need to do computational experiments to see if your GA can be improved with a TS or more generic local search, but in general, hybrid heuristics seem to be the go these days.
I haven't combined tabu search with genetic algorithms yet, but I have combined it with simulated annealing. It's not really tabu search, it's more enhancing the other algorithm with tabu.
From my experience, checking if something is tabu doesn't have a high performance cost.
Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm? Why or why not?
Basically, I'm trying to figure out why the Wikipedia page for Statistical Classification does not mention LSI. I'm just getting into this stuff and I'm trying to see how all the different approaches for classifying something relate to one another.
No, they're not quite the same. Statistical classification is intended to separate items into categories as cleanly as possible -- to make a clean decision about whether item X is more like the items in group A or group B, for example.
LSI is intended to show the degree to which items are similar or different and, primarily, find items that show a degree of similarity to an specified item. While this is similar, it's not quite the same.
LSI/LSA is eventually a technique for dimensionality reduction, and usually is coupled with a nearest neighbor algorithm to make it a into classification system. Hence in itself, its only a way of "indexing" the data in lower dimension using SVD.
Have you read about LSI on Wikipedia ? It says it uses matrix factorization (SVD), which in turn is sometimes used in classification.
The primary distinction in machine learning is between "supervised" and "unsupervised" modeling.
Usually the words "statistical classification" refer to supervised models, but not always.
With supervised methods the training set contains a "ground-truth" label that you build a model to predict. When you evaluate the model, the goal is to predict the best guess at (or probability distribution of) the true label, which you will not have at time of evaluation. Often there's a performance metric and it's quite clear what the right vs wrong answer is.
Unsupervised classification methods attempt to cluster a large number of data points which may appear to vary in complicated ways into a smaller number of "similar" categories. Data in each category ought to be similar in some kind of 'interesting' or 'deep' way. Since there is no "ground truth" you can't evaluate 'right or wrong', but 'more' vs 'less' interesting or useful.
Similarly evaluation time you can place new examples into potentially one of the clusters (crisp classification) or give some kind of weighting quantifying how similar or different looks like the "archetype" of the cluster.
So in some ways supervised and unsupervised models can yield something which is a "prediction", prediction of class/cluster label, but they are intrinsically different.
Often the goal of an unsupervised model is to provide more intelligent and powerfully compact inputs for a subsequent supervised model.