information retrieval probabilistic model - algorithm

Do you know where I can find source code(any language) to program an information retrieval system based on the probabilistic model?
I tried to search it on the web and found an algorithm named bm25 or bmf25, but I don't know if it is useful.
Basically I´m trying to compare the performance of 3 IR algorithms: Vector space model, boolean model and the probabilistic model. Right now I have found the vector space and the boolean models. Depending on the results we need to use the best of them to develop a question-answering system
Thanks in advance

If you are looking for an IR engine that have BM25 implemented, you can try Terrier IR Platform
The language is Java. You can either use the engine itself or look into the source code for implementations of BM25 or other term weighting models.

The confusion here is that there are several probabilistic IR models (e.g. 2-Poisson, Binary Independence Model, language modeling variants), so the question is ambiguous. But in my experience, when people say "the probabilistic model" they usually mean some variant of the Binary Independence model due to Robertson and Sparch-Jones. BM25 (quite roughly) approximates this model, and that's what I'd use in this case. A canonical implementation of BM25 is included in the Lemur Toolkit. See:
http://www.lemurproject.org/doxygen/lemur/html/OkapiRetMethod_8hpp-source.html

Related

What is the algorithm used by the "Universal Recommender" on Prediction.IO?

good Afternoon
What is the name algorithm used by the "Universal Recommender (UR)" on Prediction.IO?
during which i know Algorithm for
system recommendation are "collaborative filtering" and "content based filtering".
thanks!
It uses Correlated Cross-Occurrence(CCO) algorithm from Apache-mahout.
check out these
https://actionml.com/blog/cco
https://mahout.apache.org/users/algorithms/recommender-overview.html
Prediction.io uses Apache Spark MLLib's Alternating Least Squares matrix factorization method (ALS). It is one of basic methods of Collaborative Filtering, which are User-based, Item-based and Matrix factorization. Documentation can be found at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
Universal Recommender Template use this algorithm for computing "events" that are appearing "often" with "buying" some "item". Use of factorization is not what authors of Universal Recommender principle describe in their original idea, instead they used LLR similarity to find statistically significant "events". I personally doubt about suitability of use of matrix factorization and use of HBase (use Redis cluster instead). You can read about Universal Recommender general idea at https://www.mapr.com/practical-machine-learning and http://mahout.apache.org/users/algorithms/recommender-overview.html

Recommender approach and algorithms for cold-start

We are looking at building recommender system for our brand-new Learning Management System. There are a bunch of users and items (learning modules) onboarded, but no ratings yet - typical cold start problem.
To begin with, we are thinking of using a simple item-based similarity using item attributes (tags, category, etc.) The idea is to switch to more robust collaborative filtering as the ratings start coming in.
Questions:
Is this a good approach? Is there a recommended ML pattern to handle such cold-start conditions?
To realise item-based similarity, which is the right algorithm? Say, cosine similarity. However, please note there is no "matrix". Should we try to use a standard ML algorithm or maybe roll our own?
Your approach is good. I would start with an unsupervised learning algorithm such as 'k-Nearest Neighbors classifier'. If your team doesn't know the first thing about ML, I recommend you to read this tutorial http://www.astroml.org/sklearn_tutorial/general_concepts.html . It uses python and a great library called scikit-learn. From there you could do Andrew's NG course (https://www.coursera.org/learn/machine-learning/) although it does not cover any recommendation systems.
I usually go with a Pearson Correlation algorithm (https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) and that suffices me for my problems. The problem with this approach is that it is linear. I have read that the Orange data mining tool provides many correlation measures. Using it you could find which one is best for your data. I would advice against using your own algorithm.
There is an older question which provides further information on the matter: How can I implement a recommendation engine?

Is tabu search a learning algorithm? (CVRP)

I've been set the assignment of producing a solution for the capacitated vehicle routing problem using any algorithm that learns. From my brief search of the literature, tabu search variants seem to be the most successful. Can they be classed as learning algorithms though or are they just variants on local search?
Search methods are not "learning". Learning, in cotenxt of computer science is a term for learning machines - which improve their quality over training (experience). Metaheuristics, which simply search through some space do not "learn", they simply browse all possible solutions (in heuristically guided manner) in order to optimize some function. In other words - optimization techniques are used to train some models, but these optimizers themselves don't "learn". Although this is purely linguistic manner, but I would distinguish between methods that learns - in the sense - are trying to generalize knowledge from some set of examples, from algorithms which simply are searching for best parameters for arbitrary given function. The core idea of machine learning (which distinguishes it from optimization itself) is that the aim is to actually maximize the quality of our model on unknown data, while in optimization (and in particular tabu search) we are simply looking for the best quality on exactly known, and well defined data (function).

Which classification algorithm can be used for document categorization?

Hey, Here is my problem,
Given a set of documents I need to assign each document to a predefined category.
I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data that I have.
Correct me if I miss understood something please.
The problem now is that the categories should be dynamic. Meaning, my classifier should handle new training data with new category.
So for example, if I trained a classifier to classify a given document as category A, category B or category C, and then I was given new training data with category D. I should be able to incrementally train my classifier by providing it with the new training data for "category D".
To summarize, I do NOT want to combine the old training data (with 3 categories) and the new training data (with the new/unseen category) and train my classifier again. I want to train my classifier on the fly
Is this possible to implement with SVM ? if not, could u recommend me several classification algorithms ? or any book/paper that can help me.
Thanks in Advance.
Naive-Bayes is relatively fast incremental calssification algorithm.
KNN is also incremental by nature, and even simpler to implement and understand.
Both algorithms are implemented in the open source project Weka as NaiveBayes and IBk for KNN.
However, from personal experience - they are both vulnerable to large number of non-informative features (which is usually the case with text classification), and thus some kind of feature selection is usually used to squeeze better performance from these algorithms, which could be problematic to implement as incremental.
This blog post by Edwin Chen describes infinite mixture models to do clustering. I think this method supports automatically determining the number of clusters, but I am still trying to wrap my head all the way around it.
The class of algorithms that matches your criteria are called "Incremental Algorithms". There are incremental versions of almost any methods. The easiest to implement is naive bayes.

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm?

Is Latent Semantic Indexing (LSI) a Statistical Classification algorithm? Why or why not?
Basically, I'm trying to figure out why the Wikipedia page for Statistical Classification does not mention LSI. I'm just getting into this stuff and I'm trying to see how all the different approaches for classifying something relate to one another.
No, they're not quite the same. Statistical classification is intended to separate items into categories as cleanly as possible -- to make a clean decision about whether item X is more like the items in group A or group B, for example.
LSI is intended to show the degree to which items are similar or different and, primarily, find items that show a degree of similarity to an specified item. While this is similar, it's not quite the same.
LSI/LSA is eventually a technique for dimensionality reduction, and usually is coupled with a nearest neighbor algorithm to make it a into classification system. Hence in itself, its only a way of "indexing" the data in lower dimension using SVD.
Have you read about LSI on Wikipedia ? It says it uses matrix factorization (SVD), which in turn is sometimes used in classification.
The primary distinction in machine learning is between "supervised" and "unsupervised" modeling.
Usually the words "statistical classification" refer to supervised models, but not always.
With supervised methods the training set contains a "ground-truth" label that you build a model to predict. When you evaluate the model, the goal is to predict the best guess at (or probability distribution of) the true label, which you will not have at time of evaluation. Often there's a performance metric and it's quite clear what the right vs wrong answer is.
Unsupervised classification methods attempt to cluster a large number of data points which may appear to vary in complicated ways into a smaller number of "similar" categories. Data in each category ought to be similar in some kind of 'interesting' or 'deep' way. Since there is no "ground truth" you can't evaluate 'right or wrong', but 'more' vs 'less' interesting or useful.
Similarly evaluation time you can place new examples into potentially one of the clusters (crisp classification) or give some kind of weighting quantifying how similar or different looks like the "archetype" of the cluster.
So in some ways supervised and unsupervised models can yield something which is a "prediction", prediction of class/cluster label, but they are intrinsically different.
Often the goal of an unsupervised model is to provide more intelligent and powerfully compact inputs for a subsequent supervised model.

Resources