Ranking evaluation approach in two stage document retrieval - ranking

I have created a two-stage ranking system based on textual similarity ( cosine similarity ) between query-documents pair. Now I need to validate my ranking system whether the retrieved duly-ranked items are correct or not with respect to the user, which approach should I opt for. I read about Pointwise/Pairwise/Listwise approach to validate ranking, but for manual evaluation of a ranking system, which would be more helpful. If somebody can enlighten a better strategy for ranking evaluation approach, it would be very helpful for me. Thanks

If I get the question correctly, you are looking for an evaluation methodology to figure out whether your two-stage retrieval system works well or not. If this is true, you can use one of the following evaluation methodologies:
Relevance judgements: You can use TREC-like collections with a few hundred queries and explicit relevance judgement and use IR evaluation metrics (like MAP, P#10, NDCG, etc.) to evaluate your model.
A/B testing: In fact, you can show the initial result and the re-ranked results by the second stage of your retrieval system and ask users to judge whether the re-ranked one is better or not.
Click data: If you have access to search engine logs, you can use the click information of users to evaluate your model. To do so, you should be aware of several bias problems, e.g., positional bias problem.
Among the aforementioned strategies, the first one should be easier and cheaper to do. You just need to have access to TREC data, which is not private (but you need to pay a few hundred dollars to get access to most of them).

Related

evaluating the performance of item-based collaborative filtering for binary (yes/no) product recommendations

I'm attempting to write some code for item based collaborative filtering for product recommendations. The input has buyers as rows and products as columns, with a simple 0/1 flag to indicate whether or not a buyer has bought an item. The output is a list similar items for a given purchased, ranked by cosine similarities.
I am attempting to measure the accuracy of a few different implementations, but I am not sure of the best approach. Most of the literature I find mentions using some form of mean square error, but this really seems more applicable when your collaborative filtering algorithm predicts a rating (e.g. 4 out of 5 stars) instead of recommending which items a user will purchase.
One approach I was considering was as follows...
split data into training/holdout sets, train on training data
For each item (A) in the set, select data from the holdout set where users bought A
Determine which percentage of A-buyers bought one of the top 3 recommendations for A-buyers
The above seems kind of arbitrary, but I think it could be useful for comparing two different algorithms when trained on the same data.
Actually your approach is quiet similar with the literature but I think you should consider to use recall and precision as most of the papers do.
http://en.wikipedia.org/wiki/Precision_and_recall
Moreover if you will use Apache Mahout there is an implementation for recall and precision in this class; GenericRecommenderIRStatsEvaluator
Best way to test a recommender is always to manually verify that the results. However some kind of automatic verification is also good.
In the spirit of a recommendation system, you should split your data in time, and see if you algorithm can predict what future buys the user does. this should be done for all users.
Don't expect that it can predict everything, a 100% correctness is usually a sign of over-fitting.

How do I get a quick and dirty recognition of possible typos in .net?

I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".

Finding an experiment to evaluate how good an algorithm for keywords extraction is

I have a few algorithms that extract and rank keywords [both terms and bigrams] from a paragraph [most are based on the tf-idf model].
I am looking for an experiment to evaluate these algorithms. This experiment should give a grade to each algorithm, indicating "how good was it" [on the evaluation set, of course].
I am looking for an automatic / semi-automatic method to evaluate each algorithm's results, and an automatic / semi-automatic method to create the evaluation set.
Note: These experiments will be ran off-line, so efficiency is not an issue.
The classic way to do this would be to define a set of key words you want the algorithms to find per paragraph, then check how well the algorithms do with respect to this set, e.g. (generated_correct - generated_not_correct)/total_generated (see update, this is nonsense). This is automatic once you have defined this ground truth. I guess constructing that is what you want to automate as well when you talk about constructing the evaluation set? That's a bit more tricky.
Generally, if there was a way to generate key words automatically that's a good way to use as a ground truth - you should use that as your algorithm ;). Sounds cheeky, but it's a common problem. When you evaluate one algorithm using the output of another algorithm, something's probably going wrong (unless you specifically want to benchmark against that algorithm).
So you might start harvesting key words from common sources. For example:
Download scientific papers that have a keyword section. Check if those keywords actually appear in the text, if they do, take the section of text including the keywords, use the keyword section as ground truth.
Get blog posts, check if the terms in the heading appear in the text, then use the words in the title (always minus stop words of course) as ground truth
...
You get the idea. Unless you want to employ people to manually generate keywords, I guess you'll have to make do with something like the above.
Update
The evaluation function mentioned above is stupid. It does not incorporate how many of the available key words have been found. Instead, the way to judge a ranked list of relevant and irrelevant results is to use precision and recall. Precision rewards the absence of irrelevant results, Recall rewards the presence of relevant results. This again gives you two measures. In order to combine these two into a single measure, either use the F-measure, which combines those two measures into a single measure, with an optional weighting. Alternatively, use Precision#X, where X is the number of results you want to consider. Precision#X interestingly is equivalent to Recall#X. However, you need a sensible X here, ie if you have less than X keywords in some cases, those results will be punished for never providing an Xth keyword. In the literature on tag recommendation for example, which is very similar to your case, F-measure and P#5 are often used.
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.
Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.
I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.
One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.
Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.

Fuzzy record matching with multiple columns of information

I have a question that is somewhat high level, so I'll try to be as specific as possible.
I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a financial security. This record linking usually involves header information in which the name is the only common primary identifier, but where some secondary information is often available (such as city and state, dates of operation, relative size, etc). These matches are usually one-to-many, but may be one-to-one or even many-to-many. I have usually done this matching by hand or with very basic text comparison of cleaned substrings. I have occasionally used a simple matching algorithm like a Levenshtein distance measure, but I never got much out of it, in part because I didn't have a good formal way of applying it.
My guess is that this is a fairly common question and that there must be some formalized processes that have been developed to do this type of thing. I've read a few academic papers on the subject that deal with theoretical appropriateness of given approaches, but I haven't found any good source that walks through a recipe or at least a practical framework.
My question is the following:
Does anyone know of a good source for implementing multi-dimensional fuzzy record matching, like a book or a website or a published article or working paper?
I'd prefer something that had practical examples and a well defined approach.
The approach could be iterative, with human checks for improvement at intermediate stages.
(edit) The linked data is used for statistical analysis. As such, a little bit of noise is OK, but there is a strong preference for fewer "incorrect matches" over fewer "incorrect non-matches".
If they were in Python that would be fantastic, but not necessary.
One last thing, if it matters, is that I don't care much about computational efficiency. I'm not implementing this dynamically and I'm usually dealing with a few thousand records.
One common method that shouldn't be terribly expensive for "a few thousand records" would be cosine similarity. Although most often used for comparing text documents, you can easily modify it to work with any kind of data.
The linked Wikipedia article is pretty sparse on details, but following links and doing a few searches will get you some good info. Potentially an implementation that you can modify to fit your purposes. In fact, take a look at Simple implementation of N-Gram, tf-idf and Cosine similarity in Python
A simpler calculation, and one that might be "good enough" for your purposes would be a Jaccard index. The primary difference is that typically cosine similarity takes into account the number of times a word is used in a document and in the entire set of documents, whereas the Jaccard index only cares that a particular word is in the document. There are other differences, but that one strikes me as the most important.
The problem is that you have an array of distances, at least one for each column, and you want to combine those distances in an optimal way to indicate whether a pair of records are the same thing or not.
This is a problem of classification, there are many ways to do it, but logistic regression is one of simpler methods. To train a classifer, you will need to label some pairs of records as either matches or not.
The dedupe python library helps you do this and other parts of the difficult task of record linkage. The documentation has a pretty good overview of how to approach the problem of record linkage comprehensively.

Resources