A new project with some interesting requirements has arrived on my desk. I need to develop a searchable directory of businesses, with a focus on delivering relevant results based on arbitrary search queries. The businesses can be of any niche; there's no one area that is more represented than another.
When googling for things like "search algorithm" or "content relevance algorithm," all I get are references to Google's "Mystical Algorithm of the Old Gods" and SEO firms.
Does the relevance value of MySQL's full text Match() function have what it takes for the task? I've never used it, but I'm definitely going to do some testing. Also, since this will largely be a human edited directory, I can assume that we can add weighted factors like tagging and categories. What would be a good way to combine these factors with MySQL's Match() relevancy?
I'm also open to ideas that I've not discussed here.
For an example of information retrieval based techniques lookup TF-IDF or BM25.
For machine learning based techniques, lookup RankNet and its variants from MSR.
If you have hand edited data, have a look at Oracle text search. In one of my previous projects we had some good results.
I was not directly involved in the database setups, but I know that the results were very welcome. (Before this they had just keyword based search).
Use a search engine like Solr to index the data. You can still use MySql to hold the data, but for searches use a search engine.
Related
I have a problem with making search output more practically usefull for the end users. The problem is rather related to the algorithm and approach then to exact technology or framework to use.
At the moment we have a database of products, that can be described with following schema:
From the search perspective we've done pretty standard things, 3-rd party text search with token analyzer, handling mistypes and synonyms (it is not the full list, but as I said, it is rather out of scope). But stil we need to perform extra work to make the search result closer to real life user needs, probably, in somewhat similar way how Google ranks indexed pages by relevancy. Ideas, that we`ve already considered as potentially applicable in solving the problem:
Analyze most popular search requests in widespread search engines (it is still a question how to get them) and increase rank for those entries in the index, which correspond (could be found with) to the popular requests;
Increase rank for newest (hot) entries;
Increase rank for the biggest group of entries, which correspond to the popular request and have something in common (that`s why it is a group);
Appreciate for any help or advising a direction, where to dig.
You may try pLSA; there are many references on the web, and there should be libraries and source code.
EDIT:
well, I took a closer look at Lucene recently, and it seems to give a much better answer to what the question actually asked (it does not use pLSA). As for the integration with db, you may use Hibernate Search (although it does not seem to be as powerful as using Lucene directy is).
So, I'll be storing millions of sentences in a database each with an author. I need to be able to efficiently search for a sentence and return the author. Now, I'd like to be able to mispell a word or forget a word or two in this sentence, and have the application still be able to match (fuzzy-esque). Can anyone point me in the right direction? How does google do this? Because I can search for lyrics on google for instance and it will return the song with the lyrics? I'm looking to do the same thing?
Thanks all.
If fuzzy makes things too complicated, then I can deal with just an efficient sentence search.
If you're writing in Java, you can try Lucene.
Shouldn't it really be "document" and author instead of individual sentences?
For full text search check inverted index data structure.
This is how search engines do it
samples of code
UPDATE:
also if you're working on a distributed system check Hadoop - open source alternative for Goolge's MapReduce
Full Text Indexing on SQL Server or Oracle will most likey be what you're after right out of the box. They can go fuzzy, use word roots and other clever stuff.
I can't comment on other DB engines though a quick google shows most will have something similar. For some reason I expect them to be more limited in the fuzziness.
Indeed fuzzy matching is not a simple thing to do, although some databases implement some kind of fuzzy search, depending on the method used and your data, your results may vary. Here's a link that explains fuzzy searches in SQL sever
http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
As for the sentence search, most db engines implement full text search/indexing that you may want to look at... It comes with trade offs in terms of performance and storage, but you may want to look at it
How does google do this?
Using inverted indexes. The details are proprietary, but you can bet your last dollars that there is a lot of replication and storing of the indexes, etc in memory so that they can handle the vast number of search requests they get per second.
I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clustering does not remain valid anymore and hence any popular algorithms like CluStream, CURE, BIRCH cannot be applied which assumes fixed dimensional static data.
Can anyone redirect me to any existing research related to this or give suggestions? Thanks !
Have you looked at
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
Here's an idea off the top of my head:
What's your input data like? I'm guessing it's at least similarly themed, so you could start with a base phrase dictionary and use that for idf. Apache Lucene is a great indexing engine. Since you have a base dictionary, you can run kmeans or whatever you'd like. As documents come in, you'll have to rebuild the dictionary at some frequency (which can be off-loaded to another thread/machine/etc) and then re-cluster.
With the data indexed in a high-performance, flexible engine like Lucene, you could run queries even as new documents are being indexed. I bet if you do some research on different clustering algorithms you'll find some good ideas.
Some interesting paper/links:
http://en.wikipedia.org/wiki/Document_classification
http://www.scholarpedia.org/article/Text_categorization
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Without more information, I can't see why you couldn't re-cluster every once in a while. You might wanna take a look at some of the recommender systems already out there.
Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.
For example, submitting "baseball" would return
["shortstop", "Babe Ruth", "foul ball", "steroids", ... ]
Google Sets is the best example I can find of this kind of feature, but I can't use it since they have no public API (and I wont go against their TOS). Also, single-word input doesn't garner a very diverse set of results. I'm looking for a solution that goes off on tangents.
The closest I've experimented with is using WikiPedia's API to search Categories and Backlinks, but there's no way to directly sort those results by "relevance" or "popularity". Without that, the suggestion list is massive and all over the place, which is not immediately useful and very hard to whittle down.
Using A Thesaurus could also work minimally, but that would leave out any proper nouns or tangentially relevant terms (like any of the results listed above).
I would happily reuse an open service, if one exists, but I haven't found anything sufficient.
I'm looking for either a way to implement this either in-house with a decently-populated starting set, or reuse a free service that offers this.
Have a solution? Thanks ahead of time!
UPDATE: Thank you for the incredibly dense & informative answers. I'll choose a winning answer in 6 to 12 months, when I'll hopefully understand what you've all suggested =)
You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.
Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.
You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.
If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.
The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.
It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.
Take a look at the following two papers:
Clustering User Queries of a Search Engine [pdf]
Topic Detection by Clustering Keywords [pdf]
Here is my attempt at a very simplified explanation:
If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".
We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.
If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.
If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.
When developing a database of articles in a Knowledge Base (for example) - what are the best ways to sort and display the most relevant answers to a users' question?
Would you use additional data such as keyword weighting based on whether previous users found the article of help, or do you find a simple keyword matching algorithm to be sufficient?
Perhaps the easiest and most naive approach that will give immediately useful results would be to implement *tf-idf:
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
In a recent related question of mine here I learned of an excellent free book on this topic which you can download or read online:
An Introduction to Information Retrieval
That's a hard question, and companies like Google are pushing a lot of efforts to address this question. Have a look at Google Enterprise Search Appliance or Exalead Enterprise Search.
Then, as a personal opinion, I don't think that any "naive" approach is going to improve much the result compared to naive keyword search and ordering by the number of views on the documents.
If you have the possibility to expose your knowledge base to the web, then, just do it, and let your favorite search engine handles the search for you.
I think the angle here is not the retrieval itself... its about scoring the relevence of the information retrieved (A more reactive and passive approach) which can be later used to improve the search engine.
I guess you can try -
knn on tfidf for retrieving information
Hand tagging these retrieved info a relevency score
Then regress that score to predict the score for an unknwon search result and sort it.
Just a thought...
The third point is actually based on Rocchio algorithm. You can see it here
A little more specificity of your exact problem would be good. There are a lot of different techniques that you can use. Many of these are driven by other pieces of data. You can of course use Lucene and build your own indexes. There are bindings for many languages to lucene. Moving up there is also the Solr project which is Lucene with a lot of tools and extra functionality around it. That may be more along the lines of what you are looking for.
Intent is tricky and most modern search engines rely on statistical intent to aid in the ordering of results. You can always have an is this article useful button and store the query text that leads to useful documents. You could then add a layer of information to the index to boost specific words or phrases and help them point to certain documents.
Some things to think about...How many documents? What is the average length? Are they updated frequently? What do users do with the documents? What does the spread of unique words to documents look like? (More simply is it easy to match a query with a specific document(s) based on common unique features.)
If it is on the web you can always make a google custom search engine that just searches your site although you may find this to be sub-optimal for a variety of reasons.
You can always start with a simple index and gradually make it more sophisticated by talking with users and capturing data.
keyword matching is not enough when dealing with questions, you need to understand intent, as joannes say a very hot topic in search