Search for a term on amazon.com, for example "stack overflow", and the search results come back very quickly.
On the left hand side of the window, there is a faceted search that shows in certain categories, the count of products that match that term.
You can then drill into those terms. For example, there are 1094 books that match the term, which is broken down into Computers & Internet (1003), Science, etc.
Given that the search for books covers the contents of some of those books, it strikes me that this is a very impressive feat.
How does amazon do this? Massive parallelization? eg each node knows about a few products?
Incidentally, I saw that "stack overflow" appears in the text of "Soul of a New Machine", a book I remember from 1981
The short answer is, a lot of indexing.
The longer answer is, a lot of indexing, a lot of redundancy, a lot of caching, and smart partitioning.
The real answer is -- read this book:
http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
(It's free, and it's very good).
Well, there is parallelization, but one of the things that everyone does on the backend of these types of things is run slow processes (like semantic parsing of book contents) and put a fast lookup on top of it. They literally are caching the search results in some large databases, such that all they have to do is db lookups on your search results. Perhaps I misunderstood the question, but it's similar to what Google does. You don't think their spiders scour the web for your sites when you enter in a search term, right?
Related
I'm learning about ElasticSearch and enjoying every minute of it. However, there are some practical issues that are confusing me and of course lack of experience that I think seeing some good real life examples might clear up.
Now I am working on a website where I have accounts and products catalog and I want to search for best product matches when end-user searching for products depending on distance, relevance queries and so many criteria .
Particularly interested in:
Relevance Scoring and ranking strategies
Analyzing data of products catalog
Filtering
I would appreciate any references.
P.S
I am using Nest for .net to communicate with ElasticSearch Cluster
Well those three subjects are quite wides. A lot of works has been done on it. You should take some time to look at the elastic search documentation, for your problem, I would recommend you to have a look to the following page first:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html (For the scoring of your document based on the distance)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html (For the filtering)
Concerning your last point, the analysing part, I would recommend that you have a look to Kibana:
https://www.elastic.co/products/kibana
First, I'd recommend this article by Alexander Reelsen — Implementing A Modern E-Commerce Search. Great content on e-commerce product catalogues, filtering, and relevance in general (hint — there's no single optimal approach to achieve "relevance").
Secondly, I recently published a handbook for people just like you who need some good real-life examples — you can purchase it at https://elasticsearchbook.com. It contains concise guides on topic like faceted search, filtering, deduplication, autocomplete etc.
Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.
For example, submitting "baseball" would return
["shortstop", "Babe Ruth", "foul ball", "steroids", ... ]
Google Sets is the best example I can find of this kind of feature, but I can't use it since they have no public API (and I wont go against their TOS). Also, single-word input doesn't garner a very diverse set of results. I'm looking for a solution that goes off on tangents.
The closest I've experimented with is using WikiPedia's API to search Categories and Backlinks, but there's no way to directly sort those results by "relevance" or "popularity". Without that, the suggestion list is massive and all over the place, which is not immediately useful and very hard to whittle down.
Using A Thesaurus could also work minimally, but that would leave out any proper nouns or tangentially relevant terms (like any of the results listed above).
I would happily reuse an open service, if one exists, but I haven't found anything sufficient.
I'm looking for either a way to implement this either in-house with a decently-populated starting set, or reuse a free service that offers this.
Have a solution? Thanks ahead of time!
UPDATE: Thank you for the incredibly dense & informative answers. I'll choose a winning answer in 6 to 12 months, when I'll hopefully understand what you've all suggested =)
You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.
Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.
You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.
If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.
The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.
It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.
Take a look at the following two papers:
Clustering User Queries of a Search Engine [pdf]
Topic Detection by Clustering Keywords [pdf]
Here is my attempt at a very simplified explanation:
If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".
We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.
If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.
If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.
Just wondering if there is any tips on improving search times (full-text).
How do large sites like stackoverflow, reddit, etc, implement their search functions?
(Sorry for the vagueness - i am a newbie)
Oh wow, there are entire courses and papers written on this...
Firstly, if you're storing in a database, there are indexes and different joins and views and all sorts of fun for speeding up your queries.
However you've specified full text search, so I'll direct you to this page which has a comparison of the most common techniques. Now this is for arrays, but will give you an understanding of how splitting or searching can be improved or varied.
Next, take a read of this Wikipedia article on string searching. There are the naive search where you just look, or ones where you create an index first, so that future searches let you jump - like chapters or page numbers in a book of text.
The index or pattern storage techniques are also very useful in compression, and that's yet another way to help speed up searching - if you build the compressed string, you can be clever and jump to the compressed section, extract and compare, depending on whether you have a limited number of patterns that you are searching for, or whether you have anything-goes.
Then there's fuzzy searching as well, where you don't get an exact match - you may do this on some 'closeness' score - like a percentage of character matches.
Hopefully that gives you a good starting point at least!
Have a read of the MySQL Guide to Fine-Tuning Full-Text Search. It describes many techniques the engine can use to make searches faster or more exhaustive.
Apache Lucene is the canonical open source full text indexing engine. I'd start there if I needed to build a search feature for a web site.
When developing a database of articles in a Knowledge Base (for example) - what are the best ways to sort and display the most relevant answers to a users' question?
Would you use additional data such as keyword weighting based on whether previous users found the article of help, or do you find a simple keyword matching algorithm to be sufficient?
Perhaps the easiest and most naive approach that will give immediately useful results would be to implement *tf-idf:
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
In a recent related question of mine here I learned of an excellent free book on this topic which you can download or read online:
An Introduction to Information Retrieval
That's a hard question, and companies like Google are pushing a lot of efforts to address this question. Have a look at Google Enterprise Search Appliance or Exalead Enterprise Search.
Then, as a personal opinion, I don't think that any "naive" approach is going to improve much the result compared to naive keyword search and ordering by the number of views on the documents.
If you have the possibility to expose your knowledge base to the web, then, just do it, and let your favorite search engine handles the search for you.
I think the angle here is not the retrieval itself... its about scoring the relevence of the information retrieved (A more reactive and passive approach) which can be later used to improve the search engine.
I guess you can try -
knn on tfidf for retrieving information
Hand tagging these retrieved info a relevency score
Then regress that score to predict the score for an unknwon search result and sort it.
Just a thought...
The third point is actually based on Rocchio algorithm. You can see it here
A little more specificity of your exact problem would be good. There are a lot of different techniques that you can use. Many of these are driven by other pieces of data. You can of course use Lucene and build your own indexes. There are bindings for many languages to lucene. Moving up there is also the Solr project which is Lucene with a lot of tools and extra functionality around it. That may be more along the lines of what you are looking for.
Intent is tricky and most modern search engines rely on statistical intent to aid in the ordering of results. You can always have an is this article useful button and store the query text that leads to useful documents. You could then add a layer of information to the index to boost specific words or phrases and help them point to certain documents.
Some things to think about...How many documents? What is the average length? Are they updated frequently? What do users do with the documents? What does the spread of unique words to documents look like? (More simply is it easy to match a query with a specific document(s) based on common unique features.)
If it is on the web you can always make a google custom search engine that just searches your site although you may find this to be sub-optimal for a variety of reasons.
You can always start with a simple index and gradually make it more sophisticated by talking with users and capturing data.
keyword matching is not enough when dealing with questions, you need to understand intent, as joannes say a very hot topic in search
What searching algorithm/concept is used in Google?
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Indexing
If you want to get down to basics:
Google uses an inverted index of the Internet. What this means is that Google has an index of all pages it's crawled based on the terms in each page. For instance the term Google maps to this page, the Google home page, and the Wikipedia article for Google, amongst others.
Thus, when you go to Google and type "Google" into the search box, Google checks its index of all terms available on the Internet and finds the entry for the term "Google" and with it the list of all pages that have that term referenced in it.
For veteran users:
Google's index goes beyond your simple inverted index, however. This is why Google is the best. Google's crawlers (spiders) are smart. Very smart. Beyond just keeping track of the terms that are on any given web page, they also keep track of words that are on related pages and link those to the given document.
In other words, if a page has the term Google in it and the page has a link to or is linked from another web page, the other page may be referenced in the index under the term Google as well. All this and more go into why a given page is returned for a given query.
If you want to go into why pages are ordered the way they are in your search results, that gets into even more interesting stuff.
Ranking
To get down to basics:
Perhaps one of the most basic algorithms a search engine can use to sort your results is known as term frequency-inverse document frequency (tf-idf). Simply put, this means that your results will be ordered by the relative importance of your search terms in the document. In other words, a document that has 10 pages and lists the word Google once is not nearly as important as a document that has 1 page and lists the word Google ten times.
For veteran users:
Again, Google does quite a bit more than your basic search engine when it comes to ranking results. Google has implemented the aforementioned, patented, PageRank algorithm. In short form, PageRank enhances the tf-idf algorithm by taking into account the populatirty/importance of a given page. At this point, popularity/importance may be judged by any number of factors that Google just wont tell us. However, at the most basic of levels, Google can tell that one page is more important than another because loads and loads of other pages link to it.
Google's patented PigeonRank™
Wow, they initially posted this 7 years ago from Wednesday ...
PageRank is a link analysis algorithm used by Google for the search engine, but the patent was assigned to Stanford University.
I think "The Anatomy of a Large-Scale Hypertextual Web Search Engine" is a little outdated.
Hier a recent talk about scalability: Challenges in Building Large-Scale Information Retrieval Systems
Inverted index and MapReduce is the basics of most search engines (I believe). You create an index on the content and run queries against that index to display relevance. Google however does much more than just a simple index of where each word occurs, they also do how many times it appeared, where it appears, where it appears in relation to other words, the ordering, etc. Another simple concept that's used is "stop words" which may include things like "and", "the", and so on (basically "simple" words that occur often and generally not the focus of a query). In addition, they employ things like Page Rank (mentioned by TStamper) to order pages by relevance and importance.
MapReduce is basically taking one job and dividing it into smaller jobs and letting those smaller jobs run on many systems (in parts for scalability and in parts for speed). If I recall correctly, Google was able to make use of "average" computers to distribute jobs to instead of server-grade computers. Since the processing capability of one computer is reaching a peak, many technology are heading towards cloud computing where a job is done by many physical machines.
I'm not sure how much searching Google does, it's more accurately crawling. The difference lies in that they just start at specific points and crawl to anything reachable and repeat until they hit some sort of dead-end.
While being interested in the page rank algorithm and similar I was disturbed to discover that the introduction of personal search at the turn of the year (not widely commented on) seems to change quite a lot - see Failure of the Google Gold Standard and
Google’s Personalized Results
This question cannot be answered canonically. The Algorithms used by Google (and other search engines) are their closest guarded secrets and change constantly. Every correct answer can be invalid a month or a year later.
(I know this doesn't really answer the question, but that's the point, there is no possible answer.)