I'm building a small vertical search engine using Elasticsearch as the indexer and Nutch as the crawler. I was using the HTML title field to build search suggestions for ES using an edge n gram strategy, thinking that the title field would be good as it should contain relevant terms for the subject content of the page and it would keep the index smaller in terms of search suggestions, be them single words or phrases. However, in testing so far, its not working out as thought... there just aren't that many suggestions appearing.
At present I'm only doing testing using about 10 sites, but will eventually reach about 500 or so. I'm thinking that because of the small data set, (10 sites, only on HTML title field) there probably aren't enough terms or phrases available to make good suggestions, at least phrase suggestions anyway.
Would it be advisable to just crawl more sites to create more suggestions (terms and phrases) with the edge n gram strategy on the title field OR should I use the content field (which is obviously much larger than the title field).
I'm trying to fine tune this to get more search suggestions, especially phrase suggestions, while being mindful of the index size - so that performance doesn't suffer. Any ideas?
These days one could say that suggestions are even more important than the search results itself --- which is slightly nonsensical, I know. But users tend to expect that if there is no suggestion, there is no search result. Therefore make sure every searchable field is properly reflected in your suggestions --- in particular your content. And "optimize later"! Don't look at your performance too early. 500 sites does not sound like you'll get a lot of documents to index anyway. What kind of hardware are you using?
Related
I am picking one of the 2 search engines above for a project, and so far both of them have shown to be similar in functionalities.
At least for the requirements that I have:
Proximity Search
Boolean queries
query over all fields
Boolean queries
Retrieval of original indexed document
Real time search requirements, as soon as I index a document, it should be available
Besides that I should have around 1 single type of document, in about 40 million documents - roughly 2 TB of data
that's basically what I need, my questions would be:
Does one search engine perform better than the other considering my dataset? Such as offering better indexing rates or Search Rates?
Am I loosing anything by going with Solr(considering my requirements)?
Solr is my choice at the moment.
some thoughts:
nobody can tell you about which one would perform best for you unless you benchmark in your realistic conditions
for %99 of users, any of the two would work perfectly
if you want to go with one of them (for any reason: you like it, your devs want to try it, you like the logo, whatever), then, don't sweat it, both are very capable.
If I have an elastic index of news articles, with the news body text in a newsBody field, can you do a search to see if another newsBody 'matches' one in the index? The other newsBody text may have slight variations however.
So not exact matching, but being able to test for similarity between large bodies of text. This is important as often news articles will be nearly identical but differ in ~30 out of 400 words.
So I'd like to be able to pass in a newsBody, and query it against the whole index, looking for similarity to any 'matches'.
I think the similarity module may help, but haven't got anywhere yet: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
Thanks,
Daniel
I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!
Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth
This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.
So, I'll be storing millions of sentences in a database each with an author. I need to be able to efficiently search for a sentence and return the author. Now, I'd like to be able to mispell a word or forget a word or two in this sentence, and have the application still be able to match (fuzzy-esque). Can anyone point me in the right direction? How does google do this? Because I can search for lyrics on google for instance and it will return the song with the lyrics? I'm looking to do the same thing?
Thanks all.
If fuzzy makes things too complicated, then I can deal with just an efficient sentence search.
If you're writing in Java, you can try Lucene.
Shouldn't it really be "document" and author instead of individual sentences?
For full text search check inverted index data structure.
This is how search engines do it
samples of code
UPDATE:
also if you're working on a distributed system check Hadoop - open source alternative for Goolge's MapReduce
Full Text Indexing on SQL Server or Oracle will most likey be what you're after right out of the box. They can go fuzzy, use word roots and other clever stuff.
I can't comment on other DB engines though a quick google shows most will have something similar. For some reason I expect them to be more limited in the fuzziness.
Indeed fuzzy matching is not a simple thing to do, although some databases implement some kind of fuzzy search, depending on the method used and your data, your results may vary. Here's a link that explains fuzzy searches in SQL sever
http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
As for the sentence search, most db engines implement full text search/indexing that you may want to look at... It comes with trade offs in terms of performance and storage, but you may want to look at it
How does google do this?
Using inverted indexes. The details are proprietary, but you can bet your last dollars that there is a lot of replication and storing of the indexes, etc in memory so that they can handle the vast number of search requests they get per second.
Just wondering if there is any tips on improving search times (full-text).
How do large sites like stackoverflow, reddit, etc, implement their search functions?
(Sorry for the vagueness - i am a newbie)
Oh wow, there are entire courses and papers written on this...
Firstly, if you're storing in a database, there are indexes and different joins and views and all sorts of fun for speeding up your queries.
However you've specified full text search, so I'll direct you to this page which has a comparison of the most common techniques. Now this is for arrays, but will give you an understanding of how splitting or searching can be improved or varied.
Next, take a read of this Wikipedia article on string searching. There are the naive search where you just look, or ones where you create an index first, so that future searches let you jump - like chapters or page numbers in a book of text.
The index or pattern storage techniques are also very useful in compression, and that's yet another way to help speed up searching - if you build the compressed string, you can be clever and jump to the compressed section, extract and compare, depending on whether you have a limited number of patterns that you are searching for, or whether you have anything-goes.
Then there's fuzzy searching as well, where you don't get an exact match - you may do this on some 'closeness' score - like a percentage of character matches.
Hopefully that gives you a good starting point at least!
Have a read of the MySQL Guide to Fine-Tuning Full-Text Search. It describes many techniques the engine can use to make searches faster or more exhaustive.
Apache Lucene is the canonical open source full text indexing engine. I'd start there if I needed to build a search feature for a web site.