Text Search Algorithm

Text Search Algorithm - algorithm

i have a table with approximately million rows holding a text 500-600 words and i'm searching word within these texts. but iterating rows and searching within the text is not efficient from time aspect. any idea?

I would suggest Lucene
http://lucene.apache.org/java/docs/index.html

With this scarce information, I suggest you have a look at inverted indexes. Easy to build up and fast retrieval for your case, as far as I can tell. Also very easy to implement in any kind of database environment in case you cannot switch to a database which already supports them.
If you give more information maybe another solution would also work.

Related

Benefits and trade offs for improving text search on small data in PostgreSQL

I have 4 text columns of interest.
Each column is up to about 100 characters.
The text in 3 of the columns is mostly Latin words. (The data is a biological catalog, and these are names of things.)
The data is currently about 500 rows. I don't expect this to grow beyond 1000.
A small number of users (under 10) will have editing privileges to add, update, and delete data. I do not expect these users to put a heavy load on the database.
So all this suggests a pretty small data set to consider.
I need to perform a search on all 4 columns for rows where at least 1 column contains the search text (case insensitive). The query will be issued (and the results served) via a web application. I'm a bit lost about how to approach it.
PostgreSQL offers a few options for improving text searching speed. The possible options built into PostgreSQL I've been considering are
Don't try to index this at all. Just use ILIKE, LIKE on lower, or similar. (Without an index?)
Index with pg_trgm to improve search speed. I would assume that I would need to index the concatenation somehow.
Full text searching. I assume this would involve concatenating for the index also.
Unfortunately, I'm not really familiar with the expected performance of any of these or the benefits and trades off, so it's hard to know what things I should try first and what things I shouldn't even consider. Some things I have read suggest that doing the indexing for 2 and 3 is pretty slow, which conflicts with the fact that I'll be having occasional modifications going on. And the mixed language makes full text search seem unattractive since it appears to be language based, unless it can handle multiple languages simultaneously. Would I expect that for data this small, a simple ILIKE or maybe a LIKE on lower is probably fast enough? Or maybe the indexing is fast enough for the low load of modifications on data this small? Would I be better off looking for something outside the database?
Granted, I would have to actually benchmark all these to really know for sure what's fastest, but unfortunately, I don't have much time for this project. So what are the benefits and trade offs of these methods? What of these options are not appropriate for solving this type of problem? What are some other types of solutions (including potentially outside the database) worth considering?
(I suppose I might find some kind of beginner's tutorial on text searching in PG useful, but my searches turn up Full Text Search for the most part, which I don't even know if it's useful for me.)
I'm on PG 9.2.4, so any goodies pre-9.3 are an option.

Update: I've expanded this answer into a detailed blog post.
Rather than focusing purely on speed, please consider search semantics first. Define your requirements.
For example, do users need to be able to differentiate based on the order of terms? Should
radiata pinus
find:
pinus radiata
? Does the same rule apply to words within a column as between columns?
Are spaces always word separators, or are spaces within a column part of the search term?
Do you need wildcards? If so, do you need only left-anchored wildcards (think staph%) or do you need right-anchored or infix wildcards too (%ccus, p%s)? Only pg_tgrm will help you with infix wildcards. Suffix wildcards can be handled by an index on the reverse() of a word, but that gets clumsy quickly so in practice pg_tgrm is the best option there.
If you're mostly searching for discrete words and word-order isn't important, Pg's full-text search with to_tsvector and to_tsquery will be desirable. It supports left-anchored wildcard searches, weighting, categories, etc.
If you're mostly doing prefix searches of discrete columns then simple LIKE queries on a regular b-tree index per column will be the way to go.
So. Figure out what you need, then how to do it. Your current uncertainty probably stems partly from not really knowing quite what you want.

For a 1000 rows, I would guess that LIKE together with lower() should be fast enough. After a couple of queries the table will most probably be completely cached.
Regarding the indexing using pg_trgm: you are talking about "occasional" updates/inserts to the table. I would think that the additional costs of using a trigram index would only show up when you update/insert that table a lot - like several times a second.
If "occasional" only means several times an hour (or even less), then I doubt you'd see the difference in real live. I think somewhere in Depesz's blob there was also an article that compared the insert speed with and without a trigram index, but I can't find it anymore.

MongoDB text index search slow for common words in large table

I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!

Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth

This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.

Problem: Need to look up a sentence in a database of millions of sentences?

So, I'll be storing millions of sentences in a database each with an author. I need to be able to efficiently search for a sentence and return the author. Now, I'd like to be able to mispell a word or forget a word or two in this sentence, and have the application still be able to match (fuzzy-esque). Can anyone point me in the right direction? How does google do this? Because I can search for lyrics on google for instance and it will return the song with the lyrics? I'm looking to do the same thing?
Thanks all.
If fuzzy makes things too complicated, then I can deal with just an efficient sentence search.

If you're writing in Java, you can try Lucene.
Shouldn't it really be "document" and author instead of individual sentences?

For full text search check inverted index data structure.
This is how search engines do it
samples of code
UPDATE:
also if you're working on a distributed system check Hadoop - open source alternative for Goolge's MapReduce

Full Text Indexing on SQL Server or Oracle will most likey be what you're after right out of the box. They can go fuzzy, use word roots and other clever stuff.
I can't comment on other DB engines though a quick google shows most will have something similar. For some reason I expect them to be more limited in the fuzziness.

Indeed fuzzy matching is not a simple thing to do, although some databases implement some kind of fuzzy search, depending on the method used and your data, your results may vary. Here's a link that explains fuzzy searches in SQL sever
http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
As for the sentence search, most db engines implement full text search/indexing that you may want to look at... It comes with trade offs in terms of performance and storage, but you may want to look at it

How does google do this?
Using inverted indexes. The details are proprietary, but you can bet your last dollars that there is a lot of replication and storing of the indexes, etc in memory so that they can handle the vast number of search requests they get per second.

Can anyone point me toward a content relevance algorithm?

A new project with some interesting requirements has arrived on my desk. I need to develop a searchable directory of businesses, with a focus on delivering relevant results based on arbitrary search queries. The businesses can be of any niche; there's no one area that is more represented than another.
When googling for things like "search algorithm" or "content relevance algorithm," all I get are references to Google's "Mystical Algorithm of the Old Gods" and SEO firms.
Does the relevance value of MySQL's full text Match() function have what it takes for the task? I've never used it, but I'm definitely going to do some testing. Also, since this will largely be a human edited directory, I can assume that we can add weighted factors like tagging and categories. What would be a good way to combine these factors with MySQL's Match() relevancy?
I'm also open to ideas that I've not discussed here.

For an example of information retrieval based techniques lookup TF-IDF or BM25.
For machine learning based techniques, lookup RankNet and its variants from MSR.

If you have hand edited data, have a look at Oracle text search. In one of my previous projects we had some good results.
I was not directly involved in the database setups, but I know that the results were very welcome. (Before this they had just keyword based search).

Use a search engine like Solr to index the data. You can still use MySql to hold the data, but for searches use a search engine.

improving search times

Just wondering if there is any tips on improving search times (full-text).
How do large sites like stackoverflow, reddit, etc, implement their search functions?
(Sorry for the vagueness - i am a newbie)

Oh wow, there are entire courses and papers written on this...
Firstly, if you're storing in a database, there are indexes and different joins and views and all sorts of fun for speeding up your queries.
However you've specified full text search, so I'll direct you to this page which has a comparison of the most common techniques. Now this is for arrays, but will give you an understanding of how splitting or searching can be improved or varied.
Next, take a read of this Wikipedia article on string searching. There are the naive search where you just look, or ones where you create an index first, so that future searches let you jump - like chapters or page numbers in a book of text.
The index or pattern storage techniques are also very useful in compression, and that's yet another way to help speed up searching - if you build the compressed string, you can be clever and jump to the compressed section, extract and compare, depending on whether you have a limited number of patterns that you are searching for, or whether you have anything-goes.
Then there's fuzzy searching as well, where you don't get an exact match - you may do this on some 'closeness' score - like a percentage of character matches.
Hopefully that gives you a good starting point at least!

Have a read of the MySQL Guide to Fine-Tuning Full-Text Search. It describes many techniques the engine can use to make searches faster or more exhaustive.

Apache Lucene is the canonical open source full text indexing engine. I'd start there if I needed to build a search feature for a web site.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio