Sphinx - How to index only a limited number of words? - full-text-search

I have limited number of industries (around 300 industries), I would like to create an index which will give the frequency of these keywords in the indexed documents. Is there any way for doing this in sphinx?

Not really.
But the --buildstops function of indexer, will produce a list of the most common keywords in an index.
So can just look at the output of that, and compare with your industry list. In theory I would think your industries should near the top of the list, so dont have to make it too long.

There is a trick in Sphinx to get keyword statistics from the index. BuildKeywords API call ( http://sphinxsearch.com/docs/current.html#api-func-buildkeywords ) with hits flag set will return per keyword frequencies from given index.
Hope this helps

Related

Solr Boosting Logic Concepts

I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan
Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.

How does Lucene (Solr/ElasticSearch) so quickly do filtered term counts?

From a data structure point of view how does Lucene (Solr/ElasticSearch) so quickly do filtered term counts? For example given all documents that contain the word "bacon" find the counts for all words in those documents.
First, for background, I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.
But how then does Lucene quickly determine counts for all words in documents that match "bacon"? In my naive understanding, Lucene would have to take the bit array associated with bacon and AND it with the bit arrays for every single other word. Am I missing something? I don't understand how this can be efficient. Also, do these bit arrays have to be pulled off of disk? That sounds all the worse!
How's the magic work?
You may already know this but it won't hurt to say that Lucene uses inverted indexing. In this indexing technique a dictionary of each word occurring in all documents is made and against each word information about that words occurrences is stored. Something like this image
To achieve this Lucene stores documents, indexes and their Metadata in different files formats. Follow this link for file details http://lucene.apache.org/core/3_0_3/fileformats.html#Overview
If you read the document numbers section, each document is given an internal ID so when documents are found with word 'consign' the lucene engine has the reference to the metadata of it.
Refer to the overview section to see what data gets saved in different lucene indexes.
Now that we have a pointer to the stored documents Lucene may be getting it in one of the following ways
Really count the number of words if the document is stored
Use Term Dictionary, frequency, and proximity data to get the count.
Finally, which API are you using to "quickly determine counts for all words"
Image credit to http://leanjavaengineering.wordpress.com/
Check about index file format here http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/codecs/lucene80/package-summary.html#package.description
there are no bitsets involved: its an inverted index. Each term maps to a list of documents. In lucene the algorithms work on iterators of these "lists", so items from the iterator are read on-demand, not all at once.
this diagram shows a very simple conjunction algorithm that just uses a next() operation: http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html
Behind the scenes, it is much like this diagram in lucene. Our lists are delta-encoded and bitpacked, and augmented with a skiplist which allows us to intersect more efficiently (via the additional advance() operation) than the above algorithm though.
DocIDSetIterator is this "enumerator" in lucene. it has the two main methods, next() and advance(). And yes, it is true you can decide to read the entire list + convert it into a bitset in memory, and implement this iterator over that in-memory bitset. This is what happens if you use CachingWrapperFilter.

Lucene performance drop with single character

I am currently using Lucene to search a large amount of documents.
Most commonly it is being searched on the name of the object in the document.
I am using the standardAnalyser with a null list of stop words. This means words like 'and' will be searchable.
The search term looks like this (+keys:bunker +keys:s*)(keys:0x000bunkers*)
the 0x000 is a prefix to make sure that it comes higher up the list of results.
the 'keys' field also contains other information like postcode.
So must match at least one of those.
Now with the background done on with the main problem.
For some reason when I search a term with a single character. Whether it is just 's' or bunker 's' it takes around 1.7 seconds compared to say 'bunk' which will take less than 0.5 seconds.
I have sorting, I have tried it with and without that no difference. I have tried it with and without the prefix.
Just wondering if anyone else has come across anything like this, or will have any inkling of why it would do this.
Thank you.
The most commonly used terms in your index will be the slowest terms to search on.
You're using StandardAnalyzer which does not remove any stop words. Further, it splits words on punctuation, so John's is indexed as two terms John and s. These splits are likely creating a lot of occurrences of s in your index.
The more occurrences of a term in your index, the more work Lucene has to do at search-time. A term like bunk likely occurs much less in your index by orders of magnitude, thus it requires a lot less work to process at search-time.

Fuzzy match of an English sentence with a set of English sentences stored in a database

There are about 1000 records in a database table. There is a column named title which is used to store the title of articles. Before inserting a record, I need to check if there is already an article with similar title exists in that table. If so, I will skip.
What's the fastest way to perform this kind of fuzzy matching? Assuming all words in sentences can be found in a English dictionary. If 70% of words in sentence #1 can be found in sentence #2, we consider them a match. Ideally, the algorithm can pre-compute a value for each sentence so that the value can be stored in the database.
For a 1000 records, doing the dumb thing and just iterating over all the records could work (assuming that the strings aren't too long and you aren't getting hit with too many queries). Just pull all of the titles out of your database, and then sort them by their distance to your given string (for example, you could use Levenshtein distance for this metric).
A fancier way to do approximate string matching would be to precompute n-grams of all your strings and store them in your database (some systems support this feature natively). This will definitely scale better performance wise, but it could mean more work:
http://en.wikipedia.org/wiki/N-gram
You can read up on forward / reverse indexing of token - value storage for getting faster search results. I personally prefer reverse indexing which stores a hash map of token(key) to value (here title).
Whenever you write a new article, like a new stackoverflow question, the tokens in the title would be searched to map against all the titles available.
To optimize the result, i.e. get the fuzzy logic for results, you can sort the titles by the max amount of occurrences in tokens being searched for. Eg, if t1,t2 and t3 refer to the tokens 'what' 'is' 'love', and the title 'what this love is for?' would exist in all the tokens mappings, it would be placed at the topmost.
You can play around with this more. I hope this approach is more simple and appealing.

PHP MYSQL search engine using keywords

I am trying to implement search engine based on keywords search.
Can anyone tell me which is the best (fastest) algorithm to implement a search for key words?
What I need is:
My keywords:
search, faster, profitable
Their synonyms:
search: grope, google, identify, search
faster: smart, quick, faster
profitable: gain, profit
Now I should search all possible permutations of the above synonyms in a Database to identify the most matching words.
The best solution would be to use an existing search engine, like Lucene or one of its alternative ( see Which are the best alternatives to Lucene? ).
Now, if you want to implement that yourself (it's really a great and existing problem), you should have a look at the concept of Inverted Index. That's what Google and other search engines use. Of course, they have a LOT of additional systems on top of it, but that's the basic.
The idea of an inverted index, is that for each keyword (and synonyms), you store the id of the documents that contain the keyword. It's then very easy to lookup the matching documents for a set of keyword, because you just calculate an intersection (or an union depending on what you want to do) of their list in the inverted index. Example :
Let's assume that is your inverted index :
smart: [42,35]
gain: [42]
profit: [55]
Now if you have a query "smart, gain", your matching documents are the intersection (or the union) of [42, 35] and [42].
To handle synonyms, you just need to extend your query to include all synonyms for the words in the initial query. Based on your example, you query would become "faster, quick, gain, profit, profitable".
Once you've implemented that, a nice improvement is to add TFIDF weighting to your keywords. That's basically a way to weight rare words (programming) more than common ones (the).
The other approach is to just go through all your documents and find the ones that contain your words (or their synonyms). The inverted index will be MUCH faster though, because you don't have to go through all your documents every time. The time-consuming operation is building the index, which only has to be done once.

Resources