Lucene.NET MatchAllDocsQuery doesn't honor document boost? - sorting

I have a Lucene index of document, all nearly identical (test 1, test 2, etc.) except that some have a higher boost than others. When using a default query (MatchAllDocsQuery OR .Parse(":") on the query parser) the documents come back in the order they went in every time. By adding a search term ("test" in this case), the document boost is apparent and the documents are sorted according to the boost. I can change the boost levels around and the new order is reflected in the results. All my code is pretty standard fair, I'm using a default Sort() is both cases.
I found that this same bug was reported and fixed in Lucene back in 2005-2006, and I checked my MatchAllDocsQuery.cs file (Lucene .NET 2.9.2) and it seems to have this change present, but the behavior is as described in the ticket above.
Any ideas what I might be doing wrong? Perhaps someone running the Java version has experienced this (or not)? Thanks.

Uh, don't I feel silly now. This is as-designed behavior. I guess. According to Lucene in Action, MatchAllDocsQuery uses a constant for the boost.

Related

How to hightlight matches in Elasticsearch completion suggester

Is it possible to highlight the matching parts in the results of a completion suggester query?
(Caveat: I am new to Elasticsearch and the answer given relies solely on the two links I posted below. However, these are official Elasticsearch / Lucene sources and they are pretty clear about it)
As of today (2016-06-28), it does not seem to be possible.
This feature would not be implemented in Elasticsearch but in the underlying Lucene. However, it seems to be a complex issue that is not going to be resolved anytime soon.
See:
https://github.com/elastic/elasticsearch/issues/8089
https://issues.apache.org/jira/browse/LUCENE-4518

Cannot get wildcard search working with Sphinx Real Time indexes

There are several questions already here on SO for this, as well as on Google. However, repeated attempts and plenty of Googling has not netted me an answer so far. This doesn't seem difficult, but clearly I'm missing something.
I've added combinations of the following:
enable_star = 1
dict = keywords
min_infix_len = 3
min_prefix_len = 3
Note: I did not do prefix and infix at the same time.
I have blown away and re-created my indexes, re-started searchd and still no luck.
If I insert a value such as "wildcardtest" I can do the following with a hit
select * from rtindex where match('wildcardtest');
but anything else such as
select * from rtindex where match('wildcardt*');
returns 0 results.
I was using 2.1.4 but upgraded to 2.1.9 with no change.
I upgraded to 2.2.7 and tweaked the config a bit and this is working.
The essential config option needed is dict=keywords
min_prefix_len/min_infix_len also work, but they do seem to change the behavior compared to dict=keywords on its own. Searching for the same pattern with the various config options yields slightly different results.
I did have to re-build my disk-based indexes and then attach (after truncating) to the RT indexes to get the historical content searchable the way I wanted.
I haven't used the RT indexes, but on a regular index i would pass it in like so '"wildcard*"'. I found that wrapping it like this I would get the results that I was looking for. In your conf file you should also have enable_star = 1 as well.

What score should i put in boost field of elasticsearch

In my doc, I have a field called Tag and SuperTag. Whenever a Tag matches it will boost some score, but if a match on SuperTag it will boost significantly to make it 1st choice. In your opinion, what value should I put in boost field for Tag and SuperTag? Thanks.
That's quite difficult to be answered, It depends so much in the data that both field contains and the analyzers that they have.
Obviously if the data is going to be pretty much the same for both I would set a boost in supertag field to 2.0.
In case they don't hold the same data we can imagine scenarios like this:
{tag: 'tagnice tagnice tagnice'}
{supertag: 'tagnice'}
even with the boosted supertag, tag could be more relevant just because tf-idf gives it bigger score.
To solve that for example, an analyzer setted to both with filter unique will help.
So as said, it depends so much in the data and how you store it in lucene. At first sight, without knowing that much, doubling the boost would work.

How do I see/debug the way SOLR find it's results?

Let's say I search for "ABLS" and the SOLR returns a result that to me does not make any sense.
How can I debug why SOLR picked this record to be returned?
debugQuery=true would help you get the detailed score calculation and the explanation for each scores.
An over view of the scoring is available at link
For detailed explaination of the debug information you can refer Link
You could add debugQuery=true&indent=true to the url and examine the results. You could also use the analysis tool in solr. Go to the admin and click analysis. You would need to read the wiki to understand either of these more in depth.
queryDebug will give you knowledge about why your scoring looks like it does (end how every field is relevant).
I will get some results that you are not understand and play with them with Solr's analysis
You should find it under:
/admin/analysis.jsp?highlight=on
Alternatively turn on highlighting over your results to see what is actually matching in your results
Solr queries are full of short parameters, hard to read and modify, especially when the parameters are too many.
And after it is even harder to debug and understand why a document is more or less relevant than another. The debug explain output usually is a three too big to fit in one page.
I found this Google Chrome extension useful to see Solr Query explain and debug in a clear manner.
For those who still use very old version of solr 3.X, "debugQuery=true" will not put the debug information. you should specify "debugQuery=on".
There are two ways of doing that. First is the query level, which means adding the debugQuery=on to your query. That will include a few things:
parsed query
debug timing information
detailed scoring information which helps you with analysis of why a give document is given a score.
In addition to that, you can use the [explain] transformer and add it to your fl parameter. For example ...&fl=*,[explain], which will result in your documents having the scoring information as another field.
The scoring information can be quite extensive and will include calculations done by the similarity algorithm. If you would like to learn more about the similarities and the scoring algorithm in Solr, have a look at this my and my colleague Radu from Sematext talk from the Activate conference: https://www.youtube.com/watch?v=kKocQdYGVJM

Lightweight fuzzy search library

Can you suggest some light weight fuzzy text search library?
What I want to do is to allow users to find correct data for search terms with typos.
I could use full-text search engines like Lucene, but I think it's an overkill.
Edit:
To make question more clear here is a main scenario for that library:
I have a large list of strings. I want to be able to search in this list (something like MSVS' intellisense) but it should be possible to filter this list by string which is not present in it but close enough to some string which is in the list.
Example:
Red
Green
Blue
When I type 'Gren' or 'Geen' in a text box, I want to see 'Green' in the result set.
Main language for indexed data will be English.
I think that Lucene is to heavy for that task.
Update:
I found one product matching my requirements. It's ShuffleText.
Do you know any alternatives?
Lucene is very scalable—which means its good for little applications too. You can create an index in memory very quickly if that's all you need.
For fuzzy searching, you really need to decide what algorithm you'd like to use. With information retrieval, I use an n-gram technique with Lucene successfully. But that's a special indexing technique, not a "library" in itself.
Without knowing more about your application, it won't be easy to recommend a suitable library. How much data are you searching? What format is the data? How often is the data updated?
I'm not sure how well Lucene is suited for fuzzy searching, the custom library would be better choice. For example, this search is done in Java and works pretty fast, but it is custom made for such task:
http://www.softcorporation.com/products/people/
Soundex is very 'English' in it's encoding - Daitch-Mokotoff works better for many names, especially European (Germanic) and Jewish names. In my UK-centric world, it's what I use.
Wiki here.
You didn't specify your development platform, but if its PHP then suggest you look at the ZEND Lucene lubrary :
http://ifacethoughts.net/2008/02/07/zend-brings-lucene-to-php/
http://framework.zend.com/manual/en/zend.search.lucene.html
As it LAMP its far lighter than Lucene on Java, and can easily be extended for other filetypes, provided you can find a conversion library or cmd line converter - there are lots of OSS solutions around to do this.
Try Walnutil - based on Lucene API - integrated to SQL Server and Oracle DBs . You can create any type of index and then use it. For simple search you can use some methods from walnutilsoft, for more complicated search cases you can use Lucene API. See web based example where was used indexes created from Walnutil Tools. Also you can see some code example written on Java and C# which you can use it for creating different type of search.
This tools is free.
http://www.walnutilsoft.com/
If you can choose to use a database, I recommend using PostgreSQL and its fuzzy string matching functions.
If you can use Ruby, I suggest looking into the amatch library.
#aku - links to working soundex libraries are right there at the bottom of the page.
As for Levenshtein distance, the Wikipedia article on that also has implementations listed at the bottom.
A powerful, lightweight solution is sphinx.
It's smaller then Lucene and it supports disambiguation.
It's written in c++, it's fast, battle-tested, has libraries for every env and it's used by large companies, like craigslists.org

Resources