Fuzzy Problems in Solr Filter Query - performance

It would be grateful if somebody can help me with my problem. I have this query:
select?q=city:Frankfurt am Main~&fq=street:Gerhart-Hauptmann-Str.~
This is not working for me. I want to use fuzzy search to catch some user input mistakes.
Here is what I want:
Frankfurt am Main should be searched completely in the field city with fuzzy search
Gerhart-Hauptmann-Str. should be converted into three terms with fuzzy search.
Debug output of what I get actually:
"debug": {
"rawquerystring": "city:Frankfurt am Main~",
"querystring": "city:Frankfurt am Main~",
"parsedquery": city:frankfurt text:am text:Main~2",
"parsedquery_toString": "city:frankfurt text:am text:Main~2",
"explain": {...},
"QParser": "LuceneQParser",
"filter_queries": [
"street:Gerhart-Hauptmann-Str.~"
],
"parsed_filter_queries": [
"street:gerhart-hauptmann-str.~2"
],
I (think) I want this output:
"debug": {
"rawquerystring": "city:Frankfurt am Main~",
"querystring": "city:Frankfurt am Main~",
"parsedquery": city:frankfurt~2 city:am~2 text:Main~2",
"parsedquery_toString": "city:frankfurt~2 city:am~2 text:Main~2",
"explain": {...},
"QParser": "LuceneQParser",
"filter_queries": [
"street:Gerhart-Hauptmann-Str.~"
],
"parsed_filter_queries": [
# My analyser converts Str. to strasse
"street:gerhart~2 street:hauptmann~2 strasse~2"
],
The definition of the fields in the schema.xml
<field name="city" type="admin_name" indexed="true" stored="true" />
<field name="street" type="street_name" indexed="true" stored="true" multiValued="false"/>
<fieldType name="admin_name" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de_admin.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="street_name" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<!-- The StartEndSynonymFilter replaces synonyms which
are at the start or the end of an term. The types
START_SYNONYM or END_SYNONYM will be set. -->
<filter class="my.StartEndSynonymFilterFactory" synonyms="lang/synonyms_de_street.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Is this somehow possible?
If you need additional information to answer, please leave a hint in a comment.

Tokenizing on Hyphens
Have a look at the WordDelimiterFilterFactory:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
Applying Fuzzy to every single term
DISCLAIMER: I have not yet used fuzzy search in my SOLR setups.
You might have to be careful with tokenizing the city names and applying the fuzzy search to every single token. Your example "Frankfurt am Main" would in this case apply fuzzy search to "am", as well. Please try with parenthesis: (Frankfurt am Main)~ whether this gets you the intended result.
However, in case of names (city or streets) I'm not sure you should be even tokenizing them. Maybe storing them as one case insensitive token and applying the fuzzy search like this "Frankfurt am Main"~ (with quotes in the query) is actually what you need.
Nevertheless, you should try and get it to work in the way you have described it. Then look at the query results. And (maybe in parallel) setup an index where you store the city and street names as single tokens (KeywordTokenizer with lower casing and ascii folding, e.g.) and apply fuzzy search to them as single terms. I would guess that the results will be sharper. But best - try it out and compare.
In addition, I would suggest to try out the (extended or not) DisMax Handler for input without even caring to differentiate between cities and streets on the input side: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
With the dismax handler processing the input, you can allow the user to input search terms very freely (like having a single search field where cities and streets can be input in random order and format).

Related

How to prefix match a doc value -> search term in lucene search engines like Solr, ElasticSearch

We have a need to prefix match from the <document value> -> <search term>. Reverse is possible in Solr, ElasticSearch which is <search term> -> <document value>
Example:
Search term -> "traveling the world"
Document field value -> "travel"
Not sure how to prefix match or fuzzy this query so we can get this document result.
Prefix match works like this "travel*"
Search term -> "travel"
Document field value -> "traveling the world"
Try using the PorterStemFilterFactory in your field definition.
<filter class="solr.PorterStemFilterFactory"/>
Your definition may look like :
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
Here is the input and output would be :
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
There is another alternative to it known as solr.KStemFilterFactory which is less aggressive.
I mean in short you can have a field type definition for your field as below.
<fieldType name="StemmerFieldTypeDef" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
With this definition for your fields it is decided that how to store the text while indexing and what text to query while querying.
The tokenizers and filters mutate your original text as per your field definition.
For example if your indexing the word "Travelling", it would indexed as "travel", and hence when your search with word "travel" it will match and you get the records in the result.
Its vice versa as well. Like if you have indexing text as "Travel" then it would be indexed as per the field definition as "travel". Here if you search text is "Travelling" then as per the fields definition it is searched as "travel". Hence as match found.
To learn and get a good understanding of this analysis, please check the solr analysis page.
In the above example the In, Out are the example of what is the input to the field and depending on the field type applied for the field you will have the output.

Apache solr fuzzy search with distance parameter as 2

Enviornment- java version "11.0.12" 2021-07-20 LTS, solr-8.9.0
I have loaded a csv file in solr. csv file has a field 'name'. Type of 'Name' column in solr is defined as 'text_general'.
I understand that to perform a fuzzy search, tilde ~ symbol at the end of a single-word term is used. Default value of distance parameter is 2.
I have used following fuzzy-search query
http://localhost:8983/solr/startsolr/select?indent=on&wt=json&q=(Name:'Ellyse~') AND (Name:'Perry~')&sort=field(Name) asc
Above fuzzy search query is resulting following name as 'Ellysea Perry', 'Ellys Perry'
But why above query is not giving document having follwoing name 'Elly Perry' (as default distance parameter is 2 and 2 characters (se) are not present.)
Strings having editDistance as '2' , should come in output(Eg. 'Elly Perry').
I understand that "with max edit distance 2 i can have up to 2 insertions, deletions or substitutions."
Name available in loaded data - 'Elly Perry'
Input query parameter - (Name:'Ellyse~') AND (Name:'Perry~')
Since after deleting 2 characters from name 'Ellyse', It becomes 'Elly'. so it should result in output. Could someone help me find the missing piece?
https://en.wikipedia.org/wiki/Levenshtein_distance
I expect the following row to match:
'Ellysea Perry',
'Ellys Perry',
'Elly Perry'
But only get following two
'Ellysea Perry',
'Ellys Perry'
'Name' field is configured in managed-schema as follows-
<field name="Name" type="text_general" multiValued="false" indexed="true" stored="true" required="true"/>
FieldType 'text_general' description is as follows-
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
As per documentation ===> "To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term.".
References - https://solr.apache.org/guide/8_5/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
try to be more clear, i'm in lack of ideas in this problem, even it sounds like a classic :I have spend hours trying to play around with this but have got nowhere.
I have resolved the error, Now it is working on edit-distance:2
Modified query is as follows :-
http://localhost:8983/solr/startsolr/select?indent=on&wt=json&q=(Name:Ellyse~) AND (Name:Perry~)&sort=field(Name) asc
But further - I have indexed 16 milion records in solr, but fuzzy search is not working for a specific token having 40K records.
In rest of the cases its working.
Should i have to configure some parameters in solr-config.xml file?

Sort strings alphabetically with Solr

Context
I have a string field for 'title' that I want to sort alphabetically. I use Solr 4.10.2 for search and sort. Since strFields are case-sensitive by default, I am noticing that Solr is sorting my titles via ASCII sort (capital letters have priority over lowercase letters) and not alphabetically.
Current behavior (asc sort)
Mathematics: Introduction to Algebra
Mathematics: an introduction
Desire behavior (asc sort)
Mathematics: an introduction
Mathematics: Introduction to Algebra
Code in schema.xml
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<field name="string" stored="false" type="string_ci" multiValued="false" indexed="true"/>
Even after restarting Solr, reindexing, the sort is still ASCII sort
The field must be lowercased at index time.
Remove the type attribute in your definition so that it applies for both indexing and queries :
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If you want distinct analyzers for each phase, include two <analyzer> definitions distinguished with the type attribute "index" and "query".

Ignoring diacritics when sorting facet values in Solr 4

Tl;dr: How can I get Solr 4 to ignore diacritics when sorting facet values?
I've added the following four documents to the "collection1" Solr core in the default Solr example:
<doc>
<field name="id">1</field>
<field name="cat">manuka</field>
<field name="cat">mystery</field>
</doc>
<doc>
<field name="id">2</field>
<field name="cat">mānuka</field>
<field name="cat">stuff</field>
</doc>
<doc>
<field name="id">3</field>
<field name="cat">management</field>
<field name="cat">stuff</field>
</doc>
<doc>
<field name="id">4</field>
<field name="cat">abc</field>
<field name="cat">stuff</field>
</doc>
The "cat" field is defined as:
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
and the "string" type is defined as:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
When I do a facet query on the "cat" field, sorted by value (http://localhost:8983/solr/collection1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=cat&facet.sort=index), I get:
....
"facet_fields":{
"cat":[
"abc",1,
"management",1,
"manuka",1,
"mystery",1,
"mānuka",1,
"stuff",3]},
....
Note that mānuka comes after mystery. I'd like to have mānuka come after manuka and before stuff, that is, I'd like the sort to ignore diacritics including the macron.
If this was a non-facet search, it looks like I could achieve what I want by setting up Collation for a separate copy field and sort by that (I can't set up collation for the field itself because the stored data will be a binary representation of the collation key). However, it looks like this approach isn't possible for facet queries since they can only be sorted by index or count.
Am I overlooking something? Is there some trick to get this working in an environment where I do need to display the value of the "cat" field?
The question is about customizing the index-order of a facet.
Your suggestion is to use Collation. You can do this and the order of your facets will be correct. The problem is that neither CollationField nor ICUCollationField are overriding the indexedToReadable method.
The two classes cannot override indexedToReadable because in general the mapping from word to term is not invertible. But for your case possible you can implemenent a subclass of ICUCollationField which overrides indexedToReadable in a sencefull way.
Your starting point could be TestICUCollationField with
<fieldType name="sort_fr_t" class="solr.ICUCollationField" locale="fr" strength="primary"/>
...
<field name="sort_fr" type="sort_fr_t" indexed="true" stored="true" docValues="true" multiValued="true"/>
as you will see in this case the names of the facet values are very unreadable.

Apache Solr only returning results if wildcard character (*) is present (with Magento)

I have Solr set up with Magento Enterprise Edition 1.9 and for the most part it works well. However, there are certain terms (e.g. "banana") which return no results even though product names in my catalog contain the word "banana".
However, as soon as I search for "banana*", with a wildcard, it returns results as expected.
I have used Magento's default schema for Solr so I don't have experience in tweaking Solr's schema file, so any advice would be appreciated.
Edit: here is a link to both my schema and config files: https://gist.github.com/anonymous/8d7a7106eb4e594d5adc
Edit 2: exploring my index using Luke I noticed that when I changed my default field from "fulltext" to "fulltext1_en" or "name_en", my normal query "banana" worked as expected. When I made this change in my schema, the search is working as expected. This leads me to more questions, however: I'm not sure how "fulltext" relates to "fulltext1_en". Why does "fulltext" not work but "fulltext1_en" does? Doesn't "fulltext" exist since it's in the Magento schema? And how was I getting any search results at all if the "fulltext" field simply didn't exist in my schema?
Without seeing your schema, standard keyword searches in solr only return results for that exact word not partially. Like you said, adding a wildcard will work much like a regex expression but some results will just be rubbish.
One workaround is to add a spellcheck component:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="classname">solr.IndexBasedSpellChecker</str>
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="field">textsuggest</str>
<str name="comparatorClass">freq</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
Add in your request handler the component
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck">true</str>
<str name="spellcheck.count">3</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
Or you can create new fields similar to an autocomplete like:
<!-- autocomplete_edge : Will match from the left of the field, e.g. if the document field
is "A brown fox" and the query is "A bro", it will match, but not "brown"
-->
<fieldType name="autocomplete_edge" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([\.,;:-_])" replacement=" " replace="all"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([\.,;:-_])" replacement=" " replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æøåÆØÅ ])" replacement="" replace="all"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="^(.{30})(.*)?" replacement="$1" replace="all"/>
</analyzer>
</fieldType>
<field name="textnge" type="autocomplete_edge" indexed="true" stored="false" />

Resources