SolrNet: How can I perform Fuzzy search in SolrNet? - solrnet

I am searching a "text" field in solr and I looking for a way to match (for e.g.) "anamal" with "animal". My schema for the "text" field looks like the following:
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Using SolrNet how can I perform a Fuzzy search to match "anamal" with "animal"?

It seems that you can use the Lucene "~" fuzzy search, but there is a trick! If left unabated SolrNet will escape the tilde character for you by default, but it can be turned off... so this should do the trick for you:
var query = new SolrQueryByField("myField", "some value~");
query.Quoted = false;
The caveat seems to be that (for now at least) you have to use a query by field, the same parameter doesn't exist on a SolrQuery... but that makes sense I guess.

You can do this with index time synonyms, please see the Solr Wiki - SynonymFilterFactory for details on how to set this up.

Related

Solr search numbers in word form

I am indexing couple of documents that contain numericals with text, which are getting indexed as expected. Ex: "6 flags is a popular amusement park", now when I search for "6 flags", I am able to retrieve document without issues, however when I search for "Six flags" , solr doesn't return any document, I have defined a custom type called "text_field" , and it's definition is as follows:
<fieldType name="text_field" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-whitespace.rbbi"/>
<filter class="solr.StopFilterFactory" ignoreCase="false" words="stopwords.txt" />
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<!-- Remove external punctuation; spell text does exact match -->
<filter class="solr.PatternReplaceFilterFactory" pattern="^\p{Punct}*(.*?)\p{Punct}*$"
replacement="$1"/>
<!-- acronyms should be case-sensitive at index time, otherwise common words get expanded (like 'it') -->
<filter class="solr.SynonymFilterFactory" synonyms="acronym_synonyms_inline.txt" ignoreCase="false" />
<!-- DIRE: be aggressive about folding diacritics; also normalizes -->
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="new_name_synonyms.txt" ignoreCase="true" />
<!-- DIRE: Only do more aggressive synonyms at index time. -->
<filter class="solr.SynonymFilterFactory" synonyms="text_synonyms.txt" ignoreCase="true" expand="true"/>
<!-- DIRE: Multi-term synonyms break highlighting when searching on
words after the first. Here we treat each member term as a
synonym in order to get proper highlighting. Then we boost via a
hidden field to get the phrase matching -->
<filter class="solr.SynonymFilterFactory" synonyms="expand_synonyms_inline.txt" />
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-whitespace.rbbi"/>
<filter class="solr.StopFilterFactory" ignoreCase="false" words="stopwords.txt" />
<filter class="solr.PatternReplaceFilterFactory" pattern="^\p{Punct}*(.*?)\p{Punct}*$"
replacement="$1"/>
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<!-- DIRE: always do compound splitting on queries; we currently can't
do multiword synonyms at index time unless we protect the terms from tokenization. -->
<filter class="solr.SynonymGraphFilterFactory" synonyms="expand_synonyms.txt" ignoreCase="true" expand="true"/>
<!-- acronyms should be case-insensitive at query time, since users often don't bother -->
<filter class="solr.SynonymGraphFilterFactory" synonyms="acronym_synonyms_query.txt" ignoreCase="true" expand="true"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
which filter or tokenizer I should add while Indexing so that I can query numbers in their word form in solr, I am using solr 6.5

Solr performance when indexing multivaluated indexed field

I'm using SOLR 7.2, and i' trying to index a 133k document with dataimportHandler.
The problem is that indexation tooks large time (4 hours), especially after indexing 50k documents.
After a large analysis of this problem, I found out that indexed mutivaluated fields are responsible for this heavy indexation. However, when setting multivaluated fields to indexed="false" indexation is going very fast(couple of minutes).
Is there a way to speed up indexation throw changing configuration or anything else.
<fieldType name="text_fr_lemmatized" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-select.txt" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-apostrophe.txt" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ponctuation.txt" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.HunspellStemFilterFactory" dictionary="fr_FR.dic" affix="fr_FR.aff" ignoreCase="true" strictAffixParsing="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>

Sorting not working on text_general type in solr

text_general is defined as
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" >
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I have another field is defined as
<field name="model" type="text_general" indexed="true" stored="true" />
As a sample
my model name has hyphens like "model":"ATP_JP_ATPK-000152-Y"
Sorting is not working on this model field properly.I am not getting model names in proper ascending and descending order
sorting on model field is not working properly . I have searched a lot but still not getting proper answer . everytime i am getting same answer and i am not able to apply it . Please help
Sorting doesn't work good on tokenized fields. model field has been defined with text_general field type, it will be tokenized and hence the sort would not work fine.
The sorting field should not be tokenized or uses an Analyzer that only produces a single Term, it should use KeywordTokenizer
Sorting
Use string as the field type and copy the model field into the new field.

Magento SOLR fuzzy search

I am using SOLR search in magento and trying to use the power of SOLR fuzzy search. But so far seems there is no luck.
I have tried using tilda (~) at the end of search query and also tried to using "PorterStemFilterFactory" which so far is the best stem factory that I know. But it is not giving me any results. For example; I have products named "Shiraz". So a fuzzy search will return same results if search using "shirag" or "shrag".
This is my schema section (I am giving only the english section because that is the only part that I use)
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="10" />
<!--
In this example, we will only use synonyms at query time.
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
-->
<!--
Case insensitive stop word removal. Add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldType>
Also these are the links I have tried:
http://johntwang.com/blog/2011/09/05/Fuzzy-and-Document-Searching-with-WebSolr-and-Heroku/
http://docs.lucidworks.com/display/solr/The+Standard+Query+Parser
http://www.rqna.net/qna/mnuhwh-solr-fuzzy-search-for-similar-words.html
See Solr Fuzzy Search for similar words and Solr/Lucene fuzzy search too slow
If you are looking for auto-suggest, then using the EdgeNGramFilter is definitely another option to consider.

SOLR search is not showing valid result

I am using SOLR as a search engine of my application. But now it is not showing proper results.
In my schema file there is column SubscriptionIds which holds multiple values with a separator. They are stored as ,,4588,,4585,,6966,,4855,
Similarly there is another column ABCId which holds a single value SKJJ54855
When i fire a query :
ABCId:(SKJJ54855)
it shows me records which has Subscriptionds with values as ,,4588,,4585,,6966,,4855,
But when i fire a query :
SubscriptionIds: (,4855,) && ABCId:(SKJJ54855)
It doesnt get me result!!!.
One more case, when i fire a query: SubscriptionIds: (,6966,) && ABCId:(SKJJ54855)
It gets me results... for your reference (,6966,) is placed second last in SubscriptionIds list.
Why it is behaving so weird.!!!
Some portion from my Schema.xml file.
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<field name="SubscriptionIds" type="textgen" indexed="true" stored="true" />
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<field name="ABCId" type="string" indexed="true" stored="true"/>
My suggestion would be to make the field SubscriptionIds multi-valued, and store many IDs separately. This will be more representative of the actual data than a comma-separated list. Change it to:
<field name="SubscriptionIds" type="int" indexed="true" stored="true" multiValued="true" />
and change your indexing code to add multiple IDs to the SubscriptionIds field.

Resources