Solr performance when indexing multivaluated indexed field - performance

I'm using SOLR 7.2, and i' trying to index a 133k document with dataimportHandler.
The problem is that indexation tooks large time (4 hours), especially after indexing 50k documents.
After a large analysis of this problem, I found out that indexed mutivaluated fields are responsible for this heavy indexation. However, when setting multivaluated fields to indexed="false" indexation is going very fast(couple of minutes).
Is there a way to speed up indexation throw changing configuration or anything else.
<fieldType name="text_fr_lemmatized" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-select.txt" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-apostrophe.txt" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ponctuation.txt" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.HunspellStemFilterFactory" dictionary="fr_FR.dic" affix="fr_FR.aff" ignoreCase="true" strictAffixParsing="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>

Related

Solr search numbers in word form

I am indexing couple of documents that contain numericals with text, which are getting indexed as expected. Ex: "6 flags is a popular amusement park", now when I search for "6 flags", I am able to retrieve document without issues, however when I search for "Six flags" , solr doesn't return any document, I have defined a custom type called "text_field" , and it's definition is as follows:
<fieldType name="text_field" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-whitespace.rbbi"/>
<filter class="solr.StopFilterFactory" ignoreCase="false" words="stopwords.txt" />
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<!-- Remove external punctuation; spell text does exact match -->
<filter class="solr.PatternReplaceFilterFactory" pattern="^\p{Punct}*(.*?)\p{Punct}*$"
replacement="$1"/>
<!-- acronyms should be case-sensitive at index time, otherwise common words get expanded (like 'it') -->
<filter class="solr.SynonymFilterFactory" synonyms="acronym_synonyms_inline.txt" ignoreCase="false" />
<!-- DIRE: be aggressive about folding diacritics; also normalizes -->
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="new_name_synonyms.txt" ignoreCase="true" />
<!-- DIRE: Only do more aggressive synonyms at index time. -->
<filter class="solr.SynonymFilterFactory" synonyms="text_synonyms.txt" ignoreCase="true" expand="true"/>
<!-- DIRE: Multi-term synonyms break highlighting when searching on
words after the first. Here we treat each member term as a
synonym in order to get proper highlighting. Then we boost via a
hidden field to get the phrase matching -->
<filter class="solr.SynonymFilterFactory" synonyms="expand_synonyms_inline.txt" />
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory" rulefiles="Latn:Latin-whitespace.rbbi"/>
<filter class="solr.StopFilterFactory" ignoreCase="false" words="stopwords.txt" />
<filter class="solr.PatternReplaceFilterFactory" pattern="^\p{Punct}*(.*?)\p{Punct}*$"
replacement="$1"/>
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<!-- DIRE: always do compound splitting on queries; we currently can't
do multiword synonyms at index time unless we protect the terms from tokenization. -->
<filter class="solr.SynonymGraphFilterFactory" synonyms="expand_synonyms.txt" ignoreCase="true" expand="true"/>
<!-- acronyms should be case-insensitive at query time, since users often don't bother -->
<filter class="solr.SynonymGraphFilterFactory" synonyms="acronym_synonyms_query.txt" ignoreCase="true" expand="true"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict.txt" />
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
which filter or tokenizer I should add while Indexing so that I can query numbers in their word form in solr, I am using solr 6.5

How do I correctly index Spanish language documents using Solr?

How do I correctly index Spanish language documents using Solr?
More specifically, I have tried two different "character folding" techniques to index non-ASCII characters, and neither one seem to work 100% of the time. Both techniques allow me to find some accented character but not others.
For example, I use the ASCIIFoldingFilterFactory like this:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwordsspanish.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwordsspanish.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Or I use the MappingCharFilterFactory like this:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwordsspanish.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwordsspanish.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
In both cases I can search for and find some words with non-ASCII characters and some not. For examnple, I can find documents with the word "presentará" but not necessarily all of them. I know my corpus includes the word "señor" but I can never find it.
What might I be doing wrong?

Magento SOLR fuzzy search

I am using SOLR search in magento and trying to use the power of SOLR fuzzy search. But so far seems there is no luck.
I have tried using tilda (~) at the end of search query and also tried to using "PorterStemFilterFactory" which so far is the best stem factory that I know. But it is not giving me any results. For example; I have products named "Shiraz". So a fuzzy search will return same results if search using "shirag" or "shrag".
This is my schema section (I am giving only the english section because that is the only part that I use)
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="10" />
<!--
In this example, we will only use synonyms at query time.
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true"
expand="false"/>
-->
<!--
Case insensitive stop word removal. Add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords_en.txt"/>
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldType>
Also these are the links I have tried:
http://johntwang.com/blog/2011/09/05/Fuzzy-and-Document-Searching-with-WebSolr-and-Heroku/
http://docs.lucidworks.com/display/solr/The+Standard+Query+Parser
http://www.rqna.net/qna/mnuhwh-solr-fuzzy-search-for-similar-words.html
See Solr Fuzzy Search for similar words and Solr/Lucene fuzzy search too slow
If you are looking for auto-suggest, then using the EdgeNGramFilter is definitely another option to consider.

SOLR search is not showing valid result

I am using SOLR as a search engine of my application. But now it is not showing proper results.
In my schema file there is column SubscriptionIds which holds multiple values with a separator. They are stored as ,,4588,,4585,,6966,,4855,
Similarly there is another column ABCId which holds a single value SKJJ54855
When i fire a query :
ABCId:(SKJJ54855)
it shows me records which has Subscriptionds with values as ,,4588,,4585,,6966,,4855,
But when i fire a query :
SubscriptionIds: (,4855,) && ABCId:(SKJJ54855)
It doesnt get me result!!!.
One more case, when i fire a query: SubscriptionIds: (,6966,) && ABCId:(SKJJ54855)
It gets me results... for your reference (,6966,) is placed second last in SubscriptionIds list.
Why it is behaving so weird.!!!
Some portion from my Schema.xml file.
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<field name="SubscriptionIds" type="textgen" indexed="true" stored="true" />
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<field name="ABCId" type="string" indexed="true" stored="true"/>
My suggestion would be to make the field SubscriptionIds multi-valued, and store many IDs separately. This will be more representative of the actual data than a comma-separated list. Change it to:
<field name="SubscriptionIds" type="int" indexed="true" stored="true" multiValued="true" />
and change your indexing code to add multiple IDs to the SubscriptionIds field.

SolrNet: How can I perform Fuzzy search in SolrNet?

I am searching a "text" field in solr and I looking for a way to match (for e.g.) "anamal" with "animal". My schema for the "text" field looks like the following:
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Using SolrNet how can I perform a Fuzzy search to match "anamal" with "animal"?
It seems that you can use the Lucene "~" fuzzy search, but there is a trick! If left unabated SolrNet will escape the tilde character for you by default, but it can be turned off... so this should do the trick for you:
var query = new SolrQueryByField("myField", "some value~");
query.Quoted = false;
The caveat seems to be that (for now at least) you have to use a query by field, the same parameter doesn't exist on a SolrQuery... but that makes sense I guess.
You can do this with index time synonyms, please see the Solr Wiki - SynonymFilterFactory for details on how to set this up.

Resources