Sort strings alphabetically with Solr - sorting

Context
I have a string field for 'title' that I want to sort alphabetically. I use Solr 4.10.2 for search and sort. Since strFields are case-sensitive by default, I am noticing that Solr is sorting my titles via ASCII sort (capital letters have priority over lowercase letters) and not alphabetically.
Current behavior (asc sort)
Mathematics: Introduction to Algebra
Mathematics: an introduction
Desire behavior (asc sort)
Mathematics: an introduction
Mathematics: Introduction to Algebra
Code in schema.xml
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<field name="string" stored="false" type="string_ci" multiValued="false" indexed="true"/>
Even after restarting Solr, reindexing, the sort is still ASCII sort

The field must be lowercased at index time.
Remove the type attribute in your definition so that it applies for both indexing and queries :
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If you want distinct analyzers for each phase, include two <analyzer> definitions distinguished with the type attribute "index" and "query".

Related

How to prefix match a doc value -> search term in lucene search engines like Solr, ElasticSearch

We have a need to prefix match from the <document value> -> <search term>. Reverse is possible in Solr, ElasticSearch which is <search term> -> <document value>
Example:
Search term -> "traveling the world"
Document field value -> "travel"
Not sure how to prefix match or fuzzy this query so we can get this document result.
Prefix match works like this "travel*"
Search term -> "travel"
Document field value -> "traveling the world"
Try using the PorterStemFilterFactory in your field definition.
<filter class="solr.PorterStemFilterFactory"/>
Your definition may look like :
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
Here is the input and output would be :
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"
There is another alternative to it known as solr.KStemFilterFactory which is less aggressive.
I mean in short you can have a field type definition for your field as below.
<fieldType name="StemmerFieldTypeDef" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
With this definition for your fields it is decided that how to store the text while indexing and what text to query while querying.
The tokenizers and filters mutate your original text as per your field definition.
For example if your indexing the word "Travelling", it would indexed as "travel", and hence when your search with word "travel" it will match and you get the records in the result.
Its vice versa as well. Like if you have indexing text as "Travel" then it would be indexed as per the field definition as "travel". Here if you search text is "Travelling" then as per the fields definition it is searched as "travel". Hence as match found.
To learn and get a good understanding of this analysis, please check the solr analysis page.
In the above example the In, Out are the example of what is the input to the field and depending on the field type applied for the field you will have the output.

Apache solr fuzzy search with distance parameter as 2

Enviornment- java version "11.0.12" 2021-07-20 LTS, solr-8.9.0
I have loaded a csv file in solr. csv file has a field 'name'. Type of 'Name' column in solr is defined as 'text_general'.
I understand that to perform a fuzzy search, tilde ~ symbol at the end of a single-word term is used. Default value of distance parameter is 2.
I have used following fuzzy-search query
http://localhost:8983/solr/startsolr/select?indent=on&wt=json&q=(Name:'Ellyse~') AND (Name:'Perry~')&sort=field(Name) asc
Above fuzzy search query is resulting following name as 'Ellysea Perry', 'Ellys Perry'
But why above query is not giving document having follwoing name 'Elly Perry' (as default distance parameter is 2 and 2 characters (se) are not present.)
Strings having editDistance as '2' , should come in output(Eg. 'Elly Perry').
I understand that "with max edit distance 2 i can have up to 2 insertions, deletions or substitutions."
Name available in loaded data - 'Elly Perry'
Input query parameter - (Name:'Ellyse~') AND (Name:'Perry~')
Since after deleting 2 characters from name 'Ellyse', It becomes 'Elly'. so it should result in output. Could someone help me find the missing piece?
https://en.wikipedia.org/wiki/Levenshtein_distance
I expect the following row to match:
'Ellysea Perry',
'Ellys Perry',
'Elly Perry'
But only get following two
'Ellysea Perry',
'Ellys Perry'
'Name' field is configured in managed-schema as follows-
<field name="Name" type="text_general" multiValued="false" indexed="true" stored="true" required="true"/>
FieldType 'text_general' description is as follows-
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
As per documentation ===> "To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term.".
References - https://solr.apache.org/guide/8_5/the-standard-query-parser.html#TheStandardQueryParser-FuzzySearches
try to be more clear, i'm in lack of ideas in this problem, even it sounds like a classic :I have spend hours trying to play around with this but have got nowhere.
I have resolved the error, Now it is working on edit-distance:2
Modified query is as follows :-
http://localhost:8983/solr/startsolr/select?indent=on&wt=json&q=(Name:Ellyse~) AND (Name:Perry~)&sort=field(Name) asc
But further - I have indexed 16 milion records in solr, but fuzzy search is not working for a specific token having 40K records.
In rest of the cases its working.
Should i have to configure some parameters in solr-config.xml file?

Solr sort in correct alphabetical order

I have an issue - its needed to sort SOLR results in correct alphabetical order if there are both upper and lowercase values in response.
Now, using
<field name="somefield" type="text_general" indexed="true" stored="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I can easily get smth like this
aaa
AAA
BBB
bbb
BBB
DDD
ddd
like there is no priority between upper and lowercase letters.
But its needed to get like this:
aaa
AAA
bbb
BBB
BBB
ddd
DDD
How to do such sorting correctly?
In 99% of the cases, you don't want to sort on the tokenized field. Because when you end up with 5 tokens, which of them are you sorting by?
However, if you want to lowercase, it could be a tokenized field, just with KeywordTokenizer and LowercaseFilter. That way you always get one token and it is lower case. Use copyField from the original field if you still want to search the original field with synonyms, etc. Don't have to store the copy, sorting uses indexed representation only.
Also, docValues are good for sorting.

Fuzzy Problems in Solr Filter Query

It would be grateful if somebody can help me with my problem. I have this query:
select?q=city:Frankfurt am Main~&fq=street:Gerhart-Hauptmann-Str.~
This is not working for me. I want to use fuzzy search to catch some user input mistakes.
Here is what I want:
Frankfurt am Main should be searched completely in the field city with fuzzy search
Gerhart-Hauptmann-Str. should be converted into three terms with fuzzy search.
Debug output of what I get actually:
"debug": {
"rawquerystring": "city:Frankfurt am Main~",
"querystring": "city:Frankfurt am Main~",
"parsedquery": city:frankfurt text:am text:Main~2",
"parsedquery_toString": "city:frankfurt text:am text:Main~2",
"explain": {...},
"QParser": "LuceneQParser",
"filter_queries": [
"street:Gerhart-Hauptmann-Str.~"
],
"parsed_filter_queries": [
"street:gerhart-hauptmann-str.~2"
],
I (think) I want this output:
"debug": {
"rawquerystring": "city:Frankfurt am Main~",
"querystring": "city:Frankfurt am Main~",
"parsedquery": city:frankfurt~2 city:am~2 text:Main~2",
"parsedquery_toString": "city:frankfurt~2 city:am~2 text:Main~2",
"explain": {...},
"QParser": "LuceneQParser",
"filter_queries": [
"street:Gerhart-Hauptmann-Str.~"
],
"parsed_filter_queries": [
# My analyser converts Str. to strasse
"street:gerhart~2 street:hauptmann~2 strasse~2"
],
The definition of the fields in the schema.xml
<field name="city" type="admin_name" indexed="true" stored="true" />
<field name="street" type="street_name" indexed="true" stored="true" multiValued="false"/>
<fieldType name="admin_name" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de_admin.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="street_name" class="solr.TextField" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<!-- The StartEndSynonymFilter replaces synonyms which
are at the start or the end of an term. The types
START_SYNONYM or END_SYNONYM will be set. -->
<filter class="my.StartEndSynonymFilterFactory" synonyms="lang/synonyms_de_street.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Is this somehow possible?
If you need additional information to answer, please leave a hint in a comment.
Tokenizing on Hyphens
Have a look at the WordDelimiterFilterFactory:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
Applying Fuzzy to every single term
DISCLAIMER: I have not yet used fuzzy search in my SOLR setups.
You might have to be careful with tokenizing the city names and applying the fuzzy search to every single token. Your example "Frankfurt am Main" would in this case apply fuzzy search to "am", as well. Please try with parenthesis: (Frankfurt am Main)~ whether this gets you the intended result.
However, in case of names (city or streets) I'm not sure you should be even tokenizing them. Maybe storing them as one case insensitive token and applying the fuzzy search like this "Frankfurt am Main"~ (with quotes in the query) is actually what you need.
Nevertheless, you should try and get it to work in the way you have described it. Then look at the query results. And (maybe in parallel) setup an index where you store the city and street names as single tokens (KeywordTokenizer with lower casing and ascii folding, e.g.) and apply fuzzy search to them as single terms. I would guess that the results will be sharper. But best - try it out and compare.
In addition, I would suggest to try out the (extended or not) DisMax Handler for input without even caring to differentiate between cities and streets on the input side: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
With the dismax handler processing the input, you can allow the user to input search terms very freely (like having a single search field where cities and streets can be input in random order and format).

Solr search/faceting results have strange behaviour: i only get "stemmed" strings (hope it's correct definition)

Sorry for a title that bad, but i didn't know how to describe my problem.
I'm using sunburnt (python interface) to query solr within my django app.
When i'm searching, everything is ok, i get the full string.
On the other hand, if i'm faceting (let's say on "job_title" field) i'm getting only the stemmed words
Like this:
<lst name="job_title">
<int name="manag">17095</int>
<int name="sale">7689</int>
<int name="engin">6995</int>
<int name="consult">4907</int>
<int name="account">4710</int>
<int name="develop">4509</int>
<int name="senior">4366</int>
and so on...
This is my text fieldType definition:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
i think the PorterStemFilter is the one screwing things up, but i need it to activate suggestions. Any help?
This is why you usually facet on unanalyzed fields. Add another field with StrField type, use a copyField directive to get the data there, and facet on this new string field.

Resources