Using searchkick and see that a search for "animals" is returning results for "anime" because of their stem "anim". Does anyone have any suggestions on how to improve these results?
I see the in docs you can do something like
exclude_queries = {
"animals" => ["anime"],
}
Product.search query, exclude: exclude_queries[query]
But it seems like a lot of work to keep a running list for all of the bad ones like this.
Wondering if I need to change the stemmer?
Looks like instead of standard analyzer which doesn't stem the tokens somehow you are using the english analyzer which uses the stemmer, causing the stemmed tokens as shown below:
POST http://{{hostname}}:{{port}}/{{index-name}}/_analyze
{
"text" : "animals",
"analyzer" : "english"
}
{
"tokens": [
{
"token": "anim",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
}
]
}
The standard analyzer(Default on text field) generates non-stemmed tokens
{
"text" : "animals",
"analyzer" : "standard"
}
{
"tokens": [
{
"token": "animals",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
}
]
}
If you use standard analyzer you will not the stemmed form but then running will not produce run stemmed form to token and searching for running will not produce results for run, runs etc. Its a trade-off and according to your business requirements you need to choose and modify the analyzers.
I might try something like this. https://www.elastic.co/guide/en/elasticsearch/reference/master/mixing-exact-search-with-stemming.html
Update
Ankane at searchkick gem was kind enough to add a feature to help with this. As of 4.4.1 you can do this.
class Product < ApplicationRecord
searchkick stemmer_override: ["anime => anime"]
end
This will prevent "anime" from being stemmed to "anim". So it won't show up in the "animals" search results.
Related
I have an index where some entries are like
{
"name" : " Stefan Drumm"
}
...
{
"name" : "Dr. med. Elisabeth Bauer"
}
The mapping of the name field is
{
"name": {
"type": "text",
"analyzer": "index_name_analyzer",
"search_analyzer": "search_cross_fields_analyzer"
}
}
When I use the below query
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Stefan Drumm","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
It returns the first document.
But when I try to get the second document using the query below
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Dr. med. Elisabeth Bauer","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
it is not returning anything.
Things I can't do
can't change the index
can't use the term query.
change the operator to 'OR', because in that case it will return multiple entries, which I don't want.
What I am doing wrong and how can I achieve this by modifying the query?
You have configured different analyzers for indexing and searching (index_name_analyzer and search_cross_fields_analyzer). If these analyzers process the input Dr. med. Elisabeth Bauer in an incompatible way, the search isn't going to match. This is described in more detail in Index and search analysis, as well as in Controlling Analysis.
You don't provide the definition of these two analyzers, so it's hard to guess from your question what they are doing. Depending on the analyzers, it may be possible to preprocess your query string (e.g. by removing .) before executing the search so that the search will match.
You can investigate how analysis affects your search by using the _analyze API, as described in Testing analyzers. For your example, the commands
GET my_index/_analyze
{
"analyzer": "index_name_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
and
GET my_index/_analyze
{
"analyzer": "search_cross_fields_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
should show you how the two analyzers configured for your index treats the target string, which might provide you with a clue about what's wrong. The response will be something like
{
"tokens": [
{
"token": "dr",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "med",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "elisabeth",
"start_offset": 9,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "bauer",
"start_offset": 19,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 3
}
]
}
For the example output above, the analyzer has split the input into one token per word, lowercased each word, and discarded all punctuation.
My guess would be that index_name_analyzer preserves punctuation, while search_cross_fields_analyzer discards it, so that the tokens won't match. If this is the case, and you can't change the index configuration (as you state in your question), one other option would be to specify a different analyzer when running the query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "Dr. med. Elisabeth Bauer",
"operator": "AND",
"analyzer": "index_name_analyzer"
}
}
}
],
"boost": 1
}
},
"min_score": 0
}
In the query above, the analyzer parameter has been set to override the search analysis to use the same analyzer (index_name_analyzer) as the one used when indexing. What analyzer might make sense to use depends on your setup. Ideally, you should configure the analyzers to align so that you don't have to override at search time, but it sounds like you are not living in an ideal world.
I am struggling to query for exact match. This field is identical in two fields in the document, within _id and within one field in the body.
So I can search either of these fields. Is there any way to configure the term query to support this? I've tried specifying whitespace analyzer but it doesn't seem to be a supported configuration for term queries.
Ive tried a few variations, but none of it has worked so far..
data: {
query: {
term: {
"_id":"4123-0000"
}
}
}
This doesn't return anything.
Issue is that as you are using default mapping, your _id field seems to be populated by you, which would have used text field which uses the standard analyzer and splits the tokens based on -, so your _id field is tokenized as below:
POST /_analyze
{
"text" : "4123-0000",
"analyzer" : "standard"
}
And tokens
{
"tokens": [
{
"token": "4123",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 0
},
{
"token": "0000",
"start_offset": 5,
"end_offset": 9,
"type": "<NUM>",
"position": 1
}
]
}
Now as you might be aware of that term query is not analyzed ie it uses the 4123-0000 as it is and tried to find in the inverted index, which is not available hence you don't get any result.
Solution, simply replace _id to _id.keyword to get the search result.
I can handle/extract the text from my PDF-Files, I don't know quite know if I am going the right way about how to store my content in Elasticsearch.
My PDF-Texts are mostly German - with letters like "ö", "ä", etc.
In order to store EVERY character of the content, I "escape" necessary characters and encode them properly to JSON so I can store them.
For example:
I want to store the following (PDF) text:
Öffentliche Verkehrsmittel. TestPath: C:\Windows\explorer.exe
I convert and upload it to Elasticsearch like this:
{"text":"\\u00D6ffentliche Verkehrsmittel. TestPath: C:\\\\Windows\\\\explorer.exe"}
My question is: Is this the right way to store documents like this?
Elasticsearch comes up with a wide range of inbuilt language-specific analyzer and if you are creating the text field and storing your data, by default standard analyzer is used. which you change like below:
{
"mappings": {
"properties": {
"title.german" :{
"type" :"text",
"analyzer" : "german"
}
}
}
}
You can also check the tokens generated by language analyzer in your case german using analyze API
{
"text" : "Öffentliche",
"analyzer" : "german"
}
And generated token
{
"tokens": [
{
"token": "offentlich",
"start_offset": 0,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Tokens for Ö
{
"text" : "Ö",
"analyzer" : "german"
}
{
"tokens": [
{
"token": "o",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Note:- it converted it to plain text, so now whether you search for Ö or ö it will come in the search result, as the same analyzer is applied at query time if you use the match query.
ASCII Folding Token Filter folds "Ə"/"ə"(U+018F / U+0259) characters to "A"/"a". I need to modify or add fold to "E"/"e". char_filter doesn't help and doesn't preserve original
Add analyzer:
curl -XPUT 'localshot:9200/myix/_settings?pretty' -H 'Content-Type: application/json' -d'
{
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
'
Test result:
http://localhost:9200/myix/_analyze?text=üöğıəçşi_ÜÖĞIƏÇŞİ&filter=my_ascii_folding
{
"tokens": [
{
"token": "uogiacsi_UOGIACSI",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "üöğıəçşi_ÜÖĞIƏÇŞİ",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
}
]
}
When looking at Lucene's ASCIIFoldingFilter.java source file, it doesn indeed seem like Ə gets folded into an E and not a A. Even the ICU folding filter which is asciifolding on steroids, does the same folding.
However, there's an interesting discussion on the subject and it seems that given the pronunciation it should be folded into an a and not a e:
A quick search on English or French Wikipedia, where it currently gets folded, shows that it gets folded to an a! I would have expected an e based on orthography, but a makes sense in terms of pronunciation (in English, at least).
Someone else even thinks that neither a nor e makes sense:
That seems like a really bad decision. I don't think ə should fold to either of a or e.
Anyway, I don't think there is a way except using a char_filter or extending the ASCIIFoldingFilter and bundling it into an ES analysis plugin yourself.
I defined a custom analyzer that I was surprised not built-in.
analyzer": {
"keyword_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
Then my mapping for this field is:
"email": {
"type": "string",
"analyzer": "keyword_lowercase"
}
This works great. (http://.../_analyze?field=email&text=me#example.com) ->
"tokens": [
{
"token": "me#example.com",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 1
}
]
Finding by that keyword works great. http://.../_search?q=me#example.com yields results.
The problem is trying to incorporate wildcards anywhere in the Query String Query. http://.../_search?q=*me#example.com yields no results. I would expect results containing emails such as "me#example.com" and "some#example.com".
It looks like elasticsearch performs the search with the default analyzer, which doesn't make sense. Shouldn't it perform the search with each field's own default analyzer?
I.E. http://.../_search?q=email:*me#example.com returns results because I am telling it which analyzer to use based upon the field.
Can elasticsearch not do this?
See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Set analyze_wildcard to true, as it is false by default.