I performed below elasticsearch query.
GET amasyn/_search
{
"query": {
"bool" : {
"filter" : {
"term": {"ordernumber": "112-9550919-9141020"}
}
}
}
}
But it does not get any hits
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
But I have a document having this ordernumber in the index.
ordernumber is a text field.
When I change the above query by replacing term with match, I get total number of hits as the no of hits for the given query.
Please explain what's happening here and how to solve this.
This is because since you used ordernumber field with type as text, so it is getting analyzed. Please refer difference between text and keyword through this answer Difference between keyword and text in ElasticSearch.
In this way you can define both text and keyword for your ordernumber field.
Mapping
{
"mappings": {
"properties": {
"ordernumber": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
and then you can use term query as below :
{
"query": {
"bool" : {
"filter" : {
"term": {"ordernumber.keyword": "112-9550919-9141020"}
}
}
}
}
Please see, how text and keyword fields are tokenized for your text.
Standard analyzer
This analyzer is used when you were defining your field as text.
{
"analyzer": "standard",
"text" : "112-9550919-9141020"
}
result :
{
"tokens": [
{
"token": "112",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
},
{
"token": "9550919",
"start_offset": 4,
"end_offset": 11,
"type": "<NUM>",
"position": 1
},
{
"token": "9141020",
"start_offset": 12,
"end_offset": 19,
"type": "<NUM>",
"position": 2
}
]
}
Keyword Analyzer
This analyzer is used when you are defining your field as keyword.
{
"analyzer": "keyword",
"text" : "112-9550919-9141020"
}
Result
{
"tokens": [
{
"token": "112-9550919-9141020",
"start_offset": 0,
"end_offset": 19,
"type": "word",
"position": 0
}
]
}
Related
I used match_phrase query for search full-text matching.
But it did not work as I thought.
Query:
POST /_search
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"browsing_url": "/critical-illness"
}
}
],
"minimum_should_match": 1
}
}
}
Results:
"hits" : [
{
"_source" : {
"browsing_url" : "https://www.google.com/url?q=https://industrytoday.co.uk/market-research-industry-today/global-critical-illness-commercial-insurance-market-to-witness-a-pronounce-growth-during-2020-2025&usg=afqjcneelu0qvjfusnfjjte1wx0gorqv5q"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=critical+illness&tbm=nws"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness+-insurance%3f"
}
},
{
"_source" : {
"browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness%3f"
}
}
]
expectation:
To only get results where the given string is an exact sub-string in the field. For example:
https://www.example.com/critical-illness OR
https://www.example.com/critical-illness-insurance
Mapping:
"browsing_url": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
The results are not what I expected. I expected to get the results exactly as the search /critical-illness as a substring of the stored text.
The reason you're seeing unexpected results is because both your search query, and the field itself, are being run through an analyzer. Analyzers will break down text into a list of individual terms that can be searched on. Here's an example using the _analyze endpoint:
GET _analyze
{
"analyzer": "standard",
"text": "example.com/critical-illness"
}
{
"tokens" : [
{
"token" : "example.com",
"start_offset" : 0,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "critical",
"start_offset" : 12,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "illness",
"start_offset" : 21,
"end_offset" : 28,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
So while your documents true value is example.com/critical-illness, behind the scenes Elasticsearch will only use this list of tokens for matches. The same thing goes for your search query since you're using match_phrase, which tokenizes the phrase passed in. The end result is Elasticsearch trying to match the token list ["critical", "illness"] against your documents token lists.
Most of the time the standard analyzer does a good job of removing unnecessary tokens, however in your case you care about characters like / since you want to match against them. One way to solve this is to use a different analyzer like a reversed path hierarchy analyzer. Below is an example of how to configure this analyzer and use it for your browsing_url field:
PUT /browse_history
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"tokenizer": "url_tokenizer"
}
},
"tokenizer": {
"url_tokenizer": {
"type": "path_hierarchy",
"delimiter": "/",
"reverse": true
}
}
}
},
"mappings": {
"properties": {
"browsing_url": {
"type": "text",
"norms": false,
"analyzer": "url_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Now if you analyze a URL you'll now see URL paths kept whole:
GET browse_history/_analyze
{
"analyzer": "url_analyzer",
"text": "example.com/critical-illness?src=blah"
}
{
"tokens" : [
{
"token" : "example.com/critical-illness?src=blah",
"start_offset" : 0,
"end_offset" : 37,
"type" : "word",
"position" : 0
},
{
"token" : "critical-illness?src=blah",
"start_offset" : 12,
"end_offset" : 37,
"type" : "word",
"position" : 0
}
]
}
This lets you do a match_phrase_prefix to find all documents with URLs that contain a critical-illness path:
POST /browse_history/_search
{
"query": {
"match_phrase_prefix": {
"browsing_url": "critical-illness"
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.7896894,
"hits" : [
{
"_index" : "browse_history",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.7896894,
"_source" : {
"browsing_url" : "https://www.example.com/critical-illness"
}
}
]
}
}
EDIT:
Previous answer before revision was to use the keyword field and a regexp, however this is a pretty costly query to make.
POST /browse_history/_search
{
"query": {
"regexp": {
"browsing_url.keyword": ".*/critical-illness"
}
}
}
I'm trying to query all possible logs on 3 environments (dev,test,prod) with the below query using terms: Tried must and should.
curl -vs -o -X POST http://localhost:9200/*/_search?pretty=true -d '
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment": ["can-prod", "can-test", "can-dev"]
}
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-05-02T17:22:29.069Z",
"lt": "2020-05-23T17:23:29.069Z"
}
}
}, {
"terms": {
"can.level": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
}'
gives:
{
"took" : 871,
"timed_out" : false,
"_shards" : {
"total" : 391,
"successful" : 389,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
However, if i replace terms with match it works. but can't query with the other inputs like query WARN messages, query logs related to ParserService class etc:
curl -vs -o -X POST http://localhost:9200/*/_search?pretty=true -d '
{
"query": {
"bool": {
"should":
[{"match": {"can.deployment": "can-prod"}}],
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-03-20T17:22:29.069Z",
"lt": "2020-05-01T17:23:29.069Z"
}
}
},{
"match": {
"can.level": "ERROR"
}
},{
"match": {
"can.class": "MTMessage"
}
}
]
}
}
}'
How do i accomplish this with or without terms/match.
Tried this, no luck. I get 0 search results:
"match": {
"can.level": "ERROR"
}
},{
"match": {
"can.level": "WARN"
}
},{
"match": {
"can.class": "MTMessage"
}
}
Any hints will certainly help. TIA!
[EDIT]
Addings mappings (/_mapping?pretty=true):
"can" : {
"properties" : {
"class" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"deployment" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"level" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Adding sample docs:
{
"took" : 50,
"timed_out" : false,
"_shards" : {
"total" : 391,
"successful" : 387,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 5.44714,
"hits" : [
{
"_index" : "filebeat-6.1.2-2020.05.21",
"_type" : "doc",
"_id" : "AXI9K_cggA4T9jvjZc03",
"_score" : 5.44714,
"_source" : {
"#timestamp" : "2020-05-21T02:59:25.373Z",
"offset" : 34395681,
"beat" : {
"hostname" : "4c80d1588455-661e-7054-a4e5-73c821d7",
"name" : "4c80d1588455-661e-7054-a4e5-73c821d7",
"version" : "6.1.2"
},
"prospector" : {
"type" : "log"
},
"source" : "/var/logs/packages/gateway_mt/1a27957180c2b57a53e76dd686a06f4983bf233f/logs/gateway_mt.log",
"message" : "[2020-05-21 02:59:25.373] ERROR can_gateway_mt [ActiveMT SNAP Worker 18253] --- ClientIdAuthenticationFilter: Cannot authorize publishing from client ThingPayload_4
325334a89c9 : not authorized",
"fileset" : {
"module" : "can",
"name" : "services"
},
"fields" : { },
"can" : {
"component" : "can_gateway_mt",
"instancename" : "canservices/0",
"level" : "ERROR",
"thread" : "ActiveMT SNAP Worker 18253",
"message" : "Cannot authorize publishing from client ThingPayload_4325334a89c9 : not authorized",
"class" : "ClientIdAuthenticationFilter",
"timestamp" : "2020-05-21 02:59:25.373",
"deployment" : "can-prod"
}
}
}
]
}
}
Expected output:
trying to get a dump of the whole document that matches the criteria. something like a above sample doc.
"query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment": ["can-prod", "can-test", "can-dev"]
}
"filter": [{
"range": {
"#timestamp": {
"gte": "2020-05-02T17:22:29.069Z",
"lt": "2020-05-23T17:23:29.069Z"
}
}
}, {
"terms": {
"can.level": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
I suppose, the above search query din't worked because your fields can.deployement, can.level and can.class is a text field . If these were text field Elasticsearch analyzes these kind of fields by default standard analyzer, where it divides the text by stop words and converts all text in lowercase. You can refer more about it from here.
For your case , for example can.deployement field value can-prod would be analyzed as
{
"tokens": [
{
"token": "can",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "prod",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Terms query matches exact word(case sensitive search), but since elasticsearch analyzes your text and divide and converts into lowercase you are not able to find exact search text.
In order to solve this issue ,while creating your mapping of the index for these 3 fields (can.deployement, can.level and can.class) , you can create a keyword type of field which basically says to Elasticsearch to not to analyze this field and store it as it is.
You can create mapping for these 3 fields something like :
Mapping :
"mappings": {
"properties": {
"can.class": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"can.deployment": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"can.level": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
and now you can perform terms search using these keyword field :
Search Query :
{ "query": {
"bool": {
"minimum_should_match": 1,
"should": {
"terms": {
"can.deployment.keyword": ["can-prod", "can-test", "can-dev"]
}
},
"filter": [ {
"terms": {
"can.level.keyword": ["WARN", "ERROR"]
}
}, {
"terms": {
"can.class.keyword": ["MTMessage", "ParserService", "JsonParser"]
}
}]
}
}
}
This, terms query will only work for case sensitive searches. You can refer more about it from here.
If you want to do case insensitive search you can use match query to do the same :
Search Query :
{
"query": {
"bool": {
"must": [
{
"match": {
"level": "warn error"
}
},
{
"match": {
"class": "MTMessage ParserService JsonParser"
}
},
{
"match": {
"deployment": "can-test can-prod can-dev"
}
}
]
}
}
}
This works because Elasticsearch by default analyzes your match query text with same analyzer as your index analyzer. Since in your case it is standard analyzer , it will convert this match query text in lowercase and remove stop words. You can read more about it from here.
For example for search value MTMessage ParserService JsonParser it will get analyzed internally as :
{
"tokens": [
{
"token": "mtmessage",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "parserservice",
"start_offset": 10,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "jsonparser",
"start_offset": 24,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 2
}
]
}
and since your values of document with this field also got analyzed in this way they will match.
Here one issue for this value can-test can-prod can-dev , it will get analyzed as :
{
"tokens": [
{
"token": "can",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "test",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "can",
"start_offset": 9,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "prod",
"start_offset": 13,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "can",
"start_offset": 18,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "dev",
"start_offset": 22,
"end_offset": 25,
"type": "<ALPHANUM>",
"position": 5
}
]
}
Now, if in your index this kind of document is there :
{
"can.deployment": "can",
"can.level": "WARN",
"can.class": "JsonParser"
}
Then this document will also be shown in your search result.
So, based on what kind of search you want to perform and what kind of search data you have you can decide whether to use terms query or match query.
I want to configure Elasticsearch, so that searching for "JaFNam" will create a good score for "JavaFileName".
I'm tried to build an analyzer, that combines a CamelCase pattern analyzer with an edge_ngram tokenizer. I thought this would create terms like these:
J F N Ja Fi Na Jav Fil Nam Java File Name
But the tokenizer seems not to have any effect: I keep getting these terms:
Java File Name
What would the correct Elasticsearch configuration look like?
Example code:
curl -XPUT 'http://127.0.0.1:9010/hello?pretty=1' -d'
{
"settings":{
"analysis":{
"analyzer":{
"camel":{
"type":"pattern",
"pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
"filters": ["edge_ngram"]
}
}
}
}
}
'
curl -XGET 'http://127.0.0.1:9010/hello/_analyze?pretty=1' -d'
{
"analyzer":"camel",
"text":"JavaFileName"
}'
results in:
{
"tokens" : [ {
"token" : "java",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
}, {
"token" : "file",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
}, {
"token" : "name",
"start_offset" : 8,
"end_offset" : 12,
"type" : "word",
"position" : 2
} ]
}
You analyzer definition is not correct. you need a tokenizer and an array of filter, as it is your analyzer doesn't work. Try like this instead:
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"tokenizer": "my_pattern",
"filter": [
"my_gram"
]
}
},
"filter": {
"my_gram": {
"type": "edge_ngram",
"max_gram": 10
}
},
"tokenizer": {
"my_pattern": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
I am storing a 'Payment Reference Number' in elasticsearch.
The layout of it is e.g.: 2-4-3-635844569819109531 or 2-4-2-635844533758635433 etc
I want to be able to search for documents by their payment ref number either by
Searching using the 'whole' reference number, e.g. putting in 2-4-2-635844533758635433
Any 'part' of the reference number from the 'start'. E.g. 2-4-2-63 (.. so only return the second one in the example)
Note: i do not want to search 'in the middle' or 'at the end' etc. From the beginning only.
Anyways, the hyphens are confusing me.
Questions
1) I am not sure if I should remove them in the mapping like
"char_filter" : {
"removeHyphen" : {
"type" : "mapping",
"mappings" : ["-=>"]
}
},
or not. I have never use the mappings in that way so not sure if this is necessary.
2) I think I need a 'ngrams' filter because I want to be able to search a part of the reference number from the being. I think something like
"partial_word":{
"filter":[
"standard",
"lowercase",
"name_ngrams"
],
"type":"custom",
"tokenizer":"whitespace"
},
and the filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
I am not sure how to put it all together but
"paymentReference":{
"type":"string",
"analyzer": "??",
"fields":{
"partial":{
"search_analyzer":"???",
"index_analyzer":"partial_word",
"type":"string"
}
}
}
Everything that I have tried seems to always 'break' in the second search case.
If I do 'localhost:9200/orders/_analyze?field=paymentReference&pretty=1' -d "2-4-2-635844533758635433" it always breaks the hyphen as it's own token and returns e.g. all documents with 2- which is 'alot'! and not what I want when searching for 2-4-2-6
Can someone tell me how to map this field for the two types of searches I am trying to achieve?
Update - Answer
Effectively what Val said below. I just changed the mapping slightly to be more specific re the analyzers and also I don't need the main string indexed because I just query the partial.
Mapping
"paymentReference":{
"type": "string",
"index":"not_analyzed",
"fields": {
"partial": {
"search_analyzer":"payment_ref",
"index_analyzer":"payment_ref",
"type":"string"
}
}
}
Analyzer
"payment_ref": {
"type": "custom",
"filter": [
"lowercase",
"name_ngrams"
],
"tokenizer": "keyword"
}
Filter
"name_ngrams":{
"side":"front",
"max_gram":50,
"min_gram":2,
"type":"edgeNGram"
},
You don't need to use the mapping char filter for this.
You're on the right track using the Edge NGram token filter since you need to be able to search for prefixes only. I would use a keyword tokenizer instead to make sure the term is taken as a whole. So the way to set this up is like this:
curl -XPUT localhost:9200/orders -d '{
"settings": {
"analysis": {
"analyzer": {
"partial_word": {
"type": "custom",
"filter": [
"lowercase",
"ngram_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 50
}
}
}
},
"mappings": {
"order": {
"properties": {
"paymentReference": {
"type": "string",
"fields": {
"partial": {
"analyzer": "partial_word",
"type": "string"
}
}
}
}
}
}
}'
Then you can analyze what is going to be indexed into your paymentReference.partial field:
curl -XGET 'localhost:9205/payments/_analyze?field=paymentReference.partial&pretty=1' -d "2-4-2-635844533758635433"
And you get exactly what you want, i.e. all the prefixes:
{
"tokens" : [ {
"token" : "2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-635",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-6358",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
"token" : "2-4-2-63584",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
}, {
...
Finally you can search for any prefix:
curl -XGET localhost:9200/orders/order/_search?q=paymentReference.partial:2-4-3
Not sure whether wildcard search match your needs. I define custom filter and set preserve_original and generate number parts false. Here is the sample code:
PUT test1
{
"settings" : {
"analysis" : {
"analyzer" : {
"myAnalyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : [ "dont_split_on_numerics" ]
}
},
"filter" : {
"dont_split_on_numerics" : {
"type" : "word_delimiter",
"preserve_original": true,
"generate_number_parts" : false
}
}
}
},
"mappings": {
"type_one": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
},
"type_two": {
"properties": {
"raw": {
"type": "text",
"analyzer": "myAnalyzer"
}
}
}
}
}
POST test1/type_two/1
{
"raw": "2-345-6789"
}
GET test1/type_two/_search
{
"query": {
"wildcard": {
"raw": "2-345-67*"
}
}
}
I am trying to understand the fieldnorm calculation in elasticsearch (1.6) for documents indexed with a shingle analyzer - it does not seem to include shingled terms. If so, is it possible to configure the calculation to include the shingled terms? Specifically, this is the analyzer I used:
{
"index" : {
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
This is the mapping used:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer"}
}
}
}
And I posted a few documents:
{"text" : "the"}
{"text" : "the quick"}
{"text" : "the quick brown"}
{"text" : "the quick brown fox jumps"}
...
When using the following query with the explain API,
{
"query": {
"match": {
"text" : "the"
}
}
}
I get the following fieldnorms (other details omitted for brevity):
"_source": {
"text": "the quick"
},
"_explanation": {
"value": 0.625,
"description": "fieldNorm(doc=0)"
}
"_source": {
"text": "the quick brown fox jumps over the"
},
"_explanation": {
"value": 0.375,
"description": "fieldNorm(doc=0)"
}
The values seem to suggest that ES sees 2 terms for the 1st document ("the quick") and 7 terms for the 2nd document ("the quick brown fox jumps over the"), excluding the shingles. Is it possible to configure ES to calculate field norm with the shingled terms too (ie. all terms returned by the analyzer)?
You would need to customize the default similarity by disabling the discount overlap flag.
Example:
{
"index" : {
"similarity" : {
"no_overlap" : {
"type" : "default",
"discount_overlaps" : false
}
},
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
Mapping:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer", "similarity
" : "no_overlap"}
}
}
}
To expand further:
By default overlaps i.e Tokens with 0 position increment are ignored when computing norm
Example below shows the postion of tokens generated by the "my_analyzer" described in OP :
get <index_name>/_analyze?field=text&text=the quick
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "the quick",
"start_offset": 0,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
}
]
}
According to lucene documentation the length norm calculation for default similarity is implemented as follows :
state.getBoost()*lengthNorm(numTerms)
where numTerms is
if setDiscountOverlaps(boolean) is false
FieldInvertState.getLength()
else
FieldInvertState.getLength() - FieldInvertState.getNumOverlap()