Ignore term frequency but use positions - elasticsearch

I have an index with a text field and i want to ignore term frequencies in scoring but keep positions to have match phrase search ability.
and my index define like this:
curl --location --request PUT 'localhost:9200/my-index-001' \
--header 'Content-Type: application/json' \
--data-raw '{
"mappings": {
"autocomplete": {
"properties": {
"title": {
"type": "text",
"analyzer": "row_autocomplete"
},
"name": {
"type": "text",
"analyzer": "row_autocomplete"
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"row_autocomplete": {
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "autocomplete_filter", "lowercase"]
}
},
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
}'
index data:
[
{
"title": "university",
"name": "london and EC london English"
},
{
"title": "city",
"name": "london"
}
]
I want city get high score when I execute match query like this:
POST _search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "london"
}
}
},
{
"match_phrase": {
"name": {
"query": "london",
}
}
}
]
}
}
}
and they got different score(university is greater than city actually) because of term frequency, What I want is only count term frequency one time, and according to fieldLength, city's fieldLength is less than university's fieldLength, so if I can ignore repeat termFreq, the score of city will greater than university refer to elasticsearch's rule:
GET _explain
# city's _explain
{
"value": 2.0785222,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 6.0,
"description": "termFreq=6.0",
"details": []
},
{
"value": 2.0,
"description": "fieldLength",
"details": []
},
...
]
}
# university's explain
{
"value": 2.1087635,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 24.0,
"description": "termFreq=24.0",
"details": []
},
{
"value": 29.0,
"description": "fieldLength",
"details": []
},
...
]
}
There are something I tried, for example In index mapping i can set index_options=docs to ignore term frequencies but this disables term positions and i cant use match phrase query anymore.
Does anyone have any idea?
thanks in advance.

You can use constant score query that wraps a filter query and returns every matching document with a relevance score equal to the boost parameter value.
If you use a constant score query, then your match query will give no score apart from just 0 or 1. This is because it will just act like a filter that will just if the query matches or not. match query will not act like a matching based on full-text search.
A constant_score query takes a boost argument that is set as the score
for every returned document when combined with other queries. By
default boost is set to 1.
Refer this to get detailed explaination on bool filter and this SO answer to understand difference between constant score query and bool filter.
Adding a working example with index data, search query and search result.
Index data
{ "name": "london only" }
{ "name": "london and London" }
Search Query:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"match": {
"name": "london"
}
}
]
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "london and london"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"name": "london only"
}
}
]

I indexed both the sample documents what you provided using default index mapping, so both title and name fields are text fields. and used same your query and it returns me the high score for just doc which contains just london as shown below:
"hits": [
{
"_index": "matchrel",
"_type": "_doc",
"_id": "1",
"_score": 0.51518387,
"_source": {
"title": "city",
"name": "london"
}
},
{
"_index": "matchrel",
"_type": "_doc",
"_id": "2",
"_score": 0.41750965,
"_source": {
"title": "university",
"name": "london university and EC London English"
}
}
]
Also, As you have not explained your use case in details, and with limited info, It seems it can be easily achieved with below query with also returns more score for london doc:
{
"query": {
"match_phrase": {
"name": "london"
}
}
}
And its search result
"hits": [
{
"_index": "matchrel",
"_type": "_doc",
"_id": "1",
"_score": 0.25759193, // note score
"_source": {
"title": "city",
"name": "london"
}
},
{
"_index": "matchrel",
"_type": "_doc",
"_id": "2",
"_score": 0.20875482,
"_source": {
"title": "university",
"name": "london university and EC London English"
}
}
]

Related

Elastic Search 1.4 phrase query with OR operator with hyphen (-) in search string

I have a issue in Elastic search 1.4 phrase query. I am creating a below index with the data.
curl -XPUT localhost:9200/test
curl -XPOST localhost:9200/test/doc/1 -d '{"field1" : "abc-xyz"}'
curl -XPOST localhost:9200/test/doc/2 -d '{"field1" : "bcd-gyz"}'
So by default field1 is analyzed by elastic search with default analyzer.
I am searching below phrase query but its not returning any result.
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [
{
"query": {
"multi_match": {
"query": "abc\\-xyz OR bcd\\-gyz",
"type": "phrase",
"fields": [
"field1"
]
}
}
}
]
}
}
}
}
}
So elastic search phrase query is not working with OR operator. Any idea why its not working, is it a limitation of elastic search because of special character hyphen (-) in text?
Based on the comment, adding a answer using query string which works with OR in phrase with multiple search, it didn't work with multiple multi-match hence have to use query string.
Using the same indexed doc, added in previous answer, but with below search query.
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "\"abc-xyz\" OR \"bcd-gyz\"",
"fields": [
"title"
]
}
}
]
}
}
}
Search results
"hits": [
{
"_index": "phrasemulti",
"_type": "doc",
"_id": "1",
"_score": 0.05626005,
"_source": {
"title": "bcd-gyz"
}
},
{
"_index": "phrasemulti",
"_type": "doc",
"_id": "2",
"_score": 0.05626005,
"_source": {
"title": "abc-xyz"
}
}
]
When you remove few char, pharse query won't work or when you change operator to AND, sample data doesn't return search results which is expected.
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "\"abc-xyz\" OR \"bcd-gz\"",
"fields": [
"title"
]
}
}
]
}
}
}
Returns only one search result, as there is no phrase bcd-gz exist in sample data.
"hits": [
{
"_index": "phrasemulti",
"_type": "doc",
"_id": "2",
"_score": 0.05626005,
"_source": {
"title": "abc-xyz"
}
}
]
Below query works fine for me
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [
{
"query": {
"multi_match": {
"query": "abc-xyz", // note passing only one query without escaping hyphen
"type": "phrase",
"fields": [
"title"
]
}
}
}
]
}
}
}
}
}
Search results with explain param
"hits": [
{
"_shard": 3,
"_node": "1h3iipehS2abfclj51Vtsg",
"_index": "phrasemulti",
"_type": "doc",
"_id": "2",
"_score": 1.0,
"_source": {
"title": "abc-xyz"
},
"_explanation": {
"value": 1.0,
"description": "ConstantScore(BooleanFilter(QueryWrapperFilter(title:\"abc xyz\"))), product of:",
"details": [
{
"value": 1.0,
"description": "boost"
},
{
"value": 1.0,
"description": "queryNorm"
}
]
}
}
]
Verified its returning results according to phrase as query abc-xy doesn't return any result.

Scoring higher for shorter fields

I'm trying to get a higher score (or at least the same score) for the shortest values on Elastic Search.
Let's say I have these documents: "Abc", "Abca", "Abcb", "Abcc". The field label.ngram uses an EdgeNgram analyser.
With a really simple query like that:
{
"query": {
"match": {
"label.ngram": {
"query": "Ab"
}
}
}
}
I always get first the documents "Abca", "Abcb", "Abcc" instead of "Abc".
How can I get "Abc" first?
(should I use this: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html?)
Thanks!
This is happening due to field normalization and to get the same score, you have to disable the norms on the field.
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"norms": false,
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title": "Abca"
}
{
"title": "Abcb"
}
{
"title": "Abcc"
}
{
"title": "Abc"
}
Search Query:
{
"query": {
"match": {
"title": {
"query": "Ab"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65953349",
"_type": "_doc",
"_id": "1",
"_score": 0.1424427,
"_source": {
"title": "Abca"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "2",
"_score": 0.1424427,
"_source": {
"title": "Abcb"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "3",
"_score": 0.1424427,
"_source": {
"title": "Abcc"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "4",
"_score": 0.1424427,
"_source": {
"title": "Abc"
}
}
]
As mentioned by #ESCoder that using norms you can fix the scoring but this would not be very useful, if you want to score your search results, as this would cause all the documents in your search results to have the same score, which will impact the relevance of your search results big time.
Maybe you should tweak the document length norm param for default similarity algorithm(BM25) if you are on ES 5.X or higher. I tried doing this with your dataset and my setting but didn't make it to work.
Second option which will mostly work as suggested by you is to store the size of your fields in different field(but) this you should populate from your application as after analysis process, various tokens would be generated for same field. but this is extra overhead and I would prefer doing this by tweaking the similarity algo param.

ElasticSearch: why it is not possible to get suggest by criteria?

I want to get suggestions from some text for concrete user.
As I understand Elasticsearch provides suggestions based on the whole dictionary(inverted index) that contains all the terms in the index.
So if user1 posts some text then this text can be suggested to user2. Am I right?
Is it possible to add filter by criteria (by user for example) to reduce the set of terms to be suggested?
Yes, that's very much possible, let me show you by an example, which uses the query with filter context:
Index def
{
"mappings": {
"properties": {
"title": {
"type": "text" --> inverted index for storing suggestions on title field
},
"userId" : {
"type" : "keyword" --> like in you example
}
}
}
}
Index sample doc
{
"title" : "foo baz",
"userId" : "katrin"
}
{
"title" : "foo bar",
"userId" : "opster"
}
Search query without userId filter
{
"query": {
"bool": {
"must": {
"match": {
"title": "foo"
}
}
}
}
}
Search results(bring both results)
"hits": [
{
"_index": "so_suggest",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"title": "foo bar",
"userId": "posted" --> note another user
}
},
{
"_index": "so_suggest",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"title": "foo baz",
"userId": "katrin" -> note user
}
}
]
Now lets reduce the suggestion by filtering the docs created by user katrin
Search query
{
"query": {
"bool": {
"must": {
"match": {
"title": "foo"
}
},
"filter": {. --> note filter on userId field
"term": {
"userId": "katrin"
}
}
}
}
}
Search result
"hits": [
{
"_index": "so_suggest",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"title": "foo baz",
"userId": "katrin"
}
}
]

Name searching in ElasticSearch

I have a index created in ElasticSearch with the field name where I store the whole name of a person: Name and Surname. I want to perform full text search over that field so I have indexed it using the analyzer.
My issue now is that if I search:
"John Rham Rham"
And in the index I had "John Rham Rham Luck", that value has higher score than "John Rham Rham".
Is there any posibility to have better score on the exact field than in the field with more values in the string?
Thanks in advance!
I worked out a small example (assuming you're running on ES 5.x cause of the difference in scoring):
DELETE test
PUT test
{
"settings": {
"similarity": {
"my_bm25": {
"type": "BM25",
"b": 0
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "text",
"similarity": "my_bm25",
"fields": {
"length": {
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
}
POST test/test/1
{
"name": "John Rham Rham"
}
POST test/test/2
{
"name": "John Rham Rham Luck"
}
GET test/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "John Rham Rham",
"operator": "and"
}
}
},
"functions": [
{
"script_score": {
"script": "_score / doc['name.length'].getValue()"
}
}
]
}
}
}
This code does the following:
Replace the default BM25 implementation with a custom one, tweaking the B parameter (field length normalisation)
-- You could also change the similarity to 'classic' to go back to TF/IDF which doesn't have this normilisation
Create an inner field for your name field, which counts the number of tokens inside your name field.
Update the score according to the length of the token
This will result in:
"hits": {
"total": 2,
"max_score": 0.3596026,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 0.3596026,
"_source": {
"name": "John Rham Rham"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.26970196,
"_source": {
"name": "John Rham Rham Luck"
}
}
]
}
}
Not sure if this is the best way of doing it, but it maybe point you in the right direction :)

Elasticsearch query prefer exact match over partial match on multiple fields

I am doing a free text search on documents with multiple fields. When I perform a search I want the documents that have a perfect match on any of the labels to have a higher scoring. Is there any way I can do this from the query?
For example the documents have two fields called label-a and label-b and when I perform the following multi-match query:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "apple",
"type": "most_fields",
"fields": [
"label-a",
"label-b"
]
}
}
]
}
}
}
I get the following results (only the relevant part):
"hits": [
{
"_index": "salad",
"_type": "fruit",
"_id": "4",
"_score": 0.581694,
"_source": {
"label-a": "apple pie and pizza",
"label-b": "pineapple with apple juice"
}
},
{
"_index": "salad",
"_type": "fruit",
"_id": "2",
"_score": 0.1519148,
"_source": {
"label-a": "grape",
"label-b": "apple"
}
},
{
"_index": "salad",
"_type": "fruit",
"_id": "1",
"_score": 0.038978107,
"_source": {
"label-a": "apple apple apple apple apple apple apple apple apple apple apple apple",
"label-b": "raspberry"
}
},
{
"_index": "salad",
"_type": "fruit",
"_id": "3",
"_score": 0.02250402,
"_source": {
"label-a": "apple pie and pizza",
"label-b": "raspberry"
}
}
]
I want the second document, the one with the value grape for label-a and value apple for label-b, to have the highest score as I am searching for the value apple and one of the labels has that exact value. This should work regardless of which label the exact term appears.
Because Elasticsearch uses tf/idf model for scoring you are getting these results. Try to specify in your index fields "label-a" and "label-b" additionally as not-analyzed(raw) fields. Then rewrite your query someth like this:
{
"query": {
"bool": {
"should": {
"match": {
"label-a.raw": {
"query": "apple",
"boost": 2
}
}
},
"must": [
{
"multi_match": {
"query": "apple",
"type": "most_fields",
"fields": [
"label-a",
"label-b"
]
}
}
]
}
}
}
The should clause will boost documents with exact match and you will probably get them in the first place. Try to play with the boost number and pls check th equery before running. This is just and idea what you can do

Resources