How to order search results with related to slop in elasticsearch? - elasticsearch

I have an index in ES:
curl -XGET 'http://127.0.0.1:9200/so/_settings?pretty=true'
{
"so" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"provided_name" : "so",
"creation_date" : "1594912442805",
"analysis" : {
"analyzer" : {
"my_simple_analyzer" : {
"type" : "simple",
"tokenizer" : "lowercase"
}
}
},
"number_of_replicas" : "1",
"uuid" : "8YVu4zU_Sdylr3KhOIwu9Q",
"version" : {
"created" : "7080099"
}
}
}
}
}
It has around 1.5M data.
curl -XGET 'http://127.0.0.1:9200/so/_count?pretty=true'
{
"count" : 15426942,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
I wanted to perform a full text search, so that the query string first does the phrase match and then followed by results which has slop of 1, then slop of 2 and so on.
So I came up with the below query for the same:
curl -XGET 'http://127.0.0.1:9200/so/_search?pretty=true' -H 'Content-Type: application/json' -d '{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"posts": {
"query": "get the scanner on a specific family like this",
"_name": "exact_match"
}
}
},
{
"match": {
"posts": {
"query": "get the scanner on a specific family like this",
"_name": "partial_match"
}
}
}
]
}
}
}'
Is this the correct query? Because I do see the partial_match doesnt sort from slop of distance 1 and so on. How to achieve it?

Related

Search by exact match in all fields in Elasticsearch

Let's say I have 3 documents, each of them only contains one field (but let's imagine that there are more, and we need to search through all fields).
Field value is "first second"
Field value is "second first"
Field value is "first second third"
Here is a script that can be used to create these 3 documents:
# drop the index completely, use with care!
curl -iX DELETE 'http://localhost:9200/test'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/one' -d '{"name":"first second"}'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/two' -d '{"name":"second first"}'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/three' -d '{"name":"first second third"}'
I need to find the only document (document 1) that has exactly "first second" text in one of its fields.
Here is what I tried.
A. Plain search:
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "first second"
}
}
}'
returns all 3 documents
B. Quoting
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "\"first second\""
}
}
}'
gives 2 documents: 1 and 3, because both contain 'first second'.
Here https://stackoverflow.com/a/28024714/7637120 they suggest to use 'keyword' analyzer to analyze the fields when indexing, but I would like to avoid any customizations to the mapping.
Is it possible to avoid them and still only find document 1?
Yes, you can do that by declaring name mapping type as keyword. The key to solve your problem is just simple -- declare name mapping type:keyword and off you go
to demonstrate it, I have done these
1) created mapping with `keyword` for `name` field`
2) indexed the three documents
3) searched with a `match` query
mappings
PUT so_test16
{
"mappings": {
"_doc":{
"properties":{
"name": {
"type": "keyword"
}
}
}
}
}
Indexing the documents
POST /so_test16/_doc
{
"id": 1,
"name": "first second"
}
POST /so_test16/_doc
{
"id": 2,
"name": "second first"
}
POST /so_test16/_doc
{
"id": 3,
"name": "first second third"
}
The query
GET /so_test16/_search
{
"query": {
"match": {"name": "first second"}
}
}
and the result
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "so_test16",
"_type" : "_doc",
"_id" : "m1KXx2sB4TH56W1hdTF9",
"_score" : 0.2876821,
"_source" : {
"id" : 1,
"name" : "first second"
}
}
]
}
}
Adding second solution
( if the name is not a keyword type but a text type. Only thing here is fielddata:true also needed to be added for name field)
Mappings
PUT so_test18
{
"mappings" : {
"_doc" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fielddata": true
}
}
}
}
}
and the search query
GET /so_test18/_search
{
"query": {
"bool": {
"must": [
{"match_phrase": {"name": "first second"}}
],
"filter": {
"script": {
"script": {
"lang": "painless",
"source": "doc['name'].values.length == 2"
}
}
}
}
}
}
and the response
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.3971361,
"hits" : [
{
"_index" : "so_test18",
"_type" : "_doc",
"_id" : "o1JryGsB4TH56W1hhzGT",
"_score" : 0.3971361,
"_source" : {
"id" : 1,
"name" : "first second"
}
}
]
}
}
In Elasticsearch 7.1.0, it seems that you can use keyword analyzer even without creating a special mapping. At least I didn't, and the following query does what I need:
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "first second",
"analyzer": "keyword"
}
}
}'

access query value from function_score to compute new score

I need to customize ES score. The score function I need to implement is:
score = len(document_term) - len(query_term)
For instance, one of my document in the ES index is :
{
"name": "foobar"
}
And the search query
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "foo"
}
}
},
"functions": [
{
"script_score": {
"script": {
"source": "doc['name'].value.length() - ?LEN(query_tem)?"
}
}
}
],
"boost_mode": "replace"
}
}
}
The above search should provide a score of 6 - 3 = 3. But I didn't find a solution to get access the value of the query term.
Is it possible to access the value of the query term in a function_score context ?
There is no direct way to do this, however you can achieve that in the below way where you would need to add the query parameters in two different parts of the query.
Before that one important note, you cannot apply the doc['myfield'].value if the field is of type text, instead you would need to have its sibling field created as keyword and refer that in the script, which again I've mentioned below:
Mapping:
PUT myindex
{
"mappings" : {
"properties" : {
"myfield" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
Sample Document:
POST myquery/_doc/1
{
"myfield": "I've become comfortably numb"
}
Query:
POST <your_index_name>/_search
{
"query": {
"function_score": {
"query": {
"match": {
"myfield": "numb"
}
},
"functions": [
{
"script_score": {
"script": {
"source": "return doc['myfield.keyword'].value.length() - params.myquery.length()",
"params": {
"myquery": "numb" <---- Add the query string here as well
}
}
}
}
],
"boost_mode": "replace"
}
}
}
Response:
{
"took" : 558,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 24.0,
"hits" : [
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "1",
"_score" : 24.0,
"_source" : {
"myfield" : "I've become comfortably numb"
}
}
]
}
}
Hope this helps!

How to see which of the queries in boolean is matched?

I have given multiple queries using the bool query. Now it can happen that some of them might have matches and some queries might not have matches in the database. How can I know which of the queries had a match?
For example, here I have a bool query with two should conditions against the field landMark.
{
"query": {
"bool": {
"should": [
{
"match": {
"landMark": "wendys"
}
},
{
"match": {
"landMark": "starbucks"
}
}
]
}
}
}
How can I know which one of them matched in the above query if only one of them matches the documents?
You can use named queries for this purpose. Try this
{
"query": {
"bool": {
"should": [
{
"match": {
"landMark": {
"query": "wendys",
"_name": "wendy match"
}
}
},
{
"match": {
"landMark": {
"query": "starbucks",
"_name": "starbucks match"
}
}
}
]
}
}
}
you can use any _name . In response you will get something like this
"matched_queries": ["wendy match"]
so you will be able to tell which query matched that specific document.
Named query is certainly the way to go.
LINK - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-named-queries-and-filters.html
Idea of named query is simple , you tag a name to each of your query and in the result , it shows which all tags matched per document.
curl -XPOST 'http://localhost:9200/data/data' -d ' { "landMark" : "wendys near starbucks" }'
curl -XPOST 'http://localhost:9200/data/data' -d ' { "landMark" : "wendys" }'
curl -XPOST 'http://localhost:9200/data/data' -d ' { "landMark" : "starbucks" }'
Hence create you query in this fashion -
curl -XPOST 'http://localhost:9200/data/_search?pretty' -d '{
"query": {
"bool": {
"should": [
{
"match": {
"landMark": {
"query": "wendys",
"_name": "wendy_is_a_match"
}
}
},
{
"match": {
"landMark": {
"query": "starbucks",
"_name": "starbuck_is_a_match"
}
}
}
]
}
}
}'
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 0.581694,
"hits" : [ {
"_index" : "data",
"_type" : "data",
"_id" : "AVMCNNCY3OZJfBZCJ_tO",
"_score" : 0.581694,
"_source": { "landMark" : "wendys near starbucks" },
"matched_queries" : [ "starbuck_is_a_match", "wendy_is_a_match" ] ---> "Matched tags
}, {
"_index" : "data",
"_type" : "data",
"_id" : "AVMCNS0z3OZJfBZCJ_tQ",
"_score" : 0.1519148,
"_source": { "landMark" : "starbucks" },
"matched_queries" : [ "starbuck_is_a_match" ]
}, {
"_index" : "data",
"_type" : "data",
"_id" : "AVMCNRsF3OZJfBZCJ_tP",
"_score" : 0.04500804,
"_source": { "landMark" : "wendys" },
"matched_queries" : [ "wendy_is_a_match" ]
} ]
}
}

ElasticSearch - searching different doc_types with the same field name but different analyzers

Let's say I make a simple ElasticSearch index:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"char_filter": {
"de_acronym": {
"type": "mapping",
"mappings": [".=>"]
}
},
"analyzer": {
"analyzer1": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": ["de_acronym"]
}
}
}
}
}'
And I make two doc_types that have the same property name but they are analyzed slightly differently from one another:
curl -XPUT 'http://localhost:9200/test/_mapping/docA' -d '{
"docA": {
"properties": {
"name": {
"type": "string",
"analyzer": "simple"
}
}
}
}'
curl -XPUT 'http://localhost:9200/test/_mapping/docB' -d '{
"docB": {
"properties": {
"name": {
"type": "string",
"analyzer": "analyzer1"
}
}
}
}'
Next, let's say I put a document in each doc_type with the same name:
curl -XPUT 'http://localhost:9200/test/docA/1' -d '{ "name" : "U.S. Army" }'
curl -XPUT 'http://localhost:9200/test/docB/1' -d '{ "name" : "U.S. Army" }'
Let's try to search for "U.S. Army" in both doc types at the same time:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.5,
"hits" : [ {
"_index" : "test",
"_type" : "docA",
"_id" : "1",
"_score" : 1.5,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I only get one result! I get the other result when I specify docB's analyzer:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '
{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army",
"analyzer": "analyzer1"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "docB",
"_id" : "1",
"_score" : 1.0,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I was under the impression that ES would search each doc_type with the appropriate analyzer. Is there a way to do this?
The ElasticSearch docs say that precedence for search analyzer goes:
1) The analyzer defined in the query itself, else
2) The analyzer defined in the field mapping, else
...
In this case, is ElasticSearch arbitrarily choosing which field mapping to use?
Take a look at this issue in github, which seems to have started from this post in ES google groups. I believe it answers your question:
if its in a filtered query, we can't infer it, so we simply pick one of those and use its analysis settings

Boolean query does not return expected data in Elasticsearch

I have the following document in Elasticsearch as reported by Kibana:
{"deviceId":"C1976429369BFE063ED8B3409DB7C7E7D87196D9","appId":"DisneyDigitalBooks.PlanesAdventureAlbum","ostype":"iOS"}
Why the following query does not return success?
[root#myvm elasticsearch-1.0.0]# curl -XGET 'http://localhost:9200/unique_app_install/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"must" : [ {
"term" : {
"deviceId" : "C1976429369BFE063ED8B3409DB7C7E7D87196D9"
}
}, {
"term" : {
"appId" : "DisneyDigitalBooks.PlanesAdventureAlbum"
}
}, {
"term" : {
"ostype" : "iOS"
}
} ]
}
}
}'
Here is the response from Elasticsearch:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
As a side question, is this the fastest way to query the data in my case?
Thx in advance.
UPDATE:
Could it be related to the fact that I used the following mapping for this index?
curl -XPOST localhost:9200/unique_app_install -d '{
"settings" : {
"number_of_shards" : 5
},
"mappings" : {
"sdk_sync" : {
"properties" : {
"deviceId" : { "type" : "string" , "index": "not_analyzed"},
"appId" : { "type" : "string" , "index": "not_analyzed"},
"ostype" : { "type" : "string" , "index": "not_analyzed"}
}
}
}
}'
Check if the type of your document was right while inserting: sdk_sync.
I have used your items and for me it works. Using the following curl request give the right response for me:
curl -XPOST localhost:9200/unique_app_install/sdk_sync/1 -d '{
"settings" : {
"number_of_shards" : 5
},
"mappings" : {
"sdk_sync" : {
"properties" : {
"deviceId" : { "type" : "string" , "index": "not_analyzed"},
"appId" : { "type" : "string" , "index": "not_analyzed"},
"ostype" : { "type" : "string" , "index": "not_analyzed"}
}
}
}
}'
curl -XPOST localhost:9200/unique_app_install/sdk_sync/1 -d '{
"deviceId":"C1976429369BFE063ED8B3409DB7C7E7D87196D9",
"appId":"DisneyDigitalBooks.PlanesAdventureAlbum",
"ostype":"iOS"
}'
curl -XGET 'http://localhost:9200/unique_app_install/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"must" : [ {
"term" : {
"deviceId" : "C1976429369BFE063ED8B3409DB7C7E7D87196D9"
}
}, {
"term" : {
"appId" : "DisneyDigitalBooks.PlanesAdventureAlbum"
}
}, {
"term" : {
"ostype" : "iOS"
}
} ]
}
}
}'
Unless you specify the field NOT to be analyzed, every fields are analyzed by default.
It means that deviceId "C1976429369BFE063ED8B3409DB7C7E7D87196D9" will be indexed as "c1976429369bfe063ed8b3409db7c7e7d87196d9" (lower case).
You have to use term query or term filter with string in LOWER CASE.
That is the reason why you should specify {"index": "not_analyzed"}
for the mapping.

Resources