Elastic search - query-string - return result based on custom order - elasticsearch

Below search query result provides data based in an order when the search keywords are more than one.
{
"query": {
"query_string" : {
"query" : "(Sony Music) OR (Sony Music*) OR (*Sony Music) OR (*Sony Music*)",
"fields" : ["MDMGlobalData.Name1"]
}
}
}
Exact Matches first.
Then, show those that start with search term.
Then, show those that end with search term.
Then, show the remainder.
But if its just one word, say sony in query data. The order is messed up.
Someone please let me why this is happening? and what's the best approach to have above ordered results using query-string search.

When you only query sony, it should have the lowest score. Is that not what you expect? By default, the query string does seem to take into consideration the order of the OR clauses so I'd say yours is already pretty optimized.
Have you tried tinkering w/ the default_operator option?
Also, what do you mean by sony "being in the query data"? The query string itself or a document whose field MDMGlobalData.Name1 is sony?

But if its just one word, say sony in query data. The order is messed
up.
Based on your above statement and the comment which you mentioned in the above answer
Adding Working example with sample docs, and search query
Index Sample Data:
{
"MDMGlobalData":{
"name":"Sony Music"
}
}
{
"MDMGlobalData":{
"name":"Sony Music Corp"
}
}
{
"MDMGlobalData":{
"name":"All Sony Music Corp"
}
}
{
"MDMGlobalData":{
"name":"Sony"
}
}
Search Query:
{
"query": {
"query_string": {
"query": "(Sony) OR (Sony*) OR (*Sony) OR (*Sony*)",
"fields": [
"MDMGlobalData.name"
]
}
}
}
Search Result:
"hits": [
{
"_index": "foo1",
"_type": "_doc",
"_id": "4",
"_score": 3.1396344,
"_source": {
"MDMGlobalData": {
"name": "Sony"
}
}
},
{
"_index": "foo1",
"_type": "_doc",
"_id": "1",
"_score": 3.114749,
"_source": {
"MDMGlobalData": {
"name": "Sony Music"
}
}
},
{
"_index": "foo1",
"_type": "_doc",
"_id": "2",
"_score": 3.097392,
"_source": {
"MDMGlobalData": {
"name": "Sony Music Corp"
}
}
},
{
"_index": "foo1",
"_type": "_doc",
"_id": "3",
"_score": 3.084596,
"_source": {
"MDMGlobalData": {
"name": "All Sony Music Corp"
}
}
}
]
As you can see the order is still maintained, Sony is having maximum score (as it should be according to the query taken) and then further scoring is done on the basis of the order of the OR clauses.

Related

Difference between match vs wild card query

What is the difference between the Match and Wild card query? If the requirement is to search a combination of words in a paragraph or log which approach is better?
Match query is used to find all those documents that have the exact search term (ignore the case), whereas Wildcard query returns the documents that contain the search term.
Adding a working example
Index Data:
{
"name":"breadsticks with soup"
}
{
"name":"multi grain bread"
}
Search Query using Match query:
{
"query": {
"match": {
"name": "bread"
}
}
}
Search Result will be
"hits": [
{
"_index": "67706115",
"_type": "_doc",
"_id": "1",
"_score": 0.9808291,
"_source": {
"name": "multi grain bread"
}
}
]
Search Query using wildcard query:
{
"query": {
"wildcard": {
"name": "*bread*"
}
}
}
Search Result will be
"hits": [
{
"_index": "67706115",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "multi grain bread"
}
},
{
"_index": "67706115",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"name": "breadsticks with soup"
}
}
]

Scoring higher for shorter fields

I'm trying to get a higher score (or at least the same score) for the shortest values on Elastic Search.
Let's say I have these documents: "Abc", "Abca", "Abcb", "Abcc". The field label.ngram uses an EdgeNgram analyser.
With a really simple query like that:
{
"query": {
"match": {
"label.ngram": {
"query": "Ab"
}
}
}
}
I always get first the documents "Abca", "Abcb", "Abcc" instead of "Abc".
How can I get "Abc" first?
(should I use this: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html?)
Thanks!
This is happening due to field normalization and to get the same score, you have to disable the norms on the field.
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"norms": false,
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title": "Abca"
}
{
"title": "Abcb"
}
{
"title": "Abcc"
}
{
"title": "Abc"
}
Search Query:
{
"query": {
"match": {
"title": {
"query": "Ab"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65953349",
"_type": "_doc",
"_id": "1",
"_score": 0.1424427,
"_source": {
"title": "Abca"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "2",
"_score": 0.1424427,
"_source": {
"title": "Abcb"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "3",
"_score": 0.1424427,
"_source": {
"title": "Abcc"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "4",
"_score": 0.1424427,
"_source": {
"title": "Abc"
}
}
]
As mentioned by #ESCoder that using norms you can fix the scoring but this would not be very useful, if you want to score your search results, as this would cause all the documents in your search results to have the same score, which will impact the relevance of your search results big time.
Maybe you should tweak the document length norm param for default similarity algorithm(BM25) if you are on ES 5.X or higher. I tried doing this with your dataset and my setting but didn't make it to work.
Second option which will mostly work as suggested by you is to store the size of your fields in different field(but) this you should populate from your application as after analysis process, various tokens would be generated for same field. but this is extra overhead and I would prefer doing this by tweaking the similarity algo param.

Elastic search negate phrase and words in simple query string

I'm trying to negate some words and phrases in an Elastic Search request using the simple query string.
This is what I do:
&q=-"the witcher 3"-game-novel
So basically, trying to negate a phrase AND the words after it. But that doesn't seem to work.
If I try to negate the words alone it works.
How can I negate phrases and sentences in a simple query string?
Adding a working example with index data,search query, and search result.
Index Data:
{
"name":"test"
}
{
"name":"game"
}
{
"name":"the witcher"
}
{
"name":"the witcher 3"
}
{
"name":"the"
}
Search Query:
{
"query": {
"simple_query_string" : {
"query": "-(game | novel) -(the witcher 3)",
"fields": ["name"],
"default_operator": "and"
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64133051",
"_type": "_doc",
"_id": "4",
"_score": 2.0,
"_source": {
"name": "the"
}
},
{
"_index": "stof_64133051",
"_type": "_doc",
"_id": "3",
"_score": 2.0,
"_source": {
"name": "the witcher"
}
},
{
"_index": "stof_64133051",
"_type": "_doc",
"_id": "1",
"_score": 2.0,
"_source": {
"name": "test"
}
}
]

ElasticSearch : How can I boost score depending on field value?

I am trying to get rid of sorting in elasticsearch by boosting the _score based on field value. Here is my scenario:
I have a field in my document: applicationDate. This is time elapsed since EPOC. I want record having greater applicationDate (most recent) to have higer score.
If score of two documents are same, I want to sort them on another field that is of type String. Say "status" is another field that can have value (Available, in progress, closed ). So, documents having same applicationDate should have _score based on status.
Available should have more score , In Progress a less, Closed, least. So by this means, I wont have to sort the documents after getting results.
Please give me some pointers.
You should be able to achieve this using Function Score .
Depending on your requirements it could be as simple as the following
Example:
put test/test/1
{
"applicationDate" : "2015-12-02",
"status" : "available"
}
put test/test/2
{
"applicationDate" : "2015-12-02",
"status" : "progress"
}
put test/test/3
{
"applicationDate" : "2016-03-02",
"status" : "progress"
}
post test/_search
{
"query": {
"function_score": {
"functions": [
{
"field_value_factor" : {
"field" : "applicationDate",
"factor" : 0.001
}
},
{
"filter": {
"term": {
"status": "available"
}
},
"weight": 360
},
{
"filter": {
"term": {
"status": "progress"
}
},
"weight": 180
}
],
"boost_mode": "multiply",
"score_mode": "sum"
}
}
}
**Results:**
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 1456877060,
"_source": {
"applicationDate": "2016-03-02",
"status": "progress"
}
},
{
"_index": "test",
"_type": "test",
"_id": "1",
"_score": 1449014780,
"_source": {
"applicationDate": "2015-12-02",
"status": "available"
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 1449014660,
"_source": {
"applicationDate": "2015-12-02",
"status": "progress"
}
}
]
Have you looked at function scores?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
Specifically look at decay functions in the above documentation.
There is a new field called rank_feature_field that can be useful for this usecase:
https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-feature.html

Is it possible to perform user count / cardinality with logical relationship in ElasticSearch?

I have documents of Users with the following format:
{
userId: "<userId>",
userAttributes: [
"<Attribute1>",
"<Attribute2>",
...
"<AttributeN>"
]
}
I want to be able to get the number of unique users that answer a logic statement, for example How many users have attribute1 AND attribute2 OR attribute3?
I've read about the cardinality function in cardinality-aggregation but it seems to work for a single value, lacking the logic abilities of "AND" and "OR".
Note that I have around 1,000,000,000 documents and I need the results as fast as possible, this why I was looking at the cardinality estimation.
What about this attempt, considering the userAttributes as a simple array of strings (analyzed in my case, but single lowercase terms):
POST /users/user/_bulk
{"index":{"_id":1}}
{"userId":123,"userAttributes":["xxx","yyy","zzz"]}
{"index":{"_id":2}}
{"userId":234,"userAttributes":["xxx","yyy","aaa"]}
{"index":{"_id":3}}
{"userId":345,"userAttributes":["xxx","yyy","bbb"]}
{"index":{"_id":4}}
{"userId":456,"userAttributes":["xxx","ccc","zzz"]}
{"index":{"_id":5}}
{"userId":567,"userAttributes":["xxx","ddd","ooo"]}
GET /users/user/_search
{
"query": {
"query_string": {
"query": "userAttributes:(((xxx AND yyy) NOT zzz) OR ooo)"
}
},
"aggs": {
"unique_ids": {
"cardinality": {
"field": "userId"
}
}
}
}
which gives the following:
"hits": [
{
"_index": "users",
"_type": "user",
"_id": "2",
"_score": 0.16471066,
"_source": {
"userAttributes": [
"xxx",
"yyy",
"aaa"
]
}
},
{
"_index": "users",
"_type": "user",
"_id": "3",
"_score": 0.04318809,
"_source": {
"userAttributes": [
"xxx",
"yyy",
"bbb"
]
}
},
{
"_index": "users",
"_type": "user",
"_id": "5",
"_score": 0.021594046,
"_source": {
"userAttributes": [
"xxx",
"ddd",
"ooo"
]
}
}
]

Resources