Elasticsearch query returns strange sorted (score based) result - sorting

I'm using Elasticsearch v5.3.2
I have the following mapping:
{
"mappings":{
"info":{
"_all":{
"enabled": false
},
"properties":{
"info":{
"properties":{
"email":{
"doc_values":"false",
"fields":{
"ngram":{
"analyzer":"custom_nGram_analyzer",
"type":"text"
}
},
"type":"keyword"
}
}
}
}
}
},
"settings":{
"analysis":{
"analyzer":{
"custom_nGram_analyzer":{
"filter":[
"lowercase",
"asciifolding",
"custom_nGram_filter"
],
"tokenizer":"whitespace",
"type":"custom"
}
},
"filter":{
"custom_nGram_filter":{
"max_gram":16,
"min_gram":3,
"type":"ngram"
}
}
}
}
}
I see very strange results in terms of document scores when I execute the following query:
GET /info_2017_08/info/_search
{
"query": {
"multi_match": {
"query": "hotmail",
"fields": [
"info.email.ngram"
]
}
}
}
It brings the following results:
"hits": {
"total": 3,
"max_score": 1.3834574,
"hits": [
{
"_index": "info_2017_08",
"_type": "info",
"_id": "AV4uQnCjzNcTF2GMY730",
"_score": 1.3834574,
"_source": {
"info": {
"email": "pv53p8vg#gmail.com"
}
}
},
{
"_index": "info_2017_08",
"_type": "info",
"_id": "AV4uQm93zNcTF2GMY73x",
"_score": 0.3967861,
"_source": {
"info": {
"email": "-vb6sbw54#hotmail.com"
}
}
},
{
"_index": "info_2017_08",
"_type": "info",
"_id": "AV4uQmYbzNcTF2GMY73P",
"_score": 0.36409757,
"_source": {
"info": {
"email": "985pu4c.r02a#gmail.com"
}
}
}
]
}
Now pay attention to scores. How come the first result has a higher score than the second one if the first one is ...#gmail.com and the second one is ...#hotmail.com, if I have searched for the term "hotmail"?
The second one should match the query with ngrams "mail" and "hotmail", while the first one will match the query only by ngram "mail", so what is the reason for such an outcome?
Thanks in advance.

Elasticsearch calculates scores of a document on each shard independently using TF/IDF statistics. Because of that, if you have two shards with next content:
"info.email": "985pu4c.r02a#gmail.com"
"info.email": "1085pu4c.r02a#gmail.com", "info.email": "-vb6sbw54#hotmail.com"
Then for your specific query single document from the first shard will have a higher score than any document from the second shard.
You can examine content of each shards using next API call: GET index/_search?preference=_shards:0

Related

Username search in Elasticsearch

I want to implement a simple username search within Elasticsearch. I don't want weighted username searches yet, so I would expect it wouldn't be to hard to find resources on how do this. But in the end, I came across NGrams and lot of outdated Elasticsearch tutorials and I completely lost track on the best practice on how to do this.
This is now my setup, but it is really bad because it matches so much unrelated usernames:
{
"settings": {
"index" : {
"max_ngram_diff": "11"
},
"analysis": {
"analyzer": {
"username_analyzer": {
"tokenizer": "username_tokenizer",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"username_tokenizer": {
"type": "ngram",
"min_gram": "1",
"max_gram": "12"
}
}
}
},
"mappings": {
"properties": {
"_all" : { "enabled" : false },
"username": {
"type": "text",
"analyzer": "username_analyzer"
}
}
}
}
I am using the newest Elasticsearch and I just want to query similar/exact usernames. I have a user db and users should be able to search for eachother, nothing to fancy.
If you want to search for exact usernames, then you can use the term query
Term query returns documents that contain an exact term in a provided field. If you have not defined any explicit index mapping, then you need to add .keyword to the field. This uses the keyword analyzer instead of the standard analyzer.
There is no need to use an n-gram tokenizer if you want to search for the exact term.
Adding a working example with index data, index mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"username": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Index Data:
{
"username": "Jack"
}
{
"username": "John"
}
Search Query:
{
"query": {
"term": {
"username.keyword": "Jack"
}
}
}
Search Result:
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"username": "Jack"
}
}
]
Edit 1:
To match for similar terms, you can use the fuzziness parameter along with the match query
{
"query": {
"match": {
"username": {
"query": "someting",
"fuzziness":"auto"
}
}
}
}
Search Result will be
"hits": [
{
"_index": "68844541",
"_type": "_doc",
"_id": "3",
"_score": 0.6065038,
"_source": {
"username": "something"
}
}
]

Scoring higher for shorter fields

I'm trying to get a higher score (or at least the same score) for the shortest values on Elastic Search.
Let's say I have these documents: "Abc", "Abca", "Abcb", "Abcc". The field label.ngram uses an EdgeNgram analyser.
With a really simple query like that:
{
"query": {
"match": {
"label.ngram": {
"query": "Ab"
}
}
}
}
I always get first the documents "Abca", "Abcb", "Abcc" instead of "Abc".
How can I get "Abc" first?
(should I use this: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html?)
Thanks!
This is happening due to field normalization and to get the same score, you have to disable the norms on the field.
Norms store various normalization factors that are later used at query
time in order to compute the score of a document relatively to a
query.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"norms": false,
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title": "Abca"
}
{
"title": "Abcb"
}
{
"title": "Abcc"
}
{
"title": "Abc"
}
Search Query:
{
"query": {
"match": {
"title": {
"query": "Ab"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65953349",
"_type": "_doc",
"_id": "1",
"_score": 0.1424427,
"_source": {
"title": "Abca"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "2",
"_score": 0.1424427,
"_source": {
"title": "Abcb"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "3",
"_score": 0.1424427,
"_source": {
"title": "Abcc"
}
},
{
"_index": "65953349",
"_type": "_doc",
"_id": "4",
"_score": 0.1424427,
"_source": {
"title": "Abc"
}
}
]
As mentioned by #ESCoder that using norms you can fix the scoring but this would not be very useful, if you want to score your search results, as this would cause all the documents in your search results to have the same score, which will impact the relevance of your search results big time.
Maybe you should tweak the document length norm param for default similarity algorithm(BM25) if you are on ES 5.X or higher. I tried doing this with your dataset and my setting but didn't make it to work.
Second option which will mostly work as suggested by you is to store the size of your fields in different field(but) this you should populate from your application as after analysis process, various tokens would be generated for same field. but this is extra overhead and I would prefer doing this by tweaking the similarity algo param.

Elasticsearch - pass fuzziness parameter in query_string

I have a fuzzy query with customized AUTO:10,20 fuzziness value.
{
"query": {
"match": {
"name": {
"query": "nike",
"fuzziness": "AUTO:10,20"
}
}
}
}
How to convert it to a query_string query? I tried nike~AUTO:10,20 but it is not working.
It's possible with query_strng as well, let me show using the same example as OP provided, both match_query provided by OP matches and query_string fetches the same document with same score.
And according to this and this ES docs, Elasticsearch supports AUTO:10,20 format, which is shown in my example as well.
Also
Index mapping
{
"mappings": {
"properties": {
"name": {
"type": "text"
}
}
}
}
Index some doc
{
"name" : "nike"
}
Search query using match with fuzziness
{
"query": {
"match": {
"name": {
"query": "nike",
"fuzziness": "AUTO:10,20"
}
}
}
}
And result
"hits": [
{
"_index": "so-query",
"_type": "_doc",
"_id": "1",
"_score": 0.9808292,
"_source": {
"name": "nike"
}
}
]
Query_string with fuzziness
{
"query": {
"query_string": {
"fields": ["name"],
"query": "nike",
"fuzziness": "AUTO:10,20"
}
}
}
And result
"hits": [
{
"_index": "so-query",
"_type": "_doc",
"_id": "1",
"_score": 0.9808292,
"_source": {
"name": "nike"
}
}
]
Lucene syntax only allows you to specify "fuzziness" with the tilde symbol "~", optionally followed by 0, 1 or 2 to indicate the edit distance.
Elasticsearch Query DSL supports a configurable special value for AUTO which then is used to build the proper Lucene query.
You would need to implement that logic on your application side, by evaluating the desired edit distance based on the length of your search term and then use <searchTerm>~<editDistance> in your query_string-query.

Elastic/Kibana: support for plurals in query searches

I'll simplify my issue. Let's say I have an index with 3 documents I've created with Kibana:
PUT /test/vendors/1
{
"type": "doctor",
"name": "Phil",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New York"
}
]
}
PUT /test/vendors/2
{
"type": "lawyer",
"name": "John",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New Jersey"
}
]
}
PUT /test/vendors/3
{
"type": "doctor",
"name": "Jill",
"works_in": [
{
"place": "Chicago"
}
]
}
Now I'm running a search:
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctor in chicago",
"fields": [ "type", "place" ]
}
}
}
And I'm getting a good response:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.2876821,
"hits": [
{
"_index": "test",
"_type": "vendors",
"_id": "1",
"_score": 0.2876821,
"_source": {
"type": "doctor",
"name": "Phil",
"works_in": [
{
"place": "Chicago"
},
{
"place": "New York"
}
]
}
},
{
"_index": "test",
"_type": "vendors",
"_id": "3",
"_score": 0.2876821,
"_source": {
"type": "doctor",
"name": "Jill",
"works_in": [
{
"place": "Chicago"
}
]
}
}
]
}
}
Now things begin to get problematic...
Changed the doctor to doctors
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctors in chicago",
"fields": [ "type", "place" ]
}
}
}
Zero results as doctors not found. Elastic doesn't know about plural vs. singular.
Change the query to New York
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctor in new york",
"fields": [ "type", "place" ]
}
}
}
But the response result set gives me the doctor in Chicago in addition to the doctor in New York. the fields are matched with OR...
Another interesting question is, what happens if someone usesdocs or physicians or health professionals but means doctor. IS there a provision where I can teach Elasticsearch to funnel those into "doctor"?
Is there any pattern, way around these using elasticsearch alone? where I won't have to analyze the string for meaning in my own application which will then construct a complex exact elasticsearch query to match it?
I would appreciate any pointer in the right direction
I'm assuming that the fields type and place are of Text type with Standard Analyzers.
To manage singular/plurals, what you are looking for is called Snowball Token Filter which you would need to add to the mapping.
Another requirement that you've mentioned saying for e.g. physicians should also be equated as doctor, you need to make use of Synonym Token Filter
Below is how your mapping should be. Note that I've just added analyzer to type. You can make similar changes to the mapping to the other fields.
Mapping
PUT <your_index_name>
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"tokenizer":"standard",
"filter":[
"lowercase",
"my_snow",
"my_synonym"
]
}
},
"filter":{
"my_snow":{
"type":"snowball",
"language":"English"
},
"my_synonym":{
"type":"synonym",
"synonyms":[
"docs, physicians, health professionals, doctor"
]
}
}
}
},
"mappings":{
"mydocs":{
"properties":{
"type":{
"type":"text",
"analyzer":"my_analyzer"
},
"place":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
}
Notice how I've added the synonyms in the mapping itself, instead of that I'd suggest you to have the synonyms added in text file like below
{
"type":"synonym",
"synonyms_path" : "analysis/synonym.txt"
}
According to the link I've shared, it mentions that the above configures a synonym filter, with a path of analysis/synonym.txt (relative to the config location).
Hope it helps!

Can I extract the actual value of not_analyzed field when _source is disabled?

I have the following mapping:
{
"articles":{
"mappings":{
"article":{
"_all":{
"enabled":false
},
"_source":{
"enabled":false
},
"properties":{
"content":{
"type":"string",
"norms":{
"enabled":false
}
},
"url":{
"type":"string",
"index":"not_analyzed"
}
}
}
},
"settings":{
"index":{
"refresh_interval":"30s",
"number_of_shards":"20",
"analysis":{
"analyzer":{
"default":{
"filter":[
"icu_folding",
"icu_normalizer"
],
"type":"custom",
"tokenizer":"icu_tokenizer"
}
}
},
"number_of_replicas":"1"
}
}
}
}
The questions is will it be possible to somehow extract the actual values of the url field since it not_analyzed and when _source is not enabled? I need to perform this only once for this index, so even a hacky way will be acceptable.
I know that not_analyzed means that the string won't be tokenized, so it makes sense to me that it should be store somewhere, but I don't know if it is hashes or 1:1 and I couldn't find information about this in the documentation.
My servers are running ES version 1.4.4 with JVM: 1.8.0_31
You can read the field data to retrieve the url from the document. We will be reading straight from the ES index, so we will get exactly what we are "matching" on, in this case, the exact URL you indexed as it is not analyzed.
Using the example index you provided, I indexed two URLs (on a smaller subset of your provided index:
POST /articles/article/1
{
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html"
}
POST /articles/article/2
{
"url":"http://stackoverflow.com/questions/37488389/can-i-extract-the-actual-value-of-not-analyzed-field-when-source-is-disabled"
}
And then this query will provide me a new "fields" object for each hit:
GET /articles/article/_search
{
"fielddata_fields" : ["url"]
}
Giving us these results:
"hits": [
{
"_index": "articles",
"_type": "article",
"_id": "2",
"_score": 1,
"fields": {
"url": [
"http://stackoverflow.com/questions/37488389/can-i-extract-the-actual-value-of-not-analyzed-field-when-source-is-disabled"
]
}
},
{
"_index": "articles",
"_type": "article",
"_id": "1",
"_score": 1,
"fields": {
"url": [
"https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fielddata-fields.html"
]
}
}
]
Hope that helps!

Resources