Elasticsearch gives duplicate result - elasticsearch

I have following search/cities index where element will have name and bunch of other properties. I perform following aggregate search:
{
"size": 0,
"query": {
"multi_match" : {
"query": "ana",
"fields": [ "cityName" ],
"type" : "phrase_prefix"
}
},
"aggs": {
"res": {
"terms": {
"field": "cityName"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
As result I get 3 buckets with keys "Anahiem", "ana" and "santa". Below is result:
"buckets": [
{
"key": "anaheim",
"doc_count": 11,
"dedup_docs": {
"hits": {
"total": 11,
"max_score": 5.8941016,
"hits": [
{
"_index": "search",
"_type": "City",
"_id": "310",
"_score": 5.8941016,
"_source": {
"id": 310,
"country": "USA",
"stateCode": "CA",
"stateName": "California",
"cityName": "Anaheim",
"postalCode": "92806",
"latitude": 33.822738,
"longitude": -117.881633
}
}
]
}
}
},
{
"key": "ana",
"doc_count": 4,
"dedup_docs": {
"hits": {
"total": 4,
"max_score": 2.933612,
"hits": [
{
"_index": "search",
"_type": "City",
"_id": "154",
"_score": 2.933612,
"_source": {
"id": 154,
"country": "USA",
"stateCode": "CA",
"stateName": "California",
"cityName": "Santa Ana",
"postalCode": "92706",
"latitude": 33.767371,
"longitude": -117.868255
}
}
]
}
}
},
{
"key": "santa",
"doc_count": 4,
"dedup_docs": {
"hits": {
"total": 4,
"max_score": 2.933612,
"hits": [
{
"_index": "search",
"_type": "City",
"_id": "154",
"_score": 2.933612,
"_source": {
"id": 154,
"country": "USA",
"stateCode": "CA",
"stateName": "California",
"cityName": "Santa Ana",
"postalCode": "92706",
"latitude": 33.767371,
"longitude": -117.868255
}
}
]
}
}
}
]
Question is why last bucket has key "santa" even tho I search for "ana" and why same city "Santa Ana" (with id=154) shows up in 2 different buckets (key "ana" and key "santa")?

It's mainly because your cityName field is analyzed, and thus, when Santa Ana is indexed, the two tokens santa and ana are getting generated and used for bucketing.
If you want to prevent that you need to define your cityName field like this:
PUT search
{
"mappings": {
"City": {
"properties": {
"cityName": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
You first need to wipe your index, recreate it with the above mapping and then re-index your data. Only then you'll get your bucket names as Anaheim and Santa Ana.
UPDATE
If you want cityName to be analyzed but also only get a single bucket in your aggregation, there is a way by defining a multi-field, where one part is analyzed and the other one is not, like this
PUT search
{
"mappings": {
"City": {
"properties": {
"cityName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
So you let cityName be analyzed but now you also have cityName.raw which is not analyzed and that you can use in your aggregation like this:
"terms": {
"field": "cityName.raw"
},

UPDATE
The repeation is a behaviour of top_hits aggregation.
Check that nice tutorial:
https://www.elastic.co/blog/top-hits-aggregation
When solely using the top_hits aggregation, it just repeats what is
already in the regular hits in the response.
Actually analyzing is nothing to do with it. So the below explenation is not true.
In default settings Elasticsearch will split input to so called terms. Default analyzer will transform Santa Ana as 2 terms like [santa, ana]. End when searching for ana Santa Ana will also match.
You can read about how Elastichsearch work from here:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

Related

Querying array with nested objects in Elasticsearch to get multiple objects

I have data in Elasticsearch in the below format -
"segments": [
{"id": "ABC", "value":123},
{"id": "PQR", "value":345},
{"id": "DEF", "value":567},
{"id": "XYZ", "value":789},
]
I want to retrieve all segments where id is "ABC" or "DEF".
I looked up the docs (https://www.elastic.co/guide/en/elasticsearch/reference/7.9/query-dsl-nested-query.html) and few examples on YouTube but the all look to retrieve only a single object while I want to retrieve more than 1.
Is there a way to do this?
You can use nested query with inner hits as shown here.
I hope your index mapping is looks like below and segments field is define as nested
"mappings": {
"properties": {
"segments": {
"type": "nested",
"properties": {
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"value": {
"type": "long"
}
}
}
}
}
You can use below Query:
{
"_source" : false,
"query": {
"nested": {
"path": "segments",
"query": {
"terms": {
"segments.id.keyword": [
"ABC",
"DEF"
]
}
},
"inner_hits": {}
}
}
}
Response:
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "73895503",
"_id": "TmM8iYMBrWOLJcwdvQGG",
"_score": 1,
"inner_hits": {
"segments": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "73895503",
"_id": "TmM8iYMBrWOLJcwdvQGG",
"_nested": {
"field": "segments",
"offset": 0
},
"_score": 1,
"_source": {
"id": "ABC",
"value": 123
}
},
{
"_index": "73895503",
"_id": "TmM8iYMBrWOLJcwdvQGG",
"_nested": {
"field": "segments",
"offset": 2
},
"_score": 1,
"_source": {
"id": "DEF",
"value": 567
}
}
]
}
}
}
}
]
}

Elasticsearch template to support case insensitive searches

I've setup a normalizer on an index field to support case insensitive searches, cant seem to get it to work.
GET users/
Returns the following mapping:
{
"users": {
"aliases": {},
"mappings": {
"user": {
"properties": {
"active": {
"type": "boolean"
},
"first_name": {
"type": "keyword",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "search_normalizer"
}
}
},
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "users",
"creation_date": "1567936315432",
"analysis": {
"normalizer": {
"search_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
},
"number_of_replicas": "1",
"uuid": "5SknFdwJTpmF",
"version": {
"created": "6040299"
}
}
}
}
}
Although first_name is normalized to lowercase, queries on the first_name field are case sensitive.
Using the following query for a user with first name Dave
GET users/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name": {
"value": ".*dave.*"
}
}
}
]
}
}
}
GET users/_analyze
{
"analyzer" : "standard",
"text": "Dave"
}
returns
{
"tokens": [
{
"token": "dave",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Although "Dave" is tokenized to "dave" the following query
GET users/_search
{
"query": {
"match": {
"first_name": "dave"
}
}
}
Returns no hits.
Is there an issue with my current mapping? or the query?
I think you have missed first_name.normalize in query
Indexing Records
{"first_name": "Daveraj"}
{"index": {}}
{"first_name": "RajdaveN"}
{"index": {}}
{"first_name": "Dave"}
Query
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name.normalize": {
"value": ".*dave.*"
}
}
}
]
}
}
}
Result
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0,
"hits": [
{
"_index": "test3",
"_type": "test3_type",
"_id": "M8-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Dave"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Mc-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Daveraj"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Ms-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "RajdaveN"
}
}
]
}
}```
You have created a normalized multi-field: first_name.normalize , but you are searching on the original field first_name which doesn't have any analyzer specified (will default to index-default analyzer or standard).
The examples given here might help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
You need to explicitly specify the multi-field you want to search on, note even though a multi-field cant have its own content, it indexes different terms as opposed to its parent (although not always) as a result of possibly being analyzed using diff analyzers/char/token filters.

Get only the matching values and corresponding fields from ElasticSearch

In elasticsearch, let's say I have documents like
{
"name": "John",
"department": "Biology",
"address": "445 Mount Eden Road"
},
{
"name": "Jane",
"department": "Chemistry",
"address": "32 Wilson Street"
},
{
"name": "Laura",
"department": "BioTechnology",
"address": "21 Greens Road"
},
{
"name": "Mark",
"department": "Physics",
"address": "Random UNESCO Bio-reserve"
}
There is a use-case where, if I type "bio" in a search bar, I should get the matching field-value(s) from elasticsearch along with the field name.
For this example,
Input: "bio"
Expected Output:
{
"field": "department",
"value": "Biology"
},
{
"field": "department",
"value": "BioTechnology"
},
{
"field": "address",
"value": "Random UNESCO Bio-reserve"
}
What type of query should I use? I can think of using NGram Tokenizer and then use match query. But, I am not sure how shall I get only the matching field value (not the entire document) and the corresponding field name as the output.
After reading further about Completion Suggesters and Context Suggesters, I could solve this problem in the following way:
1) Keep a separate "suggest" field for each record with type "completion" with context-mapping of type "category". The mapping I created looks like as follows:
{
"properties": {
"suggest": {
"type": "completion",
"contexts": [
{
"name": "field_type",
"type": "category",
"path": "cat"
}
]
},
"name": {
"type": "text"
},
"department": {
"type": "text"
},
"address": {
"type": "text"
}
}
}
2) Then I insert the records as shown below (adding search metadata to the "suggest" field with proper "context").
For example, to insert the first record, I execute the following:
POST: localhost:9200/test_index/test_type/1
{
"suggest": [
{
"input": ["john"],
"contexts": {
"field_type": ["name"]
}
},
{
"input": ["biology"],
"contexts": {
"field_type": ["department"]
}
},
{
"input": ["445 mount eden road"],
"contexts": {
"field_type": ["address"]
}
}
],
"name": "john",
"department": "biology",
"address": "445 mount eden road"
}
3) If we want to search terms occurring in the middle of a sentence (as the search-term "bio" occurs in middle of the address field in the 4th record, we can index the entry as follows:
POST: localhost:9200/test_index/test_type/4
{
"suggest": [
{
"input": ["mark"],
"contexts": {
"field_type": ["name"]
}
},
{
"input": ["physics"],
"contexts": {
"field_type": ["department"]
}
},
{
"input": ["random unesco bio-reserve", "bio-reserve"],
"contexts": {
"field_type": ["address"]
}
}
],
"name": "mark",
"department": "physics",
"address": "random unesco bio-reserve"
}
4) Then search for the keyword "bio" like this:
localhost:9200/test_index/test_type/_search
{
"_source": false,
"suggest": {
"suggestion" : {
"text" : "bio",
"completion" : {
"field" : "suggest",
"size": 10,
"contexts": {
"field_type": [ "name", "department", "address" ]
}
}
}
}
}
The response:
{
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"suggest": {
"suggestion": [
{
"text": "bio",
"offset": 0,
"length": 3,
"options": [
{
"text": "bio-reserve",
"_index": "test_index",
"_type": "test_type",
"_id": "4",
"_score": 1,
"contexts": {
"field_type": [
"address"
]
}
},
{
"text": "biology",
"_index": "test_index",
"_type": "test_type",
"_id": "1",
"_score": 1,
"contexts": {
"field_type": [
"department"
]
}
},
{
"text": "biotechnology",
"_index": "test_index",
"_type": "test_type",
"_id": "3",
"_score": 1,
"contexts": {
"field_type": [
"department"
]
}
}
]
}
]
}
}
Can anyone please suggest any better approach?

Word-oriented completion suggester (ElasticSearch 5.x)

ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:
Completion suggester is document-oriented
Suggestions are aware of the
document they belong to. Now, associated documents (_source) are
returned as part of completion suggestions.
In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.
Let's say we have this simple mapping:
{
"my-index": {
"mappings": {
"users": {
"properties": {
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"suggest": {
"type": "completion",
"analyzer": "simple"
}
}
}
}
}
}
With a few test documents:
{
"_index": "my-index",
"_type": "users",
"_id": "1",
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"_index": "my-index",
"_type": "users",
"_id": "2",
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
And a by-the-book query:
POST /my-index/_suggest?pretty
{
"my-suggest" : {
"text" : "joh",
"completion" : {
"field" : "suggest"
}
}
}
Which yields the following results:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "1",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "2",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
]
}
]
}
In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text property.
However, I would like to receive one (1) word. Something simple like this:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
"John"
]
}
]
}
Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.
Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?
EDIT:
As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:
Keeping the new index in sync.
Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index: "John", "Doe", "David", "Smith". When querying for "John D", the result for the incomplete word should be "Doe" and not "Doe", "David".
To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.
As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname field containing ngrams of the field. First you define your mapping like this:
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"completion_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"completion_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"completion_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 24
}
}
}
},
"mappings": {
"users": {
"properties": {
"autocomplete": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"completion": {
"type": "text",
"analyzer": "completion_analyzer",
"search_analyzer": "standard"
}
}
},
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
}
}
}
}
}
Then you index a few documents:
POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }
Then you can query for joh and get one result for John and another one for Johnny
{
"size": 0,
"query": {
"term": {
"autocomplete.completion": "john d"
}
},
"aggs": {
"suggestions": {
"terms": {
"field": "autocomplete.raw"
}
}
}
}
Results:
{
"aggregations": {
"suggestions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Doe",
"doc_count": 1
},
{
"key": "John Deere",
"doc_count": 1
}
]
}
}
}
UPDATE (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
An additional field skip_duplicates will be added in the next release 6.x.
From the docs at https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html#skip_duplicates:
POST music/_search?pretty
{
"suggest": {
"song-suggest" : {
"prefix" : "nor",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}
We face exactly the same problem. In Elasticsearch 2.4 the approach like you describe used to work fine for us but now as you say the suggester has become document-based while like you we are only interested in unique words, not in the documents.
The only 'solution' we could think of so far is to create a separate index just for the words on which we want to perform the suggestion queries and in this separate index make sure somehow that identical words are only indexed once. Then you could perform the suggestion queries on this separate index. This is far from ideal, if only because we will then need to make sure that this index remains in sync with the other index that we need for our other queries.

facing problems with terms filter

My mapping looks like the below.
"BID": {
"type": "string"
},
"REGION": {
"type": "string"
},
Now I am trying to search for the records whose BID values are B100, B302. I've written below query. Though I've records with those ID values, I am not getting any results. Any clue where I am doing wrong?
{"query": {"filtered": {"filter": {"terms": {"BID": ["B100","B302"]}}}}}
Try using lower-case values, like:
{"query": {"filtered": {"filter": {"terms": {"BID": ["b100","b302"]}}}}}
You need to do this because, since you did not specify an analyzer in the definition of "BID" in your mapping, the default standard analyzer is used, which will convert letters to lower-case.
Alternatively, if you want to maintain the case in your index terms, you can add "index": "not_analyzed" to your mapping definition for "BID".
To test I set up an index like this:
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"BID": {
"type": "string",
"index": "not_analyzed"
},
"REGION": {
"type": "string"
}
}
}
}
}
added a few docs:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"REGION":"NA","BID":"B100"}
{"index":{"_id":2}}
{"REGION":"NA","BID":"B200"}
{"index":{"_id":3}}
{"REGION":"NA","BID":"B302"}
and now your query works as written:
POST /test_index/_search
{
"query": {
"filtered": {
"filter": {
"terms": {
"BID": [
"B100",
"B302"
]
}
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"REGION": "NA",
"BID": "B100"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"REGION": "NA",
"BID": "B302"
}
}
]
}
}
Here is some code I used for testing:
http://sense.qbox.io/gist/b4b4767501df7ad8b6459c4d96809d737a8811ec

Resources