Word-oriented completion suggester (ElasticSearch 5.x)

Word-oriented completion suggester (ElasticSearch 5.x) - elasticsearch

ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:
Completion suggester is document-oriented
Suggestions are aware of the
document they belong to. Now, associated documents (_source) are
returned as part of completion suggestions.
In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.
Let's say we have this simple mapping:
{
"my-index": {
"mappings": {
"users": {
"properties": {
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"suggest": {
"type": "completion",
"analyzer": "simple"
}
}
}
}
}
}
With a few test documents:
{
"_index": "my-index",
"_type": "users",
"_id": "1",
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"_index": "my-index",
"_type": "users",
"_id": "2",
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
And a by-the-book query:
POST /my-index/_suggest?pretty
{
"my-suggest" : {
"text" : "joh",
"completion" : {
"field" : "suggest"
}
}
}
Which yields the following results:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "1",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "2",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
]
}
]
}
In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text property.
However, I would like to receive one (1) word. Something simple like this:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
"John"
]
}
]
}
Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.
Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?
EDIT:
As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:
Keeping the new index in sync.
Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index: "John", "Doe", "David", "Smith". When querying for "John D", the result for the incomplete word should be "Doe" and not "Doe", "David".
To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.

As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname field containing ngrams of the field. First you define your mapping like this:
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"completion_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"completion_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"completion_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 24
}
}
}
},
"mappings": {
"users": {
"properties": {
"autocomplete": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"completion": {
"type": "text",
"analyzer": "completion_analyzer",
"search_analyzer": "standard"
}
}
},
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
}
}
}
}
}
Then you index a few documents:
POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }
Then you can query for joh and get one result for John and another one for Johnny
{
"size": 0,
"query": {
"term": {
"autocomplete.completion": "john d"
}
},
"aggs": {
"suggestions": {
"terms": {
"field": "autocomplete.raw"
}
}
}
}
Results:
{
"aggregations": {
"suggestions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Doe",
"doc_count": 1
},
{
"key": "John Deere",
"doc_count": 1
}
]
}
}
}
UPDATE (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

An additional field skip_duplicates will be added in the next release 6.x.
From the docs at https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html#skip_duplicates:
POST music/_search?pretty
{
"suggest": {
"song-suggest" : {
"prefix" : "nor",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}

We face exactly the same problem. In Elasticsearch 2.4 the approach like you describe used to work fine for us but now as you say the suggester has become document-based while like you we are only interested in unique words, not in the documents.
The only 'solution' we could think of so far is to create a separate index just for the words on which we want to perform the suggestion queries and in this separate index make sure somehow that identical words are only indexed once. Then you could perform the suggestion queries on this separate index. This is far from ideal, if only because we will then need to make sure that this index remains in sync with the other index that we need for our other queries.

Related

matching multiple terms using match_phrase - Elasticsearch

Trying to fetch two documents that fit on the params searched, searching by each document separately works fine.
The query:
{
"query":{
"bool":{
"should":[
{
"match_phrase":{
"email":"elpaso"
}
},
{
"match_phrase":{
"email":"walker"
}
}
]
}
}
}
Im expecting to retrieve both documents that have these words in their email address field, but the query is only returning the first one elpaso
Is this an issue related to index mapping? I'm using type text for this field.
Any concept I am missing?
Index mapping:
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"name":{
"type": "text"
},
"email":{
"type" : "text"
}
}
}
}
Sample data:
{
"id":"4a43f351-7b62-42f2-9b32-9832465d271f",
"name":"Walker, Gary (Mr.) .",
"email":"walkergrym#mail.com"
}
{
"id":"1fc18c05-da40-4607-a901-3d78c523cea6",
"name":"Texas Chiropractic Association P.A.C.",
"email":"txchiro#mail.com"
}
{
"id":"9a2323f4-e008-45f0-9f7f-11a1f4439042",
"name":"El Paso Energy Corp. PAC",
"email":"elpaso#mail.com"
}
I also noticed that if I use elpaso and txchiro instead of walker the query works as expected!
noticed that the issue happens, when I use only parts of the field. If i search by the exact entire email address, everything works fine.
is this expected from match_phrase?

You are not getting any result from walker because elasticsearch uses a standard analyzer if no analyzer is specified which will tokenize walkergrym#mail.com as
GET /_analyze
{
"analyzer" : "standard",
"text" : "walkergrym#mail.com"
}
The following token will be generated
{
"tokens": [
{
"token": "walkergrym",
"start_offset": 0,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "mail.com",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Since there is no token for walker you are not getting "walkergrym#mail.com" in your search result.
Whereas for "txchiro#mail.com", token generated are txchiro and mail.com and for "elpaso#mail.com" tokens are elpaso and mail.com
You can use the edge_ngram tokenizer, to achieve your required result
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 6,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"email": {
"type": "text",
"analyzer": "my_analyzer"
},
"id": {
"type": "keyword"
},
"name": {
"type": "text"
}
}
}
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"match": {
"email": "elpaso"
}
},
{
"match": {
"email": "walker"
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "66907434",
"_type": "_doc",
"_id": "1",
"_score": 3.9233165,
"_source": {
"id": "4a43f351-7b62-42f2-9b32-9832465d271f",
"name": "Walker, Gary (Mr.) .",
"email": "walkergrym#mail.com"
}
},
{
"_index": "66907434",
"_type": "_doc",
"_id": "3",
"_score": 3.9233165,
"_source": {
"id": "9a2323f4-e008-45f0-9f7f-11a1f4439042",
"name": "El Paso Energy Corp. PAC",
"email": "elpaso#mail.com"
}
}
]

Elasticsearch multi fields multi words match

I'm looking to implement an auto-complete like feature on my app with elasticsearch.
Let's say my input is "ronan f", I want elastic to return all elements where "ronan" or "f" is contained in last name or first name. I expect elasticsearch to sort the result by rank, so the element which is the closest to what I search should be on top.
I tried multiple requests but none of them results as expected.
For example :
{
"query": {
"bool": {
"must_not": [
{
"match": {
"email": "*#guest.booking.com"
}
}
],
"should": [
{
"match": {
"lastname": "ronan"
}
},
{
"match": {
"firstname": "ronan"
}
},
{
"match": {
"lastname": "f"
}
},
{
"match": {
"firstname": "f"
}
}
],
"minimum_should_match" : 1
}
},
"sort": [
"_score"
],
"from": 0,
"size": 30
}
With this request the ranks seams a bit odds, for example :
"_index": "clients",
"_type": "client",
"_id": "4369",
"_score": 20.680058,
"_source": {
"firstname": "F",
"lastname": "F"
}
is on top of :
"_index": "clients",
"_type": "client",
"_id": "212360",
_score": 9.230003,
"_source": {
"firstname": "Ronan",
"lastname": "Fily"
}
For me the second result should have a better rank than the first.
Can someone show me how can I achieve the result I want ?
For info, I can't use Completion Suggester functionality of elasticsearch because I can't access the configuration of the database (so no indexes).

Ok as you can reindex your data i join a "start with" anylyzer. It will work caseless & on text field (i thinck first name and last name can have multi words on it).
Delete / create a new index using mappings.
define your analyzer (PUT my_index)
{
"settings": {:
"filter": {
"name_ngrams": {
"max_gram": "20",
"type": "edgeNGram",
"min_gram": "1",
"side": "front"
}
},
"analyzer": {
"partial_name": {
"type": "custom",
"filter": [
"lowercase"
,
"name_ngrams"
,
"standard"
,
"asciifolding"
],
"tokenizer": "standard"
},
"full_name": {
"type": "custom",
"filter": [
"standard"
,
"lowercase"
,
"asciifolding"
],
"tokenizer": "standard"
}
}
post _mappings using this for your fields:
"lastname": {
"type": "text",
"analyzer": "partial_name",
"search_analyzer": "full_name"
},
"firstname": {
"type": "text",
"analyzer": "partial_name",
"search_analyzer": "full_name"
}
if this is not clear and elasticsearch documentation couldnot help you dont hesite to ask us.

Elasticsearch template to support case insensitive searches

I've setup a normalizer on an index field to support case insensitive searches, cant seem to get it to work.
GET users/
Returns the following mapping:
{
"users": {
"aliases": {},
"mappings": {
"user": {
"properties": {
"active": {
"type": "boolean"
},
"first_name": {
"type": "keyword",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "search_normalizer"
}
}
},
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "users",
"creation_date": "1567936315432",
"analysis": {
"normalizer": {
"search_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
},
"number_of_replicas": "1",
"uuid": "5SknFdwJTpmF",
"version": {
"created": "6040299"
}
}
}
}
}
Although first_name is normalized to lowercase, queries on the first_name field are case sensitive.
Using the following query for a user with first name Dave
GET users/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name": {
"value": ".*dave.*"
}
}
}
]
}
}
}
GET users/_analyze
{
"analyzer" : "standard",
"text": "Dave"
}
returns
{
"tokens": [
{
"token": "dave",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Although "Dave" is tokenized to "dave" the following query
GET users/_search
{
"query": {
"match": {
"first_name": "dave"
}
}
}
Returns no hits.
Is there an issue with my current mapping? or the query?

I think you have missed first_name.normalize in query
Indexing Records
{"first_name": "Daveraj"}
{"index": {}}
{"first_name": "RajdaveN"}
{"index": {}}
{"first_name": "Dave"}
Query
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name.normalize": {
"value": ".*dave.*"
}
}
}
]
}
}
}
Result
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0,
"hits": [
{
"_index": "test3",
"_type": "test3_type",
"_id": "M8-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Dave"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Mc-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Daveraj"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Ms-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "RajdaveN"
}
}
]
}
}```

You have created a normalized multi-field: first_name.normalize , but you are searching on the original field first_name which doesn't have any analyzer specified (will default to index-default analyzer or standard).
The examples given here might help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
You need to explicitly specify the multi-field you want to search on, note even though a multi-field cant have its own content, it indexes different terms as opposed to its parent (although not always) as a result of possibly being analyzed using diff analyzers/char/token filters.

Get only the matching values and corresponding fields from ElasticSearch

In elasticsearch, let's say I have documents like
{
"name": "John",
"department": "Biology",
"address": "445 Mount Eden Road"
},
{
"name": "Jane",
"department": "Chemistry",
"address": "32 Wilson Street"
},
{
"name": "Laura",
"department": "BioTechnology",
"address": "21 Greens Road"
},
{
"name": "Mark",
"department": "Physics",
"address": "Random UNESCO Bio-reserve"
}
There is a use-case where, if I type "bio" in a search bar, I should get the matching field-value(s) from elasticsearch along with the field name.
For this example,
Input: "bio"
Expected Output:
{
"field": "department",
"value": "Biology"
},
{
"field": "department",
"value": "BioTechnology"
},
{
"field": "address",
"value": "Random UNESCO Bio-reserve"
}
What type of query should I use? I can think of using NGram Tokenizer and then use match query. But, I am not sure how shall I get only the matching field value (not the entire document) and the corresponding field name as the output.

After reading further about Completion Suggesters and Context Suggesters, I could solve this problem in the following way:
1) Keep a separate "suggest" field for each record with type "completion" with context-mapping of type "category". The mapping I created looks like as follows:
{
"properties": {
"suggest": {
"type": "completion",
"contexts": [
{
"name": "field_type",
"type": "category",
"path": "cat"
}
]
},
"name": {
"type": "text"
},
"department": {
"type": "text"
},
"address": {
"type": "text"
}
}
}
2) Then I insert the records as shown below (adding search metadata to the "suggest" field with proper "context").
For example, to insert the first record, I execute the following:
POST: localhost:9200/test_index/test_type/1
{
"suggest": [
{
"input": ["john"],
"contexts": {
"field_type": ["name"]
}
},
{
"input": ["biology"],
"contexts": {
"field_type": ["department"]
}
},
{
"input": ["445 mount eden road"],
"contexts": {
"field_type": ["address"]
}
}
],
"name": "john",
"department": "biology",
"address": "445 mount eden road"
}
3) If we want to search terms occurring in the middle of a sentence (as the search-term "bio" occurs in middle of the address field in the 4th record, we can index the entry as follows:
POST: localhost:9200/test_index/test_type/4
{
"suggest": [
{
"input": ["mark"],
"contexts": {
"field_type": ["name"]
}
},
{
"input": ["physics"],
"contexts": {
"field_type": ["department"]
}
},
{
"input": ["random unesco bio-reserve", "bio-reserve"],
"contexts": {
"field_type": ["address"]
}
}
],
"name": "mark",
"department": "physics",
"address": "random unesco bio-reserve"
}
4) Then search for the keyword "bio" like this:
localhost:9200/test_index/test_type/_search
{
"_source": false,
"suggest": {
"suggestion" : {
"text" : "bio",
"completion" : {
"field" : "suggest",
"size": 10,
"contexts": {
"field_type": [ "name", "department", "address" ]
}
}
}
}
}
The response:
{
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"suggest": {
"suggestion": [
{
"text": "bio",
"offset": 0,
"length": 3,
"options": [
{
"text": "bio-reserve",
"_index": "test_index",
"_type": "test_type",
"_id": "4",
"_score": 1,
"contexts": {
"field_type": [
"address"
]
}
},
{
"text": "biology",
"_index": "test_index",
"_type": "test_type",
"_id": "1",
"_score": 1,
"contexts": {
"field_type": [
"department"
]
}
},
{
"text": "biotechnology",
"_index": "test_index",
"_type": "test_type",
"_id": "3",
"_score": 1,
"contexts": {
"field_type": [
"department"
]
}
}
]
}
]
}
}
Can anyone please suggest any better approach?

Elasticsearch gives duplicate result

I have following search/cities index where element will have name and bunch of other properties. I perform following aggregate search:
{
"size": 0,
"query": {
"multi_match" : {
"query": "ana",
"fields": [ "cityName" ],
"type" : "phrase_prefix"
}
},
"aggs": {
"res": {
"terms": {
"field": "cityName"
},
"aggs":{
"dedup_docs":{
"top_hits":{
"size":1
}
}
}
}
}
}
As result I get 3 buckets with keys "Anahiem", "ana" and "santa". Below is result:
"buckets": [
{
"key": "anaheim",
"doc_count": 11,
"dedup_docs": {
"hits": {
"total": 11,
"max_score": 5.8941016,
"hits": [
{
"_index": "search",
"_type": "City",
"_id": "310",
"_score": 5.8941016,
"_source": {
"id": 310,
"country": "USA",
"stateCode": "CA",
"stateName": "California",
"cityName": "Anaheim",
"postalCode": "92806",
"latitude": 33.822738,
"longitude": -117.881633
}
}
]
}
}
},
{
"key": "ana",
"doc_count": 4,
"dedup_docs": {
"hits": {
"total": 4,
"max_score": 2.933612,
"hits": [
{
"_index": "search",
"_type": "City",
"_id": "154",
"_score": 2.933612,
"_source": {
"id": 154,
"country": "USA",
"stateCode": "CA",
"stateName": "California",
"cityName": "Santa Ana",
"postalCode": "92706",
"latitude": 33.767371,
"longitude": -117.868255
}
}
]
}
}
},
{
"key": "santa",
"doc_count": 4,
"dedup_docs": {
"hits": {
"total": 4,
"max_score": 2.933612,
"hits": [
{
"_index": "search",
"_type": "City",
"_id": "154",
"_score": 2.933612,
"_source": {
"id": 154,
"country": "USA",
"stateCode": "CA",
"stateName": "California",
"cityName": "Santa Ana",
"postalCode": "92706",
"latitude": 33.767371,
"longitude": -117.868255
}
}
]
}
}
}
]
Question is why last bucket has key "santa" even tho I search for "ana" and why same city "Santa Ana" (with id=154) shows up in 2 different buckets (key "ana" and key "santa")?

It's mainly because your cityName field is analyzed, and thus, when Santa Ana is indexed, the two tokens santa and ana are getting generated and used for bucketing.
If you want to prevent that you need to define your cityName field like this:
PUT search
{
"mappings": {
"City": {
"properties": {
"cityName": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
You first need to wipe your index, recreate it with the above mapping and then re-index your data. Only then you'll get your bucket names as Anaheim and Santa Ana.
UPDATE
If you want cityName to be analyzed but also only get a single bucket in your aggregation, there is a way by defining a multi-field, where one part is analyzed and the other one is not, like this
PUT search
{
"mappings": {
"City": {
"properties": {
"cityName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
So you let cityName be analyzed but now you also have cityName.raw which is not analyzed and that you can use in your aggregation like this:
"terms": {
"field": "cityName.raw"
},

UPDATE
The repeation is a behaviour of top_hits aggregation.
Check that nice tutorial:
https://www.elastic.co/blog/top-hits-aggregation
When solely using the top_hits aggregation, it just repeats what is
already in the regular hits in the response.
Actually analyzing is nothing to do with it. So the below explenation is not true.
In default settings Elasticsearch will split input to so called terms. Default analyzer will transform Santa Ana as 2 terms like [santa, ana]. End when searching for ana Santa Ana will also match.
You can read about how Elastichsearch work from here:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Word-oriented completion suggester (ElasticSearch 5.x) - elasticsearch

Related

matching multiple terms using match_phrase - Elasticsearch

Elasticsearch multi fields multi words match

Elasticsearch template to support case insensitive searches

Get only the matching values and corresponding fields from ElasticSearch

Elasticsearch gives duplicate result

Categories

Resources