Elasticsearch autocomplete searching middle word - elasticsearch

I'm stuck on this for a while.
How can I get suggestion on elastic search to complete my word even when I write a middle term.
For example in my data I have "Alan Turing is great" and I start typing "turi", I would like to see suggestion term "Alan Turing is Great".
I am using elastic search v. 6.3.2, I tried with query similar to these:
curl -X GET "http://127.0.0.1:9200/my_index/_search" -H 'Content-Type: application/json' -d '{"_source":false,"suggest":{"show-suggest":{"prefix":"turi","completion":{"field":"auto_suggest"}}}}'
or
curl -X GET "http://127.0.0.1:9200/my_index/_search" -H 'Content-Type: application/json' -d '{"_source":false,"suggest":{"show-suggest":{"text":"turi","completion":{"field":"auto_suggest"}}}}'
but it works only if I search for "alan" and it shows all the terms.
index:
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
"mappings": {
"poielement": {
"numeric_detection": false,
"date_detection": false,
"dynamic_templates": [
{
"suggestions": {
"match": "suggest_*",
"mapping": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer",
"copy_to": "auto_suggest",
"store": true
}
}
},
{
"property": {
"match": "*",
"mapping": {
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer"
}
}
}
],
"properties": {
"auto_suggest": {
"type": "completion"
},
"name_suggest": {
"type": "completion"
}
}
}
}

We have an exact similar use case and this is how we solved it. what you are looking is for substring search.
Please create a custom substring analyzer for your field like below, java code for which is below:-
TokenStream result = new WhitespaceTokenizer(SearchManager.LUCENE_VERSION_301, reader);
result = new LowerCaseFilter(SearchManager.LUCENE_VERSION_301, result);
result = new SubstringFilter(result, minSize);
return result;
In the above code, I am first using the WhitespaceTokenizer and then passing it to first LowerCaseFilter and then my custom SubstringFilter code of which is customizable based on the minimum no of chars you want in your tokens.
Above code will generate below tokens for strings like hellowworld if you set min substring length 3.
Giving public URI to access the tokens which it generates as for helloworld string and min substring length 3. it will generate lot of tokens.
https://justpaste.it/4i6gh
Also you can test the tokens which your custom analyzer using the _analyze api, https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
http://localhost:9200/jaipur/_analyze?text=helloworld&analyzer=substring
here jaipur is my index name and helloworld is the string for which I want to generates tokens using substring.
EDIT
As suggested by Nishant in comments, you can use the ngram filter instead of substring filter, which Elastic inbuilt provides.

Related

Elasticsearch - searching for punctuation terms over both text and keyword fields

Using elasticsearch 7, I'm trying to use a simple query string query for searches over different fields, both text and keyword. Here's a minimum, reproducible example to show the initial setup and problem:
mapping.json:
{
"dynamic": false,
"properties": {
"publicId": {
"type": "keyword"
},
"eventDate": {
"type": "date",
"format": "yyyy-MM-dd",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"name": {
"type": "text"
}
}
}
test-data1.json:
{
"publicId": "a1b2c3",
"eventDate": "2022-06-10",
"name": "Research & Development"
}
test-data2.json
{
"publicId": "d4e5f6",
"eventDate": "2021-05-11",
"name": "F.inance"
}
Create index on ES running on localhost:19200:
#!/bin/bash -e
host=${1-localhost:19200}
dir=$( dirname `readlink -f $0` )
mapping=$(<${dir}/mapping.json);
param="{ \"mappings\": $mapping}"
curl -XPUT "http://${host}/test/" -H 'Content-Type: application/json' -d "$param"
curl -XPOST "http://${host}/test/_doc/a1b2c3" -H 'Content-Type: application/json' -d #${dir}/test-data1.json
curl -XPOST "http://${host}/test/_doc/d4e5f6" -H 'Content-Type: application/json' -d #${dir}/test-data2.json
Now the task is to support searches like "Research & Development", "Research & Development 2022-06-10", "Finance" (note the removed dot) or simply "a1b2c3". For example using a query like this:
{
"from": 0,
"size": 20,
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"query": "Research & Development 2022-06-10",
"fields": [
"publicId^1.0",
"eventDate.keyword^1.0",
"name^1.0"
],
"flags": -1,
"default_operator": "and",
"analyze_wildcard": false,
"auto_generate_synonyms_phrase_query": true,
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"fuzzy_transpositions": true,
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"version": true
}
The problem with this setup is that the standard analyzer for the text field that removes most punctuation of course also removes the ampersand character. The simple query string query splits the query into three tokens [research, &, development] and searches over all fields using the and operator. There are two matches ("Research" and "Development") for the name text field, but no matches for the ampersand in any field. Thus the result is empty.
Now I came up with a solution to add a second field for name with a different analyzer, the whitespace analyzer, that doesn't remove punctuation:
{
"dynamic": false,
"properties": {
"publicId": {
"type": "keyword"
},
"eventDate": {
"type": "date",
"format": "yyyy-MM-dd",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"name": {
"type": "text",
"fields": {
"whitespace": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
This way all searches work, including "Finance" that matches for "F.inance" for the name field. Also, "Research & Development" matches for the name field and for name.whitespace, but most crucially & matches for name.whitespace and therefore returns a result.
My question now is: given the fact that the real setup includes many more fields and a lot of data, adding an additional field and therefore indexing most terms in the same way twice seems quite heavy. Is there a way to only index analyzed terms to name.whitespace that differ from the standard analyzer's terms of name, i.e. that are not in the "parent" field? E.g. "Research & Development" results in the terms [research, development] for name and [research, development, &] for name.whitespace - ideally it would only index [&] for name.whitespace.
Or is there a more elegant/performant solution for this particular problem altogether?
I guess you can define a dynamic property mapping for all string fields and use whitespace analyzer since your use case has that specification to search on non-standard tokens. In addition, you can specify those fields in the mapping where you don't need whitespace tokenizer.
This would ensure that already mapped fields are analyzed using standard tokenizer while others (dynamic or unmapped fields) are analyzed using whitespace, thus reducing the complexity, field duplication, etc.

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Elasticsearch highlighter false positives

I am using an nGram tokenizer in ES 6.1.1 and getting some weird highlights:
multiple adjacent character ngram highlights are not merged into one
tra is incorrectly highlighted in doc 9
The query auftrag matches documents 7 and 9 as expected, but in doc 9 betrag is highlighted incorrectly. That's a problem with the highlighter - if the problem was with the query doc 8 would have also been returned.
Example code
#!/usr/bin/env bash
# Example based on
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from
# https://github.com/elastic/elasticsearch/issues/21000
DELETE INDEX IF EXISTS
curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'
CREATE NEW INDEX
curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"trigrams": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
}
'
printf '\n-------------\n'
POPULATE INDEX
curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1 # Give ES time to index
QUERY
curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"text": {
"query": "auftrag",
"minimum_should_match": "100%"
}
}
},
"highlight": {
"fields": {
"text": {
"fragment_size": 120,
"type": "fvh"
}
}
}
}
'
The hits I get are (abbreviated):
"hits" : [
{
"_id" : "9",
"_source" : {
"text" : "betrag auftragen"
},
"highlight" : {
"text" : [
"be<em>tra</em>g <em>auf</em><em>tra</em>gen"
]
}
},
{
"_id" : "7",
"_source" : {
"text" : "auftragen"
},
"highlight" : {
"text" : [
"<em>auf</em><em>tra</em>gen"
]
}
}
]
I have tried various workarounds, such as using the unified/fvh highlighter and setting all options that seemed relevant, but no luck. Any hints are greatly appreciated.
The problem here is not with highlighting but with this how you are using nGram analyzer.
First of all when you are configure mapping this way:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
you are saying to Elasticsearch that you want to use it for both indexed text and provided a search term. In your case, this simply means that:
your text from the document 9 = "betrag auftragen" is split for trigrams so in the index you have something like: [bet, etr, tra, rag, auf, uft, ftr, tra, rag, age, gen]
your text from the document 7 = "auftragen" is split for trigrams so in the index you have something like: [auf, uft, ftr, tra, rag, age, gen]
your search term = "auftrag" is also split for trigrams and Elasticsearch is see it as: [auf, uft, ftr, tra, rag]
at the end Elasticsearch matches all the trigrams from search with those from your index and because of this you have 'auf' and 'tra' highlighted separately. 'ufa', 'ftr', and 'rag' also matches, but they overlaps 'auf' and 'tra' and are not highlighted.
First what you need to do is to say to Elasticsearch that you do not want to split search term to grams. All you need to do is to add search_analyzer property to your mapping:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"search_analyzer": "standard",
"term_vector" : "with_positions_offsets"
}
}
}
}
Now words from a search term are treated by standard analyzer as separate words so in your case, it will be just "auftrag".
But this single change will not help you. It will even break the search because "auftrag" is not matching to any trigram from your index.
Now you need to improve your nGram tokenizer by increasing max_gram:
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "10",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
This way texts in your index will be split into 3-grams, 4-grams, 5-grams, 6-grams, 7-grams, 8-grams, 9-grams, and 10-grams. Among these 7-grams you will find "auftrag" which is your search term.
After this two improvements, highlighting in your search result should look as below:
"betrag <em>auftrag</em>en"
for document 9 and:
"<em>auftrag</em>en"
for document 7.
This is how ngrams and highlighting works together. I know that ES documentation is saying:
It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.
This is true. For performance reason, you need to experiment with this configuration but I hope that I explained to you how it is working.
I have the same problem here, with ngram(trigram) tokenizer, got incomplete highlight like:
query with `match`: samp
field data: sample
result highlight: <em>sam</em>ple
expected highlight: <em>samp</em>le
Use match_phrase and use fvh highlight type when set the field's term_vector to with_positions_offsets, this may get the correct highlight.
<em>samp</em>le
I hope this can help you as you do not need to change the tokenizer nor increase max_gram.
But my problem is that I want to use simple_query_string which does not support using phrase for default field query, the only way is using quote to wrap the string like "samp", but as there is some logic in query string so I cant do it for users, and require users to do it neither.
Solution from #piotr-pradzynski may not help me as I have a lot of data, increase the max_gram will lead to lots of storage usage.

How to denormalize hierarchy in ElasticSearch?

I am new to ElasticSearch and I have a tree, which describes a path to a certain document (not real filesystem paths, just simple text fields categorizing articles, images, documents as one). Each path entry has a type, like.: Group Name, Assembly name or even Unknown. The types could be used in queries to skip certain entries in the path for example.
My source data is stored in SQL Server, the schema looks something like this:
Tree builds up by connecting the Tree.Id to Tree.ParentId, but each node must have a type. The Documents are connected to a leaf in the Tree.
I am not worried about querying the structure in SQL Server, however I should find an optimal approach to denormalize and search them in Elastic. If I flatten the paths and make a list of "descriptors" for a document, I can store each of the Document entries as an Elastic Document.:
{
"path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
"descriptors": [
{
"name": "NodeNameRoot",
"type": "type1"
},
{
"name": "NodeNameLevel_1",
"type": "type1"
},
{
"name": "NodeNameLevel_2",
"type": "type2"
},
{
"name": "NodeNameLevel_3",
"type": "type2"
},
{
"name": "NodeNameLevel_4",
"type": "type3"
}
],
"document": {
...
}
}
Can I query such a structure in ElasticSearch? Or Should I denormalize the paths in a different way?
My main questions:
Can query them based on type or text value (regex matching for example). For example: Give me all the type2->type3 paths (practically leave the type1 out), where the path contains X?
Is it possible to query based on levels? Like I would like the paths where there are 4 descriptors.
Can I do the searching with the built-in functionality or do I need to write an extension?
Edit
Based on G Quintana 's anwser, I made an index like this.:
curl -X PUT \
http://localhost:9200/test \
-H 'cache-control: no-cache' \
-H 'content-type: application/json' \
-d '{
"mappings": {
"path": {
"properties": {
"names": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"tokens": {
"type": "text",
"analyzer": "pathname_analyzer"
},
"depth": {
"type": "token_count",
"analyzer": "pathname_analyzer"
}
}
},
"types": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"tokens": {
"type": "text",
"analyzer": "pathname_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"pathname_analyzer": {
"type": "pattern",
"pattern": "#->>",
"lowercase": true
}
}
}
}
}'
And could query the depth like this.:
curl -X POST \
http://localhost:9200/test/path/_search \
-H 'content-type: application/json' \
-d '{
"query": {
"bool": {
"should": [
{"match": { "names.depth": 5 }}
]
}
}
}'
Which return correct results. I will test it a little more.
First of all you should identify all your query patterns to design how you will index your data.
From the example you gave, I would index documents of the form:
{
"path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
"types: "type1/type1/type2/type2/type3",
"document": {
...
}
}
Before indexing, you must configure mapping and analysis:
Field path:
use type text + analyzer based on pattern analyzer to split at / characters
use type token_count + same analyzer to compute path depth. Create a multi field (path.depth)
Field types
use type text + analyzer based on pattern analyzer to split at / characters
Configure index mappings and analysis to split the path and types fields and the , use a or a
Give me all the type2->type3 paths use a match_phrase query on the types field
where the path contains X use match query on the path field
where there are 4 descriptors use term query on path.depth sub field
Your descriptors field is not interesting.
The Path tokenizer might be interesting for some usecases.
You can apply multiple analyzer on the same field using multi-fields and then query if sub fields.

How to handle wildcards in elastic search structured queries

My use case requires to query for our elastic search domain with trailing wildcards. I wanted to get your opinion on the best practices of handling such wildcards in the queries.
Do you think adding the following clauses is a good practice for the queries:
"query" : {
"query_string" : {
"query" : "attribute:postfix*",
"analyze_wildcard" : true,
"allow_leading_wildcard" : false,
"use_dis_max" : false
}
}
I've disallowed leading wildcards since it is a heavy operation. However I wanted to how good is analyzing wildcard for every query request in the long run. My understanding is, analyze wildcard would have no impact if the query doesn't actually have any wildcards. Is that correct?
If you have the possibility of changing your mapping type and index settings, the right way to go is to create a custom analyzer with an edge-n-gram token filter that would index all prefixes of the attribute field.
curl -XPUT http://localhost:9200/your_index -d '{
"settings": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
},
"analyzer": {
"attr_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edge_filter"]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"attribute": {
"type": "string",
"analyzer": "attr_analyzer",
"search_analyzer": "standard"
}
}
}
}
}'
Then, when you index a document, the attribute field value (e.g.) postfixing will be indexed as the following tokens: p, po, pos, post, postf, postfi, postfix, postfixi, postfixin, postfixing.
Finally, you can then easily query the attribute field for the postfix value using a simple match query like this. No need to use an under-performing wildcard in a query string query.
{
"query": {
"match" : {
"attribute" : "postfix"
}
}
}

Resources