How do I configure Elasticsearch to find substrings at the beginning OR at the end of a word (but not in middle)? - elasticsearch

I'm starting to learn Elasticsearch and now I am trying to write my first analyser configuration. What I want to achieve is that substrings are found if they are at the beginning or ending of a word. If I have the word "stackoverflow" and I search for "stack" I want to find it and when I search for "flow" I want to find it, but I do not want to find it when searching for "ackov" (in my use case this would not make sense).
I know there is the "Edge n gram tokenizer", but one analyser can only have one tokenizer and the edge n-gram can either be front or back (but not both at the same time).
And if I understood correctly, applying both version of the "Edge ngram filter" (front and back) to the analyzer, then I would not find either, because both filters need to return true, isn't it? Because "stack" wouldn't be in the ending of the word, so the back edge n gram filter would return false and the word "stackoverflow" would not be found.
So, how do I configure my analyzer to find substrings either in the end or in the beginning of a word, but not in the middle?

What can be done is to define two analyzers, one for matching at the start of a string and another to match at the end of a string. In the index settings below, I named the former one prefix_edge_ngram_analyzer and the latter one suffix_edge_ngram_analyzer. Those two analyzers can be applied to a multi-field string field to the text.prefix sub-field, respectively to the text.suffix string field.
{
"settings": {
"analysis": {
"analyzer": {
"prefix_edge_ngram_analyzer": {
"tokenizer": "prefix_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"suffix_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
}
},
"tokenizer": {
"prefix_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"suffix_edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"text": {
"type": "string",
"fields": {
"prefix": {
"type": "string",
"analyzer": "prefix_edge_ngram_analyzer"
},
"suffix": {
"type": "string",
"analyzer": "suffix_edge_ngram_analyzer"
}
}
}
}
}
}
}
Then let's say we index the following test document:
PUT test_index/test_type/1
{ "text": "stackoverflow" }
We can then search either by prefix or suffix using the following queries:
# input is "stack" => 1 result
GET test_index/test_type/_search?q=text.prefix:stack OR text.suffix:stack
# input is "flow" => 1 result
GET test_index/test_type/_search?q=text.prefix:flow OR text.suffix:flow
# input is "ackov" => 0 result
GET test_index/test_type/_search?q=text.prefix:ackov OR text.suffix:ackov
Another way to query with the query DSL:
POST test_index/test_type/_search
{
"query": {
"multi_match": {
"query": "stack",
"fields": [ "text.*" ]
}
}
}
UPDATE
If you already have a string field, you can "upgrade" it to a multi-field and create the two required sub-fields with their analyzers. The way to do this would be to do this in order:
Close your index in order to create the analyzers
POST test_index/_close
Update the index settings
PUT test_index/_settings
{
"analysis": {
"analyzer": {
"prefix_edge_ngram_analyzer": {
"tokenizer": "prefix_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"suffix_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
}
},
"tokenizer": {
"prefix_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"suffix_edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
}
Re-open your index
POST test_index/_open
Finally, update the mapping of your text field
PUT test_index/_mapping/test_type
{
"properties": {
"text": {
"type": "string",
"fields": {
"prefix": {
"type": "string",
"analyzer": "prefix_edge_ngram_analyzer"
},
"suffix": {
"type": "string",
"analyzer": "suffix_edge_ngram_analyzer"
}
}
}
}
}
You still need to re-index all your documents in order for the new sub-fields text.prefix and text.suffix to be populated and analyzed.

Related

Achieving literal text search combined with subword matching in Elasticsearch

I have populated an Elasticsearch database using the following settings:
mapping = {
"properties": {
"location": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"description": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"commentaar": {
"type": "text",
"analyzer": "ngram_analyzer"
},
}
}
settings = {
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {"custom_tool": mapping}
}
I used the ngram analyser because I wanted the be able to have subword matching. So a search for "ackoverfl" would return the entries containing "stackoverflow".
My search queries are made as follows:
q = {
"simple_query_string": {
"query": needle,
"default_operator": "and",
"analyzer": "whitespace"
}
}
Where needle is the text from my search bar.
Sometimes I would also like to do literal phrase searching. For example:
If my search term is:
"the ap hangs in the tree"
(Notice that I use quotation marks here with the intention the search for a literal piece of text).
Then in my results I get a document containing:
the apple hangs in the tree
This results is unwanted.
How could I implement having a subword matching search capability while also having the option to search for literal phrases (by using for example quotation marks) ?

Configure highlighted part in the elasticsearch

Main question
The user is looking for a name and enters the part of the it, let's say au, and the document with the text paul is found.
I would like to have the doc highlighted like p<em>au</em>l.
How can I achieve it if I have a complex search query (combination of match, prefix, wildcard to rule relevance)?
Sub question
When do highlight settings from documentation for type, boundary_scanner and boundary_chars come into play? As per my tests described below, these settings don't change highlighted part.
Try 1: Wildcard query with default analyzer
PUT myindex
{
"mappings": {
"properties": {
"name": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindex/_doc/1
{
"name": "paul"
}
GET myindex/_search
{
"query": {
"wildcard": {"name": "*au*"}
},
"highlight": {
"fields": {
"name": {}
},
"type": "fvh",
"boundary_scanner": "chars",
"boundary_chars": "abcdefghijklmnopqrstuvwxyz.,!? \t\n"
}
}
This kind of search returns highlight <em>paul</em> but I need to get p<em>au</em>l.
Try 2: Match query with NGRAM analyzer
This one works as described in SO question: Highlighting part of word in elasticsearch
PUT myindexngram
{
"settings": {
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "3",
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"index_ngram_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer",
"filter": [
"lowercase"
]
},
"search_term_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "index_ngram_analyzer",
"term_vector": "with_positions_offsets"
}
}
}
}
POST myindexngram/_doc/1
{
"name": "paul"
}
GET myindexngram/_search
{
"query": {
"match": {"name": "au"}
},
"highlight": {
"fields": {
"name": {}
}
}
}
This highlights p<em>au</em>l as desired but:
Highlighting depends on the query type, so combining match and wildcard will again result in <em>paul</em>.
Highlighting is not affected at all on type, boundary_scanner and boundary_chars settings.
Elastic version 7.13.4
Response from Elasticsearch team:
A highlighter works on terms, so only full terms can be highlighted - whatever are the terms in your index. In your second example, au could be highlighted, because it it a term in the index, which is not the case for your first example.
There is also an option to define your own highlight_query that could be different from the main query, but this could lead to unpredictable highlights.
https://discuss.elastic.co/t/configure-highlighted-part/295164

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

Elasticsearch query returning false results when term exceeds ngram length

The requirement is to search partial phrases in a block of text. Most of the words will be standard length. I want to keep the max_gram value down to 10. But there may be the occasional id/code with more characters than that, and these show up if I type in a query where the first 10 characters match, but then the rest don't.
For example, here is the mapping:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
and document:
POST my_index/doc/1
{
"title": "Quick fox with id of ABCDEFGHIJKLMNOP"
}
If I run the query:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "fox wi"
}
}
}
}
It returns the document as expected. However, if I run this:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "ABCDEFGHIJxxx"
}
}
}
}
It also returns the document, when it shouldn't. It will do this if the x's are after the 10th character, but not before it. How can I avoid this?
I am using version 5.
By default, the analyzer that is used at index time is the same analyzer that is used at search time, meaning the edge_ngram analyzer is used on your search term. This is not what you want. You will end up with 10 tokens as the search terms, none of which contain those last 3 characters.
You will want to take a look at the Search Analyzer for your mapping. This documentation points out this specific use case:
Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete.
The standard analyzer may suit your needs:
{
...
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}

elasticsearch: giving score to ngrams

I am trying to use elasticsearch to do a name search matching using ngrams,
The technique I am trying to implement is as follow:
input: a name that needs to be match to the db.
output: all potential name matching from my db of names.
The way I try to do that is as follow, I split the name to ngrams with length of 3-5.
I then collect all the names that match those ngrams from the db.
Then I go over the ngrams and sort them by there reverse frequency,
meaning that common ngrams will get the lowest score.
for example, if I use it on a company name like "my company inc" I will give the "inc" ngram the lowest score because inc appears in a lot of company names.
The way I calculate the score is by doing: 1/(count appearences of the ngram in all my db), that way I will have the "strongest" ngrams as the ones that appear the least.
I implemented this in a python script, but I want to use the power of elastic to do the same for me,
I know about the ngram tokenizer, but is there a way to tell him to do the score I do?
As far as I know, when I do a matching now, it will score the result by how much of the ngrams in the query match the ngrams in the word he has in the db
this is the mapping I use:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": ["letter", "digit"]
}
}
}
},
"mappings": {
"names": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256,
}
},
"analyzer": "my_analyzer"
},
"id": {
"type": "long"
}
}
}
}
}
this is the query I do:
GET /names/_search
{
"query": {
"match" : { "name" : "my company inc"}
}
}
The query that you would want to use is this:
{
"query": {
"common": {
"name": {
"query": "my company inc",
"cutoff_frequency": 0.001
}
}
}
}
Common terms query returns the relevance score based only on important terms (important nGrams) i.e. terms with less frequency. Here, the words that have a document frequency greater than 0.1% will be considered as common words and will not affect the relevance score.
Alternatively, if you already have a predefined list of stopwords (inc, pvt, ltd), then you can always use a custom stop words filter in your analyzer to filter them out for generating hits.
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [
"custom_stop_token_filter"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": ["letter", "digit"]
}
},
"filter": {
"custom_stop_token_filter": {
"type": "stop",
"stopwords": [
"inc",
"pvt",
"ltd"
]
}
}
}
}
}
For more info:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-common-terms-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html

Resources