Search for parts of a string in _id field in an existing elasticSearch index - elasticsearch

Hie,
I am working with an existing Elastic Search index, trying to search for a string in the _id field.
The _id in this index consists of two concatinated strings, and I need to be able to search for the second part of that string.
After reading documentation I have found out that I probably should use ngram to search for a substring, but I can't make this work properly.
I have found an example online from someone who was trying to do the same, så I updated my index with the following:
PUT /"myIndex"
{"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"partial_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"partial": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"partial_filter"
]
}
}
}
}}
And then tried to add this:
PUT /"index"/_mapping/type2
{
"type2": {
"properties": {
"_id": {
"type": "string",
"analyzer": "partial"
}
}
}
}
That gives me an exception: "Rejecting mapping update to [bci_report_provider_s_dev-"myIndex"] as the final mapping would have more than 1 type: [type2, bci-report]"
How can I resolve this, and is there another way to be able to de a partial search on the _id field?
Thanks a lot in advance!
Bjørn Olav Berg

Related

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

Empty value generates mapper_parsing_exception for Elasticsearch completion suggester field

I have a name field which is a completion suggester, and indexing generates a mapper_parsing_exception error, stating value must have a length > 0.
There are indeed some empty values in this field. How do I accommodate them?
ignore_malformed had no effect, either at the properties or index level.
I tried filtering out empty strings in the analyzer, setting a min length:
PUT /genes
{
"settings": {
"analysis": {
"filter": {
"remove_empty": {
"type": "length",
"min": 1
}
},
"analyzer": {
"keyword_lowercase": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"remove_empty"
]
}
}
}
},
"mappings": {
"gene": {
"name": {
"type": "completion",
"analyzer": "keyword_lowercase"
}
}
}
}
}
Or filter empty strings as a stopword:
"remove_empty": {
"type": "stop",
"stopwords": [""]
}
Attempting to apply a filter to the name mapping generates an unsupported parameter error:
"mappings": {
"gene": {
"name": {
"type": "completion",
"analyzer": "keyword_lowercase",
"filter": "remove_empty"
}
}
}
}
This sure feels like it ought to be simple. Is there a way to do this?
Thanks!
I have faced the same issue. After some research it seems to me that currently the only option is to change data (e.g. replace empty values with some dummy non-empty values) before indexing.
But there is also good news. This issue exists on GitHub and was resolved about a month ago. It is planned to be released in version 6.4.0.

Multi-field search for synonym in the query string

It looks like Elasticsearch does not take field analyzers into account for multi-field search using query string, without specifying field.
Can this be configured for index or specified in the query?
Here it is a hands on example.
Given files from commit (spring-data-elasticsearch).
There is a test SynonymRepositoryTests, which will pass with QueryBuilders.queryStringQuery("text:british") and QueryBuilders.queryStringQuery("british").analyzer("synonym_analyzer") queries.
Is it possible to make it passing with QueryBuilders.queryStringQuery("british") query, without specifying field and analyzer for query?
You could query without specifying fields or analyzers. By default query string query will query on _all field which is combination of all fields and uses standard analyzer. so QueryBuilders.queryStringQuery("british") will work.
You can exclude some fields from all fields while creating index and you can also create custom all field with the help of copy_to functionality.
UPDATE
You would have to use your custom analyzer on _all fields while creating index.
PUT text_index
{
"settings": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"prefix_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"edge_filter"
]
}
}
}
},
"mappings": {
"test_type": {
"_all": {
"enabled": true,
"analyzer": "prefix_analyzer" <---- your synonym analyzer
},
"properties": {
"name": {
"type": "string"
},
"tag": {
"type": "string",
"analyzer": "simple"
}
}
}
}
}
You can replace prefix_analyzer with your synonym_analyzer and then it should work.

Elasticsearch match query with partial text match

Newbie question on elasticsearch. I have set up the elasticsearch lucene index and use searching for names that contain some term, such as
search_response = es.search(index = 'sample', body = {'query':{'match':{'first_name':"JUST"}}})
This does not return me the name "JUSTIN" but the following query does
search_response = es.search(index = 'sample', body = {'query':{'match':{'first_name':"JUSTIN"}}})
What am I doing wrong? Shouldn't "match" query return me the records that contain the term?
Thanks.
The best way to handle that need is by creating a custom analyzer which uses the edgeNGram token filter. Forget about wildcards and using * in query strings, those all underperform the edgeNGram approach.
So you'd have to create your index like this first and then reindex your data into it.
curl -XPUT http://localhost:9200/sample -d '{
"settings": {
"analysis": {
"filter": {
"prefixes": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "prefixes"]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"first_name": {
"type": "string",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}'
Then when indexing first_name: JUSTIN, you'll get the following indexed tokens: j, ju, jus, just, justi, justin, basically all prefixes of JUSTIN.
You'll then be able to search with your second query and actually find what you expect.
search_response = es.search(index = 'sample', body = {'query':{'match':{'first_name':'JUST'}}})

How do I configure Elasticsearch to find substrings at the beginning OR at the end of a word (but not in middle)?

I'm starting to learn Elasticsearch and now I am trying to write my first analyser configuration. What I want to achieve is that substrings are found if they are at the beginning or ending of a word. If I have the word "stackoverflow" and I search for "stack" I want to find it and when I search for "flow" I want to find it, but I do not want to find it when searching for "ackov" (in my use case this would not make sense).
I know there is the "Edge n gram tokenizer", but one analyser can only have one tokenizer and the edge n-gram can either be front or back (but not both at the same time).
And if I understood correctly, applying both version of the "Edge ngram filter" (front and back) to the analyzer, then I would not find either, because both filters need to return true, isn't it? Because "stack" wouldn't be in the ending of the word, so the back edge n gram filter would return false and the word "stackoverflow" would not be found.
So, how do I configure my analyzer to find substrings either in the end or in the beginning of a word, but not in the middle?
What can be done is to define two analyzers, one for matching at the start of a string and another to match at the end of a string. In the index settings below, I named the former one prefix_edge_ngram_analyzer and the latter one suffix_edge_ngram_analyzer. Those two analyzers can be applied to a multi-field string field to the text.prefix sub-field, respectively to the text.suffix string field.
{
"settings": {
"analysis": {
"analyzer": {
"prefix_edge_ngram_analyzer": {
"tokenizer": "prefix_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"suffix_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
}
},
"tokenizer": {
"prefix_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"suffix_edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"text": {
"type": "string",
"fields": {
"prefix": {
"type": "string",
"analyzer": "prefix_edge_ngram_analyzer"
},
"suffix": {
"type": "string",
"analyzer": "suffix_edge_ngram_analyzer"
}
}
}
}
}
}
}
Then let's say we index the following test document:
PUT test_index/test_type/1
{ "text": "stackoverflow" }
We can then search either by prefix or suffix using the following queries:
# input is "stack" => 1 result
GET test_index/test_type/_search?q=text.prefix:stack OR text.suffix:stack
# input is "flow" => 1 result
GET test_index/test_type/_search?q=text.prefix:flow OR text.suffix:flow
# input is "ackov" => 0 result
GET test_index/test_type/_search?q=text.prefix:ackov OR text.suffix:ackov
Another way to query with the query DSL:
POST test_index/test_type/_search
{
"query": {
"multi_match": {
"query": "stack",
"fields": [ "text.*" ]
}
}
}
UPDATE
If you already have a string field, you can "upgrade" it to a multi-field and create the two required sub-fields with their analyzers. The way to do this would be to do this in order:
Close your index in order to create the analyzers
POST test_index/_close
Update the index settings
PUT test_index/_settings
{
"analysis": {
"analyzer": {
"prefix_edge_ngram_analyzer": {
"tokenizer": "prefix_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"suffix_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","suffix_edge_ngram_filter","reverse"]
}
},
"tokenizer": {
"prefix_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"suffix_edge_ngram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
}
Re-open your index
POST test_index/_open
Finally, update the mapping of your text field
PUT test_index/_mapping/test_type
{
"properties": {
"text": {
"type": "string",
"fields": {
"prefix": {
"type": "string",
"analyzer": "prefix_edge_ngram_analyzer"
},
"suffix": {
"type": "string",
"analyzer": "suffix_edge_ngram_analyzer"
}
}
}
}
}
You still need to re-index all your documents in order for the new sub-fields text.prefix and text.suffix to be populated and analyzed.

Resources