Field norm calculation on ElasticSearch array fields - elasticsearch

Here's the mapping for one of the fields in my index:
"resourceId": {
"type": "string",
"index_analyzer": "partial_match",
"search_analyzer": "lowercase",
"include_in_all": true
}
Here are the custom analyzers used in the index:
"analysis": {
"filter": {
"partial_match_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 50
}
},
"analyzer": {
"partial_match": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"partial_match_filter"
]
},
"lowercase": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
This field will contain an array of strings, which are the multiple IDs that a resource can have (it can have multiple IDs due to different systems calling each resource by a different id).
Now let's suppose that resource #1 has has three IDs:
resourceId: [3]
0: "ID:MATCH"
1: "MATCH"
2: "ID:ALT"
And that resource #2 has only one ID:
resourceId: [1]
0: "ID:MATCHFIVE"
And let's suppose that we run this query against my index:
{
"from": 0,
"size": 30,
"query": {
"query_string": {
"query": "resourceId:ID\\:MATCH"
}
}
}
What I'd like is for resource #1 to show up first, since its array contains an exact match. However, resource #2 is the one coming on top;
When I used the explain parameter on the query request, I saw that the tf and idf scores where the same for both resources. However, the norm score was lower for resource #1.
My theory is that since resource #1 has three items in the array (which I assume are concatenated together during indexing), the field is considered larger, and thus the norm value is decreased. When it comes to resource #2, it has only one item (and it's shorter than the concatenation of the other array), so the norm is higher, bumping the resource to the top.
My question, therefore, is: when calculating the score, is it possible for the norm calculation to only consider the size of the item that matched in the array?
For example: the search for "ID:MATCH" would find the exact match on resource #1 on resourceId[0]. At this point, all other items in the array would be put aside and the norm would be calculated based on that single item (resourceId[0]), showing a perfect match. As for resource #2, the norm would be lower, since the resourceId field would be larger.
If this isn't possible, would there be workarounds to get the exact match to the top? Or maybe I'm completely off on my theory?

Related

Elasticsearch combined_fields query with synonym_graph token filter

I'm trying to use a combined_fields query with a synonym_graph search-time token filter in Elasticsearch. When I query for a multi-term phrase in my synonym file, Elasticsearch seems to unconfigurably switch from "or logic" to "and logic" between my original terms. Here's an example Elasticsearch query that has been exaggerated for demonstration purposes:
GET /products/_search
{
"query": {
"bool": {
"should": [
{
"combined_fields": {
"query": "boxes other rectangle hinged lid hook cutout",
"operator": "or",
"minimum_should_match": 1,
"fields": [
"productTitle^9",
"fullDescription^5"
],
"auto_generate_synonyms_phrase_query": false
}
}
]
}
}
}
When I submit the query on my index with an empty synonyms.txt file, it returns >1000 hits. As expected, the top hits contain all or many of the terms in the query, and the result set is composed of all documents that contain any of the terms. However, when I add this line to the synonyms.txt file:
black spigot, boxes other rectangle hinged lid hook cutout
the query only returns 4 hits. These hits either contain all of the terms in my query across the queried fields, or both the terms "black" and "spigot".
My conclusion is that presence of the phrase in the synonyms file is influencing how the "non-synonym-replaced" phrase is being searched for. This seems counterintuitive - adding a phrase to the synonyms file should only possibly increase the number of results that a search for that exact phrase produces, right?
Does anyone know what I'm doing incorrectly, or if my expectations are reliant upon some fundamental misunderstanding of how Elasticsearch works? I observe the same behavior when I use a multi-match query or an array of match queries, and I've tried every combination of query options that I reasonably think might resolve the problem.
For reference, here is my analyzer configuration:
"analysis": {
"analyzer": {
"indexAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"porter_stem",
"stop"
]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"porter_stem",
"stop",
"productSynonym"
]
}
},
"filter": {
"productSynonym": {
"type": "synonym_graph",
"synonyms_path": "analysis/synonyms.txt"
}
}
}

Elastic search with Java: exclude matches with random leading characters in a letter

I am new to using elastic search. I managed to get things working somewhat close to what I intended. I am using the following configuration.
{
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": true,
"token_separator": ""
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"shingle_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"shingle_index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter",
"autocomplete_filter"
]
}
}
}
}
I have this applied over multiple fields and doing a multi match query.
Following is the java code:
NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.multiMatchQuery(i)
.field("title")
.field("alias")
.fuzziness(Fuzziness.ONE)
.type(MultiMatchQueryBuilder.Type.BEST_FIELDS))
.build();
The problem is it matches with fields that have letters with some leading characters.
For example, if my search input is "ron" I want it to match with "ron mathews", but I don't want it match with "iron". How can I make sure that I am matching with letters having no leading characters?
Update-1
Turning off fuzzy transposition seems to improve search results. But I think we can make it better.
You probably want to score "ron" higher than "ronaldo" and the exact match of complete field "ron" even higher so the best option here would be to use few subfields with standard and keyword analyzers and boost those fields in your multi_match query.
Also, as you figured out yourself, be careful with the fuzziness. Might make sense to run 2 queries in a should with one being fuzzy and another boosted so that exact matches are ranked higher.

Elasticsearch terms aggregation performance

I have a basic aggregation on an index with about 40 million documents.
{
aggs: {
countries: {
filter: {
bool: {
must: my_filters,
}
},
aggs: {
filteredCountries: {
terms: {
field: 'countryId',
min_doc_count: 1,
size: 15,
}
}
}
}
}
}
The index:
{
"settings": {
"number_of_shards": 5,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter",
"unique"
]
}
}
},
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"countryId": {
"type": "short"
}
}
}
}
The search response time is 100ms, but the aggregation response time is about 1.5s, and is increasing as we add more documents (was about 200ms with 5 million documents). There are about 20 distinct countryId right now.
What I tried so far:
Allocating more RAM (from 4GB to 32GB), same results.
Changing countryId field data type to keyword and adding eager_global_ordinals option, it made things worse
The elasticsearch version is 7.8.0, elastic has 8GB of ram, the server has 64GB of ram and 16CPU, 5 shards, 1 node
I use this aggregation to put filters in search results, so I need it to respond as fast as possible. For large number of results I don't need precision. so if it is approximate or even limited to a number (ex. 100 gte) it's great.
Any ideas how to speed up this aggregation ?
Reason for the slowness:
Bucket explosion is the reason. And breadth first collect mode would speed up further.
As per the doc, you can optimize further with breadth first collect mode.
Even though the number of actors may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets during calculation - a single actor can produce n² buckets where n is the number of actors. To find 10 popular actors and thir 5 top coactors.
I would suggest you to set Execution hint. Since you have very less unique values, I suggest you to set hint as map.
Another optimization, let's say some documents are not accessed in last few weeks, you can use a field from your filter, to partition the aggregation on particular set of documents.
Another optimization that you could exclude, include what countries needed, if possible in your use case. Filter

Can I have multiple filters in an Elasticsearch index's settings?

I want an Elasticsearch index that simply stores "names" of features. I want to be able to issue phonetic queries and also type-ahead style queries separately. I would think I would be able to create one index with two analyzers and two filters; each analyzer could use one of the filters. But I do not seem to be able to do this.
Here is the index settings json I'm trying to use:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
}
},
"analyzer": {
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
},
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
}
}
When I attempt to create an index with these settings:
http://hostname:9200/index/type
I get an HTTP 400, saying
Custom Analyzer [phonetic_analyzer] failed to find filter under name [double_metaphone_filter]
Don't get me wrong, I fully realize what that sentence means. I looked and looked for an erroneous comma or quote but I don't see any. Otherwise, everything is there and formatted correctly.
If I delete the phonetic analyzer, the index is created but ONLY with the autocomplete analyzer and ngram filter.
If I delete the ngram filter, the index is created but ONLY with the phonetic analyzer and phonetic filter.
I have a feeling I'm missing a fundamental concept of ES, like only one analyzer per index, or one filter per index, or I must have some other logical dependencies set up correctly, etc. It sure would be nice to have a logical diagram or complete API spec of the Elasticsearch infrastructure, i.e. any index can have 1..n analyzers, only 1 filter, query must need any one of bool, match, etc. But that unicorn does not seem to exist.
I see tons of documentation, blog posts, etc on how to do each of these functionalities, but with only one analyzer and one filter on the index. I'd really like to do this dual functionality on one index (for reasons out of scope).
Can someone offer some help, advice here?
You are just missing the proper formatting for your settings object. You cannot have two analyzer or filter keys, as there can only be one value per key in this settings map object. Providing a list of your filters seems to work just fine. When you were creating your index object, the second key was overriding the first.
Look here:
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"double_metaphone_filter": {
"type": "phonetic",
"encoder": "double_metaphone"
},
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard", "lowercase", "ngram"]
},
"phonetic_analyzer": {
"tokenizer": "standard",
"filter": "double_metaphone_filter"
}
}
}
}
I downloaded the plugin to confirm this works.
You can now test this out at the _analyze enpoint with a payload:
{
"analyzer":"autocomplete_analyzer",
"text":"Jonnie Smythe"
}

Multi-word Term Vectors with Word nGrams?

I'm aiming to build an index that, for each document, will break it down by word ngrams (uni, bi, and tri), then capture term vector analysis on all of those word ngrams. Is that possible with Elasticsearch?
For instance, for a document field containing "The red car drives." I would be able to get the information:
red - 1 instance
car - 1 instance
drives - 1 instance
red car - 1 instance
car drives - 1 instance
red car drives - 1 instance
Thanks in advance!
Assuming you already know about the Term Vectors api you could apply the shingle token filter at index time to add those terms as independent to each other in the token stream.
Setting min_shingle_size to 1 (instead of the default of 2), and max_shingle_size to at least 3 (instead of the default of 2)
And based on the fact that you left "the" out of the possible terms you should use stop words filter before applying shingles filter.
The analyzer settings would be something like this:
{
"settings": {
"analysis": {
"analyzer": {
"evolutionAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle"
]
}
},
"filter": {
"custom_stop": {
"type": "stop",
"stopwords": "_english_",
"enable_position_increments":"false"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "1",
"max_shingle_size": "3"
}
}
}
}
}
You can test the analyzer using the _analyze api endpoint.

Resources