Atlas Search Index partial match - mongodb-atlas-search

I have a test collection with these two documents:
{ _id: ObjectId("636ce11889a00c51cac27779"), sku: 'kw-lids-0009' }
{ _id: ObjectId("636ce14b89a00c51cac2777a"), sku: 'kw-fs66-gre' }
I've created a search index with this definition:
{
"analyzer": "lucene.standard",
"searchAnalyzer": "lucene.standard",
"mappings": {
"dynamic": false,
"fields": {
"sku": {
"type": "string"
}
}
}
}
If I run this aggregation:
[{
$search: {
index: 'test',
text: {
query: 'kw-fs',
path: 'sku'
}
}
}]
Why do I get 2 results? I only expected the one with sku: 'kw-fs66-gre' 😬

During indexing, the standard anlyzer breaks the string "kw-lids-0009" into 3 tokens [kw][lids][0009], and similarly tokenizes "kw-fs66-gre" as [kw][fs66][gre]. When you query for "kw-fs", the same analyzer tokenizes the query as [kw][fs], and so Lucene matches on both documents, as both have the [kw] token in the index.
To get the behavior you're looking for, you should index the sku field as type autocomplete and use the autocomplete operator in your $search stage instead of text

You're still getting 2 results because of the tokenization, i.e., you're still matching on [kw] in two documents. If you search for "fs66", you'll get a single match only. Results are scored based on relevance, they are not filtered. You can add {$project: {score: { $meta: "searchScore" }}} to your pipeline and see the difference in score between the matching documents.
If you are looking to get exact matches only, you can look to using the keyword analyzer or a custom analyzer that will strip the dashes, so you deal w/ a single token per field and not 3

Related

Elasticsearch search result relevance issue

Why does match query return less relevant results first? I have an index field named normalized. Its mapping is:
normalized: {
type: "text"
analyzer: "autocomplete"
}
settings for this field are:
analysis; {
filter: {
autocomplete_filter: {
type: "edge_ngram",
min_gram => "1",
max_gram => "20"
}
analyzer: {
autocomplete: {
filter: [
"lowercase",
"asciifolding",
"autocomplete_filter"
],
type: "custom",
tokenizer: "standard"
}
}
so as I know it makes an ascii, lowercase, tokens e.g. MOUSE = m, mo, mou, mous, mouse.
The problem is that request like:
{
'query': {
'bool': {
'must': {
'match': {
'normalized': 'simag'
}
}
}
}
}
returns results like
"siman siman service"
"mgr simona simunkova simiki"
"Siman - SIMANS"
"simunek simunek a simunek"
.....
But there is no SIMAG which contains all the letters of the match phrase.
How to achieve that most relevant result will be the words which contains all the letters before the tokens which does not contain all letters.
Hope somebody understand what I need.
Thanks.
PS: I am not sure but what about this query:
{
'query': {
'bool': {
'should': [
{'term': {'normalized': 'simag'}},
{'match': {'normalized': 'simag'}}
]
}
}
}
Does it make sense in comparison to previous code?
Please note that match query is analyzed, which means the same analyzer is used at the query time, which was used at the index time for the field you mentioned in your query.
In your case, you applied autocomplete analyzer on your normalized field and as you mentioned, it generates below token for MOUSE :
MOUSE = m, mo, mou, mous, mouse.
In similar way, if you search for mouse using the match query on the same field, it would search for below query strings :-
m, mo, mou, mous, mouse .. hence results which contain the words like mousee or mouser will also come as during index .. it created tokens which matches with the tokens generated on the search term.
Read more about match query on Elastic site https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html first line itself explains your search results
match queries accept text/numerics/dates, analyzes them, and
constructs a query:
If you want to go deep and understand, how your search query is matching the documents and its score use explain API
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

Elasticsearch 6.2: terms query require lowercase input when searching on keyword

I've created an example index, with the following mapping:
{
"_doc": {
"_source": {
"enabled": False
},
"properties": {
"status": { "type": "keyword" }
}
}
}
And indexed a document:
{"status": "CMP"}
When searching the documents with this status with a terms query, I find no results:
{
"query" : {
"terms": { "status": ["CMP"]}
}
}
However, if I make the same query by putting the input in lowercase, I will find my document:
{
"query" : {
"terms": { "status": ["cmp"]}
}
}
Why is it? Since I'm searching on a keyword field, the indexed content should not be analyzed and should match an uppercase value...
no more #Oliver Charlesworth Now - in Elastic 6.x - you could continue to use a keyword datatype, lowercasing your text with a normalizer,doc here. However in every cases you should change your index mapping and reindex your docs
The index and mapping creation and the search were part of a test suite. It seems that the setup part of the test suite was not executed, and the mapping was not applied to the index.
The index was then using the default types instead of the mapping types, resulting of the use of string fields instead of keywords.
After changing the setup method of the automated tests, the mappings are well applied to the index, and the uppercase values for the status "CMP" are now matching documents.
The symptoms you're seeing shouldn't occur, unless something else is wrong.
A keyword index is not analysed, so your index should contain only CMP. A terms query is also not analysed, etc. so your index is searched only for CMP. Hence there should be a match.

Elasticsearch find missing word in phrase

How can i use Elasticsearch to find the missing word in a phrase? For example i want to find all documents which contain this pattern make * great again, i tried using a wildcard query but it returned no results:
{
"fields": [
"file_name",
"mime_type",
"id",
"sha1",
"added_at",
"content.title",
"content.keywords",
"content.author"
],
"highlight": {
"encoder": "html",
"fields": {
"content.content": {
"number_of_fragments": 5
}
},
"order": "score",
"tags_schema": "styled"
},
"query": {
"wildcard": {
"content.content": "make * great again"
}
}
}
If i put in a word and use a match_phrase query i get results, so i know i have data which matches the pattern.
Which type of query should i use? or do i need to add some type of custom analyzer to the field?
Wildcard queries operate on terms, so if you use it on an analyzed field, it will actually try to match every term in that field separately. In your case, you can create a not_analyzed sub-field (such as content.content.raw) and run the wildcard query on that. Or just map the actual field to not be analyzed, if you don't need to query it in other ways.

How to match "prefix" and not whole string in elasticsearch?

I have indexed documents, each with a field: "CodeName" that has values like the following:
document 1 has CodeName: "AAA01"
document 2 has CodeName: "AAA02"
document 3 has CodeName: "AAA03"
document 4 has CodeName: "BBB02"
When I try to use a match query on field:
query: {
"match": {
"CodeName": "AAA"
}
}
I expect to get results for "AAA01" and "AAA02", but instead, I am getting an empty array. When I pass in "AAA01" (I type in the whole thing), I get a result. How do I make it such that it matches more generically? I tried using "prefix" instead of "match" and am getting the same problem.
The mapping for "CodeName" is a "type": "string".
I expect to get results for "AAA01" and "AAA02"
This is not what Elastic Search expects. ES breaks your string into tokens using a tokenizer that you specify. If you didn't specify any tokenizer/analyzer, the default standard tokenizer splits words on spaces and hyphens etc. In your case, the tokens are stored as "AAA01", "AAA02" and so on. There is no such term as "AAA", and hence you don't get any results back.
To fix this, you can use match_phrase_prefix query or set the type of match query to be phrase_prefix . Try this code:
"query": {
"match_phrase_prefix": {
"CodeName": "AAA"
}
}
OR
"query": {
"match": {
"CodeName": {
"query": "AAA",
"type": "phrase_prefix"
}
}
}
Here is the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html. Also pay attention to the max_expansions parameter, as this query can be slow sometimes depending upon your data.
Note that for this technique, you should go with default mapping. You don't not to use nGram.
As far as I know first of all you sould index your data using a tokenizer of type nGram.
You can check detailes in documentation
COMMENT RELATED:
I'm familiar with symfony-way of using elasticsearch and we are using it like this:
indexes:
search:
client: default
settings:
index:
analysis:
custom_index_analyzer:
type: custom
tokenizer: nGram
filter: [lowercase, kstem]
tokenizer:
nGram:
type: nGram
min_gram: 2
max_gram: 20
types:
skill:
mappings:
skill.name:
search_analyzer: custom_index_analyzer
index_analyzer: custom_index_analyzer
type: string
boost: 1

Search for multiple incomplete words with Elasticsearch

I have a database of records, each of which has a right and a left field, and both these fields contain text. The database is indexed with Elasticsearch.
I want to search through both fields of these records and find the records that contain in any of the fields two or more of the words with certain prefixes. The search should be specific enough to find only the records that contain all words in the query, not just some of them.
For example, a query qui bro should return the record containing the sentence The quick brown fox jumped over the lazy dog, but not the one containing the sentence The quick fox jumped over the lazy dog
I've seen a description of how to perform prefix queries with Elasticsearch (and can reproduce it when searching for one word in one field).
I've also seen a description of how to perform multi-match queries to search through several fields at once.
But what I need is some combination of these techniques, which would allow me both to search through several fields at once, and to look only for parts of words. And to get only those records that have all the words whose parts are contained in the query.
How can I do that? Any method will do (prefixes, ngrams, whatever).
(P.S.: My question may, to a certain extent, be a duplicate of this one, but since it never was answered, I hope I'm not breaking any rules by asking mine.)
======================================
UPDATED:
Oh, I might have the first part of the question. Here is the syntax that seems to work in my Rails app (using elasticsearch-rails gem):
response = Paragraph.search query: {bool: { must: [ { prefix: {right: "qui"}}, {prefix: {right: "bro"}} ] } }
Or, to re-write it in pure Elasticsearch syntax:
{
"bool": {
"must": [
{ "prefix": { "right": "qui" }},
{ "prefix": { "right": "bro" }}
]
}
}
So my updated question now is how to combine this prefix search with multi_match search (to search both through the right and the left field.
OK, here is a possible answer that seems to work. The code has to search through multiple fields for several incomplete words and return only the records that contain all these words.
Here is the request written in elasticsearch-rails syntax:
response = Paragraph.search query: {bool: { must: [ { multi_match: { query: "qui", type: "phrase_prefix", fields: ["right", "left"]}}, { multi_match: { query: "brow", type: "phrase_prefix", fields: ["right", "left"]}}]}}
Or, re-written in the syntax that is used on Elasticsearch site:
{query:
{bool:
{ must:
[
{ multi_match:
{
query: "qui",
type: "phrase_prefix",
fields: ["right", "left"]
}
},
{ multi_match:
{
query: "brow",
type: "phrase_prefix",
fields: ["right", "left"]
}
}
]
}
}
}
This seems to work. But if somebody has other solutions (particularly if these solutions will make the search case-insensitive), I will be happy to hear them.

Resources