Spring Data MongoDB with text index: difference between matchingany and matchingphrase - spring

I am using MongoDB and Spring for an application
I am using a text index on my collection.
I found two methods:
matchingany
matchingphrase
But I am unable to understand the difference.
Please help me to understand them.

If you want a match on multiple words forming a phrase then use matchingPhrase, if you want a match on at least one word in a ist of words then use matchingAny.
For example, given these documents (and assuming the title attribute is text-indexed):
{ "id": 1, "title": "The days of the week"}
{ "id": 2, "title": "Once a week"}
{ "id": 3, "title": "Once a month"}
matchingAny("Once") will match the documents with id=2 and id=3
matchingAny("month", "foo' , "bar") will match the document with id=3
matchingPhrase("The days of the week") will match the document with id=1
More details in the docs.

Related

Search score of identical documents changes when nested integer attribute is modified

We stumbled upon this issue today and cannot really understand what is happening.
Suppose we have a really simple index with just two documents inside that have the same contents.
// document 1
{
"question": "text of the question",
// nested part
"answers": [
{
"text": "text of first answer",
"clickscore": 0,
},
]
//
}
// document 2
{
"question": "text of the question",
// nested part
"answers": [
{
"text": "text of first answer",
"clickscore": 0,
},
]
//
}
question and answers.text are Text fields with the same analyzer defined on them. answers is a list with either 1 or many answers inside. clickscore is an Integer field that we will use in the future to boost the relevance of some documents. When we do a search we always look for matches in question and answers.text.
Now the weird part.
document 1 and document 2 have EXACTLY the same content, thus a search on the cluster with text contained in both question and answers.text (for example "text") returns hits with exactly the same score: makes sense.
However, if we update the clickscore of one of the two documents by setting e.g. the document 2 clickscore == 1 and we repeat EXACTLY the same search then the score of the documents are NOT the same.
How is this possible? clickscore is just an integer attribute and it should not affect the score of the search, especially since we're only looking for matches in the Text fields...
Apparently the problem is related to the fact that the shard statistics are not updated on time, and this causes the discrepancy.
If anyone arrives here on this question the only way to fix this is to manually perform a flush, so Index('...').flush() and the scores then are the same again.

Elasticsearch: search word forms only

I have collection of docs and they have field tags which is array of strings. Each string is a word.
Example:
[{
"id": 1,
"tags": [ "man", "boy", "people" ]
}, {
"id": 2,
"tags":[ "health", "boys", "people" ]
}, {
"id": 3,
"tags":[ "people", "box", "boxer" ]
}]
Now I need to query only docs which contains word "boy" and its forms("boys" in my example). I do not need elasticsearch to return doc number 3 because it is not form of boy.
If I use fuzzy query I will get all three docs and also doc number 3 which I do not need. As far as I understand, elasticsearch use levenshtein distance to determine whether doc relevant or not.
If I use match query I will get number 1 only but not both(1,2).
I wonder is there any ability to query docs by word form matching. Is there a way to make elastic match "duke", "duchess", "dukes" but not "dikes", "buke", "bike" and so on? This is more complicated case with "duke" but I need to support such case also.
Probably it could be solved using some specific settings of analyzer?
With "word-form matching" I guess you are referring to matching morphological variations of the same word. This could be about addressing plural, singular, case, tense, conjugation etc. Bear in mind that the rules for word variations are language specific
Elasticsearch's implementation of fuzziness is based on the Damerau–Levenshtein distance. It handles mutations (changes, transformations, transpositions) independent of a specific language, solely based on the number if edits.
You would need to change the processing of your strings at indexing and at search time to get the language-specific variations addressed via stemming. This can be achieved by configuring a suitable an analyzer for your field that does the language-specific stemming.
Assuming that your tags are all in English, your mapping for tags could look like:
"tags": {
"type": "text",
"analyzer": "english"
}
As you cannot change the type or analyzer of an existing index you would need to fix your mapping and then re-index everything.
I'm not sure whether Duke and Duchesse are considered to be the same word (and therefore addresses by the stemmer). If not, you would need to use a customised analyzer that allows you to configure synonyms.
See also Elasticsearch Reference: Language Analyzers

Elasticsearch: Multiple partial words not scored high enough

so I'm trying to get good search results out of an Elasticsearch installation.
But I run into problems when I'm trying to make a fuzzy search on some very simple data.
Somehow multiple (some of them partial) words are scored too low and only get scored higher, when more letters of the word are present in the search query.
Let me explain:
I have a simple index built with two simple documents.
{
"name": "Product with good qualities and awesome sound system"
},
{
"name": "Another Product that has better acustics than the other one"
}
Now I query the index with this parameters:
{
"query": {
"multi_match": {
"fields": ["name"],
"query": "product acust",
"fuzziness": "auto"
}
}
}
And the results look like this:
"hits": [
{
"_index": "test_products",
"_type": "_doc",
"_id": "1",
"_score": 0.19100355,
"_source": {
"name": "Product with good qualities and awesome sound system"
}
},
{
"_index": "test_products",
"_type": "_doc",
"_id": "2",
"_score": 0.17439455,
"_source": {
"name": "Another Product that has better acustics than the other one"
}
}
]
As you can see the product with the ID 2 is scored less than the other product even though it has possibly more similarity with the given query string than the other product because it has 1 full word match and 1 partial word match.
When the query would looke like "product acusti" the results would start to behave correctly.
I've already fiddled around with bool search but the results are identical.
Any ideas how I can get the wanted results back faster than having to have almost the whole second word typed in?
As far as I know, Elasticsearch does not do partial word matching by default, so the term acust is not matched in neither of your documents.
The reason you are getting a higher score in the first document is that your matched term, product, appears in a shorter sentence:
Product with good qualities and awesome sound system
But as for the second document, product appears in a longer sentence:
Another Product that has better acoustics than the other one
So your second document is getting a lower score because the ratio of your match term (product) to the number of terms in the sentence is lower.
In other words in has lower Field length normalization:
norm = 1/sqrt(numFieldTerms)
Now if you you want to be able to do partial prefix matching, you need to tokenize your term into ngrams, for example you can create the following ngrams for the term "acoustics":
"ac", "aco", "acou", "acous", "acoust", "acousti", "acoustic", "acoustics"
You have 2 options to achieve this, see the answer by Russ Cam on this question
use Analyze API
with an analyzer that will tokenize the field into tokens/terms from
which you would want to partial prefix match, and index this
collection as the input to the completion field. The Standard analyzer
may be a good one to start with...
Don't use the Completion Suggester here and instead set up your field (name) as a text datatype with
multi-fields
that include the different ways that name should be analyzed (or not
analyzed, with a keyword sub field for example). Spend some time with the Analyze API to build an analyzer that will
allow for partial prefix of terms anywhere in the name. As a start,
something like the Standard tokenizer, Lowercase token filter,
Edgengram token filter and possibly Stop token filter would get you
running...
You may also find this guide helpful.

how to breakdown seearch result with elasticsearch?

I have documents in my elasticsearch that represent suppliers, each document is a supplier and each supplier have branches as well, it looks like this:
{
"id": 1,
"supplierName": "John Flower Shop",
"supplierAddress": "107 main st, Los Angeles",
"branches": [
{
"branchId": 11,
"branchName": "John Flower Shop New York",
"branchAddress": "34 5th Ave, New York"
},
{
"branchId": 12,
"branchName": "John Flower Shop Miami",
"branchAddress": "56 ragnar st, Miami"
}
]
}
currently I exposed api to allow search in fields: supplierName, supplierAddress, branchName and branchAddress.
the use case is a search box in my website, that perform a call to the backend, and pur the result in a dropdown for the user to choose the supplier.
my issue is, given the example document above, if you search for "John Flower Shop Miami", the answer will be the whole document, and what will be presented is the top level supplier name.
what I want is to present "John Flower Shop Miami", and im not sure how to understand what part of the result is what hit the search....
does someone had to do something like this before?
Handling relationship in elasticsearch is a bit of work but you can do it. I recommend you to read the ES guide's chapter handling relationships to have the big picture.
Then my advice is to index your branches as nested documents. Thus they will be stored as distinct documents in your index.
It will require you to change your query syntax to use nested queries that can be a pain in the a... but in exchange, you will be granted with inner_hits functionality.
It will allow you to know which subdocument ( nested document ) matched your query.

How can I query/filter an elasticsearch index by an array of values?

I have an elasticsearch index with numeric category ids like this:
{
"id": "50958",
"name": "product name",
"description": "product description",
"upc": "00302590602108",
"**categories**": [
"26",
"39"
],
"price": "15.95"
}
I want to be able to pass an array of category ids (a parent id with all of it's children, for example) and return only results that match one of those categories. I have been trying to get it to work with a term query, but no luck yet.
Also, as a new user of elasticsearch, I am wondering if I should use a filter/facet for this...
ANSWERED!
I ended up using a terms query (as opposed to term). I'm still interested in knowing if there would be a benefit to using a filter or facet.
As you already discovered, a termQuery would work. I would suggest a termFilter though, since filters are faster, and cache-able.
Facets won't limit result, but they are excellent tools. They count hits within your total results of specific terms, and be used for faceted navigation.

Resources