Why does elastic search analyze a document 2 times? - elasticsearch

From what I've understood, When I index a document say:
PUT <index>/_doc/1
{
"title":"black white fox cat"
}
Elastic search analyzes this via a standard analyzer and turns the title into an array of tokens.
But then when I search for this document let's say
POST <index>/_search
{
"query":
{
"match":
{
"title":"black"
}
}
}
It analyzez again via the same analyzer, isn't that inefficient?

It's not efficient, its necessary step to provide the search results.
let me explain under the hood, how search and index process works.
Index tokenize the text based on data type, and configured analyzer and index the tokens into the inverted index.
Search terms again is tokenised based on the query type(no tokens in case of term family of queries), and search generated tokens into the inverted index created at index time(step-1).
Tokens match process(matching index time tokens in the inverted index to the tokens generated at the query time), is what finds the matches documents and provides the search results, normally this tokens match is a exact string match process, with the exception in some cases like (prefix query, wildcard query etc). and as its a exact string match, its very fast and optimized process.
There are various use-cases, like when you use the keywords data type, text is not analyzed and when you use term level queries search time analysis doesn't happen.
Now, important thing to not is that during search time also same analyzer used at index time, otherwise it would end up generating different token which not produce match in step-3 Described earlier.

Related

How to search exact word in a test in Elastic Search

Let's say I have two texts:
Text 1 - "The fox has been living in the wood cabin for days."
Text 2 - "The wooden hammer is a dangerous weapon."
And I would like to search for the word "wood", without it matching me "wooden hammer". How would I do that in Elastic Search or nest?
Term query is used for exact matches search. However it's not recommended to use it against text fields, the following quote from term query documentation:
To better search text fields, the match query also analyzes your
provided search term before performing a search. This means the match
query can search text fields for analyzed tokens rather than an exact
term.
The term query does not analyze the search term. The term query only
searches for the exact term you provide. This means the term query may
return poor or no results when searching text fields.
The problem with text exact matches, as described in the Term query documentation:
By default, Elasticsearch changes the values of text fields as part of
analysis. This can make finding exact matches for text field values
difficult.
So, the documents data is modified (i.e., analyzed) before indexing. This depends on the index mapping definition for each field, defaults to the default index analyzer, or the standard analyzer.
But the default standard analyzer will not change the token "Wooden" to "Wood", this might happen if you used stemming for this field.
This means, if you don't use a different analyzer or stemming, querying with "Wood" shouldn't match "Wooden" token.
To summarize: Indexed data is modified/analyzed before indexing (based on the field mapping definition). Match query analyze the search query, while Term query doesn't analyze the search query. So you have to properly chose the field mapping and the search query to better suit your use case
For some use cases, like storing email addressed, phone numbers or keyword fields that always have the same value, consider using the Keyword type, which is suitable for exact matches in these use cases. However, ES recommends:
Avoid using keyword fields for full-text search. Use the text field
type instead.
So for better visibility and practical solution for your use case, it's better to elaborate more the field mapping you use and what you want to achieve.

ElasticSearch: term vs match query decision

Being new to ElasticSearch, need help in my understanding.
What I read about term vs match query is that term query is used for exact match and match query is used when we are searching for a term and want result based on a relevancy score.
But if we already defined a mapping for a field as a keyword, why anyone has to decide upon between term vs match, wouldn't it be always a term query in case mapping is defined as a keyword?
What are the use cases where someone will make a match query on the keyword mapping field?
The same confusion is vice versa.
A text field will be analyzed (transformed, split) to generate N tokens, and the keyword itself will become a token with no transformations. At the end, you have N tokens referencing a document.
Then.
By doing a match query, you will treat your query as a text as well, by analyzing it before performing the matching (transforming it), and the term will not.
You can create a field with a term mapping, but then perform a match query on top of it (for example if you want to be case insensitive), and you can create a text mapping for a n-gram and perform a term query to match exactly what you're asking for.

How to query for alternative spellings and representations of words in elasticsearch?

I'm using elasticsearch to query on the theme field in documents. For example:
[
{ theme: 'landcover' },
{ theme: 'land cover' },
{ theme: 'land-cover' },
etc
]
I would like to specify a search of the term landcover that matches all these documents. How do I do this?
So far I've tried using the fuzziness operator in a match search, and also a fuzzy query. However neither of these approaches seems to work, which surprised me because my understanding of fuzzy searches is that they would provide a means of inexact matching.
What am I missing? From the docs I see that fuzziness definitely looks for close approximations to a search term:
When querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string.
I would consider 'landcover' and 'land cover' to be close. Is this not the case? (this is the first I have heard of Levenshtein Edit Distance so I don't know what extra/less characters mean in terms of this measurement).
An example of a match query that this doesn't seem to work:
{
query: {
match: {
'theme': {
query: 'landcover'
fuzziness: 'AUTO' // I've tried 2, '2', 6, '6', etc.
},
},
},
}
// When the term is 'land-cover' and fuzziness is auto, then 'land cover' is matched. But 'landcover' is not
And an example of a 'fuzzy' query that doesn't seem to work:
{
query: {
fuzzy: {
'theme': {
value: query,
fuzziness: 'AUTO', // Tried other values
},
},
},
}
// When the term is 'land-cover' and fuzziness is auto, then 'landcover' is matched. But 'land cover' is not. So works almost opposite to the match query in this regard
(NOTE - these queries are converted to JSON and do run and return sensible results, just the fuzziness doesn't seem to work as I would have expected)
Looking around StackOverflow, I see some questions that seem to indicate that querying an index is in some way related to how the index is created - i.e. that i cannot just run adhoc queries on any index that already exists and expect results. Is this correct? (sorry - I'm new to elasticsearch and I'm querying an index that already exists).
This answer seems related (how to find near matches for a search term): https://stackoverflow.com/a/55772800/3114742 - mentions that I should do something referred to as 'field mapping' prior to indexing data. but then the example query doesn't include the fuzziness operator. So in this case I'm confused as to what the point of the fuzziness operator is actually for.
Looking more into the documentation I've found the following:
Elasticsearch uses the concept of an 'index' rather than a database. But from the perspective of someone familiar with CouchDB and MongoDB, which are both JSON stores, there is definitely some similarity between a CouchDB database and an Elasticsearch index. Although the elasticsearch index is not an authoritative data storage in itself (it's 'built' from a source of data).
For a given index called, for example, my-index. you can insert JSON strings (documents) into my-index by PUTting to Elasticsearch:
PUT /... '{... json string ...}'
The JSON string can come directly from a JSON store (Mongo, Couch, etc.) or be cobbled together from a variety of sources. I guess.
Elasticsearch will process the document on insert and append to the inverted tree. For text fields this means K:V pairs will be created from JSON document text, with the keys being fragments of the text, and the values being references to where that text fragment is found in the source (the JSON document).
In other words, when inserting documents into an Elasticsearch index, the content is 'analyzed' to create K:V pairs that are added to the index.
I guess, then, that searching Elasticsearch means looking up search terms that are keys in the index, and comparing the values (the source of the key) to the source defined in the search (I think), and returning the source document where a search term is present for a particular field.
So:
Text is analyzed on insertion to an index
Queries are analyzed (using the same analyzer that was used to create the index)
So in my case (as mentioned above) the default analyzer is good enough to create indices that allow for basic fuzzy matching (i.e. in the match query, "land-cover" is matched to "land cover", and in the fuzzy query, "land-cover" is matched to "landcover" - I have no idea why these match differently!)
But to improve on the search results, I think I need to adjust the analyzer / tokenizer both when inserting documents into an index, and for when parsing queries to apply to an index.
My understanding of the analysis/tokenization is that this is the configuration by which inverted indexes are built from source documents. i.e. defining what the keys of the inverted index will be. As far as I can tell there is no magic in searching the index. search terms have to match keys in the inverted index otherwise there will be no results.
I'm still not sure what fuzziness is actually doing in this context.
So in short, querying elasticsearch seems to require a 'holistic perspective' over both how source data is indexed, and how queries are designed.
As a disclaimer,though, I'm not exactly an authoritative answer on this subject with less than one day of elasticsearch experience, so a better answer would still be appreciated!

Elastic search giving strange results

I am following this tutorial on elastic search.
Two employees have 'about' value as:
"about": "I love to go rock climbing"
"about": "I like to collect rock albums"
I run following query:
GET /megacorp/employee/_search {"query":{"match":{"about":"rock coll"}}}
Both above entries are returned, but surprisingly wit same score:
"_score": 0.2876821
Shouldn't the second one must have higher score as it has 'about' value containing both 'rock' and 'coll' while first one only contains 'rock'?
That totally depends on what analyzer you are using. if you are using standard or english analyzer this result is correct. I recommend you to spend some time working with elasticsearch's Analyze API to get familiar how each analyzer affect your text.
By the way, if you want second document to have higher score, take a look at Partial matching.
When we search on a full-text field, we need to pass the query string through the same analysis process as we have when we index a document, to ensure that we are searching for terms in the same form as those that exist in the index.
Analysis process usually consists of normalization and tokenization (the string is tokenized into individual terms by a tokenizer).
As for match Query:
If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search. It just looks for the words that are specified.
So, in your match query Elasticsearch will look for occurrences of the whole separate words: rock or/and coll.
Your 2nd document doesn't contain a separate word coll but was matched by the word rock.
Conclusion: the 2 documents are equivalent in their _score value (they were matched by the same word rock)
Elasticsearch analyzes each text field before storing it. The default analyzer (standard analyzer) splits the text based on whitespaces and lowercases it. The output of analysis process is a list of tokens which are used to match your query tokens. If any of the tokens match exactly the relevant document is returned. That's being said, your second document doesn't contain the token col and that's why you are having the same score for both documents.
Even if you build your custom analyzer and use stemming, the word collect won't be stemmed as coll.
You can build custom analyzers in which you can specify that tokens should be of length 1 character, then Elasticsearch will consider each single character as a token and you can search for the existence of any character in your documents.

Difference between Query Context and Filter Context while Querying

What is the difference between the Query Context and the Filter Context in the Elastic Search in Query DSL.
My Understanding is Query Context- How well the document matches the query parameters.
Ex:
{ "match": { "title": "Search" }}
If I am searching for the documents with title 'Search' then if I contains two documents
i)title:"Search"
ii)title:"Search 123"
Then first document is a perfect match and document two is a semi-match. Then the first document is given in the first place and the second document given the second place. Is my understanding correct?
Filter Context:
Ex:
{ "term": { "status": "published" }}
If I am searching for the documents with status 'published' then if I contains two documents
i)status:"published"
ii)status:"published 123"
Then the first document is perfect so it is returned and the second match is not a perfect match so it is not returned. Is my understanding correct?
Basically in Query context, the elastic search scans all the documents and tries to find out how well the documents match the query, means the score will will be calculated for each documents. Where as in filter context,it will just checks whether the documents matches the query or not i.e, only yes or no will be returned. The filter queries does not contribute to the score of the document.
Next coming to the difference between the match and term queries , if you mapped a field to keyword then that field will be not analysed and its inverted index contains the whole term as it is, i.e is if status is mapped to keyword then if you insert "published 123" in status field , then its inverted index contains ["published 123"] and if status is mapped to text then while inserting data to status filed it is analysed for ex: if you insert "published 123" then its inverted index will be ["published","123"].
So whenever you use term query for keyword fields the query string will not be analysed and it tries to find exact term in the inverted index and if you use match query it analyses the query string and it returns all the doc's that contain the one of the analysed string of query in it's inverted index
Your understanding about the difference between term and match queries is correct at the most basic level but like Jettro commented in the filter query you mentioned both the documents will be selected. When doing a term query it really depends what kind of analyzer you are using and how that affects the terms that are stored in inverted index that lucene uses.
To quote an example from the Elasticsearch: Th Definitive Guide "if you were to index ["Foo","Bar"] into an exact value not_analyzed field, or Foo Bar into an analyzed field with the whitespace analyzer, both would result in having the two terms Foo and Bar in the inverted index."
Now under the hood the term query will search all the terms in the inverted index for your query term and even if one of them matches it will be returned as a result.
So in the first case there is only "published" in the inverted index but in the second case too there are both terms "published" and "123", so both documents will be returned as matches.
It also is important to remember that the term query looks in the inverted index for the exact term only; it won’t match any variants like "Published" or "publisheD" with "published".

Resources