How to query for alternative spellings and representations of words in elasticsearch? - elasticsearch

I'm using elasticsearch to query on the theme field in documents. For example:
[
{ theme: 'landcover' },
{ theme: 'land cover' },
{ theme: 'land-cover' },
etc
]
I would like to specify a search of the term landcover that matches all these documents. How do I do this?
So far I've tried using the fuzziness operator in a match search, and also a fuzzy query. However neither of these approaches seems to work, which surprised me because my understanding of fuzzy searches is that they would provide a means of inexact matching.
What am I missing? From the docs I see that fuzziness definitely looks for close approximations to a search term:
When querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string.
I would consider 'landcover' and 'land cover' to be close. Is this not the case? (this is the first I have heard of Levenshtein Edit Distance so I don't know what extra/less characters mean in terms of this measurement).
An example of a match query that this doesn't seem to work:
{
query: {
match: {
'theme': {
query: 'landcover'
fuzziness: 'AUTO' // I've tried 2, '2', 6, '6', etc.
},
},
},
}
// When the term is 'land-cover' and fuzziness is auto, then 'land cover' is matched. But 'landcover' is not
And an example of a 'fuzzy' query that doesn't seem to work:
{
query: {
fuzzy: {
'theme': {
value: query,
fuzziness: 'AUTO', // Tried other values
},
},
},
}
// When the term is 'land-cover' and fuzziness is auto, then 'landcover' is matched. But 'land cover' is not. So works almost opposite to the match query in this regard
(NOTE - these queries are converted to JSON and do run and return sensible results, just the fuzziness doesn't seem to work as I would have expected)
Looking around StackOverflow, I see some questions that seem to indicate that querying an index is in some way related to how the index is created - i.e. that i cannot just run adhoc queries on any index that already exists and expect results. Is this correct? (sorry - I'm new to elasticsearch and I'm querying an index that already exists).
This answer seems related (how to find near matches for a search term): https://stackoverflow.com/a/55772800/3114742 - mentions that I should do something referred to as 'field mapping' prior to indexing data. but then the example query doesn't include the fuzziness operator. So in this case I'm confused as to what the point of the fuzziness operator is actually for.

Looking more into the documentation I've found the following:
Elasticsearch uses the concept of an 'index' rather than a database. But from the perspective of someone familiar with CouchDB and MongoDB, which are both JSON stores, there is definitely some similarity between a CouchDB database and an Elasticsearch index. Although the elasticsearch index is not an authoritative data storage in itself (it's 'built' from a source of data).
For a given index called, for example, my-index. you can insert JSON strings (documents) into my-index by PUTting to Elasticsearch:
PUT /... '{... json string ...}'
The JSON string can come directly from a JSON store (Mongo, Couch, etc.) or be cobbled together from a variety of sources. I guess.
Elasticsearch will process the document on insert and append to the inverted tree. For text fields this means K:V pairs will be created from JSON document text, with the keys being fragments of the text, and the values being references to where that text fragment is found in the source (the JSON document).
In other words, when inserting documents into an Elasticsearch index, the content is 'analyzed' to create K:V pairs that are added to the index.
I guess, then, that searching Elasticsearch means looking up search terms that are keys in the index, and comparing the values (the source of the key) to the source defined in the search (I think), and returning the source document where a search term is present for a particular field.
So:
Text is analyzed on insertion to an index
Queries are analyzed (using the same analyzer that was used to create the index)
So in my case (as mentioned above) the default analyzer is good enough to create indices that allow for basic fuzzy matching (i.e. in the match query, "land-cover" is matched to "land cover", and in the fuzzy query, "land-cover" is matched to "landcover" - I have no idea why these match differently!)
But to improve on the search results, I think I need to adjust the analyzer / tokenizer both when inserting documents into an index, and for when parsing queries to apply to an index.
My understanding of the analysis/tokenization is that this is the configuration by which inverted indexes are built from source documents. i.e. defining what the keys of the inverted index will be. As far as I can tell there is no magic in searching the index. search terms have to match keys in the inverted index otherwise there will be no results.
I'm still not sure what fuzziness is actually doing in this context.
So in short, querying elasticsearch seems to require a 'holistic perspective' over both how source data is indexed, and how queries are designed.
As a disclaimer,though, I'm not exactly an authoritative answer on this subject with less than one day of elasticsearch experience, so a better answer would still be appreciated!

Related

Why does elastic search analyze a document 2 times?

From what I've understood, When I index a document say:
PUT <index>/_doc/1
{
"title":"black white fox cat"
}
Elastic search analyzes this via a standard analyzer and turns the title into an array of tokens.
But then when I search for this document let's say
POST <index>/_search
{
"query":
{
"match":
{
"title":"black"
}
}
}
It analyzez again via the same analyzer, isn't that inefficient?
It's not efficient, its necessary step to provide the search results.
let me explain under the hood, how search and index process works.
Index tokenize the text based on data type, and configured analyzer and index the tokens into the inverted index.
Search terms again is tokenised based on the query type(no tokens in case of term family of queries), and search generated tokens into the inverted index created at index time(step-1).
Tokens match process(matching index time tokens in the inverted index to the tokens generated at the query time), is what finds the matches documents and provides the search results, normally this tokens match is a exact string match process, with the exception in some cases like (prefix query, wildcard query etc). and as its a exact string match, its very fast and optimized process.
There are various use-cases, like when you use the keywords data type, text is not analyzed and when you use term level queries search time analysis doesn't happen.
Now, important thing to not is that during search time also same analyzer used at index time, otherwise it would end up generating different token which not produce match in step-3 Described earlier.

ElasticSearch and Searching in Arrays

We have an ES index which has a field which stores its data as an array. In this field, we include the original text, plus text without any punctuation, special characters, etc. The problem is, when searching on the field, the multiple values appears to be skewing the score.
For example, if we search on the term 'up', the document which has the array ['up, up and away', 'up up and away'] is scoring higher with a multi_match (we are using because we may search more than one field) than the document with the array as simply ['up'].
In the end, I guess what I am looking for is a score that emulates calculating a score for each item in the array and returning me the highest. I believe in this case, comparing 'up' to 'Up' and 'Up, Up and Away' will give me a higher score for 'Up'.
With my research, I believe I may need to do custom scoring on this field...? If that is true, am I looking at "score_mode": "max" as what I want?
I think you slightly over-engineered your index. You don't need to create duplicate fields for the same information and remove punctuation, lowercase fields yourself.
I'd recommend you to read what are elasticsearch token filters and how to create multiple analyzers for the same field.
For your exact use case, if you provided a document sample, it would certainly help. But in any case looking at what you are dealing with - index your array of strings with default analyzer and with a custom one that you'll build yourself. Then you can use the same field, but with different analyzers (differently processed text) to control your score.

difference between a field and the field.keyword

If I add a document with several fields to an Elasticsearch index, when I view it in Kibana, I get each time the same field twice. One of them will be called
some_field
and the other one will be called
some_field.keyword
Where does this behaviour come from and what is the difference between both of them?
PS: one of them is aggregatable (not sure what that means) and the other (without keyword) is not.
Update : A short answer would be that type: text is analyzed, meaning it is broken up into distinct words when stored, and allows for free-text searches on one or more words in the field. The .keyword field takes the same input and keeps as one large string, meaning it can be aggregated on, and you can use wildcard searches on it. Aggregatable means you can use it in aggregations in elasticsearch, which resembles a sql group by if you are familiar with that. In Kibana you would probably use the .keyword field with aggregations to count distinct values etc.
Please take a look on this article about text vs. keyword.
Briefly: since Elasticsearch 5.0 string type was replaced by text and keyword types. Since then when you do not specify explicit mapping, for simple document with string:
{
"some_field": "string value"
}
below dynamic mapping will be created:
{
"some_field": {
"type" "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
As a consequence, it will both be possible to perform full-text search on some_field, and keyword search and aggregations using the some_field.keyword field.
I hope this answers your question.
Look at this issue. There is some explanation of your question in it. Roughly speaking some_field is analyzed and can be used for fulltext search. On the other hand some_field.keyword is not analyzed and can be used in term queries or in aggregation.
I will try to answer your questions one by one.
Where does this behavior come from?
It is introduced in Elastic 5.0.
What is the difference between the two?
some_field is used for full text search and some_field.keyword is used for keyword searching.
Full text searching is used when we want to include individual tokens of a field's value to be included in search. For instance, if you are searching for all the hotel names that has "farm" in it, such as hay farm house, Windy harbour farm house etc.
Keyword searching is used when we want to include the whole value of the field in search and not individual tokens from the value. For eg, suppose you are indexing documents based on city field. Aggregating based on this field will have separate count for "new" and "york" instead of "new york" which is usually the expected behavior.
From Elastic 5.0 onwards, strings now will be mapped both as keyword and text by default.

Elasticsearch multi term search

I am using Elasticsearch to allow a user to type in a term to search. I have the following property 'name' I'd like to search, for instance:
'name': 'The car is black'
I'd like to have this document returned if the following is used to search black car or car black.
I've tried doing a bool must and doing multiple terms ['black', 'car'] but it seems like it only works if the entire string is a match.
So what I'd really like to do is more of a, does the term contain both words in any order.
Can someone please get me on the right track? I've been banging my head on this one for a while.
If it seems like it only works if the entire string is a match, first make sure that in index mapping your string property name is analysed, i.e. mapping for this property doesn't contain "index": "not_analyzed". If it isn't so, you'll need to reindex your index in order to be able to search for tokens rather than for the whole phrase only.
Once you're sure your strings are analysed you can use:
Terms query with "minimum_should_match" parameter equalling to the number of words entered.
Bool query with must clause containing term queries per each word.
Common terms query which has a nice clean syntax for this purpose (you don't need to break down input string and construct more complex query structure in your app like with previous two) in addition to taking a smarter approach to stopwords analysing.

Elasticsearch - search on index1/type2 document scores change when adding documents to /index1/type2

I have an elasticsearch index (index1) in which I have one type (type1). I added documents to type1 and ran a search on it:
POST /index1/type1/_search
{
"query": {
"match": {
"keyword": "quick brown fox"
}
}
}
I get a result set back with scores that generally range between .03 and 1.
Then I add another type (type2) to index1 and add some documents to it. When I run the exact same search again, I get the same documents back, but they all have different scores, now ranging from 2 and 5. Ideally, the scores of these documents would not change even after adding documents to type2.
Any ideas as to why this happening? I am running a search on type1, yet adding documents to type2 seems to influence the scoring of the results. Is there anyway to stop this from happening?
I am using v1.1.2 of elasticsearch. I should also mention, I'm working with a pretty small dataset (less than 1000 docs).
Elasticsearch scoring is detailed here, but basically what you are running into is that the inverse document frequency of some of your terms is changing based on what you are indexing into type2 (which is still in the same INDEX as type1). The change in IDF changes the relevancy of your search terms.
The only way you could avoid it is to have separate indexes for type1 and type2 (and then if you need to search across both, your search would need to pass in both indexes).
The scores really have no deep meaning though and really should only be used as a relative indication that some results are better than others.

Resources