How to boost "starts with" search results above "contains" search results in elastic - elasticsearch

I am trying to find a way that I can boost the search results for a particular query such that the search results that have the query at the beginning of the field (i.e. starts with) are above the results that do not.
e.g. Suppose my query is for 'bat'
I want my results to look like
bat
bath
bathe
abate
debate
etc.

You could try adding a prefix query with a boost value to make the score for prefix matches higher than the rest of the items.

Related

How to search exact word in a test in Elastic Search

Let's say I have two texts:
Text 1 - "The fox has been living in the wood cabin for days."
Text 2 - "The wooden hammer is a dangerous weapon."
And I would like to search for the word "wood", without it matching me "wooden hammer". How would I do that in Elastic Search or nest?
Term query is used for exact matches search. However it's not recommended to use it against text fields, the following quote from term query documentation:
To better search text fields, the match query also analyzes your
provided search term before performing a search. This means the match
query can search text fields for analyzed tokens rather than an exact
term.
The term query does not analyze the search term. The term query only
searches for the exact term you provide. This means the term query may
return poor or no results when searching text fields.
The problem with text exact matches, as described in the Term query documentation:
By default, Elasticsearch changes the values of text fields as part of
analysis. This can make finding exact matches for text field values
difficult.
So, the documents data is modified (i.e., analyzed) before indexing. This depends on the index mapping definition for each field, defaults to the default index analyzer, or the standard analyzer.
But the default standard analyzer will not change the token "Wooden" to "Wood", this might happen if you used stemming for this field.
This means, if you don't use a different analyzer or stemming, querying with "Wood" shouldn't match "Wooden" token.
To summarize: Indexed data is modified/analyzed before indexing (based on the field mapping definition). Match query analyze the search query, while Term query doesn't analyze the search query. So you have to properly chose the field mapping and the search query to better suit your use case
For some use cases, like storing email addressed, phone numbers or keyword fields that always have the same value, consider using the Keyword type, which is suitable for exact matches in these use cases. However, ES recommends:
Avoid using keyword fields for full-text search. Use the text field
type instead.
So for better visibility and practical solution for your use case, it's better to elaborate more the field mapping you use and what you want to achieve.

Elasticsearch: get a list of the terms that were matched in each result

How can I get the list of terms that elasticsearch matched in each result? I know the highlight contains this but I want to get a list of the terms that were found without manually performing postprocessing on the highlight for each result.
You could use named queries with unique query for each term.
Search result will contain matched queries for each document in result.

Elastic search giving strange results

I am following this tutorial on elastic search.
Two employees have 'about' value as:
"about": "I love to go rock climbing"
"about": "I like to collect rock albums"
I run following query:
GET /megacorp/employee/_search {"query":{"match":{"about":"rock coll"}}}
Both above entries are returned, but surprisingly wit same score:
"_score": 0.2876821
Shouldn't the second one must have higher score as it has 'about' value containing both 'rock' and 'coll' while first one only contains 'rock'?
That totally depends on what analyzer you are using. if you are using standard or english analyzer this result is correct. I recommend you to spend some time working with elasticsearch's Analyze API to get familiar how each analyzer affect your text.
By the way, if you want second document to have higher score, take a look at Partial matching.
When we search on a full-text field, we need to pass the query string through the same analysis process as we have when we index a document, to ensure that we are searching for terms in the same form as those that exist in the index.
Analysis process usually consists of normalization and tokenization (the string is tokenized into individual terms by a tokenizer).
As for match Query:
If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search. It just looks for the words that are specified.
So, in your match query Elasticsearch will look for occurrences of the whole separate words: rock or/and coll.
Your 2nd document doesn't contain a separate word coll but was matched by the word rock.
Conclusion: the 2 documents are equivalent in their _score value (they were matched by the same word rock)
Elasticsearch analyzes each text field before storing it. The default analyzer (standard analyzer) splits the text based on whitespaces and lowercases it. The output of analysis process is a list of tokens which are used to match your query tokens. If any of the tokens match exactly the relevant document is returned. That's being said, your second document doesn't contain the token col and that's why you are having the same score for both documents.
Even if you build your custom analyzer and use stemming, the word collect won't be stemmed as coll.
You can build custom analyzers in which you can specify that tokens should be of length 1 character, then Elasticsearch will consider each single character as a token and you can search for the existence of any character in your documents.

ElasticSearch: is it possible to highlight words in the query rather than the results

We use ElasticSearch in a reverse manner from what I usually see. We store lots of small documents, usually 1 or 2 words, for example, Job Titles like "software engineering", "car mechanics", "architect", etc.
Then we query with a longer string, for example a 1000 word Job Spec. This way we get all Job Titles present in the text of the Job Spec.
It works well. But I was wondering whether I could get ElasticSearch to highlight the matching Job Titles in the Job Spec, i.e. highlight the results in the query. I have tried the highlight keyword, but it doesn't highlight the query text, it highlights the results. I'm not sure how to get the query to be returned in the ElasticSearch response, let alone whether it can be highlighted.
You might wonder why I need ElasticSearch to highlight the query, can't I just pick out all the results from the text and highlight them myself? Yes I can, but there's various things to think about that makes it hard such as stemming and stopword removal. for example "jquery" is stemmed to "jqueri" when doing the tokenising in ElasticSearch, so it's found as a result, but if I want to highlight it myself, I have to unstem it so it matches the original text. Elasticsearch also removes symbols, so terms & conditions would become terms conditions which is problematic if I want to highlight it manually as I have to add back the "&" symbol. There's a hundred other problem cases, hence the question about whether ElasticSearch can do it for me.
I'm quite sure highlighting the query string isn't possible - only highlighting parts of documents in an index.
What you might try is indexing the query string itself in it's own index and then using the results of the first query as the query terms for a second query against the query string (in the second index). You could then have highlighting on the query string. You'll have to make an extra request to ES each time, but I think it'll get what you want.

Elasticsearch/Lucene: Query to Find a Word Near the Start or End of a Document

Is there a way to build an elastic search query that matches documents if a given term appears within x spaces of the start of a document?
I understand that I can use span_near and slop to find all instances of a term that is within x spaces of another term, but I don't know how to find instances where the term is near the start or end of a document.
The motivation here is because I have a set of documents that always start with a particular set of words, and I want to be able to query based only on that first word. I understand I can re-index my documents, pull out the first word into another field, and then query that separately, but I would prefer to not have to re-index everything just for this.
Have you tried span_first? Looks like what you need.

Resources