Elasticsearch - nGram on documents but not the search terms - elasticsearch

I apparently misunderstood how nGram works with Elasticsearch. I wanted to be able to efficiently search for a substring. That way I could type 'loud' and still find words like 'clouds'. I have my nGram tokenizer set up to have min=2 and max=10.
Apparently, nGram splits up the search term ('loud') into 'lo', 'ou', 'ud', 'lou', 'oud' and 'loud'. In some cases this is nice because it will find 'louder' if I search for 'cloud'. However, I think generally it just confuses my users.
Is there a way to prevent Elasticsearch from splitting up the search term? I tried using quotes in the querystring but that doesn't seem to work.

You should specify 2 separate analyzers for index and for search in your mapping, called index_analyzer and search_analyzer. Index analyzer is the same, as search analyzer, but with nGram filter added.

Related

How to search exact word in a test in Elastic Search

Let's say I have two texts:
Text 1 - "The fox has been living in the wood cabin for days."
Text 2 - "The wooden hammer is a dangerous weapon."
And I would like to search for the word "wood", without it matching me "wooden hammer". How would I do that in Elastic Search or nest?
Term query is used for exact matches search. However it's not recommended to use it against text fields, the following quote from term query documentation:
To better search text fields, the match query also analyzes your
provided search term before performing a search. This means the match
query can search text fields for analyzed tokens rather than an exact
term.
The term query does not analyze the search term. The term query only
searches for the exact term you provide. This means the term query may
return poor or no results when searching text fields.
The problem with text exact matches, as described in the Term query documentation:
By default, Elasticsearch changes the values of text fields as part of
analysis. This can make finding exact matches for text field values
difficult.
So, the documents data is modified (i.e., analyzed) before indexing. This depends on the index mapping definition for each field, defaults to the default index analyzer, or the standard analyzer.
But the default standard analyzer will not change the token "Wooden" to "Wood", this might happen if you used stemming for this field.
This means, if you don't use a different analyzer or stemming, querying with "Wood" shouldn't match "Wooden" token.
To summarize: Indexed data is modified/analyzed before indexing (based on the field mapping definition). Match query analyze the search query, while Term query doesn't analyze the search query. So you have to properly chose the field mapping and the search query to better suit your use case
For some use cases, like storing email addressed, phone numbers or keyword fields that always have the same value, consider using the Keyword type, which is suitable for exact matches in these use cases. However, ES recommends:
Avoid using keyword fields for full-text search. Use the text field
type instead.
So for better visibility and practical solution for your use case, it's better to elaborate more the field mapping you use and what you want to achieve.

What's the difference between Search-as-you-type datatype and Edge NGram Tokenizer?

Can't understand the difference between setting a Search-as-you-type datatype to a field, setting an Edge NGram Tokenizer in analyzer, and adding an index_prefixes parameter. It seems to me that they do the same job after all.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-prefixes.html
edge_ngram is a tokenizer, which means it kicks in at indexing time to tokenize your input data. There is also a edge_ngram token filter. Both are similar but work at different levels.
search_as_you_type is a field type which contains a few sub-fields, one of which is called _index_prefix and which leverages the edge_ngram tokenizer.
So basically, what you see in the edge_ngram tokenizer documentation has actually been leveraged when they decided to add the new search_as_you_type field type.
Rafiqul is correct that search_as_you_go is built using edge_ngram, but it also incorporates the concept of shingles. Shingles are sets of words, which allows search_as_you_go to better handle multi-word queries.
Note that search_as_you_go requires the words to be in the order entered, which is especially ideal for known entities like movie titles than free form documents.

Favor exact matches over ngram matches in ElasticSearch when mapping

I have partial matching of words working with ngrams. How can I modify the mapping to always favor exact matches over ngram tokens? I do not want to modify the query. One search box will search multiple types, each with their own fields.
For example, lets say I'm searching job titles, one person has a title of "field engineer", the other a title of "engine technician". If a user searches for "engine", I'd want ES to return the latter as more relevant.
I'm using this mapping almost verbatim: https://stackoverflow.com/a/19874785/978622
-Exception: I'm using an ngram with min of 3 and max of 11 instead of edge ngram
Is it possible to apply a boost/function score to an analyzer? If so I'll apply both the "full_name" and "partial_name" analyzers to my index as well and boost the first.
Edit: I'm using ElasticSearch 1.1.1 and Nest 1.0.0 beta
I don't believe there is anyway to apply boosting to an analyzer as you're suggesting.
One thing you can try, is to use the multi field type in your mapping. You could then apply your partial_name analyzer to one version of the field, and your full_name analyzer to the other version.
With this mapping, you could query both fields differently, but combined (perhaps in a bool query), and apply a boost to the query that is being conducted on the full_name analyzed field.

Using Nest, how to mimic an _all field that includes ngram tokens?

I believe it is impossible for the _all field to contain ngram tokens. How can I mimic this behavior?
I have 7 types of entities, each with about 10 fields. Of those 70 total fields, about 15 must support partial search (using an ngram index analyzer). All fields will use the same search analyzer.
Is copy_to supported in Nest? I don't see it. If so, can different fields have different analyzers?
My thinking so far: If copy_to is supported, all fields I want to search would be copied to a single field, one per type, called "aggregate". The search query would specify a multifield search which included each of these aggregate fields.
The _all field can in fact contain nGram tokens. You have the ability to define both the search and index analyzers for the _all field. Please see my previous question Set analyzers for _all field with NEST However, you will need to pull the source for NEST and compile it to get this functionality, as it is not in the NEST 1.0.0-beta1 release on NuGet.

Is it possible to set a custom analyzer to not tokenize in elasticsearch?

I want to treat the field of one of the indexed items as one big string even though it might have whitespace. I know how to do this by setting a non-custom field to be 'not-analyzed', but what tokenizer can you use via a custom analyzer?
The only tokenizer items I see on elasticsearch.org are:
Edge
NGram
Keyword
Letter
Lowercase
NGram
Standard
Whitespace
Pattern
UAX URL Email
Path
Hierarchy
None of these do what I want.
The Keyword tokenizer is what you are looking for.
The Keyword tokenizer doesn't really do:
When searching, it'll tokenize the entire query string into a single token, making text queries behave like a term query.
The issue I run into is that I want to add filters and then search indexed keywords in a long text (Keyword assignment). I would say there's no tokenizer that could do this, and that the normalizer can't accept necessary filters. The workaround for me is to prepare the text before feeding it to elasticsearch.

Resources