Using ngrams instead of wildcards without a predefined schema - elasticsearch

I recently discovered that I shouldn't be using wildcards for elasticsearch queries. Instead, I've been told I should use ngrams. In my experimentation, this has worked really well. What I'd like to do is be able to tell Elasticsearch to use ngrams for all mapped fields (or mapped properties that fit a specific patern).
For example:
CURL -XPUT 'http://localhost:9200/test-ngram-7/' -d '{
"mappings": {
"person": {
"properties": {
"name": {
"type": "string",
"analyzer": "partial"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"lb_ngram": {
"max_gram": 10,
"min_gram": 1,
"type": "nGram"
}
},
"analyzer": {
"partial": {
"filter": [
"standard",
"lowercase",
"asciifolding",
"lb_ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
Now, when I add this mapping:
CURL -XPUT 'http://localhost:9200/test-ngram-7/person/1' -d '{
"name" : "Cobb",
"age" : 31
}'
I can easily query for "obb" and get a partial result. In my app, I don't know in advance what fields people will be mapping. I could obviously short circuit this on the client side and declare my mapping before posting the document, but it would be really cool if I could do something like this:
CURL -XPUT 'http://localhost:9200/test-ngram-7/' -d '{
"mappings": {
"person": {
"properties": {
"_default_": {
"type": "string",
"analyzer": "partial"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"lb_ngram": {
"max_gram": 10,
"min_gram": 1,
"type": "nGram"
}
},
"analyzer": {
"partial": {
"filter": [
"standard",
"lowercase",
"asciifolding",
"lb_ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
Note that I'm using "default". It would also be cool if I could do like "name.*" and all properties starting with name would get filtered this way. I know elasticsearch supports default and wildcards.*, so I'm hoping that I'm just doing it wrong.
In short, I'd like for new properties to get run through ngram filters when mappings are created automatically, not using the mapping API.

You could set up a dynamic_template, see http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html for info.
Using this, you can create mapping templates for your not-known field, based on a match, pattern-matching etc, and apply analyzers etc for these templates. This will give you more fine-grained control of the behavior compared to setting the default analyzer. The default analyzer should typically be used for basic stuff like "lowercase" and "asciifolding", but if you are certain that you wih to apply the nGram for ALL fields, it certainly a valid way to go.

So, One solution I've found is to set up a "default" analyzer. The docs says
Default Analyzers An analyzer is registered under a logical name. It
can then be referenced from mapping definitions or certain APIs. When
none are defined, defaults are used. There is an option to define
which analyzers will be used by default when none can be derived.
The default logical name allows one to configure an analyzer that will
be used both for indexing and for searching APIs. The default_index
logical name can be used to configure a default analyzer that will be
used just when indexing, and the default_search can be used to
configure a default analyzer that will be used just when searching.
Here is an example:
CURL -XPUT 'http://localhost:9200/test-ngram-7/' -d '{
"settings": {
"analysis": {
"filter": {
"lb_ngram": {
"max_gram": 10,
"min_gram": 1,
"type": "nGram"
}
},
"analyzer": {
"default": {
"filter": [
"standard",
"lowercase",
"asciifolding",
"lb_ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
And then this query will work:
CURL -XGET 'http://localhost:9200/test-ngram-7/person/_search' -d '{
"query":
{
"match" : {
"name" : "obb"
}
}
}'
Answering my own question because I am still interested if this is the "correct" way to do this.

Related

How to search on Elasticsearch for words with or without apostrophe ? and deal with spelling mistakes?

I'm trying to move my Full Text Search logic from MySQL to Elasticsearch. In MySQL to find all rows containing the word "woman" I would just write
SELECT b.code
FROM BIBLE b
WHERE ((b.DISPLAY_NAME LIKE '%woman%')
OR (b.BRAND LIKE '%woman%')
OR (b.DESCRIPTION LIKE '%woman%'));
on elasticsearch I tried for something similar
curl -X GET "localhost:9200/bible/_search" -H 'Content-Type: application/json' -d'
{
"query": { "multi_match": { "query": "WOMAN","fields": ["description","display_name","brand"] } }, "sort": { "code": {"order": "asc" } },"_source":["code"]
}
'
but it didn't have the same count on further checking it I found words like woman's weren't found by elasticsearch but was by MySQL. How do I solve this ?
AND
How do I incorporate stuff like searching for words even with spelling mistakes or words which are phonetically the same ?
Firstly, how is your mapping like ? Are you using any tokenizer. If not i would suggest that if you want to do wildcard search, you should use ngram tokenizer. It is mostly used for partial matches.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
In elasticsearch, you have to do the mapping for the fields before indexing the data. Mapping is the way for informing elasticsearch to index the data in a particular way for retrieving the data the way you want.
Try the below DSL query (JSON format) for creating custom analyzer and mapping:
PUT {YOUR_INDEX_NAME}
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 20 //For Elasticsearch v6 and above
},
"mappings": {
"properties": {
"code": {"type": "long"},
"description": {
"type": "text",
"analyzer": "my_analyzer"
},
"display_name": {
"type": "text",
"analyzer": "my_analyzer"
},
"brand": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Sample Query example:
GET {YOUR_INDEX_NAME}/_search
{
"query": {
"multi_match" : {
"query" : "women",
"fields" : [ "description^3", "display_name", "brand" ]
}
}
}
I suggest you take a look at the fuzzy query for spelling mistakes.
Try to use Kibana UI for testing the index using DSL query instead of using cURL which will save you time.
Hope it helps you.

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

Elasticsearch: Does edgeNGram token filter work on non english tokens?

I am trying to setup a new mapping for an index. Which is going to support partial keyword search and auto-complete requests powered by ES.
edgeNGram token filter with whitespace tokeniser seems a way to go. Till now my setting looks something like this:
curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"customNgram": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["lowercase", "customNgram"]
}
},
"filter": {
"customNgram": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "18",
"side": "front"
}
}
}
}
}
}'
The problem is with Japanese words! Does NGrams work on japanese letters?
For e.g.:
【11月13日13時まで、フォロー&RTで応募!】
There is no whitespace in this - The document is not searchable with partial keywords, is that expected?
You might want to look at the icu_tokenizer which adds support for foreign languages https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html
Tokenizes text into words on word boundaries, as defined in UAX #29:
Unicode Text Segmentation. It behaves much like the standard
tokenizer, but adds better support for some Asian languages by using a
dictionary-based approach to identify words in Thai, Lao, Chinese,
Japanese, and Korean, and using custom rules to break Myanmar and
Khmer text into syllables.
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu_analyzer": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
}
Note that to use it in your index you need to install the appropriate plugin:
bin/elasticsearch-plugin install analysis-icu
Adding this to your code:
curl -XPUT 'localhost:9200/test_ngram_2?pretty' -H 'Content-Type: application/json' -d'{
"settings": {
"index": {
"analysis": {
"analyzer": {
"customNgram": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["lowercase", "customNgram"]
}
},
"filter": {
"customNgram": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "18",
"side": "front"
}
}
}
}
}
}'
Normally you would search an autocomplete like this using the standard analyzer, instead add an analyzer to your mapping also with the icu_tokenizer (but not the edgeNGram filter) and apply this to your query at search time, or explicitly set it as the search_analyzer for the field you apply customNgram to.

Elasticsearch Automatic Synonyms

I am looking at Elasticsearch to handle search queries made by users in on my website.
Say that I have a document person with field vehicles_owned which is a list of strings. For example:
{
"name":"james",
"surname":"smith",
"vehicles_owned":["car","bike","ship"]
}
I would like to query which people own a certain vehicle. I understand it's possible to configure ES so that boat is treated as a synonym of ship and if I query with boat I am returned the user james who owns a ship.
What I don't understand is whether this is done automatically, or if I have to import lists of synonyms.
The idea is to create a custom analyzer for the vehicles_owned field which leverages the synonym token filter.
So you first need to define your index like this:
curl -XPUT localhost:9200/your_index -d '{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt" <-- your synonym file
}
}
}
}
},
"mappings": {
"syn": {
"properties": {
"name": {
"type": "string"
},
"surname": {
"type": "string"
},
"vehicles_owned": {
"type": "string",
"index_analyzer": "synonym" <-- use the synonym analyzer here
}
}
}
}
}'
Then you can add all the synonyms you want to handle in the $ES_HOME/config/synonyms.txt file using the supported formats, for instance:
boat, ship
Next, you can index your documents:
curl -XPUT localhost:9200/your_index/your_type/1 -d '{
"name":"james",
"surname":"smith",
"vehicles_owned":["car","bike","ship"]
}'
And finally searching for either ship or boat will get you the above document we just indexed:
curl -XGET localhost:9200/your_index/your_type/_search?q=vehicles_owned:boat
curl -XGET localhost:9200/your_index/your_type/_search?q=vehicles_owned:ship

How to implement case sensitive search in elasticsearch?

I have a field in my indexed documents where i need to search with case being sensitive. I am using the match query to fetch the results.
An example of my data document is :
{
"name" : "binoy",
"age" : 26,
"country": "India"
}
Now when I give the following query:
{
“query” : {
“match” : {
“name” : “Binoy"
}
}
}
It gives me a match for "binoy" against "Binoy". I want the search to be case sensitive. It seems by default,elasticsearch seems to go with case being insensitive. How to make the search case sensitive in elasticsearch?
In the mapping you can define the field as not_analyzed.
curl -X PUT "http://localhost:9200/sample" -d '{
"index": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}'
echo
curl -X PUT "http://localhost:9200/sample/data/_mapping" -d '{
"data": {
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}'
Now if you can do normal index and do normal search , it wont analyze it and make sure it deliver case insensitive search.
It depends on the mapping you have defined for you field name. If you haven't defined any mapping then elasticsearch will treat it as string and use the standard analyzer (which lower-cases the tokens) to generate tokens. Your query will also use the same analyzer for search hence matching is done by lower-casing the input. That's why "Binoy" matches "binoy"
To solve it you can define a custom analyzer without lowercase filter and use it for your field name. You can define the analyzer as below
"analyzer": {
"casesensitive_text": {
"type": "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
You can define the mapping for name as below
"name": {
"type": "string",
"analyzer": "casesensitive_text"
}
Now you can do the the search on name.
note: the analyzer above is for example purpose. You may need to change it as per your needs
Have your mapping like:
PUT /whatever
{
"settings": {
"analysis": {
"analyzer": {
"mine": {
"type": "custom",
"tokenizer": "standard"
}
}
}
},
"mappings": {
"type": {
"properties": {
"name": {
"type": "string",
"analyzer": "mine"
}
}
}
}
}
meaning, no lowercase filter for that custom analyzer.
Here is the full index template which worked for my ElasticSearch 5.6:
{
"template": "logstash-*",
"settings": {
"analysis" : {
"analyzer" : {
"case_sensitive" : {
"type" : "custom",
"tokenizer": "standard",
"filter": ["stop", "porter_stem" ]
}
}
},
"number_of_shards": 5,
"number_of_replicas": 1
},
"mappings": {
"fluentd": {
"properties": {
"message": {
"type": "text",
"fields": {
"case_sensitive": {
"type": "text",
"analyzer": "case_sensitive"
}
}
}
}
}
}
}
As you see, the logs are coming from FluentD and are saved into a timebased index logstash-*. To make sure, I can still execute wildcard queries on the message filed, I put a multi-field mapping on that field. Wildcard/analyzed queries can be done on message field and the case sensitive one on the message.case_sensitive field.

Resources