Elasticsearch: custom analyzer while querying - elasticsearch

I am trying to supply analyzer at query time which is not working.
Create Index
PUT customer
Close Index and then Update index settings with analyzer configuration and Open Index
PUT customer_new/_settings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"digit"
]
}
}
}
}
}
Query for data
GET customer/_search
{
"query": {
"match": {
"phonenumber": { "query":"678",
"analyzer": "my_analyzer"
}
}
}
}
But this does not return any results.
On explaining the query
POST customer/_validate/query?explain
{
"query": {
"match": {
"phonenumber": { "query":"678",
"analyzer": "my_analyzer"
}
}
}
}
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"valid": true,
"explanations": [
{
"index": "customer",
"valid": true,
"explanation": "phonenumber:678"
}
]
}
The reason I am updating index is that i have index already in place. What i want to do is i can receive different ways to search on a field so i want to add analyzers on the fly and then use them while querying.
I think if i reindex and have this analyzer configured in phonenumber field by update the mapping , this will work. But like i mentioned above i dont want to reindex as there are millions of records and frequently reindexing is not a option.
Is there a way to solve this?

Short answer: You will have to reindex your documents
When you specify an analyzer in the query, the text in the query will use this analyzer, not the field in the document.
For example, if you index "Hello" using default analyzer and search "Hello" using an analyzer without lowercase, you will not get a result because you will try to match "Hello" with "hello" (i.e., lowercase).
The only solution to apply a new mapping is to reindex the documents. You cannot reindex only the field whose mapping changed.
It might not be the solution that you are looking for but here is a few hint to handle this problem:
If you use ngram analyzer to search within the term, you can use a wildcard query with *<SEARCH_TERM>*. 234 will match 12345. You will not have to create a new analyzer because you just change the query. Please note that it will come with an important query overhead.
Instead of reindex the whole index, just create a subset of documents. This can be easily done with the _reindex endpoint. Test and improve your mapping using only this subset and once you are happy with the result, reindex all the documents.
If you do not use them already, use alias to make reindexing transparent for the application.

Related

Elasticsearch: search with wildcard and custom analyzer

Requirement: Search with special characters in a text field.
my Solution so far: Use wildcard query with custom analyzer. I want to use wildcards because it seems the easiest way to do partial searches in a long string with multiple search keys. See ES query below.
I have an index called "invoices" and it has document with one of the fields as
"searchString" : "I000010-1 000010 3901 North Saginaw Road add 2 Midland MI 48640 US MS Dhoni MSD-Company MSD (777) 777-7777 (333) 333-3333 sandeep#xyz.io msd-company msdhoni Dhoni, MS (3241480)"
Note: This field acts as the deprecated _all field in ES.
Index Mapping for this field:
"searchString": {"type": "text","analyzer": "multi_level_analyzer"},
Analyzer settings:
PUT invoices
{
"settings": {
"analysis": {
"analyzer": {
"multi_level_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
My query looks something like this:
GET invoices/_search
{
"query": {
"bool": {
"must": [{
"wildcard": {
"searchString": {
"value": "msd-company*",
"boost": 1.0
}
}
},
{
"wildcard": {
"searchString": {
"value": "Saginaw*",
"boost": 1.0
}
}
}
]
}
}
}
My question:
Earlier when I was not using a custom analyzer the above query worked BUT I was not able to search for words with special characters like "msd-company".
After attaching the custom analyzer(multi_level_analyzer) the above query fails to return any result. I changed the wildcard query and appended an asterisk before the search key and for some reason it works now. (referred this answer)
I want to know the impact of using "* msd-company*" instead of "msd-company*" in the wildcard query for the text field.
How can I still use the wildcard query "msd-company*" with custom analyzer?
Open to suggestions for any other approach to my problem statement.
I have solved my problem by changing the mapping of the said field to this:
"searchString": {"type": "text","analyzer": "multi_level_analyzer", "search_analyzer": "standard"},
But since wildcard queries are expensive, I would still like to know if there exists a better solution to satisfy my search use case.

how to remove a stop word from default _english_ stop-words list in elasticsearch?

I am filtering the text using default English stop-words. I found 'and' is a stop-word in English, but I need to search for the results containing 'and'. I just want to remove and word from this default English stop-words filter and use other stopwords as usually. My elasticsearch schema looks similar to below.
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace" ,
"filter": ["stop_english"]
}
}....,
"filter":{
"stop_english": {
"type": "stop",
"stopwords": "_english_"
}
}
I expect to see the docs containing AND word with _search api.
You can set the stop words for a given index manually like this:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
}
}
}
I also found the list of English stop words used by elasticsearch here. If you manage to manually set the same list of stop words minus "and" in an index and reindex you data in the newly configured index with the good stop words , you should be good to go!
regarding reindexation of your data, you should check out the reindex api. I believe it is required since the tokenization of your data happens at ingestion time so you need to redo the ingestion by reindexing it. It is requires most of the time when changing index settings or some mapping changes (not 100% sure, but i think it makes sense).

Query elasticsearch to make all analyzed ngram tokens to match

I indexed some data using a nGram analyzer (which emits only tri-grams), to solve the compound words problem exactly as described at the ES guide.
This doesn't work however as expected: the according match query will return all documents where at least one nGram-token (per word) matched.
Example:
Let's take these two indexed documents with a single field, using that nGram analyzer:
POST /compound_test/doc/_bulk
{ "index": { "_id": 1 }}
{ "content": "elasticsearch is awesome" }
{ "index": { "_id": 2 }}
{ "content": "some search queries don't perform good" }
Now if I run the following query, I get both results:
"match": {
"content": {
"query": "awesome search",
"minimum_should_match": "100%"
}
}
The query that is constructed from this, could be expressed like this:
(awe OR wes OR eso OR ome) AND (sea OR ear OR arc OR rch)
That's why the second document matches (it contains "some" and "search"). It would even match a document with words that contain the tokens "som" and "rch".
What I actually want is a query where each analyzed token must match (in the best case depending on the minimum-should-match), so something like this:
"match": {
"content": {
"query": "awe wes eso ome sea ear arc rch",
"analyzer": "whitespace",
"minimum_should_match": "100%"
}
}
..without actually creating that query "from hand" / pre-analyzing it on client side.
All settings and data to reproduce that behavior can be found at https://pastebin.com/97QxfaSb
Is there such a possibility?
While writing the question, I accidentally found the answer:
If the ngram analyzer uses a ngram-filter to generate trigrams (as described in the guide), it works the way described above. (I guess because the actual tokens are not the single ngrams but the combination of all created ngrams)
To achieve the wanted behavior, the analyzer must use the ngram tokenizer:
"tokenizer": {
"trigram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"trigrams_with_tokenizer": {
"type": "custom",
"tokenizer": "trigram_tokenizer"
}
}
Using this way to produce tokens will result in the wished result when queering that field.

Elasticsearch and Drupal : how to add filter lowercase and asciifolding

I created an index in Drupal, and my queries works.
Now I try to add filters lowercase and asciifolding in the elasticsearch.yml file, but unsuccessfully:
I add these lines :
index:
analysis:
analyzer:
default:
filter : [standard, lowercase, asciifolding]
I have an error : IndexCreationException: [myindex] failed to create index.
But 'myindex' already exist, I just try to add filters to this existing index.
How I can add these filters so that the indexation is correct for me?
Thank you very much for your help.
The reason that you get this exception is because it's not possible to update the settings of an index by calling the general create index endpoint. In order to update analyzers you will have to call the '_settings' endpoint.
I've made a small example for you on how to do this:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"new_analyzer": {
"tokenizer": "standard"
}
}
}
}
}
GET test/_analyze
{
"analyzer": "new_analyzer",
"text": "NoLowercasse"
}
POST test/_close
PUT test/_settings
{
"analysis": {
"analyzer": {
"new_analyzer": {
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
}
POST test/_open
GET test/_analyze
{
"analyzer": "new_analyzer",
"text": "LowerCaseAdded"
}
Response:
{
"tokens": [
{
"token": "lowercaseadded",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
}
]
}
You can see that after the second analysis, the lowercase filter is being applied. The reason that you have to close your index is because it needs to rebuild the analyzer. You will notice that the new analyzer won't work as expected since the previously added documents weren't indexed with this analyzer, but rather the one without the asciifolding and lowercase.
In order to fix this, you'll have to rebuild your index (with the Reindex-API for example)
Hope this helps!
Edit: I maybe was a bit too quick in responding, as this is not a Drupal-Elastic solution, but it might point you in the right direction. To be honest I'm not familiar with running ES in combination with Drupal.

How to apply synonyms at query time instead of index time in Elasticsearch

According to the elasticsearch reference documentation, it is possible to:
Expansion can be applied either at index time or at query time. Each has advantages (⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance versus flexibility.
The advantages and disadvantages all make sense and for my specific use I want to make use of synonyms at query time. My use case is that I want to allow admin users in my system to curate these synonyms without having to reindex everything on an update. Also, I'd like to do it without closing and reopening the index.
The main reason I believe this is possible is this advantage:
(⬆)︎ Synonym rules can be updated without reindexing documents.
However, I can't find any documentation describing how to apply synonyms at query time instead of index time.
To use a concrete example, if I do the following (example stolen and slightly modified from the reference), it seems like this would apply the synonyms at index time:
/* NOTE: This was all run against elasticsearch 1.5 (if that matters; documentation is identical in 2.x) */
// Create our synonyms filter and analyzer on the index
PUT my_synonyms_test
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
// Create a mapping that uses this analyzer
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string",
"analyzer": "my_synonyms"
}
}
}
// Some data
PUT my_synonyms_test/rulers/1
{
"name": "Elizabeth II",
"title": "Queen"
}
// A query which utilises the synonyms
GET my_synonyms_test/rulers/_search
{
"query": {
"match": {
"title": "monarch"
}
}
}
// And we get our expected result back:
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4142135,
"hits": [
{
"_index": "my_synonyms_test",
"_type": "rulers",
"_id": "1",
"_score": 1.4142135,
"_source": {
"name": "Elizabeth II",
"title": "Queen"
}
}
]
}
}
So my question is: how could I amend the above example so that I would be using the synonyms at query time?
Or am I barking up completely the wrong tree and can you point me somewhere else please? I've looked at plugins mentioned in answers to similar questions like https://stackoverflow.com/a/34210587/2240218 and https://stackoverflow.com/a/18481495/2240218 but they all seem to be a couple of years old and unmaintained, so I'd prefer to avoid these.
Simply use search_analyzer instead of analyzer in your mapping and your synonym analyzer will only be used at search time
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string",
"search_analyzer": "my_synonyms" <--- change this
}
}
}
To use the custom synonym filter at QUERY TIME instead of INDEX TIME, you first need to remove the analyzer from your mapping:
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string"
}
}
}
You can then use the analyzer that makes use of the custom synonym filter as part of a query_string query:
GET my_synonyms_test/rulers/_search
{
"query": {
"query_string": {
"default_field": "title",
"query": "monarch",
"analyzer": "my_synonyms"
}
}
}
I believe the query_string query is the only one that allows for specifying an analyzer since it uses a query parser to parse its content.
As you said, when using the analyzer only at query time, you won't need to re-index on every change to your synonyms collection.
Apart from using the search_analyzer, you can refresh the synonyms list by restarting the index after making changes in the synonym file.
Below is the command to restart your index
curl -XPOST 'localhost:9200/index_name/_close'
curl -XPOST 'localhost:9200/index_name/_open'
After this automatically your synonym list will be refreshed without the need to reingest the data.
I followed this reference Elasticsearch — Setting up a synonyms search to configure the synonyms in ES

Resources