How to apply synonyms at query time instead of index time in Elasticsearch - elasticsearch

According to the elasticsearch reference documentation, it is possible to:
Expansion can be applied either at index time or at query time. Each has advantages (⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance versus flexibility.
The advantages and disadvantages all make sense and for my specific use I want to make use of synonyms at query time. My use case is that I want to allow admin users in my system to curate these synonyms without having to reindex everything on an update. Also, I'd like to do it without closing and reopening the index.
The main reason I believe this is possible is this advantage:
(⬆)︎ Synonym rules can be updated without reindexing documents.
However, I can't find any documentation describing how to apply synonyms at query time instead of index time.
To use a concrete example, if I do the following (example stolen and slightly modified from the reference), it seems like this would apply the synonyms at index time:
/* NOTE: This was all run against elasticsearch 1.5 (if that matters; documentation is identical in 2.x) */
// Create our synonyms filter and analyzer on the index
PUT my_synonyms_test
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"queen,monarch"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
// Create a mapping that uses this analyzer
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string",
"analyzer": "my_synonyms"
}
}
}
// Some data
PUT my_synonyms_test/rulers/1
{
"name": "Elizabeth II",
"title": "Queen"
}
// A query which utilises the synonyms
GET my_synonyms_test/rulers/_search
{
"query": {
"match": {
"title": "monarch"
}
}
}
// And we get our expected result back:
{
"took": 42,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4142135,
"hits": [
{
"_index": "my_synonyms_test",
"_type": "rulers",
"_id": "1",
"_score": 1.4142135,
"_source": {
"name": "Elizabeth II",
"title": "Queen"
}
}
]
}
}
So my question is: how could I amend the above example so that I would be using the synonyms at query time?
Or am I barking up completely the wrong tree and can you point me somewhere else please? I've looked at plugins mentioned in answers to similar questions like https://stackoverflow.com/a/34210587/2240218 and https://stackoverflow.com/a/18481495/2240218 but they all seem to be a couple of years old and unmaintained, so I'd prefer to avoid these.

Simply use search_analyzer instead of analyzer in your mapping and your synonym analyzer will only be used at search time
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string",
"search_analyzer": "my_synonyms" <--- change this
}
}
}

To use the custom synonym filter at QUERY TIME instead of INDEX TIME, you first need to remove the analyzer from your mapping:
PUT my_synonyms_test/rulers/_mapping
{
"properties": {
"name": {
"type": "string"
},
"title": {
"type": "string"
}
}
}
You can then use the analyzer that makes use of the custom synonym filter as part of a query_string query:
GET my_synonyms_test/rulers/_search
{
"query": {
"query_string": {
"default_field": "title",
"query": "monarch",
"analyzer": "my_synonyms"
}
}
}
I believe the query_string query is the only one that allows for specifying an analyzer since it uses a query parser to parse its content.
As you said, when using the analyzer only at query time, you won't need to re-index on every change to your synonyms collection.

Apart from using the search_analyzer, you can refresh the synonyms list by restarting the index after making changes in the synonym file.
Below is the command to restart your index
curl -XPOST 'localhost:9200/index_name/_close'
curl -XPOST 'localhost:9200/index_name/_open'
After this automatically your synonym list will be refreshed without the need to reingest the data.

I followed this reference Elasticsearch — Setting up a synonyms search to configure the synonyms in ES

Related

Elasticsearch Became case sensitive after add synonym analyzer

After I added synonym analyzer to my_index, the index became case-sensitive
I have one property called nationality that has synonym analyzer. But it seems that this property become case sensitive because of the synonym analyzer.
Here is my /my_index/_mappings
{
"my_index": {
"mappings": {
"items": {
"properties": {
.
.
.
"nationality": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "synonym"
},
.
.
.
}
}
}
}
}
Inside the index, i have word India COUNTRY. When I try to search India nation using the command below, I will get the result.
POST /my_index/_search
{
"query": {
"match": {
"nationality": "India nation"
}
}
}
But, when I search for india (notice the letter i is lowercase), I will get nothing.
My assumption is, this happend because i put uppercase filter before the synonym. I did this because the synonyms are uppercased. So the query India will be INDIA after pass through this filter.
Here is my /my_index/_settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "my_index",
"similarity": {
"default": {
"type": "BM25",
"b": "0.9",
"k1": "1.8"
}
},
"creation_date": "1647924292297",
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"lenient": "true",
"synonyms": [
"NATION, COUNTRY, FLAG"
]
}
},
"analyzer": {
"synonym": {
"filter": [
"uppercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"version": {
"created": "6080099"
}
}
}
}
}
Is there a way so I can make this property still case-insensitive. All the solution i've found only shows that I should only either set all the text inside nationality to be lowercase or uppercase. But how if I have uppercase & lowercase letters inside the index?
Did you apply synonym filter after adding your data into index?
If so, probably "India COUNTRY" phrase was indexed exactly as "India COUNTRY". When you sent a match query to index, your query was analyzed and sent as "INDIA COUNTRY" because you have uppercase filter anymore, it is matched because you are using match query, it is enough to match one of the words. "COUNTRY" word provide this.
But, when you sent one word query "india" then it is analyzed and converted to "INDIA" because of your uppercase filter but you do not have any matching word on your index. You just have a document contains "India COUNTRY".
My answer has a little bit assumption. I hope that it will be useful to understand your problem.
I have found the solution!
I didn't realize that the filter that I applied in the settings is applicable while updating and searching the data. At first, I did this step:
Create index with synonym filter
Insert data
Add uppercase before synonym filter
By doing that, the uppercase filter is not applied to my data. What I should've done are:
Create index with uppercase & synonym filter (pay attention to the order)
Insert data
Then the filter will be applied to my data.

Elasticsearch: custom analyzer while querying

I am trying to supply analyzer at query time which is not working.
Create Index
PUT customer
Close Index and then Update index settings with analyzer configuration and Open Index
PUT customer_new/_settings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"digit"
]
}
}
}
}
}
Query for data
GET customer/_search
{
"query": {
"match": {
"phonenumber": { "query":"678",
"analyzer": "my_analyzer"
}
}
}
}
But this does not return any results.
On explaining the query
POST customer/_validate/query?explain
{
"query": {
"match": {
"phonenumber": { "query":"678",
"analyzer": "my_analyzer"
}
}
}
}
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"valid": true,
"explanations": [
{
"index": "customer",
"valid": true,
"explanation": "phonenumber:678"
}
]
}
The reason I am updating index is that i have index already in place. What i want to do is i can receive different ways to search on a field so i want to add analyzers on the fly and then use them while querying.
I think if i reindex and have this analyzer configured in phonenumber field by update the mapping , this will work. But like i mentioned above i dont want to reindex as there are millions of records and frequently reindexing is not a option.
Is there a way to solve this?
Short answer: You will have to reindex your documents
When you specify an analyzer in the query, the text in the query will use this analyzer, not the field in the document.
For example, if you index "Hello" using default analyzer and search "Hello" using an analyzer without lowercase, you will not get a result because you will try to match "Hello" with "hello" (i.e., lowercase).
The only solution to apply a new mapping is to reindex the documents. You cannot reindex only the field whose mapping changed.
It might not be the solution that you are looking for but here is a few hint to handle this problem:
If you use ngram analyzer to search within the term, you can use a wildcard query with *<SEARCH_TERM>*. 234 will match 12345. You will not have to create a new analyzer because you just change the query. Please note that it will come with an important query overhead.
Instead of reindex the whole index, just create a subset of documents. This can be easily done with the _reindex endpoint. Test and improve your mapping using only this subset and once you are happy with the result, reindex all the documents.
If you do not use them already, use alias to make reindexing transparent for the application.

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Elasticsearch: Is there a way to exclude synomyms from highlighting?

I'm trying to exclude synonyms from highlighting. I created a copy of my current analyzer with a synonym filter. So for each field I now have an analyzer and a search_analyzer. The search analyzer is the new analyzer with all the same filters plus the synonym filter.
Any ideas? I am using elasticsearch 5.2
Mapping:
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "custom_analyzer",
"search_analyzer": "custom_analyzer_with_synonyms",
"fields": {
"plain": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
Search Query:
{
"query": {
"match": {
"body": "something"
}
},
"highlight": {
"pre_tags": "<strong>",
"post_tags": "<strong>",
"fields" : {
"body.plain" : {
"number_of_fragments": 1,
"require_field_match": false
}
}
}
}
I am not sure about the reason behind the problem. I'd have thought that simply highlighting on a non-synonym-analyzed field would have done it. But according to the comments, it is still highlighting the synonyms. There are 2 possible reasons i can think of: (I haven't looked into the highlighter source code)
It could be because of the multi-word synonym problem mentioned in this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html It could be fixed now since the link is old. If not, it could be causing the highlighter to look at wrong position offsets.
And/Or, it could also be because of not using the highlight field in the query. The highlighter might be simply using the tokens emitted from the searched field's analyzer (which would contain synonyms) and looking for those tokens in the highlighted field.
If it's the 1st problem, you could try to change your synonyms to use simple contraction. See: https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html#synonyms-contraction But, it has its own problems with the frequencies of uncommon words and could be a lot of work.
Fixing for the second case would be to use the "body.plain" field in the query, but you cannot do that since it affects your scores. In that case, specifying a different query for the highlighter (so that scores are not affected) on the non-synonym field does the trick. It works even if the 1st case is the problem too since we are not using synonyms in the highlight field.
So your query should look something like this:
{
"query": {
"match": {
"body": "something"
}
},
"highlight": {
"pre_tags": "<strong>",
"post_tags": "<strong>",
"fields" : {
"body.plain" : {
"number_of_fragments": 1,
"highlight_query": {
"match": {"body.plain": "something"}
}
}
}
}
}
See: https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-request-highlighting.html#_highlight_query

Elasticsearch - aggregating multi level hierarchy

I am facing a problem with providing aggregated search result of documents with multi level hierarchy. Simplified documents structure looks like this:
Magazine title (Hunting) -> Magazine year (1999) -> Magazine issue (II.) -> Pages (Text of pages ...)
Every level od document is mapped to its parent by attribute "parentDocumentId".
I have prepared simple query, which works just fine for hierarchy with just 2 levels:
POST http://localhost:9200/my_index/document/_search?search_type=count&q=hunter
{
"query": {
"multi_match" : {
"query": "hunter",
"fields": [ "title", "text", "labels" ]
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "parentDocumentId"
}
}
}
}
This query is able to search through text of pages, and istead of giving me thousands of pages containting work "hunter" returns buckets (aggregated by parentDocumentId) of documents. However these buckets represent just "Magazine issues" which containt these pages.
Response:
{
"took": 54,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 44,
"max_score": 0,
"hits": []
},
"aggregations": {
"my_agg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 5,
"doc_count": 43
},
{
"key": 0,
"doc_count": 1
}
]
}
}
}
What I need, is to be able to aggregate search results on highest possible level. That means, in this particular case, to aggregate on "Magazine title" level. This could be done outside the elasticsearch query (on our application side), but as I see this, it should be definitely made in elasticsearch (performance, and other issues).
Does anybody have experience with similar aggregation? Is elasticsearch aggregations the right approach to use?
Every idea is welcome.
Thanks
Peter
Update:
Our mapping looks like this:
{
"my_index": {
"mappings": {
"document": {
"properties": {
"dateIssued": {
"type": "date",
"format": "dateOptionalTime"
},
"documentId": {
"type": "long"
},
"filter": {
"properties": {
"geo_bounding_box": {
"properties": {
"issuedLocation": {
"properties": {
"bottom_right": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
},
"top_left": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
}
}
}
}
}
}
},
"issuedLocation": {
"type": "geo_point"
},
"labels": {
"type": "string"
},
"locationLinks": {
"type": "geo_point"
},
"parentDocumentId": {
"type": "long"
},
"query": {
"properties": {
"match_all": {
"type": "object"
}
}
},
"storedLocation": {
"type": "geo_point"
},
"text": {
"type": "string"
},
"title": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
That means we use 1 mapping for all types of documents. We are indexing set of books, newspapers and other press. That means, that sometimes there is only one parent for set of pages, any sometimes there are multiple levels of parents above the pages level.
To distinguish the type of document there is an attribute "type".
When indexing top levels (these contain especially book meta-data) we leave the "text" attribute empty, always specifying the parent of document using the parentDocumentId. The top level documents have their parentDocumentId set to 0. When indexing the lowest level (pages), we provide only text attribute and parentDocumentId for indexed document.
The link used is very similar to classic one-to-many mapping (magazine has many years, has many issues, has many pages).
You could also say, that we have flattened the nested documents in elasticsearch, but the reason for this is, that there are multiple document types, that can have different level of their hierarchy.
You need to rethink your data modelling. In essence, you need a join over your data and moreover the join needs to be over an arbitrarily deep hierarchy. That is a problem even in relational databases let alone in a fulltext search engine like Elasticsearch.
Elasticsearch does support a couple of joins. You could use nested documents - a single document with all the subdocs nested. That's clearly not ideal in your case.
You could use the parent-child relationship feature which lets you index your (sub-)docs separately always referring to their parent. Underneath, that feature uses Lucene's blockjoin. However, to aggregate over a hierarchy, you would have to explicitly specify the join - listing all the intermediate steps. You want to always aggregate by the top-most available doc but that could be a different level each time (once a magazine, another time a magazine collection or perhaps a publisher).
I would consider indexing each doc with a field pointing to the top-most document. Then you can easily aggregate by that field. It would mean precomputing a part of the complex aggregation you want to do but it would result in fast aggregations and updates also wouldn't be very painful. It all depends on the source of your data, how you imagine that it will change, what updates and other queries you'll need to do.
This blog post could help to guide you a bit too: https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Resources