Elasticsearch - can I define index time analyzer on document level? - elasticsearch

I want to index pages in multiple languages into a single index. But for each language I need to define custom language analyzer. So for english page it would use english analyzer, for czech page it would use czech analyzer.
At search time I would set the correct analyzer based on current locale as I do not need to search across languages.
It appears that it was possible in the early versions of Elasticsearch, but I cannot find a way to do it in 7.6
Is there a way to achieve this or do I really need to create an index for each type in each language? That would lead to many indices with only small number of indexed documents.
Or is there a better way to handle this scenario? We are considering about 20 languages and several document types (as far as I understand, types are now deprecated so each needs its own index).

You can use the fields feature which is available in Elastic 7.6, which allows you to store the different languages in a single index, also query time it would be possible to just use the subfield of language which you want to query.
In fact, there is a nice official blog from elastic talking about different approaches to have multi-lingual search and approach given by me is inspired by that which is called per-field based language search.
Example
Sample Index mapping would look like below
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"fr": {
"type": "text",
"analyzer": "french"
},
"es": {
"type": "text",
"analyzer": "spanish"
},
"estonian": {
"type": "text",
"analyzer": "estonian"
}
}
}
}
}
}

Related

How to conditionally apply an analyzer at index time to a field that could be one of many languages?

I have documents with a field (e.g. input_text) that contains a string that could be one of 20 odd languages. I have another field that has the short form of the language (e.g. lang)
I want to conditionally apply an analyzer at index time to the text field dependent on what the language is as detected from the language field.
I eventually want a Kibana dashboard with a single word cloud of the most common words in the text field (ie in multiple languages) but only words that have been stemmed and tokenized with stop words removed.
Is there a way to do this?
The elasticsearch documents suggest using multiple fields for each language and then specifying an analyzer for the appropriate field, but I can't do this as there are 20 some languages and this would overload my nodes.
There is no way to achieve what you want in Elasticsearch (applying analyzer to field A based on the value of field B).
I would recommend to create one index per language, and then create an index alias that groups all those indices and query against it.
PUT lang_de
{
"mappings": {
"properties": {
"input_text": {
"type": "text",
"analyzer": "german"
}
}
}
}
PUT lang__en
{
"mappings": {
"properties": {
"input_text": {
"type": "text",
"analyzer": "english"
}
}
}
}
POST _aliases
{
"actions": [
{
"add": {
"index": "lang_*",
"alias": "lang"
}
}
]
}

Partial update into large document

I'm facing the problem about performance. My application is about chatting.
I designed mapping index with nested object like below.
{
"conversation_id-v1": {
"mappings": {
"stream": {
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
},
"comments": {
"type": "nested",
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
}
}
}
}
}
}
}
}
** actually have a lot of fields
A document has around 4,000 nested objects. When I upsert data into document, It peak the cpu to 100% also disk i/o in case write. Input ratio around 1000/s.
How can I tuning to improve performance?
Hardware
3x 2vCPUs 13GB on GCP
4000 nested fields sounds like a lot - if I were you, I would look long and hard at your mapping design to be very certain you actually need that many nested fields.
Quoting from the docs:
Internally, nested objects index each object in the array as a separate hidden document.
Since a document has to be fully reindexed on update, you're indexing 4000 documents with a single update.
Why so many fields?
The reason you gave in the comments for needing so many fields
I'd like to search comments in nested and come with their parent stream for display.
makes me think that you may be mixing two concerns here.
ElasticSearch is meant for search, and your mapping should be optimized for search. If your mapping shape is dictated by the way you want to display information, then something is wrong.
Design your index around search
Note that by "search" I mean both indexing and querying.
For the use case you have, it seems like you could:
Index only the comments, with a reference (some id) to the parent stream in the indexed comment document.
After you get the search results (a list of comments) back from the search index, you can retrieve each comment along with its parent stream from some other data source (e.g. a relational database).
The point is, it may be much more efficient to re-retrieve the comment along with whatever else you want from some other source that is more better than ElasticSearch at joining data.

Elasticsearch - mappings VS mapping?

I am new to ElasticSearch. I am looking at some index file definitions and ran across the word "mappings" and "mapping", as seen below. I searched all around elasticsearch's documentation site and found both words referred to a bit, but never an explicit explanation about the difference. is "mappings" just a plural of "mapping", and they accept the same parameters? is the singular mapping different since it's nested in the "dynamic_templates" scope? This seems to be the case, but i can't find anything in documentation to confirm this. thanks
{ <--- top level
...some JSON...
"mappings": { //<--- plural
"_doc": {
"dynamic_templates": [
{
"space": {
"match_mapping_type": "string",
"match": "space",
"mapping": { <--- singular!
"type": "keyword",
"ignore_above": 64,
"fields": {
"analyzed": {
"type": "text",
"analyzer": "english"
}
}
...more JSON...
The first mappings occurence is the structure where you can define your mapping types. Historically, one was allowed to define several mappings types in a single index, but since the great mapping refactoring only one mapping type is allowed. So that's why mappingsis in the plural form. It will soon disappear.
The second mapping is simply a keyword when defining dynamic field templates. The match* part simply identifies the dynamic field and the mapping part defines the mapping for that field. It's kind of an advanced feature, so don't worry if you don't grasp it immediately.

How to handle auto completion on multi word text?

My input text is a multiword english text and I have the requirement to implement a autocompletion feature for that text.
I initially looked at search completion suggesters only to figure out that those can only match the first characters of the input. This is fine for auto completion of product names or address but not very useful when requiring a auto completion on any word in the input text.
After that I setup an edge_ngram analyzer and query to locate those documents which contain the input string. That works just fine but I don't know how to use this information to provide options for my auto completion.
I could use a highlighter in order to show the words which match the query. That data could in turn be used to setup a list of options. This solution seems rather hacky and not very elegant and I wonder how this problem is usually solved?
I'm unfortunately not able to maintain another field which could include the auto completion options for the documents.
I'm currently using highlight information of the query in order to construct the autocomplete options.
My Query:
{
"query": {
"match": {
"fields.content.auto": {
"query": "content co",
"analyzer": "standard"
}
}
},
"highlight": {
"fields": {
"fields.content.auto": {
"fragment_size": 0,
"number_of_fragments": 10,
"pre_tags" : [ "%ha%" ],
"post_tags" : [ "%he%" ]
}
}
},
"_source": ["uuid", "language"]
}
My auto field used the autocomplete analyzer.
"auto": {
"type": "string",
"analyzer": "autocomplete"
}
And this is the index configuration that I'm using:
{
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stop",
"autocomplete_filter"
]
}
}
}
}
The solution was mainly inspired by the Search-as-you-type post.
I process the response JSON in order to get the autocomplete options.
The highlight information is used to extract all found tokens. These tokens are next used to construct the potential autocomplete phrase by also comparing it to the phrase that the user has already entered. The neat thing is that a stop word filter can be applied and thus stopwords will never be highlighted and in turn never be used for autocomplete suggestions.
A PoC Java code of this processor can be found here
I'm not yet sure whether I'll run with this solution but I want to share it anyway.
I think your best option is to create a dedicated index for storing just the suggestions using the edge_ngram analyzer. If you use the completion suggesters you need to explicitly define your actual suggestions anyway. The completion suggester is also document centric in ES 5.x so if you index multiple documents with the same suggestions you will get duplicate suggestions returned on a match. There is a de-duplication option in ES 6, but that has only just been released.
If you have a dedicated suggestion index you can use a hash of the suggestion as a document ID to avoid duplicates. You can start indexing document titles and other useful meta data as suggestions. Later on you could include historical searches entered by users that are seen as successful due to the user ultimately clicking on or purchasing the returned results.

How to search for # or . in Elasticsearch

I have a field, under a type company, in my elasticsearch index which captures the technologies that the company uses. So people coming to our site might enter java, Java, C#, c#, .Net, .netetc in the search box to get the companies.
Initially I indexed this in the default way and then I couldn't search for .Net or C# as there were wildcard characters in the search query. When I searched with Net or C it returned companies that uses C or C# , which again is not correct.
I did some research and changed the mapping for the field to "index": "not_analyzed" and re-indexed the companies. Now it returned the correct companies for C# and .Net, but failed in the case were the search term was not an exact match. So it didn't return companies with Java technologies, when the search term was java, but it returned correctly when the search term was java. I understand that not_analyzed requires exact match
How do I index and query on the same field to get both these cases sorted out?
The way to achieve what you need is to create a custom analyzer that does a little bit more than what not_analyzed does, i.e. also lowercase the terms.
curl -XPUT localhost:9200/test_index -d '{
"settings": {
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "lowercase_keyword"
}
}
}
}
}'
Then when you index a document that contains Java, it will be indexed as java, C# as c#, etc
This will bring the benefits of case-insensitive exact matches.

Resources