Is reindexing to a new index necessary after updating settings and mappings to support a multi-field in elasticsearch? - elasticsearch

Please consider the scenario.
Existing System
I have an index named contacts_index with 100 documents.
Each document has property named city with some text value in it.
Index has settings as the following
{
"analyzer": {
"city_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "city_tokenizer"
},
"search_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"tokenizer": {
"city_tokenizer": {
"token_chars": [
"letter"
],
"min_gram": "2",
"type": "ngram",
"max_gram": "30"
}
}
}
The index has the following mapping for city field to support matching sub-text search and keyword search.
{
"city" : {
"type" : "text",
"analyzer" : "city_analyzer",
"search_analyzer" : "search_analyzer"
}
}
Proposed System
Now we want to perform autocomplete on city field. for example for city with value Seattle. We want to get the document when the user types s, se, sea, seat, seatt, seattl, seattle but Only when they query with the above prefix text. For example not when they type eattle. etc..
We have planned to attain this with the help of one more multi-field for city property with different of type text and different analyzer.
To attain this we have done the following.
Updated the settings to support autocomplete
PUT /staging-contacts-index-v4.0/_settings?preserve_existing=true
{
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "autocomplete_tokenizer"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"token_chars": [
"letter"
],
"min_gram": "1",
"type": "edge_ngram",
"max_gram": "100"
}
}
}
}
Update the mapping of city field with multi-field autocomplete to support autocomplete
{
"city" : {
"type" : "text",
"fields" : {
"autocomplete" : {
"type" : "text",
"analyzer" : "autocomplete_analyzer",
"search_analyzer" : "search_analyzer"
}
},
"analyzer" : "city_analyzer",
"search_analyzer" : "search_analyzer"
}
}
Findings
For any new document that will be newly created after updating autocomplete multi-field settings, autocomplete search is working as expected
For existing documents, if the value of city field changes, for example seattle to chicago, the document is fetched when making autocomplete search.
We are planning to make use of update api to fetch and update the existing 100 documents so that autocomplete works for existing documents as well. However while trying to use the update api, we are getting
{"result" : "noop"}
And the autocomplete search is not working.
I can infer that since the values were not changing, elasticsearch not creating tokens for autocomplete field.
Question
From the research we have done, there are two options to make sure the existing 100 documents can perform autocomplete search.
Use Reindex api for existing 100 documents.
Fetch all 100 documents and Use document Index api to update the existing 100 documents which will create all the tokens in the process.
Which option is preferable and why?
Thanks for taking time to read through.

Related

Liferay portal 7.3.7 case insensitive, diacritics free with ElasticSearch

I am having a dilema on liferay portal 7.3.7 with case insensitive and diacritis free search through elasticsearch in JournalArticles with custom ddm fields. Liferay generated fieldmappings in Configuration->Search like this:
...
},
"localized_name_sk_SK_sortable" : {
"store" : true,
"type" : "keyword"
},
...
I would like to have these *_sortable fields usable for case insensitive and dia free searching, so I tried to add analyzer and normalizer to liferay search advanced configuration in System Settings->Search->Elasticsearch 7 like this:
{
"analysis":{
"analyzer":{
"ascii_analyzer":{
"tokenizer": "standard",
"filter":["asciifolding","lowercase"]
}
},
"normalizer": {
"ascii_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
}
After that, I overrided mapping for template_string_sortable:
{
"template_string_sortable" : {
"mapping" : {
"analyzer": "ascii_analyzer",
"normalizer": "ascii_normalizer",
"store" : true,
"type" : "keyword"
},
"match_mapping_type" : "string",
"match" : "*_sortable"
}
}
After reindexing, my sortable fields looks like this:
...
},
"localized_name_sk_SK_sortable" : {
"normalizer" : "ascii_normalizer",
"store" : true,
"type" : "keyword"
},
...
Next, I try to create new content for my ddm structure, but all my sortable fields looks same, like this:
"localized_title_sk_SK": "test diakrity časť 1 ľščťžýáíéôň title",
"localized_title_sk_SK_sortable": "test diakrity časť 1 ľščťžýáíéôň title",
but I need that sortable field without national characters, so i.e. I can find by "cast 1" through wildcardQuery in localized_title_sk_SK_sortable and so on... THX for any advice (maybe I just have wrong appearance to whole problem? And I am really new to ES)
First of all it would be better to apply original_ascii_folding and then lowercase filter, but keep in mind this filter are for search and your _source data wouldn't be changed because you applied analyzer on the field.
If you need to manipulate the data before ingesting it you can use Ingest pipeline feature in Elasticsearch for more information check here.

Auto Suggestions in Elastic Search after 3 letters

I've a search query which does basic search after a complete word is typed in. I'm looking for auto suggestions after 3 letters.
For Example,
Title- samsung galaxy s4
I want to see auto suggestions after "sam" instead of complete word "samsung".
while the ngram filter works, there is a dedicated suggester for this use-case, called the completion suggester, which uses another data structure internal, which will allow you to execute suggestions in the millisecond range, thus being much faster than a regular query use edgengram. Check out the documentation here
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/search-suggesters-completion.html
You need to use an edgeNGram filter for this.
{
"analysis": {
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edgeNGram",
"min_gram": "3",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete_edge_ngram": {
"filter": ["lowercase"],
"type": "custom",
"tokenizer": "autocomplete_tokenizer"
}
}
}
}
and mapping will be
{
"title_edge_ngram": {
"type": "text",
"analyzer": "autocomplete_edge_ngram",
"search_analyzer": "standard"
}
Or you can use the completion suggester in elasticsearch.
For three character check, you have to do it in your client side itself.

How can I index a field using two different analyzers in Elastic search

Say that I have a field "productTitle" which I want to use for my users to search for products.
I also want to apply autocomplete functionality. So I m using an autocomplete_analyzer with the following filter:
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}
However, at the same time when users make a search I don't want the "edge_ngram" to be applied, since it produces lot of irrelevant results.
For example when users want to search for "mi" and start typing "m", "mi".. they should get the results starting with m,mi as auto-complete options. However, when they actually make the query, they should only get results with the word "mi". Currently they also see results with "mini" etc..
Therefore, is it possible to have "productTitle" indexed using two different analyzers? Is multi-field type an option for me?
EDIT: Mapping for productTitle
"productTitle" : {
"type" : "string",
"index_analyzer" : "second",
"search_analyzer" : "standard",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
,
"second" analyzer
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"autocomplete_filter"
]
}
So when I'm querying for :
"filtered" : {
"query" : {
"match" : {
"productTitle" : {
"query" : "mi",
"type" : "boolean",
"minimum_should_match" : "2<75%"
}
}
}
}
I also get results like "mini". But I need to only get results including just "mi"
Thank you
hmm ... as far as I know, there is no way to apply multiple analyzers for same field ... what You can make is to use "Multi Fields".
here is an example how to apply different analyzers for "subfields":
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html#_multi_fields_with_multiple_analyzers
The correct way of preventing what you describe in your answer is to specify both analyzer and search_analyzer in your field mapping, like this:
"productTitle": {
"type": "string",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard"
}
The autocomplete analyzer will kick in at indexing time and tokenize your title according to your edge_ngram configuration and the standard analyzer will kick in at search time without applying the edge_ngram stuff.
In this context, there is no need for multi-fields unless you need to tokenize the productTitle field in different ways.

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

ElasticSearch nGram filters out punctuation

In my ElasticSearch dataset we have unique IDs that are separated with a period. A sample number might look like c.123.5432
Using an nGram I'd like to be able to search for: c.123.54
This doesn't return any results. I believe the tokenizer is splitting on the period. To account for this I added "punctuation" to the token_chars, but there's no change in results. My analyzer/tokenizer is below.
I've also tried: "token_chars": [] <--Per the documentation this should keep all characters.
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "1",
"max_gram" : "10",
"token_chars": [ "letter", "digit", "whitespace", "punctuation", "symbol" ]
}
}
}
}
},
Edit(More info):
This is the mapping of the relevant field:
"ProjectID":{"type":"string","store":"yes", "copy_to" : "meta_data"},
And this is the field I'm copying it into(that also has the ngram analyzer):
"meta_data" : { "type" : "string", "store":"yes", "index_analyzer": "my_ngram_analyzer"}
This is the command I'm using in sense to see if my search worked (see that it's searching the "meta_data" field):
GET /_search?pretty=true
{
"query": {
"match": {
"meta_data": "c.123.54"
}
}
}
Solution from s1monw at https://github.com/elasticsearch/elasticsearch/issues/5120
By using an index_analyzer search only uses a standard analyzer. To fix it I modified index_analyzer to analyzer. Keep in mind the number of results will increase greatly, so changing the min_gram to a higher number may be necessary.

Resources