Liferay portal 7.3.7 case insensitive, diacritics free with ElasticSearch - elasticsearch

I am having a dilema on liferay portal 7.3.7 with case insensitive and diacritis free search through elasticsearch in JournalArticles with custom ddm fields. Liferay generated fieldmappings in Configuration->Search like this:
...
},
"localized_name_sk_SK_sortable" : {
"store" : true,
"type" : "keyword"
},
...
I would like to have these *_sortable fields usable for case insensitive and dia free searching, so I tried to add analyzer and normalizer to liferay search advanced configuration in System Settings->Search->Elasticsearch 7 like this:
{
"analysis":{
"analyzer":{
"ascii_analyzer":{
"tokenizer": "standard",
"filter":["asciifolding","lowercase"]
}
},
"normalizer": {
"ascii_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
}
After that, I overrided mapping for template_string_sortable:
{
"template_string_sortable" : {
"mapping" : {
"analyzer": "ascii_analyzer",
"normalizer": "ascii_normalizer",
"store" : true,
"type" : "keyword"
},
"match_mapping_type" : "string",
"match" : "*_sortable"
}
}
After reindexing, my sortable fields looks like this:
...
},
"localized_name_sk_SK_sortable" : {
"normalizer" : "ascii_normalizer",
"store" : true,
"type" : "keyword"
},
...
Next, I try to create new content for my ddm structure, but all my sortable fields looks same, like this:
"localized_title_sk_SK": "test diakrity časť 1 ľščťžýáíéôň title",
"localized_title_sk_SK_sortable": "test diakrity časť 1 ľščťžýáíéôň title",
but I need that sortable field without national characters, so i.e. I can find by "cast 1" through wildcardQuery in localized_title_sk_SK_sortable and so on... THX for any advice (maybe I just have wrong appearance to whole problem? And I am really new to ES)

First of all it would be better to apply original_ascii_folding and then lowercase filter, but keep in mind this filter are for search and your _source data wouldn't be changed because you applied analyzer on the field.
If you need to manipulate the data before ingesting it you can use Ingest pipeline feature in Elasticsearch for more information check here.

Related

Is reindexing to a new index necessary after updating settings and mappings to support a multi-field in elasticsearch?

Please consider the scenario.
Existing System
I have an index named contacts_index with 100 documents.
Each document has property named city with some text value in it.
Index has settings as the following
{
"analyzer": {
"city_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "city_tokenizer"
},
"search_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
},
"tokenizer": {
"city_tokenizer": {
"token_chars": [
"letter"
],
"min_gram": "2",
"type": "ngram",
"max_gram": "30"
}
}
}
The index has the following mapping for city field to support matching sub-text search and keyword search.
{
"city" : {
"type" : "text",
"analyzer" : "city_analyzer",
"search_analyzer" : "search_analyzer"
}
}
Proposed System
Now we want to perform autocomplete on city field. for example for city with value Seattle. We want to get the document when the user types s, se, sea, seat, seatt, seattl, seattle but Only when they query with the above prefix text. For example not when they type eattle. etc..
We have planned to attain this with the help of one more multi-field for city property with different of type text and different analyzer.
To attain this we have done the following.
Updated the settings to support autocomplete
PUT /staging-contacts-index-v4.0/_settings?preserve_existing=true
{
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "autocomplete_tokenizer"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"token_chars": [
"letter"
],
"min_gram": "1",
"type": "edge_ngram",
"max_gram": "100"
}
}
}
}
Update the mapping of city field with multi-field autocomplete to support autocomplete
{
"city" : {
"type" : "text",
"fields" : {
"autocomplete" : {
"type" : "text",
"analyzer" : "autocomplete_analyzer",
"search_analyzer" : "search_analyzer"
}
},
"analyzer" : "city_analyzer",
"search_analyzer" : "search_analyzer"
}
}
Findings
For any new document that will be newly created after updating autocomplete multi-field settings, autocomplete search is working as expected
For existing documents, if the value of city field changes, for example seattle to chicago, the document is fetched when making autocomplete search.
We are planning to make use of update api to fetch and update the existing 100 documents so that autocomplete works for existing documents as well. However while trying to use the update api, we are getting
{"result" : "noop"}
And the autocomplete search is not working.
I can infer that since the values were not changing, elasticsearch not creating tokens for autocomplete field.
Question
From the research we have done, there are two options to make sure the existing 100 documents can perform autocomplete search.
Use Reindex api for existing 100 documents.
Fetch all 100 documents and Use document Index api to update the existing 100 documents which will create all the tokens in the process.
Which option is preferable and why?
Thanks for taking time to read through.

How can I index a field using two different analyzers in Elastic search

Say that I have a field "productTitle" which I want to use for my users to search for products.
I also want to apply autocomplete functionality. So I m using an autocomplete_analyzer with the following filter:
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}
However, at the same time when users make a search I don't want the "edge_ngram" to be applied, since it produces lot of irrelevant results.
For example when users want to search for "mi" and start typing "m", "mi".. they should get the results starting with m,mi as auto-complete options. However, when they actually make the query, they should only get results with the word "mi". Currently they also see results with "mini" etc..
Therefore, is it possible to have "productTitle" indexed using two different analyzers? Is multi-field type an option for me?
EDIT: Mapping for productTitle
"productTitle" : {
"type" : "string",
"index_analyzer" : "second",
"search_analyzer" : "standard",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
,
"second" analyzer
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"autocomplete_filter"
]
}
So when I'm querying for :
"filtered" : {
"query" : {
"match" : {
"productTitle" : {
"query" : "mi",
"type" : "boolean",
"minimum_should_match" : "2<75%"
}
}
}
}
I also get results like "mini". But I need to only get results including just "mi"
Thank you
hmm ... as far as I know, there is no way to apply multiple analyzers for same field ... what You can make is to use "Multi Fields".
here is an example how to apply different analyzers for "subfields":
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html#_multi_fields_with_multiple_analyzers
The correct way of preventing what you describe in your answer is to specify both analyzer and search_analyzer in your field mapping, like this:
"productTitle": {
"type": "string",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard"
}
The autocomplete analyzer will kick in at indexing time and tokenize your title according to your edge_ngram configuration and the standard analyzer will kick in at search time without applying the edge_ngram stuff.
In this context, there is no need for multi-fields unless you need to tokenize the productTitle field in different ways.

Elastic Search Standard Analyzer + Synonyms

I am attempting the get Synonyms added to our current way of searching for products.
The mappings ( part of ) looks like this currently :
{
"keywords": {
"properties": {
"modifiers": {
"type": "string",
"analyzer": "standard"
},
"nouns": {
"type": "string",
"analyzer": "standard"
}
}
}
}
I am interested in use the synonyms filter along with the standard analyzer. And as per the document here from elastic search the right way to do that is to add
{
'analysis': {
'analyzer': {
"synonym" : {
"tokenizer" : 'standard',
"filter" : ['standard', 'lowercase', 'stop', 'external_synonym']
}
},
'filter': {
'external_synonym': {
'type': 'synonym',
'synonyms': synonyms
}
}
}
into the mappings and use that in the analyzer field in the snippet above. But this does not work - in the sense that - the behaviour is quite different ( even without adding any synonyms ).
I am interested in preserving the relevancy behaviour ( as provided by the standard analyzer ) and just added the synonyms list.
Could someone please provide more information on how to exactly replicate the standard analyzer behaviour ?

How to search with keyword analyzer?

I have keyword analyzer as default analyzer, like so:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "keyword"
}}}}}}
```
But now I can't search anything. e.g:
{
"query": {
"query_string": {
"query": "cast"
}}}
Gives me 0 results all though "cast" is a common value i the indexed documents. (http://gist.github.com/baelter/b0720a52ee5a27e27d3a)
Search for "*" works fine btw.
I only have explicit defaults in my mapping:
{
"oceanography_point": {
"_all" : {
"enabled" : true
},
"properties" : {}
}
}
The index behaves as if no fields are included in _all, because field:value queries works fine.
Am I misusing the keyword analyzer?
Using keyword analyzer , you can only do an exact string match.
Lets assume that you have used keyword analyzer and no filters.
In that case for as string indexed as "Cast away in forest" , neither search for "cast" or "away" will work. You need to do an exact "Cast away in forest" string to match it. ( Assuming no lowercase filter used , you need to give the right case too)
A better approach would be to use multi fields to declare one copy as keyword analyzed and other one normal.
You can search on one of this field and aggregate on the other.
Okey, some 15h of trial and error I can conclude that this works for search:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"default": {
"type": "keyword"
}}}}}}
How ever this breaks faceting so I ended up using a dynamic template instead:
"dynamic_templates" : [
{
"strings_not_analyzed" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
],

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

Resources