How to exclude asterisks while searching with analyzer - elasticsearch

I need to search by an array of values, and each value can be either simple text or text with askterisks(*).
For example:
["MYULTRATEXT"]
And I have the next index(i have a really big index, so I will simplify it):
................
{
"settings": {
"analysis": {
"char_filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(\\d+)*(?=\\d)",
"replacement": "1$"
}
},
"analyzer": {
"custom_search_analyzer": {
"char_filter": [
"asterisk_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer":"keyword",
"search_analyzer": "custom_search_analyzer"
},
......................
And all data in the index is stored with asterisks * e.g.:
curl -X PUT "localhost:9200/locations/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
"name" : "MY*ULTRA*TEXT"
}
I need to return exact the same name value when I search by this string MYULTRATEXT
curl -XPOST 'localhost:9200/locations/_search?pretty' -d '
{
"query": { terms: { "name": ["MYULTRATEXT"] } }
}'
It Should return MY*ULTRA*TEXT, but it does not work, so can't find a workaround. Any thoughts?
I tried pattern_replace but seems like I am doing something wrong or I am missing something here.
So I need to replace all * to empty `` while searching

There appears to be a problem with the regex you provided and the replacement pattern.
I think what you want is:
"char_filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(\\w+)\\*(?=\\w)",
"replacement": "$1"
}
}
Note the following changes:
\d => \w (match word characters instead of only digits)
escape * since asterisks have a special meaning for regexes
1$ => $1 ($<GROUPNUM> is how you reference captured groups)
To see how Elasticsearch will analyze the text against an analyzer, or to check that you defined an analyzer correctly, Elasticsearch has the ANALYZE API endpoint that you can use: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
If you try this API with your current definition of custom_search_analyzer, you will find that "MY*ULTRA*TEXT" is analyzed to "MY*ULTRA*TEXT" and not "MYULTRATEXT" as you intend.
I have a personal app that I use to more easily interact with and visualize the results of the ANALYZE API. I tried your example and you can find it here: Elasticsearch Analysis Inspector.

This might help you - your regex pattern is the issue.
You want to replace all * occurrences with `` the pattern below will do the trick..
PUT my_index
{
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer":"my_analyzer"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"asterisk_remove": {
"type": "pattern_replace",
"pattern": "(?<=\\w)(\\*)(?=\\w)",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"asterisk_remove"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
}
Analyze query
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["MY*ULTRA*TEXT"]
}
Results of analyze query
{
"tokens": [
{
"token": "myultratext",
"start_offset": 0,
"end_offset": 13,
"type": "word",
"position": 0
}
]
}
Post a document
POST my_index/doc/1
{
"name" : "MY*ULTRA*TEXT"
}
Search query
GET my_index/_search
{
"query": {
"match": {
"name": "MYULTRATEXT"
}
}
}
Or
GET my_index/_search
{
"query": {
"match": {
"name": "myultratext"
}
}
}
Results search query
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "my_index",
"_type": "doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "MY*ULTRA*TEXT"
}
}
]
}
}
Hope it helps

Related

How do I search documents with their synonyms in Elasticsearch?

I have an index with some documents. These documents have the field name. But now, my documents are able to have several names. And the number of names a document can have is uncertain. A document can have only one name, or there can be 10 names of one document.
The question is, how to organize my index, document and query and then search for 1 document by different names?
For example, there's a document with names: "automobile", "automobil", "自動車". And whenever I query one of these names, I should get this document. Can I create kind of an array of these names and build a query to search for each one? Or there's more appropriate way to do this.
Tldr;
I feels like you are looking for something like synonyms?
Solution
In the following example I am creating an index, with a specific text analyser.
This analyser, handle automobile, automobil and 自動車 as the same token.
PUT /74472994
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "standard",
"filter": ["synonym" ]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [ "automobile, automobil, 自動車" ]
}
}
}
}
},
"mappings": {
"properties": {
"name":{
"type": "text",
"analyzer": "synonym"
}
}
}
}
POST /74472994/_doc
{
"name": "automobile"
}
which allow me to perform the following request:
GET /74472994/_search
{
"query": {
"match": {
"name": "automobil"
}
}
}
GET /74472994/_search
{
"query": {
"match": {
"name": "自動車"
}
}
}
And always get:
{
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7198386,
"hits": [
{
"_index": "74472994",
"_id": "ROfyhoQBcn6Q8d0DlI_z",
"_score": 1.7198386,
"_source": {
"name": "automobile"
}
}
]
}
}

Elasticsearch completion suggester issue

Issue - completion suggester with custom keyword lowercase analyzer not working as expected. We can reproduce the issue with following steps.
Not able to understand whats causing issue here. However, if we search for "PRAXIS CONSULTING AND INFORMATION SERVICES PRIVATE" , it is giving result.
Create index
curl -X PUT "localhost:9200/com.tmp.index?pretty" -H 'Content-Type: application/json' -d'{
"mappings": {
"dynamic": "false",
"properties": {
"namesuggest": {
"type": "completion",
"analyzer": "keyword_lowercase_analyzer",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50,
"contexts": [
{
"name": "searchable",
"type": "CATEGORY"
}
]
}
}
},
"settings": {
"index": {
"mapping": {
"ignore_malformed": "true"
},
"refresh_interval": "5s",
"analysis": {
"analyzer": {
"keyword_lowercase_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "0",
"number_of_shards": "1"
}
}
}'
Index document
curl -X PUT "localhost:9200/com.tmp.index/_doc/123?pretty" -H 'Content-Type: application/json' -d'{
"namesuggest": {
"input": [
"PRAXIS CONSULTING AND INFORMATION SERVICES PRIVATE LIMITED."
],
"contexts": {
"searchable": [
"*"
]
}
}
}
'
Issue - Complete suggest not giving result
curl -X GET "localhost:9200/com.tmp.index/_search?pretty" -H 'Content-Type: application/json' -d'{
"suggest": {
"legalEntity": {
"prefix": "PRAXIS CONSULTING AND INFORMATION SERVICES PRIVATE LIMITED.",
"completion": {
"field": "namesuggest",
"size": 10,
"contexts": {
"searchable": [
{
"context": "*",
"boost": 1,
"prefix": false
}
]
}
}
}
}
}'
You are facing this issue because of default value of max_input_length parameter is set to 50.
Below is description given for this parameter in documentation:
Limits the length of a single input, defaults to 50 UTF-16 code
points. This limit is only used at index time to reduce the total
number of characters per input string in order to prevent massive
inputs from bloating the underlying datastructure. Most use cases
won’t be influenced by the default value since prefix completions
seldom grow beyond prefixes longer than a handful of characters.
If you enter below string which is exact 50 character then you will get response:
PRAXIS CONSULTING AND INFORMATION SERVICES PRIVATE
Now if you add one more or two character to above string then it will not resturn the result:
PRAXIS CONSULTING AND INFORMATION SERVICES PRIVATE L
You can use this default behaviour or you can updated your index mapping with increase value of max_input_length parameter and reindex your data.
{
"mappings": {
"dynamic": "false",
"properties": {
"namesuggest": {
"type": "completion",
"analyzer": "keyword_lowercase_analyzer",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 100,
"contexts": [
{
"name": "searchable",
"type": "CATEGORY"
}
]
}
}
},
"settings": {
"index": {
"mapping": {
"ignore_malformed": "true"
},
"refresh_interval": "5s",
"analysis": {
"analyzer": {
"keyword_lowercase_analyzer": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "0",
"number_of_shards": "1"
}
}
}
You will get response like below after updating index:
"suggest": {
"legalEntity": [
{
"text": "PRAXIS CONSULTING AND INFORMATION SERVICES PRIVATE LIMITED",
"offset": 0,
"length": 58,
"options": [
{
"text": "PRAXIS CONSULTING AND INFORMATION SERVICES PRIVATE LIMITED.",
"_index": "74071871",
"_id": "123",
"_score": 1,
"_source": {
"namesuggest": {
"input": [
"PRAXIS CONSULTING AND INFORMATION SERVICES PRIVATE LIMITED."
],
"contexts": {
"searchable": [
"*"
]
}
}
},
"contexts": {
"searchable": [
"*"
]
}
}
]
}
]
}

how to switch on the elasticsearch stemming

I don't know how to turn on the Elasticsearch English word stemming. I am sorry I didn't find out a clear example to do that.
Here is what I did
Creating the index
PUT /staff/list/ -d
{
"settings" : {
"analysis": {
"analyzer": {
"standard": {
"type": "standard"
}
}
}
}
}
Adding document
PUT /staff/list/jason
{
"Title" : "searches"
}
when I search for search
GET /staff/list/_search?q=search
The result doesnt appear.
What index setting I should do to make the stemming works.
Many thanks in advance
Please note that the default Elasticsearch analyzer do not support stemming.
In order to support stemming you may need to create a custom analyzer.
Here is how you do it:
Create the index and define an analyzer called my_analyzer
PUT /staff
{
"settings" : {
"analysis": {
"filter": {
"filter_snowball_en": {
"type": "snowball",
"language": "English"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase",
"filter_snowball_en"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Configure mapping that assigns my_analyzer to list type
PUT /staff/_mapping/list
{
"list": {
"properties": {
"title": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
Index documents
PUT /staff/list/jason
{
"title": "searches"
}
PUT /staff/list/debby
{
"title": "searched open"
}
Search and stemmed results
GET staff/list/_search
{
"query": {
"query_string": {
"query": "title:opened"
}
}
}
Result
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "staff",
"_type": "list",
"_id": "debby",
"_score": 1,
"_source": {
"title": "open"
}
}]
}
}
As you can see in the search results, debby document which contains the term
open was returned although we where searching for opened.
Hope that helps.
When you create the index, you are doing nothing (just re-declaring the standard analyzer).
The standard analyzer is the default that Elasticsearch uses, which doesn't stem any word.
You need to map the fields to their respective analyzers at your index creation (mapping documentation):
PUT /staff -d
{
"mappings": {
"list": {
"properties": {
"Title": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
I guess english analyzer fits to your case (which uses the standard tokenizer).

Find concatenate words in Elasticsearch

Say I have indexed this data
song:{
title:"laser game"
}
but the user is searching for
lasergame
How would you go about mapping/indexing/querying for this?
This is kind of tricky problem.
1) I guess the most effective way might be to use compound token filter, with word list made up of some words you think user might concatenate.
"settings": {
"analysis": {
"analyzer": {
"concatenate_split": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"myFilter"
]
}
},
"filter": {
"myFilter": {
"type": "dictionary_decompounder",
"word_list": [
"laser",
"game",
"lean",
"on",
"die",
"hard"
]
}
}
}
}
After applying analyzer, lasergame will split into laser and game along with lasergame, now this will give you results that has any of those words.
2) Another approach could be concatenating whole title with pattern replace char filter replacing all the spaces.
{
"index" : {
"analysis" : {
"char_filter" : {
"my_pattern":{
"type":"pattern_replace",
"pattern":"\\s+",
"replacement":""
}
},
"analyzer" : {
"custom_with_char_filter" : {
"tokenizer" : "standard",
"char_filter" : ["my_pattern"]
}
}
}
}
}
You need to use multi fields with this approach, with this pattern, laser game will be indexed as lasergame and your query will work.
Here the problem is laser game play will be indexed as lasegameplay and search for lasergame wont return anything so you might want to consider using prefix query or wildcard query for this.
3) This might not make sense but you could also use synonym filter, if you think users are often concatenating some words.
Hope this helps!
Easiest solution would be using nGrams. That would be the base to start working with and could be tweaked to meet your needs. But here you go:
Mappings
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "nGram",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"sample": {
"properties": {
"myField": {
"type": "string",
"analyzer": "myAnalyzer"
}
}
}
}
}
Test document
PUT /test/sample/1
{
"myField": "laser game"
}
Query
GET /test/_search
{
"query": {
"match": {
"myField": "lasergame"
}
}
}
Results
{
"took": 47,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2161999,
"hits": [
{
"_index": "test",
"_type": "sample",
"_id": "1",
"_score": 0.2161999,
"_source": {
"myField": "laser game"
}
}
]
}
}
This analyzer will create lots of ngrams in your index, such as la, las, lase...gam, game and etc. Both lasergame and laser game will produce a lot of similar tokens and will find your document as you'd expect.

elasticsearch context suggester stopwords

Is there a way to analyze a field that is passed to the context suggester?
If, say, I have this in my mapping:
mappings: {
myitem: {
title: {type: 'string'},
content: {type: 'string'},
user: {type: 'string', index: 'not_analyzed'},
suggest_field: {
type: 'completion',
payloads: false,
context: {
user: {
type: 'category',
path: 'user'
},
}
}
}
}
and I index this doc:
POST /myindex/myitem/1
{
title: "The Post Title",
content: ...,
user: 123,
suggest_field: {
input: "The Post Title",
context: {
user: 123
}
}
}
I would like to analyze the input first, split it into separate words, run it through lowercase and stop words filters so that the context suggester actually gets
suggest_field: {
input: ["post", "title"],
context: {
user: 123
}
}
I know I can pass an array into the suggest field but I would like to avoid lowercasing the text, splitting it, running the stop words filter in my application, before passing to ES. If possible, I would rather ES do this for me. I did try adding an index_analyzer to the field mapping but that didn't seem to achieve anything.
OR, is there another way to get autocomplete suggestions for words?
Okay, so this is pretty involved, but I think it does what you want, more or less. I'm not going to explain the whole thing, because that would take quite a bit of time. However, I will say that I started with this blog post and added a stop token filter. The "title" field has sub-fields (what used to be called a multi_field) that use different analyzers, or none. The query contains a couple of terms aggregations. Also notice that the aggregations results are filtered by the match query to only return results relevant to the text query.
Here is the index setup (spend some time looking through this; if you have specific questions I will try to answer them but I encourage you to go through the blog post first):
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"stop_filter": {
"type": "stop"
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter"
]
},
"stopword_only_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"asciifolding",
"stop_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"stopword_only": {
"type": "string",
"analyzer": "stopword_only_analyzer"
}
}
}
}
}
}
}
Then I added a few docs:
PUT /test_index/_bulk
{"index": {"_index":"test_index", "_type":"doc", "_id":1}}
{"title": "The Lion King"}
{"index": {"_index":"test_index", "_type":"doc", "_id":2}}
{"title": "Beauty and the Beast"}
{"index": {"_index":"test_index", "_type":"doc", "_id":3}}
{"title": "Alladin"}
{"index": {"_index":"test_index", "_type":"doc", "_id":4}}
{"title": "The Little Mermaid"}
{"index": {"_index":"test_index", "_type":"doc", "_id":5}}
{"title": "Lady and the Tramp"}
Now I can search the documents with word prefixes if I want (or the full words, capitalized or not), and use aggregations to return both the intact titles of the matching documents, as well as intact (non-lowercased) words, minus the stopwords:
POST /test_index/_search?search_type=count
{
"query": {
"match": {
"title": {
"query": "mer king",
"operator": "or"
}
}
},
"aggs": {
"word_tokens": {
"terms": { "field": "title.stopword_only" }
},
"intact_titles": {
"terms": { "field": "title.raw" }
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"intact_titles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The Lion King",
"doc_count": 1
},
{
"key": "The Little Mermaid",
"doc_count": 1
}
]
},
"word_tokens": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The",
"doc_count": 2
},
{
"key": "King",
"doc_count": 1
},
{
"key": "Lion",
"doc_count": 1
},
{
"key": "Little",
"doc_count": 1
},
{
"key": "Mermaid",
"doc_count": 1
}
]
}
}
}
Notice that "The" gets returned. This seems to be because the default _english_ stopwords only contain "the". I didn't immediately find a way around this.
Here is the code I used:
http://sense.qbox.io/gist/2fbb8a16b2cd35370f5d5944aa9ea7381544be79
Let me know if that helps you solve your problem.
You can set up a analyzer which does this for you.
If you follow the tutorial called you complete me, there is a section about stopwords.
There is a change in how elasticsearch works after this article was written. The standard analyzer no logner does stopword removal, so you need to use the stop analyzer in stead.
The mapping
curl -X DELETE localhost:9200/hotels
curl -X PUT localhost:9200/hotels -d '
{
"mappings": {
"hotel" : {
"properties" : {
"name" : { "type" : "string" },
"city" : { "type" : "string" },
"name_suggest" : {
"type" : "completion",
"index_analyzer" : "stop",//NOTE HERE THE DIFFERENCE
"search_analyzer" : "stop",//FROM THE ARTICELE!!
"preserve_position_increments": false,
"preserve_separators": false
}
}
}
}
}'
Getting suggestion
curl -X POST localhost:9200/hotels/_suggest -d '
{
"hotels" : {
"text" : "m",
"completion" : {
"field" : "name_suggest"
}
}
}'
Hope this helps. I have spent a long time looking for this answer myself.

Resources