Elasticsearch : Search with special character Open & Close parentheses

Elasticsearch : Search with special character Open & Close parentheses - elasticsearch

Hi I am trying to search a word which has these characters in it '(' , ')' in elastic search. I am not able to get expected result.
This is the query I am using
{
"query": {
"query_string" : {
"default_field" : "name",
"query" : "\\(Pas\\)ta\""
}
}}
In the results I am getting records with "PASTORS" , "PAST", "PASCAL", "PASSION" first. I want to get name 'Pizza & (Pas)ta' in the first record in the search result as it is the best match.
Here is the analyzer for the name field in the schema
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
"name": {
"analyzer": "autocomplete",
"search_analyzer": "standard",
"type": "string"
},
Please help me to fix this, Thanks

You have used standard tokenizer which is removing ( and ) from the tokens generated. Instead of getting token (pas)ta one of the token generated is pasta and hence you are not getting match for (pas)ta.
Instead of using standard tokenizer you can use whitespace tokenizer which will retain all the special characters in the name. Change analyzer definition to below:
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}

Related

Achieving literal text search combined with subword matching in Elasticsearch

I have populated an Elasticsearch database using the following settings:
mapping = {
"properties": {
"location": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"description": {
"type": "text",
"analyzer": "ngram_analyzer"
},
"commentaar": {
"type": "text",
"analyzer": "ngram_analyzer"
},
}
}
settings = {
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {"custom_tool": mapping}
}
I used the ngram analyser because I wanted the be able to have subword matching. So a search for "ackoverfl" would return the entries containing "stackoverflow".
My search queries are made as follows:
q = {
"simple_query_string": {
"query": needle,
"default_operator": "and",
"analyzer": "whitespace"
}
}
Where needle is the text from my search bar.
Sometimes I would also like to do literal phrase searching. For example:
If my search term is:
"the ap hangs in the tree"
(Notice that I use quotation marks here with the intention the search for a literal piece of text).
Then in my results I get a document containing:
the apple hangs in the tree
This results is unwanted.
How could I implement having a subword matching search capability while also having the option to search for literal phrases (by using for example quotation marks) ?

Can we apply a char_filter to a custom tokenizer in elasticsearch?

I have set up a custom analyser in Elasticsearch that uses an edge-ngram tokeniser and I'm experimenting with filters and char_filters to refine the search experience.
I've been pointed to the excellent tool elyser which enables you to test the affect your custom analyser has on a specific term but this is throwing errors when I combine a custom analyser with a char_filter, specifically html_strip.
The error I get from elyser is:
illegal_argument_exception', 'reason': 'Custom normalizer may not use
char filter [html_strip]'
I would like to know whether this is a legitimate error message or whether it represents a bug in the tool.
I've referred to the main documentation and even their custom analyser example throws an error in elyser:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
Command in elyser:
elyzer --es "http://localhost:9200" --index my_index --analyzer my_custom_analyzer "Trinity Chapel <h1>[in fact King's Chapel]</h1>"
If it turns out that elyser is at fault, could anyone point me to an alternative method of examining the tokens produced from my custom analyser so that I can test the impact of each filter?
My custom analysers look a little bit like I've thrown the kitchen sink at them and I'd like a way to test and refactor:
PUT /objects
{
"settings" : {
"index" : {
"number_of_shards" : "5",
"analysis" : {
"analyzer" : {
"search_autocomplete": {
"type": "custom",
"tokenizer": "standard",
"char_filter" : [
"html_strip"
],
"filter": [
"standard",
"apostrophe",
"lowercase",
"asciifolding",
"english_stop",
"english_stemmer"
]
},
"autocomplete": {
"type": "custom",
"tokenizer": "autocomplete",
"filter": [
"standard",
"lowercase",
"asciifolding",
"english_stop",
"english_stemmer"
]
},
"title_html_strip" : {
"filter" : [
"standard",
"lowercase"
],
"char_filter" : [
"html_strip"
],
"type" : "custom",
"tokenizer" : "standard"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"name": "english"
}
}
}
}
}
}

This bug is in elyzer. In order to show the state of the tokens at each step of the analysis process, elyzer performs an analyze query for each stage: first char filters, then tokenizer, and finally token filters.
The problem is that on ES side, the analysis process has changed since they introduced normalizers (in a non-backward compatible way). They assume that if there is no normalizer, no analyzer and no tokenizer in the request but either a token filter or a char_filter, then the analyze request should behave like a normalizer.
In your case, elyzer will first perform a request for the html_strip character filter and ES will think it is about a normalizer, hence the error you're getting since html_strip is not a valid char_filter for normalizers.
Since I know Elyzer's developer pretty well (Doug Turnbull), so I've filed a bug already. We'll see what unfolds.

Alternative method of examining the tokens produced from my custom analysers:
The official documentation includes a section on using the _analyse method which along with the explain: true flag, provides me with the information I need to scrutinise my custom analysers.
The following outputs the tokens at each filter stage
GET objects/_analyze
{
"analyzer" : "search_autocomplete",
"explain" : true,
"text" : "Trinity Chapel [in fact <h1>King's Chapel</h1>]"
}

Stop word analyzer with stopwords_path not working as expected

I'm on ES2.3 and I have a list of stop words file that are a mix of upper case lower case
I'm trying to create an analyzer that ignore the case of the stop words
"stopword_analyzer": {
"type": "standard",
"ignore_case": "true"
"stopwords_path": "stopwords_english.txt"
}
I've tried using a singel stop word in upper to check if there was an issue with the stopwords_path argumant
"stopword_analyzer6": {
"type": "stop",
"stopwords": "[UPPERCASE]",
"ignore_case": "true"
}
but this failed as well
I've also tried to apply a lower case filter , but didn't work as well
"stopword_analyzer5": {
"type": "stop",
"stopwords_path": "stopwords_english.txt",
"filter": [
"lowercase"
]

What I ended doing that did the trick , using a stop word filter with a lower case filter on a custom analyzer
"analysis": {
"filter": {
"my_stop":{
"type": "stop",
"ignore_case": "true",
"stopwords_path": "stopwords_english.txt"
}
},
"analyzer": {
"stopword_analyzer7": {
"type": "custom",
"tokenizer": "whitespace",
"stopwords_path": "stopwords_english.txt",
"filter": [
"lowercase",
"my_stop"
]
}
}
}

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}

Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}

I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

ngrams ins elasticsearch are not working

I use elasticsearch ngram
"analysis": {
"filter": {
"desc_ngram": {
"type": "ngram",
"min_gram": 3,
"max_gram": 8
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "desc_ngram", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
And I have 2 objects here
{
"name": "Shana Calandra",
"username": "shacalandra",
},
{
"name": "Shana Launer",
"username": "shalauner",
},
And using this query
{
query: {
match: {
_all: "Shana"
}
}
}
When I search with this query, it returns me both documents, but I cant search by part of word here, for example I cant use "Shan" instead of "Shana" in query because it doesnt return anything.
Maybe my mapping is wrong, I cant understand problem is on mapping or on query

If you specify
"mappings": {
"test": {
"_all": {
"index_analyzer": "index_ngram",
"search_analyzer": "search_ngram"
},
for your mapping of _all field then it will work. _all has its own analyzers and I suspect you used the analyzers just for name and username and not for _all.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch : Search with special character Open & Close parentheses - elasticsearch

Related

Achieving literal text search combined with subword matching in Elasticsearch

Can we apply a char_filter to a custom tokenizer in elasticsearch?

Stop word analyzer with stopwords_path not working as expected

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

ngrams ins elasticsearch are not working

Categories

Resources