Undesired Stopwords in Elastic Search - elasticsearch

I am using Elastic Search 6.This is query
PUT /semtesttest
{
"settings": {
"index" : {
"analysis" : {
"filter": {
"my_stop": {
"type": "stop",
"stopwords_path": "analysis1/stopwords.csv"
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis1/synonym.txt"
}
},
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["synonym","my_stop"]
}
}
}
}
},
"mappings": {
"all_questions": {
"dynamic": "strict",
"properties": {
"kbaid":{
"type": "integer"
},
"answer":{
"type": "text"
},
"question": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT /semtesttest/all_questions/1
{
"question":"this is hippie"
}
GET /semtesttest/all_questions/_search
{
"query":{
"fuzzy":{"question":{"value":"hippie","fuzziness":2}}
}
}
GET /semtesttest/all_questions/_search
{
"query":{
"fuzzy":{"question":{"value":"this is","fuzziness":2}}
}
}
in synonym.txt it is
this, that, money => sainai
in stopwords.csv it is
hello
how
are
you
The first get ('hippie') return empty
only the second get ('this is') return results
what is the problem? It looks like the stop word "this is" is filtered in the first query, but I have specified my stop words explicitly?

fuzzy is a term query. It is not going to analyze the input, so your query was looking for the exact term this is (applying some fuzzy fun).
So you either want to build a query off those two terms, or use a full text query instead. If fuzziness is important, I think the only full text query is match:
GET /semtesttest/all_questions/_search?pretty
{
"query":{
"match":{"question":{"query":"this is","fuzziness":2}}
}
}
If match phrases is important, you may want to look at this answer and work with span queries.
This might also help you so you can see how your analyzer is being used:
GET /semtesttest/_analyze?analyzer=my_analyzer&field=question&text=this is

Related

Completion suggester and exact matches in Elasticsearch

I'm a bit surprised by the behavior that elasticsearch completion has sometimes. I've set up a mapping that has a suggest field. In the input of the suggest field I put 3 elements that are the name, the isin and the issuer of one security.
Here is the mapping that I use :
"suggest": {
"type" : "completion",
"analyzer" : "simple"
}
When I want to query my index with this query :
{
"suggest": {
"my_suggestion": {
"prefix": "FR0011597335",
"completion": {
"field": "suggest"
}
}
}
}
I get a list of results but not necessarilly with my exact prefix and most of the time with the exact match not at the top.
So I'd like to know if there is a way to boost exact matches in a suggestion and make such exact term matches to be in first position when possible.
I think my problem is solved by using a custom analyzer : the simple one was not convenient for the entries I had.
"settings": {
"analysis": {
"char_filter": {
"punctuation": {
"type": "mapping",
"mappings": [".=>"]
}
},
"filter": {},
"analyzer": {
"analyzer_text": {
"tokenizer": "standard",
"char_filter": ["punctuation"],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
and
"suggest": {
"type" : "completion",
"analyzer" : "analyzer_text"
}

Elastic Search,lowercase search doesnt work

I am trying to search again content using prefix and if I search for diode I get results that differ from Diode. How do I get ES to return result where both diode and Diode return the same results? This is the mappings and settings I am using in ES.
"settings":{
"analysis": {
"analyzer": {
"lowercasespaceanalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"articles": {
"properties": {
"title": {
"type": "text"
},
"url": {
"type": "keyword",
"index": "true"
},
"imageurl": {
"type": "keyword",
"index": "true"
},
"content": {
"type": "text",
"analyzer" : "lowercasespaceanalyzer",
"search_analyzer":"whitespace"
},
"description": {
"type": "text"
},
"relatedcontentwords": {
"type": "text"
},
"cmskeywords": {
"type": "text"
},
"partnumbers": {
"type": "keyword",
"index": "true"
},
"pubdate": {
"type": "date"
}
}
}
}
here is an example of the query I use
POST _search
{
"query": {
"bool" : {
"must" : {
"prefix" : { "content" : "capacitance" }
}
}
}
}
it happens because you use two different analyzers at search time and at indexing time.
So when you input query "Diod" at search time because you use "whitespace" analyzer your query is interpreted as "Diod".
However, because you use "lowercasespaceanalyzer" at index time "Diod" will be indexed as "diod". Just use the same analyzer both at search and index time, or analyzer that lowercases your strings because default "whitespace" analyzer doesn't https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html
There will be no term of Diode in your index. So if you want to get same results, you should let your query context analyzed by same analyzer.
You can use Query string query like
"query_string" : {
"default_field" : "content",
"query" : "Diode",
"analyzer" : "lowercasespaceanalyzer"
}
UPDATE
You can analyze your context before query.
AnalyzeResponse resp = client.admin().indices()
.prepareAnalyze(index, text)
.setAnalyzer("lowercasespaceanalyzer")
.get();
String analyzedContext = resp.getTokens().get(0);
...
Then use analyzedContext as new query context.

How do I search for partial accented keyword in elasticsearch?

I have the following elasticsearch settings:
"settings": {
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":["lowercase", "asciifolding"]
}
}
}
}
}
The above works fine for the following keywords:
Beyoncé
Céline Dion
The above data is stored in elasticsearch as beyonce and celine dion respectively.
I can search for Celine or Celine Dion without the accent and I get the same results. However, the moment I search for Céline, I don't get any results. How can I configure elasticsearch to search for partial keywords with the accent?
The query body looks like:
{
"track_scores": true,
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": ["name"],
"type": "phrase",
"query": "Céline"
}
}
]
}
}
}
and the mapping is
"mappings" : {
"artist" : {
"properties" : {
"name" : {
"type" : "string",
"fields" : {
"orig" : {
"type" : "string",
"index" : "not_analyzed"
},
"simple" : {
"type" : "string",
"analyzer" : "analyzer_keyword"
}
},
}
I would suggest this mapping and then go from there:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_keyword": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "analyzer_keyword"
}
}
}
}
}
Confirm that the same analyzer is getting used at query time. Here are some possible reasons why that might not be happening:
you specify a separate analyzer at query time on purpose that is not performing similar analysis
you are using a term or terms query for which no analyzer is applied (See Term Query and the section title "Why doesn’t the term query match my document?")
you are using a query_string query (E.g. see Simple Query String Query) - I have found that if you specify multiple fields with different analyzers and so I have needed to separate the fields into separate queries and specify the analyzer parameter (working with version 2.0)

Elastic search - no hit though there should be result

I've encountered the following problem with Elastic search, does anyone know where should I troubleshoot?
I'm happily retrieving result with the following query:
{
"query" : {
"match" : { "name" : "A1212001" }
}
}
But when I refine the value of the search field "name" to a substring, i've not no hit?
{
"query" : {
"match" : { "name" : "A12120" }
}
}
"A12120" is a substring of already hit query "A1212001"
If you don't have too many documents, you can go with a regexp query
POST /index/_search
{
"query" :{
"regexp":{
"name": "A12120.*"
}
}
}
or even a wildcard one
POST /index/_search
{
"query": {
"wildcard" : { "name" : "A12120*" }
}
}
However, as #Waldemar suggested, if you have many documents in your index, the best approach for this is to use an EdgeNGram tokenizer since the above queries are not ultra-performant.
First, you define your index settings like this:
PUT index
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type": "custom",
"tokenizer" : "edge_tokens",
"filter": ["lowercase"]
}
},
"tokenizer" : {
"edge_tokens" : {
"type" : "edgeNGram",
"min_gram" : "1",
"max_gram" : "10",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
Then, when indexing a document whose name field contains A1212001, the following tokens will be indexed: A, A1, A12, A121, A1212, A12120, A121200, A1212001. So when you'll search for A12120 you'll find a match.
Are you using a Match Query this query will check for terms inside lucene and your term is A1212001 if you need to find a part of your term do you can use a Regex Query but you need know that will be there some internal impacts using regex because the shard will check in all of your terms.
If you need a more "professional" way to search a part of a term do you can use NGrams

bidirectional match on elasticsearch

I've indexed a list of terms and now I want to query for some of them
Say that I've indexed 'dog food','red dog','dog','food','cats'
How do I create an exact bidirectional match query. ie: I want when search for 'dog' to get only the term dog and not the other terms (because they don't match back).
One primitive solution I thought of is indexing the terms with their length (Words-wise) and then when searching query with lengh X restrict it to the terms of length X. but it seems over complicated.
Create a custom analyzer to lowercase and normalize your search terms. So that would be your index:
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer_keyword" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings" : {
"your_type" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "my_analyzer_keyword"
}
}
}
}
}
So if you have indexed 'dog' and users types in Dog or dog or DOG, it will match only dog, 'dog food' won't be brought back.
Just set your field's index property to not_analyzed and your query should use term filter to search for text.
As per Evaldas' suggestion, find below a more complete solution, that also keeps the original value indexed with standard analyzer but uses a sub-field with a lowercased version of the terms:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"asset": {
"properties": {
"name": {
"type": "string",
"fields": {
"case_ignore": {
"type": "string",
"analyzer": "my_keyword_lowercase_analyzer"
}
}
}
}
}
}
}
POST /test/asset/1
{
"name":"dog"
}
POST /test/asset/2
{
"name":"dog food"
}
POST /test/asset/3
{
"name":"red dog"
}
GET /test/asset/_search
{
"query": {
"match": {
"name.case_ignore": "Dog"
}
}
}

Resources