Elastic Search,lowercase search doesnt work - elasticsearch

I am trying to search again content using prefix and if I search for diode I get results that differ from Diode. How do I get ES to return result where both diode and Diode return the same results? This is the mappings and settings I am using in ES.
"settings":{
"analysis": {
"analyzer": {
"lowercasespaceanalyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"articles": {
"properties": {
"title": {
"type": "text"
},
"url": {
"type": "keyword",
"index": "true"
},
"imageurl": {
"type": "keyword",
"index": "true"
},
"content": {
"type": "text",
"analyzer" : "lowercasespaceanalyzer",
"search_analyzer":"whitespace"
},
"description": {
"type": "text"
},
"relatedcontentwords": {
"type": "text"
},
"cmskeywords": {
"type": "text"
},
"partnumbers": {
"type": "keyword",
"index": "true"
},
"pubdate": {
"type": "date"
}
}
}
}
here is an example of the query I use
POST _search
{
"query": {
"bool" : {
"must" : {
"prefix" : { "content" : "capacitance" }
}
}
}
}

it happens because you use two different analyzers at search time and at indexing time.
So when you input query "Diod" at search time because you use "whitespace" analyzer your query is interpreted as "Diod".
However, because you use "lowercasespaceanalyzer" at index time "Diod" will be indexed as "diod". Just use the same analyzer both at search and index time, or analyzer that lowercases your strings because default "whitespace" analyzer doesn't https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html

There will be no term of Diode in your index. So if you want to get same results, you should let your query context analyzed by same analyzer.
You can use Query string query like
"query_string" : {
"default_field" : "content",
"query" : "Diode",
"analyzer" : "lowercasespaceanalyzer"
}
UPDATE
You can analyze your context before query.
AnalyzeResponse resp = client.admin().indices()
.prepareAnalyze(index, text)
.setAnalyzer("lowercasespaceanalyzer")
.get();
String analyzedContext = resp.getTokens().get(0);
...
Then use analyzedContext as new query context.

Related

Undesired Stopwords in Elastic Search

I am using Elastic Search 6.This is query
PUT /semtesttest
{
"settings": {
"index" : {
"analysis" : {
"filter": {
"my_stop": {
"type": "stop",
"stopwords_path": "analysis1/stopwords.csv"
},
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis1/synonym.txt"
}
},
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["synonym","my_stop"]
}
}
}
}
},
"mappings": {
"all_questions": {
"dynamic": "strict",
"properties": {
"kbaid":{
"type": "integer"
},
"answer":{
"type": "text"
},
"question": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
PUT /semtesttest/all_questions/1
{
"question":"this is hippie"
}
GET /semtesttest/all_questions/_search
{
"query":{
"fuzzy":{"question":{"value":"hippie","fuzziness":2}}
}
}
GET /semtesttest/all_questions/_search
{
"query":{
"fuzzy":{"question":{"value":"this is","fuzziness":2}}
}
}
in synonym.txt it is
this, that, money => sainai
in stopwords.csv it is
hello
how
are
you
The first get ('hippie') return empty
only the second get ('this is') return results
what is the problem? It looks like the stop word "this is" is filtered in the first query, but I have specified my stop words explicitly?
fuzzy is a term query. It is not going to analyze the input, so your query was looking for the exact term this is (applying some fuzzy fun).
So you either want to build a query off those two terms, or use a full text query instead. If fuzziness is important, I think the only full text query is match:
GET /semtesttest/all_questions/_search?pretty
{
"query":{
"match":{"question":{"query":"this is","fuzziness":2}}
}
}
If match phrases is important, you may want to look at this answer and work with span queries.
This might also help you so you can see how your analyzer is being used:
GET /semtesttest/_analyze?analyzer=my_analyzer&field=question&text=this is

Get exact match after doing mapping as not_analyzed

I have elasticsearch type I mapped as below,
mappings": {
"jardata": {
"properties": {
"groupID": {
"index": "not_analyzed",
"type": "string"
},
"artifactID": {
"index": "not_analyzed",
"type": "string"
},
"directory": {
"type": "string"
},
"jarFileName": {
"index": "not_analyzed",
"type": "string"
},
"version": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
I am using index of directory as analyzed since I want give only the last folder and get the results, But when I want to search a specific directory I need to give the whole path since there can be same folder in two paths. The problem here is since it is analyzed it will all data instead the specific one I want.
The problem here is I want to act it like both analyzed and not_analyzed. is there a way for that?
Let's say you have the following document indexed:
{
"directory": "/home/docs/public"
}
The standard analyzer is not enough in your case as it will create following terms while indexing:
[home, docs, public]
Note that it misses [/home/docs/public] token - characters like "/" etc. are acting as separators here.
One solution could be to use NGram tokenizer with punctuation character class in token_chars list. Elasticsearch would treat "/" as it would be a letter or digit. This would allow to search with following tokens:
[/hom, /home, ..., /home/docs/publi, /home/docs/public, ..., /docs/public, etc...]
Index mapping:
{
"settings": {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 18,
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
Now both search queries:
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/docs/private"
}
}
}
}
}
and
{
"query": {
"bool" : {
"must" : {
"term" : {
"directory": "/home/docs/private"
}
}
}
}
}
will give the indexed document in result.
One thing you have to consider is the maximum length of the token that is specified in "max_gram" setting. In case of directory paths it could be necessary to have it longer.
Alternative solution is to use Whitespace tokenizer, that breaks the phrase into terms only on whitespaces, and NGram filter with following mapping:
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 4,
"max_gram": 20
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"jardata": {
"properties": {
"directory": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
update the mapping of the directory field to contain raw field like this:
"directory": {
"type": "string",
"fields": {
"raw": {
"index": "not_analyzed",
"type": "string"
}
}
}
And modify your query to include directory.raw which will treat it like not_analyzed. Refer this.

How do I search for partial accented keyword in elasticsearch?

I have the following elasticsearch settings:
"settings": {
"index":{
"analysis":{
"analyzer":{
"analyzer_keyword":{
"tokenizer":"keyword",
"filter":["lowercase", "asciifolding"]
}
}
}
}
}
The above works fine for the following keywords:
Beyoncé
Céline Dion
The above data is stored in elasticsearch as beyonce and celine dion respectively.
I can search for Celine or Celine Dion without the accent and I get the same results. However, the moment I search for Céline, I don't get any results. How can I configure elasticsearch to search for partial keywords with the accent?
The query body looks like:
{
"track_scores": true,
"query": {
"bool": {
"must": [
{
"multi_match": {
"fields": ["name"],
"type": "phrase",
"query": "Céline"
}
}
]
}
}
}
and the mapping is
"mappings" : {
"artist" : {
"properties" : {
"name" : {
"type" : "string",
"fields" : {
"orig" : {
"type" : "string",
"index" : "not_analyzed"
},
"simple" : {
"type" : "string",
"analyzer" : "analyzer_keyword"
}
},
}
I would suggest this mapping and then go from there:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_keyword": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "analyzer_keyword"
}
}
}
}
}
Confirm that the same analyzer is getting used at query time. Here are some possible reasons why that might not be happening:
you specify a separate analyzer at query time on purpose that is not performing similar analysis
you are using a term or terms query for which no analyzer is applied (See Term Query and the section title "Why doesn’t the term query match my document?")
you are using a query_string query (E.g. see Simple Query String Query) - I have found that if you specify multiple fields with different analyzers and so I have needed to separate the fields into separate queries and specify the analyzer parameter (working with version 2.0)

Elasticsearch multi-word, multi-field search with analyzers

I want to use elasticsearch for multi-word searches, where all the fields are checked in a document with the assigned analyzers.
So if I have a mapping:
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
},
"mappings" : {
"typeName" :{
"date_detection": false,
"properties" : {
"stringfield" : {
"type" : "string",
"index" : "folding"
},
"numberfield" : {
"type" : "multi_field",
"fields" : {
"numberfield" : {"type" : "double"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
},
"datefield" : {
"type" : "multi_field",
"fields" : {
"datefield" : {"type" : "date", "format": "dd/MM/yyyy||yyyy-MM-dd"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}
}
As you see I have different types of fields, but I do know the structure.
What I want to do is starting a search with a string to check all fields using the analyzers too.
For example if the query string is:
John Smith 2014-10-02 300.00
I want to search for "John", "Smith", "2014-10-02" and "300.00" in all the fields, calculating the relevance score as well. The better solution is the one that have more field matches in a single document.
So far I was able to search in all the fields by using multi_field, but in that case I was not able to parse 300.00, since 300 was stored in the string part of multi_field.
If I was searching in "_all" field, then no analyzer was used.
How should I modify my mapping or my queries to be able to do a multi-word search, where dates and numbers are recognized in the multi-word query string?
Now when I do a search, error occurs, since the whole string cannot be parsed as a number or a date. And if I use the string representation of the multi_search then 300.00 will not be a result, since the string representation is 300.
(what I would like is similar to google search, where dates, numbers and strings are recognized in a multi-word query)
Any ideas?
Thanks!
Using whitespace as filter in analyzer and then applying this analyzer as search_analyzer to fields in mapping will split query in parts and each of them would be applied to index to find the best matching. And using ngram for index_analyzer would very improve results.
I am using following setup for query:
"query": {
"multi_match": {
"query": "sample query",
"fuzziness": "AUTO",
"fields": [
"title",
"subtitle",
]
}
}
And for mappings and settings:
{
"settings" : {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"ngram"
]
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
},
"mappings": {
"title": {
"type": "string",
"search_analyzer": "whitespace",
"index_analyzer": "autocomplete"
},
"subtitle": {
"type": "string"
}
}
}
See following answer and article for more details.

Elasticsearch: multiple languages in two fields when the query's language is unknown or mixed

I am new to Elasticsearch, and I am not sure how to proceed in my situation.
I have the following mapping:
{
"mappings": {
"book": {
"properties": {
"title": {
"properties": {
"en": {
"type": "string",
"analyzer": "english"
},
"ar": {
"type": "string",
"analyzer": "arabic"
}
}
},
"keyword": {
"properties": {
"en": {
"type": "string",
"analyzer": "english"
},
"ar": {
"type": "string",
"analyzer": "arabic"
}
}
}
}
}
}
}
A sample document may have two languages for the same field of the same book. Here are two example documents:
{
"title" : {
"en": "hello",
"ar": "مرحبا"
},
"keyword" : {
"en": "world",
"ar": "عالم"
}
}
{
"title" : {
"en": "Elasticsearch"
},
"keyword" : {
"en": "full-text index"
}
}
When I know what language is used in query, I am able to build query as follows (when English is used):
"query": {
"multi_match" : {
"query" : "keywords",
"fields" : [ "title.en", "keyword.en" ]
}
}
Based on my current document mapping, how can I build a query if
the query language is unknown or
is mixed with English and Arabic?
Thanks for any input!
Regards.
p.s. I am also open to any improvement to the above mapping.
the query language is unknown
You can use same multi match query but on all the fields.for eg,
Assuming you are using keyword analyzer
"query": {
"multi_match" : {
"query" : "keywords",
"fields" : [ "title.en", "keyword.en", "title.ar", "keyword.ar" ]
}
}
is mixed with English and Arabic
You need to change the analyzer to standard and then you can perform the same query.
Thanks

Resources