Why doesnt Elasticsearch match any number in keyphrase? - elasticsearch

I am searching for the following keyphrase: "xiaomi redmi note 3" in my Elasticsearch database. I m making the following bool query:
"filtered" : {
"query" : {
"match" : {
"name" : {
"query" : "xiaomi redmi note 3",
"type" : "boolean",
"operator" : "AND"
}
}
}
}
However, no matches are found. Still in Elasticsearch there is the following document:
xiaomi redmi note 3 16GB 4G Phablet
Why doesnt es match this document?
What I noticed in general is that es doesnt match any numbers in the keyprhase? Does it have to do with the analyzer I m using?
EDIT
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"autocomplete_filter"
]
},
and the mapping for my field is:
"name" : {
"type" : "string",
"index_analyzer" : "second",
"search_analyzer" : "standard",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
},
"search_quote_analyzer" : "second"
},
Autocomplete_filter:
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}

Related

Autocompletion with whitespace tokenizer in elasticsearch. Tokenize whitespaces correctly

I have an elastic index I want to do autocompletion with.
Therfore i have a suggestField of type completion where i put text that should be autocompleted.
"suggestField" : {
"type" : "completion",
"analyzer" : "IndexAnalyzer",
"search_analyzer" : "SearchAnalyzer",
"preserve_separators" : true,
"preserve_position_increments" : true,
"max_input_length" : 50
},
With Analyzers:
"IndexAnalyzer" : {
"filter" : [
"lowercase",
"stop",
"stopGerman",
"EdgeNGramFilter"
],
"type" : "custom",
"tokenizer" : "MyTokenizer"
},
"SearchAnalyzer" : {
"filter" : [
"lowercase"
],
"type" : "custom",
"tokenizer" : "MyTokenizer"
},
Filters and Tokenizer:
"filter" : {
"EdgeNGramFilter" : {
"type" : "edge_ngram",
"min_gram" : "1",
"max_gram" : "50"
},
"stopGerman" : {
"type" : "stop",
"stopwords" : "_german_"
}
},
"tokenizer" : {
"MyTokenizer" : {
"type" : "whitespace"
}
}
My Problem is now that if i query that field the autocompletion only works if i start at the beginning of the text, not for every word.
E.g. i have one value in my suggest field that looks like: "123-456-789 thisisatest"
If i search my suggest field for 123- i get that value as a result.
But if i search for thisis id do not get a result.
This is my Query.
POST myindex/_search?typed_keys=true
{
"suggest": {
"completion-term": {
"completion" : {
"field" : "suggestField"
} ,
"prefix" : "thisis"
}
}
}
The Question: How do I have to change the above setup to get the given value as a result if i search for thisis ?
FYI: If I use the IndexAnalyzer in kibana with an _analyze query for 123-456-789 thisisatest i get the (from my point of view correct) tokens:
1
12
123
123-
123-4
123-45
123-456
123-456-7
123-456-78
123-456-789
t
th
thi
this
thisi
thisis
thisisa
thisisat
thisisate
thisisates
thisisatest

ElasticSearch multi_match query not working

I'm using elastic search 5.3 and have this super simple multi_match query. None of the documents contains the ridiculous query, neither in title nor content. Still ES gives me plenty of matches.
{
"query" : {
"multi_match" : {
"query" : "42a6o8435a6o4385q023bf50",
"fields" : [
"title","content"
],
"type" : "best_fields",
"operator" : "AND",
"minimum_should_match" : "100%"
}
}
}
}
The analyzer for content is "english" and for title it is this one:
"analyzer": {
"english_autocomplete": {
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stop",
"english_stemmer",
"autocomplete_filter"
]
}
}
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
Am I missing something or how can I tell ES not to do that?

How do I get the most frequent uni-, bi-, tri-grams using shingles in Elasticsearch across all documents

I am using the following field definition in my elasticsearch index:
"my_text" :{
"type" : "string",
"index" : "analyzed",
"analyzer" : "my_ngram_analyzer",
"term_vector": "with_positions",
"term_statistics" : true
}
where, my_ngram_analyzer is used to tokenize text into n-grams using shingles and is defined as:
"settings" : {
"analysis" : {
"filter" : {
"nGram_filter": {
"type": "shingle",
"max_shingle_size": 5,
"min_shingle_size": 2,
"output_unigrams":"true"
}
},
"analyzer" : {
"my_ngram_analyzer" :{
"tokenizer" : "standard",
"filter" : [
"lowercase",
"nGram_filter"
]
}
}
}
}
I have two questions:
How can I find the most frequent n-gram (n = 1 to 5) and its frequency across all documents ?
Is there a way to get total term frequency of an n-gram without querying for a document using the termvector API with term_statistics ?

How to get the definitiion of a search analyzer of an index in elasticsearch

The mapping of the elasticsearch index has a custom analyzer attached to it. How to read the definition of the custom analyzer.
http://localhost:9200/test_namespace/test_namespace/_mapping
"matchingCriteria": {
"type": "string",
"analyzer": "custom_analyzer",
"include_in_all": false
}
my search is not working with the analyzer thats why i need to know what exactly this analyzer is doing.
the doc explains how to modify an analyzer or attach a new analyzer to an existing index but i didnt find a way to see what an analyzer does.
use the _settings API:
curl -XGET 'http://localhost:9200/test_namespace/_settings?pretty=true'
it should generate a response similar to:
{
"test_namespace" : {
"settings" : {
"index" : {
"creation_date" : "1418990814430",
"routing" : {
"allocation" : {
"disable_allocation" : "false"
}
},
"uuid" : "FmX9NrSNSTO2bQM5pd-iQQ",
"number_of_replicas" : "2",
"analysis" : {
"analyzer" : {
"edi_analyzer" : {
"type" : "custom",
"char_filter" : [ "my_pattern" ],
"filter" : [ "lowercase", "length" ],
"tokenizer" : "whitespace"
},
"xml_analyzer" : {
"type" : "custom",
"char_filter" : [ "html_strip" ],
"filter" : [ "lowercase", "length" ],
"tokenizer" : "whitespace"
},
...

Elasticsearch multi-word, multi-field search with analyzers

I want to use elasticsearch for multi-word searches, where all the fields are checked in a document with the assigned analyzers.
So if I have a mapping:
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
},
"mappings" : {
"typeName" :{
"date_detection": false,
"properties" : {
"stringfield" : {
"type" : "string",
"index" : "folding"
},
"numberfield" : {
"type" : "multi_field",
"fields" : {
"numberfield" : {"type" : "double"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
},
"datefield" : {
"type" : "multi_field",
"fields" : {
"datefield" : {"type" : "date", "format": "dd/MM/yyyy||yyyy-MM-dd"},
"untouched" : {"type" : "string", "index" : "not_analyzed"}
}
}
}
}
}
}
As you see I have different types of fields, but I do know the structure.
What I want to do is starting a search with a string to check all fields using the analyzers too.
For example if the query string is:
John Smith 2014-10-02 300.00
I want to search for "John", "Smith", "2014-10-02" and "300.00" in all the fields, calculating the relevance score as well. The better solution is the one that have more field matches in a single document.
So far I was able to search in all the fields by using multi_field, but in that case I was not able to parse 300.00, since 300 was stored in the string part of multi_field.
If I was searching in "_all" field, then no analyzer was used.
How should I modify my mapping or my queries to be able to do a multi-word search, where dates and numbers are recognized in the multi-word query string?
Now when I do a search, error occurs, since the whole string cannot be parsed as a number or a date. And if I use the string representation of the multi_search then 300.00 will not be a result, since the string representation is 300.
(what I would like is similar to google search, where dates, numbers and strings are recognized in a multi-word query)
Any ideas?
Thanks!
Using whitespace as filter in analyzer and then applying this analyzer as search_analyzer to fields in mapping will split query in parts and each of them would be applied to index to find the best matching. And using ngram for index_analyzer would very improve results.
I am using following setup for query:
"query": {
"multi_match": {
"query": "sample query",
"fuzziness": "AUTO",
"fields": [
"title",
"subtitle",
]
}
}
And for mappings and settings:
{
"settings" : {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"ngram"
]
}
},
"filter": {
"ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
},
"mappings": {
"title": {
"type": "string",
"search_analyzer": "whitespace",
"index_analyzer": "autocomplete"
},
"subtitle": {
"type": "string"
}
}
}
See following answer and article for more details.

Resources