Does Elasticsearch support nested or object fields in MultiMatch? - elasticsearch

I have some object field named "FullTitleFts". It has field "text" inside. This query works fine (and returns some entries):
GET index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"fullTitleFts.text": "Ivan"
}
}
]
}
}
}
But this query returns nothing:
GET index/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "Ivan",
"fields": [
"fullTitleFts.text"
]
}
}
]
}
}
}
Mapping of the field:
"fullTitleFts": {
"copy_to": [
"text"
],
"type": "keyword",
"fields": {
"text": {
"analyzer": "analyzer",
"term_vector": "with_positions_offsets_payloads",
"type": "text"
}
}
}
"analyzer": {
"filter": [
"lowercase",
"hypocorisms",
"protect_kw"
],
"char_filter": [
"replace_char_filter",
"e_char_filter"
],
"expand": "true",
"type": "custom",
"tokenizer": "standard"
}
e_char_filter is for replacing Cyrillic char "ё" to "е", replace_char_filter is for removing "�" from text. protect_kw is keyword_marker for some Russian unions. hypocorisms is synonym_graph for making another forms of names.
Example of analyzer output:
GET index/_analyze
{
"analyzer": "analyzer",
"text": "Алёна�"
}
{
"tokens" : [
{
"token" : "аленка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "аленушка",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "алена",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
I've also found this question. And it seems that the answer didn't really work - author had to add "include_in_root" option in mapping. So i wondering if multi match supports nested or object fields at all. I am also can't find anything about it in docs.

As you have provided index mapping, your field is defined as multi-field and not as nested or object field. So both match and multi_match should work without providing path. you can just use field name as fullTitleFts.text when need to search on text type and fullTitleFts when need to search on keyword type.

Related

elasticsearch fuzzy query seems to ignore brazilian stopwords

I have stopwords for brazilian portuguese configured at my index. but if I made a search for the term "ios" (it's a ios course), a bunch of other documents are returned, because the term "nos" (brazilian stopword) seems to be identified as a valid term for the fuzzy query.
But if I search just by the term "nos", nothing is returned. I would be not expected ios course to be returned by fuzzy query? I'm confused.
Is there any alternative to this. The main purpose here is that when user search for ios, the documents with stopword like "nos" won't be returned, while I can mantain the fuzziness for other more complex search made by users.
An example of query:
GET /index/_search
{
"explain": true,
"query": {
"bool" : {
"must" : [
{
"terms" : {
"document_type" : [
"COURSE"
],
"boost" : 1.0
}
},
{
"multi_match" : {
"query" : "ios",
"type" : "best_fields",
"operator" : "OR",
"slop" : 0,
"fuzziness" : "AUTO",
"prefix_length" : 0,
"max_expansions" : 50,
"zero_terms_query" : "NONE",
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
}
part of explain query:
"description": "weight(corpo:nos in 52) [PerFieldSimilarity], result of:",
image with the config of stopwords
thanks
I tried to add the prefix length, but I want that stopwords to be ignored.
I believe that correctly way to work stopwords by language is below:
PUT idx_teste
{
"settings": {
"analysis": {
"filter": {
"brazilian_stop_filter": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"teste_analyzer": {
"tokenizer": "standard",
"filter": ["brazilian_stop_filter"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "teste_analyzer"
}
}
}
}
POST idx_teste/_analyze
{
"analyzer": "teste_analyzer",
"text":"course nos advanced"
}
Look term "nos" was removed.
{
"tokens": [
{
"token": "course",
"start_offset": 0,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "advanced",
"start_offset": 11,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}

How to get index item that has : "name" - "McLaren" by searching with "mclaren" in Elasticsearch 1.7?

Here is the tokenizer -
"tokenizer": {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
Mapping -
"name": {
"type": "string",
"analyzer": "filename_index",
"include_in_all": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "naturalsort"
}
}
},
Analyzer -
"filename_index" : {
"tokenizer" : "filename",
"filter" : [
"word_delimiter",
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
},
I would like to get index item by searching - mclaren, but the name indexed is McLaren.
I would like to stick to query_string cause a lot of other functionality is based on that. Here is the query with what I cant get the expected result -
{
"query": {
"filtered": {
"query": {
"query_string" : {
"query" : "mclaren",
"default_operator" : "AND",
"analyze_wildcard" : true,
}
}
}
},
"size" :50,
"from" : 0,
"sort": {}
}
How I could accomplish this? Thank you!
I got it ! The problem is certainly around the word_delimiter token filter.
By default it :
Split tokens at letter case transitions. For example: PowerShot →
Power, Shot
Cf documentation
So macLaren generate two tokens -> [mac, Laren] when maclaren only generate one token ['maclaren'].
analyze example :
POST _analyze
{
"tokenizer": {
"pattern": """[^\p{L}\d]+""",
"type": "pattern"
},
"filter": [
"word_delimiter"
],
"text": ["macLaren", "maclaren"]
}
Response:
{
"tokens" : [
{
"token" : "mac",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "Laren",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "maclaren",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 102
}
]
}
So I think one option is to configure your word_delimiter with the option split_on_case_change to false (see parameters doc)
Ps: remeber to remove the settings you previously added (cf comments), since with this setting, your query string query will only target the name field that does not exists.

Elastic Search - Apply appropriate analyser to accurate result

I am new in Elastic Search. I would like to apply any analyser that satisfy below search.
Lets take an example.
Suppose I have entered below text in a document
I am walking now
I walked to Ahmedabad
Everyday I walk in the morning
Anil walks in the evening.
I am hiring candidates
I hired candidates
Everyday I hire candidates
He hires candidates
Now when I search with
text "walking"
result should be [walking, walked, walk, walks]
text "walked"
result should be [walking, walked, walk, walks]
text "walk"
result should be [walking, walked, walk, walks]
text "walks"
result should be [walking, walked, walk, walks]
Same result should also for hire.
text "hiring"
result should be [hiring, hired, hire, hires]
text "hired"
result should be [hiring, hired, hire, hires]
text "hire"
result should be [hiring, hired, hire, hires]
text "hires"
result should be [hiring, hired, hire, hires]
Thank You,
You need to use stemmer token filter
Stemming is the process of reducing a word to its root form. This ensures variants of a word match during a search.
For example, walking and walked can be stemmed to the same root word:
walk. Once stemmed, an occurrence of either word would match the other
in a search.
Mapping
PUT index36
{
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stemmer" ,"lowercase"]
}
}
}
}
}
Analyze
GET index36/_analyze
{
"text": ["walking", "walked", "walk", "walks"],
"analyzer": "my_analyzer"
}
Result
{
"tokens" : [
{
"token" : "walk",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "walk",
"start_offset" : 8,
"end_offset" : 14,
"type" : "word",
"position" : 101
},
{
"token" : "walk",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 202
},
{
"token" : "walk",
"start_offset" : 20,
"end_offset" : 25,
"type" : "word",
"position" : 303
}
]
}
All the four words produce same token "walk". So any of these words would match the other in a search.
What you are searching for is a language analyzer, see the documentation here
An Word anaylzer always consists of an word-tokenizer and a word-filter as the example below shows.
PUT /english_example
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"rebuilt_english": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
You can now use the analyzer in your index-mapping like this:
{ mappings": {
"myindex": {
"properties": {
"myField": {
"type": "keyword",
"analyzer": "rebuilt_english"
}
}
}
}
}
Remember to use a match query in order to query full-text.

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

ElasticSearch analyzer that allows for query both with and without hypens

How do you construct an analyzer that allows you to query fields both with and without hyphens?
The following two queries must return the same person:
{
"query": {
"term": {
"name": {
"value": "Jay-Z"
}
}
}
}
{
"query": {
"term": {
"name": {
"value": "jay z"
}
}
}
}
What you could do is use a mapping character filter in order to replace the hyphen with a space. Basically, like this:
curl -XPUT localhost:9200/tests -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase"
],
"char_filter": [
"hyphens"
]
}
},
"char_filter": {
"hyphens": {
"type": "mapping",
"mappings": [
"-=>\\u0020"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
Then we can check what the analysis pipeline would yield using the _analyze endpoint:
For Jay-Z:
curl -XGET 'localhost:9200/tests/_analyze?pretty&analyzer=my_analyzer' -d 'Jay-Z'
{
"tokens" : [ {
"token" : "jay z",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
} ]
}
For jay z:
curl -XGET 'localhost:9200/tests/_analyze?pretty&analyzer=my_analyzer' -d 'jay z'
{
"tokens" : [ {
"token" : "jay z",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
} ]
}
As you can see the same token is going to be indexed for both forms so your term query will work with both forms as well.

Resources