I'm trying to search with a string, containing multiple strings that are comma-separated. [might not match with the whole value text, can be partial, the passed item should be in the text]
Note: I have tried n-gram as well, which does not give me right data.
(Example: search term "Data Science" gives all "Data", "Science", "data science")
Doc In ES:
{
"_index": "questions_dev",
"_type": "_doc",
"_id": "188",
"_score": 6.6311107,
"_source": {
"questionId": 188,
"questionText": "What other social media platforms do you use on your own time?",
"domainId": 2,
"subdomainId": 25,
"type": "TEXT",
"difficulty": 1,
"time": 600,
"domain": "Domain Specific",
"subdomain": "Social Media Specialist",
"skill": ["social media"]
}
}
What I have done so far:
Index:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"default": {
"tokenizer": "custom_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "pattern",
"pattern": ",",
},
}
}
},
"mappings": {
"properties": {
"questionId": {
"type": "long"
},
"questionText": {
"type": "text",
},
"domain": {
"type": "text"
},
"subdomain": {
"type": "text"
},
"type":{
"type": "keyword"
},
"difficulty":{
"type": "keyword"
},
"totaltime":{
"type": "keyword"
},
"domainId":{
"type": "keyword"
},
"subdomainId":{
"type": "keyword"
}
}
}
}
Query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": ["questionText","skill"],
"query": "social media"
}
}
]
}
}
}
Output:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
Expected Output:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 6.6311107,
"hits": [
{
"_index": "questions_development",
"_type": "_doc",
"_id": "188",
"_score": 6.6311107,
"_source": {
"questionId": 188,
"questionText": "What other social media platforms do you use on your own time?",
"domainId": 2,
"subdomainId": 25,
"type": "TEXT",
"difficulty": 1,
"time": 600,
"domain": "Domain Specific",
"subdomain": "Social Media Specialist",
"skill": []
}
}
]
}
}
Goal:
Search with a string for all the docs, which contains the string.
Example:
If I search with "social media" it should return me the above doc.
(for my case its not returning.)
this search also should support a comma-separated search mechanism.
which means, I can pass "social media, own time" and I'm expecting the output to have questionTexts text to contain any of these strings.
The data which you are indexing social media, own time, contains whitespace between , and own time. So, the tokens generated with your previous mapping are :
{
"tokens": [
{
"token": " social media", <-- note the preceding whitespace here
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 0
},
{
"token": " own time", <-- note the preceding whitespace here
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 1
}
]
}
Therefore, in the search query when you "query": "social media", with no whitespace, in the beginning, no search results are shown. However, if you query in this way "query": " social media" (including whitespace in the beginning), then search result will be there.
To remove leading and trailing whitespace from each token in a stream you can use Trim Token filter
Adding working example with index data, mapping and search query
Index Mapping:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"default": {
"tokenizer": "custom_tokenizer",
"filter": [
"lowercase",
"trim" <-- note this
]
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "pattern",
"pattern": ",",
"filter": [
"trim" <-- note this
]
}
}
}
},
"mappings": {
"properties": {
"questionText": {
"type": "text"
}
}
}
}
Index Data:
{ "questionText": "social media" }
{ "questionText": "social media, own time" }
Search Query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": [
"questionText"
],
"query": "own time" <-- no whitespace included in the
beginning
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "my-index",
"_type": "_doc",
"_id": "2",
"_score": 0.60996956,
"_source": {
"questionText": "social media, own time"
}
}
Update 1:
Index Settings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": ","
}
}
}
}
}
Index Data:
{
"questionText": "What other platforms do you use on your ?"
}
{
"questionText": "What other social time platforms do you use on your?"
}
{
"questionText": "What other social media platforms do you use on your?"
}
{
"questionText": "What other platforms do you use on your own time?"
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": "questionText",
"query": "social media, own time"
}
}
]
}
}
}
Search Result
"hits": [
{
"_index": "my-index3",
"_type": "_doc",
"_id": "1",
"_score": 2.5628972,
"_source": {
"questionText": "What other social media platforms do you use on your own time?"
}
},
{
"_index": "my-index3",
"_type": "_doc",
"_id": "2",
"_score": 1.3862944,
"_source": {
"questionText": "What other social media platforms do you use on your?"
}
},
{
"_index": "my-index3",
"_type": "_doc",
"_id": "3",
"_score": 1.3862944,
"_source": {
"questionText": "What other platforms do you use on your own time?"
}
}
]
Related
I've setup a normalizer on an index field to support case insensitive searches, cant seem to get it to work.
GET users/
Returns the following mapping:
{
"users": {
"aliases": {},
"mappings": {
"user": {
"properties": {
"active": {
"type": "boolean"
},
"first_name": {
"type": "keyword",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "search_normalizer"
}
}
},
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "users",
"creation_date": "1567936315432",
"analysis": {
"normalizer": {
"search_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
},
"number_of_replicas": "1",
"uuid": "5SknFdwJTpmF",
"version": {
"created": "6040299"
}
}
}
}
}
Although first_name is normalized to lowercase, queries on the first_name field are case sensitive.
Using the following query for a user with first name Dave
GET users/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name": {
"value": ".*dave.*"
}
}
}
]
}
}
}
GET users/_analyze
{
"analyzer" : "standard",
"text": "Dave"
}
returns
{
"tokens": [
{
"token": "dave",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Although "Dave" is tokenized to "dave" the following query
GET users/_search
{
"query": {
"match": {
"first_name": "dave"
}
}
}
Returns no hits.
Is there an issue with my current mapping? or the query?
I think you have missed first_name.normalize in query
Indexing Records
{"first_name": "Daveraj"}
{"index": {}}
{"first_name": "RajdaveN"}
{"index": {}}
{"first_name": "Dave"}
Query
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name.normalize": {
"value": ".*dave.*"
}
}
}
]
}
}
}
Result
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0,
"hits": [
{
"_index": "test3",
"_type": "test3_type",
"_id": "M8-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Dave"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Mc-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Daveraj"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Ms-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "RajdaveN"
}
}
]
}
}```
You have created a normalized multi-field: first_name.normalize , but you are searching on the original field first_name which doesn't have any analyzer specified (will default to index-default analyzer or standard).
The examples given here might help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
You need to explicitly specify the multi-field you want to search on, note even though a multi-field cant have its own content, it indexes different terms as opposed to its parent (although not always) as a result of possibly being analyzed using diff analyzers/char/token filters.
i am basically trying to write a query where it should return the document where
school is "holy international" AND grade is "second".
but the issue with the current query is that its not considering the must match query part. ie even though i don't i specify the school is the giving me this document where as it is not a match.
query is giving me all the documents where the grade is second.
i want only document where school is "holy international" AND grade is "second".
as well as i have not specified in the match query for "schools.school" but its giving me results.
mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase1": {
"tokenizer": "keyword",
"filter": ["lowercase", "my_pattern_replace1", "trim"]
},
"my_keyword_lowercase2": {
"tokenizer": "standard",
"filter": ["lowercase", "trim"]
}
},
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
}
},
"mappings": {
"test_data": {
"properties": {
"schools": {
"type": "nested",
"properties": {
"school": {
"type": "string",
"analyzer": "my_keyword_lowercase1"
},
"grade": {
"type": "string",
"analyzer": "my_keyword_lowercase2"
}
}
}
}
}
}
}
data
{
"_index": "data_index",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_version": 1,
"found": true,
"_source": {
"summary": null,
"schools": [{
"school": "little flower",
"grade": "first",
"date": "2007-06-01",
},
{
"school": "holy international",
"grade": "second",
"date": "2007-06-01",
},
],
"first_name": "Adam",
"location": "Kansas City",
"last_name": "Roger",
"country": "US",
"name": "Adam Roger",
}
}
query
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "" <-----X didnt specify anything
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second",
"operator": "and",
"minimum_should_match": "100%"
}
}
}
}
}
}
}
}
result
{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "data_test",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_score": 0.2876821,
"_source": {
"first_name": "Adam"
},
"inner_hits": {
"schools": {
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_nested": {
"field": "schools",
"offset": 0
},
"_score": 0.2876821,
"_source": {
"schools": {
"school": "holy international",
"grade": "second"
}
}
}
]
}
}
}
}
]
}
}
So, basically your problem is analysis step, when I load everything and checked, it become very clear:
This filter completely wipes all string from schools.school field
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
I think, that's happening because . is regexp literal, so, when I checked it:
POST /_analyze
{
"field": "schools.school",
"text": "holy international"
}
{
"tokens": [
{
"token": "",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
}
]
}
That's why you always get a match, every string you passed during indexing time and during search time becomes "". Some additional info from Elastic wiki - https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-pattern_replace-tokenfilter.html
After I removed patter replace filter, this query returns everything as expected:
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "holy international"
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second"
}
}
}
}
}
}
}
}
We're using ElasticSearch completion suggester with the Standard Analyzer, but it seems like the text is not tokenized.
e.g.
Texts: "First Example", "Second Example"
Search: "Fi" returns "First Example"
While
Search: "Ex" doesn't return any result returns "First Example"
As the doc of Elastic about completion suggester: Completion Suggester
The completion suggester is a so-called prefix suggester.
So when you send a keyword, it will look for the prefix of your texts.
E.g:
Search: "Fi" => "First Example"
Search: "Sec" => "Second Example"
but if you give Elastic "Ex", it returns nothing because it cannot find a text which begins with "Ex".
You can try some others suggesters like: Term Suggester
A great work around is to tokenize the string yourself and put it in a separate tokens field.
You can then use 2 suggestions in your suggest query to search both fields.
Example:
PUT /example
{
"mappings": {
"doc": {
"properties": {
"full": {
"type": "completion"
},
"tokens": {
"type": "completion"
}
}
}
}
}
POST /example/doc/_bulk
{ "index":{} }
{"full": {"input": "First Example"}, "tokens": {"input": ["First", "Example"]}}
{ "index":{} }
{"full": {"input": "Second Example"}, "tokens": {"input": ["Second", "Example"]}}
POST /example/_search
{
"suggest": {
"full-suggestion": {
"prefix" : "Ex",
"completion" : {
"field" : "full",
"fuzzy": true
}
},
"token-suggestion": {
"prefix": "Ex",
"completion" : {
"field" : "tokens",
"fuzzy": true
}
}
}
}
Search result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"suggest": {
"full-suggestion": [
{
"text": "Ex",
"offset": 0,
"length": 2,
"options": []
}
],
"token-suggestion": [
{
"text": "Ex",
"offset": 0,
"length": 2,
"options": [
{
"text": "Example",
"_index": "example",
"_type": "doc",
"_id": "Ikvk62ABd4o_n4U8G5yF",
"_score": 2,
"_source": {
"full": {
"input": "First Example"
},
"tokens": {
"input": [
"First",
"Example"
]
}
}
},
{
"text": "Example",
"_index": "example",
"_type": "doc",
"_id": "I0vk62ABd4o_n4U8G5yF",
"_score": 2,
"_source": {
"full": {
"input": "Second Example"
},
"tokens": {
"input": [
"Second",
"Example"
]
}
}
}
]
}
]
}
}
One approach to hack in the suggestions from every position of the string could be to shingle the string, take only the shingles with position 0, from every shingle take the last token.
PUT example
{
"settings": {
"index.max_shingle_diff": 10,
"analysis": {
"filter": {
"after_last_space": {
"type": "pattern_replace",
"pattern": "(.* )",
"replacement": ""
},
"preserve_only_first": {
"type": "predicate_token_filter",
"script": {
"source": "token.position == 0"
}
},
"big_shingling": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 10,
"output_unigrams": true
}
},
"analyzer": {
"dark_magic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"big_shingling",
"preserve_only_first",
"after_last_space"
]
}
}
}
},
"mappings": {
"properties": {
"suggest": {
"type": "completion",
"analyzer": "dark_magic",
"search_analyzer": "standard"
}
}
}
}
This hack works for short strings (up to 10 tokens in the example).
I'm trying to do a simple query to my elasticsearch _type and match multiple fields with wildcards, my first attempt was like this:
POST my_index/my_type/_search
{
"sort" : { "date_field" : {"order" : "desc"}},
"query" : {
"filtered" : {
"filter" : {
"or" : [
{
"term" : { "field1" : "4848" }
},
{
"term" : { "field2" : "6867" }
}
]
}
}
}
}
This example will successfully match every record when field1 OR field2 are exactly equal to 4848 and 6867 respectively.
What I'm trying to do is to match on field1 any text that contains 4848 and field2 that contains 6867 but I'm not really sure how to do it.
I appreciate any help I can get :)
It sounds like your problem has mostly to do with analysis. The appropriate solution depends on the structure of your data and what you want to match. I'll provide a couple of examples.
First, let's assume that your data is such that we can get what we want just using the standard analyzer. This analyzer will tokenize text fields on whitespace, punctuation and symbols. So the text "1234-5678-90" will be broken into the terms "1234", "5678", and "90", so a "term" query or filter for any of those terms will match that document. More concretely:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"analyzer": "standard"
},
"field2":{
"type": "string",
"analyzer": "standard"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "1212-2323-4848","field2": "1234-5678-90"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "0000-0000-0000","field2": "0987-6543-21"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "1111-2222-3333","field2": "6867-4545-90"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "1212-2323-4848",
"field2": "1234-5678-90"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "1111-2222-3333",
"field2": "6867-4545-90"
}
}
]
}
}
(Explicitly writing "analyzer": "standard" is redundant since that is the default analyzer used if you do not specify one; I just wanted to make it obvious.)
On the other hand, if the text is embedded in such a way that the standard analysis doesn't provide what you want, say something like "121223234848" and you want to match on "4848", you will have to do something little more sophisticated, using ngrams. Here is an example of that (notice the difference in the data):
DELETE /test_index
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"field2":{
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
}
}
}
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"field1": "121223234848","field2": "1234567890"}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"field1": "000000000000","field2": "0987654321"}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"field1": "111122223333","field2": "6867454590"}
POST test_index/_search
{
"query": {
"filtered": {
"filter": {
"or": [
{
"term": { "field1": "4848" }
},
{
"term": { "field2": "6867" }
}
]
}
}
}
}
...
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"field1": "121223234848",
"field2": "1234567890"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"field1": "111122223333",
"field2": "6867454590"
}
}
]
}
}
There is a lot going on here, so I won't attempt to explain it in this post. If you want more explanation I would encourage you to read this blog post: http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams. Hope you'll forgive the shameless plug. ;)
Hope that helps.
Although the Lucene logic structure, I'm trying to make my nested fields to be highlighted when some search result is present in their content.
Here is the explanation from Elasticsearch documentation (mapping nested type`)
Internal Implementation
Internally, nested objects are indexed as additional documents, but, since they can be guaranteed to be indexed within the same "block", it allows for extremely fast joining with parent docs.
Those internal nested documents are automatically masked away when doing operations against the index (like searching with a match_all query), and they bubble out when using the nested query.
Because nested docs are always masked to the parent doc, the nested docs can never be accessed outside the scope of the nested query. For example stored fields can be enabled on fields inside nested objects, but there is no way of retrieving them, since stored fields are fetched outside of the nested query scope.
0. In my case
I have an Elasticsearch index containing a mapping like the following:
{
"my_documents": {
"dynamic_date_formats": [
"dd.MM.yyyy",
"yyyy-MM-dd",
"yyyy-MM-dd HH:mm:ss"
],
"index_analyzer": "Analyzer2_index",
"search_analyzer": "Analyzer2_search_decompound",
"_timestamp": {
"enabled": true
},
"properties": {
"identifier": {
"type": "string"
},
"description": {
"type": "multi_field",
"fields": {
"sort": {
"type": "string",
"index": "not_analyzed"
},
"description": {
"type": "string"
}
}
},
"files": {
"type": "nested",
"include_in_root": true,
"properties": {
"content": {
"type": "string",
"include_in_root": true
}
}
},
"and then some other": "normal string fields"
}
}
}
I'm trying to execute a query like this:
{
"size": 100,
"query": {
"bool": {
"should": [
{
"nested": {
"path": "files",
"query": {
"bool": {
"should": {
"match": {
"content": {
"query": "burpcontrol",
"minimum_should_match": "85%"
}
}
}
}
}
}
},
{
"match": {
"description": {
"query": "burpcontrol",
"minimum_should_match": "85%"
}
}
},
{
"match": {
"identifier": {
"query": "burpcontrol",
"minimum_should_match": "85%"
}
}
} ]
}
},
"highlight": {
"pre_tags": [
"<span style=\"background-color: yellow\">"
],
"post_tags": [
"</span>"
],
"order": "score",
"no_match_size": 100,
"fragment_size": 50,
"number_of_fragments": 3,
"require_field_match": true,
"fields": {
"files.content": {},
"description": {},
"identifier": {}
}
}
}
The problem I have are:
1. require_field_match
If I use "require_field_match": false I obtain that, even if highlighting doesn't work on nested fields, the search term is highlighted anyway in ALL the fields.
This is the solution I'm actually using, but the performances are horrible. For 50 documents my query needs 25secs. 100 documents about 50secs. 10 documents 5secs.
And if I remove the nested field from the highlighting everything works fast as light!
2 .include_in_root
I would like to have a flattened version of my nested fields (so to store them as normal objects/fields.
To do this I should specify
"files": { "type": "nested", "include_in_root": true, ...
but I don't know why, after reindexing, I cannot see any additional flattened field in the document root (while I was expecting something like "files.content":["content1", "content2", "..."]).
If it would work it would be instead possible to access (in the flattened field) the content of the nested field, and perform the highlighting on it.
Do you know if is it possible to achieve a good (and performant) highlighting on nested fields or, at least, suggest me why my query is so slow? (I already optimised the fragments)
There are a number of things you can do here, with a parent/child relationship. I'll go over a few, and hopefully that will lead you in the right direction; it will still take lots of testing to figure out whether this solution is going to be more performant for you. Also, I left out a few of the details of your setup, for clarity. Please forgive the long post.
I set up a parent/child mapping as follows:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"parent_doc": {
"properties": {
"identifier": {
"type": "string"
},
"description": {
"type": "string"
}
}
},
"child_doc": {
"_parent": {
"type": "parent_doc"
},
"properties": {
"content": {
"type": "string"
}
}
}
}
}
Then added some test docs:
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"parent_doc","_id":1}}
{"identifier": "first", "description":"some special text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is special"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is not"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":2}}
{"identifier": "second", "description":"some different text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":2}}
{"content":"different child text, but special"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":3}}
{"identifier": "third", "description":"we don't want this parent"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":3}}
{"content":"or this child"}
If I'm understanding your specs correctly, we would want a query for "special" to return every one of these documents except the last two (correct me if I'm wrong). We want docs that match the text, have a child that matches the text, or have a parent that matches the text.
We can get back parents that match the query like this:
POST /test_index/parent_doc/_search
{
"query": {
"match": {
"description": "special"
}
},
"highlight": {
"fields": {
"description": {},
"identifier": {}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.1263815,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "1",
"_score": 1.1263815,
"_source": {
"identifier": "first",
"description": "some special text"
},
"highlight": {
"description": [
"some <em>special</em> text"
]
}
}
]
}
}
And we can get back children that match the query like this:
POST /test_index/child_doc/_search
{
"query": {
"match": {
"content": "special"
}
},
"highlight": {
"fields": {
"content": {}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.92364895,
"hits": [
{
"_index": "test_index",
"_type": "child_doc",
"_id": "geUFenxITZSL7epvB568uA",
"_score": 0.92364895,
"_source": {
"content": "text that is special"
},
"highlight": {
"content": [
"text that is <em>special</em>"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "IMHXhM3VRsCLGkshx52uAQ",
"_score": 0.80819285,
"_source": {
"content": "different child text, but special"
},
"highlight": {
"content": [
"different child text, but <em>special</em>"
]
}
}
]
}
}
We can get back parents that match the text and children that match the text like this:
POST /test_index/parent_doc,child_doc/_search
{
"query": {
"multi_match": {
"query": "special",
"fields": ["description", "content"]
}
},
"highlight": {
"fields": {
"description": {},
"identifier": {},
"content": {}
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.1263815,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "1",
"_score": 1.1263815,
"_source": {
"identifier": "first",
"description": "some special text"
},
"highlight": {
"description": [
"some <em>special</em> text"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "geUFenxITZSL7epvB568uA",
"_score": 0.75740534,
"_source": {
"content": "text that is special"
},
"highlight": {
"content": [
"text that is <em>special</em>"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "IMHXhM3VRsCLGkshx52uAQ",
"_score": 0.6627297,
"_source": {
"content": "different child text, but special"
},
"highlight": {
"content": [
"different child text, but <em>special</em>"
]
}
}
]
}
}
However, to get back all the docs related to this query, we need to use a bool query:
POST /test_index/parent_doc,child_doc/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "special",
"fields": [
"description",
"content"
]
}
},
{
"has_child": {
"type": "child_doc",
"query": {
"match": {
"content": "special"
}
}
}
},
{
"has_parent": {
"type": "parent_doc",
"query": {
"match": {
"description": "special"
}
}
}
}
]
}
},
"highlight": {
"fields": {
"description": {},
"identifier": {},
"content": {}
}
},
"fields": ["_parent", "_source"]
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.8866254,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "1",
"_score": 0.8866254,
"_source": {
"identifier": "first",
"description": "some special text"
},
"highlight": {
"description": [
"some <em>special</em> text"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "geUFenxITZSL7epvB568uA",
"_score": 0.67829096,
"_source": {
"content": "text that is special"
},
"fields": {
"_parent": "1"
},
"highlight": {
"content": [
"text that is <em>special</em>"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "IMHXhM3VRsCLGkshx52uAQ",
"_score": 0.18709806,
"_source": {
"content": "different child text, but special"
},
"fields": {
"_parent": "2"
},
"highlight": {
"content": [
"different child text, but <em>special</em>"
]
}
},
{
"_index": "test_index",
"_type": "child_doc",
"_id": "NiwsP2VEQBKjqu1M4AdjCg",
"_score": 0.12531912,
"_source": {
"content": "text that is not"
},
"fields": {
"_parent": "1"
}
},
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "2",
"_score": 0.12531912,
"_source": {
"identifier": "second",
"description": "some different text"
}
}
]
}
}
(I included the "_parent" field to make it easier to see why docs were included in the results, as shown here).
Let me know if this helps.
Here is the code I used:
http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6