Search the entire text of an `object` field in ElasticSearch - elasticsearch

I have an ElasticSearch index which has an object property which stores some very dynamic JSON. I'd like to do full-text search on that JSON field. How do I index this field so that I can see if a certain word appears anywhere in the JSON, without knowing the exact key it will appear in, in advance? Like, is there a way to just index all leaf nodes of the JSON property? I'm on ElasticSearch 6.8 by the way, so I don't have the flattened field, which I think does this.
Index definition
PUT /test?include_type_name=true
{
"settings": {"number_of_shards": 1, "number_of_replicas": 1},
"mappings": {
"_doc": {
"_source": {"enabled": "true"},
"properties": {
"content": {
"type": "object",
"enabled": "true"
}
}
}
}
}
Document insertion
PUT /test/_doc/1
{
"content": {
"a": {
"b": {
"text": "42"
}
}
}
}
Query
GET /test/_search
{
"query": {
"match": {
"content": "42"
}
}
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

You're right, the flattened field type is what you need. But until your upgrade, you can do it using dynamic templates. In the mapping below, we match any string field inside the content object field to text and we also copy its value into another field called content_text that we're going to be able to search on:
PUT /test
{
"mappings": {
"dynamic_templates": [
{
"full_name": {
"match_mapping_type": "string",
"path_match": "content.*",
"mapping": {
"type": "text",
"copy_to": "content_text"
}
}
}
],
"properties": {
"content_text": {
"type": "text"
},
"content": {
"type": "object",
"enabled": "true"
}
}
}
}
Your sample document:
PUT /test/_doc/1
{
"content": {
"a": {
"b": {
"text": "42"
}
}
}
}
And now you can search on that new field as if you were searching on any field inside the content field:
GET /test/_search
{
"query": {
"match": {
"content_text": "42"
}
}
}

Related

ElasticSearch Query fields based on conditions on another field

Mapping
PUT /employee
{
"mappings": {
"post": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"properties":{
"id" : { "type" : "integer"},
"value" : { "type" : "keyword"}
}
},
"primary_email_id":{
"type": "integer"
}
}
}
}
}
Data
POST employee/post/1
{
"name": "John",
"email_ids": [
{
"id" : 1,
"value" : "1#email.com"
},
{
"id" : 2,
"value" : "2#email.com"
}
],
"primary_email_id": 2 // Here 2 refers to the id field of email_ids.id (2#email.com).
}
I need help to form a query to check if an email id is already taken as a primary email?
eg: If I query for 1#email.com I should get result as No as 1#email.com is not a primary email id.
If I query for 2#email.com I should get result as Yes as 2#email.com is a primary email id for John.
As far as i know with this mapping you can not achive what you are expecting.
But, You can create email_ids field as nested type and add one more field like isPrimary and set value of it to true whenever email is primary email.
Index Mapping
PUT employee
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"type": "nested",
"properties": {
"id": {
"type": "integer"
},
"value": {
"type": "keyword"
},
"isPrimary":{
"type": "boolean"
}
}
},
"primary_email_id": {
"type": "integer"
}
}
}
}
Sample Document
POST employee/_doc/1
{
"name": "John",
"email_ids": [
{
"id": 1,
"value": "1#email.com"
},
{
"id": 2,
"value": "2#email.com",
"isPrimary": true
}
],
"primary_email_id": 2
}
Query
You need to keep below query as it is and only need to change email address when you want to see if email is primary or not.
POST employee/_search
{
"_source": false,
"query": {
"nested": {
"path": "email_ids",
"query": {
"bool": {
"must": [
{
"term": {
"email_ids.value": {
"value": "2#email.com"
}
}
},
{
"term": {
"email_ids.isPrimary": {
"value": "true"
}
}
}
]
}
}
}
}
}
Result
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.98082924,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.98082924
}
]
}
}
Interpret Result:
Elasticsearch will not return result in boolean like true or false but you can implement it at application level. You can consider value of hits.total.value from result, if it is 0 then you can consider false otherwise true.
PS: Answer is based on ES version 7.10.

Can't Highlight Dynamic Template Field Value in Elasticsearch

Follow up to this question.
I have a dynamic template which copies the text of a JSON blob to a single text field, and I'd like to search on that field and highlight matches. Here is my full code for ES 6.5
DELETE /test
PUT /test?include_type_name=true
{
"settings": {"number_of_shards": 1,"number_of_replicas": 1},
"mappings": {
"_doc": {
"dynamic_templates": [
{
"full_name": {
"match_mapping_type": "string",
"path_match": "content.*",
"mapping": {
"type": "text",
"copy_to": "content_text"
}
}
}
],
"properties": {
"content_text": {
"type": "text"
},
"content": {
"type": "object",
"enabled": "true"
}
}
}
}
}
PUT /test/_doc/1?refresh=true
{
"content": {
"a": {
"b": {
"text": "42"
}
}
}
}
GET /test/_search
{
"query": {
"match": {
"content_text": "42"
}
},
"highlight": {
"fields": {
"content_text": {}
}
}
}
The response does not show the highlighted content_text
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"content" : {
"a" : {
"b" : {
"text" : "42"
}
}
}
}
}
]
}
}
As you can see, the content_text field is not highlight. It's also not in the response at all. How do I get highlights for this field to show up?
This is a tricky one, but will make sense once you read what follows.
As per the official documentation on highlighting, the actual content of a field is required to exist somewhere. So if the field is not stored (i.e. the mapping does not set store to true), the actual _source is loaded and the relevant field is extracted from _source.
In your case, the content_text field doesn't exist in the _source document (i.e. it is just indexed from other text fields present in content.*) and in the mapping, the store parameter is not set to true (it is false by default).
So you simply need to change your mapping to this:
"content_text": {
"store": true,
"type": "text"
},
And then your query will yield this:
"highlight" : {
"content_text" : [
"<em>42</em>"
]
}

Elasticsearch - Is it possible to create histograms without having the field indexed

I come across the following phrase
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html
For instance if you have a numeric field called foo that you need to run histograms on but that you never need to filter on, you can safely disable indexing on this field in your mappings:
PUT index
{
"mappings": {
"properties": {
"foo": {
"type": "integer",
"index": false
}
}
}
}
Does it mean aggregations like histograms can be created though the field is NOT indexed ?
Yes, that's correct and that's easy to test:
Create the index:
PUT index
{
"mappings": {
"properties": {
"foo": {
"type": "integer",
"index": false
}
}
}
}
Index a sample document:
PUT index/_doc/1
{
"foo": 23
}
Run an histogram aggregation:
POST index/_search
{
"aggs": {
"histo": {
"histogram": {
"field": "foo",
"interval": 10
}
}
}
}
Results:
"aggregations" : {
"histo" : {
"buckets" : [
{
"key" : 20.0,
"doc_count" : 1
}
]
}
}

How to Query elasticsearch index with nested and non nested fields

I have an elastic search index with the following mapping:
PUT /student_detail
{
"mappings" : {
"properties" : {
"id" : { "type" : "long" },
"name" : { "type" : "text" },
"email" : { "type" : "text" },
"age" : { "type" : "text" },
"status" : { "type" : "text" },
"tests":{ "type" : "nested" }
}
}
}
Data stored is in form below:
{
"id": 123,
"name": "Schwarb",
"email": "abc#gmail.com",
"status": "current",
"age": 14,
"tests": [
{
"test_id": 587,
"test_score": 10
},
{
"test_id": 588,
"test_score": 6
}
]
}
I want to be able to query the students where name like '%warb%' AND email like '%gmail.com%' AND test with id 587 have score > 5 etc. The high level of what is needed can be put something like below, dont know what would be the actual query, apologize for this messy query below
GET developer_search/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "abc"
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": IN [587]
}
},
{
"term": {
"tests.test_score": >= some value
}
}
]
}
}
}
}
]
}
}
}
The query must be flexible so that we can enter dynamic test Ids and their respective score filters along with the fields out of nested fields like age, name, status
Something like that?
GET student_detail/_search
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"name": {
"value": "*warb*"
}
}
},
{
"wildcard": {
"email": {
"value": "*gmail.com*"
}
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
},
"inner_hits": {}
}
}
]
}
}
}
Inner hits is what you are looking for.
You must make use of Ngram Tokenizer as wildcard search must not be used for performance reasons and I wouldn't recommend using it.
Change your mapping to the below where you can create your own Analyzer which I've done in the below mapping.
How elasticsearch (albiet lucene) indexes a statement is, first it breaks the statement or paragraph into words or tokens, then indexes these words in the inverted index for that particular field. This process is called Analysis and that this would only be applicable on text datatype.
So now you only get the documents if these tokens are available in inverted index.
By default, standard analyzer would be applied. What I've done is I've created my own analyzer and used Ngram Tokenizer which would be creating many more tokens than just simply words.
Default Analyzer on Life is beautiful would be life, is, beautiful.
However using Ngrams, the tokens for Life would be lif, ife & life
Mapping:
PUT student_detail
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"email" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"age" : {
"type" : "text" <--- I am not sure why this is text. Change it to long or int. Would leave this to you
},
"status" : {
"type" : "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tests":{
"type" : "nested"
}
}
}
}
Note that in the above mapping I've created a sibling field in the form of keyword for name, email and status as below:
"name":{
"type":"text",
"analyzer":"my_analyzer",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
Now your query could be as simple as below.
Query:
POST student_detail/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "war" <---- Note this. This would even return documents having "Schwarb"
}
},
{
"match": {
"email": "gmail" <---- Note this
}
},
{
"nested": {
"path": "tests",
"query": {
"bool": {
"must": [
{
"term": {
"tests.test_id": 587
}
},
{
"range": {
"tests.test_score": {
"gte": 5
}
}
}
]
}
}
}
}
]
}
}
}
Note that for exact matches I would make use of Term Queries on keyword fields while for normal searches or LIKE in SQL I would make use of simple Match Queries on text Fields provided they make use of Ngram Tokenizer.
Also note that for >= and <= you would need to make use of Range Query.
Response:
{
"took" : 233,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.7260926,
"hits" : [
{
"_index" : "student_detail",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.7260926,
"_source" : {
"id" : 123,
"name" : "Schwarb",
"email" : "abc#gmail.com",
"status" : "current",
"age" : 14,
"tests" : [
{
"test_id" : 587,
"test_score" : 10
},
{
"test_id" : 588,
"test_score" : 6
}
]
}
}
]
}
}
Note that I observe the document you've mentioned in your question, in my response when I run the query.
Please do read the links I've shared. It is vital that you understand the concepts. Hope this helps!

Elasticsearch metric aggregation: number of elements in array

I want to do a quite involved query/aggregation. I can't see how because I've just started working with ES. The documents I have look something like this:
{
"keyword": "some keyword",
"items": [
{
"name":"my first item",
"item_property_1":"A",
( other properties here )
},
{
"name":"my second item",
"item_property_1":"B",
( other properties here )
},
{
"name":"my third item",
"item_property_1":"A",
( other properties here )
}
]
( other properties... )
},
{
"keyword": "different keyword",
"items": [
{
"name":"cool item",
"item_property_1":"A",
( other properties here )
},
{
"name":"awesome item",
"item_property_1":"C",
( other properties here )
},
]
( other properties... )
},
( other documents... )
Now, what I would like to do is to, for each keyword, count how many items there are for which of the several possible values that property_1 can have. That is, I want a bucket aggregation that would have the following response:
{
"keyword": "some keyword",
"item_property_1_aggretation": [
{
"key":"A",
"count": 2,
},
{
"key":"B",
"count": 1,
}
]
},
{
"keyword": "different keyword",
"item_property_1_aggretation": [
{
"key":"A",
"count": 1,
},
{
"key":"C",
"count": 1,
}
]
},
( other keywords... )
If mappings are necessary, could you also specificy which? I don't have any non-default mappings, I just dumped everything in there.
EDIT:
Saving you the trouble by posting here the bulk PUT for the previous example
PUT /test/test/_bulk
{ "index": {}}
{ "keyword": "some keyword", "items": [ { "name":"my first item", "item_property_1":"A" }, { "name":"my second item", "item_property_1":"B" }, { "name":"my third item", "item_property_1":"A" } ]}
{ "index": {}}
{ "keyword": "different keyword", "items": [ { "name":"cool item", "item_property_1":"A" }, { "name":"awesome item", "item_property_1":"C" } ]}
EDIT2:
I just tried this:
POST /test/test/_search
{
"size":2,
"aggregations": {
"property_1_count": {
"terms":{
"field":"item_property_1"
}
}
}
}
and got this:
"aggregations": {
"property_1_count": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "a",
"doc_count": 2
},
{
"key": "b",
"doc_count": 1
},
{
"key": "c",
"doc_count": 1
}
]
}
}
close but no cigar. You can see what's happening, it's bucketing over each item_property_1 irrespectively of the keyword it belongs to. I'm sure the solution involves adding some mapping correctly, but I can't put my finger on it. Suggestions?
EDIT3:
Based on this:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-nested-type.html
I want to try adding a nested type to property items. To do that, I tried:
PUT /test/_mapping/test
{
"test":{
"properties": {
"items": {
"type": "nested",
"properties": {
"item_property_1":{"type":"string"}
}
}
}
}
}
However, this returns an error:
{
"error": "MergeMappingException[Merge failed with failures {[object mapping [items] can't be changed from non-nested to nested]}]",
"status": 400
}
This might have to do with the warning on that url: "changing an object type to nested type requires reindexing."
So, how do I do that?
Nice tries, you were almost there! Here is what I came up with. Based on your mapping proposal, the mapping I'm using is the following:
curl -XPUT localhost:9200/test/_mapping/test -d '{
"test": {
"properties": {
"keyword": {
"type": "string",
"index": "not_analyzed"
},
"items": {
"type": "nested",
"properties": {
"name": {
"type": "string"
},
"item_property_1": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
Note: you need to wipe and reindex your data, since you cannot change a field type from being not nested to nested.
Then I created some data with the bulk query you shared:
curl -XPOST localhost:9200/test/test/_bulk -d '
{ "index": {}}
{ "keyword": "some keyword", "items": [ { "name":"my first item", "item_property_1":"A" }, { "name":"my second item", "item_property_1":"B" }, { "name":"my third item", "item_property_1":"A" } ]}
{ "index": {}}
{ "keyword": "different keyword", "items": [ { "name":"cool item", "item_property_1":"A" }, { "name":"awesome item", "item_property_1":"C" } ]}
'
Finally, here is the aggregation query you can use to get the results you expect. We first bucket by keyword using a terms aggregation and then for each keyword, we bucket by the nested item_property_1 field. Since items is now a nested type, the key is to use a nested aggregation for items and then a terms sub-aggregation for the item_property_1 field.
{
"size": 0,
"aggregations": {
"by_keyword": {
"terms": {
"field": "keyword"
},
"aggs": {
"prop_1_count": {
"nested": {
"path": "items"
},
"aggs": {
"prop_1": {
"terms": {
"field": "items.item_property_1"
}
}
}
}
}
}
}
}
Running that query on your data set will yield this:
{
...
"aggregations" : {
"by_keyword" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "different keyword", <---- keyword 1
"doc_count" : 1,
"prop_1_count" : {
"doc_count" : 2,
"prop_1" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ { <---- buckets for item_property_1
"key" : "A",
"doc_count" : 1
}, {
"key" : "C",
"doc_count" : 1
} ]
}
}
}, {
"key" : "some keyword", <---- keyword 2
"doc_count" : 1,
"prop_1_count" : {
"doc_count" : 3,
"prop_1" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ { <---- buckets for item_property_1
"key" : "A",
"doc_count" : 2
}, {
"key" : "B",
"doc_count" : 1
} ]
}
}
} ]
}
}
}

Resources