Trying to extract a leaf field from Elasticsearch - elasticsearch

I have an object in elasticsearch which resembles something like this:
{
"text": "something something something",
"entities": { "hashtags":["test","test123"]}
}
The problem is that not each document has the entities attribute set. So I want to write a query which:
must contain a keyword in the text field
must have the entities field
extracts the entities.hashtag field
I'm trying to extract a leaf field using following query, the problem is I still get documents which don't have an entities field.
For the second part of the question, I was wondering: How do I only extract the entities.hashtags field? I tried something like "fields": ["entities.hashtags"] but it didn't work.
{
"size": 2000,
"query": {
"filtered": {
"query": {
"match_all": {
}
},
"filter": {
"bool": {
"must": [{
"term": {
"text": "something"
}
},
{
"missing": {
"field": "entities",
"existence": true
}
}]
}
}
}
}
}

This seems to do what you want, if I'm understanding you correctly. A "term" filter on the "text" field and an "exists" filter on the "entities" field filters the docs, and a "terms" aggregation on "entities.hashtags" extracts the values. I'll just post the full example I used:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
}
}
PUT /test_index/doc/1
{
"text": "something something something",
"entities": { "hashtags": ["test","test123"] }
}
PUT /test_index/doc/2
{
"text": "another doc",
"entities": { "hashtags": ["testagain","testagain123"] }
}
PUT /test_index/doc/3
{
"text": "doc with no entities"
}
POST /test_index/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{ "term": { "text": "something" } },
{ "exists": { "field": "entities" } }
]
}
}
}
},
"aggs": {
"hashtags": {
"terms": {
"field": "entities.hashtags"
}
}
}
}
...
{
"took": 35,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"hashtags": {
"buckets": [
{
"key": "test",
"doc_count": 1
},
{
"key": "test123",
"doc_count": 1
}
]
}
}
}

Related

elastic search : Aggregating the specific nested documents only

I want to aggregate the specific nested documents which satisfies the given query.
Let me explain it through an example. I have inserted two records in my index:
First document is,
{
"project": [
{
"subject": "maths",
"marks": 47
},
{
"subject": "computers",
"marks": 22
}
]
}
second document is,
{
"project": [
{
"subject": "maths",
"marks": 65
},
{
"subject": "networks",
"marks": 72
}
]
}
Which contains the subject along with the marks in each record. From that documents, I need to have an average of maths subject alone from the given documents.
The query I tried is:
{
"size": 0,
"aggs": {
"avg_marks": {
"avg": {
"field": "project.marks"
}
}
},
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "project.subject:maths",
"analyze_wildcard": true,
"default_field": "*"
}
}
]
}
}
}
Which is returning the result of aggregating all the marks average which is not required.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"avg_marks": {
"value": 51.5
}
}
}
I just need an average of maths subject from the given documents, in which the expected result is 56.00
any help with the query or idea will be helpful.
Thanks in advance.
First you need in your mapping to specify that index have nested field like following:
PUT /nested-index {
"mappings": {
"document": {
"properties": {
"project": {
"type": "nested",
"properties": {
"subject": {
"type": "keyword"
},
"marks": {
"type": "long"
}
}
}
}
}
}
}
then you insert your docs:
PUT nested-index/document/1
{
"project": [
{
"subject": "maths",
"marks": 47
},
{
"subject": "computers",
"marks": 22
}
]
}
then insert second doc:
PUT nested-index/document/2
{
"project": [
{
"subject": "maths",
"marks": 65
},
{
"subject": "networks",
"marks": 72
}
]
}
and then you do aggregation but specify that you have nested structure like this:
GET nested-index/_search
{
"size": 0,
"aggs": {
"subjects": {
"nested": {
"path": "project"
},
"aggs": {
"subjects": {
"terms": {
"field": "project.subject",
"size": 10
},
"aggs": {
"average": {
"avg": {
"field": "project.marks"
}
}
}
}
}
}
}
}
and why your query is not working and why give that result is because when you have nested field and do average it sums all number from one array if in that array you have some keyword doesn't matter that you want to aggregate only by one subject.
So if you have those two docs because in both docs you have math subject avg will be calculated like this:
(47 + 22 + 65 + 72) / 4 = 51.5
if you want avg for networks it will return you (because in one document you have network but it will do avg over all values in array):
65 + 72 = 68.5
so you need to use nested structure in this case.
If you are interested just for one subject you can than do aggregation just for subject equal to something like this (subject equal to "maths"):
GET nested-index/_search
{
"size": 0,
"aggs": {
"project": {
"nested": {
"path": "project"
},
"aggs": {
"subjects": {
"filter": {
"term": {
"project.subject": "maths"
}
},
"aggs": {
"average": {
"avg": {
"field": "project.marks"
}
}
}
}
}
}
}
}

Empty inner_hits in compound Elasticsearch filter

I'm seeing what appears to be aberrant behavior in inner_hits results within nested boolean queries.
Test data (abbreviated for brevity):
# MAPPING
PUT unit_testing
{
"mappings": {
"document": {
"properties": {
"display_name": {"type": "text"},
"metadata": {
"properties": {
"NAME": {"type": "text"}
}
}
}
},
"paragraph": {
"_parent": {"type": "document"},
"_routing": {"required": true},
"properties": {
"checksum": {"type": "text"},
"sentences": {
"type": "nested",
"properties": {
"text": {"type": "text"}
}
}
}
}
}
}
# DOCUMENT X 2 (d0, d1)
PUT unit_testing/document/doc_id_d0
{
"display_name": "Test Document d0",
"paragraphs": [
"para_id_d0p0",
"para_id_d0p1"
],
"metadata": {"NAME": "Test Document d0 Metadata"}
}
# PARAGRAPH X 2 (d0p0, d1p0)
PUT unit_testing/paragraph/para_id_d0p0?parent=doc_id_d0
{
"checksum": "para_checksum_d0p0",
"sentences": [
{"text": "Test sentence d0p0s0"},
{"text": "Test sentence d0p0s1 ODD"},
{"text": "Test sentence d0p0s2 EVEN"},
{"text": "Test sentence d0p0s3 ODD"},
{"text": "Test sentence d0p0s4 EVEN"}
]
}
This initial query behaves as I would expect (I'm aware that the metadata filter isn't actually necessary in this example case):
GET unit_testing/paragraph/_search
{
"_source": "false",
"query": {
"bool": {
"must": [
{
"has_parent": {
"query": {
"match_phrase": {
"metadata.NAME": "Test Document d0 Metadata"
}
},
"type": "document"
}
},
{
"nested": {
"inner_hits": {},
"path": "sentences",
"query": {
"match": {
"sentences.text": "d0p0s0"
}
}
}
}
]
}
}
}
It yields an inner_hits object containing the one sentence that matched the predicate (some fields removed for clarity):
{
"hits": {
"hits": [
{
"_source": {},
"inner_hits": {
"sentences": {
"hits": {
"hits": [
{
"_source": {
"text": "Test sentence d0p0s0"
}
}
]
}
}
}
}
]
}
}
The following query is an attempt to embed the query above within a parent "should" clause, to create a logical OR between the initial query, and an additional query that matches a single sentence:
GET unit_testing/paragraph/_search
{
"_source": "false",
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"has_parent": {
"query": {
"match_phrase": {
"metadata.NAME": "Test Document d0 Metadata"
}
},
"type": "document"
}
},
{
"nested": {
"inner_hits": {},
"path": "sentences",
"query": {
"match": {
"sentences.text": "d0p0s0"
}
}
}
}
]
}
},
{
"nested": {
"inner_hits": {},
"path": "sentences",
"query": {
"match": {
"sentences.text": "d1p0s0"
}
}
}
}
]
}
}
}
While the "d1" query outputs the result one would expect, with an inner_hits object containing the matching sentence, the original "d0" query now yields an empty inner_hits object:
{
"hits": {
"hits": [
{
"_source": {},
"inner_hits": {
"sentences": {
"hits": {
"total": 0,
"hits": []
}
}
}
},
{
"_source": {},
"inner_hits": {
"sentences": {
"hits": {
"hits": [
{
"_source": {
"text": "Test sentence d1p0s0"
}
}
]
}
}
}
}
]
}
}
Although I'm using the elasticsearch_dsl Python library to build and combine these queries, and I'm something of a novice with respect to the Query DSL, the query format looks solid to me.
What am I missing?
I think what is missing is the name parameter for inner_hits - you have two inner_hits clauses at two different queries that would end up with the same name. Try giving the inner_hits a name parameter (0).
0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html#_options

Elastic Search Querying/filtering nested arrays

I have stored below type of nested data on my index test_agg in ES.
{
"Date": "2015-10-21",
"Domain": "abc.com",
"Processed_at": "10/23/2015 9:47",
"Events": [
{
"Name": "visit",
"Count": "188",
"Value_Aggregations": [
{
"Value": "red",
"Count": "100"
}
]
},
{
"Name": "order_created",
"Count": "159",
"Value_Aggregations": [
{
"Value": "$125",
"Count": "50"
}
]
},
]
}
mapping of the nested item is
curl -XPOST localhost:9200/test_agg/nested_evt/_mapping -d '{
"nested_evt":{
"properties":{
"Events": {
"type": "nested"
}
}
}
}'
I am trying to get "Events.Count" and "Events.Value_Aggregations.Count" where Events.Name='Visit' using the below query
{
"fields" : ["Events.Count","Events.Value_Aggregations.Count"]
"query": {
"filtered": {
"query": {
"match": { "Domain": "abc.com" }
},
"filter": {
"nested": {
"path": "Events",
"query": {
"match": { "Events.Name": "visit" }
},
}
}
}
}
}
instead of resulting single value
Events.Count=[188] Events.Value_Aggregations.Count=[100]
it gives
Events.Count=[188,159] Events.Value_Aggregations.Count=[100,50]
what is the exact query structure to get my desired output?
So the problem here is that the nested filter you are applying selects parent documents based on attributes of the nested child documents. So ES finds the parent document that matches your query (based on the document's nested children). Then, instead of returning the entire document, since you have specified "fields" it picks out only those fields that you have asked for. Those fields happen to be nested fields, and since the parent document has two nested children, it finds two values each for the fields you specified and returns them. To my knowledge there is no way to return the child documents instead, at least with a nested architecture.
One solution to this problem would be to use the parent/child relationship instead, then you could use a has_parent query in combination with the other filters, against the child type to get what you want. That would probably be a cleaner way to do this, as long as the schema architecture doesn't conflict with your other needs.
However, there is a way to do sort of what you are asking, with your current schema, with a nested aggregation combined with a filter aggregation. It's kind of involved (and slightly ambiguous in this case; see explanation below), but here's the query:
POST /test_index/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match": {
"Domain": "abc.com"
}
},
"filter": {
"nested": {
"path": "Events",
"query": {
"match": {
"Events.Name": "visit"
}
}
}
}
}
},
"aggs": {
"nested_events": {
"nested": {
"path": "Events"
},
"aggs": {
"filtered_events": {
"filter": {
"term": {
"Events.Name": "visit"
}
},
"aggs": {
"events_count_terms": {
"terms": {
"field": "Events.Count"
}
},
"value_aggregations_count_terms": {
"terms": {
"field": "Events.Value_Aggregations.Count"
}
}
}
}
}
}
}
}
which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"nested_events": {
"doc_count": 2,
"filtered_events": {
"doc_count": 1,
"value_aggregations_count_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "100",
"doc_count": 1
}
]
},
"events_count_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "188",
"doc_count": 1
}
]
}
}
}
}
}
Caveat: it's not clear to me whether you actually need the "filter": { "nested": { ... } } clause of the "query" in what I've shown here. If this part filters out parent documents in a useful way, then you need it. If your only intention was to select which nested child documents from which to return fields, then it's redundant here since the filter aggregation is taking care of that part.
Here is the code I used to test it:
http://sense.qbox.io/gist/dcc46e50117031de300b6f91c647fe9b729a5283
here is the parent/child relationship query which resulted my desired output
{
"query": {
"filtered": {
"query": {
"bool": {"must": [
{"term": {"Name": "visit"}}
]}
},
"filter":{
"has_parent": {
"type": "domain_info",
"query" : {
"filtered": {
"query": { "match_all": {}},
"filter" : {
"and": [
{"term": {"Domain": 'abc.com'}}
]
}
}
}
}
}
}
}
}

Aggregate only matched nested object values in ElasticSearch

I need to sum only the values on the nested objects that match the query. It looks like ElasticSearch determines the documents matching the query and then sums across all of the nested objects. From the below outline I want to search on nestedobjects.objtype="A" and get back the sum of objvalue only for matching nestedobjects, I want to get the value 4. is this possible? If so, how?
Here is the mapping
{
"myindex": {
"mappings": {
"mytype": {
"properties": {
"nestedobjects": {
"type": "nested",
"include_in_parent": true,
"properties": {
"objtype": {
"type": "string"
},
"objvalue": {
"type": "integer"
}
}
}
}
}
}
}
}
Here are my documents
PUT /myindex/mytype/1
{
"nestedobjects": [
{ "objtype": "A", "objvalue": 1 },
{ "objtype": "B", "objvalue": 2 }
]
}
PUT /myindex/mytype/2
{
"nestedobjects": [
{ "objtype": "A", "objvalue": 3 },
{ "objtype": "B", "objvalue": 3 }
]
}
Here is my query code.
POST allscriptshl7/_search?search_type=count
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "nestedobjects.objtype:A"
}
}
}
},
"aggregations": {
"my_agg": {
"sum": {
"field": "nestedobjects.objvalue"
}
}
}
}
Since both (outer) documents match the condition that one of their inner documents match the query, both outer documents are returned, and the aggregation is calculated against all of the inner documents belonging to those outer documents. Whew.
Anyway, this seems to do what you're wanting, I think, using filter aggregation:
POST /myindex/_search?search_type=count
{
"aggs": {
"nested_nestedobjects": {
"nested": {
"path": "nestedobjects"
},
"aggs": {
"filtered_nestedobjects": {
"filter": {
"term": {
"nestedobjects.objtype": "a"
}
},
"aggs": {
"my_agg": {
"sum": {
"field": "nestedobjects.objvalue"
}
}
}
}
}
}
}
}
...
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"nested_nestedobjects": {
"doc_count": 4,
"filtered_nestedobjects": {
"doc_count": 2,
"my_agg": {
"value": 4,
"value_as_string": "4.0"
}
}
}
}
}
Here is some code I used to test it:
http://sense.qbox.io/gist/c1494619ff1bd0394d61f3d5a16cb9dfc229113a
Very well-structured question, by the way.

Elasticsearch facets filters

I've created a facet using elasticsearch but I want to filter it just for specific words.
{
...
"facets": {
"my_facets": {
"terms": {
"field": "description",
"size": 1000
}
}
}
}
And the result contains all the words from description .
{
"my_facet": {
"_type": "terms",
"missing": 0,
"total": 180,
"other": 0,
"terms": [
{
"term": "și",
"count": 1
},
{
"term": "światłowska",
"count": 1
},
{
"term": "łódź",
"count": 1
}
]
}
}
I want my facets to contain an analyze just for specific words not for entire words finded in description .
I've already tried to use a query match inside my facet but it makes an overall analyze
like follows
{
"query_Facet_test": {
"query": {
"match": {
"description": "word1 word2"
}
}
}
}
and the result I get :
{
"query_Facet_test": {
"_type": "query",
"count": 1
}
}
You can use a bool query like this to get query facets
{
"query": {
"bool": {
"must": [
{
"match": {
"description": "word1"
}
},
{
"match": {
"description": "word2"
}
}
]
}
},
"facets": {
"my_facets": {
"terms": {
"field": "description",
"size": 1000
}
}
}
}

Resources