Elasticsearch composite aggregate query on nested fields - elasticsearch

I've a question on an aggregation on nested objects.
Document is like:
{
"features": [{
"key": "key1",
"values": ["A", "B"]
},
{
"key": "key2",
"values": ["C", "D"]
},
{
"key": "key2",
"values": ["E"]
}
]
}
where 'features' is a nested object.
I can aggregate and get distinct values from key and values, but I need to get a combined bucket aggregation, where I need:
key1 -> A,B
key2 -> C,D,E
Is composite aggregation that has to be used? Or which is the proper aggregation to use?
Java samples are also welcome!
Thanks!!!

You don't really need composite for this. The following should be fine:
{
"size": 0,
"aggs": {
"nested_aggs": {
"nested": {
"path": "features"
},
"aggs": {
"by_key": {
"terms": {
"field": "features.key.keyword"
},
"aggs": {
"by_values": {
"terms": {
"field": "features.values.keyword"
}
}
}
}
}
}
}
}
assuming your mapping looks like this
{
"mappings":{
"properties":{
"features":{
"type":"nested",
"properties":{
"key":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"values":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
}

Related

Nested query on ElasticSearch for Long type (ES 5.0.4)

This is my first question on Stack overflow , please excuse me for the mistakes. I will improve on them in the future.
I am new to Elastic Search too. Okay so I am trying to do a exact match in elastic search (5.0.4). Instead of doing an exact match, the request returns all the documents present.
Not sure of this behavior.
Here is the mapping
{
"properties":{
"debug_urls":{
"properties":{
"characteristics":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"url_id":{
"type":"long"
}
},
"type":"nested"
},
"scanId":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
This is my request.
{
"query": {
"nested": {
"path": "debug_urls",
"query": {
"match": {
"debug_urls.url_id": 1
}
}
}
}
}
The response received,
{
"took":1,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":1,
"max_score":1,
"hits":[
{
"_index":"cust_cca39c0c6c8141008e9411032bbf4d21",
"_type":"debug-urls",
"_id":"AW70h0l72s9qXitMsWgC",
"_score":1,
"_source":{
"scan_id":"n_a0a523fb5c81435fb79c34c624c7fbd6",
"debug_urls":[
{
"url_id":1,
"characteristics":[
"FORM",
"EXTERNAL_SCRIPT",
"INLINE_SCRIPT"
]
},
{
"url_id":2,
"characteristics":[
"EXTERNAL_SCRIPT",
"INLINE_SCRIPT"
]
},
{
"url_id":3,
"characteristics":[
"EXTERNAL_SCRIPT",
"INLINE_SCRIPT"
]
},
{
"url_id":4,
"characteristics":[
"EXTERNAL_SCRIPT",
"INLINE_SCRIPT"
]
},
{
"url_id":5,
"characteristics":[
"EXTERNAL_SCRIPT",
"INLINE_SCRIPT"
]
},
{
"url_id":6,
"characteristics":[
"EXTERNAL_SCRIPT",
"INLINE_SCRIPT"
]
},
{
"url_id":7,
"characteristics":[
"EXTERNAL_SCRIPT",
"INLINE_SCRIPT"
]
}
]
}
}
]
}
}
If you only want to see the nested documents that match the criteria, you can leverage nested inner_hits:
{
"_source":["scan_id"], <--- add this line
"query": {
"nested": {
"path": "debug_urls",
"query": {
"match": {
"debug_urls.url_id": 1
}
},
"inner_hits": {} <--- add this line
}
}
}

Elastic search query to return documents matching all elements in an array

I have a structure similar to this:
Document 1:
nestedobject: {
uniqueid: "12345",
field: [ {id: 1,
color: blue,
fruit:banana},
{id: 2,
color: red,
fruit:apple},
]
}
Document 2: (in same index)
nestedobject: {
uniqueid:23456,
field: [ {id: 3,
color: blue,
fruit:banana},
{id: 4,
color: blue,
fruit:banana},
]
}
the field mappings can be seen as :
{"mappings":
"nestedobject":{
"properties":{
"uniqueid":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
"field":{
"type":"nested",
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"color":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"fruit":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
Now, I try to query this index with 2 documents and I want only the document which has all the elements in the field array with color blue and fruit as banana - NOT ATLEAST 1.
Right now, with the query, I get returned both the documents as it matches with the first element of the first document and returns that.
How to make this possible?
{
"query": {
"nested" : {
"path" : "nestedobject.field",
"query" : {
"bool" : {
"must" : [
{ "match" : {"nestedobject.field.color" : "blue"} },
{ "match" : {"nestedobject.field.fruit" : "banana"}}
]
}
}
}
}
}
Change your query to the below:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "field",
"query": {
"match":{
"field.color": "blue"
}
}
}
},
{
"nested": {
"path": "field",
"query": {
"match":{
"field.fruit": "banana"
}
}
}
}
]
}
}
}
Note that there are two Nested Queries inside a must clause.
Also note that, in order to make use of Exact Match, you should be using Term Queries on keyword field as shown below:
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "field",
"query": {
"term": {
"field.color.keyword": "yellow"
}
}
}
},
{
"nested": {
"path": "field",
"query": {
"term": {
"field.fruit.keyword": "banana"
}
}
}
}
]
}
}
}
Hope that helps and if you think that solved what you are looking for, feel free to upvote and/or accept the answer by clicking on big gray check button on the left side of this answer.

Elasticsearch [match] unknown token [START_OBJECT] after [created_utc]

I am learning how to use elasticsearch using the 2006 dataset of reddit comments from pushshift.io.
created_utc is the field with the time a comment was created.
I am trying to get all the posts within a certain time range. I googled a bit and found out that I need to use the "range" keyword.
This is my query right now:
{
"query": {
"match" : {
"range": {
"created_utc": {
"gte": "1/1/2006",
"lte": "31/1/2006",
"format": "dd/MM/yyyy"
}
}
}
}
}
I then tried using a bool query so I can match time range with edited must not = False (edited being the boolean field that tells me whether a post has been edited or not):
{
"query": {
"bool" : {
"must" : {
"range" : {
"created_utc": {
"gte" : "01/12/2006", "lte": "31/12/2006", "format": "dd/MM/yyyy"
}
}
},
"must_not": {
"edited": False
}
}
}
}
However, this gave me another error that I can't figure out:
[edited] query malformed, no start_object after query name
I'd appreciate if anyone can help me out with this, thanks!
Here is my mapping for the comment if it helps:
{
"comment":{
"properties":{
"author":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"body":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"controversiality":{
"type":"long"
},
"created_utc":{
"type":"date"
},
"edited":{
"type":"boolean"
},
"gilded":{
"type":"long"
},
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"link_id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"parent_id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"score":{
"type":"long"
},
"subreddit":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
If you want to get all the posts within a time range, then you must be using a range query. The problem with your query is you are using range inside a match query which is not allowed in elasticsearch, so your query should look like:
{
"query": {
"range": {
"created_utc": {
"gte": 1136074029,
"lte": 1136076410
}
}
}
}
Providing the fact that the created_utc field is saved as epoch, you must use a epoch format to query.
The second query where you want to find the posts within a range where edited must not false:
{
"query": {
"bool": {
"must": [
{
"range": {
"created_utc": {
"gte": 1136074029,
"lte": 1136076410
}
}
}
],
"must_not": [
{
"match": {
"edited": false
}
}
]
}
}
}
Note: If your created_utc is stored in dd/MM/yyyy format then while querying you should use a strict companion format, i.e. instead of 1/1/2006 you should be giving 01/01/2006.
Hope this helps !

Elasticsearch - Cardinality over Full Field Value

I have a document that looks like this:
{
"_id":"some_id_value",
"_source":{
"client":{
"name":"x"
},
"project":{
"name":"x November 2016"
}
}
}
I am attempting to perform a query that will fetch me the count of unique project names for each client. For this, I am using a query with cardinality over the project.name. I am sure that there are only 4 unique project names for this particular client. However, when I run my query, I get a count of 5, which I know is wrong.
The project names all contain the name of the client. For instance, if a client is "X", project names will be "X Testing November 2016", or "X Jan 2016", etc. I don't know if that is a consideration.
This is the mapping for the document type
{
"mappings":{
"vma_docs":{
"properties":{
"client":{
"properties":{
"contact":{
"type":"string"
},
"name":{
"type":"string"
}
}
},
"project":{
"properties":{
"end_date":{
"format":"yyyy-MM-dd",
"type":"date"
},
"project_type":{
"type":"string"
},
"name":{
"type":"string"
},
"project_manager":{
"index":"not_analyzed",
"type":"string"
},
"start_date":{
"format":"yyyy-MM-dd",
"type":"date"
}
}
}
}
}
}
}
This is my search query
{
"fields":[
"client.name",
"project.name"
],
"query":{
"bool":{
"must":{
"match":{
"client.name":{
"operator":"and",
"query":"ABC systems"
}
}
}
}
},
"aggs":{
"num_projects":{
"cardinality":{
"field":"project.name"
}
}
},
"size":5
}
These are the results I get (I have only posted 2 results for the sake of brevity). Please find that the num_projects aggregation returns 5, but must only return 4, which are the total number of projects.
{
"hits":{
"hits":[
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9IBwwoAW3mzgKz",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
},
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9YBwwoAW3mzgK2",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
}
],
"total":18,
"max_score":5.8553367
},
"_shards":{
"successful":5,
"failed":0,
"total":5
},
"took":4,
"aggregations":{
"num_projects":{
"value":5
}
},
"timed_out":false
}
FYI: The project names are ABC, ABC Nov 2016, ABC retest November, ABC Mobile App
You need the following mapping for your project.name field:
{
"mappings": {
"vma_docs": {
"properties": {
"client": {
"properties": {
"contact": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"project": {
"properties": {
"end_date": {
"format": "yyyy-MM-dd",
"type": "date"
},
"project_type": {
"type": "string"
},
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"project_manager": {
"index": "not_analyzed",
"type": "string"
},
"start_date": {
"format": "yyyy-MM-dd",
"type": "date"
}
}
}
}
}
}
}
It's basically a subfield called raw where the same value put in project.name is put in project.name.raw but without touching it (tokenizing or analyzing it). And then the query you need to use is:
{
"fields": [
"client.name",
"project.name"
],
"query": {
"bool": {
"must": {
"match": {
"client.name": {
"operator": "and",
"query": "ABC systems"
}
}
}
}
},
"aggs": {
"num_projects": {
"cardinality": {
"field": "project.name.raw"
}
}
},
"size": 5
}

Multi-level nesting in elastic search

I have the below structure (small part of a very large elastic-search document)
sample: {
{
"md5sum":"4002cbda13066720513d1c9d55dba809",
"id":1,
"sha256sum":"1c6e77ec49413bf7043af2058f147fb147c4ee741fb478872f072d063f2338c5",
"sha1sum":"ba1e6e9a849fb4e13e92b33d023d40a0f105f908",
"created_at":"2016-02-02T14:25:19+00:00",
"updated_at":"2016-02-11T20:43:22+00:00",
"file_size":188416,
"type":{
"name":"EXE"
},
"tags":[
],
"sampleSources":[
{
"filename":"4002cbda13066720513d1c9d55dba809",
"source":{
"name":"default"
}
},
{
"filename":"4002cbda13066720332513d1c9d55dba809",
"source":{
"name":"default"
}
}
]
}
}
The filter I would like to use is to find by the 'name' contained within sample.sampleSources.source using elastic search.
I tried the below queries
curl -XGET "http://localhost:9200/app/sample/_search?pretty" -d {query}
where, {query} is
{
"query":{
"nested":{
"path":"sample.sampleSources",
"query":{
"nested":{
"path":"sample.sampleSources.source",
"query":{
"match":{
"sample.sampleSources.source.name":"default"
}
}
}
}
}
}
}
However, it is not returning me any results. I have certain cases in my document where the nesting is more deeper than this. Can someone please guide me as to how should I formulate this query so that it works for all cases?
EDIT 1
Mappings:
{
"app":{
"mappings":{
"sample":{
"sampleSources":{
"type":"nested",
"properties":{
"filename":{
"type":"string"
},
"source":{
"type":"nested",
"properties":{
"name":{
"type":"string"
}
}
}
}
}
}
EDIT 2
The solution posted by Waldemar Neto below works well for match query but not for a wild-card or neither for a regexp
Can you please guide? I need the wild-card and the regexp queries to be working for this.
i tried here using your examples and works fine.
Take a look in my data.
mapping:
PUT /app
{
"mappings": {
"sample": {
"properties": {
"sampleSources": {
"type": "nested",
"properties": {
"source": {
"type": "nested"
}
}
}
}
}
}
}
indexed data
POST /app/sample
{
"md5sum": "4002cbda13066720513d1c9d55dba809",
"id": 1,
"sha256sum": "1c6e77ec49413bf7043af2058f147fb147c4ee741fb478872f072d063f2338c5",
"sha1sum": "ba1e6e9a849fb4e13e92b33d023d40a0f105f908",
"created_at": "2016-02-02T14:25:19+00:00",
"updated_at": "2016-02-11T20:43:22+00:00",
"file_size": 188416,
"type": {
"name": "EXE"
},
"tags": [],
"sampleSources": [
{
"filename": "4002cbda13066720513d1c9d55dba809",
"source": {
"name": "default"
}
},
{
"filename": "4002cbda13066720332513d1c9d55dba809",
"source": {
"name": "default"
}
}
]
}
Search query
GET /app/sample/_search
{
"query": {
"nested": {
"path": "sampleSources.source",
"query": {
"match": {
"sampleSources.source.name": "default"
}
}
}
}
}
Example using wildcard
GET /app/sample/_search
{
"query": {
"nested": {
"path": "sampleSources.source",
"query": {
"wildcard": {
"sampleSources.source.name": {
"value": "*aul*"
}
}
}
}
}
}
The only thing that I saw some difference was in the path, you don't need to set the sample (type) in the nested path, only the inner objets.
Test and give me a feedback.

Resources