Elasticsearch get the latest documents, grouped by multiple fields

Elasticsearch get the latest documents, grouped by multiple fields - elasticsearch

Similarly to Query the latest document of each type on Elasticsearch, I have a set of records in ES. For the sake of the example, lets say it's news as well, each with mapping:
"news": {
"properties": {
"source": { "type": "string", "index": "not_analyzed" },
"headline": { "type": "object" },
"timestamp": { "type": "date", "format": "date_hour_minute_second_millis" },
"user": { "type": "string", "index": "not_analyzed" }
"newspaper": { "type": "string", "index": "not_analyzed"}
}
}
I am able to get the latest 'news article' per user with:
"size": 0,
"aggs": {
"sources" : {
"terms" : {
"field" : "user"
},
"aggs": {
"latest": {
"top_hits": {
"size": 1,
"sort": {
"timestamp": "desc"
}
}
}
}
}
}
However what I am trying to achieve is to get the last article per user, per newspaper and I cannot get it quite right.
e.g.
John, NY Times, Title1
John, BBC, Title2
Jane, NY Times, Title3
etc.

You can add another terms sub-aggregation for the newspaper field like this
"size": 0,
"aggs": {
"sources" : {
"terms" : {
"field" : "user"
},
"aggs": {
"newspaper": {
"terms": {
"field": "newspaper"
},
"aggs": {
"latest": {
"top_hits": {
"size": 1,
"sort": {
"timestamp": "desc"
}
}
}
}
}
}
}
}

Related

Elastic Search calculate the difference of two set

I am new on Elastic Search. I really need the result about calculating the difference of two set.
Here is the mapping of a index:
{
"mappings": {
"properties": {
"Date": { "type": "date", "format": "yyyyMMdd"},
"areaID": { "type": "keyword" },
"deviceID": { "type": "keyword" }
}
}
}
The date range is from October to November.
I want to get a response for counting November's all new distinct 'deviceID' which grouped by 'areaID'.
I have no idea about how to implement it in ES syntax. Any ES master could give me some hints?
THANKS SO MUCH!

You can using aggs of elasticseach to group by areaID.
This is example with kibana
GET your_index/_search
{
"size": 1000000,
"query": {
"range": {
"Date": {
"gte": "2020-10-01",
"lte": "2020-11-31
}
}
}
},
"aggs": {
"area_id": {
"terms": {
"field": "areaID.keyword"
},
"aggs": {
"Date": {
"date_range": {
"field": "Date",
"ranges": [
{
"from": "2020-11-01",
"to": "2020-11-31"
}
]
},
"aggs": {
"device_id": {
"terms": {
"field": "deviceID.keyword",
}
}
}
}
}
}
}
}

Nested aggregation in nested field?

I am new to elasticsearch and don't know a lot about aggregations but I have this ES6 mapping:
{
"mappings": {
"test": {
"properties": {
"id": {
"type": "integer"
}
"countries": {
"type": "nested",
"properties": {
"global_id": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
},
"areas": {
"type": "nested",
"properties": {
"global_id": {
"type": "keyword"
},
"name": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"parent_global_id": {
"type": "keyword"
}
}
}
}
}
}
}
How can I get all documents grouped by areas which is then grouped by countries. Also the document has to be returned in full, not just the nested document. Is this even possible ?

1) Aggregation _search query:
first agg by area, with the path as this is nested. Then reverse to the root document and nested agg to country.
{
"size": 0,
"aggs": {
"agg_areas": {
"nested": {
"path": "areas"
},
"aggs": {
"areas_name": {
"terms": {
"field": "areas.name"
},
"aggs": {
"agg_reverse": {
"reverse_nested": {},
"aggs": {
"agg_countries": {
"nested": {
"path": "countries"
},
"aggs": {
"countries_name": {
"terms": {
"field": "countries.name"
}
}
}
}
}
}
}
}
}
}
}
}
2) retrieve documents:
add a tophits inside your aggregation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
top_hits is slow so you will have to read documentation and adjust size and sort to your context.
...
"terms": {
"field": "areas.name"
},
"aggregations": {
"hits": {
"top_hits": { "size": 100}
}
},
...

Group by a part of string from a field rather than the full field in Elasticsearch

Here structure of my index:
[
{
"Id":"1",
"Path":"/Series/Current/SerieA/foo/foo",
"PlayCount":100
},
{
"Id":"2",
"Path":"/Series/Current/SerieA/bar/foo",
"PlayCount":1000
},
{
"Id":"3",
"Path":"/Series/Current/SerieA/bar/bar",
"PlayCount":50
},
{
"Id":"4",
"Path":"/Series/Current/SerieB/bla/bla",
"PlayCount":300
},
{
"Id":"5",
"Path":"/Series/Current/SerieB/goo/boo",
"PlayCount":200
},
{
"Id":"6",
"Path":"/Series/Current/SerieC/foo/zoo",
"PlayCount":100
}
]
I'd like to execute an aggregation that bring me sum of "PlayCount" for each Series like:
[
{
"key":"serieA",
"TotalPlayCount":1150
},
{
"key":"serieB",
"TotalPlayCount":500
},
{
"key":"serieC",
"TotalPlayCount":100
}
]
This is how I try to do it but obviously query fails since this is not the proper way:
{
"size": 0,
"query":{
"filtered":{
"query":{
"regexp":{
"Path":"/Series/Current/.*"
}
}
}
},
"aggs":{
"play_count_for_current_series":{
"terms": {
"field": "Path",
"regexp": "/Series/Current/([^/]+)"
},
"aggs":{
"Total_play": { "sum": { "field": "PlayCount" } }
}
}
}
}
Is there a way to do it?

My suggestion is as follows:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"filter": {
"my_special_filter": {
"type": "pattern_capture",
"preserve_original": 0,
"patterns": [
"/Series/Current/([^/]+)"
]
}
},
"analyzer": {
"my_special_analyzer": {
"tokenizer": "whitespace",
"filter": [
"my_special_filter"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"Path": {
"type": "string",
"fields": {
"for_aggregations": {
"type": "string",
"analyzer": "my_special_analyzer"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Create a special analyzer that uses a pattern_capture filter to catch only those terms that you are interested. Because I didn't want to change your current mapping for that field I added a fields section with a sub-field that will use this special analyzer. I also added a raw field which is not_analyzed which will help with the query itself.
POST test/test/_bulk
{"index":{}}
{"Id":"1","Path":"/Series/Current/SerieA/foo/foo","PlayCount":100}
{"index":{}}
{"Id":"2","Path":"/Series/Current/SerieA/bar/foo","PlayCount":1000}
{"index":{}}
{"Id":"3","Path":"/Series/Current/SerieA/bar/bar","PlayCount":50}
{"index":{}}
{"Id":"4","Path":"/Series/Current/SerieB/bla/bla","PlayCount":300}
{"index":{}}
{"Id":"5","Path":"/Series/Current/SerieB/goo/boo","PlayCount":200}
{"index":{}}
{"Id":"6","Path":"/Series/Current/SerieC/foo/zoo","PlayCount":100}
{"index":{}}
{"Id":"7","Path":"/Sersdasdies/Curradent/SerieC/foo/zoo","PlayCount":100}
For the query, you don't need the regular expression in the query because your aggregation will use that sub-field which only has your needed SerieX terms.
GET /test/test/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"regexp": {
"Path.raw": "/Series/Current/.*"
}
}
}
},
"aggs": {
"play_count_for_current_series": {
"terms": {
"field": "Path.for_aggregations"
},
"aggs": {
"Total_play": {
"sum": {
"field": "PlayCount"
}
}
}
}
}
}
And the result is
"play_count_for_current_series": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "SerieA",
"doc_count": 3,
"Total_play": {
"value": 1150
}
},
{
"key": "SerieB",
"doc_count": 2,
"Total_play": {
"value": 500
}
},
{
"key": "SerieC",
"doc_count": 1,
"Total_play": {
"value": 100
}
}
]
}

Elasticsearch terms aggregation on a not analyzed field with filters

I have a not analyzed field on my index:
"city": { "type": "string", "index": "not_analyzed" }
I have an aggregation like the following:
"aggs": {
"city": {
"terms": {
"field": "city"
}
}
}
that gives me an output like this:
"aggregations": {
"city": {
"doc_count_error_upper_bound": 51,
"sum_other_doc_count": 12478,
"buckets": [
{
"key": "New York",
"doc_count": 28420
},
{
"key": "London",
"doc_count": 23456
},
{
"key": "São Paulo",
"doc_count": 12727
}
]
}
}
I need to add a match_phrase_prefix query before processing the aggregation to filter my results based on a user text, like this:
{
"size": 0,
"query": {
"match_phrase_prefix": {
"city": "sao"
}
},
"aggs": {
"city": {
"terms": {
"field": "city"
}
}
}
}
and the result is... nothing!
"aggregations": {
"city": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
I was expecting an aggregation result on São Paulo city. Obviously the problem is that my field should have lowercase and asciifolding filters to have a match (São/sao), but I can't make my field analyzed because I don't want to have aggregation results like São, Paulo, New, York (that's what happens on analyzed fields).
What can I do? I tried a lot of combinations with mapping/query/aggs but I can't get it to work.
Any help will be appreciated.

Since it is not_analyzed the query terms are case-sensitive.
You could use multi-field mapping on city with analyzed and non-analyzed fields.
Example:
put <index>/<type>/_mapping
{
"properties": {
"city": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
post <index>/<type>/_search
{
"size": 0,
"query": {
"match_phrase_prefix": {
"city": "Sao"
}
},
"aggs": {
"city": {
"terms": {
"field": "city.raw"
}
}
}
}

"city": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
"city" now is analyzed
"city.keyword" is not analyzed

Using inner_hits inside an aggregation

I have a collection of documents which all contain an array of nested objects with important data. I want do to an aggregation on these which returns me the first document, last document, and all of the nested objects in that group. I can achieve everything in that list except for the nested objects.
Mapping:
"instances": {
"properties": {
"aggField": {
"type": "string",
"index": "not_analyzed"
},
"id": {
"type": "integer"
},
"nestedObjs": {
"type": "nested",
"properties": {
"key": {
"type": "string",
"index": "not_analyzed"
},
"value": {
"type": "integer"
}
}
},
"timestamp": {
"type": "date",
"format": "dateOptionalTime"
}
}
}
Query:
{
"size" : 0,
"aggs" : {
"agg-buckets" : {
"terms" : {
"field" : "aggField",
"size" : 10
},
"aggs": {
"last-report": {
"top_hits": {
"sort": [
{
"timestamp": {
"order": "desc"
}
}
],
"size": 1
}
},
"first-report": {
"top_hits": {
"sort": [
{
"timestamp": {
"order": "asc"
}
}
],
"size": 1
}
},
"nested-objs": {
"nested": {
"path": "nestedObjs",
"inner_hits": {}
}
}
}
}
}
But this fails with:
Parse Failure [Unexpected token START_OBJECT in [nested-objs].]
If I remove the "inner_hits" field it works ok. But it just gives me the document count and not the documents themselves.
What am I doing wrong?
E: I'm using ES version 1.7.1

Are you sure that inner_hits is allowed in a nested aggregation (as opposed to a nested query)? I suspect that's what's causing the error.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch get the latest documents, grouped by multiple fields - elasticsearch

You can add another terms sub-aggregation for the newspaper field like this "size": 0, "aggs": { "sources" : { "terms" : { "field" : "user" }, "aggs": { "newspaper": { "terms": { "field": "newspaper" }, "aggs": { "latest": { "top_hits": { "size": 1, "sort": { "timestamp": "desc" } } } } } } } }

Related

Elastic Search calculate the difference of two set

Nested aggregation in nested field?

Group by a part of string from a field rather than the full field in Elasticsearch

Elasticsearch terms aggregation on a not analyzed field with filters

Using inner_hits inside an aggregation

Categories

Resources