cannot agregate in elasticsearch - elasticsearch

I have a service with logs in elasticsearch. I want to get users who have used my service.
Detailed log lines were returned on my request, but I want to get a unique "kubernetes.pod_name":
{
"size": 10000,
"_source": ["kubernetes.pod_name"],
"query": {"bool": {"filter": [
{"match": {"kubernetes.labels.app" : "jupyterhub"}},
{"match_phrase": {"log": "200 GET"}}
]}},
"aggs": {"pods": {"terms": {"field": "kubernetes.pod_name"}}}
}
why aren't the log lines grouped in the "aggs" section? What to do to get unique users?
Upd:
my query return:
{'took': 614,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': 17703,
'max_score': 0.0,
'hits': [{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': 'vQ6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-lyisova-2evg'}}},
{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': 'xA6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-lyisova-2evg'}}},
{'_index': 'dwh-dev-2020-10-14',
'_type': 'container_log',
'_id': '6g6vJHUBU_u817onY-cZ',
'_score': 0.0,
'_source': {'kubernetes': {'pod_name': 'jupyter-bogdanov'}}},
...
I want to get 20 lines instead of 17703 where each line corresponds to a unique "kubernetes.pod_name"

You can merge between terms aggregation and filter aggregation
{
"aggs": {
"labels_filter": {
"filter": [
{
"match": {
"kubernetes.labels.app": "jupyterhub"
}
},
{
"match_phrase": {
"log": "200 GET"
}
}
],
"aggs": {
"pods": {
"terms": {
"field": "kubernetes.pod_name"
}
}
}
}
}
}

Related

Elasticsearch Aggregation of large list

I'm trying to count how many times ingredients show up in different documents. My index body is similar to this
index_body = {
"settings":{
"index":{
"number_of_replicas":0,
"number_of_shards":4,
"refresh_interval":"-1",
"knn":"true"
}
},
"mappings":{
"properties":{
"recipe_id":{
"type":"keyword"
},
"recipe_title":{
"type":"text",
"analyzer":"standard",
"similarity":"BM25"
},
"description":{
"type":"text",
"analyzer":"standard",
"similarity":"BM25"
},
"ingredient":{
"type":"keyword"
},
"image":{
"type":"keyword"
},
....
}
}
In the ingredient field, I've stored an array of strings of each ingredient [ingredient1,ingredient2,....]
I have around 900 documents. Each with their own ingredients list.
I've tried using Elasticsearch's aggregations but it seems to not return what I expected.
Here is the query I've been using:
{
"size":0,
"aggs":{
"ingredients":{
"terms": {"field":"ingredient"}
}
}
}
But it returns this:
{'took': 4, 'timed_out': False, '_shards': {'total': 4, 'successful': 4, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 994, 'relation': 'eq'}, 'max_score': None, 'hits': []}, 'aggregations': {'ingredients': {'doc_count_error_upper_bound': 56, 'sum_other_doc_count': 4709, 'buckets': [{'key': 'salt', 'doc_count': 631}, {'key': 'oil', 'doc_count': 320}, {'key': 'sugar', 'doc_count': 314}, {'key': 'egg', 'doc_count': 302}, {'key': 'butter', 'doc_count': 291}, {'key': 'flour', 'doc_count': 264}, {'key': 'garlic', 'doc_count': 220}, {'key': 'ground pepper', 'doc_count': 185}, {'key': 'vanilla extract', 'doc_count': 146}, {'key': 'lemon', 'doc_count': 131}]}}}
This is clearly wrong, as I have many ingredients. What am I doing wrong? Why is it returning only these ones? Is there a way to force Elasticsearch to return all counts?
You need to specify size inside the aggregation.
{
"size":0,
"aggs":{
"ingredients":{
"terms": {"field":"ingredient", "size": 10000}
}
}
}

Generating data tables in elastic search

I'm trying to make a data table which consists of some calculations
******************************************************
** Bidder * Request * CPM * Revenue * Response Time **
******************************************************
I've created an index which holds all the data, so my data is stored in following format:
{
"data": {
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 78,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "nits_media_bid_won",
"_type": "nits_media_data_collection",
"_id": "MIyt6m8BWa2IbVphmPUh",
"_score": 1,
"_source": {
"bidderCode": "appnexus",
"width": 300,
"height": 600,
"statusMessage": "Bid available",
"adId": "43d59b34fd61b5",
"requestId": "2c6d19dcc536c3",
"mediaType": "banner",
"source": "client",
"cpm": 0.5,
"creativeId": 98493581,
"currency": "USD",
"netRevenue": true,
"ttl": 300,
"adUnitCode": "/19968336/header-bid-tag-0",
"appnexus": {
"buyerMemberId": 9325
},
"meta": {
"advertiserId": 2529885
},
"originalCpm": 0.5,
"originalCurrency": "USD",
"auctionId": "a628c0c0-bd4d-4f2a-9011-82fab780910e",
"responseTimestamp": 1580190231422,
"requestTimestamp": 1580190231022,
"bidder": "appnexus",
"timeToRespond": 400,
"pbLg": "0.50",
"pbMg": "0.50",
"pbHg": "0.50",
"pbAg": "0.50",
"pbDg": "0.50",
"pbCg": null,
"size": "300x600",
"adserverTargeting": {
"hb_bidder": "appnexus",
"hb_adid": "43d59b34fd61b5",
"hb_pb": "0.50",
"hb_size": "300x600",
"hb_source": "client",
"hb_format": "banner"
},
"status": "rendered",
"params": [
{
"placementId": 13144370
}
],
"nits_account": "asjdfagsd2384vasgd19",
"nits_url": "http://nitsmedia.local/run-ad",
"session_id": "YTGpETKSk2nHwLRB6GbP",
"timestamp": "2020-01-28T05:43:51.702Z",
"geo_data": {
"continent": "North America",
"address_format": "{{recipient}}\n{{street}}\n{{city}} {{region_short}} {{postalcode}}\n{{country}}",
"alpha2": "US",
"alpha3": "USA",
"country_code": "1",
"international_prefix": "011",
"ioc": "USA",
"gec": "US",
"name": "United States of America",
"national_destination_code_lengths": [
3
],
"national_number_lengths": [
10
],
"national_prefix": "1",
"number": "840",
"region": "Americas",
"subregion": "Northern America",
"world_region": "AMER",
"un_locode": "US",
"nationality": "American",
"postal_code": true,
"unofficial_names": [
"United States",
"Vereinigte Staaten von Amerika",
"États-Unis",
"Estados Unidos",
"アメリカ合衆国",
"Verenigde Staten"
],
"languages_official": [
"en"
],
"languages_spoken": [
"en"
],
"geo": {
"latitude": 37.09024000000000143018041853792965412139892578125,
"latitude_dec": "39.44325637817383",
"longitude": -95.7128909999999990532160154543817043304443359375,
"longitude_dec": "-98.95733642578125",
"max_latitude": 71.5388001000000031126546673476696014404296875,
"max_longitude": -66.8854170000000038953658076934516429901123046875,
"min_latitude": 18.77629999999999910187398199923336505889892578125,
"min_longitude": 170.595699999999993679011822678148746490478515625,
"bounds": {
"northeast": {
"lat": 71.5388001000000031126546673476696014404296875,
"lng": -66.8854170000000038953658076934516429901123046875
},
"southwest": {
"lat": 18.77629999999999910187398199923336505889892578125,
"lng": 170.595699999999993679011822678148746490478515625
}
}
},
"currency_code": "USD",
"start_of_week": "sunday"
}
}
},
//Remaining data set....
]
},
}
}
So as per my data set I want to fetch all unique bidderCode (which will be represented as Bidder in the table) and make the data with calculation respective to it. For example
Request - This will be total number of docs count in aggregation
CPM - CPM will be sum of all CPM divided by 1000
Revenue - Total CPM multiplied by 1000
Response time - Average of (responseTimestamp - requestTimestamp)
How can I achieve this, I'm bit confused with it. I tried building the blocks by:
return $this->elasticsearch->search([
'index' => 'nits_media_bid_won',
'body' => [
'query' => $query,
'aggs' => [
'unique_bidders' => [
'terms' => ['field' => 'bidderCode.keyword']
],
'aggs' => [
'sum' => [
'cpm' => [
'field' => 'cpm',
'script' => '_value / 1000'
]
]
],
]
]
]);
But it is showing me error:
{
"error":{
"root_cause":[
{
"type":"x_content_parse_exception",
"reason":"[1:112] [sum] unknown field [cpm], parser not found"
}
],
"type":"x_content_parse_exception",
"reason":"[1:112] [sum] unknown field [cpm], parser not found"
},
"status":400
}
I'm new to this help me out in it. Thanks.
ElasticSearch isn't wrong -- you've swapped the aggregation name with its type. It cannot parse the agg type cpm.
Here's the corrected query:
GET nits_media_bid_won/_search
{
"size": 0,
"aggs": {
"unique_bidders": {
"terms": {
"field": "bidderCode.keyword",
"size": 10
},
"aggs": {
"cpm": { <----------
"sum": { <----------
"field": "cpm",
"script": "_value / 1000"
}
}
}
}
}
}

_update_by_query fails to update all documents in ElasticSearch

I have over 30 million documents in Elasticsearch (version - 6.3.3), I am trying to add new field to all existing documents and setting the value to 0.
For example: I want to add start field which does not exists previously in Twitter document, and set it's initial value to 0, in all 30 million documents.
In my case I was able to update 4 million only. If I try to check the submitted task with TASK API http://localhost:9200/_task/{taskId}, result from says something like ->
{
"completed": false,
"task": {
"node": "Jsecb8kBSdKLC47Q28O6Pg",
"id": 5968304,
"type": "transport",
"action": "indices:data/write/update/byquery",
"status": {
"total": 34002005,
"updated": 3618000,
"created": 0,
"deleted": 0,
"batches": 3619,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0
},
"description": "update-by-query [Twitter][tweet] updated with Script{type=inline, lang='painless', idOrCode='ctx._source.Twitter.start = 0;', options={}, params={}}",
"start_time_in_millis": 1574677050104,
"running_time_in_nanos": 466805438290,
"cancellable": true,
"headers": {}
}
}
The query I am executing against ES , is something like:
curl -XPOST "http://localhost:9200/_update_by_query?wait_for_completion=false&conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"script": {
"source": "ctx._source.Twitter.start = 0;"
},
"query": {
"exists": {
"field": "Twitter"
}
}
}'
Any suggestions would be great, thanks

elasticsearch response hits is not showing up

I am utilizing elasticsearch and after running a search, this is the response I get
{'took': 7, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 1, 'max_score': 0.2876821, 'hits': []}}
My question is why is hits.total = 1 but hits.hits is empty?
Here is the query I used:
"query": {
"bool": {
"must": [
{"match_phrase": { "theName": "bill" } }
]
}
}
I know data exists in my node because when I did the search below (with the same url + index + type in the post request), I got hits.hits to be filled with the result.
"query" : {
"match_all" : {}
}
I had from = 40 that was causing the issue.

How to get latest values for each group with an Elasticsearch query?

I have some documents indexed on Elasticsearch, looking like these samples:
{'country': 'France', 'collected': '2015-03-12', 'value': 20}
{'country': 'Canada', 'collected': '2015-03-12', 'value': 21}
{'country': 'Brazil', 'collected': '2015-03-12', 'value': 33}
{'country': 'France', 'collected': '2015-02-01', 'value': 10}
{'country': 'Canada', 'collected': '2015-02-01', 'value': 11}
{'country': 'Mexico', 'collected': '2015-02-01', 'value': 9}
...
I want to build a query that gets one result per country, getting only the ones with max(collected).
So, for the examples shown above, the results would be something like:
{'country': 'France', 'collected': '2015-03-12', 'value': 20}
{'country': 'Canada', 'collected': '2015-03-12', 'value': 21}
{'country': 'Brazil', 'collected': '2015-03-12', 'value': 33}
{'country': 'Mexico', 'collected': '2015-02-01', 'value': 9}
I realized I need to do aggregation on country, but I'm failing to understand how to limit the results on max(collected).
Any ideas?
You can use a top_hits aggregation that groups on the country field, returns 1 doc per group, and orders the docs by the collected date descending:
POST /test/_search?search_type=count
{
"aggs": {
"group": {
"terms": {
"field": "country"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"collected": {
"order": "desc"
}
}
]
}
}
}
}
}
}
For those like user1892775 who run into "Fielddata is disabled on text fields by default...", you can create a multi field (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html). So you might have mapping like:
"mapping": {
"properties": {
"country": {"type": "string", "fields": {"raw": {"type": "string", "index": "not_analyzed"}}}
}
Then your query would look like
POST /test/_search?search_type=count
{
"aggs": {
"group": {
"terms": {
"field": "country.raw"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"collected": {
"order": "desc"
}
}
]
}
}
}
}
}
}
(Note the use of country.raw)
The answer marked correct worked great for me. Here is how I added some extra filters. This is version 7.4 on AWS.
The field I'm grouping by is a keyword field named tags.
For each group (tag), get top 3 documents sorted by date_uploaded descending.
Also show the total amount of documents within each group (tag).
Only consider non-deleted documents belonging to user 22.
Only return 10 groups (tags), sorted alphabetically.
For each document, return its ID (book_id) and date_uploaded. (Default is that all info is returned.)
Size:0 keeps the query from returning lots of info about all the documents.
{'query': {'bool': {'filter': [{'terms': {'user_id': [22]}}, {'terms': {'deleted': ['false']}}]}},
'size': 0,
"aggs": {
"group": {
"terms": {
"field": "tags.keyword",
"size":10,
"order":{ "_key": "asc" }
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 3,
"_source":["book_id","date_uploaded"],
"sort": [ {"date_uploaded": { "order": "desc" }}]
}
}
}
}
}
}
Here is how to get each group (tag in my case) and the document matches for each group.
query_results = ... result of query
buckets = query_results["aggregations"]["group"]["buckets"]
for bucket in buckets:
tag = bucket["key"]
tag_doc_count = bucket["doc_count"]
print tag, tag_total_doc_count
tag_hits = bucket["group_docs"]["hits"]["hits"]
for hit in tag_hits:
source = hit["_source"]
print source["book_id"], source["date_uploaded"]
FYI, the "group" term can be named anything. Just make sure to use the same name when getting buckets from your query results.

Resources