elasticsearch spend all time in build_scorer - elasticsearch

When we've upgraded our ES from ES 1.4 to ES 5.2 we got performance problem with such type of queries:
{
"_source": false,
"from": 0,
"size": 50,
"profile": true,
"query": {
"bool": {
"filter": [
{
"ids": {
"values": [<list of 400 ids>],
"boost": 1
}
}
],
"should": [
{
"terms": {
"views": [ <list od 20 ints> ]
}
]
"minimum_should_match": "0",
"boost": 1
}
}
}
When profiling we found that the problem with build_scorer, which call for each segment:
1 shard;
20 segments;
took: 55
{
"type": "BooleanQuery",
"description": "views:[9875227 TO 9875227] views:[6991599 TO 6991599] views:[6682953 TO 6682953] views:[6568587 TO 6568587] views:[10080097 TO 10080097] views:[9200174 TO 9200174] views:[9200174 TO 9200174] views:[10080097 TO 10080097] views:[9966870 TO 9966870] views:[6568587 TO 6568587] views:[6568587 TO 6568587] views:[8538669 TO 8538669] views:[8835038 TO 8835038] views:[9200174 TO 9200174] views:[7539089 TO 7539089] views:[6991599 TO 6991599] views:[8222303 TO 8222303] views:[9342166 TO 9342166] views:[7828288 TO 7828288] views:[9699294 TO 9699294] views:[9108691 TO 9108691] views:[9431297 TO 9431297] views:[7539089 TO 7539089] views:[6032694 TO 6032694] views:[9491741 TO 9491741] views:[9498225 TO 9498225] views:[8051047 TO 8051047] views:[9866955 TO 9866955] views:[8222303 TO 8222303] views:[9622214 TO 9622214]",
"time": "39.70427700ms",
"breakdown": {
"score": 99757,
"build_scorer_count": 20,
"match_count": 0,
"create_weight": 37150,
"next_doc": 0,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 0,
"score_count": 110,
"build_scorer": 38648674,
"advance": 918274,
"advance_count": 291
},
So 38 ms of total 55ms was taken by build_scorer, it seems weired.
On ES 1.5 we have about the same number of segments but query run 10x faster
Unfortunately ES 1.x doesn't have profiler to check how many times build_scorer executes in ES 1.x
So the question is why build_scorer_count equal to number of segments and how can we tackle this performance issue?

Related

Can't get severity info via API

Java 11
SonarQube 8.9.2 LTS
For my java project the SonarQube show the next issues info:
Severity
Blocker 1.3k
Minor 1.1k
Critical 5.8k
Info 233
Major 1.3k
So I need to get this information via SonarQube WEB API.
I found only this api method:
GET http://some_url_sonar_qube/api/issues/search
And its return all issues on page = 1
And its return all issues on page = 1 with detail info
{
"total": 10049,
"p": 1,
"ps": 100,
"paging": {
"pageIndex": 1,
"pageSize": 100,
"total": 10049
},
"effortTotal": 50995,
"issues": [
{
"key": "dddd",
"rule": "css:S4670",
"severity": "CRITICAL",
...
This:
GET http://some_url_sonar_qube/api/issues/search?p=2
And its return all issues on page = 2
and so on.
Response example:
As you can see has 10049 issues. It's 100 pages.
But I need summary info. Smt like this in json format:
{
"Severity": {
"Blocker": 1300,
"Minor": 1100,
"Critical": 5800,
"Info": 233,
"Major": 1300
}
}
I'm not found api method for this
I found solution (thanks for #gawkface)
Use this method:
GET http://some_url_sonar_qube/api/issues/search?componentKeys=my_project_key&facets=severities
And here result (on section facets)
{
"total": 10049,
"p": 1,
"ps": 100,
"paging": {
"pageIndex": 1,
"pageSize": 100,
"total": 10049
},
"effortTotal": 50995,
"issues": [...],
"components": [...],
"facets": [
{
"property": "severities",
"values": [
{
"val": "CRITICAL",
"count": 5817
},
{
"val": "MAJOR",
"count": 1454
},
{
"val": "BLOCKER",
"count": 1286
},
{
"val": "MINOR",
"count": 1161
},
{
"val": "INFO",
"count": 331
}
]
}
]
}

Is possible change priority to task create_snapshot from NORMAL to HIGH or URGENT?

I has a cluster elasticsearch with 6 data nodes and 3 master.
When execute the snapshot I receive the error "process_cluster_event_timeout_exception".
I look in my cluster "/_cat/pending_tasks" it has 69 tasks with priority HIGH and source put-mapping
My cluster is for centralized log and have this process to put data in cluster:
logstash - collect from Redis and put to Elasticsearch
apm-server
filebeat
metricbeat
I stay removing beats and some applications from apm-server
Is possible change priority to task create_snapshot from NORMAL to HIGH or URGENT?
It is not a solution, how to I check the correct size for my cluster?
*Normally i keep 7 days the indice in my cluster because the backup.
But because the error, I removed the process to exclude the old data
GET _cat/nodes?v&s=node.role:desc
ip
heap.percent
ram.percent
cpu
load_1m
load_5m
load_15m
node.role
master
name
10.0.2.8
47
50
0
0.00
0.00
0.00
mi
-
prd-elasticsearch-i-020
10.0.0.7
14
50
0
0.00
0.00
0.00
mi
-
prd-elasticsearch-i-0ab
10.0.1.1
47
77
29
1.47
1.72
1.66
mi
*
prd-elasticsearch-i-0e2
10.0.2.7
58
95
19
8.04
8.62
8.79
d
-
prd-elasticsearch-i-0b4
10.0.2.4
59
97
20
8.22
8.71
8.76
d
-
prd-elasticsearch-i-00d
10.0.1.6
62
94
38
11.42
8.87
8.89
d
-
prd-elasticsearch-i-0ff
10.0.0.6
67
97
25
8.97
10.45
10.47
d
-
prd-elasticsearch-i-01a
10.0.0.9
57
98
32
11.63
9.64
9.17
d
-
prd-elasticsearch-i-005
10.0.1.0
62
96
19
10.45
9.53
9.31
d
-
prd-elasticsearch-i-088
My cluster definitions:
{
"_nodes": {
"total": 9,
"successful": 9,
"failed": 0
},
"cluster_name": "prd-elasticsearch",
"cluster_uuid": "xxxx",
"timestamp": 1607609607018,
"status": "green",
"indices": {
"count": 895,
"shards": {
"total": 14006,
"primaries": 4700,
"replication": 1.98,
"index": {
"shards": {
"min": 2,
"max": 18,
"avg": 15.649162011173184
},
"primaries": {
"min": 1,
"max": 6,
"avg": 5.251396648044692
},
"replication": {
"min": 1,
"max": 2,
"avg": 1.9787709497206705
}
}
},
"docs": {
"count": 14896803950,
"deleted": 843126
},
"store": {
"size_in_bytes": 16778620001453
},
"fielddata": {
"memory_size_in_bytes": 4790672272,
"evictions": 0
},
"query_cache": {
"memory_size_in_bytes": 7689832903,
"total_count": 2033762560,
"hit_count": 53751516,
"miss_count": 1980011044,
"cache_size": 4087727,
"cache_count": 11319866,
"evictions": 7232139
},
"completion": {
"size_in_bytes": 0
},
"segments": {
"count": 155344,
"memory_in_bytes": 39094918196,
"terms_memory_in_bytes": 31533157295,
"stored_fields_memory_in_bytes": 5574613712,
"term_vectors_memory_in_bytes": 0,
"norms_memory_in_bytes": 449973760,
"points_memory_in_bytes": 886771949,
"doc_values_memory_in_bytes": 650401480,
"index_writer_memory_in_bytes": 905283962,
"version_map_memory_in_bytes": 1173400,
"fixed_bit_set_memory_in_bytes": 12580800,
"max_unsafe_auto_id_timestamp": 1607606224903,
"file_sizes": {}
}
},
"nodes": {
"count": {
"total": 9,
"data": 6,
"coordinating_only": 0,
"master": 3,
"ingest": 3
},
"versions": [
"6.8.1"
],
"os": {
"available_processors": 108,
"allocated_processors": 108,
"names": [
{
"name": "Linux",
"count": 9
}
],
"pretty_names": [
{
"pretty_name": "CentOS Linux 7 (Core)",
"count": 9
}
],
"mem": {
"total_in_bytes": 821975162880,
"free_in_bytes": 50684043264,
"used_in_bytes": 771291119616,
"free_percent": 6,
"used_percent": 94
}
},
"process": {
"cpu": {
"percent": 349
},
"open_file_descriptors": {
"min": 429,
"max": 9996,
"avg": 6607
}
},
"jvm": {
"max_uptime_in_millis": 43603531934,
"versions": [
{
"version": "1.8.0_222",
"vm_name": "OpenJDK 64-Bit Server VM",
"vm_version": "25.222-b10",
"vm_vendor": "Oracle Corporation",
"count": 9
}
],
"mem": {
"heap_used_in_bytes": 137629451248,
"heap_max_in_bytes": 205373571072
},
"threads": 1941
},
"fs": {
"total_in_bytes": 45245361229824,
"free_in_bytes": 28231010959360,
"available_in_bytes": 28231011147776
},
"plugins": [
{
"name": "repository-s3",
"version": "6.8.1",
"elasticsearch_version": "6.8.1",
"java_version": "1.8",
"description": "The S3 repository plugin adds S3 repositories",
"classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
"extended_plugins": [],
"has_native_controller": false
}
],
"network_types": {
"transport_types": {
"security4": 9
},
"http_types": {
"security4": 9
}
}
}
}
Data Nodes: 6 instances r4.4xlarge
Master Nodes: 3 instances m5.large
No It is not possible to change priority of task create_snapshot.
As you have 69 pending tasks, it seems you are doing too many mapping updates.
Regarding correct size of cluster, I would recommend you to go through following blog posts :
https://www.elastic.co/blog/found-sizing-elasticsearch
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/sizing-domains.html

_update_by_query fails to update all documents in ElasticSearch

I have over 30 million documents in Elasticsearch (version - 6.3.3), I am trying to add new field to all existing documents and setting the value to 0.
For example: I want to add start field which does not exists previously in Twitter document, and set it's initial value to 0, in all 30 million documents.
In my case I was able to update 4 million only. If I try to check the submitted task with TASK API http://localhost:9200/_task/{taskId}, result from says something like ->
{
"completed": false,
"task": {
"node": "Jsecb8kBSdKLC47Q28O6Pg",
"id": 5968304,
"type": "transport",
"action": "indices:data/write/update/byquery",
"status": {
"total": 34002005,
"updated": 3618000,
"created": 0,
"deleted": 0,
"batches": 3619,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0
},
"description": "update-by-query [Twitter][tweet] updated with Script{type=inline, lang='painless', idOrCode='ctx._source.Twitter.start = 0;', options={}, params={}}",
"start_time_in_millis": 1574677050104,
"running_time_in_nanos": 466805438290,
"cancellable": true,
"headers": {}
}
}
The query I am executing against ES , is something like:
curl -XPOST "http://localhost:9200/_update_by_query?wait_for_completion=false&conflicts=proceed" -H 'Content-Type: application/json' -d'
{
"script": {
"source": "ctx._source.Twitter.start = 0;"
},
"query": {
"exists": {
"field": "Twitter"
}
}
}'
Any suggestions would be great, thanks

How can I resolve the increase in index size when using nested objects in elasticsearch?

The total number of data is 1 billion.
When I configure an index by setting some fields of data as nested objects, the number of data increases and the index size increases.
There are about 20 nested objects in a document.
When I index 1 billion documents, the number of indexes is 20 billion, and the index size is about 20TB.
However, when I remove nested objects, the number of indexes is 1 billion, and the index size is about 5TB.
It's simply removed nested object and can not provide services with this index structure.
I know why nested objects have a higher index count than a simple object configuration.
But I ask why the index is four times larger and how to fix it.
version of elasticsearch : 5.1.1
The Sample Data is as follows:
nested object Mapping : idds, ishs, resources, versions
{
"fileType": {
"asdFormat": 1
},
"dac": {
"pe": {
"cal": {
"d1": -4634692645508395000,
"d2": -5805223225419042000,
"d3": -1705264433
},
"bytes": "6a7068e0",
"entry": 0,
"count": 7,
"css": {
"idh": 0,
"ish": 0,
"ifh": 0,
"ioh": 0,
"ish": 0,
"ied": 0,
"exp": 0,
"imp": 0,
"sec": 0
},
"ff": {
"field1": 23117,
"field2": 144,
"field3": 3,
"field4": 0,
"field5": 4,
"field6": 0,
"field7": 65535,
"field8": 0,
"field9": 184,
"field10": 0,
"field11": 0,
"field12": 0,
"field13": 64,
"field14": 0,
"field15": 40104,
"field16": 64563,
"field17": 0,
"field18": 0,
"field19": 0,
"field20": 0,
"field21": 0,
"field22": 0,
"field23": 0,
"field24": 0,
"field25": 0,
"field26": 0,
"field27": 0,
"field28": 0,
"field29": 0,
"field30": 0,
"field31": 224
},
"ifh": {
"mc": 332,
"nos": 3,
"time": 1091599505,
"ps": 0,
"ns": 0,
"soh": 224,
"chart": 271
},
"ioh": {
"magic": 267,
"mlv": 7,
"nlv": 10,
"soc": 80384,
"soid": 137216,
"soud": 0,
"aep": 70290,
"boc": 4096,
"bod": 86016,
"aib": "16777216",
"si": 4096,
"fa": 512,
"mosv": 5,
"nosv": 1,
"miv": 5,
"niv": 1,
"msv": 4,
"nsv": 0,
"wv": 0,
"si": 262144,
"sh": 1024,
"cs": 0,
"ss": 2,
"dllchart": 32768,
"ssr": "262144",
"ssc": "262144",
"ssh": "1048576",
"shc": "4096",
"lf": 0,
"nor": 16
},
"idds": [
{
"id": 1,
"address": 77504,
"size": 300
},
{
"id": 2,
"address": 106496,
"size": 134960
},
{
"id": 6,
"address": 5264,
"size": 28
},
{
"id": 11,
"address": 592,
"size": 300
},
{
"id": 12,
"address": 4096,
"size": 1156
}
],
"ishs": [
{
"id": 0,
"name": ".text",
"size": 79920,
"address": 4096,
"srd": 80384,
"ptr": 1024,
"ptrl": 0,
"ptl": 0,
"nor": 0,
"nol": 0,
"chart": 3758096480,
"ex1": 60404022,
"ex2": 61903965,
"ex": 61153993.5
},
{
"id": 1,
"name": ".data",
"size": 17884,
"address": 86016,
"srd": 2048,
"ptr": 81408,
"ptrl": 0,
"ptl": 0,
"nor": 0,
"nol": 0,
"chart": 3221225536,
"ex1": 27817394,
"ex2": -1,
"ex": 27817394
},
{
"id": 2,
"name": ".rsrc",
"size": 155648,
"address": 106496,
"srd": 135680,
"ptr": 83456,
"ptrl": 0,
"ptl": 0,
"nor": 0,
"nol": 0,
"chart": 3758096448,
"ex1": 38215005,
"ex2": 46960547,
"ex": 42587776
}
],
"resources": [
{
"id": 2,
"count": 3,
"hash": 658696779440676200
},
{
"id": 3,
"count": 14,
"hash": 4671329014159995000
},
{
"id": 5,
"count": 30,
"hash": -6413921454731808000
},
{
"id": 6,
"count": 17,
"hash": 8148183923057157000
},
{
"id": 14,
"count": 4,
"hash": 8004262029246967000
},
{
"id": 16,
"count": 1,
"hash": 7310592488525726000
},
{
"id": 2147487240,
"count": 2,
"hash": -7466967570237519000
}
],
"upx": {
"path": "xps",
"d64": 3570326159822345700
},
"versions": [
{
"language": 1042,
"codePage": 1200,
"companyName": "Microsoft Corporation",
"fileDescription": "Files and Settings Transfer Wizard",
"fileVersion": "5.1.2600.2180 (xpsp_sp2_rtm.040803-2158)",
"internalName": "Microsoft",
"legalCopyright": "Copyright (C) Microsoft Corp. 1999-2000",
"originalFileName": "calc.exe",
"productName": "Microsoft(R) Windows (R) 2000 Operating System",
"productVersion": "5.1.2600.2180"
}
],
"import": {
"dll": [
"GDI32.dll",
"KERNEL32.dll",
"USER32.dll",
"ole32.dll",
"ADVAPI32.dll",
"COMCTL32.dll",
"SHELL32.dll",
"msvcrt.dll",
"comdlg32.dll",
"SHLWAPI.dll",
"SETUPAPI.dll",
"Cabinet.dll",
"LOG.dll",
"MIGISM.dll"
],
"count": 14,
"d1": -149422985349905340,
"d2": -5344971616648705000,
"d3": 947564411044974800
},
"ddSec0": {
"d1": -3007779250746558000,
"d4": -2515772085422514700
},
"ddSec2": {
"d2": -4422408392580008000,
"d4": -8199520081862749000
},
"ddSec3": {
"d1": -8199520081862749000
},
"cdp": {
"d1": 787971,
"d2": 39,
"d3": 101980696,
"d4": 3,
"d5": 285349133
},
"cde": {
"d1": 67242500,
"d2": 33687042,
"d3": 218303490,
"d4": 1663632132,
"d5": 0
},
"cdm": {
"d1": 319293444,
"d2": 2819,
"d3": 168364553,
"d4": 50467081,
"d5": 198664
},
"cdb": {
"d1": 0,
"d2": 0,
"d3": 0,
"d4": 0,
"d5": 0
},
"mm": {
"d0": -3545367393134139000,
"d1": 1008464166428372900,
"d2": -6313842304565328000,
"d3": -5015640502060250000
},
"ser": 17744,
"ideal": 0,
"map": 130,
"ol": 0
}
},
"fileSize": 219136
}

Compute difference between field and aggregated field

I have to run complex aggregation and one of its steps is computing sum of sold_qty field, and then I need to subtract this sum with non aggregated field all_qty. My data looks like:
{item_id: XXX, sold_qty: 1, all_qty: 20, price: 100 }
{item_id: XXX, sold_qty: 3, all_qty: 20, price: 100 }
{item_id: YYY, sold_qty: 1, all_qty: 20, price: 80 }
These are transactions from offer. The all_qty and price fields are redundant - express single values from other structure - offers and just duplicated in all transactions from single offer (identified by item_id).
In the terms of SQL what I need is:
SELECT (all_qty - sum(sold_qty)) * price GROUP BY item_id
What I've done is aggregation
'{
"query": {"term": {"seller": 9059247}},
"size": 0,
"aggs": {
"group_by_offer": {
"terms": { "field": "item_id", size: 0},
"aggs": { "sold_sum": {"sum": {"field": "sold_qty"}}}
}
}
}'
But I don't know what to do next to achieve my goal.
Since you are already storing redundant fields, if I were you, I would also store the result of all_price = all_qty * price and sold_price = sold_qty * price. It's is not mandatory but it will be faster at execution time than executing scripts to make the same computation.
{item_id: XXX, sold_qty: 1, sold_price: 20, all_qty: 20, price: 100, all_price: 2000 }
{item_id: XXX, sold_qty: 3, sold_price: 300, all_qty: 20, price: 100, all_price: 2000 }
{item_id: YYY, sold_qty: 1, sold_price: 80, all_qty: 20, price: 80, all_price: 1600 }
All you'd have to do next is to sum sold_price and average all_price and simply get the difference between both using a bucket_script pipeline aggregation:
{
"query": {
"term": {
"seller": 9059247
}
},
"size": 0,
"aggs": {
"group_by_offer": {
"terms": {
"field": "item_id",
"size": 0
},
"aggs": {
"sold_sum": {
"sum": {
"field": "sold_price"
}
},
"all_sum": {
"avg": {
"field": "all_price"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"sold": "sold_sum",
"all": "all_sum"
},
"script": "params.all - params.sold"
}
}
}
}
}
}

Resources