When I start nodejs script, it deletes old index (if it exist) and according to the config file creates new, after creates Websocket-server and starts to listen incoming connections.
initES() {
this.elasticsearchClient = new elasticsearch.Client({
host: `${Config.elasticSearchHost}:${Config.elasticSearchPort}`,
log: 'trace'
});
let deletePromise = this.elasticsearchClient.indices.delete({index: `${Config.elasticSearchIndex}`});
deletePromise.then(() => {
console.log(`Index ${Config.elasticSearchIndex} deleted`);
}, function(e) {
console.log(e.toJSON())
}).then(() => {
let createPromise = this.elasticsearchClient.indices.create({
index: `${Config.elasticSearchIndex}`,
body: {
settings: {
index: {
number_of_shards: 1,
number_of_replicas: 0
},
analysis: {
analyzer: {
whitespace_analyzer: {
tokenizer: 'whitespace',
filter: ['lowercase']
}
}
}
}
}
});
createPromise.then(() => {
console.log(`Index ${Config.elasticSearchIndex} created`);
}, (e) => {
console.log(e.toJSON());
})
});
}
Script is intended to start just once, at the boot time (through cron), it was written by me, and uses standart ES library (
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html
).
In front, user chooses to calculate orders (~700 items, they calculate by system automatically, with gearman and phantomjs)
At first (first 8 hours or first test) everything is working fine, ES responding good, websocket clients frequently update data, and data is updated in ES index.
If user cancels process, or process is finished and user decides to recalculate (all data is deleted before anything is put on), process of IO in ES becomes slower.
And so on, and after awhile index is filled up to ~340.. ~350 items, not to 700. In some cases ES stops to respond.
Tailing log files of ES shows me tons of lines
Entering safepoint region: GenCollectForAllocation
[2019-05-21T13:46:45.611+0000][9630][gc,start ] GC(271) Pause Young (Allocation Failure)
[2019-05-21T13:46:45.611+0000][9630][gc,task ] GC(271) Using 8 workers of 8 for evacuation
[2019-05-21T13:46:45.616+0000][9630][gc,age ] GC(271) Desired survivor size 17891328 bytes, new threshold 6 (max threshold 6)
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) Age table with threshold 6 (max threshold 6)
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 1: 987344 bytes, 987344 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 2: 5440 bytes, 992784 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 3: 172640 bytes, 1165424 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 4: 535104 bytes, 1700528 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 5: 333224 bytes, 2033752 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 6: 128 bytes, 2033880 total
[2019-05-21T13:46:45.617+0000][9630][gc,heap ] GC(271) ParNew: 282158K->2653K(314560K)
[2019-05-21T13:46:45.617+0000][9630][gc,heap ] GC(271) CMS: 88354K->88355K(699072K)
[2019-05-21T13:46:45.617+0000][9630][gc,metaspace ] GC(271) Metaspace: 85648K->85648K(1128448K)
[2019-05-21T13:46:45.617+0000][9630][gc ] GC(271) Pause Young (Allocation Failure) 361M->88M(989M) 5.387ms
[2019-05-21T13:46:45.617+0000][9630][gc,cpu ] GC(271) User=0.01s Sys=0.00s Real=0.00s
[2019-05-21T13:46:45.617+0000][9630][safepoint ] Leaving safepoint region
[2019-05-21T13:46:45.617+0000][9630][safepoint ] Total time for which application threads were stopped: 0.0057277 seconds, Stopping threads took: 0.0000429 seconds
[2019-05-21T13:46:46.617+0000][9630][safepoint ] Application time: 1.0004453 seconds
[2019-05-21T13:46:46.617+0000][9630][safepoint ] Entering safepoint region: Cleanup
[2019-05-21T13:46:46.617+0000][9630][safepoint ] Leaving safepoint region
But to be precise, I dont see anyting critical (except memory failure allocation).
And even if everything go well these lines also appear in log.
If I restart my script (which deletes old and creates new index), ES updates these items fast, as it does only for first time
So my question is:
Why ES looses it's performance if I
insert/update/read/delete data ... insert/update/read/delete data ...
and its working ok, if I
insert/update/read restart script insert/update/read/
?
There is nothing to do with Elasticsearch.
It was my fault in not closing websocket connections, which led to server slow down, loosing it's resources.
Sorry guys for taking your time
Related
Today there was an indexing slow log,
[2020-02-12T15:52:37,418][WARN ][i.i.s.index ] [node-1] [company/KTngnM6ASD-_KdU0FFAWRA] took[22.7s], took_millis[22703], type[_doc], id[20080943028], routing[], source[{...}]
then to check gc.log to find out why is so,
[2020-02-12T07:52:37.417+0000][22539][safepoint ] Total time for which application threads were stopped: 0.0004935 seconds, Stopping threads took: 0.0001389 seconds
[2020-02-12T07:52:37.586+0000][22539][safepoint ] Application time: 0.1682439 seconds
[2020-02-12T07:52:37.586+0000][22539][safepoint ] Entering safepoint region: GenCollectForAllocation
[2020-02-12T07:52:37.586+0000][22539][gc,start ] GC(315124) Pause Young (Allocation Failure)
[2020-02-12T07:52:37.586+0000][22539][gc,task ] GC(315124) Using 8 workers of 8 for evacuation
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) Desired survivor size 34865152 bytes, new threshold 3 (max threshold 6)
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) Age table with threshold 3 (max threshold 6)
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) - age 1: 22998672 bytes, 22998672 total
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) - age 2: 4966112 bytes, 27964784 total
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) - age 3: 10219520 bytes, 38184304 total
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) - age 4: 4875304 bytes, 43059608 total
[2020-02-12T07:52:37.641+0000][22539][gc,heap ] GC(315124) ParNew: 597611K->52614K(613440K)
[2020-02-12T07:52:37.641+0000][22539][gc,heap ] GC(315124) CMS: 4992477K->4998973K(16095680K)
[2020-02-12T07:52:37.641+0000][22539][gc,metaspace ] GC(315124) Metaspace: 103488K->103488K(1144832K)
[2020-02-12T07:52:37.641+0000][22539][gc ] GC(315124) Pause Young (Allocation Failure) 5459M->4933M(16317M) 54.724ms
[2020-02-12T07:52:37.641+0000][22539][gc,cpu ] GC(315124) User=0.35s Sys=0.00s Real=0.06s
[2020-02-12T07:52:37.641+0000][22539][safepoint ] Leaving safepoint region
it seems gc is ok, but some log do not understand, e.g.
Entering safepoint region: Cleanup
Entering safepoint region: RevokeBias
Entering safepoint region: GenCollectForAllocation
what are Cleanup, RevokeBias, GenCollectForAllocation meaning? and What Application time meaning? why are so different?
Application time: 0.1382641 seconds
Application time: 13.2106552 seconds
Application time: 106.3031188 seconds
It's funny that you mention some things, but do not provide logs for them. Anyway:
RevokeBias is when revoking of biased locking happens or when "fat" locks from biased locking are deflated.
GenCollectForAllocation - this is the reason why the safepoint was triggered. You can read it as : "generational collector for allocation failure". There are many more other reasons, FYI.
Cleanup is whatever safepoint cleanup tasks are.
AFAIK, if you want more details, you need to enable tracing log level.
can someone explain what does the MAX statistic refers to in the below response. I don't see it documented anywhere.
localhost:8081/actuator/metrics/http.server.requests?tag=uri:/myControllerMethod
Response:
{
"name":"http.server.requests",
"description":null,
"baseUnit":"milliseconds",
"measurements":[
{
"statistic":"COUNT",
"value":13
},
{
"statistic":"TOTAL_TIME",
"value":57.430899
},
{
"statistic":"MAX",
"value":0
}
],
"availableTags":[
{
"tag":"exception",
"values":[
"None"
]
},
{
"tag":"method",
"values":[
"GET"
]
},
{
"tag":"outcome",
"values":[
"SUCCESS"
]
},
{
"tag":"status",
"values":[
"200"
]
},
{
"tag":"commonTag",
"values":[
"somePrefix"
]
}
]
}
You can see the individual metrics by using ?tag=url:{endpoint_tag} as defined in the response of the root /actuator/metrics/http.server.requests call. The details of the measurements values are;
COUNT: Rate per second for calls.
TOTAL_TIME: The sum of the times recorded. Reported in the monitoring system's base unit of time
MAX: The maximum amount recorded. When this represents a time, it is reported in the monitoring system's base unit of time.
As given here, also here.
The discrepancies you are seeing is due to the presence of a timer. Meaning after some time currently defined MAX value for any tagged metric can be reset back to 0. Can you add some new calls to /myControllerMethod then immediately do a call to /actuator/metrics/http.server.requests to see a non-zero MAX value for given tag?
This is due to the idea behind getting MAX metric for each smaller period. When you are seeing these metrics, you will be able to get an array of MAX values rather than a single value for a long period of time.
You can get to see this in action within Micrometer source code. There is a rotate() method focused on resetting the MAX value to create above described behaviour.
You can see this is called for every poll() call, which is triggered every some period for metric gathering.
What does MAX represent
MAX represents the maximum time taken to execute endpoint.
Analysis for /user/asset/getAllAssets
COUNT TOTAL_TIME MAX
5 115 17
6 122 17 (Execution Time = 122 - 115 = 17)
7 131 17 (Execution Time = 131 - 122 = 17)
8 187 56 (Execution Time = 187 - 131 = 56)
9 204 56 From Now MAX will be 56 (Execution Time = 204 - 187 = 17)
Will MAX be 0 if we have less number of request (or 1 request) to the particular endpoint?
No number of request for particular endPoint does not affect the MAX (see Image from Spring Boot Admin)
When MAX will be 0
There is Timer which set the value 0. When the endpoint is not being called or executed for sometime Timer sets MAX to 0. Here approximate timer value is 2 minutes (120 seconds)
DistributionStatisticConfig has .expiry(Duration.ofMinutes(2)).
which sets some measurements to 0 if there is no request has been made in between expiry time or rotate time.
How I have determined the timer value?
For that, I have taken 6 samples (executed the same endpoint for 6 times). For that, I have determined the time difference between the time of calling the endpoint - time for when MAX set back to zero
More Details
UPDATE
Document has been updated.
NOTE:
Max for basic DistributionSummary implementations such as CumulativeDistributionSummary, StepDistributionSummary is a time
window max (TimeWindowMax).
It means that its value is the maximum value during a time window.
If the time window ends, it'll be reset to 0 and a new time window starts again.
Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value
explicitly.
I have a Spring Boot application and I am using Spring Boot Actuator and Micrometer in order to track metrics about my application. I am specifically concerned about the 'http.server.requests' metric and the MAX statistic:
{
"name": "http.server.requests",
"measurements": [
{
"statistic": "COUNT",
"value": 2
},
{
"statistic": "TOTAL_TIME",
"value": 0.079653001
},
{
"statistic": "MAX",
"value": 0.032696019
}
],
"availableTags": [
{
"tag": "exception",
"values": [
"None"
]
},
{
"tag": "method",
"values": [
"GET"
]
},
{
"tag": "status",
"values": [
"200",
"400"
]
}
]
}
I suppose the MAX statistic is the maximum time of execution of a request (since I have made two requests, it's the the time of the longer processing of one of them).
Whenever I filter the metric by any tag, like localhost:9090/actuator/metrics?tag=status:200
{
"name": "http.server.requests",
"measurements": [
{
"statistic": "COUNT",
"value": 1
},
{
"statistic": "TOTAL_TIME",
"value": 0.029653001
},
{
"statistic": "MAX",
"value": 0.0
}
],
"availableTags": [
{
"tag": "exception",
"values": [
"None"
]
},
{
"tag": "method",
"values": [
"GET"
]
}
]
}
I am always getting 0.0 as a max time. What is the reason of this?
What does MAX represent (MAX Discussion)
MAX represents the maximum time taken to execute endpoint.
Analysis for /user/asset/getAllAssets
COUNT TOTAL_TIME MAX
5 115 17
6 122 17 (Execution Time = 122 - 115 = 17)
7 131 17 (Execution Time = 131 - 122 = 17)
8 187 56 (Execution Time = 187 - 131 = 56)
9 204 56 From Now MAX will be 56 (Execution Time = 204 - 187 = 17)
Will MAX be 0 if we have less number of request (or 1 request) to the particular endpoint?
No number of request for particular endPoint does not affect the MAX (see an image from Spring Boot Admin)
When MAX will be 0
There is Timer which set the value 0. When the endpoint is not being called or executed for sometime Timer sets MAX to 0. Here approximate timer value is 2 to 2.30 minutes (120 to 150 seconds)
DistributionStatisticConfig has .expiry(Duration.ofMinutes(2)) which sets the some measutement to 0 if there is no request has been made for last 2 minutes (120 seconds)
Methods such as public TimeWindowMax(Clock clock,...), private void rotate() Clock interface has been written for the same. You may see the implementation here
How I have determined the timer value?
For that, I have taken 6 samples (executed the same endpoint for 6 times). For that, I have determined the time difference between the time of calling the endpoint - time for when MAX set back to zero
MAX property belongs to enum Statistic which is used by Measurement
(In Measurement we get COUNT, TOTAL_TIME, MAX)
public static final Statistic MAX
The maximum amount recorded. When this represents a time, it is
reported in the monitoring system's base unit of time.
Notes:
This is the cases from metric for a particular endpoint (here /actuator/metrics/http.server.requests?tag=uri:/user/asset/getAllAssets).
For generalize metric of actuator/metrics/http.server.requests
MAX for some endPoint will be set backed to 0 due to a timer. In my view for MAX for /http.server.requests will be same as a particular endpoint.
UPDATE
The document has been updated for the MAX.
NOTE: Max for basic DistributionSummary implementations such as
CumulativeDistributionSummary, StepDistributionSummary is a time
window max (TimeWindowMax). It means that its value is the maximum
value during a time window. If the time window ends, it'll be reset to
0 and a new time window starts again. Time window size will be the
step size of the meter registry unless expiry in
DistributionStatisticConfig is set to other value explicitly.
You can see the individual metrics by using ?tag=url:{endpoint_tag} as defined in the response of the root /actuator/metrics/http.server.requests call. The details of the measurements values are;
COUNT: Rate per second for calls.
TOTAL_TIME: The sum of the times recorded. Reported in the monitoring system's base unit of time
MAX: The maximum amount recorded. When this represents a time, it is reported in the monitoring system's base unit of time.
As given here, also here.
The discrepancies you are seeing is due to the presence of a timer. Meaning after some time currently defined MAX value for any tagged metric can be reset back to 0. Can you add some new calls to your endpoint then immediately do a call to /actuator/metrics/http.server.requests to see a non-zero MAX value for given tag?
This is due to the idea behind getting MAX metric for each smaller period. When you are seeing these metrics, you will be able to get an array of MAX values rather than a single value for a long period of time.
You can get to see this in action within Micrometer source code. There is a rotate() method focused on resetting the MAX value to create above described behaviour.
You can see this is called for every poll() call, which is triggered every some period for metric gathering.
I intend to use a Prometheus Histogram vector to monitor the execution time of request handlers in Go.
I register it so:
var RequestTimeHistogramVec = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "request_duration_seconds",
Help: "Request duration distribution",
Buckets: []float64{0.125, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5, 7.5, 10, 20},
},
[]string{"endpoint"},
)
func init() {
prometheus.MustRegister(RequestTimeHistogramVec)
}
I use it so:
startTime := time.Now()
// handle request here
metrics.RequestTimeHistogramVec.WithLabelValues("get:" + endpointName).Observe(time.Since(startTime).Seconds())
When I do a HTTP GET to the /metrics endpoint after using my endpoint a couple of times, I get - amongst other things - the following:
# HELP request_duration_seconds Request duration distribution
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{endpoint="get:/position",le="0.125"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="0.25"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="0.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="1"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="1.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="2"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="3"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="4"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="7.5"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="10"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="20"} 6
request_duration_seconds_bucket{endpoint="get:/position",le="+Inf"} 6
request_duration_seconds_sum{endpoint="get:/position"} 0.022002387
request_duration_seconds_count{endpoint="get:/position"} 6
From the looks of it, all buckets are filled by the same amount, equal to the total amount of times I used my endpoint (6 times).
Why does this happen and how may I fix it?
Prometheus histogram buckets are cumulative, so in this case all the requests took less than or equal to 125ms.
In this case your choice of buckets may not be the best, you might want to make some of the buckets smaller.
This is not an error. Notice the rule for filling the bucket is le=..., meaning less or equal. Since all 6 requests succeeded quickly, all buckets were filled.
I created this device which sends a point to my webserver. My web server stores a Point instance which has the attributes created_at to reflect the point's creation time. My device consistently sends a request to my server at a 180s interval.
Now I want to see the periods of time my device has experienced outages in the last 7 days.
As an example, let's pretend it's August 3rt (08/03). I can query my Points table for points up to the last 3 days sorted by created_at
Points = [ point(name=p1, created_at="08/01 00:00:00"),
point(name=p2, created_at="08/01 00:03:00"),
point(name=p3, created_at="08/01 00:06:00"),
point(name=p4, created_at="08/01 00:20:00"),
point(name=p5, created_at="08/03 00:01:00"),
... ]
I would like to write an algorithm that can list out the following outages:
outages = {
"08/01": [ "00:06:00-00:20:00", "00:20:00-23:59:59" ],
"08/02": [ "00:00:00-23:59:59" ],
"08/03": [ "00:00:00-00:01:00" ],
}
Is there an elegant way to do this?