Today there was an indexing slow log,
[2020-02-12T15:52:37,418][WARN ][i.i.s.index ] [node-1] [company/KTngnM6ASD-_KdU0FFAWRA] took[22.7s], took_millis[22703], type[_doc], id[20080943028], routing[], source[{...}]
then to check gc.log to find out why is so,
[2020-02-12T07:52:37.417+0000][22539][safepoint ] Total time for which application threads were stopped: 0.0004935 seconds, Stopping threads took: 0.0001389 seconds
[2020-02-12T07:52:37.586+0000][22539][safepoint ] Application time: 0.1682439 seconds
[2020-02-12T07:52:37.586+0000][22539][safepoint ] Entering safepoint region: GenCollectForAllocation
[2020-02-12T07:52:37.586+0000][22539][gc,start ] GC(315124) Pause Young (Allocation Failure)
[2020-02-12T07:52:37.586+0000][22539][gc,task ] GC(315124) Using 8 workers of 8 for evacuation
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) Desired survivor size 34865152 bytes, new threshold 3 (max threshold 6)
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) Age table with threshold 3 (max threshold 6)
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) - age 1: 22998672 bytes, 22998672 total
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) - age 2: 4966112 bytes, 27964784 total
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) - age 3: 10219520 bytes, 38184304 total
[2020-02-12T07:52:37.641+0000][22539][gc,age ] GC(315124) - age 4: 4875304 bytes, 43059608 total
[2020-02-12T07:52:37.641+0000][22539][gc,heap ] GC(315124) ParNew: 597611K->52614K(613440K)
[2020-02-12T07:52:37.641+0000][22539][gc,heap ] GC(315124) CMS: 4992477K->4998973K(16095680K)
[2020-02-12T07:52:37.641+0000][22539][gc,metaspace ] GC(315124) Metaspace: 103488K->103488K(1144832K)
[2020-02-12T07:52:37.641+0000][22539][gc ] GC(315124) Pause Young (Allocation Failure) 5459M->4933M(16317M) 54.724ms
[2020-02-12T07:52:37.641+0000][22539][gc,cpu ] GC(315124) User=0.35s Sys=0.00s Real=0.06s
[2020-02-12T07:52:37.641+0000][22539][safepoint ] Leaving safepoint region
it seems gc is ok, but some log do not understand, e.g.
Entering safepoint region: Cleanup
Entering safepoint region: RevokeBias
Entering safepoint region: GenCollectForAllocation
what are Cleanup, RevokeBias, GenCollectForAllocation meaning? and What Application time meaning? why are so different?
Application time: 0.1382641 seconds
Application time: 13.2106552 seconds
Application time: 106.3031188 seconds
It's funny that you mention some things, but do not provide logs for them. Anyway:
RevokeBias is when revoking of biased locking happens or when "fat" locks from biased locking are deflated.
GenCollectForAllocation - this is the reason why the safepoint was triggered. You can read it as : "generational collector for allocation failure". There are many more other reasons, FYI.
Cleanup is whatever safepoint cleanup tasks are.
AFAIK, if you want more details, you need to enable tracing log level.
Related
In order to measure the number of context switches for a multi-thread application, I followed two methods: 1) with perf sched and 2) with the information in /proc/pid/status. The difference is quite large, though. The steps I did are:
1- Using perf command, the number of switches is 7848.
$ sudo perf stat -e sched:sched_switch,task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions ./mm_double_omp 4
Using 4 threads
PID = 395944
Performance counter stats for './mm_double_omp 4':
7,601 sched:sched_switch # 0.044 K/sec
173,377.19 msec task-clock # 3.973 CPUs utilized
7,601 context-switches # 0.044 K/sec
2 cpu-migrations # 0.000 K/sec
24,780 page-faults # 0.143 K/sec
164,393,781,352 cycles # 0.948 GHz
69,723,515,498 instructions # 0.42 insn per cycle
43.636463582 seconds time elapsed
173.244505000 seconds user
0.123880000 seconds sys
Please note that sched:sched_switch and context-switches are the same. If I only use sched:sched_switch the number is still in the order of 7000.
2- I modified the code to copy /proc/pid/status file two times: At the beginning and finish of the program.
int main() {
char cmdbuf[256];
int pid_num = getpid();
printf("PID = %d\n", pid_num);
snprintf(cmdbuf, sizeof(cmdbuf), "sudo cp /proc/%d/status %s", pid_num, "start.txt" );
system(cmdbuf);
// DO
snprintf(cmdbuf, sizeof(cmdbuf), "sudo cp /proc/%d/status %s", pid_num, "finish.txt" );
system(cmdbuf);
return 0;
}
After the execution I see:
$ tail -n2 start.txt
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 0
$ tail -n2 finish.txt
voluntary_ctxt_switches: 5
nonvoluntary_ctxt_switches: 573
So, there are less than 600 context switches which is far less than the perf result. Questions are:
Does perf code affect the measurement? If yes, then it has a large overhead.
Is the meaning of context switch is the same in both methods?
Which one is more reliable then?
can someone explain what does the MAX statistic refers to in the below response. I don't see it documented anywhere.
localhost:8081/actuator/metrics/http.server.requests?tag=uri:/myControllerMethod
Response:
{
"name":"http.server.requests",
"description":null,
"baseUnit":"milliseconds",
"measurements":[
{
"statistic":"COUNT",
"value":13
},
{
"statistic":"TOTAL_TIME",
"value":57.430899
},
{
"statistic":"MAX",
"value":0
}
],
"availableTags":[
{
"tag":"exception",
"values":[
"None"
]
},
{
"tag":"method",
"values":[
"GET"
]
},
{
"tag":"outcome",
"values":[
"SUCCESS"
]
},
{
"tag":"status",
"values":[
"200"
]
},
{
"tag":"commonTag",
"values":[
"somePrefix"
]
}
]
}
You can see the individual metrics by using ?tag=url:{endpoint_tag} as defined in the response of the root /actuator/metrics/http.server.requests call. The details of the measurements values are;
COUNT: Rate per second for calls.
TOTAL_TIME: The sum of the times recorded. Reported in the monitoring system's base unit of time
MAX: The maximum amount recorded. When this represents a time, it is reported in the monitoring system's base unit of time.
As given here, also here.
The discrepancies you are seeing is due to the presence of a timer. Meaning after some time currently defined MAX value for any tagged metric can be reset back to 0. Can you add some new calls to /myControllerMethod then immediately do a call to /actuator/metrics/http.server.requests to see a non-zero MAX value for given tag?
This is due to the idea behind getting MAX metric for each smaller period. When you are seeing these metrics, you will be able to get an array of MAX values rather than a single value for a long period of time.
You can get to see this in action within Micrometer source code. There is a rotate() method focused on resetting the MAX value to create above described behaviour.
You can see this is called for every poll() call, which is triggered every some period for metric gathering.
What does MAX represent
MAX represents the maximum time taken to execute endpoint.
Analysis for /user/asset/getAllAssets
COUNT TOTAL_TIME MAX
5 115 17
6 122 17 (Execution Time = 122 - 115 = 17)
7 131 17 (Execution Time = 131 - 122 = 17)
8 187 56 (Execution Time = 187 - 131 = 56)
9 204 56 From Now MAX will be 56 (Execution Time = 204 - 187 = 17)
Will MAX be 0 if we have less number of request (or 1 request) to the particular endpoint?
No number of request for particular endPoint does not affect the MAX (see Image from Spring Boot Admin)
When MAX will be 0
There is Timer which set the value 0. When the endpoint is not being called or executed for sometime Timer sets MAX to 0. Here approximate timer value is 2 minutes (120 seconds)
DistributionStatisticConfig has .expiry(Duration.ofMinutes(2)).
which sets some measurements to 0 if there is no request has been made in between expiry time or rotate time.
How I have determined the timer value?
For that, I have taken 6 samples (executed the same endpoint for 6 times). For that, I have determined the time difference between the time of calling the endpoint - time for when MAX set back to zero
More Details
UPDATE
Document has been updated.
NOTE:
Max for basic DistributionSummary implementations such as CumulativeDistributionSummary, StepDistributionSummary is a time
window max (TimeWindowMax).
It means that its value is the maximum value during a time window.
If the time window ends, it'll be reset to 0 and a new time window starts again.
Time window size will be the step size of the meter registry unless expiry in DistributionStatisticConfig is set to other value
explicitly.
I'm staking to join a CockroachDB node to a cluster.
I've created first cluster, then try to join 2nd node to the first node, but 2nd node created new cluster as follows.
Does anyone knows whats are wrong steps on the following my steps, any suggestions are wellcome.
I've started first node as follows:
cockroach start --insecure --advertise-host=163.172.156.111
* Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v19.1/secure-a-cluster.html
*
CockroachDB node starting at 2019-05-11 01:11:15.45522036 +0000 UTC (took 2.5s)
build: CCL v19.1.0 # 2019/04/29 18:36:40 (go1.11.6)
webui: http://163.172.156.111:8080
sql: postgresql://root#163.172.156.111:26257?sslmode=disable
client flags: cockroach <client cmd> --host=163.172.156.111:26257 --insecure
logs: /home/ueda/cockroach-data/logs
temp dir: /home/ueda/cockroach-data/cockroach-temp449555924
external I/O path: /home/ueda/cockroach-data/extern
store[0]: path=/home/ueda/cockroach-data
status: initialized new cluster
clusterID: 3e797faa-59a1-4b0d-83b5-36143ddbdd69
nodeID: 1
Then, start secondary node to join to 163.172.156.111, but can't join:
cockroach start --insecure --advertise-addr=128.199.127.164 --join=163.172.156.111:26257
CockroachDB node starting at 2019-05-11 01:21:14.533097432 +0000 UTC (took 0.8s)
build: CCL v19.1.0 # 2019/04/29 18:36:40 (go1.11.6)
webui: http://128.199.127.164:8080
sql: postgresql://root#128.199.127.164:26257?sslmode=disable
client flags: cockroach <client cmd> --host=128.199.127.164:26257 --insecure
logs: /home/ueda/cockroach-data/logs
temp dir: /home/ueda/cockroach-data/cockroach-temp067740997
external I/O path: /home/ueda/cockroach-data/extern
store[0]: path=/home/ueda/cockroach-data
status: restarted pre-existing node
clusterID: a14e89a7-792d-44d3-89af-7037442eacbc
nodeID: 1
The cockroach.log of joining node shows some gosip error:
cat cockroach-data/logs/cockroach.log
I190511 01:21:13.762309 1 util/log/clog.go:1199 [config] file created at: 2019/05/11 01:21:13
I190511 01:21:13.762309 1 util/log/clog.go:1199 [config] running on machine: amfortas
I190511 01:21:13.762309 1 util/log/clog.go:1199 [config] binary: CockroachDB CCL v19.1.0 (x86_64-unknown-linux-gnu, built 2019/04/29 18:36:40, go1.11.6)
I190511 01:21:13.762309 1 util/log/clog.go:1199 [config] arguments: [cockroach start --insecure --advertise-addr=128.199.127.164 --join=163.172.156.111:26257]
I190511 01:21:13.762309 1 util/log/clog.go:1199 line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg utf8=✓
I190511 01:21:13.762307 1 cli/start.go:1033 logging to directory /home/ueda/cockroach-data/logs
W190511 01:21:13.763373 1 cli/start.go:1068 RUNNING IN INSECURE MODE!
- Your cluster is open for any client that can access <all your IP addresses>.
- Any user, even root, can log in without providing a password.
- Any user, connecting as root, can read or write any data in your cluster.
- There is no network encryption nor authentication, and thus no confidentiality.
Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v19.1/secure-a-cluster.html
I190511 01:21:13.763675 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory
W190511 01:21:13.763752 1 cli/start.go:944 Using the default setting for --cache (128 MiB).
A significantly larger value is usually needed for good performance.
If you have a dedicated server a reasonable setting is --cache=.25 (248 MiB).
I190511 01:21:13.764011 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory
W190511 01:21:13.764047 1 cli/start.go:957 Using the default setting for --max-sql-memory (128 MiB).
A significantly larger value is usually needed in production.
If you have a dedicated server a reasonable setting is --max-sql-memory=.25 (248 MiB).
I190511 01:21:13.764239 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory
I190511 01:21:13.764272 1 cli/start.go:1082 CockroachDB CCL v19.1.0 (x86_64-unknown-linux-gnu, built 2019/04/29 18:36:40, go1.11.6)
I190511 01:21:13.866977 1 server/status/recorder.go:610 available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory
I190511 01:21:13.867002 1 server/config.go:386 system total memory: 992 MiB
I190511 01:21:13.867063 1 server/config.go:388 server configuration:
max offset 500000000
cache size 128 MiB
SQL memory pool size 128 MiB
scan interval 10m0s
scan min idle time 10ms
scan max idle time 1s
event log enabled true
I190511 01:21:13.867098 1 cli/start.go:929 process identity: uid 1000 euid 1000 gid 1000 egid 1000
I190511 01:21:13.867115 1 cli/start.go:554 starting cockroach node
I190511 01:21:13.868242 21 storage/engine/rocksdb.go:613 opening rocksdb instance at "/home/ueda/cockroach-data/cockroach-temp067740997"
I190511 01:21:13.894320 21 server/server.go:876 [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled
I190511 01:21:13.894813 21 storage/engine/rocksdb.go:613 opening rocksdb instance at "/home/ueda/cockroach-data"
W190511 01:21:13.896301 21 storage/engine/rocksdb.go:127 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:2566] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
W190511 01:21:13.905666 21 storage/engine/rocksdb.go:127 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:2566] More existing levels in DB than needed. max_bytes_for_level_multiplier may not be guaranteed.
I190511 01:21:13.911380 21 server/config.go:494 [n?] 1 storage engine initialized
I190511 01:21:13.911417 21 server/config.go:497 [n?] RocksDB cache size: 128 MiB
I190511 01:21:13.911427 21 server/config.go:497 [n?] store 0: RocksDB, max size 0 B, max open file limit 10000
W190511 01:21:13.912459 21 gossip/gossip.go:1496 [n?] no incoming or outgoing connections
I190511 01:21:13.913206 21 server/server.go:926 [n?] Sleeping till wall time 1557537673913178595 to catches up to 1557537674394265598 to ensure monotonicity. Delta: 481.087003ms
I190511 01:21:14.251655 65 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n?] circuitbreaker: gossip [::]:26257->163.172.156.111:26257 tripped: initial connection heartbeat failed: rpc error: code = Unknown desc = client cluster ID "a14e89a7-792d-44d3-89af-7037442eacbc" doesn't match server cluster ID "3e797faa-59a1-4b0d-83b5-36143ddbdd69"
I190511 01:21:14.251695 65 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 [n?] circuitbreaker: gossip [::]:26257->163.172.156.111:26257 event: BreakerTripped
W190511 01:21:14.251763 65 gossip/client.go:122 [n?] failed to start gossip client to 163.172.156.111:26257: initial connection heartbeat failed: rpc error: code = Unknown desc = client cluster ID "a14e89a7-792d-44d3-89af-7037442eacbc" doesn't match server cluster ID "3e797faa-59a1-4b0d-83b5-36143ddbdd69"
I190511 01:21:14.395848 21 gossip/gossip.go:392 [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"128.199.127.164:26257" > attrs:<> locality:<> ServerVersion:<major_val:19 minor_val:1 patch:0 unstable:0 > build_tag:"v19.1.0" started_at:1557537674395557548
W190511 01:21:14.458176 21 storage/replica_range_lease.go:506 can't determine lease status due to node liveness error: node not in the liveness table
I190511 01:21:14.458465 21 server/node.go:461 [n1] initialized store [n1,s1]: disk (capacity=24 GiB, available=18 GiB, used=2.2 MiB, logicalBytes=41 MiB), ranges=20, leases=0, queries=0.00, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=6467.00 p90=26940.00 pMax=43017435.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}
I190511 01:21:14.458775 21 storage/stores.go:244 [n1] read 0 node addresses from persistent storage
I190511 01:21:14.459095 21 server/node.go:699 [n1] connecting to gossip network to verify cluster ID...
W190511 01:21:14.469842 96 storage/store.go:1525 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown
I190511 01:21:14.474785 21 server/node.go:719 [n1] node connected via gossip and verified as part of cluster "a14e89a7-792d-44d3-89af-7037442eacbc"
I190511 01:21:14.475033 21 server/node.go:542 [n1] node=1: started with [<no-attributes>=/home/ueda/cockroach-data] engine(s) and attributes []
I190511 01:21:14.475393 21 server/status/recorder.go:610 [n1] available memory from cgroups (8.0 EiB) exceeds system memory 992 MiB, using system memory
I190511 01:21:14.475514 21 server/server.go:1582 [n1] starting http server at [::]:8080 (use: 128.199.127.164:8080)
I190511 01:21:14.475572 21 server/server.go:1584 [n1] starting grpc/postgres server at [::]:26257
I190511 01:21:14.475605 21 server/server.go:1585 [n1] advertising CockroachDB node at 128.199.127.164:26257
W190511 01:21:14.475655 21 jobs/registry.go:341 [n1] unable to get node liveness: node not in the liveness table
I190511 01:21:14.532949 21 server/server.go:1650 [n1] done ensuring all necessary migrations have run
I190511 01:21:14.533020 21 server/server.go:1653 [n1] serving sql connections
I190511 01:21:14.533209 21 cli/start.go:689 [config] clusterID: a14e89a7-792d-44d3-89af-7037442eacbc
I190511 01:21:14.533257 21 cli/start.go:697 node startup completed:
CockroachDB node starting at 2019-05-11 01:21:14.533097432 +0000 UTC (took 0.8s)
build: CCL v19.1.0 # 2019/04/29 18:36:40 (go1.11.6)
webui: http://128.199.127.164:8080
sql: postgresql://root#128.199.127.164:26257?sslmode=disable
client flags: cockroach <client cmd> --host=128.199.127.164:26257 --insecure
logs: /home/ueda/cockroach-data/logs
temp dir: /home/ueda/cockroach-data/cockroach-temp067740997
external I/O path: /home/ueda/cockroach-data/extern
store[0]: path=/home/ueda/cockroach-data
status: restarted pre-existing node
clusterID: a14e89a7-792d-44d3-89af-7037442eacbc
nodeID: 1
I190511 01:21:14.541205 146 server/server_update.go:67 [n1] no need to upgrade, cluster already at the newest version
I190511 01:21:14.555557 149 sql/event_log.go:135 [n1] Event: "node_restart", target: 1, info: {Descriptor:{NodeID:1 Address:128.199.127.164:26257 Attrs: Locality: ServerVersion:19.1 BuildTag:v19.1.0 StartedAt:1557537674395557548 LocalityAddress:[] XXX_NoUnkeyedLiteral:{} XXX_sizecache:0} ClusterID:a14e89a7-792d-44d3-89af-7037442eacbc StartedAt:1557537674395557548 LastUp:1557537671113461486}
I190511 01:21:14.916458 59 gossip/gossip.go:1510 [n1] node has connected to cluster via gossip
I190511 01:21:14.916660 59 storage/stores.go:263 [n1] wrote 0 node addresses to persistent storage
I190511 01:21:24.480247 116 storage/store.go:4220 [n1,s1] sstables (read amplification = 2):
0 [ 51K 1 ]: 51K
6 [ 1M 1 ]: 1M
I190511 01:21:24.480380 116 storage/store.go:4221 [n1,s1]
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
L0 1/0 50.73 KB 0.5 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 8.0 0 1 0.006 0 0
L6 1/0 1.26 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
Sum 2/0 1.31 MB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 8.0 0 1 0.006 0 0
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 8.0 0 1 0.006 0 0
Uptime(secs): 10.6 total, 10.6 interval
Flush(GB): cumulative 0.000, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
estimated_pending_compaction_bytes: 0 B
I190511 01:21:24.481565 121 server/status/runtime.go:500 [n1] runtime stats: 170 MiB RSS, 114 goroutines, 0 B/0 B/0 B GO alloc/idle/total, 14 MiB/16 MiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (7x), 50 KiB/1.5 MiB (r/w)net
What is the possibly cause to block to join? Thank you for your suggestion!
It seems you had previously started the second node (the one running on 128.199.127.164) by itself, creating its own cluster.
This can be seen in the error message:
W190511 01:21:14.251763 65 gossip/client.go:122 [n?] failed to start gossip client to 163.172.156.111:26257: initial connection heartbeat failed: rpc error: code = Unknown desc = client cluster ID "a14e89a7-792d-44d3-89af-7037442eacbc" doesn't match server cluster ID "3e797faa-59a1-4b0d-83b5-36143ddbdd69"
To be able to join the cluster, the data directory of the joining node must be empty. You can either delete cockroach-data or specify an alternate directory with --store=/path/to/data-dir
When I start nodejs script, it deletes old index (if it exist) and according to the config file creates new, after creates Websocket-server and starts to listen incoming connections.
initES() {
this.elasticsearchClient = new elasticsearch.Client({
host: `${Config.elasticSearchHost}:${Config.elasticSearchPort}`,
log: 'trace'
});
let deletePromise = this.elasticsearchClient.indices.delete({index: `${Config.elasticSearchIndex}`});
deletePromise.then(() => {
console.log(`Index ${Config.elasticSearchIndex} deleted`);
}, function(e) {
console.log(e.toJSON())
}).then(() => {
let createPromise = this.elasticsearchClient.indices.create({
index: `${Config.elasticSearchIndex}`,
body: {
settings: {
index: {
number_of_shards: 1,
number_of_replicas: 0
},
analysis: {
analyzer: {
whitespace_analyzer: {
tokenizer: 'whitespace',
filter: ['lowercase']
}
}
}
}
}
});
createPromise.then(() => {
console.log(`Index ${Config.elasticSearchIndex} created`);
}, (e) => {
console.log(e.toJSON());
})
});
}
Script is intended to start just once, at the boot time (through cron), it was written by me, and uses standart ES library (
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html
).
In front, user chooses to calculate orders (~700 items, they calculate by system automatically, with gearman and phantomjs)
At first (first 8 hours or first test) everything is working fine, ES responding good, websocket clients frequently update data, and data is updated in ES index.
If user cancels process, or process is finished and user decides to recalculate (all data is deleted before anything is put on), process of IO in ES becomes slower.
And so on, and after awhile index is filled up to ~340.. ~350 items, not to 700. In some cases ES stops to respond.
Tailing log files of ES shows me tons of lines
Entering safepoint region: GenCollectForAllocation
[2019-05-21T13:46:45.611+0000][9630][gc,start ] GC(271) Pause Young (Allocation Failure)
[2019-05-21T13:46:45.611+0000][9630][gc,task ] GC(271) Using 8 workers of 8 for evacuation
[2019-05-21T13:46:45.616+0000][9630][gc,age ] GC(271) Desired survivor size 17891328 bytes, new threshold 6 (max threshold 6)
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) Age table with threshold 6 (max threshold 6)
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 1: 987344 bytes, 987344 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 2: 5440 bytes, 992784 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 3: 172640 bytes, 1165424 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 4: 535104 bytes, 1700528 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 5: 333224 bytes, 2033752 total
[2019-05-21T13:46:45.617+0000][9630][gc,age ] GC(271) - age 6: 128 bytes, 2033880 total
[2019-05-21T13:46:45.617+0000][9630][gc,heap ] GC(271) ParNew: 282158K->2653K(314560K)
[2019-05-21T13:46:45.617+0000][9630][gc,heap ] GC(271) CMS: 88354K->88355K(699072K)
[2019-05-21T13:46:45.617+0000][9630][gc,metaspace ] GC(271) Metaspace: 85648K->85648K(1128448K)
[2019-05-21T13:46:45.617+0000][9630][gc ] GC(271) Pause Young (Allocation Failure) 361M->88M(989M) 5.387ms
[2019-05-21T13:46:45.617+0000][9630][gc,cpu ] GC(271) User=0.01s Sys=0.00s Real=0.00s
[2019-05-21T13:46:45.617+0000][9630][safepoint ] Leaving safepoint region
[2019-05-21T13:46:45.617+0000][9630][safepoint ] Total time for which application threads were stopped: 0.0057277 seconds, Stopping threads took: 0.0000429 seconds
[2019-05-21T13:46:46.617+0000][9630][safepoint ] Application time: 1.0004453 seconds
[2019-05-21T13:46:46.617+0000][9630][safepoint ] Entering safepoint region: Cleanup
[2019-05-21T13:46:46.617+0000][9630][safepoint ] Leaving safepoint region
But to be precise, I dont see anyting critical (except memory failure allocation).
And even if everything go well these lines also appear in log.
If I restart my script (which deletes old and creates new index), ES updates these items fast, as it does only for first time
So my question is:
Why ES looses it's performance if I
insert/update/read/delete data ... insert/update/read/delete data ...
and its working ok, if I
insert/update/read restart script insert/update/read/
?
There is nothing to do with Elasticsearch.
It was my fault in not closing websocket connections, which led to server slow down, loosing it's resources.
Sorry guys for taking your time
How to avoid 100% of JVM heap in ElasticSearch and that the garbage collector cleaups the heap?
In my case, when the JVM heap is coming to 99% the site don't respond anymore. Some tips?
Here a screen from Marvel.
elasticsearch.yml config
cluster.name: xxx
node.name: xxxx
node.data: true
node.master: true
bootstrap.mlockall: true
transport.tcp.compress: true;
transport.tcp.port: 9300
http.port: 9200
http.max_content_length: 500mb
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms
script.engine.groovy.inline.aggs: on
script.inline: on
script.indexed: on
index.max_result_window: 10000
And a heap size of 24g on 9G index.
EDIT -- GC LOG
: 553545K->8330K(613440K), 0.0557428 secs] 732889K->188315K(25097728K), 0.0558526 secs] [Times: user=0.40 sys=0.00, real=0.06 secs]
2017-06-05T10:55:25.416+0200: 88.445: Total time for which application threads were stopped: 0.0561829 seconds, Stopping threads took: 0.0000684 seconds
2017-06-05T10:55:26.416+0200: 89.446: Total time for which application threads were stopped: 0.0002122 seconds, Stopping threads took: 0.0000832 seconds
2017-06-05T10:55:26.833+0200: 89.863: [GC (Allocation Failure) 2017-06-05T10:55:26.834+0200: 89.863: [ParNew
Desired survivor size 34865152 bytes, new threshold 6 (max 6)
- age 1: 530096 bytes, 530096 total
- age 2: 920856 bytes, 1450952 total
- age 3: 1311064 bytes, 2762016 total
- age 4: 258584 bytes, 3020600 total
- age 5: 600504 bytes, 3621104 total
- age 6: 717384 bytes, 4338488 total
: 553674K->5918K(613440K), 0.0450758 secs] 733659K->186089K(25097728K), 0.0451704 secs] [Times: user=0.35 sys=0.00, real=0.04 secs]