MarkLogic - Slow Fsync Notice/warning in errorlog file continuously - performance

We are using ML instance on AWS. We are using magnetic disks to store data.
We are experiencing a lot of slow fsync messages in our log files
2019-07-10 00:00:01.756 Info: Memory 46% phys=31816 virt=51033(160%) rss=14950(46%) anon=13748(43%) file=2903(9%) forest=7442(23%) cache=10240(32%) registry=1(0%)
2019-07-10 00:00:02.036 Notice: Slow fsync /data/failover/Forests/test-003-1-1/Journals/Journal-20190709-235839-8921048-15627167181139510-10676271677428093868-9000702, 562.1 KB in 1.637 sec
2019-07-10 00:00:02.042 Notice: Slow fsync /data/Forests/test-modules/Label, 1.316 sec
2019-07-10 00:00:02.043 Notice: Slow fsync /data/Forests/Schemas/Label, 1.305 sec
2019-07-10 00:00:02.043 Notice: Slow fsync /data/Forests/Security/Label, 1.312 sec
2019-07-10 00:00:02.195 Notice: Slow fsync /data/Logs, 1.22 sec
2019-07-10 00:00:13.836 Warning: Slow fsync /data/failover/Forests/test-003-1-1/Label, 2.445 sec
2019-07-10 00:00:13.886 Warning: Slow msync /data/Forests/test-001-1/0000844d/Ordinals, 1 MB in 2.007 sec
2019-07-10 00:00:13.888 Notice: Slow fsync /data/failover/Forests/test-002-1-1/Label, 1.995 sec
2019-07-10 00:00:14.139 Info: Merged 444 MB in 94 sec at 5 MB/sec to /data/Forests/test-001-1/0000844b
2019-07-10 00:00:14.995 Info: Merging 690 MB from /data/Forests/test-001-1/0000844b, /data/Forests/test-001-1/00008449, /data/Forests/test-001-1/0000844a, and /data/Forests/test-001-1/0000844c to /data/Forests/test-001-1/0000844e, timestamp=15627162115706539
2019-07-10 00:00:42.740 Info: Saved 84 MB in 24 sec at 4 MB/sec to /data/failover/Forests/test-002-1-1/000041b5
2019-07-10 00:00:45.861 Info: Merged 193 MB in 58 sec at 3 MB/sec to /data/failover/Forests/test-002-1-1/000041b6
What is the reason for getting the above "slow fsync" messages. Does it means that the Disks are slow or there is Network congestion. How to find out the cause of these messages.
Also does it implies that query execution will also be working slowly? or any other impact on MarkLogic performance?

This knowledge base article has a lot of great detail about these error messages.
In particular, an fsync should complete in milliseconds so seeing that its taking about 2.5 seconds to complete is very concerning:
2019-07-10 00:00:13.836 Warning: Slow fsync /data/failover/Forests/test-003-1-1/Label, 2.445 sec
The purpose of fsync is to "synchronize a file's in-core state with storage device". A slow fsync essentially means your disk is running slowly. The impact of this is that reading or writing data directly on the disk can take longer. There can be a number of reasons why this may happen. Some things to check:
Do you have a lot of master forests on that host due to failover? Correctly balancing the master forests across all hosts may help.
Is there a correlation between slow fsync and running queries? Optimizing your queries to pull less documents off disk may help.
Do you have software besides MarkLogic running on that host? (NodeJS app, Splunk, etc.) Letting MarkLogic run exclusively on that host may help.
It's generally a good idea to work this through with MarkLogic Support or your friendly neighborhood consultant if you can't quickly identify the cause.

Related

Hbase write performance degrades afer 4-5 days of restart

We are facing this issue in our cluster where we use phoenix to write the data. We have observed our jobs works fine initially. But after few days (4-5 days) we see drastic increase in our job time (4 mins to 30 mins). Input data size is almost same. And restarting hbase solves the issue for the next 4-5 days.
We have 70 region servers of size 128G each. 50k(per region server)*70(no of region servers) puts per job.
From the RS logs I can see increase in responseTooSlow warning logs frequency from 40k/day to 280k/day but response time is less than 1000ms in those logs.
2018-04-18 00:00:07,831 WARN [RW.default.writeRpcServer.handler=10,queue=4,port=16020] ipc.RpcServer: (responseTooSlow): {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1524009607697,"responsesize":106,"method":"Multi","processingtimems":134,"client":"192.168.25.70:54718","queuetimems":0,"class":"HRegionServer"}

Observing frequent log file switches depsite increasing the redo log size

We had redo log sized 256m and then bumped it up to 512 and eventually 1024M and currently have 8 logs. Despite that we are observing log switch happenign every 1minute and it is eating into our performance,
A snapshot from AWR
Load Profile
Per Second Per Transaction Per Exec Per Call
DB Time(s): 1.0 0.1 0.00 0.01
DB CPU(s): 0.6 0.1 0.00 0.01
Redo size: 34,893.0 4,609.0
Instance Activity Stats - Thread Activity
Statistics identified by '(derived)' come from sources other than SYSSTAT
Statistic Total per Hour
log switches (derived) 82 59.88
Any suggestions on how to reduce the number of log file switches, I have read that ideally it should be about 1 switch in 15-20 minutes.
34893 bytes of redo per second = 125614800 bytes per hour, that is about 120 MB, nowhere near the size of 1 redo log group.
Based on this and the size of the redo logs, I would say something forces log switches periodically. The built-in parameter archive_lag_target forces a log switches after the specified amount of seconds elapses, that is the first thing I would check. Other than that, it could be anything else logging in to the database and forcing a log switch manually, e.g a cron job. (60 log switches per 60 minutes, thats quite suspicious)

Elastic Search and Logstash Performance Tuning

In a Single Node Elastic Search along with logstash, We tested with 20mb and 200mb file parsing to Elastic Search on Different types of the AWS instance i.e Medium, Large and Xlarge.
Environment Details : Medium instance 3.75 RAM 1 cores Storage :4 GB SSD 64-bit Network Performance: Moderate
Instance running with : Logstash, Elastic search
Scenario: 1
**With default settings**
Result :
20mb logfile 23 mins Events Per/second 175
200mb logfile 3 hrs 3 mins Events Per/second 175
Added the following to settings:
Java heap size : 2GB
bootstrap.mlockall: true
indices.fielddata.cache.size: "30%"
indices.cache.filter.size: "30%"
index.translog.flush_threshold_ops: 50000
indices.memory.index_buffer_size: 50%
# Search thread pool
threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100
**With added settings**
Result:
20mb logfile 22 mins Events Per/second 180
200mb logfile 3 hrs 07 mins Events Per/second 180
Scenario 2
Environment Details : R3 Large 15.25 RAM 2 cores Storage :32 GB SSD 64-bit Network Performance: Moderate
Instance running with : Logstash, Elastic search
**With default settings**
Result :
20mb logfile 7 mins Events Per/second 750
200mb logfile 65 mins Events Per/second 800
Added the following to settings:
Java heap size: 7gb
other parameters same as above
**With added settings**
Result:
20mb logfile 7 mins Events Per/second 800
200mb logfile 55 mins Events Per/second 800
Scenario 3
Environment Details :
R3 High-Memory Extra Large r3.xlarge 30.5 RAM 4 cores Storage :32 GB SSD 64-bit Network Performance: Moderate
Instance running with : Logstash, Elastic search
**With default settings**
Result:
20mb logfile 7 mins Events Per/second 1200
200mb logfile 34 mins Events Per/second 1200
Added the following to settings:
Java heap size: 15gb
other parameters same as above
**With added settings**
Result:
20mb logfile 7 mins Events Per/second 1200
200mb logfile 34 mins Events Per/second 1200
I wanted to know
What is the benchmark for the performance?
Is the performance meets the benchmark or is it below the benchmark
Why even after i increased the elasticsearch JVM iam not able to find the difference?
how do i monitor Logstash and improve its performance?
appreciate any help on this as iam new to logstash and elastic search.
I think this situation is related to the fact that Logstash uses fixed size queues (The Logstash event processing pipeline)
Logstash sets the size of each queue to 20. This means a maximum of 20 events can be pending for the next stage. The small queue sizes mean that Logstash simply blocks and stalls safely when there’s a heavy load or temporary pipeline problems. The alternatives would be to either have an unlimited queue or drop messages when there’s a problem. An unlimited queue can grow unbounded and eventually exceed memory, causing a crash that loses all of the queued messages.
I think what you should try is to increase the worker count with the '-w' flag.
On the other hand many people say that Logstash should be scaled horizontally, rather that adding more cores and GB of ram (How to improve Logstash performance)
You have given Java Heap size correctly with respect to your total memory, but I think you are not utilizing it properly. I hope you have idea about what is fielddata size, the default is 60% of Heap size and you are reducing it to 30%.
I don't know why you are doing this, my perception might be wrong for your use-case but its good habit to allocate indices.fielddata.cache.size: "70%" or even 75%, but with this setting you must have to set something like indices.breaker.total.limit: "80%" to avoid Out Of Memory(OOM) exception. You can check this for further details on Limiting Memory Usage.

What is elastic search bounded by? Is it cpu, memory etc

I am running elastic search in my personal box.
Memory: 6GB
Processor: Intel® Core™ i3-3120M CPU # 2.50GHz × 4
OS: Ubuntu 12.04 - 64-bit
ElasticSearch Settings: Only running locally
Version : 1.2.2
ES_MIN_MEM=3g
ES_MAX_MEM=3g
threadpool.bulk.queue_size: 3000
indices.fielddata.cache.size: 25%
http.compression: true
bootstrap.mlockall: true
script.disable_dynamic: true
cluster.name: elasticsearch
index size: 252MB
Scenario:
I am trying to test the performance of my bulk queries/aggregations. The test case is to run asynchronous http requests to node.js which in turn will call elastic search. The tests are running from a Java method. Started with 50 requests at a time. Each request is divided and parallized in to two asynchronous(async.parallel) bulk queries in node.js. I am using node-elasticsearch api (uses elasticsearch 1.3 api). The two bulk queries contain 13 and 10 queries respectively.And the two are asynchronously sent to elastic search from node.js. When the Elastic Search returns, the query results are combined and sent back to the test case.
Observations:
I see that all the cpu cores are utilized 100%. Memory is utilized around 90%. The response time for all 50 requests combined is 30 seconds. If I run just the single queries each alone, in the bulk queries, each are returning in less than 100 milli-seconds. Node.js is taking negligible time to forward requests to elastic search and combine responses from elastic search.
Even if run the test case synchronously from java, the response time does not change. I may say that elastic search is not doing parallel processing. Is this because I am CPU or memory bound? One more observation: if I change heap size for elastic search from 1 - 3GB, the response time does not change.
Also I am pasting top command output:
top - 18:04:12 up 4:29, 5 users, load average: 5.93, 5.16, 4.15
Tasks: 224 total, 3 running, 221 sleeping, 0 stopped, 0 zombie
Cpu(s): 98.2%us, 1.0%sy, 0.0%ni, 0.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 5955796k total, 5801920k used, 153876k free, 1548k buffers
Swap: 6133756k total, 708336k used, 5425420k free, 460436k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17410 root 20 0 7495m 3.3g 27m S 366 58.6 5:09.57 java
15356 rmadd 20 0 1015m 125m 3636 S 19 2.2 1:14.03 node
Questions:
Is this expected, because I am running Elastic Search in my local machine and not in a cluster? Can I improve my performance in my local machine? I would definitely start a cluster. But I want to know, how to improve the performance scalably. What is it that the elastic search is bound to?
I am not able to find this in forums. And am sure this would help others. Thanks for your help.

Codeigniter backup database (how much can it backup?)

When reading the manuel guide for Codeigniter, i notice under the database utilities that the backup feature says "backing up very large databases may not be possible."
What exactly does this mean? How much would it be able to backup?
From CodeIgniter's user guide:
Due to the limited execution time and memory available to PHP, backing up very large databases may not be possible.
The configuration of your server (hosting your website/CodeIgniter) will likely limit the maximum execution time of a PHP script and the total memory available to PHP. Determining what size database you can backup will depend entirely on your specific server configuration. Running this backup utility with your database on your server and benchmarking the results - CodeIgniter's Benchmarking Class may help here - will help you determine what size database you can backup. You can potentially change your server's configuration to allocate more resources to PHP as required.
I decided to benchmark this backup function with a few different databases. This was just out of curiosity, so I wouldn't rely on these results, but they may be of interest.
Database 1
306.4 KB
78 Tables
279 rows
Results:
Execution time: 0.0603s
Peak memory usage:3 MB
Database 2
1 MB
11 Tables
165 rows
Results:
Execution time: 0.0350s
Peak memory usage: 3.25 MB
Database 3
16.6 MB
4 Tables
403 rows
Results:
Execution time: 0.6335s
Peak memory usage: 30.5 MB
Database 4
6.5 MB
9 Tables
93,289 rows
Results:
Time taken: 7.1702s
Peak memory usage: 91.25 MB

Resources