I have weird performance issues with fetch.max.message.bytes parameter in librdkafka consumer implementation (version 0.11). I run some tests using kafkacat over slow speed network link (4 Mbps) and received following results:
1024 bytes = 1.740s
65536 bytes = 2.670s
131072 bytes = 7.070s
When I started debugging protocol messages I noticed a way to high RTT values.
|SEND|rdkafka| Sent FetchRequest (v4, 68 bytes # 0, CorrId 8)
|RECV|rdkafka| Received FetchResponse (v4, 131120 bytes, CorrId 8, rtt 607.68ms)
It seems that increase of fetch.max.message.bytes value causes very high network saturation, but it carries only single message per request.
On the other hand when I try kafka-console-consumer everything runs as expected (I get throughput 500 messages per second over the same network link).
Any ideas or suggestions where to look at?
You are most likely hitting issue #1384 which is a bug with the new v0.11.0 consumer. The bug is particularly evident on slow links or with MessageSets/batches with few messages.
A fix is on the way.
Related
We have set up an EFK stack for our project and from yesterday kibana seems down. When we initially troubleshooter we have found the following errors:
Readiness probe failed: Error: Got HTTP code 503 but expected a 200 & Readiness probe failed: Error: Got HTTP code 000 but expected a 200
Later we found the same issue with elasticsearch pod as well. along with this we found the following issue with Data request limit:
FATAL
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent]
Data too large, data for [indices:admin/template/get] would be
[1036909172/988.8mb], which is larger than the limitof
[1020054732/972.7mb], real usage: [1036909056/988.8mb], new bytes
reserved: [116/116b], usages [request=0/0b, fielddata=420/420b,
in_flight_requests=67310/65.7kb, model_inference=0/0b,
eql_sequence=0/0b,
accounting=110294544/105.1mb]","bytes_wanted":1036909172,"bytes_limit":1020054732,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent]
Data too large, data for [indices:admin/template/get] would be
[1036909172/988.8mb], which is larger than the limit of
[1020054732/972.7mb], real usage: [1036909056/988.8mb], new bytes
reserved: [116/116b], usages [request=0/0b, fielddata=420/420b,
in_flight_requests=67310/65.7kb, model_inference=0/0b,
eql_sequence=0/0b,
accounting=110294544/105.1mb]","bytes_wanted":1036909172,"bytes_limit":1020054732,"durability":"PERMANENT"},"status":429}
We have tried changing the REDINESS_PROBE_TIMEOUT, Initial Delay, Timeout, Probe Period, Success Threshold, and Failure Threshold. Also tried increasing the Indicess Breaker limit but it's not reflecting we can see error still taking old limits, tried fixing circuit_breaking_exception by adding ES_JAVA_OPTS values as well.
Nothing seems to be working, any help would be appreciated.
the same phenomenon occurred during the service operation. This issue is identified as a memory shortage. So there are several ways to think about it over.
Physical Memory Expansion (Scale Out)
Additional equipment due to insufficient memory available
Lower load through monitoring
If circuit_breaking_exception remains in the log, develop a monitoring device that lowers the load
Setting java_opts
You can set memory usage, but it's meaningless if you don't have enough hardware memory
I created a test in JMETER
Add > Sampler > HTTP Request = Get
Server Name = dainikbhaskar.com
No. of threads(users) = 1
Ramp-up period (seconds) = 1,
Loop Count = 1
(My internet connection is a broadband one with the speed 50 MBPS)
I ran the test, ran successful, latency comes as 127 & sometimes less than 100 in subsequent executions.
I switched off my Wi-Fi, connected my laptop with mobile hotspot & executed the same test.
Now the latency is 607, 932, 373, 542, 915
I believe it's happening due to INTERNET CONNECTION SPEED as rest of the inputs are same.
Please confirm whether my perception is correct ? :)
It is correct.
You can also get network latency from https://www.speedtest.net/ or https://fast.com/
Latency is the time from sending the request until first byte of response arrives, so called "Time to first byte"
In JMeter's world:
Latency is Connect Time + Time to send the request + time to get the first byte of response
Elapsed time is Latency + time to get the last byte of the response.
More information:
JMeter Glossary
Understanding Your Reports: Part 1 - What are KPIs?
If you get 5x times higher latency for other connection it means that the majority of time is spend for the packets to travel back and forth. You can see the more precise picture by looking at Network tab of your browser developer tools or using a special solution like Lighthouse
I have been testing MongoDB 2.6.7 for the last couple of months using YCSB 0.1.4. I have captured good data comparing SSD to HDD and am producing engineering reports.
After my testing was completed, I wanted to explore the allanbank async driver. When I got it up and running (I am not a developer, so it was a challenge for me), I first wanted to try the rebuilt sync driver. I found performance improvements of 30-100%, depending on the workload, and was very happy with it.
Next, I tried the async driver. I was not able to see much difference between it and my results with the native driver.
The command I'm running is:
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://192.168.0.13:27017/ycsb -p mongodb.writeConcern=strict -threads 96
Over the course of my testing (mostly with the native driver), I have experimented with more and less threads than 96; turned on "noatime"; tried both xfs and ext4; disabled hyperthreading; disabled half my 12 cores; put the journal on a different drive; changed sync from 60 seconds to 1 second; and checked the network bandwidth between the client and server to ensure its not oversubscribed (10GbE).
Any feedback or suggestions welcome.
The Async move exceeded my expectations. My experience is with the Python Sync (pymongo) and Async driver (motor) and the Async driver achieved greater than 10x the throughput. further, motor is still using pymongo under the hoods but adds the async ability. that could easily be the case with your allanbank driver.
Often the dramatic changes come from threading policies and OS configurations.
Async needn't and shouldn't use any more threads than cores on the VM or machine. For example, if you're server code is spawning a new thread per incoming conn -- then all bets are off. start by looking at the way the driver is being utilized. A 4 core machine uses <= 4 incoming threads.
On the OS level, you may have to fine-tune parameters like net.core.somaxconn, net.core.netdev_max_backlog, sys.fs.file_max, /etc/security/limits.conf nofile and the best place to start is looking at nginx related performance guides including this one. nginx is the server that spearheaded or at least caught the attention of many linux sysadmin enthusiasts. Contrary to popular lore one should reduce your keepalive timeout opposed to lengthen it. The default keep-alive timeout is some absurd (4 hours) number of seconds. you might want to cut the cord in 1 minute. basically, think a short sweet relationship with your clients connections.
Bear in mind that Mongo is not Async so you can use a Mongo driver pool. nevertheless, don't let the driver get stalled on slow queries. cut it off in 5 to 10 seconds using the following equivalents in Java. I'm just cutting and pasting here with no recommendations.
# Specifies a time limit for a query operation. If the specified time is exceeded, the operation will be aborted and ExecutionTimeout is raised. If max_time_ms is None no limit is applied.
# Raises TypeError if max_time_ms is not an integer or None. Raises InvalidOperation if this Cursor has already been used.
CONN_MAX_TIME_MS = None
# socketTimeoutMS: (integer) How long (in milliseconds) a send or receive on a socket can take before timing out. Defaults to None (no timeout).
CLIENT_SOCKET_TIMEOUT_MS=None
# connectTimeoutMS: (integer) How long (in milliseconds) a connection can take to be opened before timing out. Defaults to 20000.
CLIENT_CONNECT_TIMEOUT_MS=20000
# waitQueueTimeoutMS: (integer) How long (in milliseconds) a thread will wait for a socket from the pool if the pool has no free sockets. Defaults to None (no timeout).
CLIENT_WAIT_QUEUE_TIMEOUT_MS=None
# waitQueueMultiple: (integer) Multiplied by max_pool_size to give the number of threads allowed to wait for a socket at one time. Defaults to None (no waiters).
CLIENT_WAIT_QUEUE_MULTIPLY=None
Hopefully you will have the same success. I was ready to bail on Python prior to async
I am building an autocomplete functionality and realized the amount of time taken between the client and server is too high (in the range of 450-700ms)
My first stop was to check if this is result of server delay.
But as you can see these Nginx logs are almost always 0.001 milliseconds (request time is the last column). It’s hardly a cause of concern.
So it became very evident that I am losing time between the server and the client. My benchmarks are Google Instant's response times. Which almost often is in the range of 30-40 milliseconds. Magnitudes lower.
Although it’s easy to say that Google's has massive infrastructural capabilities to deliver at this speed, I wanted to push myself to learn if this is possible for someone who is not that level. If not 60 milliseconds, I want to shave off 100-150 milliseconds.
Here are some of the strategies I’ve managed to learn.
Enable httpd slowstart and initcwnd
Ensure SPDY if you are on https
Ensure results are http compressed
Etc.
What are the other things I can do here?
e.g
Does have a persistent connection help?
Should I reduce the response size dramatically?
Edit:
Here are the ping and traceroute numbers. The site is served via cloudflare from a Fremont Linode machine.
mymachine-Mac:c name$ ping site.com
PING site.com (160.158.244.92): 56 data bytes
64 bytes from 160.158.244.92: icmp_seq=0 ttl=58 time=95.557 ms
64 bytes from 160.158.244.92: icmp_seq=1 ttl=58 time=103.569 ms
64 bytes from 160.158.244.92: icmp_seq=2 ttl=58 time=95.679 ms
^C
--- site.com ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 95.557/98.268/103.569/3.748 ms
mymachine-Mac:c name$ traceroute site.com
traceroute: Warning: site.com has multiple addresses; using 160.158.244.92
traceroute to site.com (160.158.244.92), 64 hops max, 52 byte packets
1 192.168.1.1 (192.168.1.1) 2.393 ms 1.159 ms 1.042 ms
2 172.16.70.1 (172.16.70.1) 22.796 ms 64.531 ms 26.093 ms
3 abts-kk-static-ilp-241.11.181.122.airtel.in (122.181.11.241) 28.483 ms 21.450 ms 25.255 ms
4 aes-static-005.99.22.125.airtel.in (125.22.99.5) 30.558 ms 30.448 ms 40.344 ms
5 182.79.245.62 (182.79.245.62) 75.568 ms 101.446 ms 68.659 ms
6 13335.sgw.equinix.com (202.79.197.132) 84.201 ms 65.092 ms 56.111 ms
7 160.158.244.92 (160.158.244.92) 66.352 ms 69.912 ms 81.458 ms
mymachine-Mac:c name$ site.com (160.158.244.92): 56 data bytes
I may well be wrong, but personally I smell a rat. Your times aren't justified by your setup; I believe that your requests ought to run much faster.
If at all possible, generate a short query using curl and intercept it with tcpdump on both the client and the server.
It could be a bandwidth/concurrency problem on the hosting. Check out its diagnostic panel, or try estimating the traffic.
You can try and save a response query into a static file, then requesting that file (taking care as not to trigger the local browser cache...), to see whether the problem might be in processing the data (either server or client side).
Does this slowness affect every request, or only the autocomplete ones? If the latter, and no matter what nginx says, it might be some inefficiency/delay in recovering or formatting the autocompletion data for output.
Also, you can try and serve a static response bypassing nginx altogether, in case this is an issue with nginx (and for that matter: have you checked out nginx' error log?).
One approach I didn't see you mention is to use SSL sessions: you can add the following into your nginx conf to make sure that an SSL handshake (very expensive process) does not happen with every connection request:
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
See "HTTPS server optimizations" here:
http://nginx.org/en/docs/http/configuring_https_servers.html
I would recommend using New Relic if you aren't already. It is possible that the server-side code you have could be the issue. If you think that might be the issue, there are quite a few free code profiling tools.
You may want to consider an option to preload autocomplete options in the background while the page is rendered and then save a trie or whatever structure you use on the client in the local storage. When the user starts typing in the autocomplete field you would not need to send any requests to the server but instead query local storage.
Web SQL Database and IndexedDB introduce databases to the clientside.
Instead of the common pattern of posting data to the server via
XMLHttpRequest or form submission, you can leverage these clientside
databases. Decreasing HTTP requests is a primary target of all
performance engineers, so using these as a datastore can save many
trips via XHR or form posts back to the server. localStorage and
sessionStorage could be used in some cases, like capturing form
submission progress, and have seen to be noticeably faster than the
client-side database APIs.
For example, if you have a data grid component or an inbox with
hundreds of messages, storing the data locally in a database will save
you HTTP roundtrips when the user wishes to search, filter, or sort. A
list of friends or a text input autocomplete could be filtered on each
keystroke, making for a much more responsive user experience.
http://www.html5rocks.com/en/tutorials/speed/quick/#toc-databases
I'm noticing some weird snmp communication behavior when using MS SNMP Mgmt Api in terms of timeout and retries. I was wondering if mgmt api is supported on Win Server 2008 R1 x64. My program is a C++ 64bit snmp extension agent that uses the mgmt api to communicate with other agents as well.
This is my pseudo code:
SnmpMgrOpen(ip address, 150ms timeout, 3 retries)
start = getTickCount()
result = SnmpMgrRequest(get request with 3 or 4 OIDs)
finish = getTickCount()
if (result == some error)
{
log Error including total time (i.e finish - start ticks)
}
SnmpMgrClose()
When the snmpMgrRequest call times out, the total time equals anywhere from 1014ms to 5000ms. If, I set retries to 0, the total time is still 1014ms to 5000ms.
I would expect, with retries to 0 that the SnmpMgrRequest would timeout within 150ms. The documentation seems to imply this. Am I missing something is there a minimum timeout period of at least a second? What could be causing this behavior?
Any help would be greatly appreciated. I'm at a lost here.
ballerstyle_98#hotmail.com
From my experience with SNMP on Windows platforms the minimum timeout value is 1 second. So even if you set it to any value lower than that, it will default to 1 second.
Also the timeout value used is doubled for every retry. So with a 150ms 3 retry configuration in the worst case you will have a failed response to a request in 1+2+2+2 =7 seconds.
I hope this helps.