Why is lpop increasing Redis CPU usage? - go

I have an application which keeps looping while calling lpop. Using the top command, I can see that redis is using 64% of CPU, while my application uses 101%.
I'm using redis to create a queue and worker. My worker is in an infinite loop, calling lpop and waiting for the next job to come in.
For this, I'm using the machinery package. There is an issue for this here, where the problem is said to be from lpop. However, since the comments are confusing, I'm at a loss as to what the difference is between LPOP and BLPOP, apart from the fact that one doesn't block and the other does.

Using timed BLPOP instead of LPOP to avoid massive cpu
usage
committed 7 days ago
commit 54315dd9fe56a13b8aba2d2a8868fc48dfbb5795
machinery/v1/brokers/redis.go
- itemBytes, err := conn.Do("LPOP", redisBroker.config.DefaultQueue)
+ itemBytes, err := conn.Do("BLPOP", redisBroker.config.DefaultQueue, "1")
Use the latest version of machinery/v1/brokers/redis.go
which changes LPOP to BLPOP.
Reference: Redis commands: BLPOP

Related

Sqlite concurrent writing performance

I'm writing a website with Golang and Sqlite3, and I expect around 1000 concurrent writings per second for a few minutes each day, so I did the following test (ignore error checking to look cleaner):
t1 := time.Now()
tx, _ := db.Begin()
stmt, _ := tx.Prepare("insert into foo(stuff) values(?)")
defer stmt.Close()
for i := 0; i < 1000; i++ {
_, _ = stmt.Exec(strconv.Itoa(i) + " - ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,./;'[]-=<>?:()*&^%$##!~`")
}
tx.Commit()
t2 := time.Now()
log.Println("Writing time: ", t2.Sub(t1))
And the writing time is about 0.1 second. Then I modified the loop to:
for i := 0; i < 1000; i++ {
go func(stmt *sql.Stmt, i int) {
_, err = stmt.Exec(strconv.Itoa(i) + " - ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,./;'[]-=<>?:()*&^%$##!~`")
if err != nil {
log.Fatal(err)
}
}(stmt, i)
}
This gives me holy 46.2 seconds! I run it many times and every time is beyond 40 seconds! Sometimes even over a minute! Since Golang handles each user concurrently, does it mean I have to switch database in order to make the webpage working? Thanks!
I recently evaluated SQLite3 performance in Go myself for a network application and learned that it needs a bit of setup before it even remotely usable.
Turn on the Write-Ahead Logging
You need to use WAL PRAGMA journal_mode=WAL. That's mainly why you get such a bad performance. With WAL I can do 10000 concurent writes without transactions in a matter of seconds. Within transaction it will be lightning fast.
Disable connections pool
I use mattn/go-sqlite3 and it opens a database with SQLITE_OPEN_FULLMUTEX flag. It means that every SQLite call will be guarded with a lock. Everything will be serialized. And that's actually what you want with SQLite. The problem with Go in this situation is that you will get random errors that tell you that the database is locked. And the reason why is because of the way the sql/DB works inside. Inside it manages pool of connections for you, so it will open multiple SQLite connections and you don't want to do that. To solve this I had to, basically, disable the pool. Call db.SetMaxOpenConns(1) and it will work. Even on very high loads with tens of thousands of concurent reads and writes it works without a problem.
Other solution might be to use SQLITE_OPEN_NOMUTEX to run SQLite in multi-threaded mode and let it manage that for you. But SQLite doesn't really work in multi-threaded apps. Reads can happen in parallel but only one write at a time. You will get occasional busy errors which are completely normal for SQLite but will require you to do something with them - you probably don't want to stop a write operation completely when that happens. That's why most of the time people work with SQLite either synchronously or by sending calls to a separate thread just for the SQLite.
I tested the write performance on go1.18 to see if parallelism works
 
Out of Box
I used 3 golang threads incrementing different integer columns of the same record
Parallelism Conclusions:
Read code 5 percentage 2.5%
Write code 5 percentage 518% (waiting 5x in between attempts)
Write throughput: 2,514 writes per second
code 5 is “database is locked (5) (SQLITE_BUSY)”
A few years ago on Node.js the driver crashes with only concurrency, not parallelism, unless I serialized the writes, ie. write concurrency = 1
 
Serialized Writes
With golang is used github.com/haraldrudell/parl.NewModerator(1, context.Background()), ie. serialized writes:
Serialized results:
read code 5: 0.005%
write code 5: 0.02%
3,032 writes per second (+20%)
Reads are not serialized, but they are held up by writes in the same thread. Writes seems to be 208x more expensive than reads.
Serializing writes in golang increases write performance by 20%
 
PRAGMA journal_mode
Enabling sqlDB.Exec("PRAGMA journal_mode = WAL")
(from default: journalMode: delete)
increases write performance to 18,329/s, ie. another 6x
code 5 goes to 0
 
Multiple Processes
Using 3 processes x 3 threads with writes serialized per process lowers write throughput by about 5% and raises code 5 up to 200%. Good news is that file locking works without errors macOS 12.3.1 apfs

How to configure Redis connections with Rails 4, Puma and Sidekiq?

I am using Sidekiq (on Heroku with Puma) to send emails asynchronously and would like to use Redis to keep counters and cache models.
RedisCloud's free plan includes 30 connections to Redis. It is not clear to me how to manage:
redis connections used by Sidekiq
redis connections used in models (caching and counters)
Sidekiq Client size is configured like this:
Sidekiq.configure_client do |config|
config.redis = {url: ENV["REDISCLOUD_URL"], size: 3}
end
If I understood this correctly, Puma forks multiple processes, 2 in my case, which will result in:
2 (Puma Workers) * 3 (size) * 1 (Web Dyno) = 6 connections to redis used to push jobs.
Sidekiq Server
With Sidekiq taking 2 connections (or 5 in version 4), setting a concurrency of 10 would default in a server size of 12 or 15.
If I wanted to use all the remaining available connections (30 - 6 = 24), I could set :
Sidekiq.configure_client do |config|
config.redis = { size: 19 }
end
Total redis connections would be 19 + 5 (Sidekiq 4) = 24, and use the default concurrency of 25 would be ok.
As Mike Perham stated generally the concurrency must not be more than (server pool size - 2) * 2.
Now, where it starts to get confusing for me is the use of Redis out of Sidekiq.
# initializers/redis.rb
$redis = Redis.new(:url => uri)
Whenever I use Redis in a model or controller I call like so:
$redis.hincrby("mycounter", "key", 1)
As I understand it, all the puma threads wait on each other on a single Redis connection when $redis.whateverFunction is called.
In this answer What is the best way to use Redis in a Multi-threaded Rails environment? (Puma / Sidekiq), the recommended approach is using the connection_pool gem, related to the Sidekiq Wiki https://github.com/mperham/sidekiq/wiki/Advanced-Options#connection-pooling
require 'connection_pool'
$redis = ConnectionPool.new(size: 10) { Redis.new }
If I understand it right, it that case $redis.whateverFunction would have its own connection pool of 10, and sidekiq its own connection workers pool which would now be set out a new total of 20 redis connections ( 30 (available total) - 10 (redis model connections ), and Sidekiq client and server size would need to be changed.
How do you determine the size of the connection pool (here 10) needed for model/controller redis connections? Since Redis is single-threaded, how does increasing the connection pool actually increases redis operations performance?
Any thoughts on this would be of great help.
Thx!
Redis is single-threaded, but written in pure C, uses an event loop inside and handles connections asynchronously, so connection count does not affect it by much provided the same number of requests. It is capable of handling requests faster than your application can generate them because of network delay, ruby being slower than compiled and optimized C, etc, so you do not need to worry about it being single-threaded.
Increasing number of connections is beneficial for concurrent requests from different threads because there's no need to wait for response to be delivered over network to unlock connection, plus ruby can do parallel IOs.
Also you can tell if pool is too small when connection checkout times become worse than you expect/tolerate and corresponding thread/worker is idling while waiting for it, so benchmark your code and have a good look on your actual usage and behavior patterns.
On the other side i'd advise against using all of the connection count limit, there're times when you might need these extra connections. For example:
for graceful/"zero downtime" dyno restarts ("preboot") you need twice the connections, since old processes are still running for some time
keep at least one free connection for emergency debug as you may want to be able to connect from console/directly and see what data is inside when some unexpected highload comes

MongoDB-Java performance with rebuilt Sync driver vs Async

I have been testing MongoDB 2.6.7 for the last couple of months using YCSB 0.1.4. I have captured good data comparing SSD to HDD and am producing engineering reports.
After my testing was completed, I wanted to explore the allanbank async driver. When I got it up and running (I am not a developer, so it was a challenge for me), I first wanted to try the rebuilt sync driver. I found performance improvements of 30-100%, depending on the workload, and was very happy with it.
Next, I tried the async driver. I was not able to see much difference between it and my results with the native driver.
The command I'm running is:
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://192.168.0.13:27017/ycsb -p mongodb.writeConcern=strict -threads 96
Over the course of my testing (mostly with the native driver), I have experimented with more and less threads than 96; turned on "noatime"; tried both xfs and ext4; disabled hyperthreading; disabled half my 12 cores; put the journal on a different drive; changed sync from 60 seconds to 1 second; and checked the network bandwidth between the client and server to ensure its not oversubscribed (10GbE).
Any feedback or suggestions welcome.
The Async move exceeded my expectations. My experience is with the Python Sync (pymongo) and Async driver (motor) and the Async driver achieved greater than 10x the throughput. further, motor is still using pymongo under the hoods but adds the async ability. that could easily be the case with your allanbank driver.
Often the dramatic changes come from threading policies and OS configurations.
Async needn't and shouldn't use any more threads than cores on the VM or machine. For example, if you're server code is spawning a new thread per incoming conn -- then all bets are off. start by looking at the way the driver is being utilized. A 4 core machine uses <= 4 incoming threads.
On the OS level, you may have to fine-tune parameters like net.core.somaxconn, net.core.netdev_max_backlog, sys.fs.file_max, /etc/security/limits.conf nofile and the best place to start is looking at nginx related performance guides including this one. nginx is the server that spearheaded or at least caught the attention of many linux sysadmin enthusiasts. Contrary to popular lore one should reduce your keepalive timeout opposed to lengthen it. The default keep-alive timeout is some absurd (4 hours) number of seconds. you might want to cut the cord in 1 minute. basically, think a short sweet relationship with your clients connections.
Bear in mind that Mongo is not Async so you can use a Mongo driver pool. nevertheless, don't let the driver get stalled on slow queries. cut it off in 5 to 10 seconds using the following equivalents in Java. I'm just cutting and pasting here with no recommendations.
# Specifies a time limit for a query operation. If the specified time is exceeded, the operation will be aborted and ExecutionTimeout is raised. If max_time_ms is None no limit is applied.
# Raises TypeError if max_time_ms is not an integer or None. Raises InvalidOperation if this Cursor has already been used.
CONN_MAX_TIME_MS = None
# socketTimeoutMS: (integer) How long (in milliseconds) a send or receive on a socket can take before timing out. Defaults to None (no timeout).
CLIENT_SOCKET_TIMEOUT_MS=None
# connectTimeoutMS: (integer) How long (in milliseconds) a connection can take to be opened before timing out. Defaults to 20000.
CLIENT_CONNECT_TIMEOUT_MS=20000
# waitQueueTimeoutMS: (integer) How long (in milliseconds) a thread will wait for a socket from the pool if the pool has no free sockets. Defaults to None (no timeout).
CLIENT_WAIT_QUEUE_TIMEOUT_MS=None
# waitQueueMultiple: (integer) Multiplied by max_pool_size to give the number of threads allowed to wait for a socket at one time. Defaults to None (no waiters).
CLIENT_WAIT_QUEUE_MULTIPLY=None
Hopefully you will have the same success. I was ready to bail on Python prior to async

FastRWeb performance on Ubuntu with built-in web server

I have installed FastRWeb 1.1-0 on an installation of R 2.15.2 (Trick or Treat) running on an Ubuntu 10.04 box. I hope to use the resulting system to run a web service.
I've configured the system by setting http.port to 8181 in rserve.conf and unsetting the socket destination. I've assigned .http.request to FastRWeb::.http.request. I exchange JSON blobs between the client and the server using HTTP POST (the second blob can exceed 150KB in size, and will not fit in an HTTP GET query string.)
Everything works end to end -- I have a little client-side R script which generates JSON RPC calls across the channel. I see the run function invoked, and see it returned.
I've run into a significant performance problem, however: the return path takes in excess of 12 seconds from the time run() returns (including the call to done()) and the time that the R client gets the return value. RCurl doesn't seem to be the culprit; it appears that something is taking twelve seconds to do a return.
Does anybody have any suggestions of where to look? I can easily shift over to using Apache 2.0 and CGI, but, honestly, I'd rather keep everything R centric.
Answering my own question.
I wrapped .http.request with an Rprof()/Rprof(NULL) pair and looked at the time spent in each routine. It turns out that the system spends ~11 seconds inside URLDecode in the standard implementation of .run. This looks like a scaling problem in URLDecode in the core.

Django1.3 multiple gunicorn workers caching problems

i have weird caching problems with the 1.3 version of django. I probably have something configured wrong, but am not sure what.
A good example is django-avatar, which uses caching and many people use it. Even if I dont have a cache backend defined the avatar seems to be cached, which by itself would be ok, but it keeps switching back and forth between the last values cached. Example: I upload a new avatar, now on approximately 50% of the requests it will show me the new one, 50% the old one. If I delete the old one I still get it on the site 50% of the time. The only way to fix it is to disable the caching of the avatar by setting it to one second.
First I thought it was because i used django.core.cache.backends.locmem.LocMemCache, which I never used before, but it even happens when I dont configure a cache backend at all.
I found one similar bug:
Django caching bug .. even if caching is disabled
but my pages render just fine, its the templatetags (for now) that cause the problems in my setup.
I use django 1.3, postgres, nginx, gunicorn 0.12.0, greenlet==0.3.1, eventlet==0.9.16
I just did some more testing and realized that it only happens when I start gunicorn using the config file. If I start it with ./manage.py run_gunicorn everything is fine. Running "gunicorn_django -c deploy/gunicorn.conf.py" causes the problems.
The only explanation I can think of is that each worker gets his own cache (I wonder why, since I did not define a cache).
Update: running ./manage.py run_gunicorn -w 4 also causes the same problems. Therefore I am almost certain that the multiple workers are causing the problems and each worker caches the values seperately.
My configuration:
import os
import socket
import sys
PORT = 8000
PROC_NAME = 'myapp_gunicorn'
LOGFILE_NAME = 'gunicorn.log'
TIMEOUT = 3600
IP = '127.0.0.1'
DEPLOYMENT_ROOT = os.path.dirname(os.path.abspath(__file__))
SITE_ROOT = os.path.abspath(os.path.sep.join([DEPLOYMENT_ROOT, '..']))
CPU_CORES = os.sysconf("SC_NPROCESSORS_ONLN")
sys.path.insert(0, os.path.join(SITE_ROOT, "apps"))
bind = '%s:%s' % (IP, PORT)
logfile = os.path.sep.join([DEPLOYMENT_ROOT, 'logs', LOGFILE_NAME])
proc_name = PROC_NAME
timeout = TIMEOUT
worker_class = 'eventlet'
workers = 2 * CPU_CORES + 1
I also tried it without using 'eventlet', but got the same errors.
Thanks for any help.
It is most likely defaulting to the in-memory-cache, which means each worker has it's own version of the cache in it's own memory space. If you hit thread 1 you get a different cache then thread 3. Nginx is spreading the load between each thread most likely via a round robin distribution, so you are changing threads each hit. Which explains your wacky results.
When you do manage.py run_gunicorn it is most likely running single threaded, and thus only one cache, and that is why you don't see the same results.
Using memcached or something similar is the way to go.

Resources