Sqlite concurrent writing performance - go

I'm writing a website with Golang and Sqlite3, and I expect around 1000 concurrent writings per second for a few minutes each day, so I did the following test (ignore error checking to look cleaner):
t1 := time.Now()
tx, _ := db.Begin()
stmt, _ := tx.Prepare("insert into foo(stuff) values(?)")
defer stmt.Close()
for i := 0; i < 1000; i++ {
_, _ = stmt.Exec(strconv.Itoa(i) + " - ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,./;'[]-=<>?:()*&^%$##!~`")
}
tx.Commit()
t2 := time.Now()
log.Println("Writing time: ", t2.Sub(t1))
And the writing time is about 0.1 second. Then I modified the loop to:
for i := 0; i < 1000; i++ {
go func(stmt *sql.Stmt, i int) {
_, err = stmt.Exec(strconv.Itoa(i) + " - ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789,./;'[]-=<>?:()*&^%$##!~`")
if err != nil {
log.Fatal(err)
}
}(stmt, i)
}
This gives me holy 46.2 seconds! I run it many times and every time is beyond 40 seconds! Sometimes even over a minute! Since Golang handles each user concurrently, does it mean I have to switch database in order to make the webpage working? Thanks!

I recently evaluated SQLite3 performance in Go myself for a network application and learned that it needs a bit of setup before it even remotely usable.
Turn on the Write-Ahead Logging
You need to use WAL PRAGMA journal_mode=WAL. That's mainly why you get such a bad performance. With WAL I can do 10000 concurent writes without transactions in a matter of seconds. Within transaction it will be lightning fast.
Disable connections pool
I use mattn/go-sqlite3 and it opens a database with SQLITE_OPEN_FULLMUTEX flag. It means that every SQLite call will be guarded with a lock. Everything will be serialized. And that's actually what you want with SQLite. The problem with Go in this situation is that you will get random errors that tell you that the database is locked. And the reason why is because of the way the sql/DB works inside. Inside it manages pool of connections for you, so it will open multiple SQLite connections and you don't want to do that. To solve this I had to, basically, disable the pool. Call db.SetMaxOpenConns(1) and it will work. Even on very high loads with tens of thousands of concurent reads and writes it works without a problem.
Other solution might be to use SQLITE_OPEN_NOMUTEX to run SQLite in multi-threaded mode and let it manage that for you. But SQLite doesn't really work in multi-threaded apps. Reads can happen in parallel but only one write at a time. You will get occasional busy errors which are completely normal for SQLite but will require you to do something with them - you probably don't want to stop a write operation completely when that happens. That's why most of the time people work with SQLite either synchronously or by sending calls to a separate thread just for the SQLite.

I tested the write performance on go1.18 to see if parallelism works
 
Out of Box
I used 3 golang threads incrementing different integer columns of the same record
Parallelism Conclusions:
Read code 5 percentage 2.5%
Write code 5 percentage 518% (waiting 5x in between attempts)
Write throughput: 2,514 writes per second
code 5 is “database is locked (5) (SQLITE_BUSY)”
A few years ago on Node.js the driver crashes with only concurrency, not parallelism, unless I serialized the writes, ie. write concurrency = 1
 
Serialized Writes
With golang is used github.com/haraldrudell/parl.NewModerator(1, context.Background()), ie. serialized writes:
Serialized results:
read code 5: 0.005%
write code 5: 0.02%
3,032 writes per second (+20%)
Reads are not serialized, but they are held up by writes in the same thread. Writes seems to be 208x more expensive than reads.
Serializing writes in golang increases write performance by 20%
 
PRAGMA journal_mode
Enabling sqlDB.Exec("PRAGMA journal_mode = WAL")
(from default: journalMode: delete)
increases write performance to 18,329/s, ie. another 6x
code 5 goes to 0
 
Multiple Processes
Using 3 processes x 3 threads with writes serialized per process lowers write throughput by about 5% and raises code 5 up to 200%. Good news is that file locking works without errors macOS 12.3.1 apfs

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

What is the read/write cost for Firestore docRef.Collections(ctx)?

I am tracking the read/write cost of my HTTP service layer functions.
Am I correct that Collection/Doc/Collection/Doc chains incur no reads?
reads := 0
bucketDocRef := s.fsClient.Collection("accounts").Doc(accountID).Collection("widgets").Doc(widgetID)
// no cost so far?
Also, what is the cost for a call to .Collections(ctx) ... is it 1 read for each collectionRef returned from iter.GetAll()?
iter := docRef.Collections(ctx)
colRefs, _ := iter.GetAll()
reads += len(colRefs)
Also, what is the cost if the call to iter.GetAll() results in an error?
Collection and Document are just builder functions. They don't do anything other than build references to collections and documents. They are not actually performing any queries or reading any data, which means they are effectively "free" from the perspective of Firestore billing.
In your example, you won't be billed anything until you call GetAll, which costs 1 read per document returned, plus whatever egress is required.

Why does ZeroMQ not receive a string when it becomes too large on a PUSH/PULL MT4 - Python setup?

I have an EA set in place that loops history trades and builds one large string with trade information. I then send this string every second from MT4 to the python backend using a plain PUSH/PULL pattern.
For whatever reason, the data isn't received on the pull side when the string transferred becomes too long. The backend PULL-socket slices each string and further processes it.
Any chance that the PULL-side is too slow to grab and process all the data which then causes an overflow (so that a delay arises due to the processing part)?
Talking about file sizes we are well below 5kb per second.
This is the PULL-socket, which manipulates the data after receiving it:
while True:
# check 24/7 for available data in the pull socket
try:
msg = zmq_socket.recv_string()
data = msg.split("|")
print(data)
# if data is available and msg is account info, handle as follows
if data[0] == "account_info":
[...]
except zmq.error.Again:
print("\nResource timeout.. please try again.")
sleep(0.000001)
I am a bit curious now since the pull socket seems to not even be able to process a string containing 40 trades with their according information on a single MT4 client - Python connection. I actually planned to set it up to handle more than 5.000 MT4 clients - python backend connections at once.
Q : Any chance that the pull side is too slow to grab and process all the data which then causes an overflow (so that a delay arises due to the processing part)?
Zero chance.
Sending 640 B each second is definitely no showstopper ( 5kb per second - is nowhere near a performance ceiling... )
The posted problem formulation is otherwise undecidable.
Step 1) POSACK/NACK prove whether a PUSH side accepts the payload for sending error-free.
Step 2) prove the PULL side is not to be blamed - [PUSH.send(640*chr(64+i)) for i in range( 10 )] via a python-2-python tcp://-transport-class solo-channel crossing host-to-host hop, over at least your local physical network ( no VMCI/emulated vLAN, no other localhost colocation )
Step 3) if either steps above got POSACK-ed, your next chances are the ZeroMQ configuration space and/or the MT4-based PUSH-side incompatibility, most probably "hidden" inside a (not mentioned) third party ZeroMQ wrapper used / first-party issues with string handling / processing ( which you must have already read about, as it has been so many times observed and mentioned in the past posts about this trouble with well "hidden" MQL4 internal eco-system changes ).
Anyway, stay tuned. ZeroMQ is a sure bet and a truly horsepower for professional and low-latency designs in distributed-system's domain.

Why is lpop increasing Redis CPU usage?

I have an application which keeps looping while calling lpop. Using the top command, I can see that redis is using 64% of CPU, while my application uses 101%.
I'm using redis to create a queue and worker. My worker is in an infinite loop, calling lpop and waiting for the next job to come in.
For this, I'm using the machinery package. There is an issue for this here, where the problem is said to be from lpop. However, since the comments are confusing, I'm at a loss as to what the difference is between LPOP and BLPOP, apart from the fact that one doesn't block and the other does.
Using timed BLPOP instead of LPOP to avoid massive cpu
usage
committed 7 days ago
commit 54315dd9fe56a13b8aba2d2a8868fc48dfbb5795
machinery/v1/brokers/redis.go
- itemBytes, err := conn.Do("LPOP", redisBroker.config.DefaultQueue)
+ itemBytes, err := conn.Do("BLPOP", redisBroker.config.DefaultQueue, "1")
Use the latest version of machinery/v1/brokers/redis.go
which changes LPOP to BLPOP.
Reference: Redis commands: BLPOP

Same code runs slower as a Windows service than a GUI application

I have some Delphi 2007 code which runs in two different applications, one is a GUI application and the other is a Windows service. The weird part is that while the GUI application technically seems to have more "to do", drawing the GUI, calculating some stats and so on, the Windows service is consistently using more of the CPU when it runs. Where the GUI application uses around 3-4% CPU power, the service use in the region of 6-8%.
When running them together CPU loads of both applications approximately double.
The basic code is the same in both applications, except for the addition of the GUI code in the Windows Forms application.
Is there any reason for this behavior? Do Windows service applications have some kind of inherent overhead or do I need to look through the code to find the source of this, in my book, unexpected behavior?
EDIT:
Having had time to look more closely at the code, I think the suggestion below that the GUI application spends some time waiting for repaints, causing the CPU load to drop is likely incorrect. The applications are both threaded, meaning the GUI repaints should not influence the CPU load.
Just to be sure I first tried to remove all GUI components from the application, leaving only a blank form. That did not increase the CPU load of the program. I then went through and stripped out all calls to Synchronize in the working threads which were used to update the UI. This had the same result: The CPU load did not change.
The code in the service looks like this:
procedure TLsOpcServer.ServiceExecute(Sender: TService);
begin
// Initialize OPC server as NT Service
dmEngine.AddToLog( sevInfo, 'Service', 'Name', Sender.Name );
AddLocalServiceKeysToRegistry( Sender.Name );
dmEngine.AddToLog( sevInfo, 'Service', 'Execute', 'Started' );
dmEngine.Start( True );
//
while not Terminated do
begin
ServiceThread.ProcessRequests( True );
end;
dmEngine.Stop;
dmEngine.AddToLog( sevInfo, 'Service', 'Execute', 'Stopped' );
end;
dmEngine.Start will start and register the OPC server and initialize a socket. It then starts a thread which does... something to incoming OPC signals. The same exact call is made on in FormCreate on the main form of the GUI application.
I'm going to look into how the GUI application starts next, I didn't write this code so trying to puzzle out how it works is a bit of an adventure :)
EDIT2
This is a little bit interesting. I ran both applications for exactly 1 minute each, running AQTime to benchmark them. This is the most interesting part of the results:
In the service:
Procedure name: TSignalList::HandleChild
Execution time: 20.105963821084
Hitcount: 5961231
In the GUI Application:
Procedure name: TSignalList::HandleChild
Execution time: 7.62424101324976
Hit count: 6383010
EDIT 3:
I'm finally back in a position where I can keep looking at this problem. I have found two procedures which both have about the same hitcount during a five minute run, yet in the service the execution time is much higher. For HandleValue the hitcount is 4 300 258 and the execution time is 21.77s in the service and in the GUI application the hitcount is 4 254 018 with an execution time of 9.75s.
The code looks like this:
function TSignalList.HandleValue(const Signal: string; var Tag: TTag; const CreateIfNotExist: Boolean): HandleStatus;
var
Index: integer;
begin
result := statusNoSignal;
Tag := nil;
if not Assigned( Values ) then
begin
Values := TValueStrings.Create;
Values.CaseSensitive := defDefaultCase;
Values.Sorted := True;
Values.Duplicates := dupIgnore;
Index := -1; // Garantied no items in list
end else
begin
Index := Values.IndexOf( Signal );
end;
if Index = -1 then
begin
if CreateIfNotExist then
begin
// Value signal does not exist create it
Tag := TTag.Create;
if Values.AddObject( Signal, Tag ) > -1 then
begin
result := statusAdded;
end;
end;
end else
begin
Tag := TTag( Values.Objects[ Index ] );
result := statusExist;
end;
end;
Both applications enter the "CreateIfNotExist" case exactly the same number of times. TValueStrings is a direct descendant of TStringList without any overloads.
Have you timed the execution of core functionality? If so, did you measure a difference? I think, if you do, you won't find much difference between them, unless you add other functionality, like updating the GUI, to the code of that core functionality.
Consuming less CPU doesn't mean it's running slower. The GUI app could be waiting more often on repaints, which depend on the GPU as well (and maybe other parts of the system). Therefore, the GUI app may consume less CPU power, because the CPU is waiting for other parts of your system before it can continue with the next instruction.

Resources