I have a database filled with rows and multiple threads that are accessing these rows, inputting some of the data from them in a function, producing an output, and then filling the row's missing columns with the output.
Here's the issue: Each row has an unprocessed flag which is, by default, true. So each thread is looking for rows with this flag. But each thread is getting the SAME row, it turns out...because the row is being marked as processed after the thread's job is complete, which may happen after a few seconds.
One way I avoided this was to insert a currently_processed flag for each row, mark it as false, and once a thread accesses the row, change it to true. Then when the thread is done, just change if back to false. The problem with this is that I have to use some sort of locking and not allow any other thread to do anything until this occurs. I was wondering if there's an alternative approach where I wouldn't have to do thread locking (via a mutex or something) and thus slow down the whole process.
If it helps, the code is in Ruby, but this problem is language agnostic, but here's the code to demonstrate the type of threading I'm using. So nothing special, threading on the lowest level like almost all languages have:
3.times do
Thread.new do
row = get_database_row
result = do_some_processing(row)
insert_results_into_row(result)
end
end.each(&:join)
The "real" answer here is that you need a database transaction. When one thread gets that row, then the database needs to know that this row is currently up for processing.
You can't resolve that within your application! You see, when two threads look at the same row at the same time, they could both try to write that flag ... and yep, it for sure changes to "currently processed"; and then both threads will update row data and write that back. Maybe that is not the problem if any processing results in the same final result; but if not, then all kinds of data integrity problems will arise.
So the real answer is that you step back and look how your specific database is designed in order to deal with such things.
I was wondering if there's an alternative approach where I wouldn't have to do thread locking (via a mutex or something) and thus slow down the whole process.
There are some ways to do this:
1) One common dispatcher for all threads. It should read all rows and put them into shared queue from where processing theads will get rows.
2) Go deeper into DB, find out if it supports something like oracles's "select for update skip locking" syntax and utilize it. For oracle you need to use his syntax in cursor and make somewhat cumbersome interaction, but at least it can work this way.
3) Partition input by, say, index of worker thread. So 1st worker out of 3 will only process rows 1,4,7 etc. 2nd worker will only process rows 2, 5, 8 etc.
Related
I am performing an import of data wrapped in a CMSTransactionScope.
What would be the most efficient and practical way to import data in parallel and rollback if any errors? The problem I see is that, with it being parallel, I don't know if I can have the inserted objects be part of the transaction if they are apart of a new thread.
Is there any way to do this or should it be handled differently?
If you're running the code in parallel in order to achieve better performance and you are basically inserting rows one by one then it's unlikely that it'll perform any better than it would while running in a single thread.
In this case I'd recommend using one thread in combination with CMSTransactionScope, and potentially ConnectionHelper.BulkInsert.
Anyway, if you still want to run your queries in parallel then you need to implement some kind of synchronization (locking, for instance) to ensure that all statements are executed before the code hits CMSTransactionScope.Commit() (this basically means a performance loss). Otherwise, queries would get executed in separate transactions. Moreover, you have to make sure that the CMSTransactionScope object always gets instantiated with the same IDataConnection (this should happen by default when you don't pass a connection to the constructor).
The second approach seems error prone to me and I'd rather take a look at different ways of optimizing the code (using async, etc.)
I'm currently implementing EventSourcing for my Go Actor lib.
The problem that I have right now is that when an actor restarts and need to replay all it's state from the event journal, the query might return inconsistent data.
I know that I can solve this using MutationToken
But, if I do that, I would be forced to write all events in sequential order, that is, write the last event last.
That way the mutation token for the last event would be enough to get all the data consistently for the specific actor.
This is however very slow, writing about 10 000 events in order, takes about 5 sec on my setup.
If I instead write those 10 000 async, using go routines, I can write all of the data in less than one sec.
But, then the writes are in indeterministic order and I can know which mutation token I can trust.
e.g. Event 999 might be written before Event 843 due to go routine scheduling AFAIK.
What are my options here?
Technically speaking MutationToken and asynchronous operations are not mutually exclusive. It may be able to be done without a change to the client (I'm not sure) but the key here is to take all MutationToken responses and then issue the query with the highest number per vbucket with all of them.
The key here is that given a single MutationToken, you can add the others to it. I don't directly see a way to do this, but since internally it's just a map it should be relatively straightforward and I'm sure we (Couchbase) would take a contribution that does this. At the lowest level, it's just a map of vbucket sequences that is provided to query at the time the query is issued.
I found 4 "proper" ways to do this:
In the cheat sheet for ActiveRecord users substitutes for ActiveRecord's increment and increment_counter are supposed to be album.values[:column] -= 1 # or += 1 for increment and album.update(:counter_name=>Sequel.+(:counter_name, 1))
In a SO solution update_sql is suggested for the same effect s[:query_volume].update_sql(:queries => Sequel.expr(3) + :queries)
In a random thread I found this one dataset.update_sql(:exp => 'exp + 10'.lit)
In the Sequels API docs for update I found this solution http://sequel.jeremyevans.net/rdoc/classes/Sequel/Dataset.html#method-i-update
yet none of the solutions actually update the value and return the result in a safe, atomic way.
Solutions based on "adding a value and then saving" should, afaik, fail nondeterministically in multiprocessing environments resulting with errors such as:
album's counter is 0
thread A and thread B both fetch album
thread A and thread B both increment the value in the hash/model/etc
thread A and thread B both update the counter to same value
as a result: A and B both set the counter to 1 and work with counter value 1
Sequel.expr and Sequel.+ on the other hand don't actually return a value, but a Sequel::SQL::NumericExpression and (afaik) you have no way of getting it out short of doing another DB roundtrip, which means this can happen:
album's counter is 0
thread A and B both increment the value, value is incremented by 2
thread A and B both fetch the row from the DB
as a result: A and B both set the counter to 2 and work with counter value 2
So, short of writing custom locking code, what's the solution? If there's none, short of writing custom locking code :) what's the best way to do it?
Update 1
I'm generally not happy with answers saying that I want too much of life, as 1 answer suggests :)
The albums are just an example from the docs.
Imagine for example that you have a transaction counter on an e-commerce POS which can accept 2 transactions at the same time on different hosts and to the bank you need to send them with an integer counter unique in 24h (called systan), send 2 trx with same systan and 1 will be declined, or worse, gaps in the counts are alerted (because they hint at "missing transactions") so it's not possible to use the DB's ID value.
A less severe example, but more related to my use case, several file exports get triggered simultaneously in a background worker, every file destination has its own counter. Gaps in the counters are alerted, workers are on different hosts (so mutexes are not useful). And I have a feeling I'll soon be solving the more severe problem anyway.
The DB sequences are no good either because it would mean doing DDL on addition of every terminal, and we're talking 1000s here. Even in my less sever use case DDLing on web portal actions is still a PITA, and might even not work depending on the cacheing scheme below (due to implementation of ActiveRecord and Sequel - and in my case I use both - might require server restart just to register a merchant).
Redis can do this, but it seems insane to add another infrastructure component just for counters when you're sitting on an ACID-compliant database.
If you are using PostgreSQL, you can use UPDATE RETURNING: DB[:table].returning(:counter).update(:counter => Sequel.expr(1) + :counter)
However, without support for UPDATE RETURNING or something similar, there is no way to atomically increment at the same time as return the incremented value.
The answer is - in a multithreaded environment, don't use DB counters. When faced with this dilema:
If I need a unique integer counter, use a threadsafe counter generator that parcels out counters as threads require them. This can be a simple integer or something more complex like a Twitter Snowflake-like generator.
If I need a unique identifier, I use something like a uuid
In your particular situation, where you need a count of albums - is there a reason you need this on the database rather than as a derived field on the model?
Update 1:
Given that you're dealing with something approximating file exports with workers on multiple hosts, you either need to parcel out the ids in advance (i.e. seed a worker with a job and the next available id from a single canonical source) or have the workers call in to a central service which allocates transaction ids on a first come first served basis.
I can't think of another way to do it. I've never worked with a POS system, but the telecoms network provisioning systems I've worked on have generally used a single transaction generator service which namespaced ids as appropriate.
We have an issue, more often than I would like, where whether worker or client sessions crash and these sessions were in the process of using a number sequences to create a new record, but they end up blocking that number sequence literally and anyone else trying to create a record using the same sequence will have its client frozen.
When this happens, I usually go in the NUMBERSEQUENCELIST table, I spot the correct DataAreadId and the user, and delete the row whose Status = 1.
But this kind of annoying really. Is there anything, any way I can configure the AOS server to release number sequence when client/workers crash ?
For the worker sessions, I guess we can fine tweak the code which runs in them, but for the client sessions crashing, not much we can do...
Any ideas ?
Thanks!
EDIT: Turns out that in this situation, after restarting the AOS server, you can go in List in the number sequence menu, and clean it up. Prior to the restart, my client would freeze trying to do that. So no need to do it directly through SQL.
Continuous numbers in NumberSequenceList are automatically cleaned up every 24 hours (or as set up on the number sequence). The cleanup process is quite slow if there are many "dead" numbers (hundreds or thousands). This may be considered as a hang, but is not.
Things to consider:
Is a continuous number sequence needed?
Do the cleanup more frequent (say every half hour instead of the default 24 hour)
Setup the cleanup process as a batch process
Fix the bug in the client code using the number sequence
Also avoid reserving the number, just use it. Instead of the anti-pattern:
NumberSeq idSequence = NumberSeq::newGetNum(IntrastatParameters::numRefIntrastatArchiveID(), true);
this.IntrastatArchiveID = idSequence.num();
idSequence.used();
Just use the number:
this.IntrastatArchiveID = NumberSeq::newGetNum(IntrastatParameters::numRefIntrastatArchiveID()).num();
The makeDecisionLater parameter should only be used in forms, where user may decide not to use the number (by delete or by escape). And in that case the NumberSeqFormHandler class should be used anyway.
I'm trying to figure out whether or not SubSonics AddMany() method is faster than a simple foreach loop. I poked around a bit on the SubSonic site but didn't see much on performance stats.
What I currently have. (.ForEach() just has some validation it it, other than that it works just like forEach(.....){ do stuff})
records.ForEach(record =>
{
newRepository.Add(record);
recordsProcessed++;
if (cleanUp) oldRepository.Delete<T>(record);
});
Which would change too
newRepository.AddMany(records);
if (cleanUp) oldRepository.DeleteMany<T>(records);
If you notice with this method I lose the count of how many records I've processed which isn't critical... But it would be nice to be able to display to the user how many records were moved with this tool.
So my questions boil down to: Would AddMany() be noticeably faster to use? And is there any way to get a count of the number of records actually copied over? If it succeeds can I assume all the records were processed? If one record fails, does the whole process fail?
Thanks in advance.
Just to clarify, AddMany() generates individual queries per row and submits them via a batch; DeleteMany() generates a single query. Please consult the source code and the generated SQL when you want to know what happens to your queries.
Your first approach is slow: 2*N queries. However, if you submit the queries using a batch it would be faster.
Your second approach is faster: N+1 queries. You can find how many will be added simply by enumerating 'records'.
If there is a risk of exceeding capacity limits on the size of a batch, then submit 50 or 100 at a time with little penalty.
Your final question depends on transactions. If the whole operation is one transaction, it will commit of abort as one. Otherwise, each query will stand alone. Your choice.