Should Parallel.ForEach be used in DB calls? - parallel-processing

I've got a list of Foo IDs. I need to call a stored procedure for each ID.
e.g.
Guid[] siteIds = ...; // typically contains 100 to 300 elements
foreach (var id in siteIds)
{
db.MySproc(id); // Executes some stored procedure.
}
Each call is pretty independent of the other rows, this shouldn't be contentious in the database.
My question: would it be beneficial to parallelize this using Parallel.ForEach? Or is database IO going to be a bottleneck, and more threads would just result in more contention?
I would measure it myself, however, it's difficult to measure this on my test environment where the data and load is much smaller than our real web server.

Out of curiosity, why do you want to optmize it with Parallel.ForEach and spawn threads / open connections / pass data / get response for every item instead of writing a simple "sproc" that will work with list of IDs instead of single ID?
From the first look, it should get you a lot more noticable improvement.

I would think that the Parallel.ForEach would work, assuming that your DB server can handle the ~150-300 concurrent operations.
The only way to know for sure is to measure both.

Related

Auto save performance for rdbms

In my app user types in some content which I would like to auto save as the user types. The save call is not for every keystroke, rather I do autosave only when user pauses for more than 200ms. So in a typical paragraph there are 15-20 server calls. The content will not be read very often, so I need to optimize the writes.
I have to save data on MSSQL Server because of legacy code reasons. I'm getting 10 seconds avg response time in my load test. How do I improve the performance?
One approach I'm considering is instead of directly saving data in mssql I'll save it in Cassandra or redis, then eventually(maybe at regular time intervals) write it to mssql.
Another approach is instead of doing frequent updates, I'll insert new record for each auto save. Then a background process will clean up all records except for latest, every few minutes.
Update:
I replaced the existing logic with simple update calls to 2 tables and now I am seeing improvements. There was a long stored procedure which was taking upto 10 seconds under load. SO for now I have hold on the problem. Still I would like to know is there something I can do on application server layer to reduce frequent DB calls.
It is quite hard to answer yor question directly but here are some hints based on what we do in a multiple active user situation.
If you are writing/triggering on every keystroke, pass the keystroke to a background thread and do not perform the database write, or any network call, while blocking the users typing. A fast typist can hit 20 keystrokes/second, and you cannot afford to introduce latency.
If recording on a web page, you might be able to use localStorage. Do not issue an AJAX style call on every keystroke as there is a limit to outstanding requests. You need to implement some kind of buffered send. Remember that network calls in the real world can be 300mS sort of scale just to traverse the network.
Do you really need to save every keystroke, or is every N seconds acceptable? Every save operation will eventually turn into a disk operation, so you really want to coalesce as many saves as possible. The quickest way to do something is not to do it at all.
If you are recording to a database, then it is often quicker to update an existing row, if you can fetch it by direct key first. Unfortunatly it can sometimes be quicker to insert a new row and clean up excess later. This tends to be true if the table has few indexes. Which is quicker depends on database engine in use and how it is being used. We use both methods.
When using a database keep in mind that they often keep journals of some kind, so if you are updating frequently you might create a large load on the journal files.
If you are using techniques (Using C terminology) like fopen, fwrite these can perform very well, but if you are worried about system failure recovery, you may need to call fsync, which then limits your maximum performance rate. If you need fsync, a database might be better.
You might like to consider writing to a transactionlog table very frequently, and then posting to the real storage every N seconds. For example, if I am typing a customers name I might record every keystroke into a keylog table, and then have a background job read the keylog table and transfer the data to customers table. This helps reduce the operations to the customers table while also allowing the keylog table to be optimised to recording keystrokes. But, at the cost of more code server side.
Overall, you want logic like this
On keyup handler
Add keystroke to background queue
Wake background thread
Background thread
Read/remove ALL data from background queue
If no data, wait for wakeup and repeat
Write to database/network/file etc as one operation. (this can now be syncronous calls)
Optionally some velocity control, simple one is sleep(50mS) or sleep(2s)
Repeat
Keep in mind with the above the user can type and immediately hit close, so your final buffer write might not have flushed yet. You need to handle this.
If you get this correct, the user will not notice any delay. In our usage, we are recording around 1000 keystrokes/sec average, all of which ar routed over private networks to central points. This load is barely a blip, even network monitoring does not see such a small amount of traffic.
Good luck.

How to 'lock' database rows being processed

I have a database filled with rows and multiple threads that are accessing these rows, inputting some of the data from them in a function, producing an output, and then filling the row's missing columns with the output.
Here's the issue: Each row has an unprocessed flag which is, by default, true. So each thread is looking for rows with this flag. But each thread is getting the SAME row, it turns out...because the row is being marked as processed after the thread's job is complete, which may happen after a few seconds.
One way I avoided this was to insert a currently_processed flag for each row, mark it as false, and once a thread accesses the row, change it to true. Then when the thread is done, just change if back to false. The problem with this is that I have to use some sort of locking and not allow any other thread to do anything until this occurs. I was wondering if there's an alternative approach where I wouldn't have to do thread locking (via a mutex or something) and thus slow down the whole process.
If it helps, the code is in Ruby, but this problem is language agnostic, but here's the code to demonstrate the type of threading I'm using. So nothing special, threading on the lowest level like almost all languages have:
3.times do
Thread.new do
row = get_database_row
result = do_some_processing(row)
insert_results_into_row(result)
end
end.each(&:join)
The "real" answer here is that you need a database transaction. When one thread gets that row, then the database needs to know that this row is currently up for processing.
You can't resolve that within your application! You see, when two threads look at the same row at the same time, they could both try to write that flag ... and yep, it for sure changes to "currently processed"; and then both threads will update row data and write that back. Maybe that is not the problem if any processing results in the same final result; but if not, then all kinds of data integrity problems will arise.
So the real answer is that you step back and look how your specific database is designed in order to deal with such things.
I was wondering if there's an alternative approach where I wouldn't have to do thread locking (via a mutex or something) and thus slow down the whole process.
There are some ways to do this:
1) One common dispatcher for all threads. It should read all rows and put them into shared queue from where processing theads will get rows.
2) Go deeper into DB, find out if it supports something like oracles's "select for update skip locking" syntax and utilize it. For oracle you need to use his syntax in cursor and make somewhat cumbersome interaction, but at least it can work this way.
3) Partition input by, say, index of worker thread. So 1st worker out of 3 will only process rows 1,4,7 etc. 2nd worker will only process rows 2, 5, 8 etc.

How to await a ParallelQuery with LINQ?

I have an async method, that should look up database entries. It filters by name, thus is a candiate for parallel execution.
However, I can not find a simple way to support both parallel execution and asynchronous tasks.
Here's what I have:
private async Task<List<Item>> GetMatchingItems(string name) {
using (var entities = new Entities()) {
var items = from item in entities.Item.AsParallel()
where item.Name.Contains(name)
select item;
return await items.ToListAsync(); //complains "ParallelQuery<Item> does not contain a definition for ToListAsync..."
}
}
When I remove AsParallel() it will compile. A I not supposed to use both features at the same time? Or do I understand something wrong?
IHMO, both make sense:
AsParallel() would indicate that the database query may get split up into several sub-queries running at the same time, because the individual matching of any item is not dependent on any other item. UPDATE: Bad idea in this example, see comments and answer!
ToListAsync() would support the asynchronous execution of this method, to allow other match methods (on other data) to start executing immediately.
How to use both parallel exectuion (with LINQ) and asynchronous tasks at the same time?
You can't, and shouldn't. PLINQ isn't for database access. The database knows best on how to parallelize the query and does that just fine on it's own using normal LINQ. PLINQ is for accessing objects and such where you are doing computationally expensive calculations within the LINQ query so it can parallelize it on the client side (vs parallelizing it on the server/database server side).
A better answer might be:
PLINQ is for distributing a query that is compute intensive across multiple threads.
Async is for returning the thread back so that others can use it because you are going to be waiting on an external resource (database, Disk I/O, network I/O, etc).
Async PLINQ doesn't have a very strong use case where you want to return the thread while you wait AND you have a lot of calculations to do... If you are busy calculating, you NEED the thread (or multiple threads). They are almost completely on different ends of optimization. If you need something like this, there are much better mechanisms like Tasks, etc.
Well, you can't. These are 2 different options that don't go together.
You can use Task.Run with async-await to parallelize the synchronous part of the asynchronous operation (i.e. what comes before the first await) on multiple ThreadPool threads, but without all the partition logic built into PLINQ, for example:
var tasks = enumerable.Select(item => Task.Run(async () =>
{
LongSynchronousPart(item);
await AsynchronouPartAsync(item);
}));
await Task.WhenAll(tasks);
In your case however (assuming it's Entity Framework) there's no value in using PLINQ as there's no actual work to parallelize. The query itself is executed in the DB.

SubSonic AddMany() vs foreach loop Add()

I'm trying to figure out whether or not SubSonics AddMany() method is faster than a simple foreach loop. I poked around a bit on the SubSonic site but didn't see much on performance stats.
What I currently have. (.ForEach() just has some validation it it, other than that it works just like forEach(.....){ do stuff})
records.ForEach(record =>
{
newRepository.Add(record);
recordsProcessed++;
if (cleanUp) oldRepository.Delete<T>(record);
});
Which would change too
newRepository.AddMany(records);
if (cleanUp) oldRepository.DeleteMany<T>(records);
If you notice with this method I lose the count of how many records I've processed which isn't critical... But it would be nice to be able to display to the user how many records were moved with this tool.
So my questions boil down to: Would AddMany() be noticeably faster to use? And is there any way to get a count of the number of records actually copied over? If it succeeds can I assume all the records were processed? If one record fails, does the whole process fail?
Thanks in advance.
Just to clarify, AddMany() generates individual queries per row and submits them via a batch; DeleteMany() generates a single query. Please consult the source code and the generated SQL when you want to know what happens to your queries.
Your first approach is slow: 2*N queries. However, if you submit the queries using a batch it would be faster.
Your second approach is faster: N+1 queries. You can find how many will be added simply by enumerating 'records'.
If there is a risk of exceeding capacity limits on the size of a batch, then submit 50 or 100 at a time with little penalty.
Your final question depends on transactions. If the whole operation is one transaction, it will commit of abort as one. Otherwise, each query will stand alone. Your choice.

Synchronous calls in akka / actor model

I've been looking into Akka lately and it looks like a great framework for building scalable servers on the JVM. However most of the libraries on the JVM are blocking (e.g. JDBC) so don't your lose out on the performance benefits of using an event based model because your threads will always be blocked? Does Akka do something to get around this? Or is it just something you have to live with until we get more non-blocking libraries on the JVM?
Have a look at CQRS, it greatly improves scalability by separating reads from writes. This means that you can scale your reads separately from your writes.
With the types of IO blocking issues you mentioned Scala provides a language embedded solution that matches perfectly: Futures. For example:
def expensiveDBQuery(key : Key) = Future {
//...query the database
}
val dbResult : Future[Result] =
expensiveDBQuery(...) //non-blocking call
The dbResult returns immediately from the function call. The Result will be a available in the "Future". The cool part about a Future is that you can think about them like any old collection, except you can never call .size on the Future. Other than that all collection-ish functions (e.g. map, filter, foreach, ...) are fair game. Simply think of the dbResult as a list of Results. What would you do with such a list:
dbResult.map(_.getValues)
.filter(values => someTestOnValues(values))
...
That sequence of calls sets up a computation pipeline that is invoked whenever the Result is actually returned from the database. You can give a sequence of computing steps before the data has arrived. All asynchronously.

Resources