How to await a ParallelQuery with LINQ? - linq

I have an async method, that should look up database entries. It filters by name, thus is a candiate for parallel execution.
However, I can not find a simple way to support both parallel execution and asynchronous tasks.
Here's what I have:
private async Task<List<Item>> GetMatchingItems(string name) {
using (var entities = new Entities()) {
var items = from item in entities.Item.AsParallel()
where item.Name.Contains(name)
select item;
return await items.ToListAsync(); //complains "ParallelQuery<Item> does not contain a definition for ToListAsync..."
}
}
When I remove AsParallel() it will compile. A I not supposed to use both features at the same time? Or do I understand something wrong?
IHMO, both make sense:
AsParallel() would indicate that the database query may get split up into several sub-queries running at the same time, because the individual matching of any item is not dependent on any other item. UPDATE: Bad idea in this example, see comments and answer!
ToListAsync() would support the asynchronous execution of this method, to allow other match methods (on other data) to start executing immediately.
How to use both parallel exectuion (with LINQ) and asynchronous tasks at the same time?

You can't, and shouldn't. PLINQ isn't for database access. The database knows best on how to parallelize the query and does that just fine on it's own using normal LINQ. PLINQ is for accessing objects and such where you are doing computationally expensive calculations within the LINQ query so it can parallelize it on the client side (vs parallelizing it on the server/database server side).
A better answer might be:
PLINQ is for distributing a query that is compute intensive across multiple threads.
Async is for returning the thread back so that others can use it because you are going to be waiting on an external resource (database, Disk I/O, network I/O, etc).
Async PLINQ doesn't have a very strong use case where you want to return the thread while you wait AND you have a lot of calculations to do... If you are busy calculating, you NEED the thread (or multiple threads). They are almost completely on different ends of optimization. If you need something like this, there are much better mechanisms like Tasks, etc.

Well, you can't. These are 2 different options that don't go together.
You can use Task.Run with async-await to parallelize the synchronous part of the asynchronous operation (i.e. what comes before the first await) on multiple ThreadPool threads, but without all the partition logic built into PLINQ, for example:
var tasks = enumerable.Select(item => Task.Run(async () =>
{
LongSynchronousPart(item);
await AsynchronouPartAsync(item);
}));
await Task.WhenAll(tasks);
In your case however (assuming it's Entity Framework) there's no value in using PLINQ as there's no actual work to parallelize. The query itself is executed in the DB.

Related

Task.WhenAll with Select is a footgun - but why?

Consider: you have a collection of user ids and want to load the details of each user represented by their id from an API. You want to bag up all of those users into some kind of collection and send it back to the calling code. And you want to use LINQ.
Something like this:
var userTasks = userIds.Select(userId => GetUserDetailsAsync(userId));
var users = await Task.WhenAll(tasks); // users is User[]
This was fine for my app when I had relatively few users. But, there came a point where it didn't scale. When it got to the point of thousands of users, this resulted in thousands of HTTP requests being fired concurrently and bad things started to happen. Not only did we realise we were launching a denial of service attack on the API we were consuming as did this, we were also bringing our own application to the point of collapse through thread starvation.
Not a proud day.
Once we realised that the cause of our woes was a Task.WhenAll / Select combo, we were able to move away from that pattern. But my question is this:
What is going wrong here?
As I read around on the topic, this scenario seems well described by #6 on Mark Heath's list of Async antipatterns: "Excessive parallelization":
Now, this does "work", but what if there were 10,000 orders? We've flooded the thread pool with thousands of tasks, potentially preventing other useful work from completing. If ProcessOrderAsync makes downstream calls to another service like a database or a microservice, we'll potentially overload that with too high a volume of calls.
Is this actually the reason? I ask as my understanding of async / await becomes less clear the more I read about the topic. It's very clear from many pieces that "threads are not tasks". Which is cool, but my code appears to be exhausting the number of threads that ASP.NET Core can handle.
So is that what it is? Is my Task.WhenAll and Select combo exhausting the thread pool or similar? Or is there another explanation for this that I'm not aware of?
Update:
I turned this question into a blog post with a little more detail / waffle. You can find it here: https://blog.johnnyreilly.com/2020/06/taskwhenall-select-is-footgun.html
N+1 Problem
Putting threads, tasks, async, parallelism to one side, what you describe is an N+1 problem, which is something to avoid for exactly what happened to you. It's all well and good when N (your user count) is small, but it grinds to a halt as the users grow.
You may want to find a different solution. Do you have to do this operation for all users? If so, then maybe switch to a background process and fan-out for each user.
Back to the footgun (I had to look that up BTW 🙂).
Tasks are a promise, similar to JavaScript. In .NET they may complete on a separate thread - usually a thread from the thread pool.
In .NET Core, they usually do complete on a separate thread if not complete and the point of awaiting, for an HTTP request that is almost certain to be the case.
You may have exhausted the thread pool, but since you're making HTTP requests, I suspect you've exhausted the number of concurrent outbound HTTP requests instead. "The default connection limit is 10 for ASP.NET hosted applications and 2 for all others." See the documentation here.
Is there a way to achieve some parallelism and not take exhaust a resource (threads or http connections)? - Yes.
Here's a pattern I often implement for just this reason, using Batch() from morelinq.
IEnumerable<User> users = Enumerable.Empty<User>();
IEnumerable<IEnumerable<string>> batches = userIds.Batch(10);
foreach (IEnumerable<string> batch in batches)
{
Task<User> batchTasks = batch.Select(userId => GetUserDetailsAsync(userId));
User[] batchUsers = await Task.WhenAll(batchTasks);
users = users.Concat(batchUsers);
}
You still get ten asynchronous HTTP requests to GetUserDetailsAsync(), and you don't exhaust threads or concurrent HTTP requests (or at least max out with the 10).
Now if this is a heavily used operation or the server with GetUserDetailsAsync() is heavily used elsewhere in the app, you may hit the same limits when your system is under load, so this batching is not always a good idea. YMMV.
You already have an excellent answer here, but just to chime in:
There's no problem with creating thousands of tasks. They're not threads.
The core problem is that you're hitting the API way too much. So the best solutions are going to change how you call that API:
Do you really need user details for thousands of users, all at once? If this is for a dashboard display, then change your API to enforce paging; if this is for a batch process, then see if you can access the data directly from the batch process.
Use a batch route for that API if it supports one.
Use caching if possible.
Finally, if none of the above are possible, look into throttling the API calls.
The standard pattern for asynchronous throttling is to use SemaphoreSlim, which looks like this:
using var throttler = new SemaphoreSlim(10);
var userTasks = userIds.Select(async userId =>
{
await throttler.WaitAsync();
try { await GetUserDetailsAsync(userId); }
finally { throttler.Release(); }
});
var users = await Task.WhenAll(tasks); // users is User[]
Again, this kind of throttling is best only if you can't make the design changes to avoid thousands of API calls in the first place.
While there is no thread waiting for async operation if the async operation is pure, there is a thread for continuation, so assuming that your GetUserDetailsAsync will await for some IO-bound operation the continuation (parsing output, returning result ...) will need to run on some thread so your Task.Result which was created by GetUserDetailsAsync can be set, so every one of them will wait for a thread from thread pool to finish.

Long running async method vs firing an event upon completion

I have to create a library that communicates with a device via a COM port.
In the one of the functions, I need to issue a command, then wait for several seconds as it performs a test (it varies from 10 to 1000 seconds) and return the result of the test:
One approach is to use async-await pattern:
public async Task<decimal> TaskMeasurementAsync(CancellationToken ctx = default)
{
PerformTheTest();
// Wait till the test is finished
await Task.Delay(_duration, ctx);
return ReadTheResult();
}
The other that comes to mind is to just fire an event upon completion.
The device performs a test and the duration is specified prior to performing it. So in either case I would either have to use Task.Delay() or Thread.Sleep() in order to wait for the completion of the task on the device.
I lean towards async-await as it easy to build in the cancellation and for the lack of a better term, it is self contained, i.e. I don't have to declare an event, create a EventArgs class etc.
Would appreciate any feedback on which approach is better if someone has come across a similar dilemma.
Thank you.
There are several tools available for how to structure your code.
Events are a push model (so is System.Reactive, a.k.a. "LINQ over events"). The idea is that you subscribe to the event, and then your handler is invoked zero or more times.
Tasks are a pull model. The idea is that you start some operation, and the Task will let you know when it completes. One drawback to tasks is that they only represent a single result.
The coming-soon async streams are also a pull model - one that works for multiple results.
In your case, you are starting an operation (the test), waiting for it to complete, and then reading the result. This sounds very much like a pull model would be appropriate here, so I recommend Task<T> over events/Rx.

Run async function in specific thread

I would like to run specific long-running functions (which execute database queries) on a separate thread. However, let's assume that the underlying database engine only allows one connection at a time and the connection struct isn't Sync (I think at least the latter is true for diesel).
My solution would be to have a single separate thread (as opposed to a thread pool) where all the database-work happens and which runs as long as the main thread is alive.
I think I know how I would have to do this with passing messages over channels, but that requires quite some boilerplate code (e.g. explicitly sending the function arguments over the channel etc.).
Is there a more direct way of achieving something like this with rust (and possibly tokio and the new async/await notation that is in nightly)?
I'm hoping to do something along the lines of:
let handle = spawn_thread_with_runtime(...);
let future = run_on_thread!(handle, query_function, argument1, argument2);
where query_function would be a function that immediately returns a future and does the work on the other thread.
Rust nightly and external crates / macros would be ok.
If external crates are an option, I'd consider taking a look at actix, an Actor Framework for Rust.
This will let you spawn an Actor in a separate thread that effectively owns the connection to the DB. It can then listen for messages, execute work/queries based on those messages, and return either sync results or futures.
It takes care of most of the boilerplate for message passing, spawning, etc. at a higher level.
There's also a Diesel example in the actix documentation, which sounds quite close to the use case you had in mind.

Synchronous calls in akka / actor model

I've been looking into Akka lately and it looks like a great framework for building scalable servers on the JVM. However most of the libraries on the JVM are blocking (e.g. JDBC) so don't your lose out on the performance benefits of using an event based model because your threads will always be blocked? Does Akka do something to get around this? Or is it just something you have to live with until we get more non-blocking libraries on the JVM?
Have a look at CQRS, it greatly improves scalability by separating reads from writes. This means that you can scale your reads separately from your writes.
With the types of IO blocking issues you mentioned Scala provides a language embedded solution that matches perfectly: Futures. For example:
def expensiveDBQuery(key : Key) = Future {
//...query the database
}
val dbResult : Future[Result] =
expensiveDBQuery(...) //non-blocking call
The dbResult returns immediately from the function call. The Result will be a available in the "Future". The cool part about a Future is that you can think about them like any old collection, except you can never call .size on the Future. Other than that all collection-ish functions (e.g. map, filter, foreach, ...) are fair game. Simply think of the dbResult as a list of Results. What would you do with such a list:
dbResult.map(_.getValues)
.filter(values => someTestOnValues(values))
...
That sequence of calls sets up a computation pipeline that is invoked whenever the Result is actually returned from the database. You can give a sequence of computing steps before the data has arrived. All asynchronously.

Should Parallel.ForEach be used in DB calls?

I've got a list of Foo IDs. I need to call a stored procedure for each ID.
e.g.
Guid[] siteIds = ...; // typically contains 100 to 300 elements
foreach (var id in siteIds)
{
db.MySproc(id); // Executes some stored procedure.
}
Each call is pretty independent of the other rows, this shouldn't be contentious in the database.
My question: would it be beneficial to parallelize this using Parallel.ForEach? Or is database IO going to be a bottleneck, and more threads would just result in more contention?
I would measure it myself, however, it's difficult to measure this on my test environment where the data and load is much smaller than our real web server.
Out of curiosity, why do you want to optmize it with Parallel.ForEach and spawn threads / open connections / pass data / get response for every item instead of writing a simple "sproc" that will work with list of IDs instead of single ID?
From the first look, it should get you a lot more noticable improvement.
I would think that the Parallel.ForEach would work, assuming that your DB server can handle the ~150-300 concurrent operations.
The only way to know for sure is to measure both.

Resources