How to process a CPU-bound task in async code - python-asyncio

I am doing some heavy processing that needs async methods. One of my methods returns a list of dictionaries that needs to go through heavy processing prior to adding it to another awaitable object. ie.
def cpu_bound_task_here(record):
```some complicated preprocessing of record```
return record
After the answer given below by the kind person, my code is now just stuck.
async def fun():
print("Socket open")
record_count = 0
symbol = obj.symbol.replace("-", "").replace("/", "")
loop = asyncio.get_running_loop()
await obj.send()
while True:
try:
records = await obj.receive()
if not records:
continue
record_count += len(records)
So what the above function does, is its streaming values asynchronously and does some heavy processing prior to pushing to redis indefinitely. I made the necessary changes and now I'm stuck.

As that output tells you, run_in_executor returns a Future. You need to await it to get its result.
record = await loop.run_in_executor(
None, something_cpu_bound_task_here, record
)
Note that any arguments to something_cpu_bound_task_here need to be passed to run_in_executor.
Additionally, as you've mentioned that this is a CPU-bound task, you'll want to make sure you're using a concurrent.futures.ProcessPoolExecutor. Unless you've called loop.set_default_executor somewhere, the default is an instance of ThreadPoolExecutor.
with ProcessPoolExecutor() as executor:
for record in records:
record = await loop.run_in_executor(
executor, something_cpu_bound_task_here, record
)
Finally, your while loop is effectively running synchronously. You need to wait for the future and then for obj.add before moving on to process the next item in records. You might want to restructure your code a bit and use something like gather to allow for some concurrency.
async def process_record(record, obj, loop, executor):
record = await loop.run_in_executor(
executor, something_cpu_bound_task_here, record
)
await obj.add(record)
async def fun():
loop = asyncio.get_running_loop()
records = await receive()
with ProcessPoolExecutor() as executor:
await asyncio.gather(
*[process_record(record, obj, loop, executor) for record in records]
)
I'm not sure how to handle obj since that isn't defined in your example, but I'm sure you can figure that out.

Check out the library Pypeln, it is perfect for streaming tasks between process, thread, and asyncio pools:
import pypeln as pl
data = get_iterable()
data = pl.task.map(f1, data, workers=100) # asyncio
data = pl.thread.flat_map(f2, data, workers=10)
data = filter(f3, data)
data = pl.process.map(f4, data, workers=5, maxsize=200)

Related

Slow on fetching records from DynamoDB table

I have a DynamoDB table with 151 records, table size is 666 kilobytes and average item size is 4,410.83 bytes.
This is my table schema:
uid - partition key
version - sort key
status - published
archived - boolean
... // other attributes
When I'm doing scan operation from AWS Lambda.
module.exports.searchPois = async (event) => {
const klass = await getKlass.byEnv(process.env.STAGE, 'custom_pois')
const status = (event.queryStringParameters && event.queryStringParameters.status) || 'published'
const customPois = await klass.scan()
.filter('archived').eq(false)
.and()
.filter('status').eq(status)
.exec()
return customPois
}
This request takes up 7+ seconds to fetch. I'm thinking to add a GSI so I can perform a query operation. But before I add, is it really like this slow when using scan and If I add GSI, will it fetch faster like 1-3 seconds?
Using Scan should be avoided of at all possible. For your use-case a GSI would be much more efficient and a sparse index would be even better: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-general-sparse-indexes.html
In saying that, for the small number of items you have it should not take 7seconds. This is likely caused by making infrequent requests to DynamoDB as DynamoDB relies on caching metadata to improve latency for requests, if your requests are infrequent then the metadata will no exist in cache increasing response times.
I suggest to ensure you re-use your connections, create your client outside of the Lambda event handler and ensure you keep active traffic on the table.

Why the array is empty after observer?

I am trying to get the values from the Room database:
val cars = mutableListOf<String>()
carsDb.getAll.observe(viewLifecycleOwner) {
cars = it.toMutableList()
// cars.size = 5
}
// cars.size = 0
Why can't I get the values outside the observer?
I am facing this issuing every time.
Looks like a problem of synchronisation, the db call carsDb.getAll.observe() is likely asynchronous, so the cars.size=5 you get inside the function is resolved later than the cars.size=0you got just after calling the db call (but before the request is resolved)

Dexie, object not found when nesting collection

i thought i got the hang of dexie, but now i'm flabbergasted:
two tables, each with a handful of records. Komps & Bretts
output all Bretts
rdb.Bretts.each(brett => {
console.log(brett);
})
output all Komps
rdb.Komps.each(komp=> {
console.log(komp);
})
BUT: this only outputs the Bretts, for some weird reason, Komps is empty
rdb.Bretts.each(brett => {
console.log(brett);
rdb.Komps.each(komp=> {
console.log(komp);
})
})
i've tried all kinds of combinations with async/await, then() etc, the inner loop cannot find any data in the inner table, whatever table i want to something with.
2nd example. This Works:
await rdb.Komps.get(163);
This produces an error ("Failed to execute 'objectStore' on 'IDBTransaction…ction': The specified object store was not found.")
rdb.Bretts.each(async brett => {
await rdb.Komps.get(163);
})
Is there some kind of locking going on? something that can be disabled?
Thank you!
Calling rdb.Bretts.each() will implicitly launch a readOnly transaction limited to 'Bretts' only. This means that within the callback you can only reach that table. And that's the reason why it doesn't find the Comps table at that point. To get access to the Comps table from within the each callback, you would need to include it in an explicit transaction block:
rdb.transaction('r', 'Komps', 'Bretts', () => {
rdb.Bretts.each(brett => {
console.log(brett);
rdb.Komps.each(komp=> {
console.log(komp);
});
});
});
However, each() does not respect promises returned by the callback, so even this fix would not be something that I would recommend either - even if it would solve your problem. You could easlily get race conditions as you loose the control of the flow when launching new each() from an each callback.
I would recommend you to using toArray(), get(), bulkGet() and other methods than each() where possible. toArray() is also faster than each() as it can utilize faster IDB Api IDBObjectStore.getAll() and IDBIndex.getAll() when possible. And you don't nescessarily need to encapsulate the code in a transaction block (unless you really need that atomicy).
const komps = await rdb.Komps.toArray();
await Promise.all(
komps.map(
async komp => {
// Do some async call per komp:
const brett = await rdb.Bretts.get(163));
console.log("brett with id 163", brett);
}
)
);
Now this example is a bit silly as it does the exact same db.Bretts.get(163) for each komp it founds, but you could replace 163 with some dynamic value there.
Conclusion: There are two issues.
The implicit transaction of Dexie's operation and the callback to each() lives within that limited transaction (tied to one single table only) unless you surround the call with a bigger explicit transaction block.
Try avoid to start new async operation within the callback of Dexie's db.Table.each() as it does not expect promises to be returned from its callback. You can do it but it is better to stick with methods where you can keep control of the async flow.

Parse cloud job set() function weirdness

I'm trying to run this cloud job weekly on parse where I assign a rank to players based on their high scores. This piece of code mostly seems to work except it only sets ranks from 1 to 9. Anything with more than one digit does not get set!
The job returns a success after setting ranks from 1-9.
Parse.Cloud.job("TestJob", function(request, status)
{
Parse.Cloud.useMasterKey();
var rank = 0;
var usersQuery = new Parse.Query("ECJUser").descending("HighScore");
usersQuery.find(function(results){
for(var i=0;i<results.length;++i)
{
rank += 1;
console.log("Setting "+results[i].get('Name')+" rank to "+rank);
results[i].save({"Rank": rank});
}
}).then(function(){
status.success("Weekly Ranks Assigned");
}, function(error){
status.error("Uh oh. Weekly ranking failed");
})
})
In the console log, it clearly says "setting playerName rank to 11", but it doesn't actually set anything in the parse table. Just undefined (or what ever it was previously).
Does the code look right? Something javascript related that I'm missing?
Updated based on answers:
Apparently I'm not waiting for the jobs to complete. But I'm not sure how to write code for handling promises. Here's what I have:
var usersQuery = new Parse.Query("ECJUser").descending("HighScore");
usersQuery.find().then(function(results)
{
var promises = [];
for(var i=0;i<results.length;i++)
{
promises.append(results[i].save({"Rank":rank}));
}
return promises;
})
What do I do with the list of promises? where do I wait for them to complete?
Your code does not wait for saves to complete so it's going to have unpredictable results. It also isn't going to run through all users, just the first 'page' returned by the query.
So, instead of using find you should consider using each. You also need to consider wether the job will have time to process all users and may need to run multiple times.
For the save you should add each promise that is returned to an array and then wait for all of the promises to complete before calling status.success.

How to make script execution slow?

I have the task: need to select data from "TABLE_FROM", modify it and insert to the "TABLE_TO". The main problem is script must run on production and shouldn't hurts live site performance, but "TABLE_FROM" contains hundred millions of rows. Going to run the script using nodejs. What techniques are using to resolve such kind of problems? ie. how to make this script running "slowly" or other words "softly" to prevent DB and CPU overload?
Time of script execution is irrelevant. I use Cassandra DB.
Sample code:
var OFFSET = 0;
var BATCHSIZE = 100;
var TIMEOUT = 1000;
function fetchPush() {
// fetch from TABLE_FROM, possibly in batches
rows = fetch(OFFSET, BATCHSIZE);
// push to TABLE_TO
push(rows);
// do next batch in timeout
setTimeout(fetchPush, TIMEOUT);
}
Here I'm assuming the fetch and push are blocking calls, for async processing you could (obviously) use async.

Resources