I have a processing pipeline that
Accepts an image
Preprocesses the image
Runs the image through five models
Combines the predictions from five models
Returns the resulting mask
Step 3 can and should be parallelized because the analysis takes long time and the models predict corresponding masks independently.
Attempts to solve the problem
Threads. Would not work, because they run in the same process and the analysis is CPU/GPU intensive.
Run five separate helper Flask apps that contain one model each, accept the call for analysis and return the answer. Then launch the main app. At step 3 use five threads to call each model's API. Works, but I have +4Gb overhead for each helper app (to load a separate instance of PyTorch). But, this approach wastes 20Gb of RAM and is hard to understand/maintain.
What is an appropriate wrapper framework that accepts models and only requires one instance of PyTorch to handle all models in parallel?
Related
What is your question?
I am trying to implement a metric which needs access to whole data. So instead of updating the metric in *_step() methods, I am trying to collect the outputs in the *_epoch_end() methods. However, the outputs contain only the output of the partition of the data each device gets. Basically if there are n devices, then each device is getting 1/n of the total outputs.
What's your environment?
OS: ubuntu
Packaging: conda
Version [1.0.4
Pytorch: 1.6.0
See the pytorch-lightningmanual. I think you are looking for training_step_end/validation_step_end (assuming you are using DP/DDP2).
...So, when Lightning calls any of the training_step, validation_step, test_step you will only be operating on one of those pieces. (...) For most metrics, this doesn’t really matter. However, if you want to add something to your computational graph (like softmax) using all batch parts you can use the training_step_end step.
When using the DDP backend, there's a separate process running for every GPU. They don't have access to each other's data, but there are a few special operations (reduce, all_reduce, gather, all_gather) that make the processes synchronize. When you use such operations on a tensor, the processes will wait for each other to reach the same point and combine their values in some way, for example take the sum from every process.
In theory it's possible to gather all data from all processes and then calculate the metric in one process, but this is slow and prone to problems, so you want to minimize the data that you transfer. The easiest approach is to calculate the metric in pieces and then for example take the average. self.log() calls will do this automatically when you use sync_dist=True.
If you don't want to take the average over the GPU processes, it's also possible to update some state variables at each step, and after the epoch synchronize the state variables and calculate your metric from those values. The recommended way is to create a class that uses the Metrics API, which recently moved from PyTorch Lightning to the TorchMetrics project.
If it's not enough to store a set of state variables, you can try to make your metric gather all data from all the processes. Derive your own metric from the Metric base class, overriding the update() and compute() methods. Use add_state("data", default=[], dist_reduce_fx="cat") to create a list where you collect the data that you need for calculating the metric. dist_reduce_fx="cat" will cause the data from different processes to be combined with torch.cat(). Internally it uses torch.distributed.all_gather. The tricky part here is that it assumes that all processes create identically-sized tensors. If the sizes don't match, syncing will hang indefinitely.
I have been attempting to create a Dash app as a companion to a report, which I have deployed to heroku:
https://ftacv-simulation.herokuapp.com/
This works reasonably well, for the simplest form of the simulation. However, upon the introduction of more complex features, the heroku server often times out (i.e. a single callback goes over the 30 second limit, and the process is terminated). The two main features are the introduction of a more complex simulation, which requires 15-20 simple simulation runs, and the saving of older plots for the purposes of comparison.
I think I have two potential solutions to this. The first is restructuring the code so that the single large task is broken up into multiple callbacks, none of which go over the 30s limit, and potentially storing the data for the older plots in the user's browser. The second is moving to a different provider that can handle more intense computation (such as AWS).
Which of these approaches would you recommend? Or would you propose a different solution?
I trained a doc2vec model using python gensim on a corpus of 40,000,000 documents. This model is used for infering docvec on millions of documents everyday. To ensure stability, I set alpha to a small value and a large steps instead of setting a constant random seed:
from gensim.models.doc2vec import Doc2Vec
model = Doc2Vec.load('doc2vec_dm.model')
doc_demo = ['a','b']
# model.random.seed(0)
model.infer_vector(doc_demo, alpha=0.1, min_alpha=0.0001, steps=100)
doc2vec.infer_vector() accepts only one documents each time and it takes almost 0.1 second to infer each docvec. Is there any API that can handle a series of documents in each infering step?
Currently, there's no gensim API which does large batches of inference at once, which could help by using multiple threads. It is a wishlist item, among other improvements: https://github.com/RaRe-Technologies/gensim/issues/515
You might get some speedup, up to the number of cores in your CPU, by spreading your own inference jobs over multiple threads.
To eliminate all multithreaded contention due to the Python GIL, you could spread your inference over separate Python processes. If each process loads the model using some of the tricks described at another answern (see below), the OS will help them share the large model backing arrays (only paying the cost in RAM once), while they each could completely independently due one unblocking thread of inference.
(Specifically, Doc2Vec.load() can also use the mmap='r' mode to load an existing on-disk model with memory-mapping of the backing files. Inference alone, with no most_similar()-like operations, will only read the shared raw backing arrays, so no fussing with the _norm variants should be necessary if you're launching single-purpose processes that just do inference then save their results and exit.)
On a project with around 350 entities in the EDMX entity model, my team is experiencing lengthly waits when the first query happens and when the first save happens.
When profiling a simple case which runs a few queries and saves, a simple set of steps just to fire a query and save takes minutes.
The first query takes 47% of the overall time just the call that framework method that executes the query.
The first save takes 50% of the overall time just in System.Data.Objects.ObjectContext.SaveChanges.
Are there any good options to improve performance - this can be a drain on development time.
(Once the system hits production, it's annoying at startup but not a problem during the ongoing execution)
When you use context for the first time it generates mapping model defined in metadata. The option is to pregenerate this model and include pregenerated files in your application (but you must regenerate it each time you modify your EDMX).
Such a big model should be probably divided into multiple smaller models. I hardly believe that 350 entities form single domain which cannot be divided.
A single large EDMX will result in a large ObjectContext. everytime you do using(var ctx = new YourObjectContext()) it is going to construct a large object, initialize a lot of collections (probably 350 of them) and it is going to make your database operations CPU intensive. You will certainly hit performance challenges when you get high volume traffic.
I would suggest breaking the large WDMX into smaller EDMX and produce different ObjectContexts. You should put small number of logically grouped entities into one ObjectContext.
My company currently services their clients using a Windows-based fat-client app that has workflow processing embedded in it. Basically, the customer inserts a set of documents to the beginning of the workflow, the documents are processed through a number of workflow steps and then after a period of time the output is presented to the customer. We currently scale up for larger customers by installing the application on other machines and let the cluster of machines work on different document subsets. Not ideal but with minimal changes to the application it did allow us to easily scale up to our current level.
The problem we now face is that as our customers have provided us with larger document sets we find ourselves spending more than expected on machines, IT support, etc... So, we have started to think about re-architecting the platform to make it scaleable. A feature of our solution is that each document can be processed independently of one another. Also we have 10 workflow steps of which two of the steps take up about 90% of the processing time.
One idea we are mulling over is to add a workflow step field to the document schema to track which workflow step has been completed for the document. We can then throw the entire cluster of machines to work on a single document set. A single machine would not be responsible for sequentially processing a document through all workflow steps but queries the db for the next document/workflow step pair and perform that processing. Does this sound like a reasonable approach? Any suggestions?
Thanks in advance.
While I'm not sure what specific development environment you are working with, I have had to deal with some similar workflows where we have a varied number of source documents, various steps, etc. all with different performance characteristics.
Assuming you have a series of independent steps - i.e. Step A's work product is the input for Step B, and step B's product is the input for step C, etc. I would look at message queueing as a potential solution.
For example, all new documents are tossed into a queue. One or more listener apps hit the queue and grab the next available document to perform step A. As step A completes, a link to the output product and/or relevant data are tossed into another queue. a separate listener app pulls from this second queue into step B, etc. until the final output product is created.
In this way, you use one queue for the holding area between each discreet step, and can scale up or down any individual process between the queues.
For example, we use this to go from some data transformations, through a rendering process, and out to a spooler. The data is fast, the renderings are CPU bound, and the printing is I/O bound, but each individual step can be scaled out based on need.
You could (technically) use a DB for this - but a message queue and/or a service bus would likely serve you better.
Hopefully that points you in the right direction!