is there any way to optimize pytorch inference in cpu? - heroku

I am going to serve pytorch model(resnet18) in website.
However, inference in cpu(amd3600) requires 70% of cpu resources.
I don't think the server(heroku) can handle this computation.
Is there any way to optimize inference in cpu?
many thanks

Admittedly, I'm not an expert on Heroku but probably you can use OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes the inference performance by e.g. graph pruning or fusing some operations together. Here are the performance benchmarks for Resnet-18 converted from PyTorch.
You can find a full tutorial on how to convert the PyTorch model here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). Run in command line:
mo --input_model "model.onnx" --input_shape "[1,3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

You can try prune and quantize your model (techniques to compress model size for deployment, allowing inference speed up and energy saving without significant accuracy losses). There are examples in pytorch website with model pruning and quantization, you can check.
https://pytorch.org/tutorials/intermediate/pruning_tutorial.html
https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html

Related

Spacy - making smaller size of library

Is it possible to make smaller size of this lib?
Currently in python.36/site-packages/spacy lang is 380MiB folder, where I see a lot of unused languages by me. I use only en, and is it possible to install with pip somehow spacy and load to this folder only needed language?
This is improved in v2.2+, where the language-specific data has been moved out of the main library into a separate package. The size depends a bit on your OS, but should be closer to ~50 MB.
There's some nice work being done to get the size of Spacy down for serverless deployments like AWS Lambda:
https://github.com/ryfeus/lambda-packs
In addition to trimming down the repository, they also move the language models to S3 to limit the size requirements of the package.
Also see here for a python script to prune the languages:
https://gist.github.com/jshhrrsn/5377b9dd282ef51f5564f1347a7d5aef

How to get word2vec training loss in Gensim from pretrained models?

I have some pre-trained word2vec model and I'd like to evaluate them using the same corpus. Is there a way I could get the raw training loss given a model dump file and the corpus in memory?
The training-loss reporting of gensim's Word2Vec (& related models) is a newish feature that doesn't quite yet work the way most people expect.
For example, at least through gensim 3.7.1 (January 2019), you can just retrieve the total loss since the last call to train() (across multiple epochs). Some pending changes may eventually change that.
The loss-tallying is only done if requested when the model is created, via the compute_loss parameter. So if the model wasn't initially configured with this setting there will be no loss data inside it about prior training.
You could presumably tamper with the loaded model, w2v_model.compute_loss = False, so that further calls to train() (with the same or new data) would collect loss data. However, note that such training will also be updating the model, with respect the current data.
You could also look at the score() method, available for some model modes, which reports a loss-related number for batches of new texts, without changing the model. It may essentially work as a way to assess whether new texts "seem like" the original training data. See the method docs, including links to the motivating academic paper and an example notebook, for more info:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score

h2o subset to "bestofFamily"

AutoML makes two learners, one that includes "all" and the other that is a subset that is "best of family".
Is there any way to not-manually save the components and stacked ensemble aggregator to disk so that that "best of family", treated as a standalone black-box, can be stored, reloaded, and used without requiring literally 1000 less valuable learners to exist in the same space?
If so, how do I do that?
While running AutoML everything runs in memory (nothing is saved to disk unless you save one of the models to disk - or apply the option of saving an object to disk).
If you just want the "Best of Family" stacked ensemble, all you have to do is save that binary model. When you save a stacked ensemble, it saves all the required pieces (base models and meta model) for you. Then you can re-load later for use with another H2O cluster when you're ready to make predictions (just make sure, if you are saving a binary model, that you can use the same version of H2O later on).
Python Example:
bestoffamily = h2o.get_model('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.save_model(bestoffamily, path = "/home/users/me/mymodel")
R Example:
bestoffamily <- h2o.getModel('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.saveModel(bestoffamily, path = "/home/users/me/mymodel")
Later on, you re-load the stacked ensemble into memory using h2o.load_model() in Python or h2o.loadModel() in R.
Alternatively, instead of using an H2O binary model, which requires an H2O cluster to be running at prediction time, you can use a MOJO model (different model format). It's a bit more work to use MOJOs, though they are faster and designed for production use. If you want to save a MOJO model instead, then you can use h2o.save_mojo() in Python or h2o.saveMojo() in R.

Octave taking too long to save image files

I'm running Octave 3.8.1 on a i3 2GB DDR3 RAM powered notebook within Ubuntu 14.04 on dualboot with Windows 7.
And I'm having a really hardtime saving plots that I use on my seismologic research, they are quite simple and still I wait almost 5 min to save a single plot, the plot is built within seconds, the saving though...
Is it purely a problem with my notebook performance?
When I run a program for the first time I get the following warnings on shadowed functions, has one of them anything to do with it?
warning: function /home/michel/octave/specfun-1.1.0/x86_64-pc-linux-gnu-api-v49+/ellipj.oct shadows a built-in function
warning: function /home/michel/octave/specfun-1.1.0/erfcinv.m shadows a built-in function
warning: function /home/michel/octave/specfun-1.1.0/ellipke.m shadows a core library function
warning: function /home/michel/octave/specfun-1.1.0/expint.m shadows a core library function
Also, this started to happen when I upgraded from a very old version of Octave (2.8 if I'm not mistaken), it seems that the old one used to run on the linux default plot making functions, and the new one (3.8.1) runs on its own, is it correct? I used to take a little more time with this notebook that I take with the lab PC, but not even close to 5min+ for each plot.
Is there anything I can do, like upgrading anything within the octave or "unshadowing" the functions mentioned before?
Thanks a lot for the help.
Shadowing functions is just a name clash which is explained for example here: Warnings after octave installation
As for low performance, octave renderer doesn't seem to be well optimized for writing plots with huge number of points. For example, the following:
tic; x=1:10000; plot(sin(x)); saveas(gcf,'2123.png'); toc;
Will put octave in coma for quite a while. Even though the plot itself is made in an instant. If amounts of your data are of comparable magnitude, consider making it more sparce prior putting it on the graph.
There's no default linux plotmaker, there's gnuplot. You may try your luck with it by invoking
graphics_toolkit gnuplot
before plotting. (To me it didn't do much good though. graphics_toolkit fltk will return octave's usual plotter.)
If the slowness you refer to is in saving three dimensional plots (like mesh), the only workaround I've found on system similar to your is to use alt+prtscr.
Alternatively, you could try obtaining octave 4.0 which is released by now. It's changelogs mention yet another graphics toolkit.

High speed UDP receiver in MATLAB

I'd like to implement the receiving end of my system in MATLAB - this requires Gigabit Ethernet with sustained speeds of over 200Mb/sec.
Using MATLAB's built-in UDP from the Instrument Control Toolbox does not appear to be sufficient. Are there any good alternatives?
If you know Java, you can write the networking part of your code in Java classes, load those into your Matlab session with javaclasspath(), and invoke them from M-code. This could transform the problem from getting the data through Matlab's udp() function to getting the data across the Java/Matlab boundary.
If the data can be put into batches:
Use an external program to download the data to your computer, and save it to a file. Then, Matlab can read from that file whenever it needs more data. That way you partition the problem into two manageable pieces - and if you're using a decent OS, the file will never leave RAM so you won't have to worry about speed.
Theres an very good Example for the Java UDP implementation on the Mathworks site.
(Link below)
http://www.mathworks.com/matlabcentral/fileexchange/24525-a-simple-udp-communications-application/content/judp.m

Resources