Spacy - making smaller size of library - pip

Is it possible to make smaller size of this lib?
Currently in python.36/site-packages/spacy lang is 380MiB folder, where I see a lot of unused languages by me. I use only en, and is it possible to install with pip somehow spacy and load to this folder only needed language?

This is improved in v2.2+, where the language-specific data has been moved out of the main library into a separate package. The size depends a bit on your OS, but should be closer to ~50 MB.

There's some nice work being done to get the size of Spacy down for serverless deployments like AWS Lambda:
https://github.com/ryfeus/lambda-packs
In addition to trimming down the repository, they also move the language models to S3 to limit the size requirements of the package.
Also see here for a python script to prune the languages:
https://gist.github.com/jshhrrsn/5377b9dd282ef51f5564f1347a7d5aef

Related

How compile SCSS faster

Simple topic simple question: are there a way to compile faster SCSS when you have a MASSIVE folder of partial files like this
I know more partial file you have more slow the compile is but i'd like to know if there is a way to compile faster.
In general, sass needs a compiler written based on different programming languages to be compiled, if the speed of any of these compilers is slow for you, you can use Sass direct or use https://sass-lang.com/dart-sass or use compilers in faster programming languages such as java
this is good answer (--link--) 👇
there are three things to think about:
Sass becomes slowly as many SASS files are included to the process. Big SASS-Frameworks tend to use a lot of files and latest when you use a lot of big modules it heavily could slow down at all. Sometimes there are more modules included than needed.
Often the standard project settings try to to a lot of work at the same time. I.e. writing min-files in same process simply doubles the time. If it is that: just prepare 'min-files' at the end of your work. Up to that using additonal post-processors to autoprefix like linters and maby postcss needs extra time ... which counts doubles when writing min-files at once...
JS-Sass-Compilers are slower at all. So you can save time using native SASS direct. This may not be handsome but in big projects that helped me a lot. If you may try that here the link to information how to install: https://sass-lang.com/install

is there any way to optimize pytorch inference in cpu?

I am going to serve pytorch model(resnet18) in website.
However, inference in cpu(amd3600) requires 70% of cpu resources.
I don't think the server(heroku) can handle this computation.
Is there any way to optimize inference in cpu?
many thanks
Admittedly, I'm not an expert on Heroku but probably you can use OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes the inference performance by e.g. graph pruning or fusing some operations together. Here are the performance benchmarks for Resnet-18 converted from PyTorch.
You can find a full tutorial on how to convert the PyTorch model here. Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to OV format (aka IR), which is a default format for OpenVINO. It also changes the precision to FP16 (to further increase performance). Run in command line:
mo --input_model "model.onnx" --input_shape "[1,3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
You can try prune and quantize your model (techniques to compress model size for deployment, allowing inference speed up and energy saving without significant accuracy losses). There are examples in pytorch website with model pruning and quantization, you can check.
https://pytorch.org/tutorials/intermediate/pruning_tutorial.html
https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html

Does Doxygen cache parsed data? If not, can I make it do so?

I'm running Doxygen on huge project for the third time but it seems to take as much as the first time. Is it possible to enable the caching of parsed files that haven't changed?
The size of the symbol lookup cache can be set.
The config item is LOOKUP_CACHE_SIZE.
I've not used doxygen for a long time so I refer you to the docs here:
Doxygen Config
EDIT: You say you need the dependency graphs. Can you reduce the graph depth?
You could try turning off the SEARCH_INCLUDES option. Then it will only reference your files and not the supporting SDK etc.
If it makes no difference increasing the size to high value, perhaps you can run the documentation command overnight, or defer it to another machine?
you could try tag files, which are explained in this question: doxygen is slow
I hope this helps. Your project sounds larger than the ones I've used doxygen with in the past.

Store Images to display in SOLR search results

I have built a SOLR Index which has the image thumbnail urls that I want to render an image along with the search results. The problem is that those images can run into millions and I think storing the images in index as binary data would make the index humongous.
I am seeking guidance on how to efficiently store those images after rendering them from the URLs , should I use the plain file system and have them rendered by tomcat , or should I use a JCR repository like Apache Jackrabbit ?
Any guidance would be greatly appreciated.
Thank You.
I would evaluate the effective requiriments before finally deciding how to persist the images.
Do you require versioning?
Are you planning to stir eonly the images or additional metadata?
Do you have any requirements in horizontal scaling?
Do you require any image processing or scaling?
Do you need access to the image metatdata?
Do you require additional tooling for managing the images?
Are you willing to invest time in learning an additional technology?
Storing on the file system and making them available by an image sppoler implementation is the most simple way to persist your images.
But if you identify some of the above mentioned requirements (which are typical for a content repo or a dam system), then would end up reinventing the wheel with the filesystem approach.
The other option is using a kind of content repository. A JCR repo like for example Jackrabbit or it's commercial implementation CRX is one option. Alfresco (supports CMIS) would be the another valid.
Features like versioning, post processing (scaling ...), metadata extraction and management belong are supported by both mentioned repository solutions. But this requires you to learn a new technology which can be time consuming. Both mentioned repository technologies can get complex.
If horizontal scaling is a requirement I would consider a commercially supported repository implementations (CRX or Alfresco Enterprise) because the communty releases are lacking this functionality.
Me personally I would really depend any decision on the above mentioned requirements.
I extensively worked with Jackrabbit, CRX and Alfresco CE and EE and personally I would go for the Alfresco as I experienced it to scale better with larger amounts of data.
I'm not aware of a image pooling solution that fits your needs exactly but it shouldn't be to difficult to implement that, apart from the fact that recurring scaling operations may be very resource intensive.
I would go for the following approach if FS is enough for you:
Separate images and thumbnail into two locations.
The images root folder will remain, the thumbnails folder is
temporary.
Create a temporary thumbnail folder for each indexing run.
All thumbnails for that run are stored under that location, scaling
can be achived with i.e ImageMagick.
The temporary thumbnail folder can then easily be dropped as soon as
the next run has been completed.
If you are planning to store millions of images then avoid putting all files in the same directory. Browsing flat hierarchies with two many entries will be a nightmare.
Better create a tree structure by i.e. inverting the current datetime (year/month/day/hour/minute ... 2013/06/01/08/45).
This makes sure that the number of files inside the last folder get's not too big (Alfresco is using the same pattern for storing binary objects on the FS and it has proofen to work nicely).

How to deal with memory leaks in RMagick in Ruby?

Im developing web-application with Merb and im looking for some safe and stable image processing library. I used to work with Imagick in php, then moved to ruby and start using RMagick. But there is a problem. Long running scripts causing memory leaks. There are couple solution exists, but I don't know which one is the most stable. So, what do you think?
Right now, my app uses internal API that i wrote to process images, in PHP. Its running on separate server along with other applications, so its not a big problem. But i think its not a good architecture.
Anyway, i`ll consider any practical tips.
I too have encountered this issue - the solution is to force garbage collection.
When you have reassigned the image variable to a new image simply use GC.start to ensure the old reference is released from memory.
On later versions of RMagick, I also believe you can also call destroy! on the image when you have finished processing it.
A combination of the two would probably ensure you are covered, but im not sure of the real life impact on performance (I would assume it is negligible i most cases).
Alternatively, you could use mini-magick which is a wrapper for the ImageMagick commandline client.
When using RMagick it's important to remember to destroy the image once you are done, otherwise you will fill up the /tmp dir when working with large sets of images. For example you must call destroy!
require 'RMagick'
Dir.foreach('/home/tiffs/') do |file|
next if file == '.' or file == '..'
image = Magick::Image.read(file).first
image.format = "PNG"
image.write("/home/png/#{File.basename(file, '.*')}.png")
image.destroy!
end
Actually, it isn't really a Ruby specific problem, other Interpreters share that as well. The concrete problem is that the GC of Ruby only sees memory that was allocated by Ruby itself, and not by external libraries (with the notable exception of the library using Rubys memory management facilities). So, a ImageMagick-Object in Ruby memory space is really small, but the image in the space managed by ImageMagick is large. So, this is not a leak per se, but it behaves like one.
Rubys Garbage Collector never kicks in if your Process stays under a certain limit (8MB is standard). As ImageMagick never creates large objects in Ruby space, it probably never kicks in. So, either you use the proposed method of spawning a new process or using exec. Another rather nifty one is to have an image processing service in the backend that forks for every task. Another one would be to have some kind of monitoring in place that kickstarts the GC every once in a while.
There is another Library called MagickWand by Timothy Paul Hunter (the author of RMagick) that tries to address these issues and create a nicer API. It's in alpha and requires a rather new release of ImageMagick, though.
Now you can tell ImageMagick which memory space should be used.
I think RMAGICK_ENABLE_MANAGED_MEMORY = true and GC.start is what you need.
MANAGED_MEMORY
If true, RMagick is using Ruby managed memory for all allocations. If false,
RMagick allocates memory for objects directly from the operating system. You can
enable RMagick to use Ruby managed memory (when built with ImageMagick 6.4.0-11
and later) by setting
RMAGICK_ENABLE_MANAGED_MEMORY = true
before requiring RMagick.
https://rmagick.github.io/constants.html
However, image.destroy! itself is enough to stabilize the memory consumption.
This is not due to ImageMagick; it's due to Ruby itself, and it's a well known problem. My suggestion is to split your program into two parts: a long-running part that allocates little memory and just deals with the control of the system, and a separate program that actually does the processing work. The long-running control process should do just enough to find some work for a child process that it spawns, and the child should do all of the processing for that particular work item.
Another option would be to leave the two combined, but after a work unit is complete, use exec to replace your process with a freshly started version of the same program, which would search for another work item, process it, and exec itself again.
This is assuming that the work items are fairly large, which they almost certainly are if you're using ImageMagick. If they're not, you'll find that the overhead of spawning a new process and having the Ruby interpreter re-parse your entire program starts to get a little too large. You can deal with this by having your program do more work units (say, ten or a hundred) before re-executing itself.

Resources