How to make jpeg compression with lowest CPU usage?

How to make jpeg compression with lowest CPU usage? - performance

I need to convert raw image data to jpeg.
But I don't need anything special in terms of best quality, or minumum size etc.
Only thing I need is minumum CPU usage.
I am new to jpeg compression.
Can you please advice about which parameters will have the lowest CPU usage while converting jpeg?
I would like to use IPP(intel performance library).
An example from IPP jpeg library would be great.
But any sample from any other jpeg library also will be apprecited.
And if you know any jpeg library which is more performant than IPP's jpeg library, please let me know.
Thanks in advance.
Regards.

Do you mean 'fastest'? As in 'uses the cpu for the least amount of time'?
If you just mean CPU load, then the best way to lower CPU usage (if you want to do something else at the same time), is to ask the operating system to downprioritize the program.

Related

PNG optimization and environment impact

PNG can be optimized in different ways.
Reducing PNG size for websites seems to be a good choice in order to reduce bandwidth usage and data transport and reduce the impact on environment. It's recommended by Google and other services.
But does optimize PNG have an impact on CPU load when the computer read it to render it? In other words, is reducing PNG size to optimize size a good idea to limit environment impact?

does optimize PNG have an impact on CPU load when the computer read it to render it? In other words, is reducing PNG size to optimize size a good idea to limit environment impact?
I'm not sure that one question follows from the other, but, regarding the first one:
No, the effort to optimize the PNG compression, so as to make the image smaller, only impacts on the compressor side. In the client (decompressor) side, the difference on speed or CPU usage is practically zero.
To be more precise: PNG compression is mainly influenced by two factors: the "pixel filtering" algorithm/strategy (PNG specific) and the ZLIB compression level. Optimizing the first one has no effect on the decompression
(the "unfiltering" logic is one and the same). The second factor, also, has little or no influence on decompression speed (it might even be slightly beneficious).

As #leonbloy stated in his answer, the zlib compression level has little or no effect on the decompression speed. The PNG filter selection does, however, have a noticeable effect: the AVG and PAETH filters both require more memory and CPU time to defilter than do the simpler ones.
Libpng allows you to control this via the "png_set_filter()" function. For maximum compression with adaptive filtering, use
png_set_filter(write_ptr, 0, PNG_ALL_FILTERS);
while to avoid the slower AVG and PAETH filtering, use
png_set_filter(write_ptr, 0, PNG_NONE_FILTER|PNG_SUB_FILTER|PNG_UP_FILTER);
or with libpng-1.6.22 and later, you can more conveniently use
png_set_filter(write_ptr, 0, PNG_FAST_FILTERS);
With pngcrush-1.8.1 or later, you can use
pngcrush -speed file.png file_pc.png
to select the NONE, SUB, and UP filters but avoid AVG and PAETH
(note that pngcrush also has a "-fast" option, but that is for a different purpose, namely to select fast compression methods).
When the intended use of the output is to transmit over the net to an application, the file size is the dominant effect and you'd want to use maximum compression. But when the file will be accessed from your local disc
or from memory, then the decoding speed is dominant and you'd want to use "speed" optimization which will trade off some increased file size for faster decoding.

How to speed up libjpeg decompression

We are using libjpeg for JPEG decoding on our small embedded platform. We have problems with speed when we decode large images. For example, image which is 20 MB large and has dimensions of 5000x3000 pixels needs 10 seconds to load.
I need some tips on how to improve decoding speed. On other platform with similar performance, I have the same image load in two seconds.
Best decrease from 14 seconds to 10 seconds we got from using larger read buffer (64 kB instead of default 4 kB). But nothing else helped.
We do not need to display image in full resolution, so we use scale_num and scale_denom to display it in smaller size. But I would like to have more performance. Is it possible to use some kind of multithreading etc.? Different decoding settings? Anything, I ran of ideas.

First - profile the code. You're left with little more than speculation if you cannot definitively identify the bottlenecks.
Next, scour the documentation for libjpeg speedup opportunities. You mentioned scale_num and scale_denom. What about the decompressor's dct_method? I've found that the DCT_FASTEST option is good. There are other options to check: do_fancy_upsampling, do_block_smoothing, dither_mode, two_pass_quantize, etc. Some of all of these may be useful to you, depending on your system, libjpeg version, etc.
If profiling tools are unavailable, there are still some things to try. First, I suspect your bottlenecks are non-CPU related. To confirm, load the uncompressed image into a RAM buffer, then decompress it from there as you have been. Did that significantly improve the decompression time? If so, the culprit would appear to be the read operation from your image storage media. Depending on your system, reading from USB (or SD, etc) can be slow. (Note that I'm assuming a read from external media - although hardware details are scant.) Be sure to optimize relevant bus parameters, as well (SPI clocks, configurations, etc).
If you are reading from something like internal flash (i.e. NAND), there are some other things to inspect. How is your NAND controller configured? Have you ensured that the controller is configured for the fastest operation? Check wait states, timings, etc. Note that bus and/or memory contention can be an issue, too - so inspect their respective configurations, as well.
Finally, if you believe your system is actually CPU bound, this stackoverflow question may be of interest:
Can a high-performance jpeglib-turbo implmentation decompress/compress in <100ms?

Multi-threading could only help the decode process if the target had multiple execution units for true concurrent execution. Otherwise it will just time-slice existing CPU resources. It won't help in any case unless the library were designed of make use of it.
If you built the library from source, you should at first ensure you built it with optimisation switched on, and carefully select the compiler options to match the build to your target and its instruction set to enable the compiler to use SIMD or an FPU for example.
Also you need to consider other possible bottlenecks. Is the 10 seconds just the time to decode or does it include the time to read from a filesystem or network for example? Given the improvement observed when you increased the read buffer size, it seems hghly likly that it is the data read rather than the decode that is limiting in this case.
If in fact the filesystem access is the limiting factor rather then the decode, then there may be some benefit in separating the file read from the decode in a separate thread and passing the data via a pipe or queue or multiple shared memory buffers to the decoder. You may then ensure that the decoder can stream the decode without having to wait for filesystem blocking.

Take a look at libjpeg-turbo. If you have supported hardware then it is generally 2-4 times faster then libjpeg on the same CPU. Tipical 12MB jpeg is decoded in less then 2 seconds on Pandaboard. You can also take a look at speed analisys of various JPEG decoders here.

Image compression - Speed and Latency

Which is the best way to reduce size fast and decompress it fast?
I will be using it for, real time video (sharing desktop).
And i don´t think a video compression will do any good there.
Speed and Latency is everything here, i am currently looking at LZ4 for compressing a bmp, as it´s pretty fast, but it starts to be quite the bottleneck at higher resolutions (1280x720), and i only mean in compression and decoding speed, not in CPU usage.
I prefer lossless, but if there is a lossy (transparent?) option i am willing to try it out of course.
EDIT:
I am looking at libjpeg-turbo, but i have no idea how to use it in c#, can´t seem to find any information on it.

Have you tried Red5 Media server? It's an open source software to stream videos.

Parallelizeable jpeg like compression using only DCT, run length encoding stages, what sort of compression/performance possible?

We have to compress a ton o' (monochrome) image data and move it quickly. If one were to just use the parallelizeable stages of jpeg compression (DCT and run length encoding of the quantized results) and run it on a GPU so each block is compressed in parallel I am hoping that would be very fast and still yeild a very significant compression factor like full jpeg does.
Does anyone with more GPU / image compression experience have any idea how this would compare both compression and performance wise over using libjpeg on a CPU? (If it is a stupid idea, feel free to say so - I am extremely novice in my knowledge of cuda and the various stages of jpeg compression.) Certainly it will be less compression and hopefully(?) faster but I have no idea how significant those factors may be.

You could hardly get more compression in GPU - there are just no complex-enough algorithms which can use that MUCH power.
When working with simple alos like JPEG - it's so simple that you'll spend most of the time transferring data via PCI-E bus (which has significant latency, especially when card does not support DMA transfers).
Positive side is that if card have DMA, you can free up CPU for more important stuff, and get image compression "for free".
In the best case, you can get about 10x improvement on top-end GPU compared to top-end CPU provided that both CPU & GPU code is well-optimized.

Arduino: Lightweight compression algorithm to store data in EEPROM

I want to store a large amount of data onto my Arduino with a ATmega168/ATmega328 microcontroller, but unfortunately there's only 256 KB / 512 KB of EEPROM storage.
My idea is to make use of an compression algorithm to strip down the size. But well, my knowledge on compression algorithms is quite low and my search for ready-to-use libraries failed.
So, is there a good way to optimize the storage size?

You might have a look at the LZO algorithm, which is designed to be lightweight. I don't know whether there are any implementations for the AVR system, but it might be something you could implement yourself.
You may be somewhat misinformed about the amount of storage available in EEPROM on your chip though; according to the datasheet I have the EEPROM sizes are:
ATmega48P: 256
ATmega88P: 512
ATmega168P: 512
ATmega256P: 1024
Note that those values are in bytes, not KB as you mention in your question. This is not, by any measure, a "shitload".

AVRs only have a few kilobytes of EEPROM at the most, and very few have many more than 64K Flash (no standard Arduinos do).
If you are needing to store something and seldom modify, for instance an image, you could try using the Flash as there is much more space there to work with. For simple images, some crude RLE encoding would go a long way.
Compressing anything more random, for instance logged data, audio, etc, will take a tremendous amount of overhead for the AVR, you will have better luck getting a serial EEPROM chip to hold this data. Arduino's site has a page on interfacing with a 64K chip, which sounds . If you want more than that, look at interfacing with a SD card with SPI, for instance in this audio shield

A NASA study here (Postscript)
A repost of 1989 article on LZW here
Keep it simple and perform analysis of the cost/payout of adding compression. This includes time and effort, complexity, resource usage, data compressibility, etc.

An algorithm something like LZSS would probably be a good choice for an embedded platform. They are simple algorithms, and don't need much memory.
LZS is one I'm familiar with. It uses a 2 kB dictionary for compression and decompression (the dictionary is the most recent 2 kB of the uncompressed data stream). (LZS was patented by HiFn, however as far as I can tell, all patents have expired.)
But I see that an ATmega328, used on recent Arduinos, only has 512 bytes to 2 kB SRAM, so maybe even LZS is too big for it. I'm sure you could use a variant with a smaller dictionary, but I'm not sure what compression ratios you'd achieve.

The method described in the paper “Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks” might run on an ATmega328.
Reference: C. Sadler and M. Martonosi, “Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks,” Proceedings of the ACM Conference on Embedded Networked Sensor Systems (SenSys) 2006, November 2006. .pdf.
S-LZW Source for MSPGCC: slzw.tar.gz. Updated 10 March 2007.

You might also want to take a look at LZJB, being very short, simple, and lightweight.
Also, FastLZ might be worth a look. It gets better compression ratios than LZJB and has pretty minimal memory requirements for decompression:

If you just want to remove some repeating zero's or such, use Run-length encoding
Repeating byte sequences will be stored as:
<mark><byte><count>
It's super-simple algorithm, which you can probably code yourself in few lines of code.

Is an external EEPROM (for example via I2C) not an option? Even if you use a compression algorithm the down side is that the size of data you may store in the internal EEPROM may not be determined in a simple way any more..
And of corse, if you really mean kBYTES, then consider a SDCard connected to the SPI... There are some light weighted open source FAT-compatible file systems in the net.

heatshrink is a data compression/decompression library for embedded/real-time systems based on LZSS. It says it can run in under 100 bytes of memory.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio