How to speed up libjpeg decompression

How to speed up libjpeg decompression - performance

We are using libjpeg for JPEG decoding on our small embedded platform. We have problems with speed when we decode large images. For example, image which is 20 MB large and has dimensions of 5000x3000 pixels needs 10 seconds to load.
I need some tips on how to improve decoding speed. On other platform with similar performance, I have the same image load in two seconds.
Best decrease from 14 seconds to 10 seconds we got from using larger read buffer (64 kB instead of default 4 kB). But nothing else helped.
We do not need to display image in full resolution, so we use scale_num and scale_denom to display it in smaller size. But I would like to have more performance. Is it possible to use some kind of multithreading etc.? Different decoding settings? Anything, I ran of ideas.

First - profile the code. You're left with little more than speculation if you cannot definitively identify the bottlenecks.
Next, scour the documentation for libjpeg speedup opportunities. You mentioned scale_num and scale_denom. What about the decompressor's dct_method? I've found that the DCT_FASTEST option is good. There are other options to check: do_fancy_upsampling, do_block_smoothing, dither_mode, two_pass_quantize, etc. Some of all of these may be useful to you, depending on your system, libjpeg version, etc.
If profiling tools are unavailable, there are still some things to try. First, I suspect your bottlenecks are non-CPU related. To confirm, load the uncompressed image into a RAM buffer, then decompress it from there as you have been. Did that significantly improve the decompression time? If so, the culprit would appear to be the read operation from your image storage media. Depending on your system, reading from USB (or SD, etc) can be slow. (Note that I'm assuming a read from external media - although hardware details are scant.) Be sure to optimize relevant bus parameters, as well (SPI clocks, configurations, etc).
If you are reading from something like internal flash (i.e. NAND), there are some other things to inspect. How is your NAND controller configured? Have you ensured that the controller is configured for the fastest operation? Check wait states, timings, etc. Note that bus and/or memory contention can be an issue, too - so inspect their respective configurations, as well.
Finally, if you believe your system is actually CPU bound, this stackoverflow question may be of interest:
Can a high-performance jpeglib-turbo implmentation decompress/compress in <100ms?

Multi-threading could only help the decode process if the target had multiple execution units for true concurrent execution. Otherwise it will just time-slice existing CPU resources. It won't help in any case unless the library were designed of make use of it.
If you built the library from source, you should at first ensure you built it with optimisation switched on, and carefully select the compiler options to match the build to your target and its instruction set to enable the compiler to use SIMD or an FPU for example.
Also you need to consider other possible bottlenecks. Is the 10 seconds just the time to decode or does it include the time to read from a filesystem or network for example? Given the improvement observed when you increased the read buffer size, it seems hghly likly that it is the data read rather than the decode that is limiting in this case.
If in fact the filesystem access is the limiting factor rather then the decode, then there may be some benefit in separating the file read from the decode in a separate thread and passing the data via a pipe or queue or multiple shared memory buffers to the decoder. You may then ensure that the decoder can stream the decode without having to wait for filesystem blocking.

Take a look at libjpeg-turbo. If you have supported hardware then it is generally 2-4 times faster then libjpeg on the same CPU. Tipical 12MB jpeg is decoded in less then 2 seconds on Pandaboard. You can also take a look at speed analisys of various JPEG decoders here.

Related

load file to memory by block of 256kb

I'm studying Blocked sort-based indexing and the algorithm talks about loading files by some block of 32 or 64kb because disk reading is by block so it is efficient.
My first question is how am I supposed to load file by block?buffer reader of 64kb? But if I use java input stream, whether or not this optimization has already beed done and I can just tream the stream?
I actually use apache spark, so whether or not sparkContext.textFile() does this optimization? what about spark streaming?

I don't think on the JVM you have any direct view onto the file system that would make it meaningful to align reads and block-sizes. Also there are various kinds of drives and many different file systems now, and block sizes would most likely vary or even have little effect on the total I/O time.
The best performance would probably be to use java.nio.FileChannel, and then you can experiment with reading ByteBuffers of given block sizes to see if it makes any performance difference. I would guess the only effect you see is that the JVM overhead for very small buffers matters more (extreme case, reading byte by byte).
You may also use the file-channel's map method to get hold of a MappedByteBuffer.

GPU access to system RAM

I am currently involved in developing a large scientific computing project, and I am exploring the possibility of hardware acceleration with GPUs as an alternative to the MPI/cluster approach. We are in a mainly memory-bound situation, with too much data to put in memory to fit on a GPU. To this end, I have two questions:
1) The books I have read say that it is illegal to access memory on the host with a pointer on the device (for obvious reasons). Instead, one must copy the memory from the host's memory to the device memory, then do the computations, and then copy back. My question is whether there is a work-around for this -- is there any way to read a value in system RAM from the GPU?
2) More generally, what algorithms/solutions exist for optimizing the data transfer between the CPU and the GPU during memory-bound computations such as these?
Thanks for your help in this! I am enthusiastic about making the switch to CUDA, simply because the parallelization is much more intuitive!

1) Yes, you can do this with most GPGPU packages.
The one I'm most familair with -- the AMD Stream SDK lets you allocate a buffer in "system" memory and use that as a texture that is read or written by your kernel. Cuda and OpenCL have the same ability, the key is to set the correct flags on the buffer allocation.
BUT...
You might not want to do that because the data is being read/written across the PCIe bus, which has a lot of overhead.
The implementation is free to interpret your requests liberally. I mean you can tell it to locate the buffer in system memory, but the software stack is free to do things like relocate it into GPU memory on the fly -- as long as the computed results are the same
2) All of the major GPGPU software enviroments (Cuda, OpenCL, the Stream SDK) support DMA transfers, which is what you probably want.

Even if you could do this, you probably wouldn't want to, since transfers over PCI-whatever will tend to be a bottleneck, whereas bandwidth between the GPU and its own memory is typically very high.
Having said that, if you have relatively little computation to perform per element on a large data set then GPGPU is probably not going to work well for you anyway.

I suggest cuda programming guide.
you will find many answers there.
Check for streams, unified addressing, cudaHostRegister.

Arduino: Lightweight compression algorithm to store data in EEPROM

I want to store a large amount of data onto my Arduino with a ATmega168/ATmega328 microcontroller, but unfortunately there's only 256 KB / 512 KB of EEPROM storage.
My idea is to make use of an compression algorithm to strip down the size. But well, my knowledge on compression algorithms is quite low and my search for ready-to-use libraries failed.
So, is there a good way to optimize the storage size?

You might have a look at the LZO algorithm, which is designed to be lightweight. I don't know whether there are any implementations for the AVR system, but it might be something you could implement yourself.
You may be somewhat misinformed about the amount of storage available in EEPROM on your chip though; according to the datasheet I have the EEPROM sizes are:
ATmega48P: 256
ATmega88P: 512
ATmega168P: 512
ATmega256P: 1024
Note that those values are in bytes, not KB as you mention in your question. This is not, by any measure, a "shitload".

AVRs only have a few kilobytes of EEPROM at the most, and very few have many more than 64K Flash (no standard Arduinos do).
If you are needing to store something and seldom modify, for instance an image, you could try using the Flash as there is much more space there to work with. For simple images, some crude RLE encoding would go a long way.
Compressing anything more random, for instance logged data, audio, etc, will take a tremendous amount of overhead for the AVR, you will have better luck getting a serial EEPROM chip to hold this data. Arduino's site has a page on interfacing with a 64K chip, which sounds . If you want more than that, look at interfacing with a SD card with SPI, for instance in this audio shield

A NASA study here (Postscript)
A repost of 1989 article on LZW here
Keep it simple and perform analysis of the cost/payout of adding compression. This includes time and effort, complexity, resource usage, data compressibility, etc.

An algorithm something like LZSS would probably be a good choice for an embedded platform. They are simple algorithms, and don't need much memory.
LZS is one I'm familiar with. It uses a 2 kB dictionary for compression and decompression (the dictionary is the most recent 2 kB of the uncompressed data stream). (LZS was patented by HiFn, however as far as I can tell, all patents have expired.)
But I see that an ATmega328, used on recent Arduinos, only has 512 bytes to 2 kB SRAM, so maybe even LZS is too big for it. I'm sure you could use a variant with a smaller dictionary, but I'm not sure what compression ratios you'd achieve.

The method described in the paper “Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks” might run on an ATmega328.
Reference: C. Sadler and M. Martonosi, “Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks,” Proceedings of the ACM Conference on Embedded Networked Sensor Systems (SenSys) 2006, November 2006. .pdf.
S-LZW Source for MSPGCC: slzw.tar.gz. Updated 10 March 2007.

You might also want to take a look at LZJB, being very short, simple, and lightweight.
Also, FastLZ might be worth a look. It gets better compression ratios than LZJB and has pretty minimal memory requirements for decompression:

If you just want to remove some repeating zero's or such, use Run-length encoding
Repeating byte sequences will be stored as:
<mark><byte><count>
It's super-simple algorithm, which you can probably code yourself in few lines of code.

Is an external EEPROM (for example via I2C) not an option? Even if you use a compression algorithm the down side is that the size of data you may store in the internal EEPROM may not be determined in a simple way any more..
And of corse, if you really mean kBYTES, then consider a SDCard connected to the SPI... There are some light weighted open source FAT-compatible file systems in the net.

heatshrink is a data compression/decompression library for embedded/real-time systems based on LZSS. It says it can run in under 100 bytes of memory.

Report Direct3D memory usage

I have a Direct3D 9 application and I would like to monitor the memory usage.
Is there a tool to know how much system and video memory is used by Direct3D?
Ideally, it would also report how much is allocated for textures, vertex buffers, index buffers...

You can use the old DirectDraw interface to query the total and available memory.
The numbers you get that way are not reliable though.
The free memory may change at any instant and the available memory often takes the AGP-memory into account (which is strictly not video-memory). You can use the numbers to do a good guess about the default texture-resolutions and detail-level of your application/game, but that's it.
You may wonder why is there no way to get better numbers, after all it can't be to hard to track the resource-usage.
From an application point of view this is correct. You may think that the video memory just contains surfaces, textures, index- and vertex buffers and some shader-programs, but that's not true on the low-level side.
There are lots of other resources as well. All these are created and managed by the Direct3D driver to make the rendering as fast as possible. Among others there are hirarchical z-buffer acceleration structures, pre-compiled command lists (e.g. the data required to render something in the format as understood by the GPU). The driver also may queue rendering-commands for multiple frames in advance to even out the frame-rate and increase parallelity between the GPU and CPU.
The driver also does a lot of work under the hood for you. Heuristics are used to detect draw-calls with static geometry and constant rendering-settings. A driver may decide to optimize the geometry in these cases for better cache-usage. This all happends in parallel and under the control of the driver. All this stuff needs space as well so the free memory may changes at any time.
However, the driver also does caching for your resources, so you don't really need to know the resource-usage at the first place.
If you need more space than available the that's no problem. The driver will move the data between system-ram, AGP-memory and video ram for you. In practice you never have to worry that you run out of video-memory. Sure - once you need more video-memory than available the performance will suffer, but that's life :-)

Two suggestions:
You can call GetAvailableTextureMem in various times to obtain a (rough) estimate of overall memory usage progression.
Assuming you develop on nVidia's, PerfHUD includes a graphical representation of consumed AGP/VID memory (separated).
You probably won't be able to obtain a nice clean matrix of memory consumers (vertex buffers etc.) vs. memory location (AGP, VID, system), as -
(1) the driver has a lot of freedom in transferring resources between memory types, and
(2) the actual variety of memory consumers is far greater than the exposed D3D interfaces.

Compression to Improve Hard Disk Write Performance

On a modern system can local hard disk write speeds be improved by compressing the output stream?
This question derives from a case I'm working with where a program serially generates and dumps around 1-2GB of text logging data to a raw text file on the hard disk and I think it is IO bound. Would I expect to be able to decrease runtimes by compressing the data before it goes to disk or would the overhead of compression eat up any gain I could get? Would having an idle second core affect this?
I know this would be affected by how much CPU is being used to generate the data so rules of thumb on how much idle CPU time would be needed would be good.
I recall a video talk where someone used compression to improve read speeds for a database but IIRC compressing is a lot more CPU intensive than decompressing.

Yes, yes, yes, absolutely.
Look at it this way: take your maximum contiguous disk write speed in megabytes per second. (Go ahead and measure it, time a huge fwrite or something.) Let's say 100mb/s. Now take your CPU speed in megahertz; let's say 3Ghz = 3000mhz. Divide the CPU speed by the disk write speed. That's the number of cycles that the CPU is spending idle, that you can spend per byte on compression. In this case 3000/100 = 30 cycles per byte.
If you had an algorithm that could compress your data by 25% for an effective 125mb/s write speed, you would have 24 cycles per byte to run it in and it would basically be free because the CPU wouldn't be doing anything else anyway while waiting for the disk to churn. 24 cycles per byte = 3072 cycles per 128-byte cache line, easily achieved.
We do this all the time when reading optical media.
If you have an idle second core it's even easier. Just hand off the log buffer to that core's thread and it can take as long as it likes to compress the data since it's not doing anything else! The only tricky bit is you want to actually have a ring of buffers so that you don't have the producer thread (the one making the log) waiting on a mutex for a buffer that the consumer thread (the one writing it to disk) is holding.

Yes, this has been true for at least 10 years. There are operating-systems papers about it. I think Chris Small may have worked on some of them.
For speed, gzip/zlib compression on lower quality levels is pretty fast; if that's not fast enough you can try FastLZ. A quick way to use an extra core is just to use popen(3) to send output through gzip.

For what it is worth Sun's filesystem ZFS has the ability to have on the fly compression enabled to decrease the amount of disk IO without a significant increase in overhead as an example of this in practice.

The Filesystems and storage lab from Stony Brook published a rather extensive performance (and energy) evaluation on file data compression on server systems at IBM's SYSTOR systems research conference this year: paper at ACM Digital Library, presentation.
The results depend on the
used compression algorithm and settings,
the file workload and
the characteristics of your machine.
For example, in the measurements from the paper, using a textual workload and a server environment using lzop with low compression effort are faster than plain write, but bzip and gz aren't.
In your specific setting, you should try it out and measure. It really might improve performance, but it is not always the case.

CPUs have grown faster at a faster rate than hard drive access. Even back in the 80's a many compressed files could be read off the disk and uncompressed in less time than it took to read the original (uncompressed) file. That will not have changed.
Generally though, these days the compression/de-compression is handled at a lower level than you would be writing, for example in a database I/O layer.
As to the usefulness of a second core only counts if the system will be also doing a significant number of other things - and your program would have to be multi-threaded to take advantage of the additional CPU.

Logging the data in binary form may be a quick improvement. You'll write less to the disk and the CPU will spend less time converting numbers to text. It may not be useful if people are going to be reading the logs, but they won't be able to read compressed logs either.

Windows already supports File Compression in NTFS, so all you have to do is to set the "Compressed" flag in the file attributes.
You can then measure if it was worth it or not.

This depends on lots of factors and I don't think there is one correct answer. It comes down to this:
Can you compress the raw data faster than the raw write performance of your disk times the compression ratio you are achieving (or the multiple in speed you are trying to get) given the CPU bandwidth you have available to dedicate to this purpose?
Given today's relatively high data write rates in the 10's of MBytes/second this is a pretty high hurdle to get over. To the point of some of the other answers, you would likely have to have easily compressible data and would just have to benchmark it with some test of reasonableness type experiments and find out.
Relative to a specific opinion (guess!?) to the point about additional cores. If you thread up the compression of the data and keep the core(s) fed - with the high compression ratio of text, it is likely such a technique would bear some fruit. But this is just a guess. In a single threaded application alternating between disk writes and compression operations, it seems much less likely to me.

If it's just text, then compression could definitely help. Just choose an compression algorithm and settings that make the compression cheap. "gzip" is cheaper than "bzip2" and both have parameters that you can tweak to favor speed or compression ratio.

If you are I/O bound saving human-readable text to the hard drive, I expect compression to reduce your total runtime.
If you have an idle 2 GHz core, and a relatively fast 100 MB/s streaming hard drive,
halving the net logging time requires at least 2:1 compression and no more than roughly 10 CPU cycles per uncompressed byte for the compressor to ponder the data.
With a dual-pipe processor, that's (very roughly) 20 instructions per byte.
I see that LZRW1-A (one of the fastest compression algorithms) uses 10 to 20 instructions per byte, and compresses typical English text about 2:1.
At the upper end (20 instructions per byte), you're right on the edge between IO bound and CPU bound. At the middle and lower end, you're still IO bound, so there is a a few cycles available (not much) for a slightly more sophisticated compressor to ponder the data a little longer.
If you have a more typical non-top-of-the-line hard drive, or the hard drive is slower for some other reason (fragmentation, other multitasking processes using the disk, etc.)
then you have even more time for a more sophisticated compressor to ponder the data.
You might consider setting up a compressed partition, saving the data to that partition (letting the device driver compress it), and comparing the speed to your original speed.
That may take less time and be less likely to introduce new bugs than changing your program and linking in a compression algorithm.
I see a list of compressed file systems based on FUSE, and I hear that NTFS also supports compressed partitions.

If this particular machine is often IO bound,
another way to speed it up is to install a RAID array.
That would give a speedup to every program and every kind of data (even incompressible data).
For example, the popular RAID 1+0 configuration with 4 total disks gives a speedup of nearly 2x.
The nearly as popular RAID 5 configuration, with same 4 total disks, gives all a speedup of nearly 3x.
It is relatively straightforward to set up a RAID array with a speed 8x the speed of a single drive.
High compression ratios, on the other hand, are apparently not so straightforward. Compression of "merely" 6.30 to one would give you a cash prize for breaking the current world record for compression (Hutter Prize).

This used to be something that could improve performance in quite a few applications way back when. I'd guess that today it's less likely to pay off, but it might in your specific circumstance, particularly if the data you're logging is easily compressible,
However, as Shog9 commented:
Rules of thumb aren't going to help
you here. It's your disk, your CPU,
and your data. Set up a test case and
measure throughput and CPU load with
and without compression - see if it's
worth the tradeoff.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio