I want to store a large amount of data onto my Arduino with a ATmega168/ATmega328 microcontroller, but unfortunately there's only 256 KB / 512 KB of EEPROM storage.
My idea is to make use of an compression algorithm to strip down the size. But well, my knowledge on compression algorithms is quite low and my search for ready-to-use libraries failed.
So, is there a good way to optimize the storage size?
You might have a look at the LZO algorithm, which is designed to be lightweight. I don't know whether there are any implementations for the AVR system, but it might be something you could implement yourself.
You may be somewhat misinformed about the amount of storage available in EEPROM on your chip though; according to the datasheet I have the EEPROM sizes are:
ATmega48P: 256
ATmega88P: 512
ATmega168P: 512
ATmega256P: 1024
Note that those values are in bytes, not KB as you mention in your question. This is not, by any measure, a "shitload".
AVRs only have a few kilobytes of EEPROM at the most, and very few have many more than 64K Flash (no standard Arduinos do).
If you are needing to store something and seldom modify, for instance an image, you could try using the Flash as there is much more space there to work with. For simple images, some crude RLE encoding would go a long way.
Compressing anything more random, for instance logged data, audio, etc, will take a tremendous amount of overhead for the AVR, you will have better luck getting a serial EEPROM chip to hold this data. Arduino's site has a page on interfacing with a 64K chip, which sounds . If you want more than that, look at interfacing with a SD card with SPI, for instance in this audio shield
A NASA study here (Postscript)
A repost of 1989 article on LZW here
Keep it simple and perform analysis of the cost/payout of adding compression. This includes time and effort, complexity, resource usage, data compressibility, etc.
An algorithm something like LZSS would probably be a good choice for an embedded platform. They are simple algorithms, and don't need much memory.
LZS is one I'm familiar with. It uses a 2 kB dictionary for compression and decompression (the dictionary is the most recent 2 kB of the uncompressed data stream). (LZS was patented by HiFn, however as far as I can tell, all patents have expired.)
But I see that an ATmega328, used on recent Arduinos, only has 512 bytes to 2 kB SRAM, so maybe even LZS is too big for it. I'm sure you could use a variant with a smaller dictionary, but I'm not sure what compression ratios you'd achieve.
The method described in the paper “Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks” might run on an ATmega328.
Reference: C. Sadler and M. Martonosi, “Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks,” Proceedings of the ACM Conference on Embedded Networked Sensor Systems (SenSys) 2006, November 2006. .pdf.
S-LZW Source for MSPGCC: slzw.tar.gz. Updated 10 March 2007.
You might also want to take a look at LZJB, being very short, simple, and lightweight.
Also, FastLZ might be worth a look. It gets better compression ratios than LZJB and has pretty minimal memory requirements for decompression:
If you just want to remove some repeating zero's or such, use Run-length encoding
Repeating byte sequences will be stored as:
<mark><byte><count>
It's super-simple algorithm, which you can probably code yourself in few lines of code.
Is an external EEPROM (for example via I2C) not an option? Even if you use a compression algorithm the down side is that the size of data you may store in the internal EEPROM may not be determined in a simple way any more..
And of corse, if you really mean kBYTES, then consider a SDCard connected to the SPI... There are some light weighted open source FAT-compatible file systems in the net.
heatshrink is a data compression/decompression library for embedded/real-time systems based on LZSS. It says it can run in under 100 bytes of memory.
Related
Assuming you are doing it on the CPU, not using dedicated hardware, how much computing power, in megaflops, does it take to decompress a moderate quality MP3 stream?
Here is a link to a paper explaining some optimization details for an MP3 decoder application for a non-pc device:
http://www.analog.com/media/en/technical-documentation/application-notes/EE-255.pdf
Still, the paper sheds some light on the issue by giving numeric measurements of instruction counts per second: The original implementation of the algorithm takes 330 Mips. After applying some optimizations, the number reduces to 110 Mips. So, it is a very modest number. A fraction of what CPUs are capable of today.
Your milage may vary depending on the instruction set you will use or requirements of your application. Also keep in mind that the implementation makes a big difference. As it is shown on the paper, missing a couple optimizations could slow down it thrice.
We are using libjpeg for JPEG decoding on our small embedded platform. We have problems with speed when we decode large images. For example, image which is 20 MB large and has dimensions of 5000x3000 pixels needs 10 seconds to load.
I need some tips on how to improve decoding speed. On other platform with similar performance, I have the same image load in two seconds.
Best decrease from 14 seconds to 10 seconds we got from using larger read buffer (64 kB instead of default 4 kB). But nothing else helped.
We do not need to display image in full resolution, so we use scale_num and scale_denom to display it in smaller size. But I would like to have more performance. Is it possible to use some kind of multithreading etc.? Different decoding settings? Anything, I ran of ideas.
First - profile the code. You're left with little more than speculation if you cannot definitively identify the bottlenecks.
Next, scour the documentation for libjpeg speedup opportunities. You mentioned scale_num and scale_denom. What about the decompressor's dct_method? I've found that the DCT_FASTEST option is good. There are other options to check: do_fancy_upsampling, do_block_smoothing, dither_mode, two_pass_quantize, etc. Some of all of these may be useful to you, depending on your system, libjpeg version, etc.
If profiling tools are unavailable, there are still some things to try. First, I suspect your bottlenecks are non-CPU related. To confirm, load the uncompressed image into a RAM buffer, then decompress it from there as you have been. Did that significantly improve the decompression time? If so, the culprit would appear to be the read operation from your image storage media. Depending on your system, reading from USB (or SD, etc) can be slow. (Note that I'm assuming a read from external media - although hardware details are scant.) Be sure to optimize relevant bus parameters, as well (SPI clocks, configurations, etc).
If you are reading from something like internal flash (i.e. NAND), there are some other things to inspect. How is your NAND controller configured? Have you ensured that the controller is configured for the fastest operation? Check wait states, timings, etc. Note that bus and/or memory contention can be an issue, too - so inspect their respective configurations, as well.
Finally, if you believe your system is actually CPU bound, this stackoverflow question may be of interest:
Can a high-performance jpeglib-turbo implmentation decompress/compress in <100ms?
Multi-threading could only help the decode process if the target had multiple execution units for true concurrent execution. Otherwise it will just time-slice existing CPU resources. It won't help in any case unless the library were designed of make use of it.
If you built the library from source, you should at first ensure you built it with optimisation switched on, and carefully select the compiler options to match the build to your target and its instruction set to enable the compiler to use SIMD or an FPU for example.
Also you need to consider other possible bottlenecks. Is the 10 seconds just the time to decode or does it include the time to read from a filesystem or network for example? Given the improvement observed when you increased the read buffer size, it seems hghly likly that it is the data read rather than the decode that is limiting in this case.
If in fact the filesystem access is the limiting factor rather then the decode, then there may be some benefit in separating the file read from the decode in a separate thread and passing the data via a pipe or queue or multiple shared memory buffers to the decoder. You may then ensure that the decoder can stream the decode without having to wait for filesystem blocking.
Take a look at libjpeg-turbo. If you have supported hardware then it is generally 2-4 times faster then libjpeg on the same CPU. Tipical 12MB jpeg is decoded in less then 2 seconds on Pandaboard. You can also take a look at speed analisys of various JPEG decoders here.
I am looking for a robust, efficient data compression algorithm that I could use to provide a real-time transmission of medical data (primarily waveforms - heart rate, etc.).
I would appreciate any recommendations/links to scientific papers.
EDIT: The system will be based on a server (most probably installed within point-of-care infrastructure) and mobile devices (iOS & Android smartphones and tablets with native apps), to which the waveforms are going to be transferred. The server will gather all the data from the hospital (primarily waveform data). In my case, stability and speed is more important than latency.
That's the most detailed specification I can provide at the moment. I am going to investigate your recommendations and then test several algorithms. But I am looking for something that was successfully implemented in similar architecture. I also am open to any suggestions regarding server computation power or server software.
Don't think of it as real-time or as medical data - think of it as packets of data needing to be compressed for transmission (most likely in TCP packets). The details of the content only matter in choice of compression algorithm, and even there it's not whether it's medical it's how the data is formatted/stored and what the actual data looks like. The important things are the data itself and the constraints due to the overall system (e.g. is it data gathering such as a Holter monitor, or is it real-time status reporting such as a cardiac monitor in an ICU? What kind of system is receiving the data?).
Looking at the data, is it being presented for transmission as raw binary data, or is it being received from another component or device as (for example) structured XML or HL7 with numeric values represented as text? Will compressing the original data be the most efficient option, or should it be converted down to a proprietary binary format that only covers the actual data range (are 2, 3 or 4 bytes enough to cover the range of values?)? What kind of savings could be achieved by converting and what are the compatibility concerns (e.g. loss of HL7 compatibility).
Choosing the absolutely best-compressing algorithm may also not be worth much additional work unless you're going to be in an extremely low-bandwidth scenario; if the data is coming from an embedded device you should be balancing compression efficiency with the capabilities and limitations of the embedded processor, toolset and surrounding system for working with it. If a custom-built compression routine saves you 5% over something already built-in to the tools is it worth the extra coding and debugging time and storage space in embedded flash? Existing validated software libraries that produce "good enough" output may be preferred, particularly for medical devices.
Finally, depending on the environment you may want to sacrifice a big chunk of compression in favor of some level of redundancy, such as transmitting a sliding window of the data such that loss of any X packets doesn't result in loss of data. This may let you change protocols as well and may change how the device is configured - the difference between streaming UDP (with no retransmission of lost packets) and TCP (where the sender may need to be able to retransmit) may be significant.
And, now that I've blathered about the systems side, there's a lot of information out there on packetizing and streaming analog data, ranging from development of streaming protocols such as RTP to details of voice packetization for GSM/CDMA and VOIP. Still, the most important drivers for your decisions may end up being the toolsets available to you on the device and server sides. Using existing toolsets even if they're not the most efficient option may allow you to cut your development (and time-to-market) times significantly, and may also simplify the certification of your device/product for medical use. On the business side, spending an extra 3-6 months of software development, finding truly qualified developers, and dealing with regulatory approvals are likely to be the overriding factors.
UPDATE 2012/02/01: I just spent a few minutes looking at the XML export of a 12-lead cardiac stress EKG with a total observation time of 12+ minutes and an XML file size of ~6MB. I'm estimating that more than 25% of that file was repetitive and EXTREMELY compressible XML in the study headers, and the waveform data was comma-separated numbers in the range of -200 to 200 concentrated in the center of the range and changing slowly, with the numbers crossing the y-axis and staying on that side for a time. Assuming that most of what you want is the waveform values, for this example you'd be looking at a data rate with no compression of 4500KB / 763 seconds or around 59 Kbps. Completely uncompressed and using text formatting you could run that over a "2.5G" GPRS connection with ease. On any modern wireless infrastructure the bandwidth used will be almost unnoticeable.
I still think that the stock compression libraries would eat this kind of data for lunch (subject to issues with compression headers and possibly packet headers). If you insist on doing a custom compression I'd look at sending difference values rather than raw numbers (unless your raw data is already offsets). If your data looks anything like what I'm reviewing, you could probably convert each item into a 1-byte value of -127 to +127, possibly reserving the extreme ends as "special" values used for overflow (handle those as you see fit - special representation, error, etc.). If you'd rather be slightly less efficient on transmission and insignificantly faster in processing you could instead just send each value as a signed 2-byte value, which would still use less bandwidth than the text representation because currently every value is 2+ bytes anyway (values are 1-4 chars plus separators no longer needed).
Basically, don't worry about the size of the data unless this is going to be running 24/7 over a heavily metered wireless connection with low caps.
There is a category of compression software which is so fast that i see no scenario in which it can't be called "real time" : they are necessarily fast enough. Such algorithms are called LZ4, Snappy, LZO, QuickLZ, and reach hundreds of MB/s per CPU.
A comparison of them is available here :
http://code.google.com/p/lz4/
"Real Time compression for transmission" can also be seen as a trade-off between speed and compression ratio. More compression, even if slower, can effectively save transmission time.
A study of the "optimal trade-off" between compression and speed has been realized on this page for example : http://fastcompression.blogspot.com/p/compression-benchmark.html
I tested many compression libraries and this is my conclusion
LZO (http://www.oberhumer.com/opensource/lzo/) is very fast considering compressing big amount of data (more than 1 MB)
Snappy (http://code.google.com/p/snappy/) is good but requires more processing resources at decompresion (better for data less than 1MB)
http://objectegypt.com is offering a library called IHCA which is faster than lzo in big data compression and offers a good decompression speed and requires no license
finally you'd better make your own compression functions, because no one knows about your data more than you
We have to compress a ton o' (monochrome) image data and move it quickly. If one were to just use the parallelizeable stages of jpeg compression (DCT and run length encoding of the quantized results) and run it on a GPU so each block is compressed in parallel I am hoping that would be very fast and still yeild a very significant compression factor like full jpeg does.
Does anyone with more GPU / image compression experience have any idea how this would compare both compression and performance wise over using libjpeg on a CPU? (If it is a stupid idea, feel free to say so - I am extremely novice in my knowledge of cuda and the various stages of jpeg compression.) Certainly it will be less compression and hopefully(?) faster but I have no idea how significant those factors may be.
You could hardly get more compression in GPU - there are just no complex-enough algorithms which can use that MUCH power.
When working with simple alos like JPEG - it's so simple that you'll spend most of the time transferring data via PCI-E bus (which has significant latency, especially when card does not support DMA transfers).
Positive side is that if card have DMA, you can free up CPU for more important stuff, and get image compression "for free".
In the best case, you can get about 10x improvement on top-end GPU compared to top-end CPU provided that both CPU & GPU code is well-optimized.
On a modern system can local hard disk write speeds be improved by compressing the output stream?
This question derives from a case I'm working with where a program serially generates and dumps around 1-2GB of text logging data to a raw text file on the hard disk and I think it is IO bound. Would I expect to be able to decrease runtimes by compressing the data before it goes to disk or would the overhead of compression eat up any gain I could get? Would having an idle second core affect this?
I know this would be affected by how much CPU is being used to generate the data so rules of thumb on how much idle CPU time would be needed would be good.
I recall a video talk where someone used compression to improve read speeds for a database but IIRC compressing is a lot more CPU intensive than decompressing.
Yes, yes, yes, absolutely.
Look at it this way: take your maximum contiguous disk write speed in megabytes per second. (Go ahead and measure it, time a huge fwrite or something.) Let's say 100mb/s. Now take your CPU speed in megahertz; let's say 3Ghz = 3000mhz. Divide the CPU speed by the disk write speed. That's the number of cycles that the CPU is spending idle, that you can spend per byte on compression. In this case 3000/100 = 30 cycles per byte.
If you had an algorithm that could compress your data by 25% for an effective 125mb/s write speed, you would have 24 cycles per byte to run it in and it would basically be free because the CPU wouldn't be doing anything else anyway while waiting for the disk to churn. 24 cycles per byte = 3072 cycles per 128-byte cache line, easily achieved.
We do this all the time when reading optical media.
If you have an idle second core it's even easier. Just hand off the log buffer to that core's thread and it can take as long as it likes to compress the data since it's not doing anything else! The only tricky bit is you want to actually have a ring of buffers so that you don't have the producer thread (the one making the log) waiting on a mutex for a buffer that the consumer thread (the one writing it to disk) is holding.
Yes, this has been true for at least 10 years. There are operating-systems papers about it. I think Chris Small may have worked on some of them.
For speed, gzip/zlib compression on lower quality levels is pretty fast; if that's not fast enough you can try FastLZ. A quick way to use an extra core is just to use popen(3) to send output through gzip.
For what it is worth Sun's filesystem ZFS has the ability to have on the fly compression enabled to decrease the amount of disk IO without a significant increase in overhead as an example of this in practice.
The Filesystems and storage lab from Stony Brook published a rather extensive performance (and energy) evaluation on file data compression on server systems at IBM's SYSTOR systems research conference this year: paper at ACM Digital Library, presentation.
The results depend on the
used compression algorithm and settings,
the file workload and
the characteristics of your machine.
For example, in the measurements from the paper, using a textual workload and a server environment using lzop with low compression effort are faster than plain write, but bzip and gz aren't.
In your specific setting, you should try it out and measure. It really might improve performance, but it is not always the case.
CPUs have grown faster at a faster rate than hard drive access. Even back in the 80's a many compressed files could be read off the disk and uncompressed in less time than it took to read the original (uncompressed) file. That will not have changed.
Generally though, these days the compression/de-compression is handled at a lower level than you would be writing, for example in a database I/O layer.
As to the usefulness of a second core only counts if the system will be also doing a significant number of other things - and your program would have to be multi-threaded to take advantage of the additional CPU.
Logging the data in binary form may be a quick improvement. You'll write less to the disk and the CPU will spend less time converting numbers to text. It may not be useful if people are going to be reading the logs, but they won't be able to read compressed logs either.
Windows already supports File Compression in NTFS, so all you have to do is to set the "Compressed" flag in the file attributes.
You can then measure if it was worth it or not.
This depends on lots of factors and I don't think there is one correct answer. It comes down to this:
Can you compress the raw data faster than the raw write performance of your disk times the compression ratio you are achieving (or the multiple in speed you are trying to get) given the CPU bandwidth you have available to dedicate to this purpose?
Given today's relatively high data write rates in the 10's of MBytes/second this is a pretty high hurdle to get over. To the point of some of the other answers, you would likely have to have easily compressible data and would just have to benchmark it with some test of reasonableness type experiments and find out.
Relative to a specific opinion (guess!?) to the point about additional cores. If you thread up the compression of the data and keep the core(s) fed - with the high compression ratio of text, it is likely such a technique would bear some fruit. But this is just a guess. In a single threaded application alternating between disk writes and compression operations, it seems much less likely to me.
If it's just text, then compression could definitely help. Just choose an compression algorithm and settings that make the compression cheap. "gzip" is cheaper than "bzip2" and both have parameters that you can tweak to favor speed or compression ratio.
If you are I/O bound saving human-readable text to the hard drive, I expect compression to reduce your total runtime.
If you have an idle 2 GHz core, and a relatively fast 100 MB/s streaming hard drive,
halving the net logging time requires at least 2:1 compression and no more than roughly 10 CPU cycles per uncompressed byte for the compressor to ponder the data.
With a dual-pipe processor, that's (very roughly) 20 instructions per byte.
I see that LZRW1-A (one of the fastest compression algorithms) uses 10 to 20 instructions per byte, and compresses typical English text about 2:1.
At the upper end (20 instructions per byte), you're right on the edge between IO bound and CPU bound. At the middle and lower end, you're still IO bound, so there is a a few cycles available (not much) for a slightly more sophisticated compressor to ponder the data a little longer.
If you have a more typical non-top-of-the-line hard drive, or the hard drive is slower for some other reason (fragmentation, other multitasking processes using the disk, etc.)
then you have even more time for a more sophisticated compressor to ponder the data.
You might consider setting up a compressed partition, saving the data to that partition (letting the device driver compress it), and comparing the speed to your original speed.
That may take less time and be less likely to introduce new bugs than changing your program and linking in a compression algorithm.
I see a list of compressed file systems based on FUSE, and I hear that NTFS also supports compressed partitions.
If this particular machine is often IO bound,
another way to speed it up is to install a RAID array.
That would give a speedup to every program and every kind of data (even incompressible data).
For example, the popular RAID 1+0 configuration with 4 total disks gives a speedup of nearly 2x.
The nearly as popular RAID 5 configuration, with same 4 total disks, gives all a speedup of nearly 3x.
It is relatively straightforward to set up a RAID array with a speed 8x the speed of a single drive.
High compression ratios, on the other hand, are apparently not so straightforward. Compression of "merely" 6.30 to one would give you a cash prize for breaking the current world record for compression (Hutter Prize).
This used to be something that could improve performance in quite a few applications way back when. I'd guess that today it's less likely to pay off, but it might in your specific circumstance, particularly if the data you're logging is easily compressible,
However, as Shog9 commented:
Rules of thumb aren't going to help
you here. It's your disk, your CPU,
and your data. Set up a test case and
measure throughput and CPU load with
and without compression - see if it's
worth the tradeoff.