OpenCL read_imagef weird behaviour - image

I have an image with CL_FLOAT format and stores all RGBA channels. Now every 4th pixel of image has integers stored there, I store them clasically as:
image[i * 4 + 3].x = *(float*)(&someInt);
image[i * 4 + 3].y = *(float*)(&someInt2);
etc.
And as I need these to be integers (and not floats), the rest of the pixels have to store floats, so I don't have much options here.
When I read image back from OpenCL I get the values correctly, the problem arises in OpenCL kernel:
Whenever I read image like this (sampler is set just to nearest filtering):
float4 fourthPixel = read_imagef(img, sampler, coords);
And I try to convert it to integer as
int id = as_int(fourthPixel.x);
I don't read correct number (it always returns 0, unless number is quite high in integer form).
I got few points so far - if I store number like 1505353234 it WORKS, giving me back 6539629947781120.000000 - which is correct. If I store smaller numbers, it seems that read_imagef just clamps then down to 0.
So it's quite obvious, that ALL denormalized numbers are clamped down to zero. So, is there any good way to actually force read_imagef to not clamp down denormalized numbers to zero, without adding further instruction (ye i could add 0x7f000000 or such - but I need performance in the code, so this solution is unacceptable)?

So apparently reading image through read_imagei works fine. I also looked to specs and found out that your device can clamp denormalized floats to zero.

Related

Lightweight (de)compression algorithm for embedded use

I have a low-resource embedded system with a graphical user interface. The interface requires font data. To conserve read-only memory (flash), the font data needs to be compressed. I am looking for an algorithm for this purpose.
Properties of the data to be compressed
transparency data for a rectangular pixel map with 8 bits per pixel
there are typically around 200..300 glyphs in a font (typeface sampled in certain size)
each glyph is typically from 6x9 to 15x20 pixels in size
there are a lot of zeros ("no ink") and somewhat less 255's ("completely inked"), otherwise the distribution of octets is quite even due to the nature of anti-aliasing
Requirements for the compression algorithm
The important metrics for the decompression algorithm is the size of the data plus the size of the algorithm (as they will reside in the same limited memory).
There is very little RAM available for the decompression; it is possible to decompress the data for a single glyph into RAM but not much more.
To make things more difficult, the algorithm has to be very fast on a 32-bit microcontroller (ARM Cortex-M core), as the glyphs need to be decompressed while they are being drawn onto the display. Ten or twenty machine cycles per octet is ok, a hundred is certainly too much.
To make things easier, the complete corpus of data is known a priori, and there is a lot of processing power and memory available during the compression phase.
Conclusions and thoughts
The naïve approach of just packing each octet by some variable-length encoding does not give good results due to the relatively high entropy.
Any algorithm taking advantage of data decompressed earlier seems to be out of question as it is not possible to store the decompressed data of other glyphs. This makes LZ algorithms less efficient as they can only reference to a small amount of data.
Constraints on the processing power seem to rule out most bitwise operations, i.e. decompression should handle the data octet-by-octet. This makes Huffman coding difficult and arithmetic coding impossible.
The problem seems to be a good candidate for static dictionary coding, as all data is known beforehand, and the data is somewhat repetitive in nature (different glyphs share same shapes).
Questions
How can a good dictionary be constructed? I know finding the optimal dictionary for certain data is a np complete problem, but are there any reasonably good approximations? I have tried the zstandard's dictionary builder, but the results were not very good.
Is there something in my conclusions that I've gotten wrong? (Am I on the wrong track and omitting something obvious?)
Best algorithm this far
Just to give some background information, the best useful algorithm I have been able to figure out is as follows:
All samples in the font data for a single glyph are concatenated (flattened) into a one-dimensional array (vector, table).
Each sample has three possible states: 0, 255, and "something else".
This information is packed five consecutive samples at a time into a 5-digit base-three number (0..3^5).
As there are some extra values available in an octet (2^8 = 256, 3^5 = 243), they are used to signify longer strings of 0's and 255's.
For each "something else" value the actual value (1..254) is stored in a separate vector.
This data is fast to decompress, as the base-3 values can be decoded into base-4 values by a smallish (243 x 3 = 729 octets) lookup table. The compression ratios are highly dependent on the font size, but with my typical data I can get around 1:2. As this is significantly worse than LZ variants (which get around 1:3), I would like to try the static dictionary approach.
Of course, the usual LZ variants use Huffman or arithmetic coding, which naturally makes the compressed data smaller. On the other hand, I have all the data available, and the compression speed is not an issue. This should make it possible to find much better dictionaries.
Due to the nature of the data I could be able to use a lossy algorithm, but in that case the most likely lossy algorithm would be reducing the number of quantization levels in the pixel data. That won't change the underlying compression problem much, and I would like to avoid the resulting bit-alignment hassle.
I do admit that this is a borderline case of being a good answer to my question, but as I have researched the problem somewhat, this answer both describes the approach I chose and gives some more information on the nature of the problem should someone bump into it.
"The right answer" a.k.a. final algorithm
What I ended up with is a variant of what I describe in the question. First, each glyph is split into trits 0, 1, and intermediate. This ternary information is then compressed with a 256-slot static dictionary. Each item in the dictionary (or look-up table) is a binary encoded string (0=0, 10=1, 11=intermediate) with a single 1 added to the most significant end.
The grayscale data (for the intermediate trits) is interspersed between the references to the look-up table. So, the data essentially looks like this:
<LUT reference><gray value><gray value><LUT reference>...
The number of gray scale values naturally depends on the number of intermediate trits in the ternary data looked up from the static dictionary.
Decompression code is very short and can easily be written as a state machine with only one pointer and one 32-bit variable giving the state. Something like this:
static uint32_t trits_to_decode;
static uint8_t *next_octet;
/* This should be called when starting to decode a glyph
data : pointer to the compressed glyph data */
void start_glyph(uint8_t *data)
{
next_octet = data; // set the pointer to the beginning of the glyph
trits_to_decode = 1; // this triggers reloading a new dictionary item
}
/* This function returns the next 8-bit pixel value */
uint8_t next_pixel(void)
{
uint8_t return_value;
// end sentinel only? if so, we are out of ternary data
if (trits_to_decode == 1)
// get the next ternary dictionary item
trits_to_decode = dictionary[*next_octet++];
// get the next pixel from the ternary word
// check the LSB bit(s)
if (trits_to_decode & 1)
{
trits_to_decode >>= 1;
// either full value or gray value, check the next bit
if (trits_to_decode & 1)
{
trits_to_decode >>= 1;
// grayscale value; get next from the buffer
return *next_octet++;
}
// if we are here, it is a full value
trits_to_decode >>= 1;
return 255;
}
// we have a zero, return it
trits_to_decode >>= 1;
return 0;
}
(The code has not been tested in exactly this form, so there may be typos or other stupid little errors.)
There is a lot of repetition with the shift operations. I am not too worried, as the compiler should be able to clean it up. (Actually, left shift could be even better, because then the carry bit could be used after shifting. But as there is no direct way to do that in C, I don't bother.)
One more optimization relates to the size of the dictionary (look-up table). There may be short and long items, and hence it can be built to support 32-bit, 16-bit, or 8-bit items. In that case the dictionary has to be ordered so that small numerical values refer to 32-bit items, middle values to 16-bit items and large values to 8-bit items to avoid alignment problems. Then the look-up code looks like this:
static uint8_t dictionary_lookup(uint8_t octet)
{
if (octet < NUMBER_OF_32_BIT_ITEMS)
return dictionary32[octet];
if (octet < NUMBER_OF_32_BIT_ITEMS + NUMBER_OF_16_BIT_ITEMS)
return dictionary16[octet - NUMBER_OF_32_BIT_ITEMS];
return dictionary8[octet - NUMBER_OF_16_BIT_ITEMS - NUMBER_OF_32_BIT_ITEMS];
}
Of course, if every font has its own dictionary, the constants will become variables looked up form the font information. Any half-decent compiler will inline that function, as it is called only once.
If the number of quantization levels is reduced, it can be handled, as well. The easiest case is with 4-bit gray levels (1..14). This requires one 8-bit state variable to hold the gray levels. Then the gray level branch will become:
// new state value
static uint8_t gray_value;
...
// new variable within the next_pixel() function
uint8_t return_value;
...
// there is no old gray value available?
if (gray_value == 0)
gray_value = *next_octet++;
// extract the low nibble
return_value = gray_value & 0x0f;
// shift the high nibble into low nibble
gray_value >>= 4;
return return_value;
This actually allows using 15 intermediate gray levels (a total of 17 levels), which maps very nicely into linear 255-value system.
Three- or five-bit data is easier to pack into a 16-bit halfword and set MSB always one. Then the same trick as with the ternary data can be used (shift until you get 1).
It should be noted that the compression ratio starts to deteriorate at some point. The amount of compression with the ternary data does not depend on the number of gray levels. The gray level data is uncompressed, and the number of octets scales (almost) linearly with the number of bits. For a typical font the gray level data at 8 bits is 1/2 .. 2/3 of the total, but this is highly dependent on the typeface and size.
So, reduction from 8 to 4 bits (which is visually quite imperceptible in most cases) reduces the compressed size typically by 1/4..1/3, whereas the further reduction offered by going down to three bits is significantly less. Two-bit data does not make sense with this compression algorithm.
How to build the dictionary?
If the decompression algorithm is very straightforward and fast, the real challenges are in the dictionary building. It is easy to prove that there is such thing as an optimal dictionary (dictionary giving the least number of compressed octets for a given font), but wiser people than me seem to have proven that the problem of finding such dictionary is NP-complete.
With my arguably rather lacking theoretical knowledge on the field I thought there would be great tools offering reasonably good approximations. There might be such tools, but I could not find any, so I rolled my own mickeymouse version. EDIT: the earlier algorithm was rather goofy; a simpler and more effective was found
start with a static dictionary of '0', g', '1' (where 'g' signifies an intermediate value)
split the ternary data for each glyph into a list of trits
find the most common consecutive combination of items (it will most probably be '0', '0' at the first iteration)
replace all occurrences of the combination with the combination and add the combination into the dictionary (e.g., data '0', '1', '0', '0', 'g' will become '0', '1', '00', 'g' if '0', '0' is replaced by '00')
remove any unused items in the dictionary (they may occur at least in theory)
repeat steps 3-5 until the dictionary is full (i.e. at least 253 rounds)
This is still a very simplistic approach and it probably gives a very sub-optimal result. Its only merit is that it works.
How well does it work?
One answer is well enough, but to elaborate that a bit, here are some numbers. This is a font with 864 glyphs, typical glyph size of 14x11 pixels, and 8 bits per pixel.
raw uncompressed size: 127101
number of intermediate values: 46697
Shannon entropies (octet-by-octet):
total: 528914 bits = 66115 octets
ternary data: 176405 bits = 22051 octets
intermediate values: 352509 bits = 44064 octets
simply compressed ternary data (0=0, 10=1, 11=intermediate) (127101 trits): 207505 bits = 25939 octets
dictionary compressed ternary data: 18492 octets
entropy: 136778 bits = 17097 octets
dictionary size: 647 octets
full compressed data: 647 + 18492 + 46697 = 65836 octets
compression: 48.2 %
The comparison with octet-by-octet entropy is quite revealing. The intermediate value data has high entropy, whereas the ternary data can be compressed. This can also be interpreted by the high number of values 0 and 255 in the raw data (as compared to any intermediate values).
We do not do anything to compress the intermediate values, as there do not seem to be any meaningful patterns. However, we beat entropy by a clear margin with ternary data, and even the total amount of data is below entropy limit. So, we could do worse.
Reducing the number of quantization levels to 17 would reduce the data size to approximately 42920 octets (compression over 66 %). The entropy is then 41717 octets, so the algorithm gets slightly worse as is expected.
In practice, smaller font sizes are difficult to compress. This should be no surprise, as larger fraction of the information is in the gray scale information. Very big font sizes compress efficiently with this algorithm, but there run-length compression is a much better candidate.
What would be better?
If I knew, I would use it! But I can still speculate.
Jubatian suggests there would be a lot of repetition in a font. This must be true with the diacritics, as aàäáâå have a lot in common in almost all fonts. However, it does not seem to be true with letters such as p and b in most fonts. While the basic shape is close, it is not enough. (Careful pixel-by-pixel typeface design is then another story.)
Unfortunately, this inevitable repetition is not very easy to exploit in smaller size fonts. I tried creating a dictionary of all possible scan lines and then only referencing to those. Unfortunately, the number of different scan lines is high, so that the overhead added by the references outweighs the benefits. The situation changes somewhat if the scan lines themselves can be compressed, but there the small number of octets per scan line makes efficient compression difficult. This problem is, of course, dependent on the font size.
My intuition tells me that this would still be the right way to go, if both longer and shorter runs than full scan lines are used. This combined with using 4-bit pixels would probably give very good results—only if there were a way to create that optimal dictionary.
One hint to this direction is that LZMA2 compressed file (with xz at the highest compression) of the complete font data (127101 octets) is only 36720 octets. Of course, this format fulfils none of the other requirements (fast to decompress, can be decompressed glyph-by-glyph, low RAM requirements), but it still shows there is more redundance in the data than what my cheap algorithm has been able to exploit.
Dictionary coding is typically combined with Huffman or arithmetic coding after the dictionary step. We cannot do it here, but if we could, it would save another 4000 octets.
You can consider using something already developed for a scenario similar to Yours
https://github.com/atomicobject/heatshrink
https://spin.atomicobject.com/2013/03/14/heatshrink-embedded-data-compression/
You could try lossy compression using a sparse representation with custom dictionary.
The output of each glyph is a superposition of 1-N blocks from the dictionary;
most cpu time spent in preprocessing
predetermined decoding time (max, average or constant N) additions per pixel
controllable compressed size (dictionary size + xyn codes per glyph)
It seems that the simplest lossy method would be to reduce the number of bits-per-pixel. With glyphs of that size, 16 levels are likely to be sufficient. That would halve the data immediately, then you might apply your existing algorithm in the values 0, 16 or "something else" to perhaps halve it again.
I would go for Clifford's answer, that is, converting the font to 4 bits per pixel first which is sufficient for this task.
Then, since this is a font, you have lots of row repetitions, that is when rows defining one character match those of another character. Take for example the letter 'p' and 'b', the middle part of these letters should be the same (you will have even more matches if the target language uses loads of diacritics). Your encoder then could first collect all distinct rows of the font, store these, and then each character image is formed by a list of pointers to the rows.
The efficiency depends on the font of course, depending on the source, you might need some preprocessing to get it compress better with this method.
If you want more, you might rather choose to go for 3 bits per pixel or even 2 bits per pixel, depending on your goals (and some will for hand-tuning the font images), these might still be satisfactory.
This method in overall of course works very well for real-time display (you only need to traverse a pointer to get the row data).

Integer Time Series compression

Is there a well known documented algorithm for (positive) integer streams / time series compression, that would:
have variable bit length
work on deltas
My input data is a stream of temperature measurements from a sensor (more specifically a TMP36 read out by an Arduino). It is physically impossible that big jumps occur between measurements (time constant of sensor). I therefore think my compression algorithm should work on deltas (set a base on stream start and then only difference to next value). Because gaps are limited, I want variable bit length, because differences lower than 4 fit on 2 bits, lower than 8 on 3 bits and so on... But there is a dilemma between telling in stream the bit size of the next delta and just working on, say, 3 bit deltas and telling size only when bigger for instance.
Any idea what algorithm solves than one?
Use variable-length integers to code the deltas between values, and feed that to zlib to do the compression.
First of all there are different formats in existent. One thing I would do first is getting rid of the sign. A sign is usually a distraction when thinking about compression. I usually use the scheme where every positive is 2*v and every negative value is just 2*(-v)-1. So 0 = 0, -1 = 1, 1 = 2, -2 = 3, 2 = 4... .
Since with that scheme you have nothing like 0b11111111 = -1 the leading bits are gone. Now you can think about how to compress those symbols / numbers. One thing you can do is create a representive sample and use it to train a static huffman code. This should be possible within your on chip constraints. Another more simple aproach is using huffman codes for bit lengths and write the bits to stream. So 0 = bitlength 0, -1 = bitlength 1, 2,3 = bitlength length 2, ... . By using huffman codes to describe this bitlength you become quite compact literals.
I usually use a mixture. I use the most frequent symbols / values as raw values and use not so frequent numbers by using bit lengths + bit pattern of the actual value. This way you stay compact and do not have to deal with excessive tables (there are only 64 symbols for 64 bits lengths possible).
Also there are other schemes like leading bit where for example of every byte the first bit (or the highest) marks the last byte of the value so as long as the bit is set there will be another byte for the integer. If it is zero its the last byte of the value.
I usually train a static huffman code for such purposes. Its easy and you can even do the encoding and decoding becoming source code / generate source code out from your code (simply create ifs/switch statements and write your tables as arrays in your code).
You can use Integer compression methods with delta or delta of delta encoding like used in TurboPFor Integer Compression. Gamma coding can be also used if the deltas have very small values.
The current state of the art for this problem is Quantile Compression. It compresses numerical sequences such as integers and typically achieves 35% higher compression ratio than other approaches. It has delta encoding as a built-in feature.
CLI example:
cargo run --release compress \
--csv my.csv \
--col-name my_col \
--level 6 \
--delta-order 1 \
out.qco
Rust API example:
let my_nums: Vec<i64> = ...
let compressor = Compressor::<i64>::from_config(CompressorConfig {
compression_level: 6,
delta_encoding_order: 1,
});
let bytes: Vec<u8> = compressor.simple_compress(&my_nums);
println!("compressed down to {} bytes", bytes.len());
It does this by describing each number with a Huffman code for a range (a [lower, upper] bound) followed by an exact offset into that range.
By strategically choosing the ranges based on your data, it comes close the Shannon entropy of the data distribution.
Since your data comes from a temperature sensor, your data should be very smooth, and you may even consider delta orders higher than 1 (e.g. delta order 2 is "delta-of-deltas").

Testing TIFF data gives ?conflicting? bit depths in MATLAB

I'm trying to write a function in Matlab that reads in TIFF images from various cameras and restores them to their correct data values for analysis. These cameras are from a variety of brands, and, so far, store either 12 or 14 bit data into 16 bit output. I've been reading them in using imread, and I was told that dividing by either 16 or 4 would convert the data back to it's original form. Unfortunately, that was when the function was only intended for one brand of camera specifically, which nicely scales data to 16 bit at time of capture so that such a transformation would work.
Since I'd like to keep the whole image property detection thing as automated as possible, I've done some digging in the data for a couple different cameras, and I'm running into an issue that I must be completely clueless about. I've determined (so far) that the pictures will always be stored in one of two ways: such that the previous method will work (they multiply the original data out to fill the 16 bits), or they just stuff the data in directly and append zeroes to the front or back for any vacant bits. I decided to see if I could detect which was which and have been using the following two methods. The images I test should easily have values that fill up the full range from zero to saturation (though sometimes not quite), and are fairly large resolution, so in theory these methods should work:
I start by reading in the image data:
Mframe = imread('signal.tif');
This method attempts to detect the number of bits that ever get used:
bits = 0;
for i = 1:16
Bframe = bitget(Mframe,i);
bits = bits + max(max(Bframe));
end
And this method attempts to find if there has been a scaling operation done:
Mframe = imread('signal.tif');
Dframe = diff(Mframe);
mindiff = min(min(nonzeros(Dframe)));
As a 3rd check I always look at the maximum value of my input image:
maxval = max(max(Mframe));
Please check my understanding here:
The value of maxval should be at 65532 in the case of a 16 bit image containing any saturation.
If the 12 or 14 bit data has been scaled to 16 bit, it should return maxval of 65532, a mindiff of 16 or 4 respectively, and bits as 16.
If the 12 or 14 bit data was stored directly with leading/trailing zeros, it can't return a maxval of 65532, mindiff should not return 16 or 4 (though it IS remotely possible), and bits should show as 12 or 14 respectively.
If an image is actually not reaching saturation, it can't return a maxval of 65532, mindiff should still act as described for the two cases above, and bits could possibly return as one lower than it otherwise would.
Am I correct in the above? If not please show me what I'm not understanding (I'm definitely not a computer scientist), because I seem to be getting data that conflicts with this.
Only one case appears to work just like I expect. I know the data to be 12 bit, and my testing shows maxval near 65532, mindiff of 16, and bits as 15. I can conclude that this image is not saturated and is a 12 bit scaled to 16 bit.
Another case for a different brand I know to have 12 bit output, and testing an image that I know isn't quite saturated gives me maxval of 61056, mindiff of 16, and bits as 12. ???
Yet another case, for yet again another brand, is known to have 14 bit output, and when I test an image I know to be saturated it gives me maxval of 65532, mindiff of 4, and bits as 15. ???
So very confused.
Well, after a lot of digging I finally figured it all out. I wrote some code to help me understand the differences between the different files and discovered that a couple of the cameras had "signatures" of sorts in them. I'm contacting the manufacturers for more information, but one in particular appears to be a timestamp that always occurs in the first 2 pixels.
Anyhow, I wrote the following code to fix the two issues I found and now everything is working peachy:
Mframe = imread('signal.tiff');
minval = min(min(Mframe));
mindiff = min(min(nonzeros(diff(Mframe))));
fixbit = log2(double(mindiff));
if rem(fixbit,2) % Correct Brand A Issues
fixbit = fixbit + 1;
Bframe = bitget(Mframe,fixbit);
[x,y] = find(Bframe==1);
for i=1:length(x)
Mframe(x(i),y(i)) = Mframe(x(i),y(i)) + mindiff;
end
end
for i=1:4 % Correct Brand B Timestamp
Bframe = bitget(Mframe,i);
if any(any(Bframe))
Mframe(1,1) = minval; Mframe(1,2) = minval;
end
end
for i = 1:16 % Get actual bit depth
Bframe = bitget(Mframe,i);
bits = bits + max(max(Bframe));
end
As for the Brand A issues, that camera appears to have bad data in just a few pixels of every frame (not the same every time) where a value appears in a pixel that is a one bit lower difference than should be possible from the pixel below it. For example, in a 12 bit picture the minimum difference should be 16 and a 14 bit picture should have a minimum difference of 4, but they have values that are 8 and 2 lower than the pixel below them. Don't know why that's happening, but it was fairly simple to gloss over.

How to efficiently store and manipulate sparse binary matrices in Octave?

I'm trying to manipulate sparse binary matrices in GNU Octave, and it's using way more memory than I expect, and relevant sparse-matrix functions don't behave the way I want them to. I see this question about higher-than-expected sparse-matrix storage in MATLAB, which suggests that this matrix should consume even more memory, but helped explain (only) part of this situation.
For a sparse, binary matrix, I can't figure out any way to get Octave to NOT STORE the array of values (they're always implicitly 1, so need not be stored). Can this be done? Octave always seems to consume memory for a values array.
A trimmed-down example demonstrating the situation: create random sparse matrix, turn it into "binary":
mys=spones(sprandn(1024,1024,.03)); nnz(mys), whos mys
Shows the situation. The consumed size is consistent with the storage mechanism outlined in aforementioned SO answer and expanded below, if spones() creates an array of storage-class double and if all indices are 32-bit (i.e., TotalStorageSize - rowIndices - columnIndices == NumNonZero*sizeof(double) -- unnecessarily storing these values (all 1s as doubles) is over half of the total memory consumed by this 3%-sparse object.
After messing with this (for too long) while composing this question, I discovered some partial workarounds, so I'm going to "self-answer" (only) part of the question for continuity (hopefully), but I didn't figure out an adequate answer to main question:
How do I create an efficiently-stored ("no-/implicit-values") binary matrix in Octave?
Additional background on storage format follows...
The Octave docs say the storage format for sparse matrices uses format Compressed Sparse Column (CSC). This seems to imply storing the following arrays (expanding on aforementioned SO answer, with canonical Yale format labels and tweaks for column-major order):
values (A), number-of-nonzeros (NNZ) entries of storage-class size;
row numbers (IA), NNZ entries of index size (hopefully int64 but maybe int32);
start of each column (JA), number-of-columns-plus-1 entries of index size)
In this case, for binary-only storage, I hope there's a way to completely avoid storing array (A), but I can't figure it out.
Full disclosure: As noted above, as I was composing this question, I discovered a workaround to reduce memory usage, so I'm "self-answering" part of this here, but it still isn't fully satisfying, so I'm still listening for a better actual answer to storage of a sparse binary matrix without a trivial, bloated, unnecessary values array...
To get a binary-like value out of a number-like value and reduce the memory usage in this case, use "logical" storage, created by logical(X). For example, building from above,
logicalmys = logical(mys);
creates a sparse bool matrix, that takes up less memory (1-byte logical rather than 8-byte double for the values array).
Adding more information to the whos information using whos_line_format helps illuminate the situation: The default string includes 5 of the 7 properties (see docs for more). I'm using the format string
whos_line_format(" %a:4; %ln:6; %cs:16:6:1; %rb:12; %lc:8; %e:10; %t:20;\n")
to add display of "elements", and "type" (which is distinct from "class").
With that, whos mys logicalmys shows something like
Attr Name Size Bytes Class Elements Type
==== ==== ==== ===== ===== ======== ====
mys 1024x1024 391100 double 32250 sparse matrix
logicalmys 1024x1024 165350 logical 32250 sparse bool matrix
So this shows a distinction between sparse matrix and sparse bool matrix. However, the total memory consumed by logicalmys is consistent with actually storing an array of NNZ booleans (1-byte) -- That is:
totalMemory minus rowIndices minus columnOffsets leaves NNZ bytes left;
in numbers,
165350 - 32250*4 - 1025*4 == 32250.
So we're still storing 32250 elements, all of which are 1. Further, if you set one of the 1-elements to zero, it reduces the reported storage! For a good time, try: pick a nonzero element, e.g., (42,1), then zero it: logicalmys(42,1) = 0; then whos it!
My hope is that this is correct, and that this clarifies some things for those who might be interested. Comments, corrections, or actual answers welcome!

Compression performance on certain types of data

I am testing my new image file format, which without going into unnecessary detail consists of the PPM RGB 24-bit per pixel format sent through zlib's compression stream, and an 8 byte header appended to the front.
While I was writing up tests to evaluate the performance of the corresponding code which implements this I had one test case which produced pretty terrible results.
unsigned char *image = new unsigned char[3000*3000*3];
for(int i=0;i<3000*3000;++i) {
image[i*3] = i%255;
image[i*3+1] = (i/2)%255;
image[i*3+2] = (i*i*i)%255;
}
Now what I'm doing here is creating a 3000x3000 fully packed 3 byte per pixel image, which has red and green stripes increasing steadily, but the blue component is going to be varying quite a bit.
When I compressed this using the zlib stream for my .ppmz format, it was able to reduce the size from 27,000,049 bytes (the reason it is not an even 27 million is 49 bytes are in the headers) to 25,545,520 bytes. This compressed file is 94.6% the original size.
This got me rather flustered at first because I figured that even if the blue component was so chaotic it couldn't be helped much, at least the red and green components repeated themselves quite a bit. A smart enough compressor ought to be able to shrink to about 1/3 the size...
To test that, I took the original 27MB uncompressed file and RAR'd it, and it came out to 8,535,878 bytes. This is quite good, at 31.6%, even better than one-third!
Then I realized I made a mistake defining my test image. I was using mod 255 when I should be clamping to 255, which is mod 256:
unsigned char *image = new unsigned char[3000*3000*3];
for(int i=0;i<3000*3000;++i) {
image[i*3] = i%256;
image[i*3+1] = (i/2)%256;
image[i*3+2] = (i*i*i)%256;
}
The thing is, there is now just one more value that my pixels can take, which I was skipping previously. But when I ran my code again, the ppmz became a measly 145797 byte file. WinRAR squeezed it into 62K.
Why would this tiny change account for this massive difference? Even mighty WinRAR couldn't get the original file under 8MB. What is it about repeating values every 256 steps that doing so every 255 steps completely changes? I get that with the %255 it makes the first two color components' patterns slightly out of phase, but behavior is hardly random. And then there's just crazy modular arithmetic being dumped into the last channel. But I don't see how it could account for such a huge gap in performance.
I wonder if this is more of a math question than a programming question, but I really don't see how the original data could contain any more entropy than my newly modified data. I think the power of 2 dependence indicates something related to the algorithms.
Update: I've done another test: I switched the third line back to (i*i*i)%255 but left the others at %256. ppmz compression ratio rose a tiny bit to 94.65% and RAR yielded a 30.9% ratio. So it appears as though they can handle the linearly increasing sequences just fine, even when they are out of sync, but there is something quite strange going on where arithmetic mod 2^8 is a hell of a lot more friendly to our compression algorithms than other values.
Well, first of all, computers like powers of two. :)
Most of such compression algorithms use compression blocks that are typically aligned to large powers of two. When your cycle aligns perfectly with these blocks, there is only one "unique sequence" to compress. If your data is not aligned, your sequence will shift a little across each block and the algorithm may not be able to recognize it as one "sequence".
EDIT: (updated from comments)
The second reason is that there's an integer overflow on i*i*i. The result is a double modulus: one over 2^32 and then one over 255. This double modulus greatly increases the length of the cycle making it close to random and difficult for the compression algorithm to find the "pattern".
Mystical has a big part of the answer, but it also pays to look at the mathematical properties of the data itself, especially the blue channel.
(i * i * i) % 255 repeats with a period of 255, taking on 255 distinct values all equally often. A naïve coder (ignoring the pattern between different pixels, or between the R and B pixels) would need 7.99 bits/pixel to code the blue channel.
(i * i * i) % 256 is 0 whenever i is a multiple of 8 (8 cubed is 512, which is of course 0 mod 256);
It's 64 whenever i is 4 more than a multiple of 8;
It's 192 whenever i is 4 less than a multiple of 8 (together these cover all multiples of 4);
It's one of 16 different values whenever i is an even non-multiple of 4, depending on i's residue mod 64.
It takes on one of 128 distinct values whenever i is odd.
This makes for only 147 different possibilities for the blue pixel, with some occuring much more often than others, and a naïve entropy of 6.375 bits/pixel for the blue channel.

Resources