Define uint8_t variable in Protocol Buffers message file - protocol-buffers

I want to define a Point message in Protocol Buffers which represents an RGB Colored Point in 3-dimensional space.
message Point {
float x = 1;
float y = 2;
float z = 3;
uint8_t r = 4;
uint8_t g = 5;
uint8_t b = 6;
}
Here, x, y, z variables defines the position of Point and r, g, b defines the color in RGB space.
Since uint8_t is not defined in Protocol Buffers, I am looking for a workaround to define it. At present, I am using uint32 in place of uint8_t.

There isn't anything in protobuf that represents a single byte - it simply isn't a thing that the wire-format worries about. The options are:
varint (up to 64 bits input, up to 10 bytes on the wire depending on the highest set bit)
fixed 32 bit
fixed 64 bit
length-prefixed (strings, sub-objects, packed arrays)
(group tokens; a rare implementation detail)
A single byte isn't a good fit for any of those. Frankly, I'd use a single fixed32 for all 3, and combine/decompose the 3 bytes manually (via shifting etc). The advantage here is that it would only have one field header for the 3 bytes, and wouldn't be artificially stretched via having high bits (I'm not sure that a composed RGB value is a good candidate for varint). You'd also have a spare byte if you want to add something else at a later date (alpha, maybe).
So:
message Point {
float x = 1;
float y = 2;
float z = 3;
fixed32 rgb = 4;
}

IMHO this is the correct approach. You should use the nearest data type capable of holding all values to be sent between the system. The source & destination systems should validate the data if it is in the correct range. For uint8_t this is int32 indeed.

Some protocol buffers implementations actually allow this. In particular, nanopb allows to either have .options file alongside the .proto file or use its extension directly in .proto file to fine tune interpretation of individual fields.
Specifying int_size = IS_8 will convert uint32 from message to uint8_t in generated structure.
import "nanopb.proto";
message Point {
float x = 1;
float y = 2;
float z = 3;
uint32 r = 4 [(nanopb).int_size = IS_8];
uint32 g = 5 [(nanopb).int_size = IS_8];
uint32 b = 6 [(nanopb).int_size = IS_8];
}

Related

Fast random/mutation algorithms (vector to vector) [duplicate]

I've been trying to create a generalized Gradient Noise generator (which doesn't use the hash method to get gradients). The code is below:
class GradientNoise {
std::uint64_t m_seed;
std::uniform_int_distribution<std::uint8_t> distribution;
const std::array<glm::vec2, 4> vector_choice = {glm::vec2(1.0, 1.0), glm::vec2(-1.0, 1.0), glm::vec2(1.0, -1.0),
glm::vec2(-1.0, -1.0)};
public:
GradientNoise(uint64_t seed) {
m_seed = seed;
distribution = std::uniform_int_distribution<std::uint8_t>(0, 3);
}
// 0 -> 1
// just passes the value through, origionally was perlin noise activation
double nonLinearActivationFunction(double value) {
//return value * value * value * (value * (value * 6.0 - 15.0) + 10.0);
return value;
}
// 0 -> 1
//cosine interpolation
double interpolate(double a, double b, double t) {
double mu2 = (1 - cos(t * M_PI)) / 2;
return (a * (1 - mu2) + b * mu2);
}
double noise(double x, double y) {
std::mt19937_64 rng;
//first get the bottom left corner associated
// with these coordinates
int corner_x = std::floor(x);
int corner_y = std::floor(y);
// then get the respective distance from that corner
double dist_x = x - corner_x;
double dist_y = y - corner_y;
double corner_0_contrib; // bottom left
double corner_1_contrib; // top left
double corner_2_contrib; // top right
double corner_3_contrib; // bottom right
std::uint64_t s1 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y) + m_seed);
std::uint64_t s2 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s3 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s4 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y) + m_seed);
// each xy pair turns into distance vector from respective corner, corner zero is our starting corner (bottom
// left)
rng.seed(s1);
corner_0_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y});
rng.seed(s2);
corner_1_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y - 1});
rng.seed(s3);
corner_2_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y - 1});
rng.seed(s4);
corner_3_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y});
double u = nonLinearActivationFunction(dist_x);
double v = nonLinearActivationFunction(dist_y);
double x_bottom = interpolate(corner_0_contrib, corner_3_contrib, u);
double x_top = interpolate(corner_1_contrib, corner_2_contrib, u);
double total_xy = interpolate(x_bottom, x_top, v);
return total_xy;
}
};
I then generate an OpenGL texture to display with like this:
int width = 1024;
int height = 1024;
unsigned char *temp_texture = new unsigned char[width*height * 4];
double octaves[5] = {2,4,8,16,32};
for( int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
double d_noise = 0;
d_noise += temp_1.noise(j/octaves[0], i/octaves[0]);
d_noise += temp_1.noise(j/octaves[1], i/octaves[1]);
d_noise += temp_1.noise(j/octaves[2], i/octaves[2]);
d_noise += temp_1.noise(j/octaves[3], i/octaves[3]);
d_noise += temp_1.noise(j/octaves[4], i/octaves[4]);
d_noise/=5;
uint8_t noise = static_cast<uint8_t>(((d_noise * 128.0) + 128.0));
temp_texture[j*4 + (i * width * 4) + 0] = (noise);
temp_texture[j*4 + (i * width * 4) + 1] = (noise);
temp_texture[j*4 + (i * width * 4) + 2] = (noise);
temp_texture[j*4 + (i * width * 4) + 3] = (255);
}
}
Which give good results:
But gprof is telling me that the Mersenne twister is taking up 62.4% of my time and growing with larger textures. Nothing else individual takes any where near as much time. While the Mersenne twister is fast after initialization, the fact that I initialize it every time I use it seems to make it pretty slow.
This initialization is 100% required for this to make sure that the same x and y generates the same gradient at each integer point (so you need either a hash function or seed the RNG each time).
I attempted to change the PRNG to both the linear congruential generator and Xorshiftplus, and while both ran orders of magnitude faster, they gave odd results:
LCG (one time, then running 5 times before using)
Xorshiftplus
After one iteration
After 10,000 iterations.
I've tried:
Running the generator several times before utilizing output, this results in slow execution or simply different artifacts.
Using the output of two consecutive runs after initial seed to seed the PRNG again and use the value after wards. No difference in result.
What is happening? What can i do to get faster results that are of the same quality as the mersenne twister?
OK BIG UPDATE:
I don't know why this works, I know it has something to do with the prime number utilized, but after messing around a bit, it appears that the following works:
Step 1, incorporate the x and y values as seeds separately (and incorporate some other offset value or additional seed value with them, this number should be a prime/non trivial factor)
Step 2, Use those two seed results into seeding the generator again back into the function (so like geza said, the seeds made were bad)
Step 3, when getting the result, instead of using modulo number of items (4) trying to get, or & 3, modulo the result by a prime number first then apply & 3. I'm not sure if the prime being a mersenne prime matters or not.
Here is the result with prime = 257 and xorshiftplus being used! (note I used 2048 by 2048 for this one, the others were 256 by 256)
LCG is known to be inadequate for your purpose.
Xorshift128+'s results are bad, because it needs good seeding. And providing good seeding defeats the whole purpose of using it. I don't recommend this.
However, I recommend using an integer hash. For example, one from Bob's page.
Here's a result of the first hash of that page, it looks OK to me, and it is fast (I think it is much faster than Mersenne Twister):
Here's the code I've written to generate this:
#include <cmath>
#include <stdio.h>
unsigned int hash(unsigned int a) {
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
unsigned int ivalue(int x, int y) {
return hash(y<<16|x)&0xff;
}
float smooth(float x) {
return 6*x*x*x*x*x - 15*x*x*x*x + 10*x*x*x;
}
float value(float x, float y) {
int ix = floor(x);
int iy = floor(y);
float fx = smooth(x-ix);
float fy = smooth(y-iy);
int v00 = ivalue(iy+0, ix+0);
int v01 = ivalue(iy+0, ix+1);
int v10 = ivalue(iy+1, ix+0);
int v11 = ivalue(iy+1, ix+1);
float v0 = v00*(1-fx) + v01*fx;
float v1 = v10*(1-fx) + v11*fx;
return v0*(1-fy) + v1*fy;
}
unsigned char pic[1024*1024];
int main() {
for (int y=0; y<1024; y++) {
for (int x=0; x<1024; x++) {
float v = 0;
for (int o=0; o<=9; o++) {
v += value(x/64.0f*(1<<o), y/64.0f*(1<<o))/(1<<o);
}
int r = rint(v*0.5f);
pic[y*1024+x] = r;
}
}
FILE *f = fopen("x.pnm", "wb");
fprintf(f, "P5\n1024 1024\n255\n");
fwrite(pic, 1, 1024*1024, f);
fclose(f);
}
If you want to understand, how a hash function work (or better yet, which properties a good hash have), check out Bob's page, for example this.
You (unknowingly?) implemented a visualization of PRNG non-random patterns. That looks very cool!
Except Mersenne Twister, all your tested PRNGs do not seem fit for your purpose. As I have not done further tests myself, I can only suggest to try out and measure further PRNGs.
The randomness of LCGs are known to be sensitive to the choice of their parameters. In particular, the period of a LCG is relative to the m parameter - at most it will be m (your prime factor) & for many values it can be less.
Similarly, the careful parameters selection is required to get a long period from Xorshift PRNGs.
You've noted that some PRNGs give good procedural generation results while other do not. In order to isolate the cause, I would factor out the proc gen stuff & examine the PRNG output directly. An easy way to visualize the data is to build a grey scale image where each pixel value is a (possibly scaled) random value. For image based stuff, I find this to be an easy way to find stuff that may lead to visual artifacts. Any artifacts you see with this are likely to cause issues with your proc gen output.
Another option is to try something like the Diehard tests. If the aforementioned image test failed to reveal any problems, I might use this just to be sure my PRNG techniques were trustworthy.
Note that your code seeds the PRNG, then generates one pseudorandom number from the PRNG. The reason for the nonrandomness in xorshift128+ that you discovered is that xorshift128+ simply adds the two halves of the seed (and uses the result mod 264 as the generated number) before changing its state (review its source code). This makes that PRNG considerably different from a hash function.
What you see is the practical demonstration of quality of PRNG. Mersenne Twister is one of the best PRNGs with good performance, it passes DIEHARD tests. One should know that generating a random numbers is not an easy computational task, so looking for a better performance will inevitably result in poor quality. LCG is known to be simplest and worst PRNG ever designed and it clearly shows two-dimensional correlation as in your picture. The quality of Xorshift generators largely depend on bitness and parameters. They are definitely worse than Mersenne Twister, but some (xorshift128+) may work good enough to pass BigCrush battery of TestU01 tests.
In other words, if you are making an important physical modelling numerical experiment, you better continue to use Mersenne Twister as known to be a good trade-off between speed and quality and it comes in many standard libraries. On a less important case you may try to use xorshift128+ generator. For an ultimate results you need to use cryptographical-quality PRNG (none of mentioned here may be used for cryptographical purposes).

Improving an OpenCL kernel for a Perceptron neural network

I've been doing a lot of OpenGL and shaders before, and now, I decided to give a try to OpenCL. I watched some online tutorials, and started reading books on the subject. In order to better understand, and because I believe that the best way to learn is by intelligently trying and learning from the issues that arose while doing so, I decided to start implementing a kernel for a fully-connected perceptron.
For those who don't know what that is, I'll explain the basic idea. It is a neural network in which each neuron of a layer is connected to every neurons of the next layer. Each neuron has but one action to perform: performing the sum of all the neurons from the previous layer, weighted by a different value for each neuron.
This seemed simple enough to implement, and after reading the paper "Parallel Neural Network Training with OpenCL" I implemented it in the following way
Each layer being dependent on the previous one, they're being run sequentially by the host
For computing a layer, I run my kernel with a global work size of the number of neurons within the layer (which can be quite huge, tens of thousand for instance). That makes it so that all the neurons are performing its sum independently to one another.
Each neuron (identified by its global_work_id) performs the weighted sum with all the neurons from the previous layer.
Here is my fully functional opencl kernel:
/**
* #brief Computes one layer of the perceptron given the previous one and the
* weights
* The kernel is run once for each layer.
* The work items are each tasked with computing the output of a single neuron
* of the out layer.
*
* #param out_layer_size
* Size of the output layer (number of elements in the output array that will
* contain the result for each neuron).
* #param in_layer_size
* Number of elements of the input layer
* #param in_value
* Values of the neuron in the previous layer
* #param in_weights
* Array containing the weights for each input neuron. It is organised as a
* two dimensional matrix, written by concatenating each line in the array
* [ w11, w12, w13, ...
* w21, w22, w23, ...
* ..., ..., ..., ...
* ]
* Where wij is the weight linking the neuron i of the input layer to the
* neuron j of the output layer
* #param out_values
* Computed values for the current layer
*/
void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
private const int global_id = get_global_id(0);
private const int out_layer_s = *out_layer_size;
private const int in_layer_s = *in_layer_size;
private const int offset = out_layer_s * global_id;
private float sum = 0.;
for(int i=0; i < in_layer_s; i++) {
sum += in_weights[i*out_layer_s+global_id] * in_value[i];
}
//out_values[global_id] = sigma(sum);
out_values[global_id] = sum;
}
And here is how I invoke it:
queue.enqueueNDRangeKernel(kernel, cl::NullRange,cl::NDRange(number of neurons within layer),cl::NullRange);
I realize that the bottleneck of this kernel is the implementation of the weighted sum. It would be really helpful if someone could explain how I could improve upon this to make it faster.
I probably don't make proper use of the different memory regions, I'm thinking essentially of the local memory that I don't even use.
Just to give you an idea of performance (that is on an Nvidia GTX 660M), I'll show you some of the times I achieved. Each value is the number of neurons per layer:
2500, 10 000, 2500 : 0.018s ~ 60FPS. It's about 4 to 5 times faster than on my processor (Intel Core i7 running at 2.40GHz)
100 000, 100 000, 500: 140s -> which I guess isn't surpsising since each neuron in the second layer has to perform the weighted sum of 100 000 elements. Running this on my processor yields about the same results.
As you told, bottleneck is the weighted summ. That's not hard to be, as at each layer every WI (Work Item) is doing a lot of IO operations in comparison to number of arithmetic operations. I have no experience in neural networks, but for me problem looks like poor memory access pattern on GPU.
Potentially, that can be solved by organizing your WI into local WGs (Work Groups). As every WI needs to process all data from prev. layer, I guess that all WI in WG can load some amount of data into local memory, process them and than to next bunch of data. This will make your algorithm much more cache friendly. Pseudo-code of kernel looks like:
void kernel Kernel(
__global const int in_layer_size,
__global const int out_layer_size,
__global const float *in_value,
__global const float *in_weights,
__global float *out_values){
__local float buffer[SOME_SIZE];
__global const float* p_in = in_value;
__global float* p_out = out_values;
const int
global_id = get_global_id(0),
local_id = get_local_id(0),
num_buffers = in_layer_size / SOME_SIZE,
offset = out_layer_size * global_id;
float sum = 0.0f;
for(int i=0; i < num_buffers; i++){
buffer[local_id] = p_in[local_id];
barrier(CLK_LOCAL_MEM_FENCE);
//Process all data inside buffer by every WI in WG
//...
p_in += SOME_SIZE;
out_values += SOME_SIZE;
}
//...
return;
}
So, you're sliding with the window of fixed size & calculating data within & then going to next window. Al data operations are done independently, Work Items are only using same data at same time. Optimal size of local group is Device- and Kernel- dependent.
You can do it in many ways.
But the most generic way, without changing how your kernel behaves is to do it is reusing your workgroup size (whatever you selected, or default) and reuse the memory accesses from the group.
I would suggest something like this:
NOTE: I removed thouse ugly pointers for single values. OpenCL supports this, and it is much easier. There is no need to create a memory zone, just do clSetKernelArg(kernel, arg_index, sizeof(cl_float), &size); Where cl_float size = the_size;.
#define IN_LOCAL_SIZE 4096 //Because 16KB/4B (for each float)
void kernel perceptron(global const int in_layer_size, global const int out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
const int global_id = get_global_id(0);
__local float in_buffer[IN_LOCAL_SIZE];
float sum = 0.0f;
event_t ev;
int j;
//For each full buffer
for(j=0; j < (in_layer_size/IN_LOCAL_SIZE)-1; i++) {
ev = async_work_group_copy(in_buffer, in_value+j*IN_LOCAL_SIZE, IN_LOCAL_SIZE, ev);
wait_group_events(1,&ev);
barrier(CLK_LOCAL_MEM_FENCE);
for(int i=0; i < IN_LOCAL_SIZE; i++) {
sum += in_weights[(i+j*IN_LOCAL_SIZE)*out_layer_size+global_id] * in_buffer[i];
}
}
//Last one
ev = async_work_group_copy(in_buffer, in_value+j*IN_LOCAL_SIZE, in_layer_size%IN_LOCAL_SIZE, ev);
wait_group_events(1,&ev);
barrier(CLK_LOCAL_MEM_FENCE);
for(int i=0; i < in_layer_size%IN_LOCAL_SIZE; i++) {
sum += in_weights[(i+j*IN_LOCAL_SIZE)*out_layer_size+global_id] * in_buffer[i];
}
out_values[global_id] = sum;
}
However, if the output size is small (100k, 250k, 500), then you will have just 500 work items, which is not optimal. In that case you should reshape the algorithm.
One possible way to do it is that each workitem works in the inner layer, performing sums, and the whole work group creates one output out of all the work items. That would be easy, since you can control the sums inside the workgroup easily.
But maybe other approaches fit better your problem.
You can make large improvements by caching in_values in local memory. The fewer times you have to read each element of in_values from global memory, the better.
I have come up with a solution that caches the maximum number of input values, and reads each element from global memory only once per work group. This is done by copying a block of in_values at a time, processing it against all out_values, and moving on to the next block. There is also a local array of floats used to reduce the work items' sums of each block.
pseudocode:
output elements assumed to be set to 0 already
for each block of input values:
cache the input block
for each target output value:
reset local sum to 0
for each element this work item is responsible for:
read the weight, multiply, and add to sum
reduce sums to a single value, ADD value to output element
I haven't had a chance to run this through a profiler or debugger yet, but I will give it a try when I am back at my home PC. (no opencl tools at my office workstation). Make sure to queue kernel with group size equal to the GROUP_SIZE constant. Also, only create a single group per compute unit on your device.
real code:
//experiment with GROUP_SIZE to discover the optimal value for your device
//this needs to be equal to local_work_size passed into clEnqueueNDRangeKernel
//use a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
//max. for most devices is 256
#define GROUP_SIZE = 64;
// IN_VALUE_CACHE_SIZE is the number of floats from in_value to copy to local memory at a time
//assuming GROUP_SIZE can be up to 256, sizeof(float)=4, and local memory size is 32kb, full saturation can be achieved with the following:
//(32768 - (256 * 4)) /4 = 7936
//try another multiple of 1024 (6144, 4096... )if there is trouble with this value
#define IN_VALUE_CACHE_SIZE = 7936;
void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
private const int global_id = get_global_id(0);
private const int out_layer_s = *out_layer_size;
private const int in_layer_s = *in_layer_size;
private const int offset = out_layer_s * global_id;
private const int item_id = get_local_id(0);
private const int group_id = get_group_id(0);
private const int group_count = get_num_groups(0);
local float result_buffer[GROUP_SIZE];
local float in_value_cache[IN_VALUE_CACHE_SIZE];
int i,j,k;
//init the block to 0, in case there are fewer than IN_VALUE_CACHE_SIZE values in total
for(i=item_id; i<IN_VALUE_CACHE_SIZE; i+= GROUP_SIZE){
in_value_cache[i] = 0.0;
}
barrier(CL_LOCAL_MEM_FENCE);
private float sum = 0.0;
event_t e;
int copy_total = 0;
int copy_offset;
for(i=0; i<in_layer_s; i+=IN_VALUE_CACHE_SIZE){
//cap the number of values to copy to local memory if loop is near the end of the input data
copy_total = IN_VALUE_CACHE_SIZE;
if((copy_total + i*IN_VALUE_CACHE_SIZE) > in_layer_s){
copy_total = in_layer_s - i*IN_VALUE_CACHE_SIZE;
}
//copy the next block of values
e = async_work_group_copy(in_value_cache, in_value + i * 4, copy_total, 0);
wait_group_events(1, &e);
for(j=group_id; j<out_layer_s; j+=group_count){
sum = 0.0;
//need to reset result_buffer[item_id] as well
//this is in case there are fewer than GROUP_SIZE input values remaining ie copy_total < GROUP_SIZE
result_buffer[item_id] = 0.0;
for(k=item_id; k<copy_total; k+=GROUP_SIZE){
sum += in_value_cache[k] * in_weights[(k+i) + j * out_layer_s];
}
result_buffer[item_id] = sum;
//simple O(n) reduction can be optimized further
if(item_id == 0){
for(k=1;k<GROUP_SIZE;k++){
sum += result_buffer[k];
}
out_values[j] += sum;
}
barrier(CL_LOCAL_MEM_FENCE);
}
}
}
This will handle input of any size, so you can try it with as many elements as you have global memory for.

Processing-2 PImage get() returns what, exactly?

So, I know that PImage get(x,y) function returns the color of a pixel at a given coordinate, but can anyone tell me what those numbers mean? They are not 9 digits corresponding to RGB, unless I am missing something.
How can I work with the int numbers returned from a get(x,y) function? I have an image which is a map of data values in raster format. Data are colored into different classes. I want to convert the number I receive from get(x,y) to the actual value within the class.
It would help to know what in the world the return value is...
That number is the colour representation as an integer in RBGA, which is explained all over google. Here's a sample page. If the hexadecimal representation of your colour makes more sense for you, in Processing you just go
hex(img.get(x,y))
and you get the same number in a more comprehensible form i.e FF618790
You can also extract the values of the colours themselves doing it either the easy way:
int a = alpha(img.get(x,y));
int r = red(img.get(x,y));
int g = green(img.get(x,y));
int b = blue(img.get(x,y));
or the hard way:
int a = (argb >> 24) & 0xFF;
int r = (argb >> 16) & 0xFF;
int g = (argb >> 8) & 0xFF;
int b = argb & 0xFF;

How to reduce the number of colors in an image with OpenCV?

I have a set of image files, and I want to reduce the number of colors of them to 64. How can I do this with OpenCV?
I need this so I can work with a 64-sized image histogram.
I'm implementing CBIR techniques
What I want is color quantization to a 4-bit palette.
This subject was well covered on OpenCV 2 Computer Vision Application Programming Cookbook:
Chapter 2 shows a few reduction operations, one of them demonstrated here in C++ and later in Python:
#include <iostream>
#include <vector>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
void colorReduce(cv::Mat& image, int div=64)
{
int nl = image.rows; // number of lines
int nc = image.cols * image.channels(); // number of elements per line
for (int j = 0; j < nl; j++)
{
// get the address of row j
uchar* data = image.ptr<uchar>(j);
for (int i = 0; i < nc; i++)
{
// process each pixel
data[i] = data[i] / div * div + div / 2;
}
}
}
int main(int argc, char* argv[])
{
// Load input image (colored, 3-channel, BGR)
cv::Mat input = cv::imread(argv[1]);
if (input.empty())
{
std::cout << "!!! Failed imread()" << std::endl;
return -1;
}
colorReduce(input);
cv::imshow("Color Reduction", input);
cv::imwrite("output.jpg", input);
cv::waitKey(0);
return 0;
}
Below you can find the input image (left) and the output of this operation (right):
The equivalent code in Python would be the following:
(credits to #eliezer-bernart)
import cv2
import numpy as np
input = cv2.imread('castle.jpg')
# colorReduce()
div = 64
quantized = input // div * div + div // 2
cv2.imwrite('output.jpg', quantized)
You might consider K-means, yet in this case it will most likely be extremely slow. A better approach might be doing this "manually" on your own. Let's say you have image of type CV_8UC3, i.e. an image where each pixel is represented by 3 RGB values from 0 to 255 (Vec3b). You might "map" these 256 values to only 4 specific values, which would yield 4 x 4 x 4 = 64 possible colors.
I've had a dataset, where I needed to make sure that dark = black, light = white and reduce the amount of colors of everything between. This is what I did (C++):
inline uchar reduceVal(const uchar val)
{
if (val < 64) return 0;
if (val < 128) return 64;
return 255;
}
void processColors(Mat& img)
{
uchar* pixelPtr = img.data;
for (int i = 0; i < img.rows; i++)
{
for (int j = 0; j < img.cols; j++)
{
const int pi = i*img.cols*3 + j*3;
pixelPtr[pi + 0] = reduceVal(pixelPtr[pi + 0]); // B
pixelPtr[pi + 1] = reduceVal(pixelPtr[pi + 1]); // G
pixelPtr[pi + 2] = reduceVal(pixelPtr[pi + 2]); // R
}
}
}
causing [0,64) to become 0, [64,128) -> 64 and [128,255) -> 255, yielding 27 colors:
To me this seems to be neat, perfectly clear and faster than anything else mentioned in other answers.
You might also consider reducing these values to one of the multiples of some number, let's say:
inline uchar reduceVal(const uchar val)
{
if (val < 192) return uchar(val / 64.0 + 0.5) * 64;
return 255;
}
which would yield a set of 5 possible values: {0, 64, 128, 192, 255}, i.e. 125 colors.
There are many ways to do it. The methods suggested by jeff7 are OK, but some drawbacks are:
method 1 have parameters N and M, that you must choose, and you must also convert it to another colorspace.
method 2 answered can be very slow, since you should compute a 16.7 Milion bins histogram and sort it by frequency (to obtain the 64 higher frequency values)
I like to use an algorithm based on the Most Significant Bits to use in a RGB color and convert it to a 64 color image. If you're using C/OpenCV, you can use something like the function below.
If you're working with gray-level images I recommed to use the LUT() function of the OpenCV 2.3, since it is faster. There is a tutorial on how to use LUT to reduce the number of colors. See: Tutorial: How to scan images, lookup tables... However I find it more complicated if you're working with RGB images.
void reduceTo64Colors(IplImage *img, IplImage *img_quant) {
int i,j;
int height = img->height;
int width = img->width;
int step = img->widthStep;
uchar *data = (uchar *)img->imageData;
int step2 = img_quant->widthStep;
uchar *data2 = (uchar *)img_quant->imageData;
for (i = 0; i < height ; i++) {
for (j = 0; j < width; j++) {
// operator XXXXXXXX & 11000000 equivalent to XXXXXXXX AND 11000000 (=192)
// operator 01000000 >> 2 is a 2-bit shift to the right = 00010000
uchar C1 = (data[i*step+j*3+0] & 192)>>2;
uchar C2 = (data[i*step+j*3+1] & 192)>>4;
uchar C3 = (data[i*step+j*3+2] & 192)>>6;
data2[i*step2+j] = C1 | C2 | C3; // merges the 2 MSB of each channel
}
}
}
Here's a Python implementation of color quantization using K-Means Clustering with cv2.kmeans. The idea is to reduce the number of distinct colors in an image while preserving the color appearance of the image as much as possible. Here's the result:
Input -> Output
Code
import cv2
import numpy as np
def kmeans_color_quantization(image, clusters=8, rounds=1):
h, w = image.shape[:2]
samples = np.zeros([h*w,3], dtype=np.float32)
count = 0
for x in range(h):
for y in range(w):
samples[count] = image[x][y]
count += 1
compactness, labels, centers = cv2.kmeans(samples,
clusters,
None,
(cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10000, 0.0001),
rounds,
cv2.KMEANS_RANDOM_CENTERS)
centers = np.uint8(centers)
res = centers[labels.flatten()]
return res.reshape((image.shape))
image = cv2.imread('1.jpg')
result = kmeans_color_quantization(image, clusters=8)
cv2.imshow('result', result)
cv2.waitKey()
The answers suggested here are really good. I thought I would add my idea as well. I follow the formulation of many comments here, in which it is said that 64 colors can be represented by 2 bits of each channel in an RGB image.
The function in code below takes as input an image and the number of bits required for quantization. It uses bit manipulation to 'drop' the LSB bits and keep only the required number of bits. The result is a flexible method that can quantize the image to any number of bits.
#include "include\opencv\cv.h"
#include "include\opencv\highgui.h"
// quantize the image to numBits
cv::Mat quantizeImage(const cv::Mat& inImage, int numBits)
{
cv::Mat retImage = inImage.clone();
uchar maskBit = 0xFF;
// keep numBits as 1 and (8 - numBits) would be all 0 towards the right
maskBit = maskBit << (8 - numBits);
for(int j = 0; j < retImage.rows; j++)
for(int i = 0; i < retImage.cols; i++)
{
cv::Vec3b valVec = retImage.at<cv::Vec3b>(j, i);
valVec[0] = valVec[0] & maskBit;
valVec[1] = valVec[1] & maskBit;
valVec[2] = valVec[2] & maskBit;
retImage.at<cv::Vec3b>(j, i) = valVec;
}
return retImage;
}
int main ()
{
cv::Mat inImage;
inImage = cv::imread("testImage.jpg");
char buffer[30];
for(int i = 1; i <= 8; i++)
{
cv::Mat quantizedImage = quantizeImage(inImage, i);
sprintf(buffer, "%d Bit Image", i);
cv::imshow(buffer, quantizedImage);
sprintf(buffer, "%d Bit Image.png", i);
cv::imwrite(buffer, quantizedImage);
}
cv::waitKey(0);
return 0;
}
Here is an image that is used in the above function call:
Image quantized to 2 bits for each RGB channel (Total 64 Colors):
3 bits for each channel:
4 bits ...
There is the K-means clustering algorithm which is already available in the OpenCV library. In short it determines which are the best centroids around which to cluster your data for a user-defined value of k ( = no of clusters). So in your case you could find the centroids around which to cluster your pixel values for a given value of k=64. The details are there if you google around. Here's a short intro to k-means.
Something similar to what you are probably trying was asked here on SO using k-means, hope it helps.
Another approach would be to use the pyramid mean shift filter function in OpenCV. It yields somewhat "flattened" images, i.e. the number of colors are less so it might be able to help you.
If you want a quick and dirty method in C++, in 1 line:
capImage &= cv::Scalar(0b11000000, 0b11000000, 0b11000000);
So, what it does is keep the upper 2 bits of each R, G, B component, and discards the lower 6 bits, hence the 0b11000000.
Because of the 3 channels in RGB, you get maximum 4 R x 4 B x 4 B = max 64 colors. The advantage of doing this is that you can run this on any number of images and the same colors will be mapped.
Note that this can make your image a bit darker since it discards some bits.
For a greyscale image, you can do:
capImage &= 0b11111100;
This will keep the upper 6 bits, which means you get 64 grays out of 256, and again the image can become a bit darker.
Here's an example, original image = 251424 unique colors.
And the resulting image has 46 colors:
Assuming that you want to use the same 64 colors for all images (ie palette not optimized per image), there are a at least a couple choices I can think of:
1) Convert to Lab or YCrCb colorspace and quantize using N bits for luminance and M bits for each color channel, N should be greater than M.
2) Compute a 3D histogram of color values over all your training images, then choose the 64 colors with the largest bin values. Quantize your images by assigning each pixel the color of the closest bin from the training set.
Method 1 is the most generic and easiest to implement, while method 2 can be better tailored to your specific dataset.
Update:
For example, 32 colors is 5 bits so assign 3 bits to the luminance channel and 1 bits to each color channel. To do this quantization, do integer division of the luminance channel by 2^8/2^3 = 32 and each color channel by 2^8/2^1 = 128. Now there are only 8 different luminance values and 2 different color channels each. Recombine these values into a single integer doing bit shifting or math (quantized color value = luminance*4+color1*2+color2);
A simple bitwise and with a proper bitmask would do the trick.
python, for 64 colors,
img = img & int("11000000", 2)
The number of colors for an RGB image should be a perfect cube (same across 3 channels).
For this method, the number of possible values for a channel should be a power of 2. (This check is ignored by the code and the next lower power of 2 is taken by it)
import numpy as np
import cv2 as cv
def is_cube(n):
cbrt = np.cbrt(n)
return cbrt ** 3 == n, int(cbrt)
def reduce_color_space(img, n_colors=64):
n_valid, cbrt = is_cube(n_colors)
if not n_valid:
print("n_colors should be a perfect cube")
return
n_bits = int(np.log2(cbrt))
if n_bits > 8:
print("Can't generate more colors")
return
bitmask = int(f"{'1' * n_bits}{'0' * (8 - n_bits)}", 2)
return img & bitmask
img = cv.imread("image.png")
cv.imshow("orig", img)
cv.imshow("reduced", reduce_color_space(img))
cv.waitKey(0)
img = numpy.multiply(img//32, 32)
Why don't you just do Matrix multiplication/division? Values will be automatically rounded.
Pseudocode:
convert your channels to unsigned characters (CV_8UC3),
Divide by
total colors / desired colors. Mat = Mat / (256/64). Decimal points
will be truncated.
Multiply by the same number. Mat = mat * 4
Done. Each channel now only contains 64 colors.

Fast block placement algorithm, advice needed?

I need to emulate the window placement strategy of the Fluxbox window manager.
As a rough guide, visualize randomly sized windows filling up the screen one at a time, where the rough size of each results in an average of 80 windows on screen without any window overlapping another.
If you have Fluxbox and Xterm installed on your system, you can try the xwinmidiarptoy BASH script to see a rough prototype of what I want happening. See the xwinmidiarptoy.txt notes I've written about it explaining what it does and how it should be used.
It is important to note that windows will close and the space that closed windows previously occupied becomes available once more for the placement of new windows.
The algorithm needs to be an Online Algorithm processing data "piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start."
The Fluxbox window placement strategy has three binary options which I want to emulate:
Windows build horizontal rows or vertical columns (potentially)
Windows are placed from left to right or right to left
Windows are placed from top to bottom or bottom to top
Differences between the target algorithm and a window-placement algorithm
The coordinate units are not pixels. The grid within which blocks will be placed will be 128 x 128 units. Furthermore, the placement area may be further shrunk by a boundary area placed within the grid.
Why is the algorithm a problem?
It needs to operate to the deadlines of a real time thread in an audio application.
At this moment I am only concerned with getting a fast algorithm, don't concern yourself over the implications of real time threads and all the hurdles in programming that that brings.
And although the algorithm will never ever place a window which overlaps another, the user will be able to place and move certain types of blocks, overlapping windows will exist. The data structure used for storing the windows and/or free space, needs to be able to handle this overlap.
So far I have two choices which I have built loose prototypes for:
1) A port of the Fluxbox placement algorithm into my code.
The problem with this is, the client (my program) gets kicked out of the audio server (JACK) when I try placing the worst case scenario of 256 blocks using the algorithm. This algorithm performs over 14000 full (linear) scans of the list of blocks already placed when placing the 256th window.
For a demonstration of this I created a program called text_boxer-0.0.2.tar.bz2 which takes a text file as input and arranges it within ASCII boxes. Issue make to build it. A little unfriendly, use --help (or any other invalid option) for a list of command line options. You must specify the text file by using the option.
2) My alternative approach.
Only partially implemented, this approach uses a data structure for each area of rectangular free unused space (the list of windows can be entirely separate, and is not required for testing of this algorithm). The data structure acts as a node in a doubly linked list (with sorted insertion), as well as containing the coordinates of the top-left corner, and the width and height.
Furthermore, each block data structure also contains four links which connect to each immediately adjacent (touching) block on each of the four sides.
IMPORTANT RULE: Each block may only touch with one block per side. This is a rule specific to the algorithm's way of storing free unused space and bears no factor in how many actual windows may touch each other.
The problem with this approach is, it's very complex. I have implemented the straightforward cases where 1) space is removed from one corner of a block, 2) splitting neighbouring blocks so that the IMPORTANT RULE is adhered to.
The less straightforward case, where the space to be removed can only be found within a column or row of boxes, is only partially implemented - if one of the blocks to be removed is an exact fit for width (ie column) or height (ie row) then problems occur. And don't even mention the fact this only checks columns one box wide, and rows one box tall.
I've implemented this algorithm in C - the language I am using for this project (I've not used C++ for a few years and am uncomfortable using it after having focused all my attention to C development, it's a hobby). The implementation is 700+ lines of code (including plenty of blank lines, brace lines, comments etc). The implementation only works for the horizontal-rows + left-right + top-bottom placement strategy.
So I've either got to add some way of making this +700 lines of code work for the other 7 placement strategy options, or I'm going to have to duplicate those +700 lines of code for the other seven options. Neither of these is attractive, the first, because the existing code is complex enough, the second, because of bloat.
The algorithm is not even at a stage where I can use it in the real time worst case scenario, because of missing functionality, so I still don't know if it actually performs better or worse than the first approach.
The current state of C implementation of this algorithm is freespace.c. I use gcc -O0 -ggdb freespace.c to build, and run it in an xterm sized to atleast 124 x 60 chars.
What else is there?
I've skimmed over and discounted:
Bin Packing algorithms: their
emphasis on optimal fit does not
match the requirements of this
algorithm.
Recursive Bisection Placement algorithms: sounds promising, but
these are for circuit design. Their
emphasis is optimal wire length.
Both of these, especially the latter, all elements to be placed/packs are known before the algorithm begins.
What are your thoughts on this? How would you approach it? What other algorithms should I look at? Or even what concepts should I research seeing as I've not studied computer science/software engineering?
Please ask questions in comments if further information is needed.
Further ideas developed since asking this question
Some combination of my "alternative algorithm" with a spatial hashmap for identifying if a large window to be placed would cover several blocks of free space.
I would consider some kind of spatial hashing structure. Imagine your entire free space is gridded coarsely, call them blocks. As windows come and go, they occupy certain sets of contiguous rectangular blocks. For each block, keep track of the largest unused rectangle incident to each corner, so you need to store 2*4 real numbers per block. For an empty block, the rectangles at each corner have size equal to the block. Thus, a block can only be "used up" at its corners, and so at most 4 windows can sit in any block.
Now each time you add a window, you have to search for a rectangular set of blocks for which the window will fit, and when you do, update the free corner sizes. You should size your blocks so that a handful (~4x4) of them fit into a typical window. For each window, keep track of which blocks it touches (you only need to keep track of extents), as well as which windows touch a given block (at most 4, in this algorithm). There is an obvious tradeoff between the granularity of the blocks and the amount of work per window insertion/removal.
When removing a window, loop over all blocks it touches, and for each block, recompute the free corner sizes (you know which windows touch it). This is fast since the inner loop is at most length 4.
I imagine a data structure like
struct block{
int free_x[4]; // 0 = top left, 1 = top right,
int free_y[4]; // 2 = bottom left, 3 = bottom right
int n_windows; // number of windows that occupy this block
int window_id[4]; // IDs of windows that occupy this block
};
block blocks[NX][NY];
struct window{
int id;
int used_block_x[2]; // 0 = first index of used block,
int used_block_y[2]; // 1 = last index of used block
};
Edit
Here is a picture:
It shows two example blocks. The colored dots indicate the corners of the block, and the arrows emanating from them indicate the extents of the largest-free-rectangle from that corner.
You mentioned in the edit that the grid on which your windows will be placed is already quite coarse (127x127), so the block sizes would probably be something like 4 grid cells on a side, which probably wouldn't gain you much. This method is suitable if your window corner coordinates can take on a lot of values (I was thinking they would be pixels), but not so much in your case. You can still try it, since it's simple. You would probably want to also keep a list of completely empty blocks so that if a window comes in that is larger than 2 block widths, then you look first in that list before looking for some suitable free space in the block grid.
After some false starts, I have eventually arrived here. Here is where the use of data structures for storing rectangular areas of free space have been abandoned. Instead, there is a 2d array with 128 x 128 elements to achieve the same result but with much less complexity.
The following function scans the array for an area width * height in size. The first position it finds it writes the top left coordinates of, to where resultx and resulty point to.
_Bool freespace_remove( freespace* fs,
int width, int height,
int* resultx, int* resulty)
{
int x = 0;
int y = 0;
const int rx = FSWIDTH - width;
const int by = FSHEIGHT - height;
*resultx = -1;
*resulty = -1;
char* buf[height];
for (y = 0; y < by; ++y)
{
x = 0;
char* scanx = fs->buf[y];
while (x < rx)
{
while(x < rx && *(scanx + x))
++x;
int w, h;
for (h = 0; h < height; ++h)
buf[h] = fs->buf[y + h] + x;
_Bool usable = true;
w = 0;
while (usable && w < width)
{
h = 0;
while (usable && h < height)
if (*(buf[h++] + w))
usable = false;
++w;
}
if (usable)
{
for (w = 0; w < width; ++w)
for (h = 0; h < height; ++h)
*(buf[h] + w) = 1;
*resultx = x;
*resulty = y;
return true;
}
x += w;
}
}
return false;
}
The 2d array is zero initialized. Any areas in the array where the space is used are set to 1. This structure and function will work independently from the actual list of windows that are occupying the areas marked with 1's.
The advantages of this method are its simplicity. It only uses one data structure - an array. The function is short, and should not be too difficult to adapt to handle the remaining placement options (here it only handles Row Smart + Left to Right + Top to Bottom).
My initial tests also look promising on the speed front. Though I don't think this would be suitable for a window manager placing windows on, for example, a 1600 x 1200 desktop with pixel accuracy, for my purposes I believe it is going to be much better than any of the previous methods I have tried.
Compilable test code here:
http://jwm-art.net/art/text/freespace_grid.c
(in Linux I use gcc -ggdb -O0 freespace_grid.c to compile)
#include <limits.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#define FSWIDTH 128
#define FSHEIGHT 128
#ifdef USE_64BIT_ARRAY
#define FSBUFBITS 64
#define FSBUFWIDTH 2
typedef uint64_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctzl(( v ))
#define LEADING_ONES( v ) __builtin_clzl(~( v ))
#else
#ifdef USE_32BIT_ARRAY
#define FSBUFBITS 32
#define FSBUFWIDTH 4
typedef uint32_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz(( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ))
#else
#ifdef USE_16BIT_ARRAY
#define FSBUFBITS 16
#define FSBUFWIDTH 8
typedef uint16_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz( 0xffff0000 | ( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ) << 16)
#else
#ifdef USE_8BIT_ARRAY
#define FSBUFBITS 8
#define FSBUFWIDTH 16
typedef uint8_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz( 0xffffff00 | ( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ) << 24)
#else
#define FSBUFBITS 1
#define FSBUFWIDTH 128
typedef unsigned char fsbuf_type;
#define TRAILING_ZEROS( v ) (( v ) ? 0 : 1)
#define LEADING_ONES( v ) (( v ) ? 1 : 0)
#endif
#endif
#endif
#endif
static const fsbuf_type fsbuf_max = ~(fsbuf_type)0;
static const fsbuf_type fsbuf_high = (fsbuf_type)1 << (FSBUFBITS - 1);
typedef struct freespacegrid
{
fsbuf_type buf[FSHEIGHT][FSBUFWIDTH];
_Bool left_to_right;
_Bool top_to_bottom;
} freespace;
void freespace_dump(freespace* fs)
{
int x, y;
for (y = 0; y < FSHEIGHT; ++y)
{
for (x = 0; x < FSBUFWIDTH; ++x)
{
fsbuf_type i = FSBUFBITS;
fsbuf_type b = fs->buf[y][x];
for(; i != 0; --i, b <<= 1)
putchar(b & fsbuf_high ? '#' : '/');
/*
if (x + 1 < FSBUFWIDTH)
putchar('|');
*/
}
putchar('\n');
}
}
freespace* freespace_new(void)
{
freespace* fs = malloc(sizeof(*fs));
if (!fs)
return 0;
int y;
for (y = 0; y < FSHEIGHT; ++y)
{
memset(&fs->buf[y][0], 0, sizeof(fsbuf_type) * FSBUFWIDTH);
}
fs->left_to_right = true;
fs->top_to_bottom = true;
return fs;
}
void freespace_delete(freespace* fs)
{
if (!fs)
return;
free(fs);
}
/* would be private function: */
void fs_set_buffer( fsbuf_type buf[FSHEIGHT][FSBUFWIDTH],
unsigned x,
unsigned y1,
unsigned xoffset,
unsigned width,
unsigned height)
{
fsbuf_type v;
unsigned y;
for (; width > 0 && x < FSBUFWIDTH; ++x)
{
if (width < xoffset)
v = (((fsbuf_type)1 << width) - 1) << (xoffset - width);
else if (xoffset < FSBUFBITS)
v = ((fsbuf_type)1 << xoffset) - 1;
else
v = fsbuf_max;
for (y = y1; y < y1 + height; ++y)
{
#ifdef FREESPACE_DEBUG
if (buf[y][x] & v)
printf("**** over-writing area ****\n");
#endif
buf[y][x] |= v;
}
if (width < xoffset)
return;
width -= xoffset;
xoffset = FSBUFBITS;
}
}
_Bool freespace_remove( freespace* fs,
unsigned width, unsigned height,
int* resultx, int* resulty)
{
unsigned x, x1, y;
unsigned w, h;
unsigned xoffset, x1offset;
unsigned tz; /* trailing zeros */
fsbuf_type* xptr;
fsbuf_type mask = 0;
fsbuf_type v;
_Bool scanning = false;
_Bool offset = false;
*resultx = -1;
*resulty = -1;
for (y = 0; y < (unsigned) FSHEIGHT - height; ++y)
{
scanning = false;
xptr = &fs->buf[y][0];
for (x = 0; x < FSBUFWIDTH; ++x, ++xptr)
{
if(*xptr == fsbuf_max)
{
scanning = false;
continue;
}
if (!scanning)
{
scanning = true;
x1 = x;
x1offset = xoffset = FSBUFBITS;
w = width;
}
retry:
if (w < xoffset)
mask = (((fsbuf_type)1 << w) - 1) << (xoffset - w);
else if (xoffset < FSBUFBITS)
mask = ((fsbuf_type)1 << xoffset) - 1;
else
mask = fsbuf_max;
offset = false;
for (h = 0; h < height; ++h)
{
v = fs->buf[y + h][x] & mask;
if (v)
{
tz = TRAILING_ZEROS(v);
offset = true;
break;
}
}
if (offset)
{
if (tz)
{
x1 = x;
w = width;
x1offset = xoffset = tz;
goto retry;
}
scanning = false;
}
else
{
if (w <= xoffset) /***** RESULT! *****/
{
fs_set_buffer(fs->buf, x1, y, x1offset, width, height);
*resultx = x1 * FSBUFBITS + (FSBUFBITS - x1offset);
*resulty = y;
return true;
}
w -= xoffset;
xoffset = FSBUFBITS;
}
}
}
return false;
}
int main(int argc, char** argv)
{
int x[1999];
int y[1999];
int w[1999];
int h[1999];
int i;
freespace* fs = freespace_new();
for (i = 0; i < 1999; ++i, ++u)
{
w[i] = rand() % 18 + 4;
h[i] = rand() % 18 + 4;
freespace_remove(fs, w[i], h[i], &x[i], &y[i]);
/*
freespace_dump(fs);
printf("w:%d h:%d x:%d y:%d\n", w[i], h[i], x[i], y[i]);
if (x[i] == -1)
printf("not removed space %d\n", i);
getchar();
*/
}
freespace_dump(fs);
freespace_delete(fs);
return 0;
}
The above code requires one of USE_64BIT_ARRAY, USE_32BIT_ARRAY, USE_16BIT_ARRAY, USE_8BIT_ARRAY to be defined otherwise it will fall back to using only the high bit of an unsigned char for storing the state of grid cells.
The function fs_set_buffer will not be declared in the header, and will become static within the implementation when this code gets split between .h and .c files. A more user friendly function hiding the implementation details will be provided for removing used space from the grid.
Overall, this implementation is faster without optimization than my previous answer with maximum optimization (using GCC on 64bit Gentoo, optimization options -O0 and -O3 respectively).
Regarding USE_NNBIT_ARRAY and the different bit sizes, I used two different methods of timing the code which make 1999 calls to freespace_remove.
Timing main() using the Unix time command (and disabling any output in the code) seemed to prove my expectations correct - that higher bit sizes are faster.
On the other hand, timing individual calls to freespace_remove (using gettimeofday) and comparing the maximum time taken over the 1999 calls seemed to indicate lower bit sizes were faster.
This has only been tested on a 64bit system (Intel Dual Core II).

Resources