Including standalone C code into assembler - gcc

I haven't looked in ages how the computer actually starts up, so I started playing around with writing my own loader which would boot into IA-32e mode and initialize all the CPUs with some dummy code to run. I'm fairly far, but I'm getting tired of writing trivial things in assembler.
Here's a toy case of what I would like to achieve. Say I want to write a simple piece of code that prints a C-style string and keeps track of the cursor in some fixed location in memory. A C implementation would be something along the following lines (this code is untested, I wrote it on the fly, so don't comment on bugs, since they're not relevant):
#define VIDEORAM_ADDRESS 0xa0000
#define VIDEORAM_LINE_LENGTH 160
#define VGA_GREY_ON_BLACK 0x07
#define CURSOR_X 0x100 /* dummy address */
#define CURSOR_Y 0x101
void printk(const char *s)
{
volatile char *p;
int x, y;
x = *(volatile char*)CURSOR_X;
y = *(volatile char*)CURSOR_Y;
while(*s != 0) {
if(*s == '\n') {
y++;
y = y >= 25 ? 0 : y;
x = 0;
} else {
x++;
if(x >= 80) {
y++;
y = y >= 25 ? 0 : y;
x = 0;
}
p = (volatile char*)VIDEORAM_ADDRESS + x + y * VIDEORAM_LINE_LENGTH;
*p++ = *s++;
*p = VGA_GREY_ON_BLACK;
}
}
*(volatile char*)CURSOR_X = x;
*(volatile char*)CURSOR_Y = y;
}
I can compile this with gcc -m32 -O2 -S printk.c, which generates printk.s. My question is essentially how to combine this together with a handwritten assembly file? The end result should of course be nothing else except a single binary blob of machine code and data that is loaded by the BIOS onto 0000:7C00 if, say, I want to include the code into the stage 1 loader loaded from a disk and call it after switching over to protected mode.
Is an alternative putting an .include directive somewhere in the handwritten assembly file to get the code included? Unfortunately, gcc emits all kinds of directives for the GNU Assembler in the .s file and I really only want the code for the printk function.
Is there some canonical way of doing this?

Related

/*undefined sequence*/ in sliced code from Frama-C

I am trying to slice code using Frama-C.
The source code is
static uint8_T ALARM_checkOverInfusionFlowRate(void)
{
uint8_T ov;
ov = 0U;
if (ALARM_Functional_B.In_Therapy) {
if (ALARM_Functional_B.Flow_Rate > ALARM_Functional_B.Flow_Rate_High) {
ov = 1U;
} else if (ALARM_Functional_B.Flow_Rate >
ALARM_Functional_B.Commanded_Flow_Rate * div_s32
(ALARM_Functional_B.Tolerance_Max, 100) +
ALARM_Functional_B.Commanded_Flow_Rate) {
ov = 1U;
} else {
if (ALARM_Functional_B.Flow_Rate > ALARM_Functional_B.Commanded_Flow_Rate * div_s32(ALARM_Functional_B.Tolerance_Min, 100) + ALARM_Functional_B.Commanded_Flow_Rate) {
ov = 2U;
}
}
}
return ov;
}
When I sliced the code usig Frama-C, I get the following. I don't know what this “undefined sequence” means.
static uint8_T ALARM_checkOverInfusionFlowRate(void)
{
uint8_T ov;
ov = 0U;
if (ALARM_Functional_B.In_Therapy)
if ((int)ALARM_Functional_B.Flow_Rate > (int)ALARM_Functional_B.Flow_Rate_High)
ov = 1U;
else {
int32_T tmp_0;
{
/*undefined sequence*/
tmp_0 = div_s32((int)ALARM_Functional_B.Tolerance_Max,100);
}
if ((int)ALARM_Functional_B.Flow_Rate > (int)ALARM_Functional_B.Commanded_Flow_Rate * tmp_0 + (int)ALARM_Functional_B.Commanded_Flow_Rate)
ov = 1U;
else {
int32_T tmp;
{
/*undefined sequence*/
tmp = div_s32((int)ALARM_Functional_B.Tolerance_Min,100);
}
if ((int)ALARM_Functional_B.Flow_Rate > (int)ALARM_Functional_B.Commanded_Flow_Rate * tmp + (int)ALARM_Functional_B.Commanded_Flow_Rate)
ov = 2U;
}
}
return ov;
}
Appreciate any help in explaining why this happens.
/* undefined sequence */ in a block simply means that the block has been generated during the code normalization at parsing time but that with respect to C semantics there is no sequence point between the statements composing it. For instance x++ + x++ will be normalized as
{
/*undefined sequence*/
tmp = x;
x ++;
tmp_0 = x;
x ++;
;
}
Internally, each statement in such a sequence is decorated with lists of locations that are accessed for writing or reading (use -kernel-debug 1 with -print to see them in the output). Option -unspecified-access used together with -val will check that such accesses are correct, i.e. that there is at most one statement inside the sequence that write to a given location and if this is the case, that there is no read access to it (except for building the value it is assigned to). In addition, this option does not take care of side-effects occurring in a function call inside the sequence. There is a special plug-in for that, but it has not been released yet.
Finally note that since Frama-C Neon, the comment reads only /*sequence*/, which seems to be less daunting for the user. Indeed, the original code may be correct or may show undefined behavior, but syntactic analysis is too weak to decide in the general case. For instance, (*p)++ + (*q)++ is correct as long as p and q do not overlap. This is why the normalization phase only points out the sequences and leaves it up to more powerful analysis plug-ins to check whether there might be an issue.

Which Frama-C version is best suited to develop a slicing plugin?

I want to explore Frama-C to apply Assertion-based Slicing (using ACSL notation).
I have found that there are several different versions of Frama-C with some different features.
My question is which version is best suited to develop a a slicing plugin to Frama-C and to manipulate the AST created by Frama-C.
There already is a slicing plug-in in Frama-C (in all versions).
This plug-in uses the results of the value analysis plug-in, which assumes the properties written inside ACSL assertions (after having attempted to verify them).
So, depending on what you call “assertion-based slicing” (and be aware that the article that comes up first in Google is behind a paywall), what you propose to do may already exists as a Frama-C plug-in (and one that works pretty well as of the last two or three Frama-C versions).
To answer your question anyway, the best version to use is the latest one, which is Fluorine 20130601 as of this writing.
Example of existing slicing features in Frama-C:
$ cat t.c
int f(unsigned int x)
{
int y;
/*# assert x == 0 ; */
if (x)
y = 9;
else
y = 10;
return y;
}
$ frama-c -sparecode t.c -main f
...
t.c:4:[value] Assertion got status unknown.
...
/* Generated by Frama-C */
int f(unsigned int x)
{
int y;
/*# assert x ≡ 0; */
;
y = 10;
return (y);
}
Is the above what you have in mind when you speak of “assertion-based slicing”?
Note: Frama-C's option -sparecode is a slicing option for the criterion “preserve all results of the program”. It still removes any statement that is without consequences, such as y=3; in y=3; y=4;, and being based on Frama-C's value analysis, it removes anything that is considered unreachable or without consequences because of the value analysis' results.
Another example to illustrate:
$ cat t.c
int f(unsigned int x)
{
int y;
int a, b;
int *p[2] = {&a, &b};
/*# assert x == 0 ; */
a = 100;
b = 200;
if (x)
y = 9;
else
y = 10;
return y + *(p[x]);
}
$ frama-c -sparecode t.c -main f
...
t.c:6:[value] Assertion got status unknown.
...
/* Generated by Frama-C */
int f(unsigned int x)
{
int __retres;
int y;
int a;
int *p[2];
p[0] = & a;
/*# assert x ≡ 0; */
;
a = 100;
y = 10;
__retres = y + *(p[x]);
return (__retres);
}

Performance: greater / smaller than vs not equal to

I wonder if there is a difference in performance between
checking if a value is greater / smaller than another
for(int x = 0; x < y; x++); // for y > x
and
checking if a value is not equal to another
for(int x = 0; x != y; x++); // for y > x
and why?
In addition: What if I compare to zero, is there a further difference?
It would be nice if the answers also consider an assebled view on the code.
EDIT:
As most of you pointed out the difference in performance of course is negligible but I'm interested in the difference on the cpu level. Which operation is more complex?
To me it's more a question to learn / understand the technique.
I removed the Java tag, which I added accidentally because the question was meant generally not just based on Java, sorry.
You should still do what is clearer, safer and easier to understand. These micro-tuning discussions are usually a waste of your time because
they rarely make a measurable difference
when they make a difference this can change if you use a different JVM, or processor. i.e. without warning.
Note: the machine generated can also change with processor or JVM, so looking this is not very helpful in most cases, even if you are very familiar with assembly code.
What is much, much more important is the maintainability of the software.
The performance is absolutely negligible. Here's some code to prove it:
public class OpporatorPerformance {
static long y = 300000000L;
public static void main(String[] args) {
System.out.println("Test One: " + testOne());
System.out.println("Test Two: " + testTwo());
System.out.println("Test One: " + testOne());
System.out.println("Test Two: " + testTwo());
System.out.println("Test One: " + testOne());
System.out.println("Test Two: " + testTwo());
System.out.println("Test One: " + testOne());
System.out.println("Test Two: " + testTwo());
}
public static long testOne() {
Date newDate = new Date();
int z = 0;
for(int x = 0; x < y; x++){ // for y > x
z = x;
}
return new Date().getTime() - newDate.getTime();
}
public static long testTwo() {
Date newDate = new Date();
int z = 0;
for(int x = 0; x != y; x++){ // for y > x
z = x;
}
return new Date().getTime() - newDate.getTime();
}
}
The results:
Test One: 342
Test Two: 332
Test One: 340
Test Two: 340
Test One: 415
Test Two: 325
Test One: 393
Test Two: 329
Now 6 years later and after still receiving occasional notifications from this question I'd like to add some insights that I've gained during my computer science study.
Putting the above statements into a small program and compiling it...
public class Comp {
public static void main(String[] args) {
int y = 42;
for(int x = 0; x < y; x++) {
// stop if x >= y
}
for(int x = 0; x != y; x++) {
// stop if x == y
}
}
}
... we get the following bytecode:
public static void main(java.lang.String[]);
Code:
// y = 42
0: bipush 42
2: istore_1
// first for-loop
3: iconst_0
4: istore_2
5: iload_2
6: iload_1
7: if_icmpge 16 // jump out of loop if x => y
10: iinc 2, 1
13: goto 5
// second for-loop
16: iconst_0
17: istore_2
18: iload_2
19: iload_1
20: if_icmpeq 29 // jump out of loop if x == y
23: iinc 2, 1
26: goto 18
29: return
As we can see, on bytecode level both are handled in the same way and use a single bytecode instruction for the comparison.
As already stated, how the bytecode is translated into assembler/machine code depends on the JVM.
But generally this conditional jumps can be translated to some assembly code like this:
; condition of first loop
CMP eax, ebx
JGE label ; jump if eax > ebx
; condition of second loop
CMP eax, ebx
JE label ; jump if eax == ebx
On hardware level JGE and JE have the same complexity.
So all in all: Regarding performance, both x < y and x != y are theoretically the same on hardware level and one isn't per se faster or slower than the other.
There is rarely a performance hit but the first is much more reliable as it will handle both of the extraordinary cases where
y < 0 to start
x or y are messed with inside the block.
The performance in theory is the same. When you do a less than or not equal to operation, in the processor level you actually perform a subtract operation and check if the negative flag or zero flag is enabled in the result. In theory the performance will be the same. Since the difference is only checking the flag set.
Other people seem to have answered from a measured perspective, but from the machine level you'd be interested in the Arithmetic Logic Unit (ALU), which handles the mathy bits on a "normal computer". There seems to be a pretty good detail in How does less than and greater than work on a logical level in binary? for complete details.
From purely a logical level, the short answer is that it's easier to tell if something is not something than it is to tell if something is relative to something, however this has likely been optimized in your standard personal computer or server, so you'll only see actual gains likely in small personal builds like on-board computers for drones or other micro-technologies.
Wonder if nested for each test, same results ?
for(int x = 0; x < y; x++)
{
for(int x2 = 0; x2 < y; x2++) {}
}
for(int x = 0; x != y; x++)
{
for(int x2 = 0; x2 != y; x2++) {}
}

glext visual studio cuda

I am currently in a parallel computing class using a book called Cuda by Example. In Chapter 4 of this book I am using some .h files that contain includes for "GL/glut.h" and "GL/glext.h", I have steps for installing GLUT online, and followed those. I think that this worked but I am not sure. I then tried to find directions for glext, but I cannot seem to find as much on this. I did find one .h file and tried to use that by including it in the GL folder as well. This does not seem to work because I received errors when compiling of things similar to this:
Error 1 error : calling a host function("cuComplex::cuComplex") from a device/_global_ function("julia") is not allowed C:\Users\Laptop\Documents\Visual Studio 2010\Projects\Lab1\Lab1\lab1.cu 29 1 Lab1
I think this is because I need more for glext.h, like .dll and things similar to the glut, but I am not sure. Any help with this would be appreciated. Thank You.
EDIT:- this is the code that I am using, and I have not changed it from what I see in the book, except for the top two include statements and the .h files are from google code: thank you for any help
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "book.h"
#include "cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
cuComplex( float a, float b) : r(a), i(b) {}
__device__ float magnitude2(void) {
return r*r + i*i;
}
__device__ cuComplex operator* (const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+ (const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int julia( int x, int y) {
const float scale = 1.5;
float jx = scale * (float)(DIM/2 -x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, .156);
cuComplex a(jx, jy);
int i = 0;
for(i=0;i<200;i++) {
a = a * a + c;
if(a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel(unsigned char *ptr ) {
//map from threadIdx/BlockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
//now claculate the value at that position
int juliaValue = julia(x,y);
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
int main( void ) {
CPUBitmap bitmap(DIM, DIM);
unsigned char *dev_bitmap;
HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap, bitmap.image_size()));
dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost));
bitmap.display_and_exit();
HANDLE_ERROR( cudaFree( dev_bitmap ));
}
try adding the following.
Original code:
cuComplex( float a, float b) : r(a), i(b) {}
Modified:
__host__ __device__ cuComplex( float a, float b ) : r(a), i(b) {}
It fixed the issue for me. I also didn't need the two include files you added, but you may depending on your build process.
A CUDA program consists of 2 types of code: host code and device code. Host code runs on the host CPU and cannot run on the GPU, and device code runs on the GPU and cannot run on the CPU. If you don't decorate your program in any way, then it will be all host code. But once you start adding CUDA sections delineated by keywords like __ global__ or __ device__ then your program will contain some device code.
The compiler error you received indicated that a function that was running on the device was attempting to use code compiled for the CPU. This is a no-no and the compiler will not allow this. This example is unusual since at some point in time (when the book was written) it presumably did not generate this error, and furthermore the code in cuComplex struct appears to be decorated with __ device__ keyword. However at the outermost level of the struct at the line of code I modified, there is no keyword identifying __ device__ . When I add the __ device__ __ host__ keywords, this tells the compiler "for this logical section, create both a device-compiled version and a host-compiled version of the code". This explicitly tells the compiler you want to be able to use this section of code in the device. And with that addition, we have steered the compiler correctly and it no longer gives the complaint.
Apparently something has changed about the level of decoration that the compiler needs to generate device code in this case. Presumably, with older compilers, the __ device__ keywords inside the struct were enough to let the compiler know that it had to generate device versions of the operators callable by cuComplex type.

Fast block placement algorithm, advice needed?

I need to emulate the window placement strategy of the Fluxbox window manager.
As a rough guide, visualize randomly sized windows filling up the screen one at a time, where the rough size of each results in an average of 80 windows on screen without any window overlapping another.
If you have Fluxbox and Xterm installed on your system, you can try the xwinmidiarptoy BASH script to see a rough prototype of what I want happening. See the xwinmidiarptoy.txt notes I've written about it explaining what it does and how it should be used.
It is important to note that windows will close and the space that closed windows previously occupied becomes available once more for the placement of new windows.
The algorithm needs to be an Online Algorithm processing data "piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start."
The Fluxbox window placement strategy has three binary options which I want to emulate:
Windows build horizontal rows or vertical columns (potentially)
Windows are placed from left to right or right to left
Windows are placed from top to bottom or bottom to top
Differences between the target algorithm and a window-placement algorithm
The coordinate units are not pixels. The grid within which blocks will be placed will be 128 x 128 units. Furthermore, the placement area may be further shrunk by a boundary area placed within the grid.
Why is the algorithm a problem?
It needs to operate to the deadlines of a real time thread in an audio application.
At this moment I am only concerned with getting a fast algorithm, don't concern yourself over the implications of real time threads and all the hurdles in programming that that brings.
And although the algorithm will never ever place a window which overlaps another, the user will be able to place and move certain types of blocks, overlapping windows will exist. The data structure used for storing the windows and/or free space, needs to be able to handle this overlap.
So far I have two choices which I have built loose prototypes for:
1) A port of the Fluxbox placement algorithm into my code.
The problem with this is, the client (my program) gets kicked out of the audio server (JACK) when I try placing the worst case scenario of 256 blocks using the algorithm. This algorithm performs over 14000 full (linear) scans of the list of blocks already placed when placing the 256th window.
For a demonstration of this I created a program called text_boxer-0.0.2.tar.bz2 which takes a text file as input and arranges it within ASCII boxes. Issue make to build it. A little unfriendly, use --help (or any other invalid option) for a list of command line options. You must specify the text file by using the option.
2) My alternative approach.
Only partially implemented, this approach uses a data structure for each area of rectangular free unused space (the list of windows can be entirely separate, and is not required for testing of this algorithm). The data structure acts as a node in a doubly linked list (with sorted insertion), as well as containing the coordinates of the top-left corner, and the width and height.
Furthermore, each block data structure also contains four links which connect to each immediately adjacent (touching) block on each of the four sides.
IMPORTANT RULE: Each block may only touch with one block per side. This is a rule specific to the algorithm's way of storing free unused space and bears no factor in how many actual windows may touch each other.
The problem with this approach is, it's very complex. I have implemented the straightforward cases where 1) space is removed from one corner of a block, 2) splitting neighbouring blocks so that the IMPORTANT RULE is adhered to.
The less straightforward case, where the space to be removed can only be found within a column or row of boxes, is only partially implemented - if one of the blocks to be removed is an exact fit for width (ie column) or height (ie row) then problems occur. And don't even mention the fact this only checks columns one box wide, and rows one box tall.
I've implemented this algorithm in C - the language I am using for this project (I've not used C++ for a few years and am uncomfortable using it after having focused all my attention to C development, it's a hobby). The implementation is 700+ lines of code (including plenty of blank lines, brace lines, comments etc). The implementation only works for the horizontal-rows + left-right + top-bottom placement strategy.
So I've either got to add some way of making this +700 lines of code work for the other 7 placement strategy options, or I'm going to have to duplicate those +700 lines of code for the other seven options. Neither of these is attractive, the first, because the existing code is complex enough, the second, because of bloat.
The algorithm is not even at a stage where I can use it in the real time worst case scenario, because of missing functionality, so I still don't know if it actually performs better or worse than the first approach.
The current state of C implementation of this algorithm is freespace.c. I use gcc -O0 -ggdb freespace.c to build, and run it in an xterm sized to atleast 124 x 60 chars.
What else is there?
I've skimmed over and discounted:
Bin Packing algorithms: their
emphasis on optimal fit does not
match the requirements of this
algorithm.
Recursive Bisection Placement algorithms: sounds promising, but
these are for circuit design. Their
emphasis is optimal wire length.
Both of these, especially the latter, all elements to be placed/packs are known before the algorithm begins.
What are your thoughts on this? How would you approach it? What other algorithms should I look at? Or even what concepts should I research seeing as I've not studied computer science/software engineering?
Please ask questions in comments if further information is needed.
Further ideas developed since asking this question
Some combination of my "alternative algorithm" with a spatial hashmap for identifying if a large window to be placed would cover several blocks of free space.
I would consider some kind of spatial hashing structure. Imagine your entire free space is gridded coarsely, call them blocks. As windows come and go, they occupy certain sets of contiguous rectangular blocks. For each block, keep track of the largest unused rectangle incident to each corner, so you need to store 2*4 real numbers per block. For an empty block, the rectangles at each corner have size equal to the block. Thus, a block can only be "used up" at its corners, and so at most 4 windows can sit in any block.
Now each time you add a window, you have to search for a rectangular set of blocks for which the window will fit, and when you do, update the free corner sizes. You should size your blocks so that a handful (~4x4) of them fit into a typical window. For each window, keep track of which blocks it touches (you only need to keep track of extents), as well as which windows touch a given block (at most 4, in this algorithm). There is an obvious tradeoff between the granularity of the blocks and the amount of work per window insertion/removal.
When removing a window, loop over all blocks it touches, and for each block, recompute the free corner sizes (you know which windows touch it). This is fast since the inner loop is at most length 4.
I imagine a data structure like
struct block{
int free_x[4]; // 0 = top left, 1 = top right,
int free_y[4]; // 2 = bottom left, 3 = bottom right
int n_windows; // number of windows that occupy this block
int window_id[4]; // IDs of windows that occupy this block
};
block blocks[NX][NY];
struct window{
int id;
int used_block_x[2]; // 0 = first index of used block,
int used_block_y[2]; // 1 = last index of used block
};
Edit
Here is a picture:
It shows two example blocks. The colored dots indicate the corners of the block, and the arrows emanating from them indicate the extents of the largest-free-rectangle from that corner.
You mentioned in the edit that the grid on which your windows will be placed is already quite coarse (127x127), so the block sizes would probably be something like 4 grid cells on a side, which probably wouldn't gain you much. This method is suitable if your window corner coordinates can take on a lot of values (I was thinking they would be pixels), but not so much in your case. You can still try it, since it's simple. You would probably want to also keep a list of completely empty blocks so that if a window comes in that is larger than 2 block widths, then you look first in that list before looking for some suitable free space in the block grid.
After some false starts, I have eventually arrived here. Here is where the use of data structures for storing rectangular areas of free space have been abandoned. Instead, there is a 2d array with 128 x 128 elements to achieve the same result but with much less complexity.
The following function scans the array for an area width * height in size. The first position it finds it writes the top left coordinates of, to where resultx and resulty point to.
_Bool freespace_remove( freespace* fs,
int width, int height,
int* resultx, int* resulty)
{
int x = 0;
int y = 0;
const int rx = FSWIDTH - width;
const int by = FSHEIGHT - height;
*resultx = -1;
*resulty = -1;
char* buf[height];
for (y = 0; y < by; ++y)
{
x = 0;
char* scanx = fs->buf[y];
while (x < rx)
{
while(x < rx && *(scanx + x))
++x;
int w, h;
for (h = 0; h < height; ++h)
buf[h] = fs->buf[y + h] + x;
_Bool usable = true;
w = 0;
while (usable && w < width)
{
h = 0;
while (usable && h < height)
if (*(buf[h++] + w))
usable = false;
++w;
}
if (usable)
{
for (w = 0; w < width; ++w)
for (h = 0; h < height; ++h)
*(buf[h] + w) = 1;
*resultx = x;
*resulty = y;
return true;
}
x += w;
}
}
return false;
}
The 2d array is zero initialized. Any areas in the array where the space is used are set to 1. This structure and function will work independently from the actual list of windows that are occupying the areas marked with 1's.
The advantages of this method are its simplicity. It only uses one data structure - an array. The function is short, and should not be too difficult to adapt to handle the remaining placement options (here it only handles Row Smart + Left to Right + Top to Bottom).
My initial tests also look promising on the speed front. Though I don't think this would be suitable for a window manager placing windows on, for example, a 1600 x 1200 desktop with pixel accuracy, for my purposes I believe it is going to be much better than any of the previous methods I have tried.
Compilable test code here:
http://jwm-art.net/art/text/freespace_grid.c
(in Linux I use gcc -ggdb -O0 freespace_grid.c to compile)
#include <limits.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#define FSWIDTH 128
#define FSHEIGHT 128
#ifdef USE_64BIT_ARRAY
#define FSBUFBITS 64
#define FSBUFWIDTH 2
typedef uint64_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctzl(( v ))
#define LEADING_ONES( v ) __builtin_clzl(~( v ))
#else
#ifdef USE_32BIT_ARRAY
#define FSBUFBITS 32
#define FSBUFWIDTH 4
typedef uint32_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz(( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ))
#else
#ifdef USE_16BIT_ARRAY
#define FSBUFBITS 16
#define FSBUFWIDTH 8
typedef uint16_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz( 0xffff0000 | ( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ) << 16)
#else
#ifdef USE_8BIT_ARRAY
#define FSBUFBITS 8
#define FSBUFWIDTH 16
typedef uint8_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz( 0xffffff00 | ( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ) << 24)
#else
#define FSBUFBITS 1
#define FSBUFWIDTH 128
typedef unsigned char fsbuf_type;
#define TRAILING_ZEROS( v ) (( v ) ? 0 : 1)
#define LEADING_ONES( v ) (( v ) ? 1 : 0)
#endif
#endif
#endif
#endif
static const fsbuf_type fsbuf_max = ~(fsbuf_type)0;
static const fsbuf_type fsbuf_high = (fsbuf_type)1 << (FSBUFBITS - 1);
typedef struct freespacegrid
{
fsbuf_type buf[FSHEIGHT][FSBUFWIDTH];
_Bool left_to_right;
_Bool top_to_bottom;
} freespace;
void freespace_dump(freespace* fs)
{
int x, y;
for (y = 0; y < FSHEIGHT; ++y)
{
for (x = 0; x < FSBUFWIDTH; ++x)
{
fsbuf_type i = FSBUFBITS;
fsbuf_type b = fs->buf[y][x];
for(; i != 0; --i, b <<= 1)
putchar(b & fsbuf_high ? '#' : '/');
/*
if (x + 1 < FSBUFWIDTH)
putchar('|');
*/
}
putchar('\n');
}
}
freespace* freespace_new(void)
{
freespace* fs = malloc(sizeof(*fs));
if (!fs)
return 0;
int y;
for (y = 0; y < FSHEIGHT; ++y)
{
memset(&fs->buf[y][0], 0, sizeof(fsbuf_type) * FSBUFWIDTH);
}
fs->left_to_right = true;
fs->top_to_bottom = true;
return fs;
}
void freespace_delete(freespace* fs)
{
if (!fs)
return;
free(fs);
}
/* would be private function: */
void fs_set_buffer( fsbuf_type buf[FSHEIGHT][FSBUFWIDTH],
unsigned x,
unsigned y1,
unsigned xoffset,
unsigned width,
unsigned height)
{
fsbuf_type v;
unsigned y;
for (; width > 0 && x < FSBUFWIDTH; ++x)
{
if (width < xoffset)
v = (((fsbuf_type)1 << width) - 1) << (xoffset - width);
else if (xoffset < FSBUFBITS)
v = ((fsbuf_type)1 << xoffset) - 1;
else
v = fsbuf_max;
for (y = y1; y < y1 + height; ++y)
{
#ifdef FREESPACE_DEBUG
if (buf[y][x] & v)
printf("**** over-writing area ****\n");
#endif
buf[y][x] |= v;
}
if (width < xoffset)
return;
width -= xoffset;
xoffset = FSBUFBITS;
}
}
_Bool freespace_remove( freespace* fs,
unsigned width, unsigned height,
int* resultx, int* resulty)
{
unsigned x, x1, y;
unsigned w, h;
unsigned xoffset, x1offset;
unsigned tz; /* trailing zeros */
fsbuf_type* xptr;
fsbuf_type mask = 0;
fsbuf_type v;
_Bool scanning = false;
_Bool offset = false;
*resultx = -1;
*resulty = -1;
for (y = 0; y < (unsigned) FSHEIGHT - height; ++y)
{
scanning = false;
xptr = &fs->buf[y][0];
for (x = 0; x < FSBUFWIDTH; ++x, ++xptr)
{
if(*xptr == fsbuf_max)
{
scanning = false;
continue;
}
if (!scanning)
{
scanning = true;
x1 = x;
x1offset = xoffset = FSBUFBITS;
w = width;
}
retry:
if (w < xoffset)
mask = (((fsbuf_type)1 << w) - 1) << (xoffset - w);
else if (xoffset < FSBUFBITS)
mask = ((fsbuf_type)1 << xoffset) - 1;
else
mask = fsbuf_max;
offset = false;
for (h = 0; h < height; ++h)
{
v = fs->buf[y + h][x] & mask;
if (v)
{
tz = TRAILING_ZEROS(v);
offset = true;
break;
}
}
if (offset)
{
if (tz)
{
x1 = x;
w = width;
x1offset = xoffset = tz;
goto retry;
}
scanning = false;
}
else
{
if (w <= xoffset) /***** RESULT! *****/
{
fs_set_buffer(fs->buf, x1, y, x1offset, width, height);
*resultx = x1 * FSBUFBITS + (FSBUFBITS - x1offset);
*resulty = y;
return true;
}
w -= xoffset;
xoffset = FSBUFBITS;
}
}
}
return false;
}
int main(int argc, char** argv)
{
int x[1999];
int y[1999];
int w[1999];
int h[1999];
int i;
freespace* fs = freespace_new();
for (i = 0; i < 1999; ++i, ++u)
{
w[i] = rand() % 18 + 4;
h[i] = rand() % 18 + 4;
freespace_remove(fs, w[i], h[i], &x[i], &y[i]);
/*
freespace_dump(fs);
printf("w:%d h:%d x:%d y:%d\n", w[i], h[i], x[i], y[i]);
if (x[i] == -1)
printf("not removed space %d\n", i);
getchar();
*/
}
freespace_dump(fs);
freespace_delete(fs);
return 0;
}
The above code requires one of USE_64BIT_ARRAY, USE_32BIT_ARRAY, USE_16BIT_ARRAY, USE_8BIT_ARRAY to be defined otherwise it will fall back to using only the high bit of an unsigned char for storing the state of grid cells.
The function fs_set_buffer will not be declared in the header, and will become static within the implementation when this code gets split between .h and .c files. A more user friendly function hiding the implementation details will be provided for removing used space from the grid.
Overall, this implementation is faster without optimization than my previous answer with maximum optimization (using GCC on 64bit Gentoo, optimization options -O0 and -O3 respectively).
Regarding USE_NNBIT_ARRAY and the different bit sizes, I used two different methods of timing the code which make 1999 calls to freespace_remove.
Timing main() using the Unix time command (and disabling any output in the code) seemed to prove my expectations correct - that higher bit sizes are faster.
On the other hand, timing individual calls to freespace_remove (using gettimeofday) and comparing the maximum time taken over the 1999 calls seemed to indicate lower bit sizes were faster.
This has only been tested on a 64bit system (Intel Dual Core II).

Resources