DirectX11 Radix Sort - sorting

I'm struggling to implement an efficient Radix Sort in DirectX11.
I need to quickly sort no more than 256K items.
It doesn't need to be in-place.
Using CUDA / OpenCL is not an option.
I have it almost working, but it has a few problems:
The histogram generation is quite fast, but still takes much longer than quoted figures online
On subsequent sort passes, the order of keys whose lower bits are identical, is not preserved, due to the InterlockedAdd on the histogram buffer in cp_sort (see below)
cp_sort is really slow, due to global memory access on that same InterlockedAdd
I've been trying to understand how I can fix this, based on algorithms online, but I can't seem to understand them.
Here are my 3 kernels. (it's for a billboarded particle system, so the term 'quad' just refers to an item to be sorted)
// 8 bits per radix
#define NUM_RADIXES 4
#define RADIX_SIZE 256
// buffers
StructuredBuffer<Quad> g_particleQuadBuffer : register( t20 );
RWStructuredBuffer<uint> g_indexBuffer : register( u0 );
RWStructuredBuffer<uint> g_histogram : register( u1 );
RWStructuredBuffer<uint> g_indexOutBuffer : register( u2 );
// quad buffer counter
cbuffer BufferCounter : register( b12 )
{
uint numQuads;
uint pad[ 3 ];
}
// on-chip memory for fast histogram calculation
#define SHARED_MEM_PADDING 8 // to try and reduce bank conflicts
groupshared uint g_localHistogram[ NUM_RADIXES * RADIX_SIZE * SHARED_MEM_PADDING ];
// convert a float to a sortable int, assuming all inputs are negative
uint floatToUInt( float input )
{
return 0xffffffff - asuint( input );
}
// initialise the indices, and build the histograms (dispatched with numQuads / ( NUM_RADIXES * RADIX_SIZE ))
[numthreads( NUM_RADIXES * RADIX_SIZE, 1, 1 )]
void cp_histogram( uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex )
{
// initialise local histogram
g_localHistogram[ groupIndex * SHARED_MEM_PADDING ] = 0;
GroupMemoryBarrierWithGroupSync();
// check within range
uint quadIndex = ( groupID.x * NUM_RADIXES * RADIX_SIZE ) + groupIndex;
if ( quadIndex < numQuads )
{
// initialise index
g_indexBuffer[ quadIndex ] = quadIndex;
// floating point to sortable uint
uint value = floatToUInt( g_particleQuadBuffer[ quadIndex ].v[ 0 ].projected.z );
// build 8-bit histograms
uint value0 = ( value ) & 0xff;
uint value1 = ( value >> 8 ) & 0xff;
uint value2 = ( value >> 16 ) & 0xff;
uint value3 = ( value >> 24 );
InterlockedAdd( g_localHistogram[ ( value0 ) * SHARED_MEM_PADDING ], 1 );
InterlockedAdd( g_localHistogram[ ( value1 + 256 ) * SHARED_MEM_PADDING ], 1 );
InterlockedAdd( g_localHistogram[ ( value2 + 512 ) * SHARED_MEM_PADDING ], 1 );
InterlockedAdd( g_localHistogram[ ( value3 + 768 ) * SHARED_MEM_PADDING ], 1 );
}
// write back to histogram
GroupMemoryBarrierWithGroupSync();
InterlockedAdd( g_histogram[ groupIndex ], g_localHistogram[ groupIndex * SHARED_MEM_PADDING ] );
}
// build the offsets based on histograms (dispatched with 1)
// NOTE: I know this could be more efficient, but from my profiling, its time is negligible compared to the other 2 stages, and I can optimise this separately using a parallel prefix sum if I need to
[numthreads( NUM_RADIXES, 1, 1 )]
void cp_offsets( uint groupIndex : SV_GroupIndex )
{
uint sum = 0;
uint base = ( groupIndex * RADIX_SIZE );
for ( uint i = 0; i < RADIX_SIZE; i++ )
{
uint tempSum = g_histogram[ base + i ] + sum;
g_histogram[ base + i ] = sum;
sum = tempSum;
}
}
// move the data (dispatched with numQuads / ( NUM_RADIXES * RADIX_SIZE ))
uint currentRadix;
[numthreads( NUM_RADIXES * RADIX_SIZE, 1, 1 )]
void cp_sort( uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex )
{
// check within range
uint quadIndex = ( groupID.x * NUM_RADIXES * RADIX_SIZE ) + groupIndex;
if ( quadIndex < numQuads )
{
uint fi = g_indexBuffer[ quadIndex ];
uint depth = floatToUInt( g_particleQuadBuffer[ fi ].v[ 0 ].projected.z );
uint radix = currentRadix;
uint pos = ( depth >> ( radix * 8 ) ) & 0xff;
uint writePosition;
InterlockedAdd( g_histogram[ radix * RADIX_SIZE + pos ], 1, writePosition );
g_indexOutBuffer[ writePosition ] = fi;
}
}
Can anyone offer any help on how to fix/optimise this?
I'd love to understand what some of the more complex algorithms for GPU Radix sorting are actually doing!
Thanks!

Related

easy genetic algorithm fitness function

I'm trying to do a very easy genetic algorithm in c (for school research project). I am kind of stuck on calculating the fitness percentage.
I'm trying to match a random string from user input, with a dictionary word. (one could imagine a scrabble game algorithm or anything else)
For instance when the user input is "hello" and the dictionary word "hello",
both strings match and a fitness of 100% should be correct. With "hellp" and "hello" a fitness of almost 100% and with "uryyb" fitness should be (far) below 100%.
Perhaps does anybody know how to do fitness function or know (general) reference of this sort of fitness functions?
Here I allocate memory for an array of dictionary words
int row;
//alloceer eerst amount_words void *
woorden = (char **) malloc( amount_words * (len + 1) );
for( row = 0; row <= amount_words; row++ )
woorden[row] = (char *) malloc ( len + 1 );
return;
these could also be freed:
int row;
for( row = 0; row <= amount_words; row++ )
free( woorden[row] );
free( woorden );
return;
Here I open a dictionary file.
FILE *f;
int amount_words = 0;
char woord[40];
f = fopen("words.txt", "r");
while(!feof(f)) {
fscanf( f, "%s\n", woord );
if( strlen(woord) == len ) {
amount_words++;
if( !is_valid_str( woord ) )
amount_words--;
}
}
fclose(f);
return amount_words;
I rudely strip characters:
char is_valid_str( char *str )
{
int i;
for( i=0; i <= zoek_str_len - 1; i++ )
if( str[i] < 'a' || str[i] > 'z' )
return FALSE;
return TRUE;
}
I calculate the amount of words of certain length
amount_len_words( int len )
{
FILE *f;
int amount_words = 0;
char woord[40];
f = fopen("words.txt", "r");
while(!feof(f)) {
fscanf( f, "%s\n", woord );
if( strlen(woord) == len ) {
amount_words++;
if( !is_valid_str( woord ) )
amount_words--;
}
}
fclose(f);
return amount_words;
}
I read an array of words, certain length
FILE *f;
int i=0;
int lenwords;
char woord[40];
lenwords = amount_len_words( len );
alloc_woorden( lenwords, len );
f = fopen("words.txt", "r");
while( !feof( f ) ) {
fscanf(f,"%s\n", woord );
if( strlen(woord) == len ) {
if( is_valid_str( woord ) ) {
strncpy(woorden[i++], woord, len);
//printf("->%s\n", woorden[i]);
}
}
}
for( i=0;i < lenwords;i++) {
printf("%s\n", woorden[i] );
}
Here is the main routine
int i;
char zoek_str[40];
if( argc <= 1 ) {
printf( "gebruik: %s zoek_string\n", argv[0] );
return 0;
}
if( strlen( argv[1] ) > 39 ) {
printf( "Zoek string maximaal 39 lowercase karakters.\n" );
return 0;
}
strcpy( zoek_str, argv[1] );
zoek_str_len = strlen ( zoek_str );
if( !is_valid_str( zoek_str ) ) {
printf( "Ongeldige zoek string. Neemt alleen lowercase karakters!\n" );
return 0;
}
printf("%s\n",zoek_str);
init_words( zoek_str_len );
return 0;
}
These two are the functions I'm currently puzzling about:
double calculate_fitness( char *zoek )
{
}
And
void mutate( char *arg )
{
}
Thereafter I would calculate generation by generation.
Note that I only search at fixed length strings ex: strlen(argv[1])
example output of all of this could be:
generation string word percentage
1 hfllr hello 89.4%
2 hellq hello 90.3%
3 hellp hello 95.3%
4 hello hello 100%
or something like that.
By comparing the two strings letter by letter a metric could be correct/max_length where 'correct' is the number of letters that match and 'max_length' is the length of the longest string.
For something more involved you could look up the concept of Edit distance.
see Edit distance
also see Levenshtein distance
Basically what you are trying to measure is the minimum number of operations required to transform one string into the other.
First of all, you need a metric on "distance between strings". A commonly used one is the Levenshtein distance, which measure the distance between two strings as the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one string into the other.
Googling this you can find multiple code examples on how to compute such distance. Once you have the distance, your fitness should be inversely proportional to it.

improper mandelbrot set output plotting

i am trying to write a code to display Mandelbrot set for the numbers between
(-3,-3) to (2,2) on my terminal.
The main function generates & feeds a complex number to analyze function.
The analyze function returns character "*" for the complex number Z within the set and "." for the numbers which lie outside the set.
The code:
#define MAX_A 2 // upperbound on real
#define MAX_B 2 // upper bound on imaginary
#define MIN_A -3 // lowerbnd on real
#define MIN_B -3 // lower bound on imaginary
#define NX 300 // no. of points along x
#define NY 200 // no. of points along y
#define max_its 50
int analyze(double real,double imag);
void main()
{
double a,b;
int x,x_arr,y,y_arr;
int array[NX][NY];
int res;
for(y=NY-1,x_arr=0;y>=0;y--,x_arr++)
{
for(x=0,y_arr++;x<=NX-1;x++,y_arr++)
{
a= MIN_A+ ( x/( (double)NX-1)*(MAX_A-MIN_A) );
b= MIN_B+ ( y/( (double)NY-1 )*(MAX_B-MIN_B) );
//printf("%f+i%f ",a,b);
res=analyze(a,b);
if(res>49)
array[x][y]=42;
else
array[x][y]=46;
}
// printf("\n");
}
for(y=0;y<NY;y++)
{
for(x=0;x<NX;x++)
printf("%2c",array[x][y]);
printf("\n");
}
}
The analyze function accepts the coordinate on imaginary plane ;
and computes (Z^2)+Z 50 times ; and while computing if the complex number explodes, then function returns immidiately else the function returns after finishing 50 iterations;
int analyze(double real,double imag)
{
int iter=0;
double r=4.0;
while(iter<50)
{
if ( r < ( (real*real) + (imag*imag) ) )
{
return iter;
}
real= ( (real*real) - (imag*imag) + real);
imag= ( (2*real*imag)+ imag);
iter++;
}
return iter;
}
So, i am analyzing 60000 (NX * NY) numbers & displaying it on the terminal
considering 3:2 ratio (300,200) , i even tried 4:3 (NX:NY) , but the output remains same and the generated shape is not even close to the mandlebrot set :
hence, the output appears inverted ,
i browsed & came across lines like:
(x - 400) / ZOOM;
(y - 300) / ZOOM;
on many mandelbrot codes , but i am unable to understand how this line may rectify my output.
i guess i am having trouble in mapping output to the terminal!
(LB_Real,UB_Imag) --- (UB_Real,UB_Imag)
| |
(LB_Real,LB_Imag) --- (UB_Real,LB_Imag)
Any Hint/help will be very useful
The Mandelbrot recurrence is zn+1 = zn2 + c.
Here's your implementation:
real= ( (real*real) - (imag*imag) + real);
imag= ( (2*real*imag)+ imag);
Problem 1. You're updating real to its next value before you've used the old value to compute the new imag.
Problem 2. Assuming you fix problem 1, you're computing zn+1 = zn2 + zn.
Here's how I'd do it using double:
int analyze(double cr, double ci) {
double zr = 0, zi = 0;
int r;
for (r = 0; (r < 50) && (zr*zr + zi*zi < 4.0); ++r) {
double zr1 = zr*zr - zi*zi + cr;
double zi1 = 2 * zr * zi + ci;
zr = zr1;
zi = zi1;
}
return r;
}
But it's easier to understand if you use the standard C99 support for complex numbers:
#include <complex.h>
int analyze(double cr, double ci) {
double complex c = cr + ci * I;
double complex z = 0;
int r;
for (r = 0; (r < 50) && (cabs(z) < 2); ++r) {
z = z * z + c;
}
return r;
}

How to set order few pips above order initiation bar in MQL4

I would like to create a stoploss order that will be placed above the high of the previous order's initiation bar in case this is a Sell order OR below the low of the previous order's initiation bar in case this is a Buy order.
Here is a picture to illustrate the issue ( the example depicts a sell order case ):
Any idea how to do that? The code below works fine if I use stoploss that is fixed. If I replace the stoploss with variables that are based on High or Low no orders are fired.
Here is my code:
//| Expert initialization function |
//+------------------------------------------------------------------+
/* -----------------------------------------------------------------------------
KINDLY RESPECT THIS & DO NOT MODIFY THE EDITS AGAIN
MQL4 FORMAT IS NOT INDENTATION SENSITIVE,
HAS IDE-HIGHLIGHTING
AND
HAS NO OTHER RESTRICTIVE CONDITIONS ----------- THIS CODING-STYLE HELPS A LOT
FOR BOTH
EASY & FAST
TRACKING OF NON-SYNTACTIC ERRORS
AND
IMPROVES FAST ORIENTATION
IN ALGORITHM CONSTRUCTORS' MODs
DURING RAPID PROTOTYPING
IF YOU CANNOT RESIST,
SOLVE RATHER ANY OTHER PROBLEM,
THAT MAY HELP SOMEONE ELSE's POST, THX
------------------------------------------- KINDLY RESPECT
THE AIM OF StackOverflow
------------------------------------------- TO HELP OTHERS DEVELOP UNDERSTANDING,
THEIRS UNDERSTANDING, OK? */
extern int StartHour = 14;
extern int TakeProfit = 70;
extern int StopLoss = 40;
extern double Lots = 0.01;
extern int MA_period = 20;
extern int MA_period_1 = 45;
extern int RSI_period14 = 14;
extern int RSI_period12 = 12;
void OnTick() {
static bool IsFirstTick = true;
static int ticket = 0;
double R_MA = iMA( Symbol(), Period(), MA_period, 0, 0, 0, 1 );
double R_MA_Fast = iMA( Symbol(), Period(), MA_period_1, 0, 0, 0, 1 );
double R_RSI14 = iRSI( Symbol(), Period(), RSI_period14, 0, 0 );
double R_RSI12 = iRSI( Symbol(), Period(), RSI_period12, 0, 0 );
double HH = High[1];
double LL = Low[ 1];
if ( Hour() == StartHour ) {
if ( IsFirstTick == true ) {
IsFirstTick = false;
bool res1 = OrderSelect( ticket, SELECT_BY_TICKET );
if ( res1 == true ) {
if ( OrderCloseTime() == 0 ) {
bool res2 = OrderClose( ticket, Lots, OrderClosePrice(), 10 );
if ( res2 == false ) {
Alert( "Error closing order # ", ticket );
}
}
}
if ( High[1] < R_MA
&& R_RSI12 > R_RSI14
&& R_MA_Fast >= R_MA
){
ticket = OrderSend( Symbol(),
OP_BUY,
Lots,
Ask,
10,
Bid - LL * Point * 10,
Bid + TakeProfit * Point * 10,
"Set by SimpleSystem"
);
}
if ( ticket < 0 ) {
Alert( "Error Sending Order!" );
}
else {
if ( High[1] > R_MA
&& R_RSI12 > R_RSI14
&& R_MA_Fast <= R_MA
){
ticket = OrderSend( Symbol(),
OP_SELL,
Lots,
Bid,
10,
Ask + HH * Point * 10,
Ask - TakeProfit * Point * 10,
"Set by SimpleSystem"
);
}
if ( ticket < 0 ) {
Alert( "Error Sending Order!" );
}
}
}
}
else {
IsFirstTick = true;
}
}
Major issue
Once having assigned ( per each Market Event Quote Arrival )
double HH = High[1],
LL = Low[ 1];
Your instruction to OP_SELL shall be repaired:
ticket = OrderSend( Symbol(),
OP_SELL,
Lots,
Bid,
10,
// ----------------------v--------------------------------------
// Ask + HH * 10 * Point,
// intention was High[1] + 10 [PT]s ( if Broker allows ), right?
NormalizeDouble( HH + 10 * Point,
Digits // ALWAYS NORMALIZE FOR .XTO-s
),
// vvv----------------------------------------------------------
// Ask - TakeProfit * Point * 10, // SAFER TO BASE ON BreakEvenPT
NormalizeDouble( Ask
- TakeProfit * Point * 10,
Digits // ALWAYS NORMALIZE FOR .XTO-s
),
"Set by SimpleSystem"
);
Symmetrically review and modify the OP_BUY case.
For Broker T&C collisions ( these need not get reflected in backtest ) review:
MarketInfo( _Symbol, MODE_STOPLEVEL )
MarketInfo( _Symbol, MODE_FREEZELEVEL )
or inspect in the MT4.Terminal in the MarketWatch aMouseRightClick Symbols -> Properties for STOPLEVEL distance.
Minor Issue
Review also your code for OrderClose() -- this will fail due to having wrong Price:
// ---------------------------------------------vvvvv----------------------------
bool res2 = OrderClose( ticket, Lots, OrderClosePrice(), 10 ); # was db.POOL()-SELECT'd

efficiently find the first element matching a bit mask

I have a list of N 64-bit integers whose bits represent small sets. Each integer has at most k bits set to 1. Given a bit mask, I would like to find the first element in the list that matches the mask, i.e. element & mask == element.
Example:
If my list is:
index abcdef
0 001100
1 001010
2 001000
3 000100
4 000010
5 000001
6 010000
7 100000
8 000000
and my mask is 111000, the first element matching the mask is at index 2.
Method 1:
Linear search through the entire list. This takes O(N) time and O(1) space.
Method 2:
Precompute a tree of all possible masks, and at each node keep the answer for that mask. This takes O(1) time for the query, but takes O(2^64) space.
Question:
How can I find the first element matching the mask faster than O(N), while still using a reasonable amount of space? I can afford to spend polynomial time in precomputation, because there will be a lot of queries. The key is that k is small. In my application, k <= 5 and N is in the thousands. The mask has many 1s; you can assume that it is drawn uniformly from the space of 64-bit integers.
Update:
Here is an example data set and a simple benchmark program that runs on Linux: http://up.thirld.com/binmask.tar.gz. For large.in, N=3779 and k=3. The first line is N, followed by N unsigned 64-bit ints representing the elements. Compile with make. Run with ./benchmark.e >large.out to create the true output, which you can then diff against. (Masks are generated randomly, but the random seed is fixed.) Then replace the find_first() function with your implementation.
The simple linear search is much faster than I expected. This is because k is small, and so for a random mask, a match is found very quickly on average.
A suffix tree (on bits) will do the trick, with the original priority at the leaf nodes:
000000 -> 8
1 -> 5
10 -> 4
100 -> 3
1000 -> 2
10 -> 1
100 -> 0
10000 -> 6
100000 -> 7
where if the bit is set in the mask, you search both arms, and if not, you search only the 0 arm; your answer is the minimum number you encounter at a leaf node.
You can improve this (marginally) by traversing the bits not in order but by maximum discriminability; in your example, note that 3 elements have bit 2 set, so you would create
2:0 0:0 1:0 3:0 4:0 5:0 -> 8
5:1 -> 5
4:1 5:0 -> 4
3:1 4:0 5:0 -> 3
1:1 3:0 4:0 5:0 -> 6
0:1 1:0 3:0 4:0 5:0 -> 7
2:1 0:0 1:0 3:0 4:0 5:0 -> 2
4:1 5:0 -> 1
3:1 4:0 5:0 -> 0
In your example mask this doesn't help (since you have to traverse both the bit2==0 and bit2==1 sides since your mask is set in bit 2), but on average it will improve the results (but at a cost of setup and more complex data structure). If some bits are much more likely to be set than others, this could be a huge win. If they're pretty close to random within the element list, then this doesn't help at all.
If you're stuck with essentially random bits set, you should get about (1-5/64)^32 benefit from the suffix tree approach on average (13x speedup), which might be better than the difference in efficiency due to using more complex operations (but don't count on it--bit masks are fast). If you have a nonrandom distribution of bits in your list, then you could do almost arbitrarily well.
This is the bitwise Kd-tree. It typically needs less than 64 visits per lookup operation. Currently, the selection of the bit (dimension) to pivot on is random.
#include <limits.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
typedef unsigned long long Thing;
typedef unsigned long Number;
unsigned thing_ffs(Thing mask);
Thing rand_mask(unsigned bitcnt);
#define WANT_RANDOM 31
#define WANT_BITS 3
#define BITSPERTHING (CHAR_BIT*sizeof(Thing))
#define NONUMBER ((Number)-1)
struct node {
Thing value;
Number num;
Number nul;
Number one;
char pivot;
} *nodes = NULL;
unsigned nodecount=0;
unsigned itercount=0;
struct node * nodes_read( unsigned *sizp, char *filename);
Number *find_ptr_to_insert(Number *ptr, Thing value, Thing mask);
unsigned grab_matches(Number *result, Number num, Thing mask);
void initialise_stuff(void);
int main (int argc, char **argv)
{
Thing mask;
Number num;
unsigned idx;
srand (time(NULL));
nodes = nodes_read( &nodecount, argv[1]);
fprintf( stdout, "Nodecount=%u\n", nodecount );
initialise_stuff();
#if WANT_RANDOM
mask = nodes[nodecount/2].value | nodes[nodecount/3].value ;
#else
mask = 0x38;
#endif
fprintf( stdout, "\n#### Search mask=%llx\n", (unsigned long long) mask );
itercount = 0;
num = NONUMBER;
idx = grab_matches(&num,0, mask);
fprintf( stdout, "Itercount=%u\n", itercount );
fprintf(stdout, "KdTree search %16llx\n", (unsigned long long) mask );
fprintf(stdout, "Count=%u Result:\n", idx);
idx = num;
if (idx >= nodecount) idx = nodecount-1;
fprintf( stdout, "num=%4u Value=%16llx\n"
,(unsigned) nodes[idx].num
,(unsigned long long) nodes[idx].value
);
fprintf( stdout, "\nLinear search %16llx\n", (unsigned long long) mask );
for (idx = 0; idx < nodecount; idx++) {
if ((nodes[idx].value & mask) == nodes[idx].value) break;
}
fprintf(stdout, "Cnt=%u\n", idx);
if (idx >= nodecount) idx = nodecount-1;
fprintf(stdout, "Num=%4u Value=%16llx\n"
, (unsigned) nodes[idx].num
, (unsigned long long) nodes[idx].value );
return 0;
}
void initialise_stuff(void)
{
unsigned num;
Number root, *ptr;
root = 0;
for (num=0; num < nodecount; num++) {
nodes[num].num = num;
nodes[num].one = NONUMBER;
nodes[num].nul = NONUMBER;
nodes[num].pivot = -1;
}
nodes[num-1].value = 0; /* last node is guaranteed to match anything */
root = 0;
for (num=1; num < nodecount; num++) {
ptr = find_ptr_to_insert (&root, nodes[num].value, 0ull );
if (*ptr == NONUMBER) *ptr = num;
else fprintf(stderr, "Found %u for %u\n"
, (unsigned)*ptr, (unsigned) num );
}
}
Thing rand_mask(unsigned bitcnt)
{struct node * nodes_read( unsigned *sizp, char *filename)
{
struct node *ptr;
unsigned size,used;
FILE *fp;
if (!filename) {
size = (WANT_RANDOM+0) ? WANT_RANDOM : 9;
ptr = malloc (size * sizeof *ptr);
#if (!WANT_RANDOM)
ptr[0].value = 0x0c;
ptr[1].value = 0x0a;
ptr[2].value = 0x08;
ptr[3].value = 0x04;
ptr[4].value = 0x02;
ptr[5].value = 0x01;
ptr[6].value = 0x10;
ptr[7].value = 0x20;
ptr[8].value = 0x00;
#else
for (used=0; used < size; used++) {
ptr[used].value = rand_mask(WANT_BITS);
}
#endif /* WANT_RANDOM */
*sizp = size;
return ptr;
}
fp = fopen( filename, "r" );
if (!fp) return NULL;
fscanf(fp,"%u\n", &size );
fprintf(stderr, "Size=%u\n", size);
ptr = malloc (size * sizeof *ptr);
for (used = 0; used < size; used++) {
fscanf(fp,"%llu\n", &ptr[used].value );
}
fclose( fp );
*sizp = used;
return ptr;
}
Thing value = 0;
unsigned bit, cnt;
for (cnt=0; cnt < bitcnt; cnt++) {
bit = 54321*rand();
bit %= BITSPERTHING;
value |= 1ull << bit;
}
return value;
}
Number *find_ptr_to_insert(Number *ptr, Thing value, Thing done)
{
Number num=NONUMBER;
while ( *ptr != NONUMBER) {
Thing wrong;
num = *ptr;
wrong = (nodes[num].value ^ value) & ~done;
if (nodes[num].pivot < 0) { /* This node is terminal */
/* choose one of the wrong bits for a pivot .
** For this bit (nodevalue==1 && searchmask==0 )
*/
if (!wrong) wrong = ~done ;
nodes[num].pivot = thing_ffs( wrong );
}
ptr = (wrong & 1ull << nodes[num].pivot) ? &nodes[num].nul : &nodes[num].one;
/* Once this bit has been tested, it can be masked off. */
done |= 1ull << nodes[num].pivot ;
}
return ptr;
}
unsigned grab_matches(Number *result, Number num, Thing mask)
{
Thing wrong;
unsigned count;
for (count=0; num < *result; ) {
itercount++;
wrong = nodes[num].value & ~mask;
if (!wrong) { /* we have a match */
if (num < *result) { *result = num; count++; }
/* This is cheap pruning: the break will omit both subtrees from the results.
** But because we already have a result, and the subtrees have higher numbers
** than our current num, we can ignore them. */
break;
}
if (nodes[num].pivot < 0) { /* This node is terminal */
break;
}
if (mask & 1ull << nodes[num].pivot) {
/* avoid recursion if there is only one non-empty subtree */
if (nodes[num].nul >= *result) { num = nodes[num].one; continue; }
if (nodes[num].one >= *result) { num = nodes[num].nul; continue; }
count += grab_matches(result, nodes[num].nul, mask);
count += grab_matches(result, nodes[num].one, mask);
break;
}
mask |= 1ull << nodes[num].pivot;
num = (wrong & 1ull << nodes[num].pivot) ? nodes[num].nul : nodes[num].one;
}
return count;
}
unsigned thing_ffs(Thing mask)
{
unsigned bit;
#if 1
if (!mask) return (unsigned)-1;
for ( bit=random() % BITSPERTHING; 1 ; bit += 5, bit %= BITSPERTHING) {
if (mask & 1ull << bit ) return bit;
}
#elif 0
for (bit =0; bit < BITSPERTHING; bit++ ) {
if (mask & 1ull <<bit) return bit;
}
#else
mask &= (mask-1); // Kernighan-trick
for (bit =0; bit < BITSPERTHING; bit++ ) {
mask >>=1;
if (!mask) return bit;
}
#endif
return 0xffffffff;
}
struct node * nodes_read( unsigned *sizp, char *filename)
{
struct node *ptr;
unsigned size,used;
FILE *fp;
if (!filename) {
size = (WANT_RANDOM+0) ? WANT_RANDOM : 9;
ptr = malloc (size * sizeof *ptr);
#if (!WANT_RANDOM)
ptr[0].value = 0x0c;
ptr[1].value = 0x0a;
ptr[2].value = 0x08;
ptr[3].value = 0x04;
ptr[4].value = 0x02;
ptr[5].value = 0x01;
ptr[6].value = 0x10;
ptr[7].value = 0x20;
ptr[8].value = 0x00;
#else
for (used=0; used < size; used++) {
ptr[used].value = rand_mask(WANT_BITS);
}
#endif /* WANT_RANDOM */
*sizp = size;
return ptr;
}
fp = fopen( filename, "r" );
if (!fp) return NULL;
fscanf(fp,"%u\n", &size );
fprintf(stderr, "Size=%u\n", size);
ptr = malloc (size * sizeof *ptr);
for (used = 0; used < size; used++) {
fscanf(fp,"%llu\n", &ptr[used].value );
}
fclose( fp );
*sizp = used;
return ptr;
}
UPDATE:
I experimented a bit with the pivot-selection, favouring bits with the highest discriminatory value ("information content"). This involves:
making a histogram of the usage of bits (can be done while initialising)
while building the tree: choosing the one with frequency closest to 1/2 in the remaining subtrees.
The result: the random pivot selection performed better.
Construct a a binary tree as follows:
Every level corresponds to a bit
It corresponding bit is on go right, otherwise left
This way insert every number in the database.
Now, for searching: if the corresponding bit in the mask is 1, traverse both children. If it is 0, traverse only the left node. Essentially keep traversing the tree until you hit the leaf node (BTW, 0 is a hit for every mask!).
This tree will have O(N) space requirements.
Eg of tree for 1 (001), 2(010) and 5 (101)
root
/ \
0 1
/ \ |
0 1 0
| | |
1 0 1
(1) (2) (5)
With precomputed bitmasks. Formally is is still O(N), since the and-mask operations are O(N). The final pass is also O(N), because it needs to find the lowest bit set, but that could be sped up, too.
#include <limits.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/* For demonstration purposes.
** In reality, this should be an unsigned long long */
typedef unsigned char Thing;
#define BITSPERTHING (CHAR_BIT*sizeof (Thing))
#define COUNTOF(a) (sizeof a / sizeof a[0])
Thing data[] =
/****** index abcdef */
{ 0x0c /* 0 001100 */
, 0x0a /* 1 001010 */
, 0x08 /* 2 001000 */
, 0x04 /* 3 000100 */
, 0x02 /* 4 000010 */
, 0x01 /* 5 000001 */
, 0x10 /* 6 010000 */
, 0x20 /* 7 100000 */
, 0x00 /* 8 000000 */
};
/* Note: this is for demonstration purposes.
** Normally, one should choose a machine wide unsigned int
** for bitmask arrays.
*/
struct bitmap {
char data[ 1+COUNTOF (data)/ CHAR_BIT ];
} nulmaps [ BITSPERTHING ];
#define BITSET(a,i) (a)[(i) / CHAR_BIT ] |= (1u << ((i)%CHAR_BIT) )
#define BITTEST(a,i) ((a)[(i) / CHAR_BIT ] & (1u << ((i)%CHAR_BIT) ))
void init_tabs(void);
void map_empty(struct bitmap *dst);
void map_full(struct bitmap *dst);
void map_and2(struct bitmap *dst, struct bitmap *src);
int main (void)
{
Thing mask;
struct bitmap result;
unsigned ibit;
mask = 0x38;
init_tabs();
map_full(&result);
for (ibit = 0; ibit < BITSPERTHING; ibit++) {
/* bit in mask is 1, so bit at this position is in fact a don't care */
if (mask & (1u <<ibit)) continue;
/* bit in mask is 0, so we can only select items with a 0 at this bitpos */
map_and2(&result, &nulmaps[ibit] );
}
/* This is not the fastest way to find the lowest 1 bit */
for (ibit = 0; ibit < COUNTOF (data); ibit++) {
if (!BITTEST(result.data, ibit) ) continue;
fprintf(stdout, " %u", ibit);
}
fprintf( stdout, "\n" );
return 0;
}
void init_tabs(void)
{
unsigned ibit, ithing;
/* 1 bits in data that dont overlap with 1 bits in the searchmask are showstoppers.
** So, for each bitpos, we precompute a bitmask of all *entrynumbers* from data[], that contain 0 in bitpos.
*/
memset(nulmaps, 0 , sizeof nulmaps);
for (ithing=0; ithing < COUNTOF(data); ithing++) {
for (ibit=0; ibit < BITSPERTHING; ibit++) {
if ( data[ithing] & (1u << ibit) ) continue;
BITSET(nulmaps[ibit].data, ithing);
}
}
}
/* Logical And of two bitmask arrays; simular to dst &= src */
void map_and2(struct bitmap *dst, struct bitmap *src)
{
unsigned idx;
for (idx = 0; idx < COUNTOF(dst->data); idx++) {
dst->data[idx] &= src->data[idx] ;
}
}
void map_empty(struct bitmap *dst)
{
memset(dst->data, 0 , sizeof dst->data);
}
void map_full(struct bitmap *dst)
{
unsigned idx;
/* NOTE this loop sets too many bits to the left of COUNTOF(data) */
for (idx = 0; idx < COUNTOF(dst->data); idx++) {
dst->data[idx] = ~0;
}
}

Speed comparison between recursive and nonrecursive implementation

I have an complex algorithm which uses really deep recursion. Because there is stack overflow with some specific data I have tried to rewrite it without recursion (using external stack on the heap). So I have two modifications of the same algorithm. Then I have performed some tests and I have found out that recursive implementation is much time faster than another one.
Can someone explain it to me, please? It is part of my final university project to discuss these results (why is one implementation highly faster than another one). I think that it is because of different caching of stack and heap but I am not sure.
Thanks a lot!
EDIT
OK, there is a code. The algorithm is written in C++ and solves tree isomorphism problem. Both implementations are same except one method which compares two nodes. The comparison is defined recursively - one node is lesser than another if one of it's children is lesser than corresponding child of another node.
Recursive version
char compareTo( const IMisraNode * nodeA, const IMisraNode * nodeB ) const {
// comparison of same number of children
int min = std::min( nodeA->getDegree( ), nodeB->getDegree( ) );
for ( int i = 0; i < min; ++i ) {
char res = compareTo( nodeA->getChild( i ), nodeB->getChild( i ) );
if ( res < 0 ) return -1;
if ( res > 0 ) return 1;
}
if ( nodeA->getDegree( ) == nodeB->getDegree( ) ) return 0; // same number of children
else if ( nodeA->getDegree( ) == min ) return -1;
else return 1;
}
Nonrecursive implementation
struct Comparison {
const IMisraNode * nodeA;
const IMisraNode * nodeB;
int i;
int min; // minimum of count of children
Comparison( const IMisraNode * nodeA, const IMisraNode * nodeB ) :
nodeA( nodeA ), nodeB( nodeB ),
i( 0 ), min( std::min( nodeA->getDegree( ), nodeB->getDegree( ) ) ) { }
} ;
char compareTo( const IMisraNode * nodeA, const IMisraNode * nodeB ) const {
Comparison * cmp = new Comparison( nodeA, nodeB );
// stack on the heap
std::stack<Comparison * > stack;
stack.push( cmp );
char result = 0; // result, the equality is assumed
while ( !result && !stack.empty( ) ) { // while they are not same and there are nodes left
cmp = stack.top( );
// comparison of same children
if ( cmp->i < cmp->min ) {
// compare these children
stack.push( new Comparison( cmp->nodeA->getChild( cmp->i ), cmp->nodeB->getChild( cmp->i ) ) );
++cmp->i; // next node
continue; // continue in comparing on next level
}
if ( cmp->nodeA->getDegree( ) != cmp->nodeB->getDegree( ) ) { // count of children is not same
if ( cmp->nodeA->getDegree( ) == cmp->min ) result = -1; // node A has lesser count of children
else result = 1;
}
delete cmp;
stack.pop( );
}
while ( !stack.empty( ) ) { // clean stack
delete stack.top( );
stack.pop( );
}
return result;
}
Your non-recursive code does dynamic memory allocation (explicitly with new, and implicitly by your use of std::stack), while the recursive one does not. Dynamic memory allocation is an extremely expensive operation.
To speed things up, try storing values, not pointers:
stack <Comparison> astack;
then code like:
astack.push( Comparison( cmp->nodeA->getChild( cmp->i ), cmp->nodeB->getChild( cmp->i ) ) );
Comparison cp = astack.top();
This doesn't answer your speed-comparison question, but rather suggests ways to increase stack size for your recursive solution.
You can increase the stack size (default: 1MB) under VC++: search "stacksize" in Visual Studio help.
And you can do the same under gcc. There's an SO discussion on that very question.

Resources