Benchmark problems for testing concurency - performance

For one of the projects I'm doing right now, I need to look at the performance (amongst other things) of different concurrent enabled programming languages.
At the moment I'm looking into comparing stackless python and C++ PThreads, so the focus is on these two languages, but other languages will probably be tested later. Ofcourse the comparison must be as representative and accurate as possible, so my first thought was to start looking for some standard concurrent/multi-threaded benchmark problems, alas I couldn't find any decent or standard, tests/problems/benchmarks.
So my question is as follows: Do you have a suggestion for a good, easy or quick problem to test the performance of the programming language (and to expose it's strong and weak points in the process)?

Surely you should be testing hardware and compilers rather than a language for concurrency performance?
I would be looking at a language from the point of view of how easy and productive it is in terms of concurrency and how much it 'insulates' the programmer from making locking mistakes.
EDIT: from past experience as a researcher designing parallel algorithms, I think you will find in most cases the concurrent performance will depend largely on how an algorithm is parallelised, and how it targets the underlying hardware.
Also, benchmarks are notoriously unequal; this is even more so in a parallel environment. For instance, a benchmark that 'crunches' very large matrices would be suited to a vector pipeline processor, whereas a parallel sort might be better suited to more general purpose multi core CPUs.
These might be useful:
Parallel Benchmarks
NAS Parallel Benchmarks

Well, there are a few classics, but different tests emphasize different features. Some distributed systems may be more robust, have more efficient message-passing, etc. Higher message overhead can hurt scalability, since it the normal way to scale up to more machines is to send a larger number of small messages. Some classic problems you can try are a distributed Sieve of Eratosthenes or a poorly implemented fibonacci sequence calculator (i.e. to calculate the 8th number in the series, spin of a machine for the 7th, and another for the 6th). Pretty much any divide-and-conquer algorithm can be done concurrently. You could also do a concurrent implementation of Conway's game of life or heat transfer. Note that all of these algorithms have different focuses and thus you probably will not get one distributed system doing the best in all of them.
I'd say the easiest one to implement quickly is the poorly implemented fibonacci calculator, though it places too much emphasis on creating threads and too little on communication between those threads.

Surely you should be testing hardware
and compilers rather than a language
for concurrency performance?
No, hardware and compilers are irrelevant for my testing purposes. I'm just looking for some good problems that can test how well code, written in one language, can compete against code from another language. I'm really testing the constructs available in the specific languages to do concurrent programming. And one of the criteria is performance (measured in time).
Some of the other test criteria I'm looking for are:
how easy is it to write correct code; because as we all know concurrent programming is harder then writing single threaded programs
what is the technique used to to concurrent programming: event-driven, actor based, message parsing, ...
how much code must be written by the programmer himself and how much is done automatically for him: this can also be tested with the given benchmark problems
what's the level of abstraction and how much overhead is involved when translated back to machine code
So actually, I'm not looking for performance as the only and best parameter (which would indeed send me to the hardware and the compilers instead of the language itself), I'm actually looking from a programmers point of view to check what language is best suited for what kind of problems, what it's weaknesses and strengths are and so on...
Bare in mind that this is just a small project and the tests are therefore to be kept small as well. (rigorous testing of everything is therefore not feasible)

I have decided to use the Mandelbrot set (the escape time algorithm to be more precise) to benchmark the different languages.
It fits me quite well as the original algorithm can easily be implemented and creating the multi threaded variant from it is not that much work.
below is the code I currently have. It is still a single threaded variant, but I'll update it as soon as I'm satisfied with the result.
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
int DoThread(const double x, const double y, int maxiter) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++) {
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(int horizPixels, int vertPixels, int maxiter, std::vector<std::vector<int> >& result) {
for(int x = horizPixels; x > 0; x--) {
for(int y = vertPixels; y > 0; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x-1][y-1] = DoThread((3.0 / horizPixels) * (x - (horizPixels / 2)),(3.0 / vertPixels) * (y - (vertPixels / 2)),maxiter);
}
}
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
int horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
int vertPixels = atoi(argv[2]);
//third arg = iterations
int maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
std::vector<std::vector<int> > result(horizPixels, std::vector<int>(vertPixels,0)); //create and init 2-dimensional vector
SingleThreaded(horizPixels, vertPixels, maxiter, result);
//TODO: remove these lines
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
}
I've tested it with gcc under Linux, but I'm sure it works under other compilers/Operating Systems as well. To get it to work you have to enter some command line arguments like so:
mandelbrot 106 500 255 1
the first argument is the width (x-axis)
the second argument is the height (y-axis)
the third argument is the number of maximum iterations (the number of colors)
the last ons is the number of threads (but that one is currently not used)
on my resolution, the above example gives me a nice ASCII-art representation of a Mandelbrot set. But try it for yourself with different arguments (the first one will be the most important one, as that will be the width)

Below you can find the code I hacked together to test the multi threaded performance of pthreads. I haven't cleaned it up and no optimizations have been made; so the code is a bit raw.
the code to save the calculated mandelbrot set as a bitmap is not mine, you can find it here
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
#include "bitmap_Image.h" //for saving the mandelbrot as a bmp
#include <pthread.h>
pthread_mutex_t mutexCounter;
int sharedCounter(0);
int percent(0);
int horizPixels(0);
int vertPixels(0);
int maxiter(0);
//doesn't need to be locked
std::vector<std::vector<int> > result; //create 2 dimensional vector
void *DoThread(void *null) {
double curX,curY,xSquare,ySquare,x,y;
int i, intx, inty, counter;
counter = 0;
do {
counter++;
pthread_mutex_lock (&mutexCounter); //lock
intx = int((sharedCounter / vertPixels) + 0.5);
inty = sharedCounter % vertPixels;
sharedCounter++;
pthread_mutex_unlock (&mutexCounter); //unlock
//exit thread when finished
if (intx >= horizPixels) {
std::cout << "exited thread - I did " << counter << " calculations" << std::endl;
pthread_exit((void*) 0);
}
//set x and y to the correct value now -> in the range like singlethread
x = (3.0 / horizPixels) * (intx - (horizPixels / 1.5));
y = (3.0 / vertPixels) * (inty - (vertPixels / 2));
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
result[intx][inty] = i;
} while (true);
}
int DoSingleThread(const double x, const double y) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(std::vector<std::vector<int> >& result) {
for(int x = horizPixels - 1; x != -1; x--) {
for(int y = vertPixels - 1; y != -1; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x][y] = DoSingleThread((3.0 / horizPixels) * (x - (horizPixels / 1.5)),(3.0 / vertPixels) * (y - (vertPixels / 2)));
}
}
}
void MultiThreaded(int threadCount, std::vector<std::vector<int> >& result) {
/* Initialize and set thread detached attribute */
pthread_t thread[threadCount];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (int i = 0; i < threadCount - 1; i++) {
pthread_create(&thread[i], &attr, DoThread, NULL);
}
std::cout << "all threads created" << std::endl;
for(int i = 0; i < threadCount - 1; i++) {
pthread_join(thread[i], NULL);
}
std::cout << "all threads joined" << std::endl;
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
vertPixels = atoi(argv[2]);
//third arg = iterations
maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
result = std::vector<std::vector<int> >(horizPixels, std::vector<int>(vertPixels,21)); // init 2-dimensional vector
if (threadCount <= 1) {
SingleThreaded(result);
} else {
MultiThreaded(threadCount, result);
}
//TODO: remove these lines
bitmapImage image(horizPixels, vertPixels);
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
image.setPixelRGB(x,y,16777216*result[x][y]/maxiter % 256, 65536*result[x][y]/maxiter % 256, 256*result[x][y]/maxiter % 256);
//std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
image.saveToBitmapFile("~/Desktop/test.bmp",32);
}
good results can be obtained using the program with the following arguments:
mandelbrot 5120 3840 256 3
that way you will get an image that is 5 * 1024 wide; 5 * 768 high with 256 colors (alas you will only get 1 or 2) and 3 threads (1 main thread that doesn't do any work except creating the worker threads, and 2 worker threads)

Since the benchmarks game moved to a quad-core machine September 2008, many programs in different programming languages have been re-written to exploit quad-core - for example, the first 10 mandelbrot programs.

Related

Improving the Efficiency of Compact/Scatter in CUDA

Summary:
Any ideas about how to further improve upon the basic scatter operation in CUDA? Especially if one knows it will only be used to compact a larger array into a smaller one? or why the below methods of vectorizing memory ops and shared memory didn't work? I feel like there may be something fundamental I am missing and any help would be appreciated.
EDIT 03/09/15: So I found this Parallel For All Blog post "Optimized Filtering with Warp-Aggregated Atomics". I had assumed atomics would be intrinsically slower for this purpose, however I was wrong - especially since I don't think I care about maintaining element order in the array during my simulation. I'll have to think about it some more and then implement it to see what happens!
EDIT 01/04/16: I realized I never wrote about my results. Unfortunately in that Parallel for All Blog post they compared the global atomic method for compact to the Thrust prefix-sum compact method, which is actually quite slow. CUB's Device::IF is much faster than Thrust's - as is the prefix-sum version I wrote using CUB's Device::Scan + custom code. The warp-aggregrate global atomic method is still faster by about 5-10%, but nowhere near the 3-4x faster I had been hoping for based on the results in the blog. I'm still using the prefix-sum method as while maintaining element order is not necessary, I prefer the consistency of the prefix-sum results and the advantage from the atomics is not very big. I still try various methods to improve compact, but so far only marginal improvements (2%) at best for dramatically increased code complexity.
Details:
I am writing a simulation in CUDA where I compact out elements I am no longer interested in simulating every 40-60 time steps. From profiling it seems that the scatter op takes up the most amount of time when compacting - more so than the filter kernel or the prefix sum. Right now I use a pretty basic scatter function:
__global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
freq_Index is the number of elements in the old array. The flag array is the result from the filter. Scan_ID is the result from the prefix sum on the flag array.
Attempts I've made to improve it are to read the flagged frequencies into shared memory first and then write from shared memory to global memory - the idea being that the writes to global memory would be more coalesced amongst the warps (e.g. instead of thread 0 writing to position 0 and thread 128 writing to position 1, thread 0 would write to 0 and thread 1 would write to 1). I also tried vectorizing the reads and the writes - instead of reading and writing floats/ints I read/wrote float4/int4 from the global arrays when possible, so four numbers at a time. This I thought might speed up the scatter by having fewer memory ops transferring larger amounts of memory. The "kitchen sink" code with both vectorized memory loads/stores and shared memory is below:
const int compact_threads = 256;
__global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
int gID = blockIdx.x*blockDim.x + threadIdx.x; //global ID
int tID = threadIdx.x; //thread ID within block
__shared__ float row[4*compact_threads];
__shared__ int start_index[1];
__shared__ int end_index[1];
float4 myResult;
int st_index;
int4 myFlag;
int4 index;
for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
if(tID == 0){
index = reinterpret_cast<const int4*>(scan_Index)[id];
myFlag = reinterpret_cast<const int4*>(flag)[id];
start_index[0] = index.x;
st_index = index.x;
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[0] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
}
__syncthreads();
if(tID > 0){
myFlag = reinterpret_cast<const int4*>(flag)[id];
st_index = start_index[0];
index = reinterpret_cast<const int4*>(scan_Index)[id];
myResult = reinterpret_cast<const float4*>(freq)[id];
if(myFlag.x){ row[index.x-st_index] = myResult.x; }
if(myFlag.y){ row[index.y-st_index] = myResult.y; }
if(myFlag.z){ row[index.z-st_index] = myResult.z; }
if(myFlag.w){ row[index.w-st_index] = myResult.w; }
if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
}
__syncthreads();
int count = end_index[0] - st_index;
int rem = st_index & 0x3; //equivalent to modulo 4
int offset = 0;
if(rem){ offset = 4 - rem; }
if(tID < offset && tID < count){
new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
}
int tempID = 4*tID+offset;
if((tempID+3) < count){
reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
}
tempID = tID + offset + (count-offset)/4*4;
if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
}
int id = gID + freq_Index/4 * 4;
if(id < freq_Index){
if(flag[id]){
new_freq[scan_Index[id]] = freq[id];
}
}
}
Obviously it gets a bit more complicated. :) While the above kernel seems stable when there are hundreds of thousands of elements in the array, I've noticed a race condition when the array numbers in the tens of millions. I'm still trying to track the bug down.
But regardless, neither method (shared memory or vectorization) together or alone improved performance. I was especially surprised by the lack of benefit from vectorizing the memory ops. It had helped in other functions I had written, though now I am wondering if maybe it helped because it increased Instruction-Level-Parallelism in the calculation steps of those other functions rather than the fewer memory ops.
I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.
//kernel phase 1
int myID = blockIdx.x*blockDim.x + threadIdx.x;
//padded_length is nearest multiple of 1024 > true_length
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int mask;
unsigned int cnt=0;//;//
for(int j = 0; j < 32; j++){
int index = (warpID<<10)+(j<<5)+lnID;
bool pred;
if(index > true_length) pred = false;
else pred = predicate(input[index]);
mask = __ballot(pred);
if(lnID == 0) {
flag[(warpID<<5)+j] = mask;
cnt += __popc(mask);
}
}
if(lnID == 0) counter[warpID] = cnt; //store sum
}
//kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array
//kernel phase 3
int myID = blockIdx.x*blockDim.x + threadIdx.x;
for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
int lnID = threadIdx.x % warp_size;
int warpID = id >> 5;
unsigned int predmask;
unsigned int cnt;
predmask = flag[(warpID<<5)+lnID];
cnt = __popc(predmask);
//parallel prefix sum
#pragma unroll
for(int offset = 1; offset < 32; offset<<=1){
unsigned int n = __shfl_up(cnt, offset);
if(lnID >= offset) cnt += n;
}
unsigned int global_index = 0;
if(warpID > 0) global_index = scan_Index[warpID - 1];
for(int i = 0; i < 32; i++){
unsigned int mask = __shfl(predmask, i); //broadcast from thread i
unsigned int sub_group_index = 0;
if(i > 0) sub_group_index = __shfl(cnt, i-1);
if(mask & (1 << lnID)){
compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID];
}
}
}
}
EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.

Parallel multiplication of many small matrices by fixed vector

Situation is the following: I have a number (1000s) of elements which are given by small matrices of dimensions 4x2, 9x3 ... you get the idea. All matrices have the same dimension.
I want to multiply each of these matrices with a fixed vector of precalculated values. In short:
for(i = 1...n)
X[i] = M[i] . N;
What is the best approach to do this in parallel using Thrust? How do I lay out my data in memory?
NB: There might be specialized, more suitable libraries to do this on GPUs. I'm interested in Thrust because it allows me to deploy to different backends, not just CUDA.
One possible approach:
flatten the arrays (matrices) into a single data vector. This is an advantageous step for enabling general thrust processing anyway.
use a strided range mechanism to take your scaling vector and extend it to the overall length of your flattened data vector
use thrust::transform with thrust::multiplies to multiply the two vectors together.
If you need to access the matrices later out of your flattened data vector (or result vector), you can do so with pointer arithmetic, or a combination of fancy iterators.
If you need to re-use the extended scaling vector, you may want to use the method outlined in step 2 exactly (i.e. create an actual vector using that method, length = N matrices, repeated). If you are only doing this once, you can achieve the same effect with a counting iterator, followed by a transform iterator (modulo the length of your matrix in elements), followed by a permutation iterator, to index into your original scaling vector (length = 1 matrix).
The following example implements the above, without using the strided range iterator method:
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>
#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024
struct my_modulo_functor : public thrust::unary_function<int, int>
{
__host__ __device__
int operator() (int idx) {
return idx%(H_MAT*W_MAT);}
};
int main(){
thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
thrust::host_vector<int> scale(H_MAT*W_MAT);
// synthetic; instead flatten/copy matrices into data vector
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;
thrust::device_vector<int> d_data = data;
thrust::device_vector<int> d_scale = scale;
thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);
thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());
thrust::host_vector<int> result = d_result;
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
EDIT: Responding to a question below:
The benefit of fancy iterators (i.e. transform(numbers, iterator)) is that they often allow for eliminaion of extra data copies/data movement, as compared to assembling other number (which requires extra steps and data movement) and then passing it to transform(numbers, other numbers). If you're only going to use other numbers once, then the fancy iterators will generally be better. If you're going to use other numbers again, then you may want to assemble it explicitly. This preso is instructive, in particular "Fusion".
For a one-time use of other numbers the overhead of assembling it on the fly using fancy iterators and the functor is generally lower than explicitly creating a new vector, and then passing that new vector to the transform routine.
When looking for a software library which is concisely made for multiplying small matrices, then one may have a look at https://github.com/hfp/libxsmm. Below, the code requests a specialized matrix kernel according to the typical GEMM parameters (please note that some limitations apply).
double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;
xmm = libxsmm_dmmdispatch(m, n, k,
&m/*lda*/, &k/*ldb*/, &m/*ldc*/,
&alpha, &beta, &flags, &prefetch);
Given the above code, one can proceed and run "xmm" for an entire series of (small) matrices without a particular data structure (below code also uses "prefetch locations").
if (0 < n) { /* check that n is at least 1 */
# pragma parallel omp private(i)
for (i = 0; i < (n - 1); ++i) {
const double *const ai = a + i * asize;
const double *const bi = b + i * bsize;
double *const ci = c + i * csize;
xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
}
xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
/* pseudo prefetch for last element of batch (avoids page fault) */
a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}
In addition to the manual loop control as shown above, libxsmm_gemm_batch (or libxsmm_gemm_batch_omp) can be used (see ReadTheDocs). The latter is useful if data structures exist that describe the series of operands (A, B, and C matrices).
There are two reasons why this library gives superior performance: (1) on-the-fly code specialization using an in-memory code generation technique, and (2) loading the next matrix operands while calculating the current product.
( Given one is looking for something that blends well with C/C++, this library supports it. However, it does not aim for CUDA/Thrust. )

Hot content algorithm / score with time decay

I have been reading + researching on algorithms and formulas to work out a score for my user submitted content to display currently hot / trending items higher up the list, however i'll admit i'm a little over my head here.
I'll give some background on what i'm after... users upload audio to my site, audios have several actions:
Played
Downloaded
Liked
Favorited
Ideally i want an algorithm where I can update an audios score each time a new activity is logged (played, download etc...), also a download action is worth more than a play, like more than a download and a favourite more than a like.
If possible i would like for audios older than 1 week to drop off quite sharply from the list to give newer content more of a chance of trending.
I have read about reddits algorithm which looked good, but i'm in over my head on how to tweak it to make use of my multiple variables, and to drop off older articles after around 7 days.
Some articles that we're interesting:
https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9 (reddits algo)
http://www.evanmiller.org/rank-hotness-with-newtons-law-of-cooling.html
Any help is appreciated!
Paul
Reddits old formula and a little drop off
Basically you can use Reddit's formula. Since your system only supports upvotes you could weight them, resulting in something like this:
def hotness(track)
s = track.playedCount
s = s + 2*track.downloadCount
s = s + 3*track.likeCount
s = s + 4*track.favCount
baseScore = log(max(s,1))
timeDiff = (now - track.uploaded).toWeeks
if(timeDiff > 1)
x = timeDiff - 1
baseScore = baseScore * exp(-8*x*x)
return baseScore
The factor exp(-8*x*x) will give you your desired drop off:
The basics behind
You can use any function that goes to zero faster than your score goes up. Since we use log on our score, even a linear function can get multiplied (as long as your score doesn't grow exponentially).
So all you need is a function that returns 1 as long as you don't want to modify the score, and drops afterwards. Our example above forms that function:
multiplier(x) = x > 1 ? exp(-8*x*x) : 1
You can vary the multiplier if you want less steep curves.
Example in C++
Lets say that the probability for a given track to be played in a given hour is 50%, download 10%, like 1% and favorite 0.1%. Then the following C++ program will give you an estimate for your scores behavior:
#include <iostream>
#include <fstream>
#include <random>
#include <ctime>
#include <cmath>
struct track{
track() : uploadTime(0),playCount(0),downCount(0),likeCount(0),faveCount(0){}
std::time_t uploadTime;
unsigned int playCount;
unsigned int downCount;
unsigned int likeCount;
unsigned int faveCount;
void addPlay(unsigned int n = 1){ playCount += n;}
void addDown(unsigned int n = 1){ downCount += n;}
void addLike(unsigned int n = 1){ likeCount += n;}
void addFave(unsigned int n = 1){ faveCount += n;}
unsigned int baseScore(){
return playCount +
2 * downCount +
3 * likeCount +
4 * faveCount;
}
};
int main(){
track test;
const unsigned int dayLength = 24 * 3600;
const unsigned int weekLength = dayLength * 7;
std::mt19937 gen(std::time(0));
std::bernoulli_distribution playProb(0.5);
std::bernoulli_distribution downProb(0.1);
std::bernoulli_distribution likeProb(0.01);
std::bernoulli_distribution faveProb(0.001);
std::ofstream fakeRecord("fakeRecord.dat");
std::ofstream fakeRecordDecay("fakeRecordDecay.dat");
for(unsigned int i = 0; i < weekLength * 3; i += 3600){
test.addPlay(playProb(gen));
test.addDown(downProb(gen));
test.addLike(likeProb(gen));
test.addFave(faveProb(gen));
double baseScore = std::log(std::max<unsigned int>(1,test.baseScore()));
double timePoint = static_cast<double>(i)/weekLength;
fakeRecord << timePoint << " " << baseScore << std::endl;
if(timePoint > 1){
double x = timePoint - 1;
fakeRecordDecay << timePoint << " " << (baseScore * std::exp(-8*x*x)) << std::endl;
}
else
fakeRecordDecay << timePoint << " " << baseScore << std::endl;
}
return 0;
}
Result:
This should be sufficient for you.

Fast Modulo 511 and 127

Is there a way, how to make modulo by 511 (and 127) faster than using "%" operator ?
int c = 758 % 511;
int d = 423 % 127;
Here is a way to do fast modulo by 511 assuming that x is at most 32767. It's about twice as fast as x%511. It does the modulo in five steps: two multiply, two addition, one shift.
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
Here is the theory at how I arrive at this. I posted the code I tested this at the end
Let's consider
y = x/511 = x/(512-1) = x/1000 * 1/(1-1/512).
Let's define z = 512, then
y = x/z*1/(1-1/z).
Using Taylor expansion
y = x/z(1 + 1/z + 1/z^2 + 1/z^3 + ...).
Now if we know that x has a limited range we can cut the expansion. Let's assume x is always less than 2^15=32768. Then we can write
512*512*y = (1+512)*x = 513*x.
After looking at the digits which are significant we arrive at
y = (513*x+64)>>18 //512^2 = 2^18.
We can divide x/511 (assuming x is less than 32768) in three steps:
multiply,
add,
shift.
Here is the code I just to profile this in MSVC2013 64-bit release mode on an Ivy Bridge core.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
int main() {
unsigned int i, x;
volatile unsigned int r;
double dtime;
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = j%511;
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = fast_mod_511(j);
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}
You can use a lookup table with the solutions pre-stored. If you create an array of a million integers looking up is about twice as fast as actually doing modulo in my C# app.
// fill an array
var mod511 = new int[1000000];
for (int x = 0; x < 1000000; x++) mod511[x] = x % 511;
and instead of using
c = 758 % 511;
you use
c = mod511[758];
This will cost you (possibly a lot of) memory, and will obviously not work if you want to use it for very large numbers also. But it is faster.
If you have to repeat those two modulus operations on a large number of data and your CPU supports SIMD (for example Intel's SSE/AVX/AVX2) then you can vectorize the operations, i.e., do the operations on many data in parallel. You can do this by using intrinsics or inline assembly. Yes the solution will be platform specific but maybe that is fine...

Fast block placement algorithm, advice needed?

I need to emulate the window placement strategy of the Fluxbox window manager.
As a rough guide, visualize randomly sized windows filling up the screen one at a time, where the rough size of each results in an average of 80 windows on screen without any window overlapping another.
If you have Fluxbox and Xterm installed on your system, you can try the xwinmidiarptoy BASH script to see a rough prototype of what I want happening. See the xwinmidiarptoy.txt notes I've written about it explaining what it does and how it should be used.
It is important to note that windows will close and the space that closed windows previously occupied becomes available once more for the placement of new windows.
The algorithm needs to be an Online Algorithm processing data "piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start."
The Fluxbox window placement strategy has three binary options which I want to emulate:
Windows build horizontal rows or vertical columns (potentially)
Windows are placed from left to right or right to left
Windows are placed from top to bottom or bottom to top
Differences between the target algorithm and a window-placement algorithm
The coordinate units are not pixels. The grid within which blocks will be placed will be 128 x 128 units. Furthermore, the placement area may be further shrunk by a boundary area placed within the grid.
Why is the algorithm a problem?
It needs to operate to the deadlines of a real time thread in an audio application.
At this moment I am only concerned with getting a fast algorithm, don't concern yourself over the implications of real time threads and all the hurdles in programming that that brings.
And although the algorithm will never ever place a window which overlaps another, the user will be able to place and move certain types of blocks, overlapping windows will exist. The data structure used for storing the windows and/or free space, needs to be able to handle this overlap.
So far I have two choices which I have built loose prototypes for:
1) A port of the Fluxbox placement algorithm into my code.
The problem with this is, the client (my program) gets kicked out of the audio server (JACK) when I try placing the worst case scenario of 256 blocks using the algorithm. This algorithm performs over 14000 full (linear) scans of the list of blocks already placed when placing the 256th window.
For a demonstration of this I created a program called text_boxer-0.0.2.tar.bz2 which takes a text file as input and arranges it within ASCII boxes. Issue make to build it. A little unfriendly, use --help (or any other invalid option) for a list of command line options. You must specify the text file by using the option.
2) My alternative approach.
Only partially implemented, this approach uses a data structure for each area of rectangular free unused space (the list of windows can be entirely separate, and is not required for testing of this algorithm). The data structure acts as a node in a doubly linked list (with sorted insertion), as well as containing the coordinates of the top-left corner, and the width and height.
Furthermore, each block data structure also contains four links which connect to each immediately adjacent (touching) block on each of the four sides.
IMPORTANT RULE: Each block may only touch with one block per side. This is a rule specific to the algorithm's way of storing free unused space and bears no factor in how many actual windows may touch each other.
The problem with this approach is, it's very complex. I have implemented the straightforward cases where 1) space is removed from one corner of a block, 2) splitting neighbouring blocks so that the IMPORTANT RULE is adhered to.
The less straightforward case, where the space to be removed can only be found within a column or row of boxes, is only partially implemented - if one of the blocks to be removed is an exact fit for width (ie column) or height (ie row) then problems occur. And don't even mention the fact this only checks columns one box wide, and rows one box tall.
I've implemented this algorithm in C - the language I am using for this project (I've not used C++ for a few years and am uncomfortable using it after having focused all my attention to C development, it's a hobby). The implementation is 700+ lines of code (including plenty of blank lines, brace lines, comments etc). The implementation only works for the horizontal-rows + left-right + top-bottom placement strategy.
So I've either got to add some way of making this +700 lines of code work for the other 7 placement strategy options, or I'm going to have to duplicate those +700 lines of code for the other seven options. Neither of these is attractive, the first, because the existing code is complex enough, the second, because of bloat.
The algorithm is not even at a stage where I can use it in the real time worst case scenario, because of missing functionality, so I still don't know if it actually performs better or worse than the first approach.
The current state of C implementation of this algorithm is freespace.c. I use gcc -O0 -ggdb freespace.c to build, and run it in an xterm sized to atleast 124 x 60 chars.
What else is there?
I've skimmed over and discounted:
Bin Packing algorithms: their
emphasis on optimal fit does not
match the requirements of this
algorithm.
Recursive Bisection Placement algorithms: sounds promising, but
these are for circuit design. Their
emphasis is optimal wire length.
Both of these, especially the latter, all elements to be placed/packs are known before the algorithm begins.
What are your thoughts on this? How would you approach it? What other algorithms should I look at? Or even what concepts should I research seeing as I've not studied computer science/software engineering?
Please ask questions in comments if further information is needed.
Further ideas developed since asking this question
Some combination of my "alternative algorithm" with a spatial hashmap for identifying if a large window to be placed would cover several blocks of free space.
I would consider some kind of spatial hashing structure. Imagine your entire free space is gridded coarsely, call them blocks. As windows come and go, they occupy certain sets of contiguous rectangular blocks. For each block, keep track of the largest unused rectangle incident to each corner, so you need to store 2*4 real numbers per block. For an empty block, the rectangles at each corner have size equal to the block. Thus, a block can only be "used up" at its corners, and so at most 4 windows can sit in any block.
Now each time you add a window, you have to search for a rectangular set of blocks for which the window will fit, and when you do, update the free corner sizes. You should size your blocks so that a handful (~4x4) of them fit into a typical window. For each window, keep track of which blocks it touches (you only need to keep track of extents), as well as which windows touch a given block (at most 4, in this algorithm). There is an obvious tradeoff between the granularity of the blocks and the amount of work per window insertion/removal.
When removing a window, loop over all blocks it touches, and for each block, recompute the free corner sizes (you know which windows touch it). This is fast since the inner loop is at most length 4.
I imagine a data structure like
struct block{
int free_x[4]; // 0 = top left, 1 = top right,
int free_y[4]; // 2 = bottom left, 3 = bottom right
int n_windows; // number of windows that occupy this block
int window_id[4]; // IDs of windows that occupy this block
};
block blocks[NX][NY];
struct window{
int id;
int used_block_x[2]; // 0 = first index of used block,
int used_block_y[2]; // 1 = last index of used block
};
Edit
Here is a picture:
It shows two example blocks. The colored dots indicate the corners of the block, and the arrows emanating from them indicate the extents of the largest-free-rectangle from that corner.
You mentioned in the edit that the grid on which your windows will be placed is already quite coarse (127x127), so the block sizes would probably be something like 4 grid cells on a side, which probably wouldn't gain you much. This method is suitable if your window corner coordinates can take on a lot of values (I was thinking they would be pixels), but not so much in your case. You can still try it, since it's simple. You would probably want to also keep a list of completely empty blocks so that if a window comes in that is larger than 2 block widths, then you look first in that list before looking for some suitable free space in the block grid.
After some false starts, I have eventually arrived here. Here is where the use of data structures for storing rectangular areas of free space have been abandoned. Instead, there is a 2d array with 128 x 128 elements to achieve the same result but with much less complexity.
The following function scans the array for an area width * height in size. The first position it finds it writes the top left coordinates of, to where resultx and resulty point to.
_Bool freespace_remove( freespace* fs,
int width, int height,
int* resultx, int* resulty)
{
int x = 0;
int y = 0;
const int rx = FSWIDTH - width;
const int by = FSHEIGHT - height;
*resultx = -1;
*resulty = -1;
char* buf[height];
for (y = 0; y < by; ++y)
{
x = 0;
char* scanx = fs->buf[y];
while (x < rx)
{
while(x < rx && *(scanx + x))
++x;
int w, h;
for (h = 0; h < height; ++h)
buf[h] = fs->buf[y + h] + x;
_Bool usable = true;
w = 0;
while (usable && w < width)
{
h = 0;
while (usable && h < height)
if (*(buf[h++] + w))
usable = false;
++w;
}
if (usable)
{
for (w = 0; w < width; ++w)
for (h = 0; h < height; ++h)
*(buf[h] + w) = 1;
*resultx = x;
*resulty = y;
return true;
}
x += w;
}
}
return false;
}
The 2d array is zero initialized. Any areas in the array where the space is used are set to 1. This structure and function will work independently from the actual list of windows that are occupying the areas marked with 1's.
The advantages of this method are its simplicity. It only uses one data structure - an array. The function is short, and should not be too difficult to adapt to handle the remaining placement options (here it only handles Row Smart + Left to Right + Top to Bottom).
My initial tests also look promising on the speed front. Though I don't think this would be suitable for a window manager placing windows on, for example, a 1600 x 1200 desktop with pixel accuracy, for my purposes I believe it is going to be much better than any of the previous methods I have tried.
Compilable test code here:
http://jwm-art.net/art/text/freespace_grid.c
(in Linux I use gcc -ggdb -O0 freespace_grid.c to compile)
#include <limits.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#define FSWIDTH 128
#define FSHEIGHT 128
#ifdef USE_64BIT_ARRAY
#define FSBUFBITS 64
#define FSBUFWIDTH 2
typedef uint64_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctzl(( v ))
#define LEADING_ONES( v ) __builtin_clzl(~( v ))
#else
#ifdef USE_32BIT_ARRAY
#define FSBUFBITS 32
#define FSBUFWIDTH 4
typedef uint32_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz(( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ))
#else
#ifdef USE_16BIT_ARRAY
#define FSBUFBITS 16
#define FSBUFWIDTH 8
typedef uint16_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz( 0xffff0000 | ( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ) << 16)
#else
#ifdef USE_8BIT_ARRAY
#define FSBUFBITS 8
#define FSBUFWIDTH 16
typedef uint8_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz( 0xffffff00 | ( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ) << 24)
#else
#define FSBUFBITS 1
#define FSBUFWIDTH 128
typedef unsigned char fsbuf_type;
#define TRAILING_ZEROS( v ) (( v ) ? 0 : 1)
#define LEADING_ONES( v ) (( v ) ? 1 : 0)
#endif
#endif
#endif
#endif
static const fsbuf_type fsbuf_max = ~(fsbuf_type)0;
static const fsbuf_type fsbuf_high = (fsbuf_type)1 << (FSBUFBITS - 1);
typedef struct freespacegrid
{
fsbuf_type buf[FSHEIGHT][FSBUFWIDTH];
_Bool left_to_right;
_Bool top_to_bottom;
} freespace;
void freespace_dump(freespace* fs)
{
int x, y;
for (y = 0; y < FSHEIGHT; ++y)
{
for (x = 0; x < FSBUFWIDTH; ++x)
{
fsbuf_type i = FSBUFBITS;
fsbuf_type b = fs->buf[y][x];
for(; i != 0; --i, b <<= 1)
putchar(b & fsbuf_high ? '#' : '/');
/*
if (x + 1 < FSBUFWIDTH)
putchar('|');
*/
}
putchar('\n');
}
}
freespace* freespace_new(void)
{
freespace* fs = malloc(sizeof(*fs));
if (!fs)
return 0;
int y;
for (y = 0; y < FSHEIGHT; ++y)
{
memset(&fs->buf[y][0], 0, sizeof(fsbuf_type) * FSBUFWIDTH);
}
fs->left_to_right = true;
fs->top_to_bottom = true;
return fs;
}
void freespace_delete(freespace* fs)
{
if (!fs)
return;
free(fs);
}
/* would be private function: */
void fs_set_buffer( fsbuf_type buf[FSHEIGHT][FSBUFWIDTH],
unsigned x,
unsigned y1,
unsigned xoffset,
unsigned width,
unsigned height)
{
fsbuf_type v;
unsigned y;
for (; width > 0 && x < FSBUFWIDTH; ++x)
{
if (width < xoffset)
v = (((fsbuf_type)1 << width) - 1) << (xoffset - width);
else if (xoffset < FSBUFBITS)
v = ((fsbuf_type)1 << xoffset) - 1;
else
v = fsbuf_max;
for (y = y1; y < y1 + height; ++y)
{
#ifdef FREESPACE_DEBUG
if (buf[y][x] & v)
printf("**** over-writing area ****\n");
#endif
buf[y][x] |= v;
}
if (width < xoffset)
return;
width -= xoffset;
xoffset = FSBUFBITS;
}
}
_Bool freespace_remove( freespace* fs,
unsigned width, unsigned height,
int* resultx, int* resulty)
{
unsigned x, x1, y;
unsigned w, h;
unsigned xoffset, x1offset;
unsigned tz; /* trailing zeros */
fsbuf_type* xptr;
fsbuf_type mask = 0;
fsbuf_type v;
_Bool scanning = false;
_Bool offset = false;
*resultx = -1;
*resulty = -1;
for (y = 0; y < (unsigned) FSHEIGHT - height; ++y)
{
scanning = false;
xptr = &fs->buf[y][0];
for (x = 0; x < FSBUFWIDTH; ++x, ++xptr)
{
if(*xptr == fsbuf_max)
{
scanning = false;
continue;
}
if (!scanning)
{
scanning = true;
x1 = x;
x1offset = xoffset = FSBUFBITS;
w = width;
}
retry:
if (w < xoffset)
mask = (((fsbuf_type)1 << w) - 1) << (xoffset - w);
else if (xoffset < FSBUFBITS)
mask = ((fsbuf_type)1 << xoffset) - 1;
else
mask = fsbuf_max;
offset = false;
for (h = 0; h < height; ++h)
{
v = fs->buf[y + h][x] & mask;
if (v)
{
tz = TRAILING_ZEROS(v);
offset = true;
break;
}
}
if (offset)
{
if (tz)
{
x1 = x;
w = width;
x1offset = xoffset = tz;
goto retry;
}
scanning = false;
}
else
{
if (w <= xoffset) /***** RESULT! *****/
{
fs_set_buffer(fs->buf, x1, y, x1offset, width, height);
*resultx = x1 * FSBUFBITS + (FSBUFBITS - x1offset);
*resulty = y;
return true;
}
w -= xoffset;
xoffset = FSBUFBITS;
}
}
}
return false;
}
int main(int argc, char** argv)
{
int x[1999];
int y[1999];
int w[1999];
int h[1999];
int i;
freespace* fs = freespace_new();
for (i = 0; i < 1999; ++i, ++u)
{
w[i] = rand() % 18 + 4;
h[i] = rand() % 18 + 4;
freespace_remove(fs, w[i], h[i], &x[i], &y[i]);
/*
freespace_dump(fs);
printf("w:%d h:%d x:%d y:%d\n", w[i], h[i], x[i], y[i]);
if (x[i] == -1)
printf("not removed space %d\n", i);
getchar();
*/
}
freespace_dump(fs);
freespace_delete(fs);
return 0;
}
The above code requires one of USE_64BIT_ARRAY, USE_32BIT_ARRAY, USE_16BIT_ARRAY, USE_8BIT_ARRAY to be defined otherwise it will fall back to using only the high bit of an unsigned char for storing the state of grid cells.
The function fs_set_buffer will not be declared in the header, and will become static within the implementation when this code gets split between .h and .c files. A more user friendly function hiding the implementation details will be provided for removing used space from the grid.
Overall, this implementation is faster without optimization than my previous answer with maximum optimization (using GCC on 64bit Gentoo, optimization options -O0 and -O3 respectively).
Regarding USE_NNBIT_ARRAY and the different bit sizes, I used two different methods of timing the code which make 1999 calls to freespace_remove.
Timing main() using the Unix time command (and disabling any output in the code) seemed to prove my expectations correct - that higher bit sizes are faster.
On the other hand, timing individual calls to freespace_remove (using gettimeofday) and comparing the maximum time taken over the 1999 calls seemed to indicate lower bit sizes were faster.
This has only been tested on a 64bit system (Intel Dual Core II).

Resources