How can I minimize two integers so that their product is less than a certain value? - pixel

Given two integers, how can I minimize them so that their product is less than some other value, while maintaining their relative ratio?
That's the formal problem. The practical problem is this: I have width/height pixel resolutions containing random values (anywhere from 1 to 8192 for either dimension). I want to adjust the pairs of values, such that their product doesn't exceed some total number of pixels (ex: 1000000), and I need to ensure that the aspect ratio of the adjusted resolution remains the same (ex: 1.7777).
The naive approach is to just run a loop where I subtract 1 from the width each time, adjusting the height to match the aspect ratio, until their product is under the threshold. Ex:
int wid = 1920;
int hei = 1080;
float aspect = wid / (float)hei;
int maxPixels = 1000000;
while (wid * hei > maxPixels)
{
wid -= 1;
hei = wid / aspect;
}
Surely there must be a more analytical approach to this problem though?

Edit: Misread the original question.
Another way to word your question is with a W and H what is the largest a and b such that a/b = W/H and a*b < C where C is your limit on .
To do so, find the D = gcd(W,H) or greatest common divisor of W and H. Greatest common denominator is commonly found using the Euclidean Algorithm.
Set x = W/D and y = H/D, this is minimal solution with the same ratio.
To produce the maxima under C, start with the inequality of F*x*F*y <= C where F will be our scale factor for x and y
Algebra:
F^2 <= C/(x*y)
F <= sqrt(C/(x*y))
Since we want F to be a whole number and strictly less than the above,
F = floor(sqrt(C/(x*y)))
This will give you a new solution A = x*F and B = y*F where A*B < C and A/B = W/H.

mascoj came up with the answer, but here is an interpretation in code form:
#include <utility>
#include <numeric>
#include <cmath>
#include <iostream>
// Mathsy stuff
std::pair<uint64_t, uint64_t> ReduceRatio(const uint64_t W, const uint64_t H)
{
const double D = std::gcd(W, H);
return {W/D, H/D};
}
std::pair<uint64_t, uint64_t> Maximise(const uint64_t C, const uint64_t W, const uint64_t H)
{
const auto [x, y] = ReduceRatio(W, H);
const uint64_t F = std::floor(std::sqrt(C/double(x*y)));
const uint64_t A = x*F;
const uint64_t B = y*F;
return {A, B};
}
// Test harness
void Test(const uint64_t MaxProduct, const uint64_t W, const uint64_t H)
{
const auto [NewW, NewH] = Maximise(MaxProduct, W, H);
std::cout << W << "\u00D7" << H << " (" << (W*H) << " pixels)";
if (NewW > W)
std::cout << '\n';
else
std::cout << " => " << NewW << "\u00D7" << NewH << " (" << (NewW * NewH) << " pixels)\n";
}
int main()
{
Test(100000, 1024, 768);
Test(100000, 1920, 1080);
Test(500000, 1920, 1080);
Test(1000000, 1920, 1080);
Test(2000000, 1920, 1080);
Test(4000000, 1920, 1080);
}
// g++ -std=c++17 -O2 -Wall -pedantic -pthread main.cpp && ./a.out
// 1024×768 (786432 pixels) => 364×273 (99372 pixels)
// 1920×1080 (2073600 pixels) => 416×234 (97344 pixels)
// 1920×1080 (2073600 pixels) => 928×522 (484416 pixels)
// 1920×1080 (2073600 pixels) => 1328×747 (992016 pixels)
// 1920×1080 (2073600 pixels) => 1872×1053 (1971216 pixels)
// 1920×1080 (2073600 pixels)

Related

How to fill a fixed rectangle with square pieces entirely?

Is this knapsack algorithm or bin packing? I couldn't find an exact solution but basically I have a fixed rectangle area that I want to fill with perfect squares that represents my items where each have a different weight which will influence their size relative to others.
They will be sorted from large to smaller from top left to bottom right.
Also even though I need perfect squares, in the end some non-uniform scaling is allowed to fill the entire space as long as they still retain their relative area, and the non-uniform scaling is done with the least possible amount.
What algorithm I can use to achieve this?
There's a fast approximation algorithm due to Hiroshi Nagamochi and Yuusuke Abe. I implemented it in C++, taking care to obtain a worst-case O(n log n)-time implementation with worst-case recursive depth O(log n). If n ≤ 100, these precautions are probably unnecessary.
#include <algorithm>
#include <iostream>
#include <random>
#include <vector>
namespace {
struct Rectangle {
double x;
double y;
double width;
double height;
};
Rectangle Slice(Rectangle &r, const double beta) {
const double alpha = 1 - beta;
if (r.width > r.height) {
const double alpha_width = alpha * r.width;
const double beta_width = beta * r.width;
r.width = alpha_width;
return {r.x + alpha_width, r.y, beta_width, r.height};
}
const double alpha_height = alpha * r.height;
const double beta_height = beta * r.height;
r.height = alpha_height;
return {r.x, r.y + alpha_height, r.width, beta_height};
}
void LayoutRecursive(const std::vector<double> &reals, const std::size_t begin,
std::size_t end, double sum, Rectangle rectangle,
std::vector<Rectangle> &dissection) {
while (end - begin > 1) {
double suffix_sum = reals[end - 1];
std::size_t mid = end - 1;
while (mid > begin + 1 && suffix_sum + reals[mid - 1] <= 2 * sum / 3) {
suffix_sum += reals[mid - 1];
mid -= 1;
}
LayoutRecursive(reals, mid, end, suffix_sum,
Slice(rectangle, suffix_sum / sum), dissection);
end = mid;
sum -= suffix_sum;
}
dissection.push_back(rectangle);
}
std::vector<Rectangle> Layout(std::vector<double> reals,
const Rectangle rectangle) {
std::sort(reals.begin(), reals.end());
std::vector<Rectangle> dissection;
dissection.reserve(reals.size());
LayoutRecursive(reals, 0, reals.size(),
std::accumulate(reals.begin(), reals.end(), double{0}),
rectangle, dissection);
return dissection;
}
std::vector<double> RandomReals(const std::size_t n) {
std::vector<double> reals(n);
std::exponential_distribution<> dist;
std::default_random_engine gen;
for (double &real : reals) {
real = dist(gen);
}
return reals;
}
} // namespace
int main() {
const std::vector<Rectangle> dissection =
Layout(RandomReals(100), {72, 72, 6.5 * 72, 9 * 72});
std::cout << "%!PS\n";
for (const Rectangle &r : dissection) {
std::cout << r.x << " " << r.y << " " << r.width << " " << r.height
<< " rectstroke\n";
}
std::cout << "showpage\n";
}
Ok so lets assume integer positions and sizes (no float operations). To evenly divide rectangle into regular square grid (as big squares as possible) the size of the cells will be greatest common divisor GCD of the rectangle sizes.
However you want to have much less squares than that so I would try something like this:
try all square sizes a from 1 to smaller size of rectangle
for each a compute the naive square grid size of the rest of rectangle once a*a square is cut of it
so its simply GCD again on the 2 rectangles that will be created once a*a square is cut of. If the min of all 3 sizes a and GCD for the 2 rectangles is bigger than 1 (ignoring zero area rectangles) then consider a as valid solution so remember it.
after the for loop use last found valida
so simply add a*a square to your output and recursively do this whole again for the 2 rectangles that will remain from your original rectangle after a*a square was cut off.
Here simple C++/VCL/OpenGL example for this:
//---------------------------------------------------------------------------
#include <vcl.h>
#pragma hdrstop
#include "Unit1.h"
#include "gl_simple.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
class square // simple square
{
public:
int x,y,a; // corner 2D position and size
square(){ x=y=a=0.0; }
square(int _x,int _y,int _a){ x=_x; y=_y; a=_a; }
~square(){}
void draw()
{
glBegin(GL_LINE_LOOP);
glVertex2i(x ,y);
glVertex2i(x+a,y);
glVertex2i(x+a,y+a);
glVertex2i(x ,y+a);
glEnd();
}
};
int rec[4]={20,20,760,560}; // x,y,a,b
const int N=1024; // max square number
int n=0; // number of squares
square s[N]; // squares
//---------------------------------------------------------------------------
int gcd(int a,int b) // slow euclid GCD
{
if(!b) return a;
return gcd(b, a % b);
}
//---------------------------------------------------------------------------
void compute(int x0,int y0,int xs,int ys)
{
if ((xs==0)||(ys==0)) return;
const int x1=x0+xs;
const int y1=y0+ys;
int a,b,i,x,y;
square t;
// try to find biggest square first
for (a=1,b=0;(a<=xs)&&(a<=ys);a++)
{
// sizes for the rest of the rectangle once a*a square is cut of
if (xs==a) x=0; else x=gcd(a,xs-a);
if (ys==a) y=0; else y=gcd(a,ys-a);
// min of all sizes
i=a;
if ((x>0)&&(i>x)) i=x;
if ((y>0)&&(i>y)) i=y;
// if divisible better than by 1 remember it as better solution
if (i>1) b=a;
} a=b;
// bigger square not found so use naive square grid division
if (a<=1)
{
t.a=gcd(xs,ys);
for (t.y=y0;t.y<y1;t.y+=t.a)
for (t.x=x0;t.x<x1;t.x+=t.a)
if (n<N){ s[n]=t; n++; }
}
// bigest square found so add it to result and recursively process the rest
else{
t=square(x0,y0,a);
if (n<N){ s[n]=t; n++; }
compute(x0+a,y0,xs-a,a);
compute(x0,y0+a,xs,ys-a);
}
}
//---------------------------------------------------------------------------
void gl_draw()
{
glClear(GL_COLOR_BUFFER_BIT);
glDisable(GL_DEPTH_TEST);
glDisable(GL_TEXTURE_2D);
// set view to 2D [pixel] units
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
glTranslatef(-1.0,-1.0,0.0);
glScalef(2.0/float(xs),2.0/float(ys),1.0);
// render input rectangle
glColor3f(0.2,0.2,0.2);
glBegin(GL_QUADS);
glVertex2i(rec[0] ,rec[1]);
glVertex2i(rec[0]+rec[2],rec[1]);
glVertex2i(rec[0]+rec[2],rec[1]+rec[3]);
glVertex2i(rec[0] ,rec[1]+rec[3]);
glEnd();
// render output squares
glColor3f(0.2,0.5,0.9);
for (int i=0;i<n;i++) s[i].draw();
glFinish();
SwapBuffers(hdc);
}
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
{
// Init of program
gl_init(Handle); // init OpenGL
n=0; compute(rec[0],rec[1],rec[2],rec[3]);
Caption=n;
}
//---------------------------------------------------------------------------
void __fastcall TForm1::FormDestroy(TObject *Sender)
{
// Exit of program
gl_exit();
}
//---------------------------------------------------------------------------
void __fastcall TForm1::FormPaint(TObject *Sender)
{
// repaint
gl_draw();
}
//---------------------------------------------------------------------------
void __fastcall TForm1::FormResize(TObject *Sender)
{
// resize
gl_resize(ClientWidth,ClientHeight);
}
//---------------------------------------------------------------------------
And preview for the actually hardcoded rectangle:
The number 8 in Caption of the window is the number of squares produced.
Beware this is just very simple startup example for this problem. I Have not tested it extensively so there is possibility once prime sizes or just unlucky aspect ratios are involved then this might result in really high number of squares... for example if GCD of the rectangle size is 1 (primes) ... In such case you should tweak the initial rectangle size (+/-1 or whatever)
The important thing in the code is just the compute() function and the global variables holding the output squares s[n]... beware the compute() does not clear the list (in order to allow recursion) so you need to set n=0 before its non recursive call.
To keep this simple I avoided to use any libs or dynamic allocations or containers for the compute itself...

Efficient and accurate computation of the reciprocal of hypot(a,b)

Givens rotations provide a robust and easily parallelizable way to implement QR decomposition. A Givens rotation requires the computation of sine and cosine components of a rotation angle. In the case of real computation, this typically involves the computation of the reciprocal of the hypot() function to normalize a two-vector, as shown for example in Wikipedia.
While this avoids most cases of overflow and underflow in intermediate computation, for very large values a, b, hypot(a,b) may overflow to infinity, while 1/√(a2+b2) is actually representable as a subnormal floating-point number. Also, the use of a division adds further computational cost that can be significant on platforms with slow floating-point division.
A function rhypot(a,b) that directly computes 1/√(a2+b2) at a cost similar to the standard hypot() function would therefore be desirable. The accuracy should be same or better than the naive approach of computing 1.0/hypot(a,b). With a correctly-rounded hypot function, this expression has a maximum error of 1.5 ulps.
How can such a function be implemented efficiently and accurately? The use of IEEE-754 binary floating-point arithmetic and the availability of native hardware support for fused multiply-add (FMA) operations can be assumed. For ease of exposition and testing, we can restrict to single-precision computation, i.e. the IEEE-754 binary32 format.
In the following, I am showing ISO-C99 code that implements rhypot with good accuracy and good performance. The general algorithm is directly derived from the example implementations I showed for hypot in this answer. For hypot, one determines the value of largest magnitude among the arguments, then find a scale factor (a power of two for reasons of accuracy) that maps this value into the vicinity of unity. The scale factor is applied to both arguments, and the length of this transformed 2-vector is then computed with the sqrt function, finally the result scaled back with the "inverse' of the scale factor. The scaling relies on actual multiplication as the arguments may be subnormals that cannot be scaled correctly by simple exponent manipulation alone.
For rhypot, only two changes are needed: the reciprocal square root function rsqrt must be used instead of sqrt, and input scaling and result scaling use the same scale factor.
Some computing environments provide an rsqrt() function, and this function is scheduled for inclusion in a future version of the ISO C standard (ISO/IEC TS 18661-4:2015). For environments that do not provide a reciprocal square root function, I am showing some portable (within the platform requirements stated in the question) and machine-specific implementations.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
uint32_t __float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float __uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
float my_rsqrtf (float);
/* Compute the reciprocal of sqrt (a**2 + b**2), avoiding premature overflow
and underflow in intermediate computation. The accuracy of this function
depends on the accuracy of the reciprocal square root implementation used.
With the rsqrtf() implementations shown below, the following maximum ulp
error was observed for 2**36 random test cases:
CORRECTLY_ROUNDED 1.20736973
SSE_HALLEY 1.33120522
SSE_2NR 1.42086841
SQRT_OOX 1.42906701
BIT_TWIDDLE_3NR 1.43062950
ITO_TAKAGI_YAJIMA_1NR 1.43681737
BIT_TWIDDLE_NR_HALLEY 1.47485797
*/
float my_rhypotf (float a, float b)
{
float fa, fb, mn, mx, scale, s, w, res;
uint32_t expo;
/* sort arguments by magnitude */
fa = fabsf (a);
fb = fabsf (b);
mx = fmaxf (fa, fb);
mn = fminf (fa, fb);
/* compute scale factor */
expo = __float_as_uint32 (mx) & 0xfc000000;
scale = __uint32_as_float (0x7e000000 - expo);
/* scale operand of maximum magnitude towards unity */
mn = mn * scale;
mx = mx * scale;
/* mx in [2**-23, 2**6) */
s = fmaf (mx, mx, mn * mn); // 0.75 ulp
w = my_rsqrtf (s);
/* reverse previous scaling */
res = w * scale;
/* handle special cases */
float t = a + b;
if (!(fabsf (t) <= INFINITY)) res = t; // isnan(t)
if (mx == INFINITY) res = 0.0f; // isinf(mx)
return res;
}
#define CORRECTLY_ROUNDED (1)
#define SSE_HALLEY (2)
#define SSE_2NR (3)
#define ITO_TAKAGI_YAJIMA_1NR (4)
#define SQRT_OOX (5)
#define BIT_TWIDDLE_3NR (6)
#define BIT_TWIDDLE_NR_HALLEY (7)
#define RSQRT_VARIANT (SSE_HALLEY)
#if (RSQRT_VARIANT == SSE_2NR) || (RSQRT_VARIANT == SSE_HALLEY)
#include "immintrin.h"
#endif // (RSQRT_VARIANT == SSE_2NR) || (RSQRT_VARIANT == SSE_HALLEY)
float my_rsqrtf (float a)
{
#if RSQRT_VARIANT == CORRECTLY_ROUNDED
float r = (float) sqrt (1.0/(double)a);
#elif RSQRT_VARIANT == SQRT_OOX
float r = sqrtf (1.0f / a);
#elif RSQRT_VARIANT == SSE_2NR
float r;
/* compute initial approximation */
_mm_store_ss (&r, _mm_rsqrt_ss (_mm_set_ss (a)));
/* refine approximation using two Newton-Raphson iterations */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#elif RSQRT_VARIANT == SSE_HALLEY
float e, r;
/* compute initial approximation */
_mm_store_ss (&r, _mm_rsqrt_ss (_mm_set_ss (a)));
/* refine approximation using Halley iteration with cubic convergence */
e = fmaf (r * r, -a, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#elif RSQRT_VARIANT == BIT_TWIDDLE_3NR
float r;
/* compute initial approximation */
r = __uint32_as_float (0x5f375b0d - (__float_as_uint32(a) >> 1));
/* refine approximation using three Newton-Raphson iterations */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#elif RSQRT_VARIANT == BIT_TWIDDLE_NR_HALLEY
float e, r;
/* compute initial approximation */
r = __uint32_as_float (0x5f375b0d - (__float_as_uint32(a) >> 1));
/* refine approximation using Newton-Raphson iteration */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
/* refine approximation using Halley iteration with cubic convergence */
e = fmaf (r * r, -a, 1.0f);
r = fmaf (fmaf (0.375f, e, 0.5f), e * r, r);
#elif RSQRT_VARIANT == ITO_TAKAGI_YAJIMA_1NR
/* Masayuki Ito, Naofumi Takagi, Shuzo Yajima, "Efficient Initial
Approximation for Multiplicative Division and Square Root by a
Multiplication with Operand Modification". IEEE Transactions on
Computers, Vol. 46, No. 4, April 1997, pp. 495-498.
*/
#define TAB_INDEX_BITS (7)
#define TAB_ENTRY_BITS (16)
#define TAB_ENTRIES (1 << TAB_INDEX_BITS)
#define FP32_EXPO_BIAS (127)
#define FP32_MANT_BITS (23)
#define FP32_SIGN_MASK (0x80000000)
#define FP32_EXPO_MASK (0x7f800000)
#define FP32_EXPO_LSB_MASK (1u << FP32_MANT_BITS)
#define FP32_INDEX_MASK (((1u << TAB_INDEX_BITS) - 1) << (FP32_MANT_BITS - TAB_INDEX_BITS))
#define FP32_XHAT_MASK (~(FP32_INDEX_MASK | FP32_SIGN_MASK) | FP32_EXPO_MASK)
#define FP32_FLIP_BIT_MASK (3u << (FP32_MANT_BITS - TAB_INDEX_BITS - 1))
#define FP32_ONE_HALF (0x3f000000)
const uint16_t d1tab [TAB_ENTRIES] = {
0xb2ec, 0xaed7, 0xaae9, 0xa720, 0xa37b, 0x9ff7, 0x9c93, 0x994d,
0x9623, 0x9316, 0x9022, 0x8d47, 0x8a85, 0x87d8, 0x8542, 0x82c0,
0x8053, 0x7bf0, 0x775f, 0x72f1, 0x6ea4, 0x6a77, 0x666a, 0x6279,
0x5ea5, 0x5aed, 0x574e, 0x53c9, 0x505d, 0x4d07, 0x49c8, 0x469e,
0x438a, 0x408a, 0x3d9e, 0x3ac4, 0x37fc, 0x3546, 0x32a0, 0x300b,
0x2d86, 0x2b10, 0x28a8, 0x264f, 0x2404, 0x21c6, 0x1f95, 0x1d70,
0x1b58, 0x194c, 0x174b, 0x1555, 0x136a, 0x1189, 0x0fb2, 0x0de6,
0x0c22, 0x0a68, 0x08b7, 0x070f, 0x056f, 0x03d8, 0x0249, 0x00c1,
0xfd08, 0xf742, 0xf1b4, 0xec5a, 0xe732, 0xe239, 0xdd6d, 0xd8cc,
0xd454, 0xd002, 0xcbd6, 0xc7cd, 0xc3e5, 0xc01d, 0xbc75, 0xb8e9,
0xb57a, 0xb225, 0xaeeb, 0xabc9, 0xa8be, 0xa5cb, 0xa2ed, 0xa024,
0x9d6f, 0x9ace, 0x983e, 0x95c1, 0x9355, 0x90fa, 0x8eae, 0x8c72,
0x8a45, 0x8825, 0x8614, 0x8410, 0x8219, 0x802e, 0x7c9c, 0x78f5,
0x7565, 0x71eb, 0x6e85, 0x6b31, 0x67f3, 0x64c7, 0x61ae, 0x5ea7,
0x5bb0, 0x58cb, 0x55f6, 0x5330, 0x5079, 0x4dd1, 0x4b38, 0x48ad,
0x462f, 0x43be, 0x4159, 0x3f01, 0x3cb5, 0x3a75, 0x3840, 0x3616
};
uint32_t arg, idx, d1, xhat;
float r;
arg = __float_as_uint32 (a);
idx = (arg >> ((FP32_MANT_BITS + 1) - TAB_INDEX_BITS)) & ((1u << TAB_INDEX_BITS) - 1);
d1 = FP32_ONE_HALF | (d1tab[idx] << ((FP32_MANT_BITS + 1) - TAB_ENTRY_BITS));
xhat = ((arg & FP32_INDEX_MASK) | (((((3 * FP32_EXPO_BIAS) << FP32_MANT_BITS) + ~arg) >> 1) & FP32_XHAT_MASK)) ^ FP32_FLIP_BIT_MASK;
/* compute initial approximation, accurate to about 14 bits */
r = __uint32_as_float (d1) * __uint32_as_float (xhat);
/* refine approximation with one Newton-Raphson iteration */
r = fmaf (fmaf (-a, r * r, 1.0f), 0.5f * r, r);
#else
#error unsupported RSQRT_VARIANT
#endif // RSQRT_VARIANT
return r;
}
uint64_t __double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)__float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = __double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
double rhypot (double a, double b)
{
return 1.0 / hypot (a, b);
}
// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579) /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)
#define FP32_QNAN_BIT (0x00400000)
int main (void)
{
float af, bf, resf, reff;
uint32_t ai, bi, resi, refi;
double ref, err, maxerr = 0;
uint64_t diff, diffsum = 0, count = 1ULL << 36;
do {
ai = KISS;
bi = KISS;
af = __uint32_as_float (ai);
bf = __uint32_as_float (bi);
resf = my_rhypotf (af, bf);
ref = rhypot ((double)af, (double)bf);
reff = (float)ref;
refi = __float_as_uint32 (reff);
resi = __float_as_uint32 (resf);
diff = llabs ((long long int)resi - (long long int)refi);
/* If both inputs are a NaN, result can be either argument, converted
to QNaN if necessary. If one input is NaN and the other not infinity
the NaN input must be returned, converted to QNaN if necessary. If
one input is infinity, zero must be returned even if the other input
is a NaN. In all other cases allow up to 1 ulp of difference.
*/
if ((isnan (af) && isnan (bf) && (resi != (ai | FP32_QNAN_BIT)) && (resi != (bi | FP32_QNAN_BIT))) ||
(isnan (af) && !isinf (bf) && !isnan (bf) && (resi != (ai | FP32_QNAN_BIT))) ||
(isnan (bf) && !isinf (af) && !isnan (af) && (resi != (bi | FP32_QNAN_BIT))) ||
(isinf (af) && (resi != 0)) ||
(isinf (bf) && (resi != 0)) ||
(diff > 1)) {
printf ("err # (%08x,%08x): res= %08x (%15.8e) ref=%08x (%15.8e)\n",
ai, bi, resi, resf, refi, reff);
return EXIT_FAILURE;
}
diffsum += diff;
err = floatUlpErr (resf, ref);
if (err > maxerr) {
printf ("ulp=%.8f # (% 15.8e, % 15.8e): res=%15.6a ref=%22.13a\n",
err, af, bf, resf, ref);
maxerr = err;
}
count--;
} while (count);
printf ("diffsum = %llu\n", diffsum);
return EXIT_SUCCESS;
}

Comparing arrays for "similarity"?

I am trying to compare same size arrays.
Given the following array:
I am looking for an algorithm to tell me the most "similar" array to the input. I understand that the word "similar" is not very specific but I don't know how to be more specific.
For example the following is very similar to the input.
The following is somewhat similar.
The following is very different.
You could apply a smoothing kernel to the array, and then compute the L2 norm (Euclidean distance) on it.
This is often used to compare e.g. neural spike trains or other continuous signals.
http://www.cs.utah.edu/~suresh/papers/kerneld/kerneld.pdf
You didn't specify a language...I happen to have code in C++ (may not be the most efficient).
First, you do a smoothing of the vector based on your desired kernel width and parameterize it depending on the scale/desired amount of "blur", etc.. For example:
Output of code below (behaves as expected):
riveale#rv-mba:~/tmpdir$ g++ -std=c++11 test.cpp -o test.exe
riveale#rv-mba:~/tmpdir$ ./test.exe
Distance [1] to [2]: [31.488026] (should be far)
Distance [2] to [3]: [26.591297] (should be far)
Distance [1] to [3]: [12.468342] (should be closer)
And code (test.cpp):
#include <vector>
#include <cstdlib>
#include <cstdio>
#include <cmath>
double gauss_kernel_funct(const size_t& sourcetime, const size_t& thistime)
{
const double tauval = 5.0; //width of kernel
double dist = ((sourcetime-thistime)/tauval); //distance between the points in the vector
double retval = exp(-1 * dist*dist); //exponential decay away from center of that point, squared....
return retval;
}
std::vector<double> convolvegauss( const std::vector<double>& v1)
{
std::vector<double> convolved( v1.size(), 0.0 );
for(size_t t=0; t<v1.size(); ++t)
{
for(size_t u=0; u<v1.size(); ++u)
{
double coeff = gauss_kernel_funct(u, t);
convolved[t]+=v1[u] * coeff;
}
}
return (convolved);
}
double eucliddist( const std::vector<double>& v1, const std::vector<double>& v2 )
{
if(v1.size() != v2.size()) { fprintf(stderr, "ERROR v1!=v2 sizes\n"); exit(1); }
double sum=0.0;
for(size_t x=0; x<v1.size(); ++x)
{
double tmp = (v1[x] - v2[x]);
sum += tmp*tmp; //sum += distance of this dimension squared
}
return (sqrt( sum ));
}
double vectdist( const std::vector<double>& v1, const std::vector<double>& v2 )
{
std::vector<double> convolved1 = convolvegauss( v1 );
std::vector<double> convolved2 = convolvegauss( v2 );
return (eucliddist( convolved1, convolved2 ));
}
int main()
{
//Original 3 vectors. (1 and 3) are closer than (1 and 2) or (2 and 3)...like your example.
std::vector<double> myvector1 = {1.0, 32.0, 10.0, 5.0, 2.0};
std::vector<double> myvector2 = {2.0, 3.0, 10.0, 22.0, 2.0};
std::vector<double> myvector3 = {2.0, 20.0, 17.0, 1.0, 2.0};
//Now run the vectdist on each, which convolves each vector with the gaussian kernel, and takes the euclid distance between the convovled vectors)
fprintf(stdout, "Distance [%d] to [%d]: [%lf] (should be far)\n", 1, 2, vectdist(myvector1, myvector2) );
fprintf(stdout, "Distance [%d] to [%d]: [%lf] (should be far)\n", 2, 3, vectdist(myvector2, myvector3) );
fprintf(stdout, "Distance [%d] to [%d]: [%lf] (should be closer)\n", 1, 3, vectdist(myvector1, myvector3) );
return 0;
}

Estimating an Affine Transform between Two Images

I have a sample image:
I apply the affine transform with the following warp matrix:
[[ 1.25 0. -128 ]
[ 0. 2. -192 ]]
and crop a 128x128 part from the result to get an output image:
Now, I want to estimate the warp matrix and crop size/location from just comparing the sample and output image. I detect feature points using SURF, and match them by brute force:
There are many matches, of which I'm keeping the best three (by distance), since that is the number required to estimate the affine transform. I then use those 3 keypoints to estimate the affine transform using getAffineTransform. However, the transform it returns is completely wrong:
-0.00 1.87 -6959230028596648489132997794229911552.00
0.00 -1.76 -0.00
What am I doing wrong? Source code is below.
Perform affine transform (Python):
"""Apply an affine transform to an image."""
import cv
import sys
import numpy as np
if len(sys.argv) != 10:
print "usage: %s in.png out.png x1 y1 width height sx sy flip" % __file__
sys.exit(-1)
source = cv.LoadImage(sys.argv[1])
x1, y1, width, height, sx, sy, flip = map(float, sys.argv[3:])
X, Y = cv.GetSize(source)
Xn, Yn = int(sx*(X-1)), int(sy*(Y-1))
if flip:
arr = np.array([[-sx, 0, sx*(X-1)-x1], [0, sy, -y1]])
else:
arr = np.array([[sx, 0, -x1], [0, sy, -y1]])
print arr
warp = cv.fromarray(arr)
cv.ShowImage("source", source)
dest = cv.CreateImage((Xn, Yn), source.depth, source.nChannels)
cv.WarpAffine(source, dest, warp)
cv.SetImageROI(dest, (0, 0, int(width), int(height)))
cv.ShowImage("dest", dest)
cv.SaveImage(sys.argv[2], dest)
cv.WaitKey(0)
Estimate affine transform from two images (C++):
#include <stdio.h>
#include <iostream>
#include <opencv2/core/core.hpp>
#include <opencv2/features2d/features2d.hpp>
#include <opencv2/calib3d/calib3d.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/nonfree/nonfree.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <algorithm>
using namespace cv;
void readme();
bool cmpfun(DMatch a, DMatch b) { return a.distance < b.distance; }
/** #function main */
int main( int argc, char** argv )
{
if( argc != 3 )
{
return -1;
}
Mat img_1 = imread( argv[1], CV_LOAD_IMAGE_GRAYSCALE );
Mat img_2 = imread( argv[2], CV_LOAD_IMAGE_GRAYSCALE );
if( !img_1.data || !img_2.data )
{
return -1;
}
//-- Step 1: Detect the keypoints using SURF Detector
int minHessian = 400;
SurfFeatureDetector detector( minHessian );
std::vector<KeyPoint> keypoints_1, keypoints_2;
detector.detect( img_1, keypoints_1 );
detector.detect( img_2, keypoints_2 );
//-- Step 2: Calculate descriptors (feature vectors)
SurfDescriptorExtractor extractor;
Mat descriptors_1, descriptors_2;
extractor.compute( img_1, keypoints_1, descriptors_1 );
extractor.compute( img_2, keypoints_2, descriptors_2 );
//-- Step 3: Matching descriptor vectors with a brute force matcher
BFMatcher matcher(NORM_L2, false);
std::vector< DMatch > matches;
matcher.match( descriptors_1, descriptors_2, matches );
double max_dist = 0;
double min_dist = 100;
//-- Quick calculation of max and min distances between keypoints
for( int i = 0; i < descriptors_1.rows; i++ )
{ double dist = matches[i].distance;
if( dist < min_dist ) min_dist = dist;
if( dist > max_dist ) max_dist = dist;
}
printf("-- Max dist : %f \n", max_dist );
printf("-- Min dist : %f \n", min_dist );
//-- Draw only "good" matches (i.e. whose distance is less than 2*min_dist )
//-- PS.- radiusMatch can also be used here.
sort(matches.begin(), matches.end(), cmpfun);
std::vector< DMatch > good_matches;
vector<Point2f> match1, match2;
for (int i = 0; i < 3; ++i)
{
good_matches.push_back( matches[i]);
Point2f pt1 = keypoints_1[matches[i].queryIdx].pt;
Point2f pt2 = keypoints_2[matches[i].trainIdx].pt;
match1.push_back(pt1);
match2.push_back(pt2);
printf("%3d pt1: (%.2f, %.2f) pt2: (%.2f, %.2f)\n", i, pt1.x, pt1.y, pt2.x, pt2.y);
}
//-- Draw matches
Mat img_matches;
drawMatches( img_1, keypoints_1, img_2, keypoints_2, good_matches, img_matches,
Scalar::all(-1), Scalar::all(-1), vector<char>(), DrawMatchesFlags::NOT_DRAW_SINGLE_POINTS);
//-- Show detected matches
imshow("Matches", img_matches );
imwrite("matches.png", img_matches);
waitKey(0);
Mat fun = getAffineTransform(match1, match2);
for (int i = 0; i < fun.rows; ++i)
{
for (int j = 0; j < fun.cols; j++)
{
printf("%.2f ", fun.at<float>(i,j));
}
printf("\n");
}
return 0;
}
/** #function readme */
void readme()
{
std::cout << " Usage: ./SURF_descriptor <img1> <img2>" << std::endl;
}
The cv::Mat getAffineTransform returns is made of doubles, not of floats. The matrix you get probably is fine, you just have to change the printf command in your loops to
printf("%.2f ", fun.at<double>(i,j));
or even easier: Replace this manual output with
std::cout << fun << std::endl;
It's shorter and you don't have to care about data types yourself.

Cuda Bayer/CFA demosaicing example

I've written a CUDA4 Bayer demosaicing routine, but it's slower than single threaded CPU code, running on a16core GTS250.
Blocksize is (16,16) and the image dims are a multiple of 16 - but changing this doesn't improve it.
Am I doing anything obviously stupid?
--------------- calling routine ------------------
uchar4 *d_output;
size_t num_bytes;
cudaGraphicsMapResources(1, &cuda_pbo_resource, 0);
cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes, cuda_pbo_resource);
// Do the conversion, leave the result in the PBO fordisplay
kernel_wrapper( imageWidth, imageHeight, blockSize, gridSize, d_output );
cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0);
--------------- cuda -------------------------------
texture<uchar, 2, cudaReadModeElementType> tex;
cudaArray *d_imageArray = 0;
__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)
{
uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
uint i = __umul24(y, width) + x;
// input is GR/BG output is BGRA
if ((x < width) && (y < height)) {
if ( y & 0x01 ) {
if ( x & 0x01 ) {
d_output[i].x = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2; // B
d_output[i].y = (tex2D(tex,x,y)); // G in B
d_output[i].z = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2; // R
} else {
d_output[i].x = (tex2D(tex,x,y)); //B
d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4; // G
d_output[i].z = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4; // R
}
} else {
if ( x & 0x01 ) {
// odd col = R
d_output[i].y = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4; // B
d_output[i].z = (tex2D(tex,x,y)); //R
d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4; // G
} else {
d_output[i].x = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2; // B
d_output[i].y = (tex2D(tex,x,y)); // G in R
d_output[i].z = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2; // R
}
}
}
}
void initTexture(int imageWidth, int imageHeight, uchar *imagedata)
{
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cutilSafeCall( cudaMallocArray(&d_imageArray, &channelDesc, imageWidth, imageHeight) );
uint size = imageWidth * imageHeight * sizeof(uchar);
cutilSafeCall( cudaMemcpyToArray(d_imageArray, 0, 0, imagedata, size, cudaMemcpyHostToDevice) );
cutFree(imagedata);
// bind array to texture reference with point sampling
tex.addressMode[0] = cudaAddressModeClamp;
tex.addressMode[1] = cudaAddressModeClamp;
tex.filterMode = cudaFilterModePoint;
tex.normalized = false;
cutilSafeCall( cudaBindTextureToArray(tex, d_imageArray) );
}
There aren't any obvious bugs in your code, but there are several obvious performance opportunities:
1) for best performance, you should use texture to stage into shared memory - see the 'SobelFilter' SDK sample.
2) As written, the code is writing bytes to global memory, which is guaranteed to incur a large performance hit. You can use shared memory to stage results before committing them to global memory.
3) There is a surprisingly big performance advantage to sizing blocks in a way that match the hardware's texture cache attributes. On Tesla-class hardware, the optimal block size for kernels using the same addressing scheme as your kernel is 16x4. (64 threads per block)
For workloads like this, it may be hard to compete with optimized CPU code. SSE2 can do 16 byte-sized operations in a single instruction, and CPUs are clocked about 5 times as fast.
Based on answer on Nvidia forums, here (for the search engines) is a slightly more optomised version which writes a 2x2 block of pixels in each thread. Although the difference in speed isn't measurable on my setup.
Note it should be called with a gridsize half the size of the image;
dim3 blockSize(16, 16); // for example
dim3 gridSize((width/2) / blockSize.x, (height/2) / blockSize.y);
__global__ void d_convertGRBG(uchar4 *d_output, uint width, uint height)
{
uint x = 2 * (__umul24(blockIdx.x, blockDim.x) + threadIdx.x);
uint y = 2 * (__umul24(blockIdx.y, blockDim.y) + threadIdx.y);
uint i = __umul24(y, width) + x;
// input is GR/BG output is BGRA
if ((x < width-1) && (y < height-1)) {
// x+1, y+1:
d_output[i+width+1] = make_uchar4( (tex2D(tex,x+2,y+1)+tex2D(tex,x,y+1))/2, // B
(tex2D(tex,x+1,y+1)), // G in B
(tex2D(tex,x+1,y+2)+tex2D(tex,x+1,y))/2, // R
0xff);
// x, y+1:
d_output[i+width] = make_uchar4( (tex2D(tex,x,y+1)), //B
(tex2D(tex,x+1,y+1) + tex2D(tex,x-1,y+1)+tex2D(tex,x,y+2)+tex2D(tex,x,y))/4, // G
(tex2D(tex,x+1,y+2) + tex2D(tex,x+1,y)+tex2D(tex,x-1,y+2)+tex2D(tex,x-1,y))/4, // R
0xff);
// x+1, y:
d_output[i+1] = make_uchar4( (tex2D(tex,x,y-1) + tex2D(tex,x+2,y-1)+tex2D(tex,x,y+1)+tex2D(tex,x+2,y-1))/4, // B
(tex2D(tex,x+2,y) + tex2D(tex,x,y)+tex2D(tex,x+1,y+1)+tex2D(tex,x+1,y-1))/4, // G
(tex2D(tex,x+1,y)), //R
0xff);
// x, y:
d_output[i] = make_uchar4( (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2, // B
(tex2D(tex,x,y)), // G in R
(tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2, // R
0xff);
}
}
There are many if's and else's in the code. If you structure the code to eliminate all the conditional statements then you will get a huge performance boost as branching is a performance killer. It is indeed possible to remove the branches. There are exactly 30 cases which you will have to code explicitly. I have implemented it on CPU and it does not contain any conditional statements. I am thinking of making a blog explaining it. Will post it once its done.

Resources