output shifted in template matching

output shifted in template matching - halide

I'm trying to create a template matching program that's using the following formula to determine the fit between the template and the image:
my code is following:
Halide::Var x, y, xt, yt;
Halide::RDom r(0, t.width(), 0, t.height());
Halide::Func limit, compare;
limit = Halide::BoundaryConditions::constant_exterior(input,255);
compare(x, y) = limit(x,y);
compare(x, y) = Halide::cast<uint8_t>(Halide::pow(t(0 + r.x, 0 + r.y) - limit(x + r.x, y + r.y),2));
Halide::Image<uint8_t> output(input.width(),input.height());
output = compare.realize(input.width(),input.height());
After executing the following code the result image is shifted like in the example:
Original image:
Template:
Result:
How can I prevent the image from shifting?

Not sure what the types of t and input are, so the following might overflow, but I think you want something like:
Halide::Var x, y, xt, yt;
Halide::RDom r(0, t.width(), 0, t.height());
Halide::Func limit, compare;
limit = Halide::BoundaryConditions::constant_exterior(input,255);
compare(x, y) = limit(x,y);
compare(x, y) = Halide::cast<uint8_t>(sum(Halide::pow(t(r.x, r.y) - limit(x + r.x - t.width()/2, y + r.y - t.height()/2),2))/(t.width()*t.height()));
Halide::Image<uint8_t> output(input.width(),input.height());
output = compare.realize(input.width(),input.height());

There is no sum. You are only storing the squared difference of the lower right pixel of the template image. There's some other things too, which I commented on:
Halide::Var x, y, xt, yt;
Halide::RDom r(0, t.width(), 0, t.height());
Halide::Func limit, compare;
// There's no point comparing the template to pixels not in the input.
// If you really wanted to do that, you should start at
// -t.width(), -t.height() and wd, ht will be plus the template size
// instead of minus.
int wd = input.width () - t.width ();
int ht = input.height() - t.height();
// constant_exterior returns a Func.
// We can copy all dimensions with an underscore.
limit(_) = Halide::BoundaryConditions::constant_exterior(input,255)(_) / 255.f;
Func tf;
tf(_) = t(_) / 255.f;
// Not necessary now and even so, should have been set to undef< uint8_t >().
// compare(x, y) = limit(x,y);
// Expr are basically cut and pasted to where they are used.
Expr sq_dif = Halide::pow(tf(r.x, r.x) - limit(x + r.x, y + r.y), 2);
Expr t_count = t.width() * t.height();
Expr val = Halide::sum(sq_dif) / t_count;
compare(x, y) = Halide::cast<uint8_t>(Halide::clamp(255 * val, 0, 255));
// The size of output is set by realize().
Halide::Image<uint8_t> output;
output = compare.realize(wd, ht);

Related

Minimum rectangles required to cover a given rectangular area

I have a rectangular area of dimension: n*m. I also have a smaller rectangle of dimension: x*y. What will be the minimum number of smaller rectangles required to cover all the area of the bigger rectangle?
It's not necessary to pack the smaller rectangles. They are allowed to overlap with each other, cross the borders of the bigger rectangle if required. The only requirement is that we have to use the fewest number of x*y rectangles.
Another thing is that we can rotate the smaller rectangles if required (90 degrees rotation I mean), to minimise the number.
n,m,x and y: all are natural numbers. x, y need not be factors of n,m.
I couldn't solve it in the time given, neither could I figure out an approach. I initiated by taking up different cases of n,m being divisible by x,y or not.
update
sample test cases:
n*m = 3*3, x*y = 2*2. Result should be 4
n*m = 5*6, x*y = 3*2. Result should be 5
n*m = 68*68, x*y = 9*8. Result should be 65

(UPDATE: See newer version below.)
I think (but I have no proof at this time) that irregular tilings can be discarded, and finding the optimal solution means finding the point at which to switch the direction of the tiles.
You start off with a basic grid like this:
and the optimal solution will take one of these two forms:
So for each of these points, you calculate the number of required tiles for both options:
This is a very basic implementation. The "horizontal" and "vertical" values in the results are the number of tiles in the non-rotated zone (indicated in pink in the images).
The algorithm probably checks some things twice, and could use some memoization to speed it up.
(Testing has shown that you need to run the algorithm a second time with the x and y parameter switched, and that checking both types of solution is indeed necessary.)
function rectangleCover(n, m, x, y, rotated) {
var width = Math.ceil(n / x), height = Math.ceil(m / y);
var cover = {num: width * height, rot: !!rotated, h: width, v: height, type: 1};
for (var i = 0; i <= width; i++) {
for (var j = 0; j <= height; j++) {
var rect = i * j;
var top = simpleCover(n, m - y * j, y, x);
var side = simpleCover(n - x * i, y * j, y, x);
var total = rect + side + top;
if (total < cover.num) {
cover = {num: total, rot: !!rotated, h: i, v: j, type: 1};
}
var top = simpleCover(x * i, m - y * j, y, x);
var side = simpleCover(n - x * i, m, y, x);
var total = rect + side + top;
if (total < cover.num) {
cover = {num: total, rot: !!rotated, h: i, v: j, type: 2};
}
}
}
if (!rotated && n != m && x != y) {
var c = rectangleCover(n, m, y, x, true);
if (c.num < cover.num) cover = c;
}
return cover;
function simpleCover(n, m, x, y) {
return (n > 0 && m > 0) ? Math.ceil(n / x) * Math.ceil(m / y) : 0;
}
}
document.write(JSON.stringify(rectangleCover(3, 3, 2, 2)) + "<br>");
document.write(JSON.stringify(rectangleCover(5, 6, 3, 2)) + "<br>");
document.write(JSON.stringify(rectangleCover(22, 18, 5, 3)) + "<br>");
document.write(JSON.stringify(rectangleCover(1000, 1000, 11, 17)));
This is the counter-example Evgeny Kluev provided: (68, 68, 9, 8) which returns 68 while there is a solution using just 65 rectangles, as demonstrated in this image:
Update: improved algorithm
The counter-example shows the way for a generalisation of the algorithm: work from the 4 corners, try all unique combinations of orientations, and every position of the borders a, b, c and d between the regions; if a rectangle is left uncovered in the middle, try both orientations to cover it:
Below is a simple, unoptimised implementation of this idea; it probably checks some configurations multiple times, and it takes 6.5 seconds for the 11×17/1000×1000 test, but it finds the correct solution for the counter-example and the other tests from the previous version, so the logic seems sound.
These are the five rotations and the numbering of the regions used in the code. If the large rectangle is a square, only the first 3 rotations are checked; if the small rectangles are squares, only the first rotation is checked. X[i] and Y[i] are the size of the rectangles in region i, and w[i] and h[i] are the width and height of region i expressed in number of rectangles.
function rectangleCover(n, m, x, y) {
var X = [[x,x,x,y],[x,x,y,y],[x,y,x,y],[x,y,y,x],[x,y,y,y]];
var Y = [[y,y,y,x],[y,y,x,x],[y,x,y,x],[y,x,x,y],[y,x,x,x]];
var rotations = x == y ? 1 : n == m ? 3 : 5;
var minimum = Math.ceil((n * m) / (x * y));
var cover = simpleCover(n, m, x, y);
for (var r = 0; r < rotations; r++) {
for (var w0 = 0; w0 <= Math.ceil(n / X[r][0]); w0++) {
var w1 = Math.ceil((n - w0 * X[r][0]) / X[r][1]);
if (w1 < 0) w1 = 0;
for (var h0 = 0; h0 <= Math.ceil(m / Y[r][0]); h0++) {
var h3 = Math.ceil((m - h0 * Y[r][0]) / Y[r][3]);
if (h3 < 0) h3 = 0;
for (var w2 = 0; w2 <= Math.ceil(n / X[r][2]); w2++) {
var w3 = Math.ceil((n - w2 * X[r][2]) / X[r][3]);
if (w3 < 0) w3 = 0;
for (var h2 = 0; h2 <= Math.ceil(m / Y[r][2]); h2++) {
var h1 = Math.ceil((m - h2 * Y[r][2]) / Y[r][1]);
if (h1 < 0) h1 = 0;
var total = w0 * h0 + w1 * h1 + w2 * h2 + w3 * h3;
var X4 = w3 * X[r][3] - w0 * X[r][0];
var Y4 = h0 * Y[r][0] - h1 * Y[r][1];
if (X4 * Y4 > 0) {
total += simpleCover(Math.abs(X4), Math.abs(Y4), x, y);
}
if (total == minimum) return minimum;
if (total < cover) cover = total;
}
}
}
}
}
return cover;
function simpleCover(n, m, x, y) {
return Math.min(Math.ceil(n / x) * Math.ceil(m / y),
Math.ceil(n / y) * Math.ceil(m / x));
}
}
document.write("(3, 3, 2, 2) → " + rectangleCover(3, 3, 2, 2) + "<br>");
document.write("(5, 6, 3, 2) → " + rectangleCover(5, 6, 3, 2) + "<br>");
document.write("(22, 18, 5, 3) → " + rectangleCover(22, 18, 5, 3) + "<br>");
document.write("(68, 68, 8, 9) → " + rectangleCover(68, 68, 8, 9) + "<br>");
Update: fixed calculation of center region
As #josch pointed out in the comments, the calculation of the width and height of the center region 4 is not done correctly in the above code; Sometimes its size is overestimated, which results in the total number of rectangles being overestimated. An example where this happens is (1109, 783, 170, 257) which returns 23 while there exists a solution of 22. Below is a new code version with the correct calculation of the size of region 4.
function rectangleCover(n, m, x, y) {
var X = [[x,x,x,y],[x,x,y,y],[x,y,x,y],[x,y,y,x],[x,y,y,y]];
var Y = [[y,y,y,x],[y,y,x,x],[y,x,y,x],[y,x,x,y],[y,x,x,x]];
var rotations = x == y ? 1 : n == m ? 3 : 5;
var minimum = Math.ceil((n * m) / (x * y));
var cover = simpleCover(n, m, x, y);
for (var r = 0; r < rotations; r++) {
for (var w0 = 0; w0 <= Math.ceil(n / X[r][0]); w0++) {
var w1 = Math.ceil((n - w0 * X[r][0]) / X[r][1]);
if (w1 < 0) w1 = 0;
for (var h0 = 0; h0 <= Math.ceil(m / Y[r][0]); h0++) {
var h3 = Math.ceil((m - h0 * Y[r][0]) / Y[r][3]);
if (h3 < 0) h3 = 0;
for (var w2 = 0; w2 <= Math.ceil(n / X[r][2]); w2++) {
var w3 = Math.ceil((n - w2 * X[r][2]) / X[r][3]);
if (w3 < 0) w3 = 0;
for (var h2 = 0; h2 <= Math.ceil(m / Y[r][2]); h2++) {
var h1 = Math.ceil((m - h2 * Y[r][2]) / Y[r][1]);
if (h1 < 0) h1 = 0;
var total = w0 * h0 + w1 * h1 + w2 * h2 + w3 * h3;
var X4 = n - w0 * X[r][0] - w2 * X[r][2];
var Y4 = m - h1 * Y[r][1] - h3 * Y[r][3];
if (X4 > 0 && Y4 > 0) {
total += simpleCover(X4, Y4, x, y);
} else {
X4 = n - w1 * X[r][1] - w3 * X[r][3];
Y4 = m - h0 * Y[r][0] - h2 * Y[r][2];
if (X4 > 0 && Y4 > 0) {
total += simpleCover(X4, Y4, x, y);
}
}
if (total == minimum) return minimum;
if (total < cover) cover = total;
}
}
}
}
}
return cover;
function simpleCover(n, m, x, y) {
return Math.min(Math.ceil(n / x) * Math.ceil(m / y),
Math.ceil(n / y) * Math.ceil(m / x));
}
}
document.write("(3, 3, 2, 2) → " + rectangleCover(3, 3, 2, 2) + "<br>");
document.write("(5, 6, 3, 2) → " + rectangleCover(5, 6, 3, 2) + "<br>");
document.write("(22, 18, 5, 3) → " + rectangleCover(22, 18, 5, 3) + "<br>");
document.write("(68, 68, 9, 8) → " + rectangleCover(68, 68, 9, 8) + "<br>");
document.write("(1109, 783, 170, 257) → " + rectangleCover(1109, 783, 170, 257) + "<br>");
Update: non-optimality and recursion
It is indeed possible to create input for which the algorithm does not find the optimal solution. For the example (218, 196, 7, 15) it returns 408, but there is a solution with 407 rectangles. This solution has a center region sized 22×14, which can be covered by three 7×15 rectangles; however, the simpleCover function only checks options where all rectangles have the same orientation, so it only finds a solution with 4 rectangles for the center region.
This can of course be countered by using the algorithm recursively, and calling rectangleCover again for the center region. To avoid endless recursion, you should limit the recursions depth, and use simpleCover once you've reached a certain recursion level. To avoid the code becoming unusably slow, add memoization of intermediate results (but don't use results that were calculated in a deeper recursion level for a higher recursion level).
When adding one level of recursion and memoization of intermediate results, the algorithm finds the optimal solution of 407 for the example mentioned above, but of course takes a lot more time. Again, I have no proof that adding a certain recursion depth (or even unlimited recursion) will result in the algorithm being optimal.

Halide internal error issue

Here's the code. I'm using Halide on VS2013, Win64 trunk of Aug 5, 2015. When I execute diag.compile_to_lowered_stmt("diag.html", {}, HTML) (with a 16MB stack), I get the following error message:
"Internal error at E:\Code\Halide\src\IR.cpp:160
Condition failed: a.type() == b.type()
LT of mismatched types"
I have confirmed that the error occurs because of the line:
diag(x, y, c) = select(m135(x, y) > m45(x, y), f45(x, y, c), select(m45(x, y) > m135(x, y), f135(x, y, c), f4x4(x, y, c)));
The only way I've been able to remove the error is to remove both selects (the function is unusable in that case, of course.) I've tried converting the condition to an Expr, and I've also checked the types of m45 and m135 (by assigning them to an Expr t1, and then looking at t1.type().) I note that changing the ">" to an "<" or even ">=" does NOT change the error message from "LT".
Any ideas?
Code is still the same as my previous post:
Image<uint8_t> orig_uint = Tools::load_image("../foo.ppm");
Var x, y, c;
Func orig("orig"), orig_lum("orig_lum"), m45("m45"), m135("m135"), f45("f45"), f135("f135"), f4x4_horiz("f4x4_horiz"), f4x4("f4x4"), diag("diag");
Func orig_clamped = BoundaryConditions::repeat_edge(orig_uint);
const float wta = 1.0f, wtb = 3.0f, wt0 = wta * wta, wt1 = wta * wtb, wt2 = wtb * wtb;
orig(x, y, c) = cast<float_t>(orig_clamped(x, y, c));
orig_lum(x, y) = 0.299f * orig(x, y, 0) + 0.587f * orig(x, y, 1) + 0.114f * orig(x, y, 2);
m45(x, y) = abs(orig_lum(x - 1, y - 1) - orig_lum(x, y)) + abs(orig_lum(x, y) - orig_lum(x + 1, y + 1)) + abs(orig_lum(x + 1, y + 1) - orig_lum(x + 2, y + 2));
m135(x, y) = abs(orig_lum(x + 2, y - 1) - orig_lum(x + 1, y)) + abs(orig_lum(x + 1, y) - orig_lum(x, y + 1)) + abs(orig_lum(x, y + 1) - orig_lum(x - 1, y + 2));
f45(x, y, c) = wta * (orig(x - 1, y - 1, c) + orig(x + 2, y + 2, c)) + wtb * (orig(x, y, c) + orig(x + 1, y + 1, c));
f135(x, y, c) = wta * (orig(x - 1, y + 2, c) + orig(x + 2, y - 1, c)) + wtb * (orig(x, y + 1, c) + orig(x + 1, y, c));
f4x4_horiz(x, y, c) = wta * (orig(x - 1, y, c) + orig(x + 2, y, c)) + wtb * (orig(x, y, c) + orig(x + 1, y, c));
f4x4(x, y, c) = wta * (f4x4_horiz(x, y - 1, c) + f4x4_horiz(x, y + 2, c)) + wtb * (f4x4_horiz(x, y, c) + f4x4_horiz(x, y + 1, c));
diag(x, y, c) = select(m135(x, y) > m45(x, y), f45(x, y, c), select(m45(x, y) > m135(x, y), f135(x, y, c), f4x4(x, y, c)));
// schedule
orig_lum.compute_root();
m45.compute_root().bound(x, 0, orig_uint.width()).bound(y, 0, orig_uint.height());
m135.compute_root().bound(x, 0, orig_uint.width()).bound(y, 0, orig_uint.height());
f45.compute_at(diag, x);
f135.compute_at(diag, x);
f4x4.compute_at(diag, x);
diag.compute_root();
// compile so we can take a look at the code
diag.compile_to_lowered_stmt("diag.html", {}, HTML); // stack oflo here

It's a bug in Halide. Fix pushed by Andrew Adams an hour ago (thanks!)

Ported code having issues, can't figure out how this should work

Okay, so I ported this bit of code from a javascript file to Lua. It's a solution to the Bin Packing problem. Basically, this is a given a target rectangle size "init(x,y)", and then is given a table with blocks to fill said rectangle, "fit(blocks)". However when I run this I get the error "attempt to index local 'root' (a number value)". What is going wrong here?
I also don't fully understand how this code is working, someone helped me along with the porting process. When I pass a table "blocks" into the fit function, is it adding the attributes of block.fit.x and block.fit.y?
Any help is appreciated.
Edit: Fixed the error by changing "." to ":" when calling a method.
--ported from https://github.com/jakesgordon/bin-packing
local _M = {}
mt = {
init = function(t, x, y) --takes in dimensions of target rect.
t.root = { x = 0, y = 0, x = x, y = y }
end,
fit = function(t, blocks) --passes table "blocks"
local n, node, block
for k, block in pairs(blocks) do
if node == t.findNode(t.root, block.x, block.y) then
block.fit = t.splitNode(node, block.x, block.y)
end
end
end,
findNode = function(t, root, x, y)
if root.used then --if root.used then
return t.findNode(root.right, x, y) or t.findNode(root.down, x, y)
elseif (x <= root.x) and (y <= root.y) then
return root
else
return nil
end
end,
splitNode = function(t, node, x, y)
node.used = true
node.down = { x = node.x, y = node.y + y, x = node.x, y = node.y - y }
node.right = { x = node.x + x, y = node.y, x = node.x - x, y = y }
return node;
end,
}
setmetatable(_M, mt)
-- Let's do the object-like magic
mt.__index = function(t, k)
if nil ~= mt[k] then
return mt[k]
else
return t[k]
end
end
mt.__call = function(t, ...)
local new_instance = {}
setmetatable(new_instance, mt)
new_instance:init(...)
return new_instance
end
return _M

I don't know if this will help, but here is how I would have ported the code over to Lua.
local Packer = {}
Packer.__index = Packer
function Packer:findNode (root, w, h)
if root.used then
return self:findNode(root.right, w, h) or self:findNode(root.down, w, h)
elseif w <= root.w and h <= root.h then
return root
else
return nil
end
end
function Packer:fit (blocks)
local node
for _, block in pairs(blocks) do
node = self:findNode(self.root, block.w, block.h)
if node then
block.fit = self:splitNode(node, block.w, block.h)
end
end
end
function Packer.init (w, h)
local packer = {}
packer.root = {x = 0, y = 0, w = w, h = h}
return setmetatable(packer, Packer)
end
function Packer:splitNode (node, w, h)
node.used = true
node.down = {x = node.x, y = node.y + h, w = node.w, h = node.h - h}
node.right = {x = node.x + w, y = node.y, w = node.w - w, h = h }
return node
end
return Packer
Just place it in a file like packer.lua and import it into your main.lua with local Packer = require "packer"

Lua, table converted to a number?

I am simply adding numbers together but it continues to error. I used type() to check if vector is a table or not and it always said it was but it continues to say that it is a number.
Can anyone tell me why this is happening and a way to fix it(the variable vector is a vector3 object)? Any help is greatly appreciated.
Vector3:
function new(x, y, z)
return setmetatable({x = x, y = y, z = z}, meta) --{} has public variables
end
All of the Vector3 file here: http://pastebin.com/csBmJG36
ERROR:
attempt to index local 'vector' (a number value)
SCRIPT:
function translate(object, x, y, z)
for i, v in pairs(object) do
if (i == "Vertices") then
for _, q in pairs(v) do
for l, vector in pairs(q) do
vector.x = vector.x + x;
vector.y = vector.y + y;
vector.z = vector.z + z;
end
end
end
end
end

Let's refactor your code by removing the loop-switch anti-pattern:
function translate(object, x, y, z)
for _, q in pairs(object.Vertices) do
for l, vector in pairs(q) do
-- Test the type of vector here...
vector.x = vector.x + x;
vector.y = vector.y + y;
vector.z = vector.z + z;
end
end
end
So, the error occurs with an access to object.Vertices[_][l].x.
That would be a curious vertex-list which contains lists of vertex-lists instead.

Python performance: iteration and operations on nested lists

Problem Hey folks. I'm looking for some advice on python performance. Some background on my problem:
Given:
A (x,y) mesh of nodes each with a value (0...255) starting at 0
A list of N input coordinates each at a specified location within the range (0...x, 0...y)
A value Z that defines the "neighborhood" in count of nodes
Increment the value of the node at the input coordinate and the node's neighbors. Neighbors beyond the mesh edge are ignored. (No wrapping)
BASE CASE: A mesh of size 1024x1024 nodes, with 400 input coordinates and a range Z of 75 nodes.
Processing should be O(x*y*Z*N). I expect x, y and Z to remain roughly around the values in the base case, but the number of input coordinates N could increase up to 100,000. My goal is to minimize processing time.
Current results Between my start and the comments below, we've got several implementations.
Running speed on my 2.26 GHz Intel Core 2 Duo with Python 2.6.1:
f1: 2.819s
f2: 1.567s
f3: 1.593s
f: 1.579s
f3b: 1.526s
f4: 0.978s
f1 is the initial naive implementation: three nested for loops.
f2 is replaces the inner for loop with a list comprehension.
f3 is based on Andrei's suggestion in the comments and replaces the outer for with map()
f is Chris's suggestion in the answers below
f3b is kriss's take on f3
f4 is Alex's contribution.
Code is included below for your perusal.
Question How can I further reduce the processing time? I'd prefer sub-1.0s for the test parameters.
Please, keep the recommendations to native Python. I know I can move to a third-party package such as numpy, but I'm trying to avoid any third party packages. Also, I've generated random input coordinates, and simplified the definition of the node value updates to keep our discussion simple. The specifics have to change slightly and are outside the scope of my question.
thanks much!
**`f1` is the initial naive implementation: three nested `for` loops.**
def f1(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
for j in xrange(max(0, topleft[1]), min(topleft[1]+(z*2), y)):
if rows[i][j] <= 255: rows[i][j] += 1
f2 is replaces the inner for loop with a list comprehension.
def f2(x,y,n,z):
rows = [[0]*x for i in xrange(y)]
for i in range(n):
inputX, inputY = (int(x*random.random()), int(y*random.random()))
topleft = (inputX - z, inputY - z)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE: f3 is based on Andrei's suggestion in the comments and replaces the outer for with map(). My first hack at this requires several out-of-local-scope lookups, specifically recommended against by Guido: local variable lookups are much faster than global or built-in variable lookups I hardcoded all but the reference to the main data structure itself to minimize that overhead.
rows = [[0]*x for i in xrange(y)]
def f3(x,y,n,z):
inputs = [(int(x*random.random()), int(y*random.random())) for i in range(n)]
rows = map(g, inputs)
def g(input):
inputX, inputY = input
topleft = (inputX - 75, inputY - 75)
for i in xrange(max(0, topleft[0]), min(topleft[0]+(75*2), 1024)):
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE3: ChristopeD also pointed out a couple improvements.
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
UPDATE4: kriss added a few improvements to f3, replacing min/max with the new ternary operator syntax.
def f3b(x,y,n,z):
rn = random.random
rows = [g1(x, y, z) for x, y in [(int(x*rn()), int(y*rn())) for i in xrange(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]
UPDATE5: Alex weighed in with his substantive revision, adding a separate map() operation to cap the values at 255 and removing all non-local-scope lookups. The perf differences are non-trivial.
def f4(x,y,n,z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
Also, since we all seem to be hacking around with variations, here's my test harness to compare speeds: (improved by ChristopheD)
def timing(f,x,y,z,n):
fn = "%s(%d,%d,%d,%d)" % (f.__name__, x, y, z, n)
ctx = "from __main__ import %s" % f.__name__
results = timeit.Timer(fn, ctx).timeit(10)
return "%4.4s: %.3f" % (f.__name__, results / 10.0)
if __name__ == "__main__":
print timing(f, 1024, 1024, 400, 75)
#add more here.

On my (slow-ish;-) first-day Macbook Air, 1.6GHz Core 2 Duo, system Python 2.5 on MacOSX 10.5, after saving your code in op.py I see the following timings:
$ python -mtimeit -s'import op' 'op.f1()'
10 loops, best of 3: 5.58 sec per loop
$ python -mtimeit -s'import op' 'op.f2()'
10 loops, best of 3: 3.15 sec per loop
So, my machine is slower than yours by a factor of a bit more than 1.9.
The fastest code I have for this task is:
def f3(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
inc = (1).__add__
sat = (0xff).__and__
for i in range(n):
inputX, inputY = rr(x), rr(y)
b = max(0, inputX - z)
t = min(inputX + z, x)
l = max(0, inputY - z)
r = min(inputY + z, y)
for i in range(b, t):
rows[i][l:r] = map(inc, rows[i][l:r])
for i in range(x):
rows[i] = map(sat, rows[i])
which times as:
$ python -mtimeit -s'import op' 'op.f3()'
10 loops, best of 3: 3 sec per loop
so, a very modest speedup, projecting to more than 1.5 seconds on your machine - well above the 1.0 you're aiming for:-(.
With a simple C-coded extensions, exte.c...:
#include "Python.h"
static PyObject*
dopoint(PyObject* self, PyObject* args)
{
int x, y, z, px, py;
int b, t, l, r;
int i, j;
PyObject* rows;
if(!PyArg_ParseTuple(args, "iiiiiO",
&x, &y, &z, &px, &py, &rows
))
return 0;
b = px - z;
if (b < 0) b = 0;
t = px + z;
if (t > x) t = x;
l = py - z;
if (l < 0) l = 0;
r = py + z;
if (r > y) r = y;
for(i = b; i < t; ++i) {
PyObject* row = PyList_GetItem(rows, i);
for(j = l; j < r; ++j) {
PyObject* pyitem = PyList_GetItem(row, j);
long item = PyInt_AsLong(pyitem);
if (item < 255) {
PyObject* newitem = PyInt_FromLong(item + 1);
PyList_SetItem(row, j, newitem);
}
}
}
Py_RETURN_NONE;
}
static PyMethodDef exteMethods[] = {
{"dopoint", dopoint, METH_VARARGS, "process a point"},
{0}
};
void
initexte()
{
Py_InitModule("exte", exteMethods);
}
(note: I haven't checked it carefully -- I think it doesn't leak memory due to the correct interplay of reference stealing and borrowing, but it should be code inspected very carefully before being put in production;-), we could do
import exte
def f4(x=x,y=y,n=n,z=z):
rows = [[0]*y for i in range(x)]
rr = random.randrange
for i in range(n):
inputX, inputY = rr(x), rr(y)
exte.dopoint(x, y, z, inputX, inputY, rows)
and the timing
$ python -mtimeit -s'import op' 'op.f4()'
10 loops, best of 3: 345 msec per loop
shows an acceleration of 8-9 times, which should put you in the ballpark you desire. I've seen a comment saying you don't want any third-party extension, but, well, this tiny extension you could make entirely your own;-). ((Not sure what licensing conditions apply to code on Stack Overflow, but I'll be glad to re-release this under the Apache 2 license or the like, if you need that;-)).

1. A (smaller) speedup could definitely be the initialization of your rows...
Replace
rows = []
for i in range(x):
rows.append([0 for i in xrange(y)])
with
rows = [[0] * y for i in xrange(x)]
2. You can also avoid some lookups by moving random.random out of the loops (saves a little).
3. EDIT: after corrections -- you could arrive at something like this:
def f(x,y,n,z):
rows = [[0] * y for i in xrange(x)]
rn = random.random
for i in xrange(n):
topleft = (int(x*rn()) - z, int(y*rn()) - z)
l = max(0, topleft[1])
r = min(topleft[1]+(z*2), y)
for u in xrange(max(0, topleft[0]), min(topleft[0]+(z*2), x)):
rows[u][l:r] = [j+(j<255) for j in rows[u][l:r]]
EDIT: some new timings with timeit (10 runs) -- seems this provides only minor speedups:
import timeit
print timeit.Timer("f1(1024,1024,400,75)", "from __main__ import f1").timeit(10)
print timeit.Timer("f2(1024,1024,400,75)", "from __main__ import f2").timeit(10)
print timeit.Timer("f(1024,1024,400,75)", "from __main__ import f3").timeit(10)
f1 21.1669280529
f2 12.9376120567
f 11.1249599457

in your f3 rewrite, g can be simplified. (Can also be applied to f4)
You have the following code inside a for loop.
l = max(0, topleft[1])
r = min(topleft[1]+(75*2), 1024)
However, it appears that those values never change inside the for loop. So calculate them once, outside the loop instead.

Based on your f3 version I played with the code. As l and r are constants you can avoid to compute them in g1 loop. Also using new ternary if instead of min and max seems to be consistently faster. Also simplified expression with topleft. On my system it appears to be about 20% faster using with the code below.
def f3b(x,y,n,z):
rows = [g1(x, y, z) for x, y in [(int(x*random.random()), int(y*random.random())) for i in range(n)]]
def g1(x, y, z):
l = y - z if y - z > 0 else 0
r = y + z if y + z < 1024 else 1024
for i in xrange(x - z if x - z > 0 else 0, x + z if x + z < 1024 else 1024 ):
rows[i][l:r] = [j+(j<255) for j in rows[i][l:r]]

You can create your own Python module in C, and control the performance as you want:
http://docs.python.org/extending/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

output shifted in template matching - halide

Related

Minimum rectangles required to cover a given rectangular area

Halide internal error issue

Ported code having issues, can't figure out how this should work

Lua, table converted to a number?

Python performance: iteration and operations on nested lists

Categories

Resources