Efficiently implementing erode/dilate

Efficiently implementing erode/dilate - algorithm

So normally and very inefficiently min/max filter is implemented by using four for loops.
for( index1 < dy ) { // y loop
for( index2 < dx ) { // x loop
for( index3 < StructuringElement.dy() ) { // kernel y
for( index4 < StructuringElement.dx() ) { // kernel x
pixel = src(index3+index4);
val = (pixel > val) ? pixel : val; // max
}
}
dst(index2, index1) = val;
}
}
However this approach is damn inefficient since it checks again previously checked values. So I am wondering what methods are there to implement this with using previously checked values on next iteration?
Any assumptions regarding structuring element size/point of origin can be made.
Update: I am especially keen to know any insights of this or kind of implementation: http://dl.acm.org/citation.cfm?id=2114689

I have been following this question for some time, hoping someone would write a fleshed-out answer, since I am pondering the same problem.
Here is my own attempt so far; I have not tested this, but I think you can do repeated dilation and erosion with any structuring element, by only accessing each pixel twice:
Assumptions: Assume the structuring element/kernel is a KxL rectangle and the image is a NxM rectangle. Assume that K and L are odd.
The basic approach you outlined has four for loops and takes O(K*L*N*M) time to complete.
Often you want to dilate repeatedly with the same kernel, so the time is again multiplied by the desired number of dilations.
I have three basic ideas for speeding up the dilation:
dilation by a KxL kernel is equal to dilation by a Kx1 kernel followed by dilation by a 1xL kernel. You can do both of these dilations with only three for loops, in O(KNM) and O(LNM)
However you can do a dilation with a Kx1 kernel much faster: You only need to access each pixel once. For this you need a particular data structure, explained below. This allows you to do a single dilation in O(N*M), regardless of the kernel size
repeated dilation by a Kx1 kernel is equal to a single dilation by a larger kernel. If you dilate P times with a Kx1 kernel, this is equal to a single dilation with a ((K-1)*P + 1) x 1 kernel.
So you can do repeated dilation with any kernel size in a single pass, in O(N*M) time.
Now for a detailed description of step 2.
You need a queue with the following properties:
push an element to the back of the queue in constant time.
pop an element from the front of the queue in constant time.
query the current smallest or largest element in the queue in constant time.
How to build such a queue is described in this stackoverflow answer: Implement a queue in which push_rear(), pop_front() and get_min() are all constant time operations.
Unfortunately not much pseudocode, but the basic idea seems sound.
Using such a queue, you can calculate a Kx1 dilation in a single pass:
Assert(StructuringElement.dy()==1);
int kernel_half = (StructuringElement.dx()-1) /2;
for( y < dy ) { // y loop
for( x <= kernel_half ) { // initialize the queue
queue.Push(src(x, y));
}
for( x < dx ) { // x loop
// get the current maximum of all values in the queue
dst(x, y) = queue.GetMaximum();
// remove the first pixel from the queue
if (x > kernel_half)
queue.Pop();
// add the next pixel to the queue
if (x < dx - kernel_half)
queue.Push(src(x + kernel_half, y));
}
}

The only approach I can think of is to buffer the maximum pixel values and the rows in which they are found so that you only have to do the full iteration over a kernel sized row/column when the maximum is no longer under it.
In the following C-like pseudo code, I have assumed signed integers, 2d row-major arrays for the source and destination and a rectangular kernel over [±dx, ±dy].
//initialise the maxima and their row positions
for(x=0; x < nx; ++x)
{
row[x] = -1;
buf[x] = 0;
}
for(sy=0; sy < ny; ++sy)
{
//update the maxima and their row positions
for(x=0; x < nx; ++x)
{
if(row[x] < max(sy-dy, 0))
{
//maximum out of scope, search column
row[x] = max(sy-dy, 0);
buf[x] = src[row[x]][x];
for(y=row[x]+1; y <= min(sy+dy, ny-1); ++y)
{
if(src[y][x]>=buf[x])
{
row[x] = y;
buf[x] = src[y][x];
}
}
}
else
{
//maximum in scope, check latest value
y = min(sy+dy, ny-1);
if(src[y][x] >= buf[x])
{
row[x] = y;
buf[x] = src[y][x];
}
}
}
//initialise maximum column position
col = -1;
for(sx=0; sx < nx; ++sx)
{
//update maximum column position
if(col<max(sx-dx, 0))
{
//maximum out of scope, search buffer
col = max(sx-dx, 0);
for(x=col+1; x <= min(sx+dx, nx-1); ++x)
{
if(buf[x] >= buf[col]) col = x;
}
}
else
{
//maximum in scope, check latest value
x = min(sx+dx, nx-1);
if(buf[x] >= buf[col]) col = x;
}
//assign maximum to destination
dest[sy][sx] = buf[col];
}
}
The worst case performance occurs when the source goes smoothly from a maximum at the top left to a minimum at the bottom right, forcing a full row or column scan at each step (although it's still more efficient than the original nested loops).
I would expect average case performance to be much better though, since regions containing increasing values (both row and column wise) will update the maximum before a scan is required.
That said, not having actually tested it I'd recommend that you run a few benchmarks rather than trust my gut feeling!

a theoretical way of improving the complexity would be to maintain a BST for the KxK pixels, delete previsous Kx1 pixels and add the next Kx1 pixels to it. The cost of this operation would be 2K log K and it would be repeated NxN times. Overall the computation time would become NxNxKxlog K from NxNxKxK

Same kind of optimizations can be used as "non maximum suppression" algorithms
http://www.vision.ee.ethz.ch/publications/papers/proceedings/eth_biwi_00446.pdf

In 1D, using morphological wavelet transform in O(N) :
https://gist.github.com/matovitch/11206318
You could get O(N * M) in 2D. HugoRune solution is way simpler and probably faster (though this one could probably be improved).

Related

Binary Lifting | Planet Queries 1 | TLE

I am solving this problem on CSES.
Given n planets, with exactly 1 teleporter on each planet which teleports us to some other planet (possibly the same), we have to solve q queries. Each query is associated with a start planet, x and a number of teleporters to traverse, k. For each query, we need to tell where we would reach after going through k teleporters.
I have attempted this problem using the binary lifting concept.
For each planet, I first saved the planets we would reach by going through 20, 21, 22,... teleporters.
Now, as per the constraints (esp. for k) provided in the question, we need only store the values till 231.
Then, for each query, starting from the start planet, I traverse through the teleporters using the data in the above created array (in 1) to mimic the binary expansion of k, the number of teleporters to traverse.
For example, if k = 5, i.e. (101)2, and the initial planet is x, I first go (001)2 = 1 planet ahead, using the array, let's say to planet y, and then (100)2 = 4 planets ahead. The planet now reached is the required result to the query.
Unfortunately, I am receiving TLE (time limit exceeded) error in the last test case (test 12).
Here's my code for reference:
#define inp(x) ll x; scanf("%lld", &x)
void solve()
{
// Inputting the values of n, number of planets and q, number of queries.
inp(n);
inp(q);
// Inputting the location of next planet the teleporter on each planet points to, with correction for 0 - based indexing
vector<int> adj(n);
for(int i = 0; i < n; i++)
{
scanf("%d", &(adj[i]));
adj[i]--;
}
// maxN stores the maximum value till which we need to locate the next reachable plane, based on constraints.
// A value of 32 means that we'll only ever need to go at max 2^31 places away from the planet in query.
int maxN = 32;
// This array consists of the next planet we can reach from any planet.
// Specifically, par[i][j] is the planet we get to, on passing through 2^j teleporters starting from planet i.
vector<vector<int>> par(n, vector<int>(maxN, -1));
for(int i = 0; i < n; i++)
{
par[i][0] = adj[i];
}
for(int i = 1; i < maxN; i++)
{
for(int j = 0; j < n; j++)
{
ll p1 = par[j][i-1];
par[j][i] = par[p1][i-1];
}
}
// This task is done for each query.
for(int i = 0; i < q; i++)
{
// x is the initial planet, corrected for 0 - based indexing.
inp(x);
x--;
// k is the number of teleporters to traverse.
inp(k);
// cur is the planet we currently are at.
int cur = x;
// For every i'th bit in k that is 1, the current planet is moved to the planet we reach to by moving through 2^i teleporters from cur.
for(int i = 0; (1 << i) <= k ; i++)
{
if(k & (1 << i))
{
cur = par[cur][i];
}
}
// Once the full binary expansion of k is used up, we are at cur, so (cur + 1) is the result because of the judge's 1 - based indexing.
cout<<(cur + 1)<<endl;
}
}
The code gives the correct output in every test case, but undergoes TLE in the final one (the result in the final one is correct too, just a TLE occurs). According to my observation the complexity of the code is O(32 * q + n), which doesn't seem to exceed the 106 bound for linear time code in 1 second.
Are there any hidden costs in the algorithm I may have missed, or some possible optimization?
Any help appreciated!

It looks to me like your code works (after fixing the scanf), but your par map could have 6.4M entries in it, and precalculating all of those might just get you over the 1s time limit.
Here are a few things to try, in order of complexity:
replace par with a single vector<int> and index it like par[i*32+j]. This will remove a lot of double indirections.
Buffer the output in a std::string and write it in one step at the end, in case there's some buffer flushing going on that you don't know about. I don't think so, but it's easy to try.
Starting at each planet, you enter a cycle in <= n steps. In O(n) time, you can precalculate the distance to the terminal cycle and the size of the terminal cycle for all planets. Using this information you can reduce each k to at most 20000, and that means you only need j <= 16.

HMM Localization in 2D maze, trouble applying smoothing (backward algorithm)

We use HMM (Hidden Markov Model) to localize a robot in a windy maze with damaged sensors. If he attempts to move in a direction, he will do so with a high probability, and a low chance to accidentally go to either side. If his movement would make him go over an obstacle, he will bounce back to the original tile.
From any given position, he can sense in all four directions. He will notice an obstacle if it is there with high certainty, and see an obstacle when there is none with low certainty.
We have a probability map for all possible places the robot might be in the maze, since he knows what the maze looks like. Initially it all starts evenly distributed.
I have completed the motion and sensing aspect of this and am getting the proper answers, but I am stuck on smoothing (backward algorithm).
Assume that the robot performs the following sequence of actions: senses, moves, senses, moves, senses. This gives us 3 states in our HMM model. Assume that the results I have at each step of the way so far are correct.
I am having a lot of trouble performing smoothing (backward algorithm), given that there are four conditional probabilities (one for each direction).
Assume SP is for smoothing probability, BP is for backward probability
Assume Sk is for a state, and Zk is for an observation at that state. The problem for me is figuring out how to construct my backwards equation given that each Zk is only for a single direction.
I know the algorithm for smoothing is: SP(k) is proportional to BP(k+1) * P(Sk | Z1:k)
Where BP(k+1) is defined as :
if (k == n) return 1 else return Sum(s) of BP(k+1) * P(Zk+1|Sk+1) * P(Sk+1=s | Sk)
This is where I am having my trouble. Mainly in the Conditional Probability portion of this equation. Because each spot has four different directions that it observed! In other words, each state has four different evidence variables as opposed to just one! Do I average these values? Do I do a separate summation for them? How do I account for multiple observations at a given state and properly condense it into this equation which only has room for one conditional probability?
Here is the code I have performing the smoothing:
public static void Smoothing(List<int[]> observations) {
int n = observations.Count; //n is Total length of evidence sequence
int k = n - 1; //k is the state we are trying to smooth. start with n-1
for (; k >= 1; k--) { //Smooth all the way back to the first state
for (int dir = 0; dir < 4; dir++) {
//We must smooth each direction separately
SmoothDirection(dir, observations, k, n);
}
Console.WriteLine($"Smoothing for k = {k}\n");
UpdateMapMotion(mapHistory[k]);
PrintMap();
}
}
public static void SmoothDirection(int dir, List<int[]> observations, int k, int n) {
var alphas = new double[ROWS, COLS];
var normalizer = 0.0;
int row, col;
foreach (var t in map) {
if (t.isObstacle) continue;
row = t.pos.y;
col = t.pos.x;
alphas[row, col] = mapHistory[k][row, col]
* Backwards(k, n, t, dir, observations, moves[^(n - k)]);
normalizer += alphas[row, col];
}
UpdateHistory(k, alphas, normalizer);
}
public static void UpdateHistory(int index, double[,] alphas, double normalizer) {
for (int r = 0; r < ROWS; r++) {
for (int c = 0; c < COLS; c++) {
mapHistory[index][r, c] = alphas[r, c] / normalizer;
}
}
}
public static double Backwards(int k, int n, Tile t, int dir, List<int[]> observations, int moveDir) {
if (k == n) return 1;
double p = 0;
var nextStates = GetPossibleNextStates(t, moveDir);
foreach (var s in nextStates) {
p += Cond_Prob(s.hasObstacle[dir], observations[^(n - k)][dir] == 1) * Trans_Prob(t, s, moveDir)
* Backwards(k+1, n, s, dir, observations, moves[^(n - k)]);
}
return p;
}
public static List<Tile> GetPossibleNextStates(Tile t, int direction) {
var tiles = new List<Tile>(); //Next States
var perpDirs = GetPerpendicularDir(direction); //Perpendicular Directions
//If obstacle in front of Tile t or on the sides, Tile t is a possible next state.
if (t.hasObstacle[direction] || t.hasObstacle[perpDirs[0]] || t.hasObstacle[perpDirs[1]])
tiles.Add(t);
//If there is no obstacle in front of Tile t, then that tile is a possible next state.
if (!t.hasObstacle[direction])
tiles.Add(GetTileAtPos(t.pos + directions[direction]));
//If there are no obstacles on the sides of Tile t, then those are possible next states.
foreach (var dir in perpDirs) {
if (!t.hasObstacle[dir])
tiles.Add(GetTileAtPos(t.pos + directions[dir]));
}
return tiles;
}
TL;DR : How do I perform smoothing (backward algorithm) in a Hidden Markov Model when there are 4 evidences at each state as opposed to just 1?

SOLVED!
It was actually rather much more simple than I imagined.
I don't actually need to each iteration separately in each direction.
I just need to replace the Cond_Prob() function with Joint_Cond_Prob() which finds the joint probability of all directional observations at a given state.
So P(Zk|Sk) is actually P(Zk1:Zk4|Sk) which is just P(Zk1|Sk)P(Zk2|Sk)P(Zk3|Sk)P(Zk4|Sk)

Selection Sort in Cuda

So, I'm trying to implement selection sort in Cuda, but so far I haven't been as successful.
__device__ void selection_sort( int *data, int left, int right ){
for( int i = left ; i <= right ; ++i ){
int min_val = data[i];
int min_idx = i;
// Find the smallest value in the range [left, right].
for( int j = i+1 ; j <= right ; ++j ){
int val_j = data[j];
if( val_j < min_val ){
min_idx = j;
min_val = val_j;
}
}
// Swap the values.
if( i != min_idx ){
data[min_idx] = data[i];
data[i] = min_val;
}
}
}
My main attempt here is to find the minimum and parallelize the solution. Now, I realize the code looks very C++ 'ish but I'm nowhere qualified as skilled in Cuda.
Is there a way to parallelize the solution? Are there any more additions to be made?

Selection sort algorithm for N numbers can be roughly described as:
for i from N-1 down to 0
find the maximum element among data[0] ~ data[i]
swap that maximum element with data[i] within the data array
The first part (finding the maximum element) falls into a widely known and well documented class of problems called reduction. However, to perform the second part (swapping), you must track the index of the maximum element while comparing the values, and it is not so natural to do that while performing reduction. This is one of the reasons why selection sort do not port well to parallel architectures.
Also, you can see that the problem size diminishes by one for each loop, and this is another aspect of the selection sort algorithm that does not map well to parallel architectures. In case of CUDA, 32 threads form a warp, which execute at the same time. Although you can tell arbitrary number of threads to run within a warp, it is generally not recommended to do so because it is a loss of computing power.
I've tried to build a CUDA version of selection sort myself, but I stopped doing it because it seems there are better algorithms well suited for CUDA. But I'll just show you what I've done so far to illustrate why selection sort is not good for CUDA.
Firstly, start from a small and simple problem: sorting 32 elements. Since 32 threads form a warp, you can use shuffle instructions to find maximum value. (Full code)
// Finds the maximum element within a warp and gives the maximum element to
// thread with lane id 0. Note that other elements do not get lost but their
// positions are shuffled.
__inline__ __device__ int warpMax(int data, unsigned int threadId)
{
for (int mask = 16; mask > 0; mask /= 2) {
int dual_data = __shfl_xor(data, mask, 32);
if (threadId & mask)
data = min(data, dual_data);
else
data = max(data, dual_data);
}
return data;
}
__global__ void selection32(int* d_data, int* d_data_sorted)
{
unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int laneId = threadIdx.x % 32;
int n = N;
while(n-- > 0) {
// get the maximum element among d_data and put it in d_data_sorted[n]
int data = d_data[threadId];
data = warpMax(data, threadId);
d_data[threadId] = data;
// now maximum element is in d_data[0]
if (laneId == 0) {
d_data_sorted[n] = d_data[0];
d_data[0] = INT_MIN; // this element is ignored from now on
}
}
}
int main()
{
// ... build data and trasfer to d_data ...
selection32<<<1, 32>>>(d_data, d_data_sorted);
// ... get the sorted array stored at d_data_sorted ...
}
(Some may argue that this is not exactly a selection sort since 1) the array elements of the unsorted area keep shuffling, and 2) it is not an in-place sort. Please note that I'm just trying to show that selection sort does not fit in for CUDA. Also, note that warpMax has highly divergent branches, making it less optimal for CUDA.)
The case with only 1 warp of elements may look parallel-ish, but the thing gets worse when the problem size increases to multiple warps. Let's see the case for 1024 elements. (I've chosen the number 1024 becuase it is the maximum number limit of threads in a block.) Now there are 32 warps, and after calling warpMax for each warp, we must compare the maximum elements of each warp to get the maximum element among the 1024 elements. This problem of comparing 32 warp-maximum-values cannot be done with warpMax because we need to track in which warp the maximum value came from to swap the maximum value with the last element in the data array. One way I can think of for doing this is using one single thread to compare warp-maximum-values. This is not a good implemenation for CUDA becuase other 1023 threads in the block become idle.
Furthermore, if the problem size grows larger than a block can cover, we need to compare the maximum values of each block, implying that we will have to launch separate kernels since we need to synchronize between blocks. And it is redundant to say that we need to keep track of in which block the maximum value came from. All of these just tells that implementing selection sort for CUDA is not a good idea.

Which way to order a shared 2D/3D array for parallel reduction over 1 dimension in CUDA/OpenCL?

Overall goal
I have several reductions to make on a bipartite graph, represented by two dense arrays for vertices and a dense array specifying whether an edge is present b/w the two. Say, two arrays are a0[] and a1[], and all edges go like e[i0][i1] (that is, from elements in a0 to elements in a1).
There are ~100+100 vertices, and ~100*100 edges, so each thread is responsible for one edge.
Task 1 : max reduction
For each vertex in a0 I want to find the maximum of all vertices (in a1) connected to it, and then the same in reverse: having assigned the result to an array b0, for each vertex in a1, I want to find the maximum b0[i0] of the connected vertices.
To do this, I:
1) load into shared memory
#define DC_NUM_FROM_SHARED 16
#define DC_NUM_TO_SHARED 16
__global__ void max_reduce_down(
Value* value1
, Value* max_value_in_connected
, int r0_size, int r1_size
, bool** connected
)
{
int id_from;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
//load into shared memory
__shared__ Value value[DC_NUM_TO_SHARED][DC_NUM_FROM_SHARED]; //FROM is the inner (consecutive) dimension
if(within_bounds)
value[threadIdx.y][threadIdx.x] = connected[id_to][id_from]? value1[id_to] : 0;
else
value[threadIdx.y][threadIdx.x] = 0;
__syncthreads();
if(!within_bounds)
return;
2) reduce
for(int stride = DC_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
value[threadIdx.y][threadIdx.x] = max(value[threadIdx.y][threadIdx.x], dc[threadIdx.y + stride][threadIdx.x]);
__syncthreads();
}
3) write back
max_value_connected[id_from] = value[0][threadIdx.x];
Task 2 : best k
Similar problem, but reduction is only in for vertices in a0, I need to find the k best candidates are chosen from connected in a1 (k is ~5).
1) I initialize the shared array with zero elements except for the 1st place
int id_from, id_to;
id_from = blockIdx.x * blockDim.x + threadIdx.x;
id_to = blockIdx.y * blockDim.y + threadIdx.y;
__shared Value* values[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; //champion overlaps
__shared int* champs[MAX_CHAMPS * CHAMPS_NUM_FROM_SHARED * CHAMPS_NUM_TO_SHARED]; // overlap champions
bool within_bounds = (id_from < r0_size) && (id_to < r1_size);
int i = threadIdx.y * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
if(within_bounds)
{
values[i] = connected[id_to][id_from] * values1[id_to];
champs[i] = connected[id_to][id_from] ? id_to : -1;
}
else
{
values[i] = 0;
champs[i] = -1;
}
for(int place = 1; place < CHAMP_COUNT; place++)
{
i = (place * CHAMPS_NUM_TO_SHARED + threadIdx.y) * CHAMPS_NUM_FROM_SHARED + threadIdx.x;
values[i] = 0;
champs[i] = -1;
}
if(! within_bounds)
return;
__syncthreads();
2) reduce it
for(int stride = CHAMPS_NUM_TO_SHARED/2; threadIdx.y < stride; stride >>= 1)
{
merge_2_champs(values, champs, CHAMP_COUNT, id_from, id_to, id_to + stride);
__syncthreads();
}
3) write the results back
for(int place = 0; place < LOCAL_DESIRED_ACTIVITY; place++)
champs0[place][id_from] = champs[place * CHAMPS_NUM_TO_SHARED * CHAMPS_NUM_FROM_SHARED + threadIdx.x];
Issue
How do I order (transpose) the elements in the shared array, so that memory access uses the cache better?
Does it matter at this point, or there is much more I can gain from other optimizations?
Would it be better to transpose the edge matrix if I needed to optimize for Task 2? (as far as I understood, there is a symmetry in Task 1, so it doesn't matter).
P.S.
I have delayed unrolling loops and doing the first reduction iteration while loading, since I thought it is too complicated to do before I have explored simpler ways.
For Task 2, it would be nice to not load zero elements, since the array would never need to grow, and only start shrinking once log k steps have been made. This would make it k times more compact in shared memory! But I dread the resulting index math.
Syntax and Correctness
The unusual types are just typedef'ed ints/chars/etc - AFAIK, in GPUs, it makes sense to compactify those as much as possible. I have not run the code yet, no need to check for indexing errors.
Also, I am using CUDA, but I am interested in an OpenCL perspective as well, since I think the best solution should be the same, and I will be using OpenCL in the future anyway.

OK, I think I figured this out.
The two alternatives that I am considering are to have reductions work on the y dimension, and independent on the x dimension, or vice versa (x dimension being the contiguous one). In any case, the scheduler is able to assemble threads into warps along the x dimension, so some coherence is guaranteed. However, having coherence extend beyond a warp would be great. Also, due to the 2D/3D nature of the shared arrays, one would have to limit the dimensions to 16 or even 8.
To ensure coalescence within a warp, the scheduler has to assemble warps along the x dimension.
If reducing over x dimension, after each iteration, the number of active threads in a warp will halve. However, if reducing over y dimension, it is the number of active warps that will halve.
So, I need to reduce over y.
Unless the transpose (load) is the slowest, which is an abnormal case.

Coalesced buffer reads really matter; kernels can be 32x slower if you don't do them. It can be worth doing a re-arrangement pass if it means being able to do them (of course, the re-arrangement pass needs to be coalesced as well, but you can often leverage shared local memory to do this).

Printing numbers of the form 2^i * 5^j in increasing order

How do you print numbers of form 2^i * 5^j in increasing order.
For eg:
1, 2, 4, 5, 8, 10, 16, 20

This is actually a very interesting question, especially if you don't want this to be N^2 or NlogN complexity.
What I would do is the following:
Define a data structure containing 2 values (i and j) and the result of the formula.
Define a collection (e.g. std::vector) containing this data structures
Initialize the collection with the value (0,0) (the result is 1 in this case)
Now in a loop do the following:
Look in the collection and take the instance with the smallest value
Remove it from the collection
Print this out
Create 2 new instances based on the instance you just processed
In the first instance increment i
In the second instance increment j
Add both instances to the collection (if they aren't in the collection yet)
Loop until you had enough of it
The performance can be easily tweaked by choosing the right data structure and collection.
E.g. in C++, you could use an std::map, where the key is the result of the formula, and the value is the pair (i,j). Taking the smallest value is then just taking the first instance in the map (*map.begin()).
I quickly wrote the following application to illustrate it (it works!, but contains no further comments, sorry):
#include <math.h>
#include <map>
#include <iostream>
typedef __int64 Integer;
typedef std::pair<Integer,Integer> MyPair;
typedef std::map<Integer,MyPair> MyMap;
Integer result(const MyPair &myPair)
{
return pow((double)2,(double)myPair.first) * pow((double)5,(double)myPair.second);
}
int main()
{
MyMap myMap;
MyPair firstValue(0,0);
myMap[result(firstValue)] = firstValue;
while (true)
{
auto it=myMap.begin();
if (it->first < 0) break; // overflow
MyPair myPair = it->second;
std::cout << it->first << "= 2^" << myPair.first << "*5^" << myPair.second << std::endl;
myMap.erase(it);
MyPair pair1 = myPair;
++pair1.first;
myMap[result(pair1)] = pair1;
MyPair pair2 = myPair;
++pair2.second;
myMap[result(pair2)] = pair2;
}
}

This is well suited to a functional programming style. In F#:
let min (a,b)= if(a<b)then a else b;;
type stream (current, next)=
member this.current = current
member this.next():stream = next();;
let rec merge(a:stream,b:stream)=
if(a.current<b.current) then new stream(a.current, fun()->merge(a.next(),b))
else new stream(b.current, fun()->merge(a,b.next()));;
let rec Squares(start) = new stream(start,fun()->Squares(start*2));;
let rec AllPowers(start) = new stream(start,fun()->merge(Squares(start*2),AllPowers(start*5)));;
let Results = AllPowers(1);;
Works well with Results then being a stream type with current value and a next method.
Walking through it:
I define min for completenes.
I define a stream type to have a current value and a method to return a new string, essentially head and tail of a stream of numbers.
I define the function merge, which takes the smaller of the current values of two streams and then increments that stream. It then recurses to provide the rest of the stream. Essentially, given two streams which are in order, it will produce a new stream which is in order.
I define squares to be a stream increasing in powers of 2.
AllPowers takes the start value and merges the stream resulting from all squares at this number of powers of 5. it with the stream resulting from multiplying it by 5, since these are your only two options. You effectively are left with a tree of results
The result is merging more and more streams, so you merge the following streams
1, 2, 4, 8, 16, 32...
5, 10, 20, 40, 80, 160...
25, 50, 100, 200, 400...
.
.
.
Merging all of these turns out to be fairly efficient with tail recursio and compiler optimisations etc.
These could be printed to the console like this:
let rec PrintAll(s:stream)=
if (s.current > 0) then
do System.Console.WriteLine(s.current)
PrintAll(s.next());;
PrintAll(Results);
let v = System.Console.ReadLine();
Similar things could be done in any language which allows for recursion and passing functions as values (it's only a little more complex if you can't pass functions as variables).

For an O(N) solution, you can use a list of numbers found so far and two indexes: one representing the next number to be multiplied by 2, and the other the next number to be multiplied by 5. Then in each iteration you have two candidate values to choose the smaller one from.
In Python:
numbers = [1]
next_2 = 0
next_5 = 0
for i in xrange(100):
mult_2 = numbers[next_2]*2
mult_5 = numbers[next_5]*5
if mult_2 < mult_5:
next = mult_2
next_2 += 1
else:
next = mult_5
next_5 += 1
# The comparison here is to avoid appending duplicates
if next > numbers[-1]:
numbers.append(next)
print numbers

So we have two loops, one incrementing i and second one incrementing j starting both from zero, right? (multiply symbol is confusing in the title of the question)
You can do something very straightforward:
Add all items in an array
Sort the array
Or you need an other solution with more math analysys?
EDIT: More smart solution by leveraging similarity with Merge Sort problem
If we imagine infinite set of numbers of 2^i and 5^j as two independent streams/lists this problem looks very the same as well known Merge Sort problem.
So solution steps are:
Get two numbers one from the each of streams (of 2 and of 5)
Compare
Return smallest
get next number from the stream of the previously returned smallest
and that's it! ;)
PS: Complexity of Merge Sort always is O(n*log(n))

I visualize this problem as a matrix M where M(i,j) = 2^i * 5^j. This means that both the rows and columns are increasing.
Think about drawing a line through the entries in increasing order, clearly beginning at entry (1,1). As you visit entries, the row and column increasing conditions ensure that the shape formed by those cells will always be an integer partition (in English notation). Keep track of this partition (mu = (m1, m2, m3, ...) where mi is the number of smaller entries in row i -- hence m1 >= m2 >= ...). Then the only entries that you need to compare are those entries which can be added to the partition.
Here's a crude example. Suppose you've visited all the xs (mu = (5,3,3,1)), then you need only check the #s:
x x x x x #
x x x #
x x x
x #
#
Therefore the number of checks is the number of addable cells (equivalently the number of ways to go up in Bruhat order if you're of a mind to think in terms of posets).
Given a partition mu, it's easy to determine what the addable states are. Image an infinite string of 0s following the last positive entry. Then you can increase mi by 1 if and only if m(i-1) > mi.
Back to the example, for mu = (5,3,3,1) we can increase m1 (6,3,3,1) or m2 (5,4,3,1) or m4 (5,3,3,2) or m5 (5,3,3,1,1).
The solution to the problem then finds the correct sequence of partitions (saturated chain). In pseudocode:
mu = [1,0,0,...,0];
while (/* some terminate condition or go on forever */) {
minNext = 0;
nextCell = [];
// look through all addable cells
for (int i=0; i<mu.length; ++i) {
if (i==0 or mu[i-1]>mu[i]) {
// check for new minimum value
if (minNext == 0 or 2^i * 5^(mu[i]+1) < minNext) {
nextCell = i;
minNext = 2^i * 5^(mu[i]+1)
}
}
}
// print next largest entry and update mu
print(minNext);
mu[i]++;
}
I wrote this in Maple stopping after 12 iterations:
1, 2, 4, 5, 8, 10, 16, 20, 25, 32, 40, 50
and the outputted sequence of cells added and got this:
1 2 3 5 7 10
4 6 8 11
9 12
corresponding to this matrix representation:
1, 2, 4, 8, 16, 32...
5, 10, 20, 40, 80, 160...
25, 50, 100, 200, 400...

First of all, (as others mentioned already) this question is very vague!!!
Nevertheless, I am going to give a shot based on your vague equation and the pattern as your expected result. So I am not sure the following will be true for what you are trying to do, however it may give you some idea about java collections!
import java.util.List;
import java.util.ArrayList;
import java.util.SortedSet;
import java.util.TreeSet;
public class IncreasingNumbers {
private static List<Integer> findIncreasingNumbers(int maxIteration) {
SortedSet<Integer> numbers = new TreeSet<Integer>();
SortedSet<Integer> numbers2 = new TreeSet<Integer>();
for (int i=0;i < maxIteration;i++) {
int n1 = (int)Math.pow(2, i);
numbers.add(n1);
for (int j=0;j < maxIteration;j++) {
int n2 = (int)Math.pow(5, i);
numbers.add(n2);
for (Integer n: numbers) {
int n3 = n*n1;
numbers2.add(n3);
}
}
}
numbers.addAll(numbers2);
return new ArrayList<Integer>(numbers);
}
/**
* Based on the following fuzzy question # StackOverflow
* http://stackoverflow.com/questions/7571934/printing-numbers-of-the-form-2i-5j-in-increasing-order
*
*
* Result:
* 1 2 4 5 8 10 16 20 25 32 40 64 80 100 125 128 200 256 400 625 1000 2000 10000
*/
public static void main(String[] args) {
List<Integer> numbers = findIncreasingNumbers(5);
for (Integer i: numbers) {
System.out.print(i + " ");
}
}
}

If you can do it in O(nlogn), here's a simple solution:
Get an empty min-heap
Put 1 in the heap
while (you want to continue)
Get num from heap
print num
put num*2 and num*5 in the heap
There you have it. By min-heap, I mean min-heap

As a mathematician the first thing I always think about when looking at something like this is "will logarithms help?".
In this case it might.
If our series A is increasing then the series log(A) is also increasing. Since all terms of A are of the form 2^i.5^j then all members of the series log(A) are of the form i.log(2) + j.log(5)
We can then look at the series log(A)/log(2) which is also increasing and its elements are of the form i+j.(log(5)/log(2))
If we work out the i and j that generates the full ordered list for this last series (call it B) then that i and j will also generate the series A correctly.
This is just changing the nature of the problem but hopefully to one where it becomes easier to solve. At each step you can either increase i and decrease j or vice versa.
Looking at a few of the early changes you can make (which I will possibly refer to as transforms of i,j or just transorms) gives us some clues of where we are going.
Clearly increasing i by 1 will increase B by 1. However, given that log(5)/log(2) is approx 2.3 then increasing j by 1 while decreasing i by 2 will given an increase of just 0.3 . The problem then is at each stage finding the minimum possible increase in B for changes of i and j.
To do this I just kept a record as I increased of the most efficient transforms of i and j (ie what to add and subtract from each) to get the smallest possible increase in the series. Then applied whichever one was valid (ie making sure i and j don't go negative).
Since at each stage you can either decrease i or decrease j there are effectively two classes of transforms that can be checked individually. A new transform doesn't have to have the best overall score to be included in our future checks, just better than any other in its class.
To test my thougths I wrote a sort of program in LinqPad. Key things to note are that the Dump() method just outputs the object to screen and that the syntax/structure isn't valid for a real c# file. Converting it if you want to run it should be easy though.
Hopefully anything not explicitly explained will be understandable from the code.
void Main()
{
double C = Math.Log(5)/Math.Log(2);
int i = 0;
int j = 0;
int maxi = i;
int maxj = j;
List<int> outputList = new List<int>();
List<Transform> transforms = new List<Transform>();
outputList.Add(1);
while (outputList.Count<500)
{
Transform tr;
if (i==maxi)
{
//We haven't considered i this big before. Lets see if we can find an efficient transform by getting this many i and taking away some j.
maxi++;
tr = new Transform(maxi, (int)(-(maxi-maxi%C)/C), maxi%C);
AddIfWorthwhile(transforms, tr);
}
if (j==maxj)
{
//We haven't considered j this big before. Lets see if we can find an efficient transform by getting this many j and taking away some i.
maxj++;
tr = new Transform((int)(-(maxj*C)), maxj, (maxj*C)%1);
AddIfWorthwhile(transforms, tr);
}
//We have a set of transforms. We first find ones that are valid then order them by score and take the first (smallest) one.
Transform bestTransform = transforms.Where(x=>x.I>=-i && x.J >=-j).OrderBy(x=>x.Score).First();
//Apply transform
i+=bestTransform.I;
j+=bestTransform.J;
//output the next number in out list.
int value = GetValue(i,j);
//This line just gets it to stop when it overflows. I would have expected an exception but maybe LinqPad does magic with them?
if (value<0) break;
outputList.Add(value);
}
outputList.Dump();
}
public int GetValue(int i, int j)
{
return (int)(Math.Pow(2,i)*Math.Pow(5,j));
}
public void AddIfWorthwhile(List<Transform> list, Transform tr)
{
if (list.Where(x=>(x.Score<tr.Score && x.IncreaseI == tr.IncreaseI)).Count()==0)
{
list.Add(tr);
}
}
// Define other methods and classes here
public class Transform
{
public int I;
public int J;
public double Score;
public bool IncreaseI
{
get {return I>0;}
}
public Transform(int i, int j, double score)
{
I=i;
J=j;
Score=score;
}
}
I've not bothered looking at the efficiency of this but I strongly suspect its better than some other solutions because at each stage all I need to do is check my set of transforms - working out how many of these there are compared to "n" is non-trivial. It is clearly related since the further you go the more transforms there are but the number of new transforms becomes vanishingly small at higher numbers so maybe its just O(1). This O stuff always confused me though. ;-)
One advantage over other solutions is that it allows you to calculate i,j without needing to calculate the product allowing me to work out what the sequence would be without needing to calculate the actual number itself.
For what its worth after the first 230 nunmbers (when int runs out of space) I had 9 transforms to check each time. And given its only my total that overflowed I ran if for the first million results and got to i=5191 and j=354. The number of transforms was 23. The size of this number in the list is approximately 10^1810. Runtime to get to this level was approx 5 seconds.
P.S. If you like this answer please feel free to tell your friends since I spent ages on this and a few +1s would be nice compensation. Or in fact just comment to tell me what you think. :)

I'm sure everyone one's might have got the answer by now, but just wanted to give a direction to this solution..
It's a Ctrl C + Ctrl V from
http://www.careercup.com/question?id=16378662
void print(int N)
{
int arr[N];
arr[0] = 1;
int i = 0, j = 0, k = 1;
int numJ, numI;
int num;
for(int count = 1; count < N; )
{
numI = arr[i] * 2;
numJ = arr[j] * 5;
if(numI < numJ)
{
num = numI;
i++;
}
else
{
num = numJ;
j++;
}
if(num > arr[k-1])
{
arr[k] = num;
k++;
count++;
}
}
for(int counter = 0; counter < N; counter++)
{
printf("%d ", arr[counter]);
}
}

The question as put to me was to return an infinite set of solutions. I pondered the use of trees, but felt there was a problem with figuring out when to harvest and prune the tree, given an infinite number of values for i & j. I realized that a sieve algorithm could be used. Starting from zero, determine whether each positive integer had values for i and j. This was facilitated by turning answer = (2^i)*(2^j) around and solving for i instead. That gave me i = log2 (answer/ (5^j)). Here is the code:
class Program
{
static void Main(string[] args)
{
var startTime = DateTime.Now;
int potential = 0;
do
{
if (ExistsIandJ(potential))
Console.WriteLine("{0}", potential);
potential++;
} while (potential < 100000);
Console.WriteLine("Took {0} seconds", DateTime.Now.Subtract(startTime).TotalSeconds);
}
private static bool ExistsIandJ(int potential)
{
// potential = (2^i)*(5^j)
// 1 = (2^i)*(5^j)/potential
// 1/(2^1) = (5^j)/potential or (2^i) = potential / (5^j)
// i = log2 (potential / (5^j))
for (var j = 0; Math.Pow(5,j) <= potential; j++)
{
var i = Math.Log(potential / Math.Pow(5, j), 2);
if (i == Math.Truncate(i))
return true;
}
return false;
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Efficiently implementing erode/dilate - algorithm

Same kind of optimizations can be used as "non maximum suppression" algorithms http://www.vision.ee.ethz.ch/publications/papers/proceedings/eth_biwi_00446.pdf

In 1D, using morphological wavelet transform in O(N) : https://gist.github.com/matovitch/11206318 You could get O(N * M) in 2D. HugoRune solution is way simpler and probably faster (though this one could probably be improved).

Related

Binary Lifting | Planet Queries 1 | TLE

HMM Localization in 2D maze, trouble applying smoothing (backward algorithm)

Selection Sort in Cuda

Which way to order a shared 2D/3D array for parallel reduction over 1 dimension in CUDA/OpenCL?

Printing numbers of the form 2^i * 5^j in increasing order

Categories

Resources