parallel SVD decomposition with openMP deos not perform as expected

parallel SVD decomposition with openMP deos not perform as expected - openmp

I have recently coded a parallel SVD decomposition routine, based on a "one sided Jacobi rotations" algorithm. The code works correctly but is tremendously slow.
In fact it should exploit the parallelism in the inner for loop for(int g=0;g<n;g++), but on commenting out the #pragma omp paralell for directive I can appreciate just a very slight decrease in performances. In other words there is no appreciable speed up on going parallel (the code does run parallel with 4 threads).
Note 1: almost all the work is concentrated in the three following loops involving the matrices A and V, which are relatively large.
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
and
double Ahi,Vhi;
for(h=0;h<N;h++)//...rotate Ai & Aj (only columns i & j are changend)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
All the parallelism should be exploited there but is not. And I can't understand why.
Note 2: The same happens both on Windows (cygWin compiler) and Linux (GCC) platforms.
Note 3: matrices are represented by column major arrays
So I'm looking for some help in finding out why the parallelism is not exploited. Did I miss something? There is some hidden overhead in the parallel for I cannot see?
Thank you very much for any suggestion
int sweep(double* A,double*V,int N,double tol)
{
static int*I=new int[(int)ceil(0.5*(N-1))];
static int*J=new int[(int)ceil(0.5*(N-1))];
int ntol=0;
for(int r=0;r<N;r++) //fill in i,j indexes of parallel rotations in vectors I & J
{
int k=r+1;
if (k==N)
{
for(int i=2;i<=(int)ceil(0.5*N);i++){
I[i-2]=i-1;
J[i-2]=N+2-i-1;
}
}
else
{
for(int i=1;i<=(int)ceil(0.5*(N-k));i++)I[i-1]=i-1;
for(int i=1;i<=(int)ceil(0.5*(N-k));i++)J[i-1]=N-k+2-i-1;
if(k>2)
{
int j=(int)ceil(0.5*(N-k));
for(int i=N-k+2;i<=N-(int)floor(0.5*k);i++){
I[j]=i-1;
J[j]=2*N-k+2-i-1;
j++;
}
}
}
int n=(k%2==0)?(int)floor(0.5*(N-1)):(int)floor(0.5*N);
#pragma omp parallel for schedule(dynamic,5) reduction(+:ntol) default(none) shared(std::cout,I,J,A,V,N,n,tol)
for(int g=0;g<n;g++)
{
int i=I[g];
int j=J[g];
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
int h;
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if(p*p/(qi*qj)<tol) ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj (only columns i & j are changend)
double Ahi,Vhi;
for(h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}

Thanks to Z boson remarks I managed to write a far better performing paralell SVD decomposition. It runs many times faster than the original one, and I guess it could be still improved by the use of SIMD instructions.
I post the code, in the case anyone should find it of use. All tests I performed gave me correct results, in any case there is no warranty for its use.
I'm really sorry not to have had the time to properly comment the code for a better comprehensibility.
int sweep(double* A,double*V,int N,double tol,int M, int n)
{
/********************************************************************************************
This routine performs a parallel "sweep" of the SVD factorization algorithm for the matrix A.
It implements a single sided Jacobi rotation algorithm, described by Michael W. Berry,
Dani Mezher, Bernard Philippe, and Ahmed Sameh in "Parallel Algorithms for the Singular Value
Decomposition".
At each sweep the A matrix becomes a little more orthogonal, until each column of A is orthogonal
to each other within a give tolerance. At this point the sweep() routine returns 1 and convergence
is attained.
Arguments:
A : on input the square matrix to be orthogonalized, on exit a more "orthogonal" matrix
V : on input the accumulated rotation matrix, on exit it is updated with the current rotations
N : dimension of the matrices.
tol :tolerance for convergence of orthogonalization. See the explainations for SVD() routine
M : number of blocks (it is calculated from the given block size n)
n : block size (number of columns on each block). It should be an optimal value according to the hardware available
returns : 1 if in the last sweep convergence is attained, 0 if not and one more sweep is necessary
Author : Renato Talucci 2015
*************************************************************************************************/
#include <math.h>
int ntol=0;
//STEP 1 : INTERNAL BLOCK ORTHOGONALISATION
#pragma omp paralell for reduction(+:ntol) shared(A,V,n,tol,N) default(none)
for(int a=0;a<M;a++)
{
for(int i=a*n;i<a*n+imin(n,N-a*n)-1;i++)
{
for(int j=i+1;j<a*n+imin(n,N-a*n);j++)
{
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
for(int h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj
double Ahi,Vhi;
for(int h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j atre updated)
for(int h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
}
//STEP 2 : PARALLEL BLOCK MUTUAL ORTHOGONALISATION
static int*I=new int[(int)ceil(0.5*(M-1))];
static int*J=new int[(int)ceil(0.5*(M-1))];
for(int h=0;h<M;h++)
{
//fill in i,j indexes of blocks to be mutually orthogonalized at each turn
int k=h+1;
if (k==M)
{
for(int i=2;i<=(int)ceil(0.5*M);i++){
I[i-2]=i-1;
J[i-2]=M+2-i-1;
}
}
else
{
for(int i=1;i<=(int)ceil(0.5*(M-k));i++)I[i-1]=i-1;
for(int i=1;i<=(int)ceil(0.5*(M-k));i++)J[i-1]=M-k+2-i-1;
if(k>2)
{
int j=(int)ceil(0.5*(M-k));
for(int i=M-k+2;i<=M-(int)floor(0.5*k);i++){
I[j]=i-1;
J[j]=2*M-k+2-i-1;
j++;
}
}
}
int ng=(k%2==0)?(int)floor(0.5*(M-1)):(int)floor(0.5*M);
#pragma omp parallel for schedule(static,5) shared(A,V,I,J,n,tol,N,ng) reduction(+:ntol) default(none)
for(int g=0;g<ng;g++)
{
int block_i=I[g];
int block_j=J[g];
for(int i=block_i*n;i<block_i*n+imin(n,N-block_i*n);i++)
{
for(int j=block_j*n;j<block_j*n+imin(n,N-block_j*n);j++)
{
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
int h;
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj
double Ahi,Vhi;
for(h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j atre updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
}
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}
int SVD(double* A,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
This routine calculates the SVD decomposition of the square matrix A [NxN]
A = U * S * V'
Arguments :
A : on input NxN square matrix to be factorized, on exit contains the
rotated matrix A*V=U*S.
V : on input an identity NxN matrix, on exit is the right orthogonal matrix
of the decomposition A = U*S*V'
U : NxN matrix, on exit is the left orthogonal matrix of the decomposition A = U*S*V'.
sigma : array of dimension N. On exit contains the singular values, i.e. the diagonal
elements of the matrix S.
N : Dimension of the A matrix.
tol : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
The software returns the number of sweeps needed for convergence.
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
int n=24;//this is the dimension of block submatrices, you shall enter an optimal value for your hardware
int M=N/n+int(((N%n)!=0)?1:0);
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
converged=sweep(A,V,N,tol,M,n);
swp++;
}
#pragma omp parallel for default(none) shared(sigma,A,U,N)
for(int i=0;i<N;i++)
{
double si=0;
for(int j=0;j<N;j++) si+=A[j+N*i]*A[j+N*i];
si=sqrt(si);
for(int k=0;k<N;k++) U[k+N*i]=A[k+N*i]/si;
sigma[i]=si;
}
return(swp);
}
Note : if some user prefers to left A unchanged upon exit, it is sufficient to calculate U=A*V before entering the while loop, and instead of passing the A matrix to the sweep() routine, passing the obtained U matrix. The U matrix shall be orthonormalized instead of A, after convergence of sweep() and the #pragma omp directive must include U in the shared variables instead of A.
Note 2:if you have (as I have) to factorize a sequence of A(k) matrices each of which A(k) can be considered a perturbation of the previous A(k-1), as a jacobian matrix can be considered in a multistep Newton solver, the driver can be easily modified to update from A(k-1) to A(k) instead of calculating A(k) from begin. Here is the code:
int updateSVD(double* DA,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
Given a previously factorization
A(k-1) = U(k-1) * S(k-1) * V(k-1)'
and given a perturbation DA(k) of A(k-1), i.e.
A(k) = A(k-1) + DA(k)
this routine calculates the SVD factorization of A(k), starting from the factorization of A(k-1)
Arguments:
DA : on input NxN perturbation matrix, unchanged on exit
U : on input NxN orthonormal left matrix of the previous (k-1) factorization, on exit
orthonormal right matrix of the current factorization
V : on input NxN orthonormal right matrix of the previous (k-1) factorization, on exit
orthonormal right matrix of the current factorization
N : dimension of the matrices
tol : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
sigma : on input, array with the N singular values of the previuos factorization, on exit
array with the N singular values of the current factorization
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
for(int i=0;i<N;i++) for(int j=0;j<N;j++) U[i+N*j]*=sigma[j];
int n=24; //this is the dimension of block submatrices, you shall enter an optimal value for your hardware
matmat_col_col(DA,V,U,N,n); //U =U(k-1)*S(k-1) + DA(k)*V(k-1) = A(k)*V(k-1)
int M=N/n+int(((N%n)!=0)?1:0); //number of blocks
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
converged=sweep(U,V,N,tol,M,n);
swp++;
}
#pragma omp parallel for default(none) shared(sigma,U,N)
for(int i=0;i<N;i++)
{
double si=0;
for(int j=0;j<N;j++) si+=U[j+N*i]*U[j+N*i];
si=sqrt(si);
for(int k=0;k<N;k++) U[k+N*i]=U[k+N*i]/si;
sigma[i]=si;
}
return(swp);
}
Finally, the routine matmat_col_col(DA,V,U,N,n) is a paralell block matrix product. Here is the code:
inline int imin(int a,int b) {return((a<=b)?a:b);}
void matmat_col_col(double* A,double* B,double*C,int N,int n)
/********************************************************************************************
square matrix block product NxN :
C = C + A * B
n is the optimal block dimension
N is the dimension of the matrices
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING M(i,j) = M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
{
int M=N/n+int(((N%n)!=0)?1:0);
#pragma omp parallel for shared(M,A,B,C)
for(int a=0;a<M;a++)
{
for(int b=0;b<M;b++)
{
for(int c=0;c<M;c++)
{
for(int i=a*n;i<imin((a+1)*n,N);i++)
{
for(int j=b*n;j<imin((b+1)*n,N);j++)
{
for(int k=c*n;k<imin((c+1)*n,N);k++)
{
C[i+N*j]+=A[i+N*k]*B[k+N*j];
}
}
}
}
}
}
return;
}
I hope no typos have been created from so much copy&paste.
It would be nice if anyone could improve the code.

The FLOPS for Singular value decomposition (SVD) should go as O(N^3) whereas the reads as O(N^2). This means it may be possible to parallelize the algorithm and have it scale well with the number of cores.
However, your implementation of SVD is memory bandwidth bound which means it can't scale well with the number of cores for a single socket system. Your three nested loops currently each go over the entire range of N which causes much of the data to be evicted from the cache the next time it need to be reused.
In order to be compute bound you're going to have to change your algorithm so that, instead of operating on long strips/dot products of size N, it uses loop tiling to maximize the n^3 (where n is the size of the block) calculations in the cache of size n^2.
Here is paper for parallel SVD computation from a google search which says in the abstract
the key point of our proposed block JRS algorithm is reusing the loaded data into cache memory by performing computations on matrix blocks (b rows) instead of on strips of vectors as in JRS iteration algorithms.
I used loop tiling/blocks to achieve good scaling with the number of cores for cholesky decomposition.
Note that using tiles/blocks will improve the performance of your single threaded code as well.

Related

HMM Localization in 2D maze, trouble applying smoothing (backward algorithm)

We use HMM (Hidden Markov Model) to localize a robot in a windy maze with damaged sensors. If he attempts to move in a direction, he will do so with a high probability, and a low chance to accidentally go to either side. If his movement would make him go over an obstacle, he will bounce back to the original tile.
From any given position, he can sense in all four directions. He will notice an obstacle if it is there with high certainty, and see an obstacle when there is none with low certainty.
We have a probability map for all possible places the robot might be in the maze, since he knows what the maze looks like. Initially it all starts evenly distributed.
I have completed the motion and sensing aspect of this and am getting the proper answers, but I am stuck on smoothing (backward algorithm).
Assume that the robot performs the following sequence of actions: senses, moves, senses, moves, senses. This gives us 3 states in our HMM model. Assume that the results I have at each step of the way so far are correct.
I am having a lot of trouble performing smoothing (backward algorithm), given that there are four conditional probabilities (one for each direction).
Assume SP is for smoothing probability, BP is for backward probability
Assume Sk is for a state, and Zk is for an observation at that state. The problem for me is figuring out how to construct my backwards equation given that each Zk is only for a single direction.
I know the algorithm for smoothing is: SP(k) is proportional to BP(k+1) * P(Sk | Z1:k)
Where BP(k+1) is defined as :
if (k == n) return 1 else return Sum(s) of BP(k+1) * P(Zk+1|Sk+1) * P(Sk+1=s | Sk)
This is where I am having my trouble. Mainly in the Conditional Probability portion of this equation. Because each spot has four different directions that it observed! In other words, each state has four different evidence variables as opposed to just one! Do I average these values? Do I do a separate summation for them? How do I account for multiple observations at a given state and properly condense it into this equation which only has room for one conditional probability?
Here is the code I have performing the smoothing:
public static void Smoothing(List<int[]> observations) {
int n = observations.Count; //n is Total length of evidence sequence
int k = n - 1; //k is the state we are trying to smooth. start with n-1
for (; k >= 1; k--) { //Smooth all the way back to the first state
for (int dir = 0; dir < 4; dir++) {
//We must smooth each direction separately
SmoothDirection(dir, observations, k, n);
}
Console.WriteLine($"Smoothing for k = {k}\n");
UpdateMapMotion(mapHistory[k]);
PrintMap();
}
}
public static void SmoothDirection(int dir, List<int[]> observations, int k, int n) {
var alphas = new double[ROWS, COLS];
var normalizer = 0.0;
int row, col;
foreach (var t in map) {
if (t.isObstacle) continue;
row = t.pos.y;
col = t.pos.x;
alphas[row, col] = mapHistory[k][row, col]
* Backwards(k, n, t, dir, observations, moves[^(n - k)]);
normalizer += alphas[row, col];
}
UpdateHistory(k, alphas, normalizer);
}
public static void UpdateHistory(int index, double[,] alphas, double normalizer) {
for (int r = 0; r < ROWS; r++) {
for (int c = 0; c < COLS; c++) {
mapHistory[index][r, c] = alphas[r, c] / normalizer;
}
}
}
public static double Backwards(int k, int n, Tile t, int dir, List<int[]> observations, int moveDir) {
if (k == n) return 1;
double p = 0;
var nextStates = GetPossibleNextStates(t, moveDir);
foreach (var s in nextStates) {
p += Cond_Prob(s.hasObstacle[dir], observations[^(n - k)][dir] == 1) * Trans_Prob(t, s, moveDir)
* Backwards(k+1, n, s, dir, observations, moves[^(n - k)]);
}
return p;
}
public static List<Tile> GetPossibleNextStates(Tile t, int direction) {
var tiles = new List<Tile>(); //Next States
var perpDirs = GetPerpendicularDir(direction); //Perpendicular Directions
//If obstacle in front of Tile t or on the sides, Tile t is a possible next state.
if (t.hasObstacle[direction] || t.hasObstacle[perpDirs[0]] || t.hasObstacle[perpDirs[1]])
tiles.Add(t);
//If there is no obstacle in front of Tile t, then that tile is a possible next state.
if (!t.hasObstacle[direction])
tiles.Add(GetTileAtPos(t.pos + directions[direction]));
//If there are no obstacles on the sides of Tile t, then those are possible next states.
foreach (var dir in perpDirs) {
if (!t.hasObstacle[dir])
tiles.Add(GetTileAtPos(t.pos + directions[dir]));
}
return tiles;
}
TL;DR : How do I perform smoothing (backward algorithm) in a Hidden Markov Model when there are 4 evidences at each state as opposed to just 1?

SOLVED!
It was actually rather much more simple than I imagined.
I don't actually need to each iteration separately in each direction.
I just need to replace the Cond_Prob() function with Joint_Cond_Prob() which finds the joint probability of all directional observations at a given state.
So P(Zk|Sk) is actually P(Zk1:Zk4|Sk) which is just P(Zk1|Sk)P(Zk2|Sk)P(Zk3|Sk)P(Zk4|Sk)

Find the missing coordinate of rectangle

Chef has N axis-parallel rectangles in a 2D Cartesian coordinate system. These rectangles may intersect, but it is guaranteed that all their 4N vertices are pairwise distinct.
Unfortunately, Chef lost one vertex, and up until now, none of his fixes have worked (although putting an image of a point on a milk carton might not have been the greatest idea after all…). Therefore, he gave you the task of finding it! You are given the remaining 4N−1 points and you should find the missing one.
Input
The first line of the input contains a single integer T denoting the number of test cases. The description of T test cases follows.
The first line of each test case contains a single integer N.
Then, 4N−1 lines follow. Each of these lines contains two space-separated integers x and y denoting a vertex (x,y) of some rectangle.
Output
For each test case, print a single line containing two space-separated integers X and Y ― the coordinates of the missing point. It can be proved that the missing point can be determined uniquely.
Constraints
T≤100
1≤N≤2⋅105
|x|,|y|≤109
the sum of N over all test cases does not exceed 2⋅105
Example Input
1
2
1 1
1 2
4 6
2 1
9 6
9 3
4 3
Example Output
2 2
Problem link: https://www.codechef.com/problems/PTMSSNG
my approach: I have created a frequency array for x and y coordinates and then calculated the point which is coming odd no. of times.
#include <iostream>
using namespace std;
int main() {
// your code goes here
int t;
cin>>t;
while(t--)
{
long int n;
cin>>n;
long long int a[4*n-1][2];
long long int xm,ym,x,y;
for(int i=0;i<4*n-1;i++)
{
cin>>a[i][0]>>a[i][1];
if(i==0)
{
xm=abs(a[i][0]);
ym=abs(a[i][1]);
}
if(i>0)
{
if(abs(a[i][0])>xm)
{
xm=abs(a[i][0]);
}
if(abs(a[i][1])>ym)
{
ym=abs(a[i][1]);
}
}
}
long long int frqx[xm+1],frqy[ym+1];
for(long long int i=0;i<xm+1;i++)
{
frqx[i]=0;
}
for(long long int j=0;j<ym+1;j++)
{
frqy[j]=0;
}
for(long long int i=0;i<4*n-1;i++)
{
frqx[a[i][0]]+=1;
frqy[a[i][1]]+=1;
}
for(long long int i=0;i<xm+1;i++)
{
if(frqx[i]>0 && frqx[i]%2>0)
{
x=i;
break;
}
}
for(long long int j=0;j<ym+1;j++)
{
if(frqy[j]>0 && frqy[j]%2>0)
{
y=j;
break;
}
}
cout<<x<<" "<<y<<"\n";
}
return 0;
}
My code is showing TLE for inputs <10^6

First of all, your solution is not handling negative x/y correctly. long long int frqx[xm+1],frqy[ym+1] allocated barely enough memory to hold positive values, but not enough to hold negative ones.
It doesn't even matter though, as with the guarantee that abs(x) <= 109, you can just statically allocate a vector of 219 elements, and map both positive and negative coordinates in there.
Second, you are not supposed to buffer the input in a. Not only is this going to overflow the stack, is also entirely unnecessary. Write to the frequency buckets right away, don't buffer.
Same goes for most of these challenges. Don't buffer, always try to process the input directly.
About your buckets, you don't need a long long int. A bool per bucket is enough. You do not care even the least how many coordinates were sorted into the bucket, only whether the number so far was even or not. What you implemented as a separate loop can be substituted by simply toggling a flag while processing the input.

I find the answer of #Ext3h with respect to the errors adequate.
The solution, giving that you came on the odd/even quality of the problem,
can be done more straight-forward.
You need to find the x and y that appear an odd number of times.
In java
int[] missingPoint(int[][] a) {
//int n = (a.length + 1) / 4;
int[] pt = new int[2]; // In C initialize with 0.
for (int i = 0; i < a.length; ++i) {
for (int j = 0; j < 2; ++j) {
pt[j] ^= a[i][j];
}
}
return pt;
}
This uses exclusive-or ^ which is associative and reflexive 0^x=x, x^x=0. (5^7^4^7^5=4.)
For these "search the odd one" one can use this xor-ing.
In effect you do not need to keep the input in an array.

Algorithm to match sets with overlapping members

Looking for an efficient algorithm to match sets among a group of sets, ordered by the most overlapping members. 2 identical sets for example are the best match, while no overlapping members are the worst.
So, the algorithm takes input a list of sets and returns matching set pairs ordered by the sets with the most overlapping members.
Would be interested in ideas to do this efficiently. Brute force approach is to try all combinations and sort which obviously is not very performant when the number of sets is very large.
Edit: Use case - Assume a large number of sets already exist. When a new set arrives, the algorithm is run and the output includes matching sets (with at least one element overlap) sorted by the most matching to least (doesn't matter how many items are in the new/incoming set). Hope that clarifies my question.

If you can afford an approximation algorithm with a chance of error, then you should probably consider MinHash.
This algorithm allows estimating the similarity between 2 sets in constant time. For any constructed set, a fixed size signature is computed, and then only the signatures are compared when estimating the similarities. The similarity measure being used is Jaccard distance, which ranges from 0 (disjoint sets) to 1 (identical sets). It is defined as the intersection to union ratio of two given sets.
With this approach, any new set has to be compared against all existing ones (in linear time), and then the results can be merged into the top list (you can use a bounded search tree/heap for this purpose).

Since the number of possible different values is not very large, you get a fairly efficient hashing if you simply set the nth bit in a "large integer" when the nth number is present in your set. You can then look for overlap between sets with a simple bitwise AND followed by a "count set bits" operation. On 64 bit architecture, that means that you can look for the similarity between two numbers (out of 1000 possible values) in about 16 cycles, regardless of the number of values in each cluster. As the cluster gets more sparse, this becomes a less efficient algorithm.
Still - I implemented some of the basic functions you might need in some code that I attach here - not documented but reasonably understandable, I think. In this example I made the numbers small so I can check the result by hand - you might want to change some of the #defines to get larger ranges of values, and obviously you will want some dynamic lists etc to keep up with the growing catalog.
#include <stdio.h>
// biggest number you will come across: want this to be much bigger
#define MAXINT 25
// use the biggest type you have - not int
#define BITSPER (8*sizeof(int))
#define NWORDS (MAXINT/BITSPER + 1)
// max number in a cluster
#define CSIZE 5
typedef struct{
unsigned int num[NWORDS]; // want to use longest type but not for demo
int newmatch;
int rank;
} hmap;
// convert number to binary sequence:
void hashIt(int* t, int n, hmap* h) {
int ii;
for(ii=0;ii<n;ii++) {
int a, b;
a = t[ii]%BITSPER;
b = t[ii]/BITSPER;
h->num[b]|=1<<a;
}
}
// print binary number:
void printBinary(int n) {
unsigned int jj;
jj = 1<<31;
while(jj!=0) {
printf("%c",((n&jj)!=0)?'1':'0');
jj>>=1;
}
printf(" ");
}
// print the array of binary numbers:
void printHash(hmap* h) {
unsigned int ii, jj;
for(ii=0; ii<NWORDS; ii++) {
jj = 1<<31;
printf("0x%08x: ", h->num[ii]);
printBinary(h->num[ii]);
}
//printf("\n");
}
// find the maximum overlap for set m of n
int maxOverlap(hmap* h, int m, int n) {
int ii, jj;
int overlap, maxOverlap = -1;
for(ii = 0; ii<n; ii++) {
if(ii == m) continue; // don't compare with yourself
else {
overlap = 0;
for(jj = 0; jj< NWORDS; jj++) {
// just to see what's going on: take these print statements out
printBinary(h->num[ii]);
printBinary(h->num[m]);
int bc = countBits(h->num[ii] & h->num[m]);
printBinary(h->num[ii] & h->num[m]);
printf("%d bits overlap\n", bc);
overlap += bc;
}
if(overlap > maxOverlap) maxOverlap = overlap;
}
}
return maxOverlap;
}
int countBits (unsigned int b) {
int count;
for (count = 0; b != 0; count++) {
b &= b - 1; // this clears the LSB-most set bit
}
return count;
}
int main(void) {
int cluster[20][CSIZE];
int temp[CSIZE];
int ii,jj;
static hmap H[20]; // make them all 0 initially
for(jj=0; jj<20; jj++){
for(ii=0; ii<CSIZE; ii++) {
temp[ii] = rand()%MAXINT;
}
hashIt(temp, CSIZE, &H[jj]);
}
for(ii=0;ii<20;ii++) {
printHash(&H[ii]);
printf("max overlap: %d\n", maxOverlap(H, ii, 20));
}
}
See if this helps at all...

Calculate the volume from a sequence of points in N^2

Given a sequence of (integer, integer) points, say (p1, ..., pn), which define the lines (p_i, p_i+1) for 1 <= i < n, plus the line (p_n, p_1). The resulting lines have the additional property that they don't intersect pairwise. What would be the best way to calculate the resulting volume?

Here is a nice code blurb with explanations as to why it works: http://alienryderflex.com/polygon_area/
// Public-domain function by Darel Rex Finley, 2006.
double polygonArea(double *X, double *Y, int points) {
double area=0. ;
int i, j=points-1 ;
for (i=0; i<points; i++) {
area+=(X[j]+X[i])*(Y[j]-Y[i]); j=i; }
return area*.5; }
You should also read the previous incarnation of this question: How do I calculate the area of a 2d polygon?

Travelling Salesman Problem

I am trying to develop a program in C++ from Travelling Salesman Problem Algorithm. I need a distance matrix and a cost matrix. After using all the formulas, i get a new resultant matrix. But I dont understand what that matrix shows.
Suppose the resultant matrix is:
1 2 3
4 5 6
7 8 9
Now I want to know what this matrix shows? Assume I have 3 cities to traverse.
Please tell me the flow. A sample program of this algorithm will be more favorable..
Thank you.
My Program is:
#include<iostream.h>
#include<conio.h>
#include <stdlib.h>
void main()
{
clrscr();
int a,b,c,d,ctr,j,Q=1,K=1 ;
float q0=0.7, p = 0.5 ;
int phe[3][3];
double dist[3][3] , mem[3][3],exp[3][3],eplt[3][3], rnd;
cout<<"enter the iterations, cities , ants ";
cin>>a>>b>>c;
for (int i=0;i<3;i++)
{
for (j=0;j<3;j++)
{
dist[i][j]=(double)rand()/(double)RAND_MAX;
if (i==j)
dist[i][j]=0;
}
}
for (i=0;i<3;i++)
{
for (j=0;j<3;j++)
{
cout<< dist[i][j]<<"\t";
}
cout<<"\n";
}
cout<<"pheromone matrix "<<endl;
for (i=0;i<3;i++)
{
for (j=0;j<3;j++)
{
if (i==j)
phe[i][j]=0;
else
phe[i][j]=1;
}
}
for ( i=0;i<3;i++)
{
for ( j=0;j<3;j++)
{
cout<< phe[i][j]<<"\t";
}
cout<<"\n";
}
cout<< "after iteration "<<endl;
for (i=0;i<3;i++)
{
ctr=0;
for (int k=0;k<3;k++)
{
// mem[i][k]=(rand()%b)+1;
// cout<<"memory"<<mem[i][k]<<"\n";
rnd= (double)rand()/(double)RAND_MAX;
cout<<"hhhhhhh"<<rnd;
if (rnd<=q0)
{
cout<<"Exploitation\n";
eplt[i][ctr] =(p*phe[i][k])+(Q/K);
}
else
{
cout<<"EXPLORATION\n";
eplt[i][ctr]= phe[i][k]/dist[i][k];
}
ctr++;
}
}
for (i=0;i<3;i++)
{
for (int k=0;k<3;k++)
{
cout <<eplt[i][k]<<"\t";
}
cout<<"\n";
}
getch();
}
OUTPUT:
enter the iterations, cities , ants 3
4
4
0 0.003967 0.335154
0.033265 0 0.2172
0.536973 0.195776 0
pheromone matrix
0 1 1
1 0 1
1 1 0
after iteration
hhhhhhh0.949919EXPLORATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.949919EXPLORATION

First up, I'm guessing when you say My Program you mean The program in the paper since it is basically out of date C++. Standard library headers don't have .h appended, and conio.h is an MS-DOS header - most code that I've seen that uses that comes from Borland Turbo C++. Worth bearing in mind if you're going to try to compile that demo on a modern system.
Next up, what you're looking at is an adjacancy matrix. I don't believe that matrix is part of the output at all; I believe it is part of the model being used, for demonstration purposes. I believe, given you have a pheromone matrix, that what you're looking at here is Ant Colony Optimisation, a probabilistic method of solving the TSP and other problems that can be reduced to it.
From your output, it isn't clear where or how the result is being stored, and since this is homework, I am lazy and you're just asking for an outright answer, I'm not going to read that code. The premise of Ant Colony optimisation is that pheromone trails laid by ants, which walk the graph at random, decay over time (number of iterations). The longer it takes an ant to move along a particular vertex (distance), the more the laid pheromone decays. At this point, ants start to make decisions based on the strength of the laid pheromone along a path. So what happens is ants start to prefer certain routes over others, and continually re-inforce the pheromone along that path.
So, somewhere in there, there must be a matrix like the adjacancy matrix, storing the pheromone levels for each route. Combined with the length of the route, each iteration should detect a rate of decay.

Your input variables a, b, c are never used.
Your variable ctr is used in the exact same incremental way as the variable k of the same loop.
Your phenomone matrix indicates use of an ant colony optimization algorithm, why just not say it in your question ?
Such "iteration" should be, well, iterated, so probably the output you give us (which is not a normal output) is not the definitive solution, rather a provisory result of the algorithm.

In this post, implementation of simple solution is discussed.
Consider city 1 or 0 as the starting and ending point. Since route is
cyclic, we can consider any point as starting point.
Generate all (n-1)! permutations of cities.
Calculate cost of every permutation and keep track of minimum cost
permutation.
Return the permutation with minimum cost.
#include <bits/stdc++.h>
using namespace std;
int main(void){
int t;
scanf("%d",&t);
while(t--){
int n;
scanf("%d",&n);
int graph[n][n];
for(int i =0;i<n;i++){
for(int j =0;j<n;j++){
scanf("%d",&graph[i][j]);
}
}
vector<int> v;
int s = 0;
for(int i =0;i<n;i++){
if(i!=s){
v.push_back(i);
}
}
int ans = INT_MAX;
do{
int current_pathsum = 0;
int k = s;
for(int i = 0;i<v.size();i++){
current_pathsum += graph[k][v[i]];
k = v[i];
}
current_pathsum += graph[k][s];
ans = min(ans,current_pathsum);
}while(next_permutation(v.begin(),v.end()));
cout<<ans<<endl;
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio