Travelling Salesman Problem

Travelling Salesman Problem - algorithm

I am trying to develop a program in C++ from Travelling Salesman Problem Algorithm. I need a distance matrix and a cost matrix. After using all the formulas, i get a new resultant matrix. But I dont understand what that matrix shows.
Suppose the resultant matrix is:
1 2 3
4 5 6
7 8 9
Now I want to know what this matrix shows? Assume I have 3 cities to traverse.
Please tell me the flow. A sample program of this algorithm will be more favorable..
Thank you.
My Program is:
#include<iostream.h>
#include<conio.h>
#include <stdlib.h>
void main()
{
clrscr();
int a,b,c,d,ctr,j,Q=1,K=1 ;
float q0=0.7, p = 0.5 ;
int phe[3][3];
double dist[3][3] , mem[3][3],exp[3][3],eplt[3][3], rnd;
cout<<"enter the iterations, cities , ants ";
cin>>a>>b>>c;
for (int i=0;i<3;i++)
{
for (j=0;j<3;j++)
{
dist[i][j]=(double)rand()/(double)RAND_MAX;
if (i==j)
dist[i][j]=0;
}
}
for (i=0;i<3;i++)
{
for (j=0;j<3;j++)
{
cout<< dist[i][j]<<"\t";
}
cout<<"\n";
}
cout<<"pheromone matrix "<<endl;
for (i=0;i<3;i++)
{
for (j=0;j<3;j++)
{
if (i==j)
phe[i][j]=0;
else
phe[i][j]=1;
}
}
for ( i=0;i<3;i++)
{
for ( j=0;j<3;j++)
{
cout<< phe[i][j]<<"\t";
}
cout<<"\n";
}
cout<< "after iteration "<<endl;
for (i=0;i<3;i++)
{
ctr=0;
for (int k=0;k<3;k++)
{
// mem[i][k]=(rand()%b)+1;
// cout<<"memory"<<mem[i][k]<<"\n";
rnd= (double)rand()/(double)RAND_MAX;
cout<<"hhhhhhh"<<rnd;
if (rnd<=q0)
{
cout<<"Exploitation\n";
eplt[i][ctr] =(p*phe[i][k])+(Q/K);
}
else
{
cout<<"EXPLORATION\n";
eplt[i][ctr]= phe[i][k]/dist[i][k];
}
ctr++;
}
}
for (i=0;i<3;i++)
{
for (int k=0;k<3;k++)
{
cout <<eplt[i][k]<<"\t";
}
cout<<"\n";
}
getch();
}
OUTPUT:
enter the iterations, cities , ants 3
4
4
0 0.003967 0.335154
0.033265 0 0.2172
0.536973 0.195776 0
pheromone matrix
0 1 1
1 0 1
1 1 0
after iteration
hhhhhhh0.949919EXPLORATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.356777EXPLOITATION
hhhhhhh0.949919EXPLORATION

First up, I'm guessing when you say My Program you mean The program in the paper since it is basically out of date C++. Standard library headers don't have .h appended, and conio.h is an MS-DOS header - most code that I've seen that uses that comes from Borland Turbo C++. Worth bearing in mind if you're going to try to compile that demo on a modern system.
Next up, what you're looking at is an adjacancy matrix. I don't believe that matrix is part of the output at all; I believe it is part of the model being used, for demonstration purposes. I believe, given you have a pheromone matrix, that what you're looking at here is Ant Colony Optimisation, a probabilistic method of solving the TSP and other problems that can be reduced to it.
From your output, it isn't clear where or how the result is being stored, and since this is homework, I am lazy and you're just asking for an outright answer, I'm not going to read that code. The premise of Ant Colony optimisation is that pheromone trails laid by ants, which walk the graph at random, decay over time (number of iterations). The longer it takes an ant to move along a particular vertex (distance), the more the laid pheromone decays. At this point, ants start to make decisions based on the strength of the laid pheromone along a path. So what happens is ants start to prefer certain routes over others, and continually re-inforce the pheromone along that path.
So, somewhere in there, there must be a matrix like the adjacancy matrix, storing the pheromone levels for each route. Combined with the length of the route, each iteration should detect a rate of decay.

Your input variables a, b, c are never used.
Your variable ctr is used in the exact same incremental way as the variable k of the same loop.
Your phenomone matrix indicates use of an ant colony optimization algorithm, why just not say it in your question ?
Such "iteration" should be, well, iterated, so probably the output you give us (which is not a normal output) is not the definitive solution, rather a provisory result of the algorithm.

In this post, implementation of simple solution is discussed.
Consider city 1 or 0 as the starting and ending point. Since route is
cyclic, we can consider any point as starting point.
Generate all (n-1)! permutations of cities.
Calculate cost of every permutation and keep track of minimum cost
permutation.
Return the permutation with minimum cost.
#include <bits/stdc++.h>
using namespace std;
int main(void){
int t;
scanf("%d",&t);
while(t--){
int n;
scanf("%d",&n);
int graph[n][n];
for(int i =0;i<n;i++){
for(int j =0;j<n;j++){
scanf("%d",&graph[i][j]);
}
}
vector<int> v;
int s = 0;
for(int i =0;i<n;i++){
if(i!=s){
v.push_back(i);
}
}
int ans = INT_MAX;
do{
int current_pathsum = 0;
int k = s;
for(int i = 0;i<v.size();i++){
current_pathsum += graph[k][v[i]];
k = v[i];
}
current_pathsum += graph[k][s];
ans = min(ans,current_pathsum);
}while(next_permutation(v.begin(),v.end()));
cout<<ans<<endl;
}
}

Related

Find the missing coordinate of rectangle

Chef has N axis-parallel rectangles in a 2D Cartesian coordinate system. These rectangles may intersect, but it is guaranteed that all their 4N vertices are pairwise distinct.
Unfortunately, Chef lost one vertex, and up until now, none of his fixes have worked (although putting an image of a point on a milk carton might not have been the greatest idea after all…). Therefore, he gave you the task of finding it! You are given the remaining 4N−1 points and you should find the missing one.
Input
The first line of the input contains a single integer T denoting the number of test cases. The description of T test cases follows.
The first line of each test case contains a single integer N.
Then, 4N−1 lines follow. Each of these lines contains two space-separated integers x and y denoting a vertex (x,y) of some rectangle.
Output
For each test case, print a single line containing two space-separated integers X and Y ― the coordinates of the missing point. It can be proved that the missing point can be determined uniquely.
Constraints
T≤100
1≤N≤2⋅105
|x|,|y|≤109
the sum of N over all test cases does not exceed 2⋅105
Example Input
1
2
1 1
1 2
4 6
2 1
9 6
9 3
4 3
Example Output
2 2
Problem link: https://www.codechef.com/problems/PTMSSNG
my approach: I have created a frequency array for x and y coordinates and then calculated the point which is coming odd no. of times.
#include <iostream>
using namespace std;
int main() {
// your code goes here
int t;
cin>>t;
while(t--)
{
long int n;
cin>>n;
long long int a[4*n-1][2];
long long int xm,ym,x,y;
for(int i=0;i<4*n-1;i++)
{
cin>>a[i][0]>>a[i][1];
if(i==0)
{
xm=abs(a[i][0]);
ym=abs(a[i][1]);
}
if(i>0)
{
if(abs(a[i][0])>xm)
{
xm=abs(a[i][0]);
}
if(abs(a[i][1])>ym)
{
ym=abs(a[i][1]);
}
}
}
long long int frqx[xm+1],frqy[ym+1];
for(long long int i=0;i<xm+1;i++)
{
frqx[i]=0;
}
for(long long int j=0;j<ym+1;j++)
{
frqy[j]=0;
}
for(long long int i=0;i<4*n-1;i++)
{
frqx[a[i][0]]+=1;
frqy[a[i][1]]+=1;
}
for(long long int i=0;i<xm+1;i++)
{
if(frqx[i]>0 && frqx[i]%2>0)
{
x=i;
break;
}
}
for(long long int j=0;j<ym+1;j++)
{
if(frqy[j]>0 && frqy[j]%2>0)
{
y=j;
break;
}
}
cout<<x<<" "<<y<<"\n";
}
return 0;
}
My code is showing TLE for inputs <10^6

First of all, your solution is not handling negative x/y correctly. long long int frqx[xm+1],frqy[ym+1] allocated barely enough memory to hold positive values, but not enough to hold negative ones.
It doesn't even matter though, as with the guarantee that abs(x) <= 109, you can just statically allocate a vector of 219 elements, and map both positive and negative coordinates in there.
Second, you are not supposed to buffer the input in a. Not only is this going to overflow the stack, is also entirely unnecessary. Write to the frequency buckets right away, don't buffer.
Same goes for most of these challenges. Don't buffer, always try to process the input directly.
About your buckets, you don't need a long long int. A bool per bucket is enough. You do not care even the least how many coordinates were sorted into the bucket, only whether the number so far was even or not. What you implemented as a separate loop can be substituted by simply toggling a flag while processing the input.

I find the answer of #Ext3h with respect to the errors adequate.
The solution, giving that you came on the odd/even quality of the problem,
can be done more straight-forward.
You need to find the x and y that appear an odd number of times.
In java
int[] missingPoint(int[][] a) {
//int n = (a.length + 1) / 4;
int[] pt = new int[2]; // In C initialize with 0.
for (int i = 0; i < a.length; ++i) {
for (int j = 0; j < 2; ++j) {
pt[j] ^= a[i][j];
}
}
return pt;
}
This uses exclusive-or ^ which is associative and reflexive 0^x=x, x^x=0. (5^7^4^7^5=4.)
For these "search the odd one" one can use this xor-ing.
In effect you do not need to keep the input in an array.

Covering segments by points

I did search and looked at these below links but it didn't help .
Point covering problem
Segments poked (covered) with points - any tricky test cases?
Need effective greedy for covering a line segment
Problem Description:
You are given a set of segments on a line and your goal is to mark as
few points on a line as possible so that each segment contains at least
one marked point
Task.
Given a set of n segments {[a0,b0],[a1,b1]....[an-1,bn-1]} with integer
coordinates on a line, find the minimum number 'm' of points such that
each segment contains at least one point .That is, find a set of
integers X of the minimum size such that for any segment [ai,bi] there
is a point x belongs X such that ai <= x <= bi
Output Description:
Output the minimum number m of points on the first line and the integer
coordinates of m points (separated by spaces) on the second line
Sample Input - I
3
1 3
2 5
3 6
Output - I
1
3
Sample Input - II
4
4 7
1 3
2 5
5 6
Output - II
2
3 6
I didn't understand the question itself. I need the explanation, on how to solve this above problem, but i don't want the code. Examples would be greatly helpful

Maybe this formulation of the problem will be easier to understand. You have n people who can each tolerate a different range of temperatures [ai, bi]. You want to find the minimum number of rooms to make them all happy, i.e. you can set each room to a certain temperature so that each person can find a room within his/her temperature range.
As for how to solve the problem, you said you didn't want code, so I'll just roughly describe an approach. Think about the coldest room you have. If making it one degree warmer won't cause anyone to no longer be able to tolerate that room, you might as well make the increase, since that can only allow more people to use that room. So the first temperature you should set is the warmest one that the most cold-loving person can still tolerate. In other words, it should be the smallest of the bi. Now this room will satisfy some subset of your people, so you can remove them from consideration. Then repeat the process on the remaining people.
Now, to implement this efficiently, you might not want to literally do what I said above. I suggest sorting the people according to bi first, and for the ith person, try to use an existing room to satisfy them. If you can't, try to create a new one with the highest temperature possible to satisfy them, which is bi.

Yes the description is pretty vague and the only meaning that makes sense to me is this:
You got some line
Segment on a line is defined by l,r
Where one parameter is distance from start of line and second is the segments length. Which one is which is hard to tell as the letters are not very usual for such description. My bet is:
l length of segment
r distance of (start?) of segment from start of line
You want to find min set of points
So that each segment has at least one point in it. That mean for 2 overlapped segments you need just one point ...
Surely there are more option how to solve this, the obvious is genere & test with some heuristics like genere combinations only for segments that are overlapped more then once. So I would attack this task in this manner (using assumed terminology from #2):
sort segments by r
add number of overlaps to your segment set data
so the segment will be { r,l,n } and set the n=0 for all segments for now.
scan segments for overlaps
something like
for (i=0;i<segments;i++) // loop all segments
for (j=i+1;j<segments;j++) // loop all latter segments until they are still overlapped
if ( segment[i] and segment [j] are overlapped )
{
segment[i].n++; // update overlap counters
segment[j].n++;
}
else break;
Now if the r-sorted segments are overlapped then
segment[i].r <=segment[j].r
segment[i].r+segment[i].l>=segment[j].r
scan segments handling non overlapped segments
for each segment such that segment[i].n==0 add to the solution point list its point (middle) defined by distance from start of line.
points.add(segment[i].r+0.5*segment[i].l);
And after that remove segment from the list (or tag it as used or what ever you do for speed boost...).
scan segments that are overlapped just once
So if segment[i].n==1 then you need to determine if it is overlapped with i-1 or i+1. So add the mid point of the overlap to the solution points and remove i segment from list. Then decrement the n of the overlapped segment (i+1 or i-1)` and if zero remove it too.
points.add(0.5*( segment[j].r + min(segment[i].r+segment[i].l , segment[j].r+segment[j].l )));
Loop this whole scanning until there is no new point added to the solution.
now you got only multiple overlaps left
From this point I will be a bit vague for 2 reasons:
I do not have this tested and I d not have any test data to validate not to mention I am lazy.
This smells like assignment so there is some work/fun left for you.
From start I would scann all segments and remove all of them which got any point from the solution inside. This step you should perform after any changes in the solution.
Now you can experiment with generating combination of points for each overlapped group of segments and remember the minimal number of points covering all segments in group. (simply by brute force).
There are more heuristics possible like handling all twice overlapped segments (in similar manner as the single overlaps) but in the end you will have to do brute force on the rest of data ...
[edit1] as you added new info
The r,l means distance of left and right from the start of line. So if you want to convert between the other formulation { r',l' } and (l<=r) then
l=r`
r=r`+l`
and back
r`=l
l`=r-l`
Sorry too lazy to rewrite the whole thing ...

Here is the working solution in C, please refer to it partially and try to fix your code before reading the whole. Happy coding :) Spoiler alert
#include <stdio.h>
#include <stdlib.h>
int cmp_func(const void *ptr_a, const void *ptr_b)
{
const long *a = *(double **)ptr_a;
const long *b = *(double **)ptr_b;
if (a[1] == b[1])
return a[0] - b[0];
return a[1] - b[1];
}
int main()
{
int i, j, n, num_val;
long **arr;
scanf("%d", &n);
long values[n];
arr = malloc(n * sizeof(long *));
for (i = 0; i < n; ++i) {
*(arr + i) = malloc(2 * sizeof(long));
scanf("%ld %ld", &arr[i][0], &arr[i][1]);
}
qsort(arr, n, sizeof(long *), cmp_func);
i = j = 0;
num_val = 0;
while (i < n) {
int skip = 0;
values[num_val] = arr[i][1];
for (j = i + 1; j < n; ++j) {
int condition;
condition = arr[i][1] <= arr[j][1] ? arr[j][0] <= arr[i][1] : 0;
if (condition) {
skip++;
} else {
break;
}
}
num_val++;
i += skip + 1;
}
printf("%d\n", num_val);
for (int k = 0; k < num_val; ++k) {
printf("%ld ", values[k]);
}
free(arr);
return 0;
}

Here's the working code in C++ for anyone searching :)
#include <bits/stdc++.h>
#define ll long long
#define double long double
#define vi vector<int>
#define endl "\n"
#define ff first
#define ss second
#define pb push_back
#define all(x) (x).begin(),(x).end()
#define mp make_pair
using namespace std;
bool cmp(const pair<ll,ll> &a, const pair<ll,ll> &b)
{
return (a.second < b.second);
}
vector<ll> MinSig(vector<pair<ll,ll>>&vec)
{
vector<ll> points;
for(int x=0;x<vec.size()-1;)
{
bool found=false;
points.pb(vec[x].ss);
for(int y=x+1;y<vec.size();y++)
{
if(vec[y].ff>vec[x].ss)
{
x=y;
found=true;
break;
}
}
if(!found)
break;
}
return points;
}
int main()
{
ios_base::sync_with_stdio(false);
cin.tie(NULL);
int n;
cin>>n;
vector<pair<ll,ll>>v;
for(int x=0;x<n;x++)
{
ll temp1,temp2;
cin>>temp1>>temp2;
v.pb(mp(temp1,temp2));
}
sort(v.begin(),v.end(),cmp);
vector<ll>res=MinSig(v);
cout<<res.size()<<endl;
for(auto it:res)
cout<<it<<" ";
}

Resolve 16-Queens Problem in 1 second only

I should resolve 16-Queens Problem in 1 second.
I used backtracking algorithm like below.
This code is enough to resolve N-Queens Problem in 1 second when the N is smaller than 13.
But it takes long time if N is bigger than 13.
How can I improve it?
#include <stdio.h>
#include <stdlib.h>
int n;
int arr[100]={0,};
int solution_count = 0;
int check(int i)
{
int k=1, ret=1;
while (k < i && ret == 1) {
if (arr[i] == arr[k] ||
abs(arr[i]-arr[k]) == abs(i-k))
ret = 0;
k++;
}
return ret;
}
void backtrack(int i)
{
if(check(i)) {
if(i == n) {
solution_count++;
} else {
for(int j=1; j<=n; j++) {
arr[i+1] = j;
backtrack(i+1);
}
}
}
}
int main()
{
scanf("%d", &n);
backtrack(0);
printf("%d", solution_count);
}

Your algorithm is almost fine. A small change will probably give you enough time improvement to produce a solution much faster. In addition, there is a data structure change that should let you reduce the time even further.
First, tweak the algorithm a little: rather than waiting for the check all the way till you place all N queens, check early: every time you are about to place a new queen, check if another queen is occupying the same column or the same diagonal before making the arr[i+1] = j; assignment. This will save you a lot of CPU cycles.
Now you need to speed up checking of the next queen. In order to do that you have to change your data structure so that you could do all your checks without any loops. Here is how to do it:
You have N rows
You have N columns
You have 2N-1 ascending diagonals
You have 2N-1 descending diagonals
Since no two queens can take the same spot in any of the four "dimensions" above, you need an array of boolean values for the last three things; the rows are guaranteed to be different, because the i parameter of backtrack, which represents the row, is guaranteed to be different.
With N up to 16, 2N-1 goes up to 31, so you can use uint32_t for your bit arrays. Now you can check if a column c is taken by applying bitwise and & to the columns bit mask and 1 << c. Same goes for the diagonal bit masks.
Note: Doing a 16 Queen problem in under a second would be rather tricky. A very highly optimized program does it in 23 seconds on an 800 MHz PC. A 3.2 GHz should give you a speed-up of about 4 times, but it would be about 8 seconds to get a solution.

I would change while (k < i && ret == 1) { to while (k < i) {
and instead of ret = 0; do return 0;.
(this will save a check every iteration. It might be that your compiler does this anyway, or some other performance trick, but this might help a bit).

Minimizing the distance of pairing points

My problem is as follows:
Given a number of 2n points, I can calculate the distance between all points
and get a symmetrical matrix.
Can you create n pairs of points, so that the sum of the distance of all pairs is
minimal?
EDIT: Every point has to be in one of the pairs. Which means that
every point is only allowed to be in one pair.
I have naively tried to use the Hungarian algorithm and hoped that it may give me an assignment, so that the assignments are symmetrical. But that obviously did not work, as I do not have a bipartite graph.
After a search, I found the Stable roommates problem, which seems to be similar to my problem, but the difference is, that it just tries to find a matching, but not to try to minimize some kind of distance.
Does anyone know a similar problem or even a solution? Did I miss something? The problem does actually not seem that difficult, but I just could not come up with an optimal solution.

There's a primal-dual algorithm due to Edmonds (the Blossom algorithm), which you really don't want to implement yourself if possible. Vladimir Kolmogorov has an implementation that may be suitable for your purposes.

Try network-flow. The max flow is the number of the pairs you want to create. And calculate the min cost of it.

now this isn't a guarantee but just a hunch.
you can find the shortest pair, match them, and remove it from the set.
and recurse until you have no pairs left.
It is clearly sub-optimal. but I have a hunch that the ratio of just how sub-optimal this is to the absolutely optimal solution can be bounded. The hope is to use some sub-modularity argument and bound it to something like (1 - 1 / e) fraction of the global optimal, but I wasn't able to do it. Maybe someone could take a stab at it.

There is a C++ memoization implementation in Competitive Programming 3 as follows (note maximum of N was 8):
#include <algorithm>
#include <cmath>
#include <cstdio>
#include <cstring>
using namespace std;
int N, target;
double dist[20][20], memo[1<<16];
double matching(int bitmask)
{
if (memo[bitmask] > -0.5) // Already computed? Then return the result if yes
return memo[bitmask];
if (bitmask == target) // If all students are already matched then cost is zero
return memo[bitmask] = 0;
double ans = 2000000000.0; // Infinity could also work
int p1, p2;
for (p1 = 0; p1 < 2*N; ++p1) // Find first non-matched point
if (!(bitmask & (1 << p1)))
break;
for (p2 = p1 + 1; p2 < 2*N; ++p2) // and pair it with another non-matched point
if (!(bitmask & (1 << p2)))
ans = min(ans, dist[p1][p2]+matching(bitmask| (1 << p1) | (1 << p2)));
return memo[bitmask] = ans;
}
and then the main method (driving code)
int main()
{
int i,j, caseNo = 1, x[20], y[20];
while(scanf("%d", &N), N){
for (i = 0; i < 2 * N; ++i)
scanf("%d %d", &x[i], &y[i]);
for (i = 0; i < 2*N - 1; ++i)
for (j = i + 1; j < 2*N; ++j)
dist[i][j] = dist[j][i] = hypot(x[i]-x[j], y[i]-y[j]);
// use DP to solve min weighted perfect matching on small general graph
for (i = 0; i < (1 << 16); ++i) memo[i] = -1;
target = (1 << (2 * N)) - 1;
printf("Case %d: %.2lf", caseNo++, matching(0));
}
return 0;
}

parallel SVD decomposition with openMP deos not perform as expected

I have recently coded a parallel SVD decomposition routine, based on a "one sided Jacobi rotations" algorithm. The code works correctly but is tremendously slow.
In fact it should exploit the parallelism in the inner for loop for(int g=0;g<n;g++), but on commenting out the #pragma omp paralell for directive I can appreciate just a very slight decrease in performances. In other words there is no appreciable speed up on going parallel (the code does run parallel with 4 threads).
Note 1: almost all the work is concentrated in the three following loops involving the matrices A and V, which are relatively large.
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
and
double Ahi,Vhi;
for(h=0;h<N;h++)//...rotate Ai & Aj (only columns i & j are changend)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
All the parallelism should be exploited there but is not. And I can't understand why.
Note 2: The same happens both on Windows (cygWin compiler) and Linux (GCC) platforms.
Note 3: matrices are represented by column major arrays
So I'm looking for some help in finding out why the parallelism is not exploited. Did I miss something? There is some hidden overhead in the parallel for I cannot see?
Thank you very much for any suggestion
int sweep(double* A,double*V,int N,double tol)
{
static int*I=new int[(int)ceil(0.5*(N-1))];
static int*J=new int[(int)ceil(0.5*(N-1))];
int ntol=0;
for(int r=0;r<N;r++) //fill in i,j indexes of parallel rotations in vectors I & J
{
int k=r+1;
if (k==N)
{
for(int i=2;i<=(int)ceil(0.5*N);i++){
I[i-2]=i-1;
J[i-2]=N+2-i-1;
}
}
else
{
for(int i=1;i<=(int)ceil(0.5*(N-k));i++)I[i-1]=i-1;
for(int i=1;i<=(int)ceil(0.5*(N-k));i++)J[i-1]=N-k+2-i-1;
if(k>2)
{
int j=(int)ceil(0.5*(N-k));
for(int i=N-k+2;i<=N-(int)floor(0.5*k);i++){
I[j]=i-1;
J[j]=2*N-k+2-i-1;
j++;
}
}
}
int n=(k%2==0)?(int)floor(0.5*(N-1)):(int)floor(0.5*N);
#pragma omp parallel for schedule(dynamic,5) reduction(+:ntol) default(none) shared(std::cout,I,J,A,V,N,n,tol)
for(int g=0;g<n;g++)
{
int i=I[g];
int j=J[g];
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
int h;
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if(p*p/(qi*qj)<tol) ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj (only columns i & j are changend)
double Ahi,Vhi;
for(h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}

Thanks to Z boson remarks I managed to write a far better performing paralell SVD decomposition. It runs many times faster than the original one, and I guess it could be still improved by the use of SIMD instructions.
I post the code, in the case anyone should find it of use. All tests I performed gave me correct results, in any case there is no warranty for its use.
I'm really sorry not to have had the time to properly comment the code for a better comprehensibility.
int sweep(double* A,double*V,int N,double tol,int M, int n)
{
/********************************************************************************************
This routine performs a parallel "sweep" of the SVD factorization algorithm for the matrix A.
It implements a single sided Jacobi rotation algorithm, described by Michael W. Berry,
Dani Mezher, Bernard Philippe, and Ahmed Sameh in "Parallel Algorithms for the Singular Value
Decomposition".
At each sweep the A matrix becomes a little more orthogonal, until each column of A is orthogonal
to each other within a give tolerance. At this point the sweep() routine returns 1 and convergence
is attained.
Arguments:
A : on input the square matrix to be orthogonalized, on exit a more "orthogonal" matrix
V : on input the accumulated rotation matrix, on exit it is updated with the current rotations
N : dimension of the matrices.
tol :tolerance for convergence of orthogonalization. See the explainations for SVD() routine
M : number of blocks (it is calculated from the given block size n)
n : block size (number of columns on each block). It should be an optimal value according to the hardware available
returns : 1 if in the last sweep convergence is attained, 0 if not and one more sweep is necessary
Author : Renato Talucci 2015
*************************************************************************************************/
#include <math.h>
int ntol=0;
//STEP 1 : INTERNAL BLOCK ORTHOGONALISATION
#pragma omp paralell for reduction(+:ntol) shared(A,V,n,tol,N) default(none)
for(int a=0;a<M;a++)
{
for(int i=a*n;i<a*n+imin(n,N-a*n)-1;i++)
{
for(int j=i+1;j<a*n+imin(n,N-a*n);j++)
{
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
for(int h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj
double Ahi,Vhi;
for(int h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j atre updated)
for(int h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
}
//STEP 2 : PARALLEL BLOCK MUTUAL ORTHOGONALISATION
static int*I=new int[(int)ceil(0.5*(M-1))];
static int*J=new int[(int)ceil(0.5*(M-1))];
for(int h=0;h<M;h++)
{
//fill in i,j indexes of blocks to be mutually orthogonalized at each turn
int k=h+1;
if (k==M)
{
for(int i=2;i<=(int)ceil(0.5*M);i++){
I[i-2]=i-1;
J[i-2]=M+2-i-1;
}
}
else
{
for(int i=1;i<=(int)ceil(0.5*(M-k));i++)I[i-1]=i-1;
for(int i=1;i<=(int)ceil(0.5*(M-k));i++)J[i-1]=M-k+2-i-1;
if(k>2)
{
int j=(int)ceil(0.5*(M-k));
for(int i=M-k+2;i<=M-(int)floor(0.5*k);i++){
I[j]=i-1;
J[j]=2*M-k+2-i-1;
j++;
}
}
}
int ng=(k%2==0)?(int)floor(0.5*(M-1)):(int)floor(0.5*M);
#pragma omp parallel for schedule(static,5) shared(A,V,I,J,n,tol,N,ng) reduction(+:ntol) default(none)
for(int g=0;g<ng;g++)
{
int block_i=I[g];
int block_j=J[g];
for(int i=block_i*n;i<block_i*n+imin(n,N-block_i*n);i++)
{
for(int j=block_j*n;j<block_j*n+imin(n,N-block_j*n);j++)
{
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
int h;
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj
double Ahi,Vhi;
for(h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j atre updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
}
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}
int SVD(double* A,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
This routine calculates the SVD decomposition of the square matrix A [NxN]
A = U * S * V'
Arguments :
A : on input NxN square matrix to be factorized, on exit contains the
rotated matrix A*V=U*S.
V : on input an identity NxN matrix, on exit is the right orthogonal matrix
of the decomposition A = U*S*V'
U : NxN matrix, on exit is the left orthogonal matrix of the decomposition A = U*S*V'.
sigma : array of dimension N. On exit contains the singular values, i.e. the diagonal
elements of the matrix S.
N : Dimension of the A matrix.
tol : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
The software returns the number of sweeps needed for convergence.
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
int n=24;//this is the dimension of block submatrices, you shall enter an optimal value for your hardware
int M=N/n+int(((N%n)!=0)?1:0);
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
converged=sweep(A,V,N,tol,M,n);
swp++;
}
#pragma omp parallel for default(none) shared(sigma,A,U,N)
for(int i=0;i<N;i++)
{
double si=0;
for(int j=0;j<N;j++) si+=A[j+N*i]*A[j+N*i];
si=sqrt(si);
for(int k=0;k<N;k++) U[k+N*i]=A[k+N*i]/si;
sigma[i]=si;
}
return(swp);
}
Note : if some user prefers to left A unchanged upon exit, it is sufficient to calculate U=A*V before entering the while loop, and instead of passing the A matrix to the sweep() routine, passing the obtained U matrix. The U matrix shall be orthonormalized instead of A, after convergence of sweep() and the #pragma omp directive must include U in the shared variables instead of A.
Note 2:if you have (as I have) to factorize a sequence of A(k) matrices each of which A(k) can be considered a perturbation of the previous A(k-1), as a jacobian matrix can be considered in a multistep Newton solver, the driver can be easily modified to update from A(k-1) to A(k) instead of calculating A(k) from begin. Here is the code:
int updateSVD(double* DA,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
Given a previously factorization
A(k-1) = U(k-1) * S(k-1) * V(k-1)'
and given a perturbation DA(k) of A(k-1), i.e.
A(k) = A(k-1) + DA(k)
this routine calculates the SVD factorization of A(k), starting from the factorization of A(k-1)
Arguments:
DA : on input NxN perturbation matrix, unchanged on exit
U : on input NxN orthonormal left matrix of the previous (k-1) factorization, on exit
orthonormal right matrix of the current factorization
V : on input NxN orthonormal right matrix of the previous (k-1) factorization, on exit
orthonormal right matrix of the current factorization
N : dimension of the matrices
tol : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
sigma : on input, array with the N singular values of the previuos factorization, on exit
array with the N singular values of the current factorization
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
for(int i=0;i<N;i++) for(int j=0;j<N;j++) U[i+N*j]*=sigma[j];
int n=24; //this is the dimension of block submatrices, you shall enter an optimal value for your hardware
matmat_col_col(DA,V,U,N,n); //U =U(k-1)*S(k-1) + DA(k)*V(k-1) = A(k)*V(k-1)
int M=N/n+int(((N%n)!=0)?1:0); //number of blocks
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
converged=sweep(U,V,N,tol,M,n);
swp++;
}
#pragma omp parallel for default(none) shared(sigma,U,N)
for(int i=0;i<N;i++)
{
double si=0;
for(int j=0;j<N;j++) si+=U[j+N*i]*U[j+N*i];
si=sqrt(si);
for(int k=0;k<N;k++) U[k+N*i]=U[k+N*i]/si;
sigma[i]=si;
}
return(swp);
}
Finally, the routine matmat_col_col(DA,V,U,N,n) is a paralell block matrix product. Here is the code:
inline int imin(int a,int b) {return((a<=b)?a:b);}
void matmat_col_col(double* A,double* B,double*C,int N,int n)
/********************************************************************************************
square matrix block product NxN :
C = C + A * B
n is the optimal block dimension
N is the dimension of the matrices
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING M(i,j) = M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
{
int M=N/n+int(((N%n)!=0)?1:0);
#pragma omp parallel for shared(M,A,B,C)
for(int a=0;a<M;a++)
{
for(int b=0;b<M;b++)
{
for(int c=0;c<M;c++)
{
for(int i=a*n;i<imin((a+1)*n,N);i++)
{
for(int j=b*n;j<imin((b+1)*n,N);j++)
{
for(int k=c*n;k<imin((c+1)*n,N);k++)
{
C[i+N*j]+=A[i+N*k]*B[k+N*j];
}
}
}
}
}
}
return;
}
I hope no typos have been created from so much copy&paste.
It would be nice if anyone could improve the code.

The FLOPS for Singular value decomposition (SVD) should go as O(N^3) whereas the reads as O(N^2). This means it may be possible to parallelize the algorithm and have it scale well with the number of cores.
However, your implementation of SVD is memory bandwidth bound which means it can't scale well with the number of cores for a single socket system. Your three nested loops currently each go over the entire range of N which causes much of the data to be evicted from the cache the next time it need to be reused.
In order to be compute bound you're going to have to change your algorithm so that, instead of operating on long strips/dot products of size N, it uses loop tiling to maximize the n^3 (where n is the size of the block) calculations in the cache of size n^2.
Here is paper for parallel SVD computation from a google search which says in the abstract
the key point of our proposed block JRS algorithm is reusing the loaded data into cache memory by performing computations on matrix blocks (b rows) instead of on strips of vectors as in JRS iteration algorithms.
I used loop tiling/blocks to achieve good scaling with the number of cores for cholesky decomposition.
Note that using tiles/blocks will improve the performance of your single threaded code as well.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio