Related
I am solving this problem on CSES.
Given n planets, with exactly 1 teleporter on each planet which teleports us to some other planet (possibly the same), we have to solve q queries. Each query is associated with a start planet, x and a number of teleporters to traverse, k. For each query, we need to tell where we would reach after going through k teleporters.
I have attempted this problem using the binary lifting concept.
For each planet, I first saved the planets we would reach by going through 20, 21, 22,... teleporters.
Now, as per the constraints (esp. for k) provided in the question, we need only store the values till 231.
Then, for each query, starting from the start planet, I traverse through the teleporters using the data in the above created array (in 1) to mimic the binary expansion of k, the number of teleporters to traverse.
For example, if k = 5, i.e. (101)2, and the initial planet is x, I first go (001)2 = 1 planet ahead, using the array, let's say to planet y, and then (100)2 = 4 planets ahead. The planet now reached is the required result to the query.
Unfortunately, I am receiving TLE (time limit exceeded) error in the last test case (test 12).
Here's my code for reference:
#define inp(x) ll x; scanf("%lld", &x)
void solve()
{
// Inputting the values of n, number of planets and q, number of queries.
inp(n);
inp(q);
// Inputting the location of next planet the teleporter on each planet points to, with correction for 0 - based indexing
vector<int> adj(n);
for(int i = 0; i < n; i++)
{
scanf("%d", &(adj[i]));
adj[i]--;
}
// maxN stores the maximum value till which we need to locate the next reachable plane, based on constraints.
// A value of 32 means that we'll only ever need to go at max 2^31 places away from the planet in query.
int maxN = 32;
// This array consists of the next planet we can reach from any planet.
// Specifically, par[i][j] is the planet we get to, on passing through 2^j teleporters starting from planet i.
vector<vector<int>> par(n, vector<int>(maxN, -1));
for(int i = 0; i < n; i++)
{
par[i][0] = adj[i];
}
for(int i = 1; i < maxN; i++)
{
for(int j = 0; j < n; j++)
{
ll p1 = par[j][i-1];
par[j][i] = par[p1][i-1];
}
}
// This task is done for each query.
for(int i = 0; i < q; i++)
{
// x is the initial planet, corrected for 0 - based indexing.
inp(x);
x--;
// k is the number of teleporters to traverse.
inp(k);
// cur is the planet we currently are at.
int cur = x;
// For every i'th bit in k that is 1, the current planet is moved to the planet we reach to by moving through 2^i teleporters from cur.
for(int i = 0; (1 << i) <= k ; i++)
{
if(k & (1 << i))
{
cur = par[cur][i];
}
}
// Once the full binary expansion of k is used up, we are at cur, so (cur + 1) is the result because of the judge's 1 - based indexing.
cout<<(cur + 1)<<endl;
}
}
The code gives the correct output in every test case, but undergoes TLE in the final one (the result in the final one is correct too, just a TLE occurs). According to my observation the complexity of the code is O(32 * q + n), which doesn't seem to exceed the 106 bound for linear time code in 1 second.
Are there any hidden costs in the algorithm I may have missed, or some possible optimization?
Any help appreciated!
It looks to me like your code works (after fixing the scanf), but your par map could have 6.4M entries in it, and precalculating all of those might just get you over the 1s time limit.
Here are a few things to try, in order of complexity:
replace par with a single vector<int> and index it like par[i*32+j]. This will remove a lot of double indirections.
Buffer the output in a std::string and write it in one step at the end, in case there's some buffer flushing going on that you don't know about. I don't think so, but it's easy to try.
Starting at each planet, you enter a cycle in <= n steps. In O(n) time, you can precalculate the distance to the terminal cycle and the size of the terminal cycle for all planets. Using this information you can reduce each k to at most 20000, and that means you only need j <= 16.
Chef has N axis-parallel rectangles in a 2D Cartesian coordinate system. These rectangles may intersect, but it is guaranteed that all their 4N vertices are pairwise distinct.
Unfortunately, Chef lost one vertex, and up until now, none of his fixes have worked (although putting an image of a point on a milk carton might not have been the greatest idea after all…). Therefore, he gave you the task of finding it! You are given the remaining 4N−1 points and you should find the missing one.
Input
The first line of the input contains a single integer T denoting the number of test cases. The description of T test cases follows.
The first line of each test case contains a single integer N.
Then, 4N−1 lines follow. Each of these lines contains two space-separated integers x and y denoting a vertex (x,y) of some rectangle.
Output
For each test case, print a single line containing two space-separated integers X and Y ― the coordinates of the missing point. It can be proved that the missing point can be determined uniquely.
Constraints
T≤100
1≤N≤2⋅105
|x|,|y|≤109
the sum of N over all test cases does not exceed 2⋅105
Example Input
1
2
1 1
1 2
4 6
2 1
9 6
9 3
4 3
Example Output
2 2
Problem link: https://www.codechef.com/problems/PTMSSNG
my approach: I have created a frequency array for x and y coordinates and then calculated the point which is coming odd no. of times.
#include <iostream>
using namespace std;
int main() {
// your code goes here
int t;
cin>>t;
while(t--)
{
long int n;
cin>>n;
long long int a[4*n-1][2];
long long int xm,ym,x,y;
for(int i=0;i<4*n-1;i++)
{
cin>>a[i][0]>>a[i][1];
if(i==0)
{
xm=abs(a[i][0]);
ym=abs(a[i][1]);
}
if(i>0)
{
if(abs(a[i][0])>xm)
{
xm=abs(a[i][0]);
}
if(abs(a[i][1])>ym)
{
ym=abs(a[i][1]);
}
}
}
long long int frqx[xm+1],frqy[ym+1];
for(long long int i=0;i<xm+1;i++)
{
frqx[i]=0;
}
for(long long int j=0;j<ym+1;j++)
{
frqy[j]=0;
}
for(long long int i=0;i<4*n-1;i++)
{
frqx[a[i][0]]+=1;
frqy[a[i][1]]+=1;
}
for(long long int i=0;i<xm+1;i++)
{
if(frqx[i]>0 && frqx[i]%2>0)
{
x=i;
break;
}
}
for(long long int j=0;j<ym+1;j++)
{
if(frqy[j]>0 && frqy[j]%2>0)
{
y=j;
break;
}
}
cout<<x<<" "<<y<<"\n";
}
return 0;
}
My code is showing TLE for inputs <10^6
First of all, your solution is not handling negative x/y correctly. long long int frqx[xm+1],frqy[ym+1] allocated barely enough memory to hold positive values, but not enough to hold negative ones.
It doesn't even matter though, as with the guarantee that abs(x) <= 109, you can just statically allocate a vector of 219 elements, and map both positive and negative coordinates in there.
Second, you are not supposed to buffer the input in a. Not only is this going to overflow the stack, is also entirely unnecessary. Write to the frequency buckets right away, don't buffer.
Same goes for most of these challenges. Don't buffer, always try to process the input directly.
About your buckets, you don't need a long long int. A bool per bucket is enough. You do not care even the least how many coordinates were sorted into the bucket, only whether the number so far was even or not. What you implemented as a separate loop can be substituted by simply toggling a flag while processing the input.
I find the answer of #Ext3h with respect to the errors adequate.
The solution, giving that you came on the odd/even quality of the problem,
can be done more straight-forward.
You need to find the x and y that appear an odd number of times.
In java
int[] missingPoint(int[][] a) {
//int n = (a.length + 1) / 4;
int[] pt = new int[2]; // In C initialize with 0.
for (int i = 0; i < a.length; ++i) {
for (int j = 0; j < 2; ++j) {
pt[j] ^= a[i][j];
}
}
return pt;
}
This uses exclusive-or ^ which is associative and reflexive 0^x=x, x^x=0. (5^7^4^7^5=4.)
For these "search the odd one" one can use this xor-ing.
In effect you do not need to keep the input in an array.
I did search and looked at these below links but it didn't help .
Point covering problem
Segments poked (covered) with points - any tricky test cases?
Need effective greedy for covering a line segment
Problem Description:
You are given a set of segments on a line and your goal is to mark as
few points on a line as possible so that each segment contains at least
one marked point
Task.
Given a set of n segments {[a0,b0],[a1,b1]....[an-1,bn-1]} with integer
coordinates on a line, find the minimum number 'm' of points such that
each segment contains at least one point .That is, find a set of
integers X of the minimum size such that for any segment [ai,bi] there
is a point x belongs X such that ai <= x <= bi
Output Description:
Output the minimum number m of points on the first line and the integer
coordinates of m points (separated by spaces) on the second line
Sample Input - I
3
1 3
2 5
3 6
Output - I
1
3
Sample Input - II
4
4 7
1 3
2 5
5 6
Output - II
2
3 6
I didn't understand the question itself. I need the explanation, on how to solve this above problem, but i don't want the code. Examples would be greatly helpful
Maybe this formulation of the problem will be easier to understand. You have n people who can each tolerate a different range of temperatures [ai, bi]. You want to find the minimum number of rooms to make them all happy, i.e. you can set each room to a certain temperature so that each person can find a room within his/her temperature range.
As for how to solve the problem, you said you didn't want code, so I'll just roughly describe an approach. Think about the coldest room you have. If making it one degree warmer won't cause anyone to no longer be able to tolerate that room, you might as well make the increase, since that can only allow more people to use that room. So the first temperature you should set is the warmest one that the most cold-loving person can still tolerate. In other words, it should be the smallest of the bi. Now this room will satisfy some subset of your people, so you can remove them from consideration. Then repeat the process on the remaining people.
Now, to implement this efficiently, you might not want to literally do what I said above. I suggest sorting the people according to bi first, and for the ith person, try to use an existing room to satisfy them. If you can't, try to create a new one with the highest temperature possible to satisfy them, which is bi.
Yes the description is pretty vague and the only meaning that makes sense to me is this:
You got some line
Segment on a line is defined by l,r
Where one parameter is distance from start of line and second is the segments length. Which one is which is hard to tell as the letters are not very usual for such description. My bet is:
l length of segment
r distance of (start?) of segment from start of line
You want to find min set of points
So that each segment has at least one point in it. That mean for 2 overlapped segments you need just one point ...
Surely there are more option how to solve this, the obvious is genere & test with some heuristics like genere combinations only for segments that are overlapped more then once. So I would attack this task in this manner (using assumed terminology from #2):
sort segments by r
add number of overlaps to your segment set data
so the segment will be { r,l,n } and set the n=0 for all segments for now.
scan segments for overlaps
something like
for (i=0;i<segments;i++) // loop all segments
for (j=i+1;j<segments;j++) // loop all latter segments until they are still overlapped
if ( segment[i] and segment [j] are overlapped )
{
segment[i].n++; // update overlap counters
segment[j].n++;
}
else break;
Now if the r-sorted segments are overlapped then
segment[i].r <=segment[j].r
segment[i].r+segment[i].l>=segment[j].r
scan segments handling non overlapped segments
for each segment such that segment[i].n==0 add to the solution point list its point (middle) defined by distance from start of line.
points.add(segment[i].r+0.5*segment[i].l);
And after that remove segment from the list (or tag it as used or what ever you do for speed boost...).
scan segments that are overlapped just once
So if segment[i].n==1 then you need to determine if it is overlapped with i-1 or i+1. So add the mid point of the overlap to the solution points and remove i segment from list. Then decrement the n of the overlapped segment (i+1 or i-1)` and if zero remove it too.
points.add(0.5*( segment[j].r + min(segment[i].r+segment[i].l , segment[j].r+segment[j].l )));
Loop this whole scanning until there is no new point added to the solution.
now you got only multiple overlaps left
From this point I will be a bit vague for 2 reasons:
I do not have this tested and I d not have any test data to validate not to mention I am lazy.
This smells like assignment so there is some work/fun left for you.
From start I would scann all segments and remove all of them which got any point from the solution inside. This step you should perform after any changes in the solution.
Now you can experiment with generating combination of points for each overlapped group of segments and remember the minimal number of points covering all segments in group. (simply by brute force).
There are more heuristics possible like handling all twice overlapped segments (in similar manner as the single overlaps) but in the end you will have to do brute force on the rest of data ...
[edit1] as you added new info
The r,l means distance of left and right from the start of line. So if you want to convert between the other formulation { r',l' } and (l<=r) then
l=r`
r=r`+l`
and back
r`=l
l`=r-l`
Sorry too lazy to rewrite the whole thing ...
Here is the working solution in C, please refer to it partially and try to fix your code before reading the whole. Happy coding :) Spoiler alert
#include <stdio.h>
#include <stdlib.h>
int cmp_func(const void *ptr_a, const void *ptr_b)
{
const long *a = *(double **)ptr_a;
const long *b = *(double **)ptr_b;
if (a[1] == b[1])
return a[0] - b[0];
return a[1] - b[1];
}
int main()
{
int i, j, n, num_val;
long **arr;
scanf("%d", &n);
long values[n];
arr = malloc(n * sizeof(long *));
for (i = 0; i < n; ++i) {
*(arr + i) = malloc(2 * sizeof(long));
scanf("%ld %ld", &arr[i][0], &arr[i][1]);
}
qsort(arr, n, sizeof(long *), cmp_func);
i = j = 0;
num_val = 0;
while (i < n) {
int skip = 0;
values[num_val] = arr[i][1];
for (j = i + 1; j < n; ++j) {
int condition;
condition = arr[i][1] <= arr[j][1] ? arr[j][0] <= arr[i][1] : 0;
if (condition) {
skip++;
} else {
break;
}
}
num_val++;
i += skip + 1;
}
printf("%d\n", num_val);
for (int k = 0; k < num_val; ++k) {
printf("%ld ", values[k]);
}
free(arr);
return 0;
}
Here's the working code in C++ for anyone searching :)
#include <bits/stdc++.h>
#define ll long long
#define double long double
#define vi vector<int>
#define endl "\n"
#define ff first
#define ss second
#define pb push_back
#define all(x) (x).begin(),(x).end()
#define mp make_pair
using namespace std;
bool cmp(const pair<ll,ll> &a, const pair<ll,ll> &b)
{
return (a.second < b.second);
}
vector<ll> MinSig(vector<pair<ll,ll>>&vec)
{
vector<ll> points;
for(int x=0;x<vec.size()-1;)
{
bool found=false;
points.pb(vec[x].ss);
for(int y=x+1;y<vec.size();y++)
{
if(vec[y].ff>vec[x].ss)
{
x=y;
found=true;
break;
}
}
if(!found)
break;
}
return points;
}
int main()
{
ios_base::sync_with_stdio(false);
cin.tie(NULL);
int n;
cin>>n;
vector<pair<ll,ll>>v;
for(int x=0;x<n;x++)
{
ll temp1,temp2;
cin>>temp1>>temp2;
v.pb(mp(temp1,temp2));
}
sort(v.begin(),v.end(),cmp);
vector<ll>res=MinSig(v);
cout<<res.size()<<endl;
for(auto it:res)
cout<<it<<" ";
}
I have recently coded a parallel SVD decomposition routine, based on a "one sided Jacobi rotations" algorithm. The code works correctly but is tremendously slow.
In fact it should exploit the parallelism in the inner for loop for(int g=0;g<n;g++), but on commenting out the #pragma omp paralell for directive I can appreciate just a very slight decrease in performances. In other words there is no appreciable speed up on going parallel (the code does run parallel with 4 threads).
Note 1: almost all the work is concentrated in the three following loops involving the matrices A and V, which are relatively large.
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
and
double Ahi,Vhi;
for(h=0;h<N;h++)//...rotate Ai & Aj (only columns i & j are changend)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
All the parallelism should be exploited there but is not. And I can't understand why.
Note 2: The same happens both on Windows (cygWin compiler) and Linux (GCC) platforms.
Note 3: matrices are represented by column major arrays
So I'm looking for some help in finding out why the parallelism is not exploited. Did I miss something? There is some hidden overhead in the parallel for I cannot see?
Thank you very much for any suggestion
int sweep(double* A,double*V,int N,double tol)
{
static int*I=new int[(int)ceil(0.5*(N-1))];
static int*J=new int[(int)ceil(0.5*(N-1))];
int ntol=0;
for(int r=0;r<N;r++) //fill in i,j indexes of parallel rotations in vectors I & J
{
int k=r+1;
if (k==N)
{
for(int i=2;i<=(int)ceil(0.5*N);i++){
I[i-2]=i-1;
J[i-2]=N+2-i-1;
}
}
else
{
for(int i=1;i<=(int)ceil(0.5*(N-k));i++)I[i-1]=i-1;
for(int i=1;i<=(int)ceil(0.5*(N-k));i++)J[i-1]=N-k+2-i-1;
if(k>2)
{
int j=(int)ceil(0.5*(N-k));
for(int i=N-k+2;i<=N-(int)floor(0.5*k);i++){
I[j]=i-1;
J[j]=2*N-k+2-i-1;
j++;
}
}
}
int n=(k%2==0)?(int)floor(0.5*(N-1)):(int)floor(0.5*N);
#pragma omp parallel for schedule(dynamic,5) reduction(+:ntol) default(none) shared(std::cout,I,J,A,V,N,n,tol)
for(int g=0;g<n;g++)
{
int i=I[g];
int j=J[g];
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
int h;
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if(p*p/(qi*qj)<tol) ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj (only columns i & j are changend)
double Ahi,Vhi;
for(h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j are updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}
Thanks to Z boson remarks I managed to write a far better performing paralell SVD decomposition. It runs many times faster than the original one, and I guess it could be still improved by the use of SIMD instructions.
I post the code, in the case anyone should find it of use. All tests I performed gave me correct results, in any case there is no warranty for its use.
I'm really sorry not to have had the time to properly comment the code for a better comprehensibility.
int sweep(double* A,double*V,int N,double tol,int M, int n)
{
/********************************************************************************************
This routine performs a parallel "sweep" of the SVD factorization algorithm for the matrix A.
It implements a single sided Jacobi rotation algorithm, described by Michael W. Berry,
Dani Mezher, Bernard Philippe, and Ahmed Sameh in "Parallel Algorithms for the Singular Value
Decomposition".
At each sweep the A matrix becomes a little more orthogonal, until each column of A is orthogonal
to each other within a give tolerance. At this point the sweep() routine returns 1 and convergence
is attained.
Arguments:
A : on input the square matrix to be orthogonalized, on exit a more "orthogonal" matrix
V : on input the accumulated rotation matrix, on exit it is updated with the current rotations
N : dimension of the matrices.
tol :tolerance for convergence of orthogonalization. See the explainations for SVD() routine
M : number of blocks (it is calculated from the given block size n)
n : block size (number of columns on each block). It should be an optimal value according to the hardware available
returns : 1 if in the last sweep convergence is attained, 0 if not and one more sweep is necessary
Author : Renato Talucci 2015
*************************************************************************************************/
#include <math.h>
int ntol=0;
//STEP 1 : INTERNAL BLOCK ORTHOGONALISATION
#pragma omp paralell for reduction(+:ntol) shared(A,V,n,tol,N) default(none)
for(int a=0;a<M;a++)
{
for(int i=a*n;i<a*n+imin(n,N-a*n)-1;i++)
{
for(int j=i+1;j<a*n+imin(n,N-a*n);j++)
{
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
for(int h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj
double Ahi,Vhi;
for(int h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j atre updated)
for(int h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
}
//STEP 2 : PARALLEL BLOCK MUTUAL ORTHOGONALISATION
static int*I=new int[(int)ceil(0.5*(M-1))];
static int*J=new int[(int)ceil(0.5*(M-1))];
for(int h=0;h<M;h++)
{
//fill in i,j indexes of blocks to be mutually orthogonalized at each turn
int k=h+1;
if (k==M)
{
for(int i=2;i<=(int)ceil(0.5*M);i++){
I[i-2]=i-1;
J[i-2]=M+2-i-1;
}
}
else
{
for(int i=1;i<=(int)ceil(0.5*(M-k));i++)I[i-1]=i-1;
for(int i=1;i<=(int)ceil(0.5*(M-k));i++)J[i-1]=M-k+2-i-1;
if(k>2)
{
int j=(int)ceil(0.5*(M-k));
for(int i=M-k+2;i<=M-(int)floor(0.5*k);i++){
I[j]=i-1;
J[j]=2*M-k+2-i-1;
j++;
}
}
}
int ng=(k%2==0)?(int)floor(0.5*(M-1)):(int)floor(0.5*M);
#pragma omp parallel for schedule(static,5) shared(A,V,I,J,n,tol,N,ng) reduction(+:ntol) default(none)
for(int g=0;g<ng;g++)
{
int block_i=I[g];
int block_j=J[g];
for(int i=block_i*n;i<block_i*n+imin(n,N-block_i*n);i++)
{
for(int j=block_j*n;j<block_j*n+imin(n,N-block_j*n);j++)
{
double p=0;
double qi=0;
double qj=0;
double cs,sn,q,c;
int h;
for(h=0;h<N;h++)
{
p+=A[h+N*i]*A[h+N*j];//columns dot product:Ai * Aj
qi+=A[h+N*i]*A[h+N*i];// ||Ai||^2
qj+=A[h+N*j]*A[h+N*j];// ||Aj||^2
}
q=qi-qj;
if((p*p/(qi*qj)<tol)||(qi<tol)||(qj<tol))ntol++; //if Ai & Aj are orthogonal enough...
else //if Ai & Aj are not orthogonal enough then... rotate them
{
c=sqrt(4*p*p+q*q);
if(q>=0){
cs=sqrt((c+q)/(2*c));
sn=p/(c*cs);
}
else{
sn=(p>=0)?sqrt((c-q)/2/c):-sqrt((c-q)/2/c);
cs=p/(c*sn);
}
//...rotate Ai & Aj
double Ahi,Vhi;
for(h=0;h<N;h++)
{
Ahi=A[h+N*i];
A[h+N*i]=cs*A[h+N*i]+sn*A[h+N*j];
A[h+N*j]=-sn*Ahi+cs*A[h+N*j];
}
//store & update rotation matrix V (only columns i & j atre updated)
for(h=0;h<N;h++)
{
Vhi=V[h+N*i];
V[h+N*i]=cs*V[h+N*i]+sn*V[h+N*j];
V[h+N*j]=-sn*Vhi+cs*V[h+N*j];
}
}
}
}
}
}
if(2*ntol==(N*(N-1)))return(1);//if each columns of A is orthogonal enough to each other stop sweep
return(0);
}
int SVD(double* A,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
This routine calculates the SVD decomposition of the square matrix A [NxN]
A = U * S * V'
Arguments :
A : on input NxN square matrix to be factorized, on exit contains the
rotated matrix A*V=U*S.
V : on input an identity NxN matrix, on exit is the right orthogonal matrix
of the decomposition A = U*S*V'
U : NxN matrix, on exit is the left orthogonal matrix of the decomposition A = U*S*V'.
sigma : array of dimension N. On exit contains the singular values, i.e. the diagonal
elements of the matrix S.
N : Dimension of the A matrix.
tol : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
The software returns the number of sweeps needed for convergence.
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
int n=24;//this is the dimension of block submatrices, you shall enter an optimal value for your hardware
int M=N/n+int(((N%n)!=0)?1:0);
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
converged=sweep(A,V,N,tol,M,n);
swp++;
}
#pragma omp parallel for default(none) shared(sigma,A,U,N)
for(int i=0;i<N;i++)
{
double si=0;
for(int j=0;j<N;j++) si+=A[j+N*i]*A[j+N*i];
si=sqrt(si);
for(int k=0;k<N;k++) U[k+N*i]=A[k+N*i]/si;
sigma[i]=si;
}
return(swp);
}
Note : if some user prefers to left A unchanged upon exit, it is sufficient to calculate U=A*V before entering the while loop, and instead of passing the A matrix to the sweep() routine, passing the obtained U matrix. The U matrix shall be orthonormalized instead of A, after convergence of sweep() and the #pragma omp directive must include U in the shared variables instead of A.
Note 2:if you have (as I have) to factorize a sequence of A(k) matrices each of which A(k) can be considered a perturbation of the previous A(k-1), as a jacobian matrix can be considered in a multistep Newton solver, the driver can be easily modified to update from A(k-1) to A(k) instead of calculating A(k) from begin. Here is the code:
int updateSVD(double* DA,double* U,double*V,int N,double tol,double* sigma)
{
/********************************************************************************************
Given a previously factorization
A(k-1) = U(k-1) * S(k-1) * V(k-1)'
and given a perturbation DA(k) of A(k-1), i.e.
A(k) = A(k-1) + DA(k)
this routine calculates the SVD factorization of A(k), starting from the factorization of A(k-1)
Arguments:
DA : on input NxN perturbation matrix, unchanged on exit
U : on input NxN orthonormal left matrix of the previous (k-1) factorization, on exit
orthonormal right matrix of the current factorization
V : on input NxN orthonormal right matrix of the previous (k-1) factorization, on exit
orthonormal right matrix of the current factorization
N : dimension of the matrices
tol : Tolerance for the convergence of the orthogonalisation of A. Said Ai and Aj any two
columns of A, the convergence is attained when Ai*Aj / ( |Ai|*|Aj| ) < tol for each i,j=0,..,N-1 (i!=j)
sigma : on input, array with the N singular values of the previuos factorization, on exit
array with the N singular values of the current factorization
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING I.E. M(i,j)=M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
for(int i=0;i<N;i++) for(int j=0;j<N;j++) U[i+N*j]*=sigma[j];
int n=24; //this is the dimension of block submatrices, you shall enter an optimal value for your hardware
matmat_col_col(DA,V,U,N,n); //U =U(k-1)*S(k-1) + DA(k)*V(k-1) = A(k)*V(k-1)
int M=N/n+int(((N%n)!=0)?1:0); //number of blocks
int swp=0;//sweeps counter
int converged=0;
while(converged==0) {
converged=sweep(U,V,N,tol,M,n);
swp++;
}
#pragma omp parallel for default(none) shared(sigma,U,N)
for(int i=0;i<N;i++)
{
double si=0;
for(int j=0;j<N;j++) si+=U[j+N*i]*U[j+N*i];
si=sqrt(si);
for(int k=0;k<N;k++) U[k+N*i]=U[k+N*i]/si;
sigma[i]=si;
}
return(swp);
}
Finally, the routine matmat_col_col(DA,V,U,N,n) is a paralell block matrix product. Here is the code:
inline int imin(int a,int b) {return((a<=b)?a:b);}
void matmat_col_col(double* A,double* B,double*C,int N,int n)
/********************************************************************************************
square matrix block product NxN :
C = C + A * B
n is the optimal block dimension
N is the dimension of the matrices
NOTE : ALL MATRICES ARE ASSUMED TO HAVE COLUMN MAJOR ORDERING M(i,j) = M[i+N*j]
Author: Renato Talucci 2015
*************************************************************************************************/
{
int M=N/n+int(((N%n)!=0)?1:0);
#pragma omp parallel for shared(M,A,B,C)
for(int a=0;a<M;a++)
{
for(int b=0;b<M;b++)
{
for(int c=0;c<M;c++)
{
for(int i=a*n;i<imin((a+1)*n,N);i++)
{
for(int j=b*n;j<imin((b+1)*n,N);j++)
{
for(int k=c*n;k<imin((c+1)*n,N);k++)
{
C[i+N*j]+=A[i+N*k]*B[k+N*j];
}
}
}
}
}
}
return;
}
I hope no typos have been created from so much copy&paste.
It would be nice if anyone could improve the code.
The FLOPS for Singular value decomposition (SVD) should go as O(N^3) whereas the reads as O(N^2). This means it may be possible to parallelize the algorithm and have it scale well with the number of cores.
However, your implementation of SVD is memory bandwidth bound which means it can't scale well with the number of cores for a single socket system. Your three nested loops currently each go over the entire range of N which causes much of the data to be evicted from the cache the next time it need to be reused.
In order to be compute bound you're going to have to change your algorithm so that, instead of operating on long strips/dot products of size N, it uses loop tiling to maximize the n^3 (where n is the size of the block) calculations in the cache of size n^2.
Here is paper for parallel SVD computation from a google search which says in the abstract
the key point of our proposed block JRS algorithm is reusing the loaded data into cache memory by performing computations on matrix blocks (b rows) instead of on strips of vectors as in JRS iteration algorithms.
I used loop tiling/blocks to achieve good scaling with the number of cores for cholesky decomposition.
Note that using tiles/blocks will improve the performance of your single threaded code as well.
Below is the solution I am trying to implement
/**
* Definition for a point.
* class Point {
* int x;
* int y;
* Point() { x = 0; y = 0; }
* Point(int a, int b) { x = a; y = b; }
* }
*/
public class Solution {
public int maxPoints(Point[] points) {
int max=0;
if(points.length==1)
return 1;
for(int i=0;i<points.length;i++){
for(int j=0;j<points.length;j++){
if((points[i].x!=points[j].x)||(points[i].y!=points[j].y)){
int coll=get_collinear(points[i].x,points[i].y,points[j].x,points[j].y,points);
if(coll>max)
max=coll;
}
else{
**Case where I am suffering**
}
}
}
return max;
}
public int get_collinear(int x1,int y1,int x2, int y2,Point[] points)
{
int c=0;
for(int i=0;i<points.length;i++){
int k1=x1-points[i].x;
int l1=y1-points[i].y;
int k2=x2-points[i].x;
int l2=y2-points[i].y;
if((k1*l2-k2*l1)==0)
c++;
}
return c;
}
}
It runs at O(n^3). What I am basically doing is running two loops comparing various points in the 2d plane. And then taking 2 points I send these 2 points to the get_collinear method which hits the line formed by these 2 points with all the elements of the array to check if the 3 points are collinear. I know this is a brute force method. However in case where the input is[(0,0),(0,0)] my result fails. The else loop is where I have to add a condition to figure out such cases. Can someone help me with the solution to that. And does there exist a better solution to this problem at better run time. I can't think of any.
BTW complexity is indeed O(n^3) to lower that you need to:
sort the points somehow
by x and or y in ascending or descending order. Also use of polar coordinates can help sometimes
use divide at impera (divide and conquer) algorithms
usually for planar geometry algorithms is good idea to divide area to quadrants and sub-quadrants but these algorithms are hard to code on vector graphics
Also there is one other speedup possibility
check against all possible directions (limited number of them for example to 360 angles only) which leads to O(n^2). Then compute results which is still O(m^3) where m is the subset of points per the tested direction.
Ok here is something basic I coded in C++ for example:
void points_on_line()
{
const int dirs =360; // num of directions (accuracy)
double mdir=double(dirs)/M_PI; // conversion from angle to code
double pacc=0.01; // position acc <0,1>
double lmin=0.05; // min line size acc <0,1>
double lmax=0.25; // max line size acc <0,1>
double pacc2,lmin2,lmax2;
int n,ia,ib;
double x0,x1,y0,y1;
struct _lin
{
int dir; // dir code <0,dirs>
double ang; // dir [rad] <0,M_PI>
double dx,dy; // dir unit vector
int i0,i1; // index of points
} *lin;
glview2D::_pnt *a,*b;
glview2D::_lin q;
_lin l;
// prepare buffers
n=view.pnt.num; // n=number of points
n=((n*n)-n)>>1; // n=max number of lines
lin=new _lin[n]; n=0;
if (lin==NULL) return;
// precompute size of area and update accuracy constants ~O(N)
for (a=view.pnt.dat,ia=0;ia<view.pnt.num;ia++,a++)
{
if (!ia)
{
x0=a->p[0]; y0=a->p[1];
x1=a->p[0]; y1=a->p[1];
}
if (x0>a->p[0]) x0=a->p[0];
if (x1<a->p[0]) x1=a->p[0];
if (y0>a->p[1]) y0=a->p[1];
if (y1<a->p[1]) y1=a->p[1];
}
x1-=x0; y1-=y0; if (x1>y1) x1=y1;
pacc*=x1; pacc2=pacc*pacc;
lmin*=x1; lmin2=lmin*lmin;
lmax*=x1; lmax2=lmax*lmax;
// precompute lines ~O((N^2)/2)
for (a=view.pnt.dat,ia=0;ia<view.pnt.num;ia++,a++)
for (b=a+1,ib=ia+1;ib<view.pnt.num;ib++,b++)
{
l.i0=ia;
l.i1=ib;
x0=b->p[0]-a->p[0];
y0=b->p[1]-a->p[1];
x1=(x0*x0)+(y0*y0);
if (x1<=lmin2) continue; // ignore too small lines
if (x1>=lmax2) continue; // ignore too big lines
l.ang=atanxy(x0,y0);
if (l.ang>M_PI) l.ang-=M_PI; // 180 deg is enough lines goes both ways ...
l.dx=cos(l.ang);
l.dy=sin(l.ang);
l.dir=double(l.ang*mdir);
lin[n]=l; n++;
// q.p0=*a; q.p1=*b; view.lin.add(q); // just visualise used lines for testing
}
// test directions
int cnt,cntmax=0;
double t;
for (ia=0;ia<n;ia++)
{
cnt=1;
for (ib=ia+1;ib<n;ib++)
if (lin[ia].dir==lin[ib].dir)
{
a=&view.pnt[lin[ia].i0];
if (lin[ia].i0!=lin[ib].i0)
b=&view.pnt[lin[ib].i0];
else b=&view.pnt[lin[ib].i1];
x0=b->p[0]-a->p[0]; x0*=x0;
y0=b->p[1]-a->p[1]; y0*=y0;
t=sqrt(x0+y0);
x0=a->p[0]+(t*lin[ia].dx)-b->p[0]; x0*=x0;
y0=a->p[1]+(t*lin[ia].dy)-b->p[1]; y0*=y0;
t=x0+y0;
if (fabs(t)<=pacc2) cnt++;
}
if (cntmax<cnt) // if more points on single line found
{
cntmax=cnt; // update point count
q.p0=view.pnt[lin[ia].i0]; // copy start/end point
q.p1=q.p0;
q.p0.p[0]-=500.0*lin[ia].dx; // and set result line as very big (infinite) line
q.p0.p[1]-=500.0*lin[ia].dy;
q.p1.p[0]+=500.0*lin[ia].dx;
q.p1.p[1]+=500.0*lin[ia].dy;
}
}
if (cntmax) view.lin.add(q);
view.redraw=true;
delete lin;
// Caption=n; // just to see how many lines actualy survive the filtering
}
It is from my geometry engine so here is some stuff explained:
glview2D::_pnt
view.pnt[] are input 2D points (I feed random points around random line + random noise points)
view.pnt.num is number of points
glview2D::_lin
view.lin[] are output lines (just one line is used)
accuracy
Play with pacc,lmin,lmax constants to change the behavior and computation speed. Change dirs to change direction accuracy and computation speed
Complexity estimation is not possible due to big dependency on input data
But for my random test points are the runtimes like this:
[ 0.056 ms]Genere 100 random 2D points
[ 151.839 ms]Compute 100 points on line1 (unoptimized brute force O(N^3))
[ 4.385 ms]Compute 100 points on line2 (optimized direction check)
[ 0.096 ms] Genere 200 random 2D points
[1041.676 ms] Compute 200 points on line1
[ 39.561 ms] Compute 200 points on line2
[ 0.440 ms] Genere 1000 random 2D points
[29061.54 ms] Compute 1000 points on line2