Merge Sort for 10 million inputs [closed] - performance

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
This is my code in c++. I have used c++11. is used to measure time in microseconds. My merge sort takes about 24 seconds to sort a randomly generated number array of size of 10 million. But when i refer to my friend's results they have got like 3 seconds. My code seems correct and the difference of mine and them is that they have used the clock in to measure time instead of chrono. Will this affect the deviation of my result? Please answer!
This is my code:
#include <iostream>
#include <climits>
#include<cstdlib>
#include<ctime>
#include<chrono>
using namespace std;
void merge_sort(long* inputArray,long low,long high);
void merge(long* inputArray,long low,long high,long mid);
int main(){
srand(time(NULL));
int n=1000;
long* inputArray = new long[n];
for (long i=0;i<n;i++){ //initialize the arrays of size n with random n numbers
inputArray[i]=rand(); //Generate a random number
}
auto Start = std::chrono::high_resolution_clock::now(); //Set the start time for insertion_sort
merge_sort(inputArray,0,n); //calling the insertion_sort to sort the array of size n
auto End = std::chrono::high_resolution_clock::now(); //Set the end time for insertion_sort
cout<<endl<<endl<<"Time taken for Merge Sort = "<<std::chrono::duration_cast<std::chrono::microseconds>(End-Start).count()<<" microseconds"; //Display the time taken for insertion sort
delete []inputArray;
return 0;
}
void merge_sort(long* inputArray,long low,long high){
if (low<high){
int mid =(low+high)/2;
merge_sort(inputArray,low,mid);
merge_sort(inputArray,mid+1,high);
merge(inputArray,low,mid,high);
}
return;
}
void merge(long* inputArray,long low,long mid,long high){
long n1 = mid-low+1;
long n2 = high - mid;
long *L= new long [n1+1];
long *R=new long [n2+1];
for (int i=0;i<=n1;i++){
L[i] = inputArray[low+i];
}
for (int j=0;j<=n2;j++){
R[j] = inputArray[mid+j+1];
}
L[n1]=INT_MAX ;
R[n2]=INT_MAX;
long i=0;
long j=0;
for (long k=low;k<=high;k++){
if (L[i] <= R[j] ){
inputArray[k]=L[i];
i=i+1;
}
else{
inputArray[k]=R[j];
j=j+1;
}
}
delete[] L;
delete[] R;
}

No way two time measurements took 20 seconds.
As others pointed out, results would be really dependent on platform, compiler optimization (debug mode can be really slower than release mode) and so on.
If you have the same setting as your friends and still have performance issue, you may want to use a profiler to see where your code is spending time. You can use that tool if you are in linux otherwise visual studio in windows is a good candidate.

Related

What is wrong with my selection sort?

My implementation of selection sort does not work in case of j < n-2 or n-1 or n. What am I doing wrong?
Is there an online IDE that lets us put a watch for the control loops?
#include <stdio.h>
#define n 4
int main(void) {
int a[n]={4,3,2,1};
int j,min;
for(int i=0;i<n;i++){
min=i;
for(j=i+1;j<n-3;j++)
if(a[j]>a[j+1])
min=j+1;
if(min!=i){
int t=a[min];
a[min]=a[i];
a[i]=a[t];
}
}
for(int i=0;i<n;i++)
printf("%d",a[i]);
return 0;
}
I tried it here
Your code has indeed a strange limit on n-3, but it has also some other flaws:
To find a minimum you should compare with the current minimum (a[min]), not the next/previous element in the array
The code to swap is not correct: the last assignment should not be from a[t], but t itself.
Here is the corrected code:
int main(void) {
int a[n]={4,3,2,1};
int j,min;
for(int i=0;i<n;i++){
min=i;
for(j=i+1;j<n;j++)
if(a[min]>a[j])
min=j;
if(min!=i){
int t=a[min];
a[min]=a[i];
a[i]=t;
}
}
for(int i=0;i<n;i++)
printf("%d",a[i]);
return 0;
}
https://ideone.com/AGJDPS
NB: To see intermediate results in an online IDE, why not add printf calls inside the loop? Of course, for larger code projects you'd better use a locally installed IDE with all the debugging features, and step through the code.

Resolve 16-Queens Problem in 1 second only

I should resolve 16-Queens Problem in 1 second.
I used backtracking algorithm like below.
This code is enough to resolve N-Queens Problem in 1 second when the N is smaller than 13.
But it takes long time if N is bigger than 13.
How can I improve it?
#include <stdio.h>
#include <stdlib.h>
int n;
int arr[100]={0,};
int solution_count = 0;
int check(int i)
{
int k=1, ret=1;
while (k < i && ret == 1) {
if (arr[i] == arr[k] ||
abs(arr[i]-arr[k]) == abs(i-k))
ret = 0;
k++;
}
return ret;
}
void backtrack(int i)
{
if(check(i)) {
if(i == n) {
solution_count++;
} else {
for(int j=1; j<=n; j++) {
arr[i+1] = j;
backtrack(i+1);
}
}
}
}
int main()
{
scanf("%d", &n);
backtrack(0);
printf("%d", solution_count);
}
Your algorithm is almost fine. A small change will probably give you enough time improvement to produce a solution much faster. In addition, there is a data structure change that should let you reduce the time even further.
First, tweak the algorithm a little: rather than waiting for the check all the way till you place all N queens, check early: every time you are about to place a new queen, check if another queen is occupying the same column or the same diagonal before making the arr[i+1] = j; assignment. This will save you a lot of CPU cycles.
Now you need to speed up checking of the next queen. In order to do that you have to change your data structure so that you could do all your checks without any loops. Here is how to do it:
You have N rows
You have N columns
You have 2N-1 ascending diagonals
You have 2N-1 descending diagonals
Since no two queens can take the same spot in any of the four "dimensions" above, you need an array of boolean values for the last three things; the rows are guaranteed to be different, because the i parameter of backtrack, which represents the row, is guaranteed to be different.
With N up to 16, 2N-1 goes up to 31, so you can use uint32_t for your bit arrays. Now you can check if a column c is taken by applying bitwise and & to the columns bit mask and 1 << c. Same goes for the diagonal bit masks.
Note: Doing a 16 Queen problem in under a second would be rather tricky. A very highly optimized program does it in 23 seconds on an 800 MHz PC. A 3.2 GHz should give you a speed-up of about 4 times, but it would be about 8 seconds to get a solution.
I would change while (k < i && ret == 1) { to while (k < i) {
and instead of ret = 0; do return 0;.
(this will save a check every iteration. It might be that your compiler does this anyway, or some other performance trick, but this might help a bit).

Am I using Dynamic Programming? Matrix chain multiplication in c

Halo I just write code to perform Matrix chain multiplication, which can be solved by Dynamic Programming
http://en.wikipedia.org/wiki/Matrix_chain_multiplication#A_Dynamic_Programming_Algorithm
Here is the code I wrote, which I think is simpler than the one provided by wikipedia. So I doubt am i doing dynamic programming or not?
and I can't figure out the time complexity of my program. Can someone help me to figure the time complexity of this program?
Here's my guess..
the for loop will run n times for each call? if mem is not used..
for each loop, it will then expand into two
if mem is used, it prevent recalculation...
ahhh I can't figure it out, hope someone can help me :-)
#include <iostream>
#include <cstdlib>
#include <cstring>
#include <climits>
using namespace std;
int mem[10][10];
int row[10];
int col[10];
int m[10];
#define NUM 4
int DP(int c, int r){
if(mem[c][r] != INT_MAX) return mem[c][r];
if(c == r) return 0;
int min_cost;
for(int j=c; j<r; j++){
min_cost = DP(c, j) + DP(j+1, r) + m[c-1]*m[j]*m[r];
if(min_cost < mem[c][r])
mem[c][r] = min_cost;
}
return mem[c][r];
}
int main(){
for(int i=0; i< 10;i++){
for(int j=0; j<10;j++){
mem[i][j] = INT_MAX;
}
}
int n = NUM; // MAX 4 matrix
int a,b;
for(int i=0; i< NUM+1; i++){
cin >> a;
m[i] = a;
}
cout << "Lowest Cost for matrix multiplicatoin " << DP(1,NUM);
}
The technique you have used is called memoization. Most of the time, you may solve DP problems using memoization with little (or no) overhead.
The complexity of your implementation is just like the original DP solution: O(n^3) (Note: Every cell of mem array should be computed at least once, and each cell takes O(n) time to be computed. Further computation of a cell, does not involve any loop, since it would be a simple lookup.)
See also http://en.wikipedia.org/wiki/Memoization

rewriting a simple C++ Code snippet into CUDA Code

I have written the following simple C++ code.
#include <iostream>
#include <omp.h>
int main()
{
int myNumber = 0;
int numOfHits = 0;
cout << "Enter my Number Value" << endl;
cin >> myNumber;
#pragma omp parallel for reduction(+:numOfHits)
for(int i = 0; i <= 100000; ++i)
{
for(int j = 0; j <= 100000; ++j)
{
for(int k = 0; k <= 100000; ++k)
{
if(i + j + k == myNumber)
numOfHits++;
}
}
}
cout << "Number of Hits" << numOfHits << endl;
return 0;
}
As you can see I use OpenMP to parallelize the outermost loop. What I would like to do is to rewrite this small code in CUDA. Any help will be much appreciated.
Well, I can give you a quick tutorial, but I won't necessarily write it all for you.
So first of all, you will want to get MS Visual Studio set up with CUDA, which is easy following this guide: http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/
Now you will want to read The NVIDIA CUDA Programming Guide (free pdf), documentation, and CUDA by Example (A book I highly recommend for learning CUDA).
But let's say you haven't done that yet, and definitely will later.
This is an extremely arithmetic heavy and data-light computation - actually it can be computed without this brute force method fairly simply, but that isn't the answer you are looking for. I suggest something like this for the kernel:
__global__ void kernel(int* myNumber, int* numOfHits){
//a shared value will be stored on-chip, which is beneficial since this is written to multiple times
//it is shared by all threads
__shared__ int s_hits = 0;
//this identifies the current thread uniquely
int i = (threadIdx.x + blockIdx.x*blockDim.x);
int j = (threadIdx.y + blockIdx.y*blockDim.y);
int k = 0;
//we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
for(; i < 100000; i += blockDim.x*gridDim.x){
for(; j < 100000; j += blockDim.y*gridDim.y){
//Thanks to talonmies for this simplification
if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
//you should actually use atomics for this
//otherwise, the value may change during the 'read, modify, write' process
s_hits++;
}
}
}
//synchronize threads, so we now s_hits is completely updated
__syncthreads();
//again, atomics
//we make sure only one thread per threadblock actually adds in s_hits
if(threadIdx.x == 0 && threadIdx.y == 0)
*numOfHits += s_hits;
return;
}
To launch the kernel, you will want something like this:
dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);
I know you probably want a quick way to do this, but getting into CUDA isn't really a 'quick' thing. As in, you will need to do some reading and some setup to get it working; past that, the learning curve isn't too high. I haven't told you anything about memory allocation yet, so you will need to do that (although that is simple). If you followed my code, my goal is that you had to read up a bit on shared memory and CUDA, and so you are already kick-started. Good luck!
Disclaimer: I haven't tested my code, and I am not an expert - it could be idiotic.

Use boost::object_pool can not quit clearly. Am I misusing?

I use boost::object_pool in my program, but I found some problems, it can not quit.
Below is the code. Don't suggest me use boost::pool. The boost::pool is no problem, just discuss the boost::object_pool. Could anybody help me?
#include <iostream>
#include <boost/pool/object_pool.hpp>
int main(void) {
boost::object_pool<int> p;
int count = 1000*1000;
int** pv = new int*[count];
for (int i = 0; i < count; i++)
pv[i] = p.construct();
for (int i = 0; i < count; i++)
p.destroy(pv[i]);
delete [] pv;
return 0;
}
This program can not quit normally. Why?
On my machine, this program works correctly, if very slowly.
The loop calling "destroy" is very slow; it appears to have a O(N^2) bit in it; at least, for each 10x increase in the size of the loops, the run time increases by a factor of 90.
Here are some timings:
1000 elements 0.021 sec
10000 elements 1.219 sec
100000 elements 103.29 secs (1m43.29s)
1000000 elements 13437 secs (223m57s)
Someone beat me too it - just saw this question via the boost mailing list.
According to the docs destroy is O(N) so certainly calling this N times isn't ideal for large N - however I imagine that using the Object Pool destructor itself (which calls the destructor for each allocated object) which is itself O(N) would help a lot with bulk deletions).
I did have a graph showing timings on my machine - but since I haven't used Stack Overflow much I can't post it - ah well it doesn't show that much ...
I've published a fix derived from a merge-sort in boost sandbox:
https://github.com/graehl/boost/tree/object_pool-constant-time-free
or, standalone:
https://github.com/graehl/Pool-object_pool

Resources