benchmarking bubble sort vs merge sort using JMH - algorithm

I am benchmarking an implementation for merge sort and bubble sort for a school project using JMH.
The results aren't what I expected because it seems to me that bubble sort is performing better.
so is there something wrong in my benchmarking or my implementation of the algorithms?
benchmarking code:
#Benchmark
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#Fork(value = 1)
#Warmup(iterations = 2)
#Measurement(iterations = 3)
public void init() {
int arr[]={3,2,2,2,1,3,2,3,2,3,0,1,22,9,12,44,55,1000,99,33,333,4,33339,32,3,444,332345,32939,39493,3,2,2,2,1,3,2,3,2,3,0,1,22,9,12,44,55,1000,99,33,333,4,33339,32,3,444,332345,32939,39493,3,2,2,2,1,3,2,3,2,3,0,1,22,9,12,44,55,1000,99,33,333,4,33339,32,3,444,332345,32939,39493};
mergeSort(arr,0,arr.length-1);
//bubbleSort(arr);
}
The results:
merge sort:
"
# JMH version: 1.35
# VM version: JDK 16.0.2, Java HotSpot(TM) 64-Bit Server VM, 16.0.2+7-67
# VM invoker: C:\Program Files\Java\jdk-16.0.2\bin\java.exe
# VM options: -javaagent:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2021.2.2\lib\idea_rt.jar=8379:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2021.2.2\bin -Dfile.encoding=UTF-8
# Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 2 iterations, 10 s each
# Measurement: 3 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: benchmark.testBM.init
# Run progress: 0.00% complete, ETA 00:00:50
# Fork: 1 of 1
# Warmup Iteration 1: 2411.561 ns/op
# Warmup Iteration 2: 2403.000 ns/op
Iteration 1: 2434.613 ns/op
Iteration 2: 2505.492 ns/op
Iteration 3: 2578.021 ns/op
Result "benchmark.testBM.init":
2506.042 ±(99.9%) 1308.179 ns/op [Average]
(min, avg, max) = (2434.613, 2506.042, 2578.021), stdev = 71.706
CI (99.9%): [1197.864, 3814.221] (assumes normal distribution)
# Run complete. Total time: 00:00:50
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
testBM.init avgt 3 2506.042 ± 1308.179 ns/op
Process finished with exit code 0
so its average is : 2506.042
bubble sort:
"
# Run progress: 0.00% complete, ETA 00:00:50
# Fork: 1 of 1
# Warmup Iteration 1: 1963.402 ns/op
# Warmup Iteration 2: 1815.476 ns/op
Iteration 1: 1831.914 ns/op
Iteration 2: 1863.120 ns/op
Iteration 3: 1874.605 ns/op
Result "benchmark.testBM.init":
1856.546 ±(99.9%) 403.037 ns/op [Average]
(min, avg, max) = (1831.914, 1856.546, 1874.605), stdev = 22.092
CI (99.9%): [1453.509, 2259.584] (assumes normal distribution)
# Run complete. Total time: 00:00:50
Benchmark Mode Cnt Score Error Units
testBM.init avgt 3 1856.546 ± 403.037 ns/op
Process finished with exit code 0
the average is : 1856.546
so does anyone know what is the problem may be?
implementation of bubble sort:
public int[] bubbleSort(int arr[])
{
for(int i = 0 ; i< arr.length;i++)
{
for(int j =arr.length-1 ; j>=i+1;j--)
{
if(arr[j]<arr[j-1])
{
int temp = arr[j-1];
arr[j-1]=arr[j];
arr[j]=temp;
}
}
}
return arr;
}
implementation of merge sort:
public void merge(int[] arr, int p, int q, int r) {
int left = q - p + 1;
int right = r - q;
int[] L = new int[left];
int[] R = new int[right];
for (int i = 0; i < left; ++i)
L[i] = arr[p + i];
for (int j = 0; j < right; ++j)
R[j] = arr[q + 1 + j];
int i = 0, j = 0;
int z = p;
while (i < left && j < right) {
if (L[i] <= R[j]) {
arr[z] = L[i];
i++;
} else {
arr[z] = R[j];
j++;
}
z++;
}
while (i < left) {
arr[z] = L[i];
i++;
z++;
}
while (j < right) {
arr[z] = R[j];
j++;
z++;
}
}
public void mergeSort(int[] arr, int p, int r) {
if (p < r) {
int q = p + (r - p) / 2;
mergeSort(arr, p, q);
mergeSort(arr, q + 1, r);
merge(arr, p, q, r);
}
}

Related

Understanding the speed up of openmp program across NUMA nodes

I came across this behavior of speed up and I am finding it hard to explain. Following is the background:
Program
Invocation of Gaussian Elimination method to solve linear equation within a loop to parallelize the work load across compute units. We use an augmented matrix of dimension (M by M+1) where one additional column holds the RHS
HPC Setup - Cray XC50 node with Intel Xeon 6148 Gold with the following configuration
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 0 size: 95325 MB
node 0 free: 93811 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 1 size: 96760 MB
node 1 free: 96374 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Although not the actual HPC, but the block diagram and the related explanation seems to fully apply (https://www.nas.nasa.gov/hecc/support/kb/skylake-processors_550.html). Specifically sub NUMA clustering seems to be disabled.
Job submitted through APLS is as follows
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 0,1,2,3,4,5,6,7,8,9,30,31,32,33,34,35,36,37,38,39 -e N=4000 -e M=200 -e MODE=2 ./gem
time aprun -n 1 -d 20 -j 1 -ss -cc 40,41,42,43,44,45,46,47,48,49,60,61,62,63,64,65,66,67,68,69 -e N=4000 -e M=200 -e MODE=2 ./gem
In the above N indicates the number of matrices and M replaces the dimension of the matrix. These are passed as environment variable to the program and used internally. MODE can be ignored for this discussion
cc list specifically lists the CPUs to bind with. OMP_NUM_THREADS is set to 20. The intent is to use 20 threads across 20 compute units.
Time to run sequentially and parallel is recorded within the program using omp_get_wtime() and the results are the following
CPU Binding
Objective
Speed Up
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Load work across 20 physical cores on socket 0
13.081944
0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29
Spread across first 10 physical cores on socket 0 & socket 1
18.332559
10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
Spread across 2nd set of 1o physical cores on socket 0 & first 10 of socket 1
18.636265
40,41,42,43,44,45,46,47,48,49,60,61,62,63,64,65,66,67,68,69
Spread across virtual cores across sockets(40-0, 60-21)
15.922209
Why is the speed up less for the first case when all physical nodes on socket 0 are being used ? The understanding here is that when tasks are spread across sockets, UPI comes into effect and it should be slower whereas it seems to be exactly the opposite. Also what can possibly explain the last scenario when virtual cores are being used.
Note: We have tried multiple iterations and the results for the above combinations are pretty consistent.
Edit1:
Edit2: Source code
#define _GNU_SOURCE
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include "sched.h"
#include "omp.h"
double drand(double low, double high, unsigned int *seed)
{
return ((double)rand_r(seed) * (high - low)) / (double)RAND_MAX + low;
}
void init_vars(int *N, int *M, int *mode)
{
const char *number_of_instances = getenv("N");
if (number_of_instances) {
*N = atoi(number_of_instances);
}
const char *matrix_dim = getenv("M");
if (matrix_dim) {
*M = atoi(matrix_dim);
}
const char *running_mode = getenv("MODE");
if (running_mode) {
*mode = atoi(running_mode);
}
}
void print_matrix(double *instance, int M)
{
for (int row = 0; row < M; row++) {
for (int column = 0; column <= M; column++) {
printf("%lf ", instance[row * (M + 1) + column]);
}
printf("\n");
}
printf("\n");
}
void swap(double *a, double *b)
{
double temp = *a;
*a = *b;
*b = temp;
}
void init_matrix(double *instance, unsigned int M)
{
unsigned int seed = 45613 + 19 * omp_get_thread_num();
for (int row = 0; row < M; row++) {
for (int column = 0; column <= M; column++) {
instance[row * (M + 1) + column] = drand(-1.0, 1.0, &seed);
}
}
}
void initialize_and_solve(int M)
{
double *instance;
instance = malloc(M * (M + 1) * sizeof(double));
// Initialise the matrix
init_matrix(instance, M);
// Performing elementary operations
int i, j, k = 0, c, flag = 0, m = 0;
for (i = 0; i < M; i++) {
if (instance[i * (M + 2)] == 0) {
c = 1;
while ((i + c) < M && instance[(i + c) * (M + 1) + i] == 0)
c++;
if ((i + c) == M) {
flag = 1;
break;
}
for (j = i, k = 0; k <= M; k++) {
swap(&instance[j * (M + 1) + k], &instance[(j + c) * (M + 1) + k]);
}
}
for (j = 0; j < M; j++) {
// Excluding all i == j
if (i != j) {
// Converting Matrix to reduced row
// echelon form(diagonal matrix)
double pro = instance[j * (M + 1) + i] / instance[i * (M + 2)];
for (k = 0; k <= M; k++)
instance[j * (M + 1) + k] -= (instance[i * (M + 1) + k]) * pro;
}
}
}
// Get the solution in the last column
for (int i = 0; i < M; i++) {
instance[i * (M + 1) + M] /= instance[i * (M + 2)];
}
free(instance);
instance = NULL;
}
double solve_serial(int N, int M)
{
double now = omp_get_wtime();
for (int i = 0; i < N; i++) {
initialize_and_solve(M);
}
return omp_get_wtime() - now;
}
double solve_parallel(int N, int M)
{
double now = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < N; i++) {
initialize_and_solve(M);
}
return omp_get_wtime() - now;
}
int main(int argc, char **argv)
{
// Default parameters
int N = 200, M = 200, mode = 2;
if (argc == 4) {
N = atoi(argv[1]);
M = atoi(argv[2]);
mode = atoi(argv[3]);
}
init_vars(&N, &M, &mode);
if (mode == 0) {
// Serial only
double l2_norm_serial = 0.0;
double serial = solve_serial(N, M);
printf("Time, %d, %d, %lf\n", N, M, serial);
} else if (mode == 1) {
// Parallel only
double l2_norm_parallel = 0.0;
double parallel = solve_parallel(N, M);
printf("Time, %d, %d, %lf\n", N, M, parallel);
} else {
// Both serial and parallel
// Solve using GEM (serial)
double serial = solve_serial(N, M);
// Solve using GEM (parallel)
double parallel = solve_parallel(N, M);
printf("Time, %d, %d, %lf, %lf, %lf\n", N, M, serial, parallel, serial / parallel);
}
return 0;
}
Edit3: Rephrased the first point to clarify what is actually being done ( based on feedback in comment )
You say you implement a "Simple implementation of Gaussian Elimination". Sorry, there is no such thing. There are multiple different algorithms and they all come with their own analysis. But let's assume you use the textbook one. Even then, Gaussian Elimination is not simple.
First of all, you haven't stated that you initialized your data in parallel. If you don't do that, all the data will wind up on socket 0 and you will get bad performance, never mind the speedup. But let's assume you did the right thing here. (If not, google "first touch".)
In the GE algorithm, each of the sequential k iterations works on a smaller and smaller subset of the data. This means that no simple mapping of data to cores is possible. If you place your data in such a way that initially each core works on local data, this will quickly no longer be the case.
In fact, after half the number of iterations, half your cores will be pulling data from the other socket, leading to NUMA coherence delays. Maybe a spread binding is better here than your compact binding.
Why is the speed up less for the first case when all physical nodes on socket 0 are being used ?
Results are often dependent of the application but some patterns regularly happens. My guess is that your application heavily use the main RAM and 2 sockets results in more DDR4 RAM blocks being used than only one. Indeed, with local NUMA-node allocations, 1 socket can access to the RAM at the speed of 128 GB/s while 2 sockets can access to the RAM at the speed of 256 GB/s. With a balanced use of DDR4 RAM blocks, the performance with be far worst and bounded by UPI (I do not expect 2 socket to be much slower because of the full-duplex data transfer).
The understanding here is that when tasks are spread across sockets, UPI comes into effect and it should be slower whereas it seems to be exactly the opposite.
UPI is only a bottleneck if data are massively transferred between the two sockets, but good NUMA applications should not do that because they should operate on their own NUMA-node memory.
You can check the use of the UPI and RAM throughput using hardware counters.
Also what can possibly explain the last scenario when virtual cores are being used.
I do not have an explanation for this. Note higher IDs are the second hyperthreads of each core so it is certainly related to a low-level behaviour of the hyperthreading (maybe some processes are bound to some PU causing pre-emption the target PUs or simply the second PU have somehow a lower priority). Note also that physical core IDs and logical PU IDs are often not mapped the same way so if you use the wrong one you could end up binding 2 threads to the same core. I advise you to use hwloc to check that.

Intel VTune Results Understanding - Naive Questions

My application I want to speedup performs element-wise processing of large array (about 1e8 elements).
​The processing procedure for each element is very simple and I suspect that bottleneck could be not CPU but DRAM bandwidth.
​So I decided to study one-threaded version at first.
The system is: Windows 10 64bit, 32 GB RAM, Intel Core i7-3770S Ivybridge 1.10 GHz 4 cores, Hyperthreading enabled
Concurrency analysis
Elapsed Time: 34.425s
CPU Time: 14.908s
Effective Time: 14.908s
Idle: 0.005s
Poor: 14.902s
Ok: 0s
Ideal: 0s
Over: 0s
Spin Time: 0s
Overhead Time: 0s
Wait Time: 0.000s
Idle: 0.000s
Poor: 0s
Ok: 0s
Ideal: 0s
Over: 0s
Total Thread Count: 2
Paused Time: 18.767s
Memory Access Analysis
Memory Access Analysis provides different CPU times for three consecutive  runs on the same amount of data
​Actual execution time was about 23 seconds as Concurrency Analysis says.
Elapsed Time: 33.526s
CPU Time: 5.740s
Memory Bound: 38.3%
L1 Bound: 10.4%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 0.8%
Memory Bandwidth: 36.1%
Memory Latency: 60.4%
Loads: 12,912,960,000
Stores: 7,720,800,000
LLC Miss Count: 420,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 18.081s
Elapsed Time: 33.011s
CPU Time: 4.501s
Memory Bound: 36.9%
L1 Bound: 10.6%
L2 Bound: 0.0%
L3 Bound: 0.2%
DRAM Bound: 0.6%
Memory Bandwidth: 36.5%
Memory Latency: 62.7%
Loads: 9,836,100,000
Stores: 5,876,400,000
LLC Miss Count: 180,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 17.913s
Elapsed Time: 33.738s
CPU Time: 5.999s
Memory Bound: 38.5%
L1 Bound: 10.8%
L2 Bound: 0.0%
L3 Bound: 0.1%
DRAM Bound: 0.9%
Memory Bandwidth: 57.8%
Memory Latency: 37.3%
Loads: 13,592,760,000
Stores: 8,125,200,000
LLC Miss Count: 660,000
Average Latency (cycles): 15
Total Thread Count: 4
Paused Time: 18.228s
As far as I understand the Summary Page, the situation is not very good.
The paper Finding your Memory Access performance bottlenecks says that the reason is so-called false sharing. But I do not use multithreading, all processing is performed by  just one thread.
From the other hand according to Memory Access Analysis/Platform Page DRAM Bandwidth is not bottleneck.
So the questions are
Why CPU times metric values are different for Concurrency Analysis and Memory Access Analysis
What is the reason of bad memory metrics values, especially for L1 Bound?
The main loop is lambda function, where
tasklets: std::vector of simple structures that contain coefficients for data processing
points: data itself, Eigen::Matrix
projections: Eigen::Matrix, array to put results of processing into
The code is:
#include <iostream>
#include <future>
#include <random>
#include <Eigen/Dense>
#include <ittnotify.h>
using namespace std;
using Vector3 = Eigen::Matrix<float, 3, 1>;
using Matrix3X = Eigen::Matrix<float, 3, Eigen::Dynamic>;
uniform_real_distribution<float> rnd(0.1f, 100.f);
default_random_engine gen;
class Tasklet {
public:
Tasklet(int p1, int p2)
:
p1Id(p1), p2Id(p2), Loc0(p1)
{
RestDistance = rnd(gen);
Weight_2 = rnd(gen);
}
__forceinline void solve(const Matrix3X& q, Matrix3X& p)
{
Vector3 q1 = q.col(p1Id);
Vector3 q2 = q.col(p2Id);
for (int i = 0; i < 0; ++i) {
Vector3 delta = q2 - q1;
float norm = delta.blueNorm() * delta.hypotNorm();
}
Vector3 deltaQ = q2 - q1;
float dist = deltaQ.norm();
Vector3 deltaUnitVector = deltaQ / dist;
p.col(Loc0) = deltaUnitVector * RestDistance * Weight_2;
}
int p1Id;
int p2Id;
int Loc0;
float RestDistance;
float Weight_2;
};
typedef vector<Tasklet*> TaskList;
void
runTest(const Matrix3X& points, Matrix3X& projections, TaskList& tasklets)
{
size_t num = tasklets.size();
for (size_t i = 0; i < num; ++i) {
Tasklet* t = tasklets[i];
t->solve(points, projections);
}
}
void
prepareData(Matrix3X& points, Matrix3X& projections, int numPoints, TaskList& tasklets)
{
points.resize(3, numPoints);
projections.resize(3, numPoints);
points.setRandom();
/*
for (int i = 0; i < numPoints; ++i) {
points.col(i) = Vector3(1, 0, 0);
}
*/
tasklets.reserve(numPoints - 1);
for (int i = 1; i < numPoints; ++i) {
tasklets.push_back(new Tasklet(i - 1, i));
}
}
int
main(int argc, const char** argv)
{
// Pause VTune data collection
__itt_pause();
cout << "Usage: <exefile> <number of points (in thousands)> <#runs for averaging>" << endl;
int numPoints = 150 * 1000;
int numRuns = 1;
int argNo = 1;
if (argc > argNo) {
istringstream in(argv[argNo]);
int i;
in >> i;
if (in) {
numPoints = i * 1000;
}
}
++argNo;
if (argc > argNo) {
istringstream in(argv[argNo]);
int i;
in >> i;
if (in) {
numRuns = i;
}
}
cout
<< "Running test" << endl
<< "\t NumPoints (thousands): " << numPoints / 1000. << endl
<< "\t # of runs for averaging: " << numRuns << endl;
Matrix3X q, projections;
TaskList tasklets;
cout << "Preparing test data" << endl;
prepareData(q, projections, numPoints, tasklets);
cout << "Running test" << endl;
// Resume VTune data collection
__itt_resume();
for (int r = 0; r < numRuns; ++r) {
runTest(q, projections, tasklets);
}
// Pause VTune data collection
__itt_pause();
for (auto* t : tasklets) {
delete t;
}
return 0;
}
Thank you.

Find kth largest element in an array, two different priority_queue solutions time complexity

I'm interested in two solutions using priority_queue specifically. Although they both use priority_queue, I think they have different time complexity.
Solution 1:
int findKthLargest(vector<int>& nums, int k) {
priority_queue<int> pq(nums.begin(), nums.end()); //O(N)
for (int i = 0; i < k - 1; i++) //O(k*log(k))
pq.pop();
return pq.top();
}
Time Complexity: O(N) + O(k*log(k))
EDIT: sorry, it should be O(N) + O(k*log(N)) thanks for pointing out!
Solution 2:
int findKthLargest(vector<int>& nums, int k) {
priority_queue<int, vector<int>, greater<int>> p;
int i = 0;
while(p.size()<k) {
p.push(nums[i++]);
}
for(; i<nums.size(); i++) {
if(p.top()<nums[i]){
p.pop();
p.push(nums[i]);
}
}
return p.top();
}
Time Complexity: O(N*log(k))
So in most cases the 1st solution is much better than the 2nd?
In the first case the complexity is O(n)+klog(n) not O(n)+klog(k) as there are n elements in the heap. In the worst case, k can be as large as n, so for unbounded data O(nlog(n)) is the correct worst case complexity.
In the second case, there is never more than k items in the priority queue, so the complexity is O(nlog(k)) and again for unbounded data k can be as large as n, so it is O(nlog(n)).
For smaller k, the second code will run faster, but as k becomes larger the first code becomes faster. I made some experiments, here are the results:
k=1000
Code 1 time:0.123662
998906057
Code 2 time:0.03287
998906057
========
k=11000
Code 1 time:0.137448
988159929
Code 2 time:0.0872
988159929
========
k=21000
Code 1 time:0.152471
977547704
Code 2 time:0.131074
977547704
========
k=31000
Code 1 time:0.168929
966815132
Code 2 time:0.168899
966815132
========
k=41000
Code 1 time:0.185737
956136410
Code 2 time:0.205008
956136410
========
k=51000
Code 1 time:0.202973
945313516
Code 2 time:0.236578
945313516
========
k=61000
Code 1 time:0.216686
934315450
Code 2 time:0.27039
934315450
========
k=71000
Code 1 time:0.231253
923596252
Code 2 time:0.293189
923596252
========
k=81000
Code 1 time:0.246896
912964978
Code 2 time:0.321346
912964978
========
k=91000
Code 1 time:0.263312
902191629
Code 2 time:0.343613
902191629
========
I modified the second code a little bit to make to similar to code1:
int findKthLargest2(vector<int>& nums, int k) {
double st=clock();
priority_queue<int, vector<int>, greater<int>> p(nums.begin(), nums.begin()+k);
int i=k;
for(; i<nums.size(); i++) {
if(p.top()<nums[i]){
p.pop();
p.push(nums[i]);
}
}
cerr<<"Code 2 time:"<<(clock()-st)/CLOCKS_PER_SEC<<endl;
return p.top();
}
int findKthLargest1(vector<int>& nums, int k) {
double st=clock();
priority_queue<int> pq(nums.begin(), nums.end()); //O(N)
for (int i = 0; i < k - 1; i++) //O(k*log(k))
pq.pop();
cerr<<"Code 1 time:"<<(clock()-st)/CLOCKS_PER_SEC<<endl;
return pq.top();
}
int main() {
READ("in");
vector<int>v;
int n;
cin>>n;
repl(i,n)
{
int x;
scanf("%d",&x);
v.pb(x);
}
for(int k=1000;k<=100000;k+=10000)
{
cout<<"k="<<k<<endl;
cout<<findKthLargest1(v,k)<<endl;
cout<<findKthLargest2(v,k)<<endl;
puts("========");
}
}
I used 1000000 random integers between 0 to 10^9 as dataset, generated by C++ rand() function.
well, no the first is O(N)+O(k*log(N)) because the pop is O(log(N))
int findKthLargest(vector<int>& nums, int k) {
priority_queue<int> pq(nums.begin(), nums.end()); //O(N)
for (int i = 0; i < k - 1; i++) //O(k*log(N))
pq.pop(); // this line is O(log(N))
return pq.top();
}
it's still better than the second in most cases.

OpenMP Code Not Scaling due to overheads and cache issues

struct xnode
{
float *mat;
};
void testScaling( )
{
int N = 1000000; ///total num matrices
int dim = 10;
//memory for matrices
std::vector<xnode> nodeArray(N);
for( int k = 0; k < N; ++k )
nodeArray[k].mat = new float [dim*dim];
//memory for Y
std::vector<float*> Y(N,0);
for( int k = 0; k < N; ++k )
Y[k] = new float [dim];
//shared X
float* X = new float [dim];
for(int i = 0; i < dim; ++i ) X[i] = 1.0;
//init mats
for( int k = 0; k < N; ++k )
{
for( int i=0; i<dim*dim; ++i )
nodeArray[k].mat[i] = 0.25+((float)i)/3;
}
int NTIMES = 500;
//gemv args
char trans = 'N';
int lda = dim;
int incx = 1;
float alpha =1 , beta = 0;
//threads
int thr[4];
thr[0] =1 ; thr[1] = 2; thr[2] = 4; thr[3] = 8;
for( int t = 0; t<4; ++t )//test for nthreads
{
int nthreads = thr[t];
double t_1 = omp_get_wtime();
for( int ii = 0; ii < NTIMES; ++ii )//do matvec NTIMES
{
#pragma omp parallel for num_threads(nthreads)
for( int k=0; k<N; ++k )
{
//compute Y[k] = mat[k] * X;
GEMV(&trans, &dim, &dim, &alpha, nodeArray[k].mat, &lda, X, &incx, &beta, Y[k], &incx);
//GEMV(&trans, &dim, &dim, &alpha, nodeArray[0].mat, &lda, X, &incx, &beta, Y[k], &incx);
}
}
double t_2 = omp_get_wtime();
std::cout << "Threads " << nthreads << " time " << (t_2-t_1)/NTIMES << std::endl;
}
//clear memory
for( int k = 0; k < N; ++k )
{
delete [] nodeArray[k].mat;
delete [] Y[k];
}
delete [] X;
}
The above code parallelizes the matrix-vector product of N matrices of size dim, and stores results in N output vectors. The average of 500 products is taken as the time per matrix-vector product. The matrix-vector products in the above example are all of equal size and thus the threads should be perfectly balanced - we should achieve a performance scaling close to ideal 8x. The following are the observations (Machine – Intel Xeon 3.1Ghz.2 processors,8cores each, HyperThreading enabled, Windows, VS2012, Intel MKL, Intel OMP library).
OBSERVATION 1:
dim=10 N=1000000
Threads 1 - time 0.138068s
Threads 2 - time 0.0729147s
Threads 4 - time 0.0360527s
Threads 8 - time 0.0224268s (6.1x on 8threads)
OBSERVATION 2 :
dim=20 N=1000000
Threads 1 time 0.326617
Threads 2 time 0.185706
Threads 4 time 0.0886508
Threads 8 time 0.0733666 (4.5x on 8 threads).
Note – I ran VTune on this case. It showed CPUTime 267.8sec, Overhead time 43 sec, Spin time – 8 sec. The overhead time is all spent in a libiomp function (intel library). 8Threads/1Thread scaling is poor for such cases.
Next - in the gemv for loop, we change nodeArray[k].mat to nodeArray[0].mat (see commented statement), so that only the first matrix is used for all the matrix-vector products.
OBSERVATION 3
dim=20 N=1000000
Threads 1 time 0.152298 (The serial time is halved)
Threads 2 time 0.0769173
Threads 4 time 0.0384086
Threads 8 time 0.019336 (7.87x on 8 threads)
Thus I get almost ideal scaling - why is this behavior? VTune says that a significant portion of CPU time is spent in synchronization and thread overhead. Here it seems there is no relation between the load balancing and thread synchronization. As matrix size is increased the granularity should increase and thread overhead should be proportionately small. But as we increase from size 10 to 20 the scaling is weakening. When we use nodeArray[0].mat (only the first matrix) for doing all the matrix-vector products the cache is updated only once (since the compiler knows this during optimization) and we get near ideal scaling. Thus the synchronization overhead seems to be related to some cache related issue. I have tried a number of other things like setting KMP_AFFINITY and varying load distribution but that did not buy me anything.
My questions are:
1. I dont have a clear idea about how does the cache performance affect openMP thread synchronization. Can someone explain this?
2. Can anything be done about improving the scaling and reducing the overhead?
Thanks

How much time will a typical home computer take to perform calculations of millions of digits?

I know that machines find it difficult to make calculations involving very large numbers.
Let's say I want to find square of a million digit number. Will a typical computer give an answer almost instantly? How much time does it take for them to handle million digit calculations?
Also what is the reason for them to be slow in such calculations?
I found some calculator websites which claim that they can do the task instantly. Will a computer become faster if they use the method those websites use?
On my PC it takes more than 21 minutes to draw a square root of a number with 1 million digits. See the details below. It should be possible to achieve faster times, but "almost instantly" is probably not feasible without making use of special hardware (like graphics boards with CUDA support).
I have written a test program in C# to find the runtimes for calculating the square root with Newton's method. It uses the System.Numerics library which features the BigInteger class for arbitrary accuracy arithmetic.
The runtime depends on the initial value assumed for the iterative calculation method. To look for the highest non-zero bit of the number turned out to be faster than simply always using 1 as initial value.
using System;
using System.Diagnostics;
using System.Numerics;
namespace akBigSquareRoot
{
class Program
{
static void Main(string[] args)
{
Stopwatch stopWatch = new Stopwatch();
Console.WriteLine(" nDigits error iterations elapsed ");
Console.WriteLine("-----------------------------------------");
for (int nDigits = 10; nDigits <= 1e6; nDigits *= 10)
{
// create a base number with nDigits/2 digits
BigInteger x = 1;
for (int i = 0; i < nDigits / 2; i++)
{
x *= 10;
}
BigInteger square = x * x;
stopWatch.Restart();
int iterations;
BigInteger root = sqrt(square, out iterations);
stopWatch.Stop();
BigInteger error = x - root;
TimeSpan ts = stopWatch.Elapsed;
string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}",
ts.Hours, ts.Minutes, ts.Seconds,
ts.Milliseconds / 10);
Console.WriteLine("{0,8} {1,6} {2,6} {3}", nDigits, error, iterations, elapsedTime);
}
Console.WriteLine("\n<end reached>");
Console.ReadKey();
}
public static BigInteger sqrt(BigInteger x, out int iterations)
{
BigInteger div = BigInteger.One << (bitLength(x) / 2);
// BigInteger div = 1;
BigInteger div2 = div;
BigInteger y;
// Loop until we hit the same value twice in a row, or wind
// up alternating.
iterations = 0;
while (true)
{
iterations++;
y = (div + (x / div)) >> 1;
if ((y == div) || (y == div2))
return y;
div2 = div;
div = y;
}
}
private static int bitLength(BigInteger x) {
int len = 0;
do
{
len++;
} while ((x >>= 1) != 0);
return len;
}
}
}
The results on a DELL XPS 8300 with Intel Core i7-2600 CPU 3.40 GHz
nDigits error iterations elapsed
----------------------------------------
10 0 4 00:00:00.00
100 0 7 00:00:00.00
1000 0 10 00:00:00.00
10000 0 14 00:00:00.09
100000 0 17 00:00:09.81
1000000 0 20 00:21:18.38
Increasing the number of digits by a factor of 10 results in three additional iterations in the search procedure. But due to the increased bit-length, the search iterations a slowed down substantially.
The computational complexity of calculating square (and higher degree) roots is discussed in a related post.

Resources