I have the following piece of code, which I want to make parallel in a certain way. I am making a mistake, and hence not all threads are running the loop as I thought it should. It would be great if somebody could help me out identifying that mistake.
This is a code to calculate histograms.
#pragma omp parallel default(shared) private(iIndex2, iIndex1, fDist) shared(iSize, dense) reduction(+:iCount)
chunk = (unsigned int)(iSize / omp_get_num_threads());
threadID = omp_get_thread_num();
svtout << "Number of threads available " << omp_get_num_threads() << endl;
svtout << "The threadID is " << threadID << endl;
//want each of the thread to execute the loop
for (iIndex1=0; iIndex1 < chunk; iIndex1++)
for (iIndex2=iIndex1+1; iIndex2 < chunk; iIndex2++)
fDist = (*this)[iIndex1 + threadID*chunk].distance( (*this)[iIndex2 + threadID*chunk] );
idx = (int)(fDist/fWidth);
if ((int)fDist % (int)fWidth >= 0)
#pragma omp atomic
dense[idx] += 1;
The iCount variable keeps track of the number of iterations, and I noticed that there is a marked difference between the serial and the parallel version. I guess not all threads are running, and hence the histogram values that I'm obtaining from the parallel program are much less than the actual readings (the dense array stores the histogram values).

you are a looping over chunk, rather than iSize with more than one thread.
Try replacing loop bounds with iSize .


SHA256 Find Partial Collision

I have two message:
messageA: "Frank is one of the "best" students topicId{} "
messageB: "Frank is one of the "top" students topicId{} "
I need to find SHA256 partially collision of these two messages(8 digits).
Therefore, The first 8 digests of SHA256(messageA) == The first 8 digest of SHA256(messageB)
We can put any letters and numbers in {}, Both {} should have same string
I have tried brute force and birthday attack with hash table to solve this problem, but it costs too much time. I know the cycle detection algorithm like Floyd and Brent, however i have no idea how to construct the cycle for this problem. Are there any other methods to solve this problem? Thank you so much!
This is pretty trivial to solve with a birthday attack. Here's how I did it in Python (v2):
def find_collision(ntries):
from hashlib import sha256
str1 = 'Frank is one of the "best" students topicId{%d} '
str2 = 'Frank is one of the "top" students topicId{%d} '
seen = {}
for n in xrange(ntries):
h = sha256(str1 % n).digest()[:4].encode('hex')
seen[h] = n
for n in xrange(ntries):
h = sha256(str2 % n).digest()[:4].encode('hex')
if h in seen:
print str1 % seen[h]
print str2 % n
If your attempt took too long to find a solution, then either you simply made a mistake in your coding somewhere, or you were using the wrong data type.
Python's dictionary data type is implemented using hash tables. That means you can search for dictionary elements in constant time. If you implemented seen using a list instead of a dict in the above code, then the search at line 11 would take an awful lot longer.
If the two topicId tokens have to be identical, then — as pointed out in the comments — there is little option but to grind through somewhere in the order of 231 values. You will find a collision eventually, but it could take a long time.
Just leave this running overnight and with a bit of luck you'll have an answer in the morning:
def find_collision():
from hashlib import sha256
str1 = 'Frank is one of the "best" students topicId{%x} '
str2 = 'Frank is one of the "top" students topicId{%x} '
seen = {}
n = 0
while True:
if sha256(str1 % n).digest()[:4] == sha256(str2 % n).digest()[:4]:
print str1 % n
print str2 % n
n += 1
If you're in a hurry, you could maybe look into using a GPU to speed up the hash calculations.
I'm assuming the space at the end of the strings in the question was intentional so I left it in.
"Frank is one of the "top" students topicId{59220691223} "
"Frank is one of the "best" students topicId{59220691223} "
It actually took about 7 billion tries to find one using brute force, a lot more than I expected.
I figure 2^32 is roughly 4.3 billion and so chance of not finding any match after 4.3 billion tries is about 36.78%
I actually found a match after about 7 billion tries, there was less than a 20% chance of no matches in 7 billion tries.
This is the C++ code I used running on 7 threads, each thread gets a different starting point and it quits once a match is found on any thread. Each thread also updates its progress to cout every 1 million attempts.
I've fast forwarded to where the match was found on threadId=5, so it takes less than a minute to run. But if you change the starting point you can look for other matches.
And I'm not sure either how one would use Floyd and Brent since the strings have to use the same topicId so you are locked in on both the prefix and suffix.
To compile go get picosha2 header file from
Copy this code into same directory as picosha2.h file, save it as hash.cpp for example.
On Linux go to command line and cd to directory where these files are.
To compile it:
g++ -O2 -o hash hash.cpp -l pthread
And run it:
#include <iostream>
#include <string>
#include <thread>
#include <mutex>
// I used picoSHA2 header only file for the hashing
#include "picosha2.h"
// return 1st 4 bytes (8 chars) of SHA256 hash
std::string hash8(const std::string& src_str) {
std::vector<unsigned char> hash(picosha2::k_digest_size);
picosha2::hash256(src_str.begin(), src_str.end(), hash.begin(), hash.end());
return picosha2::bytes_to_hex_string(hash.begin(), hash.begin() + 4);
bool done = false;
std::mutex mtxCout;
void work(unsigned long long threadId) {
std::string a = "Frank is one of the \"best\" students topicId{",
b = "Frank is one of the \"top\" students topicId{";
// Each thread gets a different starting point, I've fast forwarded to the part
// where I found the match so this won't take long to run if you try it, < 1 minute.
// If you want to run a while drop the last "+ 150000000ULL" term and it will run
// for about 1 billion total (150 million each thread, assuming 7 threads) take
// about 30 minutes on Linux.
// Collision occurred on threadId = 5, so if you change it to use less than 6 threads
// then your mileage may vary.
unsigned long long start = threadId * (11666666667ULL + 147000000ULL) + 150000000ULL;
unsigned long long x = start;
for (;;) {
// Not concerned with making the reading/updating "done" flag atomic, unlikely
// 2 collisions are found at once on separate threads, and writing to cout
// is guarded anyway.
if (done) return;
std::string xs = std::to_string(x++);
std::string hashA = hash8(a + xs + "} "), hashB = hash8(b + xs + "} ");
if (hashA == hashB) {
std::lock_guard<std::mutex> lock(mtxCout);
std::cout << "*** SOLVED ***" << std::endl;
std::cout << (x-1) << std::endl;
std::cout << "\"" << a << (x - 1) << "} \" = " << hashA << std::endl;
std::cout << "\"" << b << (x - 1) << "} \" = " << hashB << std::endl;
done = true;
if (((x - start) % 1000000ULL) == 0) {
std::lock_guard<std::mutex> lock(mtxCout);
std::cout << "thread: " << threadId << " = " << (x-start)
<< " tries so far" << std::endl;
void runBruteForce() {
const int NUM_THREADS = 7;
std::thread threads[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; i++) threads[i] = std::thread(work, i);
for (int i = 0; i < NUM_THREADS; i++) threads[i].join();
int main(int argc, char** argv) {
return 0;

Unexpected and large runtime variations in Eigen for matrix multiplies

I am comparing ways to perform equivalent matrix operations within Eigen, and am getting extraordinarily different runtimes, including some non-intuitive results.
I am comparing three mathematically equivalent forms of the matrix multiplication:
wx * transpose(data)
The three forms I'm comparing are:
result = wx * data.transpose() (straight multiply version)
result.noalias() = wx * data.transpose() (noalias version)
result = (data * wx.transpose()).transpose() (transposed version)
I am also testing using both Column Major and Row Major storage.
With column major storage, the transposed version is significantly faster (an order of magnitude) than both the straight multiply and the no alias version, which are both approximately equal in runtime.
With row major storage, the noalias and the transposed version are both significantly faster than the straight multiply in runtime.
I understand that Eigen uses lazy evaluation, and that the immediate results returned from an operation are often expression templates, and are not the intermediate values. I also understand that matrix * matrix operations will always produce a temporary when they are the last operation on the right hand side, to avoid aliasing issues, hence why I am attempting to speed things up through noalias().
My main questions:
Why is the transposed version always significantly faster, even (in the case of column major storage) when I explicitly state noalias so no temporaries are created?
Why does the (significant) difference in runtime only occur between the straight multiply and the noalias version when using column major storage?
The code I am using for this is below. It is being compiled using gcc 4.9.2, on a Centos 6 install, using the following command line.
g++ eigen_test.cpp -O3 -std=c++11 -o eigen_test -pthread -fopenmp -finline-functions
using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
// using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
int wx_rows = 8000;
int wx_cols = 1000;
int samples = 1;
// Eigen::MatrixXf matrix = Eigen::MatrixXf::Random(matrix_rows, matrix_cols);
Matrix wx = Eigen::MatrixXf::Random(wx_rows, wx_cols);
Matrix data = Eigen::MatrixXf::Random(samples, wx_cols);
Matrix result;
unsigned int iterations = 10000;
float sum = 0;
auto before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
result = wx * data.transpose();
sum += result(result.rows() - 1, result.cols() - 1);
auto after = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "original sum: " << sum << std::endl;
std::cout << "original time (ms): " << duration << std::endl;
std::cout << std::endl;
sum = 0;
before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
result.noalias() = wx * data.transpose();
sum += result(wx_rows - 1, samples - 1);
after = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "alias sum: " << sum << std::endl;
std::cout << "alias time (ms) : " << duration << std::endl;
std::cout << std::endl;
sum = 0;
before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
result = (data * wx.transpose()).transpose();
sum += result(wx_rows - 1, samples - 1);
after = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "new sum: " << sum << std::endl;
std::cout << "new time (ms) : " << duration << std::endl;
One half of the explanation is because, in the current version of Eigen, multi-threading is achieved by splitting the work over blocks of columns of the result (and the right-hand-side). With only 1 column, multi-threading does not take place. In the column-major case, this explain why cases 1 and 2 underperform. On the other hand, case 3 is evaluated as:
column_major_tmp.noalias() = data * wx.transpose();
result = column_major_tmp.transpose();
and since wx.transpose().cols() is huge, multi-threading is effective.
To understand the row-major case, you also need to know that internally matrix products is implemented for a column-major destination. If the destination is row-major, as in case 2, then the product is transposed, so what really happens is:
row_major_result.transpose().noalias() = data * wx.transpose();
and so we're back to case 3 but without temporary.
This is clearly a limitation of current Eigen's multi-threading implementation for highly unbalanced matrix sizes. Ideally threads should be spread on row-block and/or column-block depending on the size of the matrices at hand.
BTW, you should also compile with -march=native to let Eigen fully exploit your CPU (AVX, FMA, AVX512...).

openMP number of threads is higher than asked for

I'm implementing an openMP version of a sequential program, and for a function that distributes a list for the threads, I need function to know the number of threads.
Boiled down, the code looks like this:
int numberOfThreads = 0;
#pragma omp parallel
//split nodeQueue
#pragma omp master
cout << "Asked for " << NUM_THREADS << endl;
numberOfThreads = omp_get_num_threads();
cout << "Got " << numberOfThreads << " threads" << endl;
No matter what I set NUM_THREADS to, it seems to get 4 threads, and outputs:
Asked for 1
Got 4 threads
Shouln't it get a maximum of NUM_THREADS when I use omp_set_num_threads(NUM_THREADS)?
It doesn't matter what number of threads I ask for - it always gets 4 (which is the number of threads available on the CPU)...
Can't I force it to use the specified number of threads as maximum?
I think, setting num_threads from within parallel region would not change the number of threads for the fork at the start of the parallel region, it only changes the number of threads for nested parallel regions, which defaults to 1 by OMP specs

rewriting a simple C++ Code snippet into CUDA Code

I have written the following simple C++ code.
#include <iostream>
#include <omp.h>
int main()
int myNumber = 0;
int numOfHits = 0;
cout << "Enter my Number Value" << endl;
cin >> myNumber;
#pragma omp parallel for reduction(+:numOfHits)
for(int i = 0; i <= 100000; ++i)
for(int j = 0; j <= 100000; ++j)
for(int k = 0; k <= 100000; ++k)
if(i + j + k == myNumber)
cout << "Number of Hits" << numOfHits << endl;
return 0;
As you can see I use OpenMP to parallelize the outermost loop. What I would like to do is to rewrite this small code in CUDA. Any help will be much appreciated.
Well, I can give you a quick tutorial, but I won't necessarily write it all for you.
So first of all, you will want to get MS Visual Studio set up with CUDA, which is easy following this guide:
Now you will want to read The NVIDIA CUDA Programming Guide (free pdf), documentation, and CUDA by Example (A book I highly recommend for learning CUDA).
But let's say you haven't done that yet, and definitely will later.
This is an extremely arithmetic heavy and data-light computation - actually it can be computed without this brute force method fairly simply, but that isn't the answer you are looking for. I suggest something like this for the kernel:
__global__ void kernel(int* myNumber, int* numOfHits){
//a shared value will be stored on-chip, which is beneficial since this is written to multiple times
//it is shared by all threads
__shared__ int s_hits = 0;
//this identifies the current thread uniquely
int i = (threadIdx.x + blockIdx.x*blockDim.x);
int j = (threadIdx.y + blockIdx.y*blockDim.y);
int k = 0;
//we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
for(; i < 100000; i += blockDim.x*gridDim.x){
for(; j < 100000; j += blockDim.y*gridDim.y){
//Thanks to talonmies for this simplification
if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
//you should actually use atomics for this
//otherwise, the value may change during the 'read, modify, write' process
//synchronize threads, so we now s_hits is completely updated
//again, atomics
//we make sure only one thread per threadblock actually adds in s_hits
if(threadIdx.x == 0 && threadIdx.y == 0)
*numOfHits += s_hits;
To launch the kernel, you will want something like this:
dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);
I know you probably want a quick way to do this, but getting into CUDA isn't really a 'quick' thing. As in, you will need to do some reading and some setup to get it working; past that, the learning curve isn't too high. I haven't told you anything about memory allocation yet, so you will need to do that (although that is simple). If you followed my code, my goal is that you had to read up a bit on shared memory and CUDA, and so you are already kick-started. Good luck!
Disclaimer: I haven't tested my code, and I am not an expert - it could be idiotic.

Can compiler reorder code over calls to std::chrono::system_clock::now()?

While playing with VS11 beta I noticed something weird:
this code couts
f took 0 milliseconds
int main()
std::vector<int> v;
size_t length =64*1024*1024;
for (int i = 0; i < length; i++)
uint64_t sum=0;
auto t1 = std::chrono::system_clock::now();
for (size_t i=0;i<v.size();++i)
//std::cout << sum << std::endl;
auto t2 = std::chrono::system_clock::now();
std::cout << "f() took "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
<< " milliseconds\n";
But when I decide to uncomment the line with couting of the sum then it prints out a reasonable number.
This is the behaviour I get with optimizations enabled, with them disabled I get "normal" cout
f() took 471 milliseconds
So is this standard compliant behaviour?
Important: it is not that dead code gets optimized away, I can see the lag when running from console, and I can see CPU spike in Task Manager.
My guess is that this is dead code optimization - and that your load spike is due to the work initializing the vector isn't being optimized away, but the computation of your unused sum variable is.
But when I decide to uncomment the line with couting of the sum then it prints out a reasonable number.
That goes along with my theory, yes - when you're forced to use the result of the computation, the computation itself can't be optimized away.
If you want to confirm that further, make your program say when it's ready and pause for you to press return - that will allow you to wait for any CPU spike to be obviously "gone" before you press return, which will give you more confidence about what's causing it.
