SHA256 Find Partial Collision - algorithm

I have two message:
messageA: "Frank is one of the "best" students topicId{} "
messageB: "Frank is one of the "top" students topicId{} "
I need to find SHA256 partially collision of these two messages(8 digits).
Therefore, The first 8 digests of SHA256(messageA) == The first 8 digest of SHA256(messageB)
We can put any letters and numbers in {}, Both {} should have same string
I have tried brute force and birthday attack with hash table to solve this problem, but it costs too much time. I know the cycle detection algorithm like Floyd and Brent, however i have no idea how to construct the cycle for this problem. Are there any other methods to solve this problem? Thank you so much!

This is pretty trivial to solve with a birthday attack. Here's how I did it in Python (v2):
def find_collision(ntries):
from hashlib import sha256
str1 = 'Frank is one of the "best" students topicId{%d} '
str2 = 'Frank is one of the "top" students topicId{%d} '
seen = {}
for n in xrange(ntries):
h = sha256(str1 % n).digest()[:4].encode('hex')
seen[h] = n
for n in xrange(ntries):
h = sha256(str2 % n).digest()[:4].encode('hex')
if h in seen:
print str1 % seen[h]
print str2 % n
find_collision(100000)
If your attempt took too long to find a solution, then either you simply made a mistake in your coding somewhere, or you were using the wrong data type.
Python's dictionary data type is implemented using hash tables. That means you can search for dictionary elements in constant time. If you implemented seen using a list instead of a dict in the above code, then the search at line 11 would take an awful lot longer.
Edit:
If the two topicId tokens have to be identical, then — as pointed out in the comments — there is little option but to grind through somewhere in the order of 231 values. You will find a collision eventually, but it could take a long time.
Just leave this running overnight and with a bit of luck you'll have an answer in the morning:
def find_collision():
from hashlib import sha256
str1 = 'Frank is one of the "best" students topicId{%x} '
str2 = 'Frank is one of the "top" students topicId{%x} '
seen = {}
n = 0
while True:
if sha256(str1 % n).digest()[:4] == sha256(str2 % n).digest()[:4]:
print str1 % n
print str2 % n
break
n += 1
find_collision()
If you're in a hurry, you could maybe look into using a GPU to speed up the hash calculations.

I'm assuming the space at the end of the strings in the question was intentional so I left it in.
"Frank is one of the "top" students topicId{59220691223} "
6026d9b323898bcd7ecdbcbcd575b0a1d9dc22fd9e60074aefcbaade494a50ae
"Frank is one of the "best" students topicId{59220691223} "
6026d9b31ba780bb9973e7cfc8c9f74a35b54448d441a61cc9bf8db0fcae5280
It actually took about 7 billion tries to find one using brute force, a lot more than I expected.
I figure 2^32 is roughly 4.3 billion and so chance of not finding any match after 4.3 billion tries is about 36.78%
I actually found a match after about 7 billion tries, there was less than a 20% chance of no matches in 7 billion tries.
This is the C++ code I used running on 7 threads, each thread gets a different starting point and it quits once a match is found on any thread. Each thread also updates its progress to cout every 1 million attempts.
I've fast forwarded to where the match was found on threadId=5, so it takes less than a minute to run. But if you change the starting point you can look for other matches.
And I'm not sure either how one would use Floyd and Brent since the strings have to use the same topicId so you are locked in on both the prefix and suffix.
/*
To compile go get picosha2 header file from https://github.com/okdshin/PicoSHA2
Copy this code into same directory as picosha2.h file, save it as hash.cpp for example.
On Linux go to command line and cd to directory where these files are.
To compile it:
g++ -O2 -o hash hash.cpp -l pthread
And run it:
./hash
*/
#include <iostream>
#include <string>
#include <thread>
#include <mutex>
// I used picoSHA2 header only file for the hashing
// https://github.com/okdshin/PicoSHA2
#include "picosha2.h"
// return 1st 4 bytes (8 chars) of SHA256 hash
std::string hash8(const std::string& src_str) {
std::vector<unsigned char> hash(picosha2::k_digest_size);
picosha2::hash256(src_str.begin(), src_str.end(), hash.begin(), hash.end());
return picosha2::bytes_to_hex_string(hash.begin(), hash.begin() + 4);
}
bool done = false;
std::mutex mtxCout;
void work(unsigned long long threadId) {
std::string a = "Frank is one of the \"best\" students topicId{",
b = "Frank is one of the \"top\" students topicId{";
// Each thread gets a different starting point, I've fast forwarded to the part
// where I found the match so this won't take long to run if you try it, < 1 minute.
// If you want to run a while drop the last "+ 150000000ULL" term and it will run
// for about 1 billion total (150 million each thread, assuming 7 threads) take
// about 30 minutes on Linux.
// Collision occurred on threadId = 5, so if you change it to use less than 6 threads
// then your mileage may vary.
unsigned long long start = threadId * (11666666667ULL + 147000000ULL) + 150000000ULL;
unsigned long long x = start;
for (;;) {
// Not concerned with making the reading/updating "done" flag atomic, unlikely
// 2 collisions are found at once on separate threads, and writing to cout
// is guarded anyway.
if (done) return;
std::string xs = std::to_string(x++);
std::string hashA = hash8(a + xs + "} "), hashB = hash8(b + xs + "} ");
if (hashA == hashB) {
std::lock_guard<std::mutex> lock(mtxCout);
std::cout << "*** SOLVED ***" << std::endl;
std::cout << (x-1) << std::endl;
std::cout << "\"" << a << (x - 1) << "} \" = " << hashA << std::endl;
std::cout << "\"" << b << (x - 1) << "} \" = " << hashB << std::endl;
done = true;
return;
}
if (((x - start) % 1000000ULL) == 0) {
std::lock_guard<std::mutex> lock(mtxCout);
std::cout << "thread: " << threadId << " = " << (x-start)
<< " tries so far" << std::endl;
}
}
}
void runBruteForce() {
const int NUM_THREADS = 7;
std::thread threads[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; i++) threads[i] = std::thread(work, i);
for (int i = 0; i < NUM_THREADS; i++) threads[i].join();
}
int main(int argc, char** argv) {
runBruteForce();
return 0;
}

Related

Boost R tree node remove

I want to remove the nearest point node. and that should be satisfied the limit of distance.
but I think my code is not efficient.
How can I modify this?
for (int j = 0; j < 3; j++) {
bgi::rtree< value, bgi::quadratic<16> > nextRT;
// search for nearest neighbours
std::vector<value> matchPoints;
vector<pair<float, float>> pointList;
for (unsigned i = 0; i < keypoints[j + 1].size(); ++i) {
point p = point(keypoints[j + 1][i].pt.x, keypoints[j + 1][i].pt.y);
nextRT.insert(std::make_pair(p, i));
RT.query(bgi::nearest(p, 1), std::back_inserter(matchPoints));
if (bg::distance(p, matchPoints.back().first) > 3) matchPoints.pop_back();
else {
pointList.push_back(make_pair(keypoints[j + 1][i].pt.x, keypoints[j + 1][i].pt.y));
RT.remove(matchPoints.back());
}
}
and I also curious about result of matchPoints.
After query function works, there are values in matchPoints.
first one is point, and second one looks like some indexing number.
I don't know what second one means.
Q. and I also curious about result of matchPoints. After query function works, there are values in matchPoints. first one is point, and second one looks like some indexing number. I don't know what second one means.
Well, that's got to be a data member in your value type. What is in it depends solely on what you inserted into the rtree. it wouldn't surprise me if it was an ID that describes the geometry.
Since you do not even show the type of RT, we can only assume it is the same as nextRT. If so, we can assume that value is likely a pair like pair<box, unsigned> (because of what you insert). So, look at what got inserted for the unsigned value of the pair in RT...
Q.
if (bg::distance(p, matchPoints.back().first) > 3) matchPoints.pop_back();
else {
pointList.push_back(make_pair(keypoints[j + 1][i].pt.x, keypoints[j + 1][i].pt.y));
rtree.remove(matchPoints.back());
}
Simplify your code! Distilling the requirements:
It looks to me that for 4 sets of "key points", you want to create 4 rtrees containing all those key points with sequentially increasing ids.
Also for those 4 sets of "key points", you want to create a list of key points for which a geometry can be found with a radius of 3.
As a side-effect, remove those closely-matching geometries from the original rtree RT.
DECISION: Because these tasks are independent, let's do them separate:
// making up types that match the usage in your code:
struct keypoint_t { point pt; };
std::array<std::vector<keypoint_t>, 4> keypoints;
Now, let's do the tasks:
Note how RT is not used here:
for (auto const& current_key_set : keypoints) {
bgi::rtree< value, bgi::quadratic<16> > nextRT; // use a better name...
int i = 0;
for (auto const& kpd : current_key_set)
nextRT.insert(std::make_pair(kpd.pt, i++));
}
Creating the vector containing matched key-points (those with near geometries in RT):
for (auto const& current_key_set : keypoints) {
std::vector<point> matched_key_points;
for (auto const& kpd : current_key_set) {
point p = kpd.pt;
value match;
if (!RT.query(bgi::nearest(p, 1), &match))
continue;
if (bg::distance(p, match.first) <= 3) {
matched_key_points.push_back(p);
RT.remove(match);
}
}
}
Ironically, removing the matching geometries from RT became a bit of a minor issue in this: you can either delete by iterator or by a value. In this case, we use the overload that takes a value.
Summary
It was hard to understand the code enough to see what it did. I have shown how to clean up the code, and make it work. Maybe these aren't the things you need, but hopefully using the better separated code, you should be able to get further.
Note that the algorithms have side effects. This makes it hard to understand what really will happen. E.g.:
removing points from the original RT affects what the subsequent key points (even from subsequent sets (next j)) can match with
if you have the same key point multiple times, they may match more than 1 source RT point (because after removal of the first match, there might be a second match within radius 3)
key points are checked strictly sequentially. This means that if the first keypoint roughly matches a point X, this might cause a later keypoint to fail to match, even though the point X might be closer to that keypoint...
I'd suggest you THINK about the requirements really hard before implementing things with these side-effects. **Study the sample cases in the live demo below. If all these side-effects are exactly what you wanted, be sure to use much better naming and proper comments to describe what the code is doing.
Live Demo
Live On Coliru
#include <boost/geometry.hpp>
#include <boost/geometry/io/io.hpp>
#include <boost/geometry/index/rtree.hpp>
#include <iostream>
namespace bg = boost::geometry;
namespace bgi = bg::index;
typedef bg::model::point<float, 2, bg::cs::cartesian> point;
typedef std::pair<point, unsigned> pvalue;
typedef pvalue value;
int main() {
bgi::rtree< value, bgi::quadratic<16> > RT;
{
int i = 0;
for (auto p : { point(2.0f, 2.0f), point(2.5f, 2.5f) })
RT.insert(std::make_pair(p, i++));
}
struct keypoint_t { point pt; };
using keypoints_t = std::vector<keypoint_t>;
keypoints_t const keypoints[] = {
keypoints_t{ keypoint_t { point(-2, 2) } }, // should not match anything
keypoints_t{ keypoint_t { point(-1, 2) } }, // should match (2,2)
keypoints_t{ keypoint_t { point(2.0, 2.0) }, // matches (2.5,2.5)
{ point(2.5, 2.5) }, // nothing anymore...
},
};
for (auto const& current_key_set : keypoints) {
bgi::rtree< pvalue, bgi::quadratic<16> > nextRT; // use a better name...
int i = 0;
for (auto const& kpd : current_key_set)
nextRT.insert(std::make_pair(kpd.pt, i++));
}
for (auto const& current_key_set : keypoints) {
std::cout << "-----------\n";
std::vector<point> matched_key_points;
for (auto const& kpd : current_key_set) {
point p = kpd.pt;
std::cout << "Key: " << bg::wkt(p) << "\n";
value match;
if (!RT.query(bgi::nearest(p, 1), &match))
continue;
if (bg::distance(p, match.first) <= 3) {
matched_key_points.push_back(p);
std::cout << "\tRemoving close point: " << bg::wkt(match.first) << "\n";
RT.remove(match);
}
}
std::cout << "\nMatched keys: ";
for (auto& p : matched_key_points)
std::cout << bg::wkt(p) << " ";
std::cout << "\n\tElements remaining: " << RT.size() << "\n";
}
}
Prints
-----------
Key: POINT(-2 2)
Matched keys:
Elements remaining: 2
-----------
Key: POINT(-1 2)
Removing close point: POINT(2 2)
Matched keys: POINT(-1 2)
Elements remaining: 1
-----------
Key: POINT(2 2)
Removing close point: POINT(2.5 2.5)
Key: POINT(2.5 2.5)
Matched keys: POINT(2 2)
Elements remaining: 0

std::default_random_engine gives the same result for different seeds [duplicate]

The code below is meant to generate a list of five pseudo-random numbers in the interval [1,100]. I seed the default_random_engine with time(0), which returns the system time in unix time. When I compile and run this program on Windows 7 using Microsoft Visual Studio 2013, it works as expected (see below). When I do so in Arch Linux with the g++ compiler, however, it behaves strangely.
In Linux, 5 numbers will be generated each time. The last 4 numbers will be different on each execution (as will often be the case), but the first number will stay the same.
Example output from 5 executions on Windows and Linux:
| Windows: | Linux:
---------------------------------------
Run 1 | 54,01,91,73,68 | 25,38,40,42,21
Run 2 | 46,24,16,93,82 | 25,78,66,80,81
Run 3 | 86,36,33,63,05 | 25,17,93,17,40
Run 4 | 75,79,66,23,84 | 25,70,95,01,54
Run 5 | 64,36,32,44,85 | 25,09,22,38,13
Adding to the mystery, that first number periodically increments by one on Linux. After obtaining the above outputs, I waited about 30 minutes and tried again to find that the 1st number had changed and now was always being generated as a 26. It has continued to increment by 1 periodically and is now at 32. It seems to correspond with the changing value of time(0).
Why does the first number rarely change across runs, and then when it does, increment by 1?
The code. It neatly prints out the 5 numbers and the system time:
#include <iostream>
#include <random>
#include <time.h>
using namespace std;
int main()
{
const int upper_bound = 100;
const int lower_bound = 1;
time_t system_time = time(0);
default_random_engine e(system_time);
uniform_int_distribution<int> u(lower_bound, upper_bound);
cout << '#' << '\t' << "system time" << endl
<< "-------------------" << endl;
for (int counter = 1; counter <= 5; counter++)
{
int secret = u(e);
cout << secret << '\t' << system_time << endl;
}
system("pause");
return 0;
}
Here's what's going on:
default_random_engine in libstdc++ (GCC's standard library) is minstd_rand0, which is a simple linear congruential engine:
typedef linear_congruential_engine<uint_fast32_t, 16807, 0, 2147483647> minstd_rand0;
The way this engine generates random numbers is xi+1 = (16807xi + 0) mod 2147483647.
Therefore, if the seeds are different by 1, then most of the time the first generated number will differ by 16807.
The range of this generator is [1, 2147483646]. The way libstdc++'s uniform_int_distribution maps it to an integer in the range [1, 100] is essentially this: generate a number n. If the number is not greater than 2147483600, then return (n - 1) / 21474836 + 1; otherwise, try again with a new number. It should be easy to see that in the vast majority of cases, two ns that differ by only 16807 will yield the same number in [1, 100] under this procedure. In fact, one would expect the generated number to increase by one about every 21474836 / 16807 = 1278 seconds or 21.3 minutes, which agrees pretty well with your observations.
MSVC's default_random_engine is mt19937, which doesn't have this problem.
The std::default_random_engine is implementation defined. Use std::mt19937 or std::mt19937_64 instead.
In addition std::time and the ctime functions are not very accurate, use the types defined in the <chrono> header instead:
#include <iostream>
#include <random>
#include <chrono>
int main()
{
const int upper_bound = 100;
const int lower_bound = 1;
auto t = std::chrono::high_resolution_clock::now().time_since_epoch().count();
std::mt19937 e;
e.seed(static_cast<unsigned int>(t)); //Seed engine with timed value.
std::uniform_int_distribution<int> u(lower_bound, upper_bound);
std::cout << '#' << '\t' << "system time" << std::endl
<< "-------------------" << std::endl;
for (int counter = 1; counter <= 5; counter++)
{
int secret = u(e);
std::cout << secret << '\t' << t << std::endl;
}
system("pause");
return 0;
}
In Linux, the random function is not a random function in the probabilistic sense of the way, but a pseudo random number generator.
It is salted with a seed, and based on that seed, the numbers that are produced are pseudo random and uniformly distributed.
The Linux way has the advantage that in the design of certain experiments using information from populations, that the repeat of the experiment with known tweaking of input information can be measured. When the final program is ready for real-life testing, the salt (seed), can be created by asking for the user to move the mouse, mix the mouse movement with some keystrokes and add in a dash of microsecond counts since the beginning of the last power on.
Windows random number seed is obtained from the collection of mouse, keyboard, network and time of day numbers. It is not repeatable. But this salt value may be reset to a known seed, if as mentioned above, one is involved in the design of an experiment.
Oh yes, Linux has two random number generators. One, the default is modulo 32bits, and the other is modulo 64bits. Your choice depends on the accuracy needs and amount of compute time you wish to consume for your testing or actual use.

Is fftw output depending on size of input?

In the last week i have been programming some 2-dimensional convolutions with FFTW, by passing to the frequency domain both signals, multiplying, and then coming back.
Surprisingly, I am getting the correct result only when input size is less than a fixed number!
I am posting some working code, in which i take simple initial constant matrixes of value 2 for the input, and 1 for the filter on the spatial domain. This way, the result of convolving them should be a matrix of the average of the first matrix values, i.e., 2, since it is constant. This is the output when I vary the sizes of width and height from 0 to h=215, w=215 respectively; If I set h=216, w=216, or greater, then the output gets corrupted!! I would really appreciate some clues about where could I be making some mistake. Thank you very much!
#include <fftw3.h>
int main(int argc, char* argv[]) {
int h=215, w=215;
//Input and 1 filter are declared and initialized here
float *in = (float*) fftwf_malloc(sizeof(float)*w*h);
float *identity = (float*) fftwf_malloc(sizeof(float)*w*h);
for(int i=0;i<w*h;i++){
in[i]=5;
identity[i]=1;
}
//Declare two forward plans and one backward
fftwf_plan plan1, plan2, plan3;
//Allocate for complex output of both transforms
fftwf_complex *inTrans = (fftwf_complex*) fftw_malloc(sizeof(fftwf_complex)*h*(w/2+1));
fftwf_complex *identityTrans = (fftwf_complex*) fftw_malloc(sizeof(fftwf_complex)*h*(w/2+1));
//Initialize forward plans
plan1 = fftwf_plan_dft_r2c_2d(h, w, in, inTrans, FFTW_ESTIMATE);
plan2 = fftwf_plan_dft_r2c_2d(h, w, identity, identityTrans, FFTW_ESTIMATE);
//Execute them
fftwf_execute(plan1);
fftwf_execute(plan2);
//Multiply in frequency domain. Theoretically, no need to multiply imaginary parts; since signals are real and symmetric
//their transform are also real, identityTrans[i][i] = 0, but i leave here this for more generic implementation.
for(int i=0; i<(w/2+1)*h; i++){
inTrans[i][0] = inTrans[i][0]*identityTrans[i][0] - inTrans[i][1]*identityTrans[i][1];
inTrans[i][1] = inTrans[i][0]*identityTrans[i][1] + inTrans[i][1]*identityTrans[i][0];
}
//Execute inverse transform, store result in identity, where identity filter lied.
plan3 = fftwf_plan_dft_c2r_2d(h, w, inTrans, identity, FFTW_ESTIMATE);
fftwf_execute(plan3);
//Output first results of convolution(in, identity) to see if they are the average of in.
for(int i=0;i<h/h+4;i++){
for(int j=0;j<w/w+4;j++){
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(w*h*w*h) << endl;
}
}std::cout<<endl;
//Compute average of data
float sum=0.0;
for(int i=0; i<w*h;i++)
sum+=in[i];
std::cout<<"Mean of input was " << (float)sum/(w*h) << endl;
std::cout<< endl;
fftwf_destroy_plan(plan1);
fftwf_destroy_plan(plan2);
fftwf_destroy_plan(plan3);
return 0;
}
Your problem has nothing to do with fftw ! It comes from this line :
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(w*h*w*h) << endl;
if w=216 and h=216 then `w*h*w*h=2 176 782 336. The higher limit for signed 32bit integer is 2 147 483 647. You are facing an overflow...
Solution is to cast the denominator to float.
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(((float)w)*h*w*h) << endl;
The next trouble that you are going to face is this one :
float sum=0.0;
for(int i=0; i<w*h;i++)
sum+=in[i];
Remember that a float has 7 useful decimal digits. If w=h=4000, the computed average will be lower than the real one. Use a double or write two loops and sum on the inner loop (localsum) before summing the outer loop (sum+=localsum) !
Bye,
Francis

How to partly sort arrays on CUDA?

Problem
Provided I have two arrays:
const int N = 1000000;
float A[N];
myStruct *B[N];
The numbers in A can be positive or negative (e.g. A[N]={3,2,-1,0,5,-2}), how can I make the array A partly sorted (all positive values first, not need to be sorted, then negative values)(e.g. A[N]={3,2,5,0,-1,-2} or A[N]={5,2,3,0,-2,-1}) on the GPU? The array B should be changed according to A (A is keys, B is values).
Since the scale of A,B can be very large, I think the sort algorithm should be implemented on GPU (especially on CUDA, because I use this platform). Surely I know thrust::sort_by_key can do this work, but it does muck extra work since I do not need the array A&B to be sorted entirely.
Has anyone come across this kind of problem?
Thrust example
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
thrust::greater<float>() );
Thrust's documentation on Github is not up-to-date. As #JaredHoberock said, thrust::partition is the way to go since it now supports stencils. You may need to get a copy from the Github repository:
git clone git://github.com/thrust/thrust.git
Then run scons doc in the Thrust folder to get an updated documentation, and use these updated Thrust sources when compiling your code (nvcc -I/path/to/thrust ...). With the new stencil partition, you can do:
#include <thrust/partition.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
thrust::partition(thrust::host, // if you want to test on the host
thrust::make_zip_iterator(thrust::make_tuple(keyVec.begin(), valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(keyVec.end(), valVec.end())),
keyVec.begin(),
is_positive());
This returns:
Before:
keyVec = 0 -1 2 -3 4 -5 6 -7 8 -9
valVec = 0 1 2 3 4 5 6 7 8 9
After:
keyVec = 0 2 4 6 8 -5 -3 -7 -1 -9
valVec = 0 2 4 6 8 5 3 7 1 9
Note that the 2 partitions are not necessarily sorted. Also, the order may differ between the original vectors and the partitions. If this is important to you, you can use thrust::stable_partition:
stable_partition differs from partition in that stable_partition is
guaranteed to preserve relative order. That is, if x and y are
elements in [first, last), such that pred(x) == pred(y), and if x
precedes y, then it will still be true after stable_partition that x
precedes y.
If you want a complete example, here it is:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/partition.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/tuple.h>
struct is_positive
{
__host__ __device__
bool operator()(const int &x)
{
return x >= 0;
}
};
void print_vec(const thrust::host_vector<int>& v)
{
for(size_t i = 0; i < v.size(); i++)
std::cout << " " << v[i];
std::cout << "\n";
}
int main ()
{
const int N = 10;
thrust::host_vector<int> keyVec(N);
thrust::host_vector<int> valVec(N);
int sign = 1;
for(int i = 0; i < N; ++i)
{
keyVec[i] = sign * i;
valVec[i] = i;
sign *= -1;
}
// Copy host to device
thrust::device_vector<int> d_keyVec = keyVec;
thrust::device_vector<int> d_valVec = valVec;
std::cout << "Before:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
// Partition key-val on device
thrust::partition(thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.begin(), d_valVec.begin())),
thrust::make_zip_iterator(thrust::make_tuple(d_keyVec.end(), d_valVec.end())),
d_keyVec.begin(),
is_positive());
// Copy result back to host
keyVec = d_keyVec;
valVec = d_valVec;
std::cout << "After:\n keyVec = ";
print_vec(keyVec);
std::cout << " valVec = ";
print_vec(valVec);
}
UPDATE
I made a quick comparison with the thrust::sort_by_key version, and the thrust::partition implementation does seem to be faster (which is what we could naturally expect). Here is what I obtain on NVIDIA Visual Profiler, with N = 1024 * 1024, with the sort version on the left, and the partition version on the right. You may want to do the same kind of tests on your own.
How about this?:
Count how many positive numbers to determine the inflexion point
Evenly divide each side of the inflexion point into groups (negative-groups are all same length but different length to positive-groups. these groups are the memory chunks for the results)
Use one kernel call (one thread) per chunk pair
Each kernel swaps any out-of-place elements in the input groups into the desired output groups. You will need to flag any chunks that have more swaps than the maximum so that you can fix them during subsequent iterations.
Repeat until done
Memory traffic is swaps only (from original element position, to sorted position). I don't know if this algorithm sounds like anything already defined...
You should be able to achieve this in thrust simply with a modification of your comparison operator:
struct my_compare
{
__device__ __host__ bool operator()(const float x, const float y) const
{
return !((x<0.0f) && (y>0.0f));
}
};
thrust::sort_by_key(thrust::device_ptr<float> (A),
thrust::device_ptr<float> ( A + N ),
thrust::device_ptr<myStruct> ( B ),
my_compare() );

Range-based for loop with boost::adaptor::indexed

The C++11 range-based for loop dereferences the iterator. Does that mean that it makes no sense to use it with boost::adaptors::indexed? Example:
boost::counting_range numbers(10,20);
for(auto i : numbers | indexed(0)) {
cout << "number = " i
/* << " | index = " << i.index() */ // i is an integer!
<< "\n";
}
I can always use a counter but I like indexed iterators.
Is it possible to use them somehow with range-based for loops?
What is the idiom for using range-based loops with an index? (just a plain counter?)
This was fixed in Boost 1.56 (released August 2014); the element is indirected behind a value_type with index() and value() member functions.
Example: http://coliru.stacked-crooked.com/a/e95bdff0a9d371ea
auto numbers = boost::counting_range(10, 20);
for (auto i : numbers | boost::adaptors::indexed())
std::cout << "number = " << i.value()
<< " | index = " << i.index() << "\n";
It seems more useful when iterating over collection, where you may need the index position (to print the item number if not for anything else):
#include <boost/range/adaptors.hpp>
std::vector<std::string> list = {"boost", "adaptors", "are", "great"};
for (auto v: list | boost::adaptors::indexed(0)) {
printf("%ld: %s\n", v.index(), v.value().c_str());
}
Prints:
0: boost
1: adaptors
2: are
3: great
Any innovation for simply iterating over integer range is strongly challenged by the classic for loop, still very strong competitor:
for (int a = 10; a < 20; a++)
While this can be twisted up in a number of ways, it is not so easy to propose something that is obviously much more readable.
The short answer (as everyone in the comments mentioned) is "right, it makes no sense." I have also found this annoying. Depending your programming style, you might like the "zipfor" package I wrote (just a header): from github
It allows syntax like
std::vector v;
zipfor(x,i eachin v, icounter) {
// use x as deferenced element of x
// and i as index
}
Unfortunately, I cannot figure a way to use the ranged-based for syntax and have to resort to the "zipfor" macro :(
The header was originally designed for things like
std::vector v,w;
zipfor(x,y eachin v,w) {
// x is element of v
// y is element of w (both iterated in parallel)
}
and
std::map m;
mapfor(k,v eachin m)
// k is key and v is value of pair in m
My tests on g++4.8 with full optimizations shows that the resulting code is no slower than writing it by hand.

Resources