STXXL: limited parallelism during sorting?

STXXL: limited parallelism during sorting? - sorting

I populate a very large array using a stxxl::VECTOR_GENERATOR<MyData>::result::bufwriter_type (something like 100M entries) which I need to sort in parallel.
I use the stxxl::sort(vector->begin(), vector->end(), cmp(), memoryAmount) method, which in theory should do what I need: sort the elements very efficiently.
However, during the execution of this method I noticed that only one processor is fully utilised, and all the other cores are quite idle (I suspect there is little activity to fetch the input, but in practice they don't do anything).
This is my question: is it possible to exploit more cores during the sorting phase, or is the parallelism used only to fetch the input asynchronously? If so, are there documents that explain how to enable it? (I looked extensively the documentation on the website, but I couldn't find anything).
Thanks very much!
EDIT
Thanks for the suggestion. I provide below some more information.
First of all I use MacOs for my experiments. What I do is that I launch the following program and I study its behaviour.
typedef struct Triple {
long t1, t2, t3;
Triple(long s, long p, long o) {
this->t1 = s;
this->t2 = p;
this->t3 = o;
}
Triple() {
t1 = t2 = t3 = 0;
}
} Triple;
const Triple minv(std::numeric_limits<long>::min(),
std::numeric_limits<long>::min(), std::numeric_limits<long>::min());
const Triple maxv(std::numeric_limits<long>::max(),
std::numeric_limits<long>::max(), std::numeric_limits<long>::max());
struct cmp: std::less<Triple> {
bool operator ()(const Triple& a, const Triple& b) const {
if (a.t1 < b.t1) {
return true;
} else if (a.t1 == b.t1) {
if (a.t2 < b.t2) {
return true;
} else if (a.t2 == b.t2) {
return a.t3 < b.t3;
}
}
return false;
}
Triple min_value() const {
return minv;
}
Triple max_value() const {
return maxv;
}
};
typedef stxxl::VECTOR_GENERATOR<Triple>::result vector_type;
int main(int argc, const char** argv) {
vector_type vector;
vector_type::bufwriter_type writer(vector);
for (int i = 0; i < 1000000000; ++i) {
if (i % 10000000 == 0)
std::cout << "Inserting element " << i << std::endl;
Triple t;
t.t1 = rand();
t.t2 = rand();
t.t3 = rand();
writer << t;
}
writer.finish();
//Sort the vector
stxxl::sort(vector.begin(), vector.end(), cmp(), 1024*1024*1024);
std::cout << vector.size() << std::endl;
}
Indeed there seems to be only one or maximum two threads working during the execution of this program. Notice that the machine has only a single disk.
Can you please confirm me whether the parallelism work on macos? If not, then I will try to use linux to see what happens. Or is perhaps because there is only one disk?

In principle what you are doing should work out-of-the-box. With everything working you should see all cores doing processing.
Since it doesnt work, we'll have to find the error, and debugging why we see no parallel speedups is still tricky business these days.
The main idea is to go from small to large examples:
what platform is this? There is no parallelism on MSVC, only on Linux/gcc.
By default STXXL builds on Linux/gcc with USE_GNU_PARALLEL. you can turn it off to see if it has an effect.
Try reproducing the example values shown in http://stxxl.sourceforge.net/tags/master/stxxl_tool.html - with and without USE_GNU_PARALLEL
See if just in memory parallel sorting scales on your processor/system.

Related

How to get gtest results?

How to get the result of EXPECT_EXIT() in gtest?
I want to do something according to the results of EXPECT_EXIT. For example:
for(int i = 0; i < 1000; i++) {
for(int j = 0; j < 10; j++) {
auto &result = EXPECT_EXIT(Function(i, inputs[j]));
if (result) {
do something;
break; //jump out of the loop
}
}
}

From your comment:
But once it failed, I need record something and skip rest of the input
check for this function
Short answer:
To stop execution at first mismatch you need to use ASSERT version instead of EXPECT, in your case ASSERT_EXIT.
To output some extra data for failed check you can just use stream operator of check:
ASSERT_EXIT(Function(i, j), testing::ExitedWithCode(0), "") << "Some custom data: " << i << "; " << j;
If you want to do something more on fail - you can use ::testing::Test::HasFailure() function.
Long suggestion:
One of the properties of good unittest is simplicity. That implies absence of any loops or branches. Setup logic (if any) should be moved to fixtures. Also unittests must be executed in isolation.
By switching your test to parametric one you would cover that. Check this:
void Function(int i, int j)
{
if (j != 3)
exit(0);
}
class FunTest : public testing::TestWithParam<std::tuple<int, int>> { };
TEST_P(FunTest, Should_Exit)
{
const auto i = std::get<0>(GetParam());
const auto j = std::get<1>(GetParam());
EXPECT_EXIT(Function(i, j), testing::ExitedWithCode(0), "");
}
INSTANTIATE_TEST_SUITE_P(,
FunTest,
testing::Combine(
testing::Range(1, 20),
testing::Range(1, 10)),
[](const testing::TestParamInfo<std::tuple<int, int>>& info) -> std::string {
const auto i = std::get<0>(info.param);
const auto j = std::get<1>(info.param);
return std::to_string(i) + '_' + std::to_string(j);
});
This approach would help with your problem: you can stop execution by using --gtest_break_on_failure and you can control naming of test cases by implementing custom printer for specific TestParamInfo (lambda in example).
It would also guarantee isolation and provide certain level of simplicity (all complex stuff provided by framework, imho test code is minimalistic and simple).
Also this would allow execution of specific test using gtest_filter
(e.g. --gtest_filter=FunTest.Should_Exit/4_3).

Passing a temporary stream object to a lambda function as part of an extraction expression

I have a function which needs to parse some arguments and several if clauses inside it need to perform similar actions. In order to reduce typing and help keep the code readable, I thought I'd use a lambda to encapsulate the recurring actions, but I'm having trouble finding sufficient info to determine whether I'm mistakenly invoking undefined behavior or what I need to do to actualize my approach.
Below is a simplified code snippet of what I have currently:
int foo(int argc, char* argv[])
{
Using ss = std::istringstream;
auto sf = [&](ss&& stream) -> ss& {
stream.exceptions(ss::failbit);
return stream;
};
int retVal = 0;
bool valA = false;
bool valB = false;
try
{
for(int i=1; i < argc; i++)
{
std::string arg( argv[i] );
if( !valA )
{
valA = true;
sf( ss(arg) ) >> myInt;
}
else
if( !valB )
{
valB = true;
sf( ss(arg) ) >> std::hex >> myOtherInt;
}
}
}
catch( std::exception& err )
{
retVal = -1;
std::cerr << err.what() << std::endl;
}
return retVal;
}
First, based on what I've read, I don't think that specifying the lambda argument as an rvalue reference (ss&&) is doing quite what I want it to do, however, trying to compile with it declared as a normal reference (ss&) failed with the error cannot bind non-const lvalue reference of type 'ss&'. Changing ss& to ss&& got rid of the error and did not produce any warnings, but I'm not convinced that I'm using that construct correctly.
I've tried reading up on the various definitions for each, but the wording is a bit confusing.
I guess ultimately my questions are:
Can I expect the lifetime of my temporary ss(arg) object to extend through the entire extraction expression?
What is the correct way to define a lambda such that I can use the lambda in the way I demonstrate above, assuming that such a thing is actually possible?

node.js c++ addon - afraid of memory leak

first of all I admit I'm a newbie in C++ addons for node.js.
I'm writing my first addon and I reached a good result: the addon does what I want. I copied from various examples I found in internet to exchange complex data between the two languages, but I understood almost nothing of what I wrote.
The first thing scaring me is that I wrote nothing that seems to free some memory; another thing which is seriously worrying me is that I don't know if what I wrote may helps or creating confusion for the V8 garbage collector; by the way I don't know if there are better ways to do what I did (iterating over js Object keys in C++, creating js Objects in C++, creating Strings in C++ to be used as properties of js Objects and what else wrong you can find in my code).
So, before going on with my job writing the real math of my addon, I would like to share with the community the nan and V8 part of it to ask if you see something wrong or that can be done in a better way.
Thank you everybody for your help,
iCC
#include <map>
#include <nan.h>
using v8::Array;
using v8::Function;
using v8::FunctionTemplate;
using v8::Local;
using v8::Number;
using v8::Object;
using v8::Value;
using v8::String;
using Nan::AsyncQueueWorker;
using Nan::AsyncWorker;
using Nan::Callback;
using Nan::GetFunction;
using Nan::HandleScope;
using Nan::New;
using Nan::Null;
using Nan::Set;
using Nan::To;
using namespace std;
class Data {
public:
int dt1;
int dt2;
int dt3;
int dt4;
};
class Result {
public:
int x1;
int x2;
};
class Stats {
public:
int stat1;
int stat2;
};
typedef map<int, Data> DataSet;
typedef map<int, DataSet> DataMap;
typedef map<float, Result> ResultSet;
typedef map<int, ResultSet> ResultMap;
class MyAddOn: public AsyncWorker {
private:
DataMap *datas;
ResultMap results;
Stats stats;
public:
MyAddOn(Callback *callback, DataMap *set): AsyncWorker(callback), datas(set) {}
~MyAddOn() { delete datas; }
void Execute () {
for(DataMap::iterator i = datas->begin(); i != datas->end(); ++i) {
int res = i->first;
DataSet *datas = &i->second;
for(DataSet::iterator l = datas->begin(); l != datas->end(); ++l) {
int dt4 = l->first;
Data *data = &l->second;
// TODO: real population of stats and result
}
// test result population
results[res][res].x1 = res;
results[res][res].x2 = res;
}
// test stats population
stats.stat1 = 23;
stats.stat2 = 42;
}
void HandleOKCallback () {
Local<Object> obj;
Local<Object> res = New<Object>();
Local<Array> rslt = New<Array>();
Local<Object> sts = New<Object>();
Local<String> x1K = New<String>("x1").ToLocalChecked();
Local<String> x2K = New<String>("x2").ToLocalChecked();
uint32_t idx = 0;
for(ResultMap::iterator i = results.begin(); i != results.end(); ++i) {
ResultSet *set = &i->second;
for(ResultSet::iterator l = set->begin(); l != set->end(); ++l) {
Result *result = &l->second;
// is it ok to declare obj just once outside the cycles?
obj = New<Object>();
// is it ok to use same x1K and x2K instances for all objects?
Set(obj, x1K, New<Number>(result->x1));
Set(obj, x2K, New<Number>(result->x2));
Set(rslt, idx++, obj);
}
}
Set(sts, New<String>("stat1").ToLocalChecked(), New<Number>(stats.stat1));
Set(sts, New<String>("stat2").ToLocalChecked(), New<Number>(stats.stat2));
Set(res, New<String>("result").ToLocalChecked(), rslt);
Set(res, New<String>("stats" ).ToLocalChecked(), sts);
Local<Value> argv[] = { Null(), res };
callback->Call(2, argv);
}
};
NAN_METHOD(AddOn) {
Local<Object> datas = info[0].As<Object>();
Callback *callback = new Callback(info[1].As<Function>());
Local<Array> props = datas->GetOwnPropertyNames();
Local<String> dt1K = Nan::New("dt1").ToLocalChecked();
Local<String> dt2K = Nan::New("dt2").ToLocalChecked();
Local<String> dt3K = Nan::New("dt3").ToLocalChecked();
Local<Array> props2;
Local<Value> key;
Local<Object> value;
Local<Object> data;
DataMap *set = new DataMap();
int res;
int dt4;
DataSet *dts;
Data *dt;
for(uint32_t i = 0; i < props->Length(); i++) {
// is it ok to declare key, value, props2 and res just once outside the cycle?
key = props->Get(i);
value = datas->Get(key)->ToObject();
props2 = value->GetOwnPropertyNames();
res = To<int>(key).FromJust();
dts = &((*set)[res]);
for(uint32_t l = 0; l < props2->Length(); l++) {
// is it ok to declare key, data and dt4 just once outside the cycles?
key = props2->Get(l);
data = value->Get(key)->ToObject();
dt4 = To<int>(key).FromJust();
dt = &((*dts)[dt4]);
int dt1 = To<int>(data->Get(dt1K)).FromJust();
int dt2 = To<int>(data->Get(dt2K)).FromJust();
int dt3 = To<int>(data->Get(dt3K)).FromJust();
dt->dt1 = dt1;
dt->dt2 = dt2;
dt->dt3 = dt3;
dt->dt4 = dt4;
}
}
AsyncQueueWorker(new MyAddOn(callback, set));
}
NAN_MODULE_INIT(Init) {
Set(target, New<String>("myaddon").ToLocalChecked(), GetFunction(New<FunctionTemplate>(AddOn)).ToLocalChecked());
}
NODE_MODULE(myaddon, Init)
One year and half later...
If somebody is interested, my server is up and running since my question and the amount of memory it requires is stable.
I can't say if the code I wrote really does not has some memory leak or if lost memory is freed at each thread execution end, but if you are afraid as I was, I can say that using same structure and calls does not cause any real problem.

You do actually free up some of the memory you use, with the line of code:
~MyAddOn() { delete datas; }
In essence, C++ memory management boils down to always calling delete for every object created by new. There are also many additional architecture-specific and legacy 'C' memory management functions, but it is not strictly necessary to use these when you do not require the performance benefits.
As an example of what could potentially be a memory leak: You're passing the object held in the *callback pointer to the function AsyncQueueWorker. Yet nowhere in your code is this pointer freed, so unless the Queue worker frees it for you, there is a memory leak here.
You can use a memory tool such as valgrind to test your program further. It will spot most memory problems for you and comes highly recommended.
One thing I've observed is that you often ask (paraphrased):
Is it okay to declare X outside my loop?
To which the answer actually is that declaring variables inside of your loops is better, whenever you can do it. Declare variables as deep inside as you can, unless you have to re-use them. Variables are restricted in scope to the outermost set of {} brackets. You can read more about this in this question.
is it ok to use same x1K and x2K instances for all objects?
In essence, when you do this, if one of these objects modifies its 'x1K' string, then it will change for all of them. The advantage is that you free up memory. If the string is the same for all these objects anyway, instead of having to store say 1,000,000 copies of it, your computer will only keep a single one in memory and have 1,000,000 pointers to it instead. If the string is 9 ASCII characters long or longer under amd64, then that amounts to significant memory savings.
By the way, if you don't intend to modify a variable after its declaration, you can declare it as const, a keyword short for constant which forces the compiler to check that your variable is not modified after declaration. You may have to deal with quite a few compiler errors about functions accepting only non-const versions of things they don't modify, some of which may not be your own code, in which case you're out of luck. Being as conservative as possible with non-const variables can help spot problems.

C++11 Lambda Generator - How to signal last element or end?

I am looking at the lambda generators from https://stackoverflow.com/a/12735970 and https://stackoverflow.com/a/12639820.
I would like to adapt these to a template version and I'm wondering what I should return to signal that the generator has reached its end.
Consider
template<typename T>
std::function<T()> my_template_vector_generator(std::vector<T> &v) {
int idx = 0;
return [=,&v]() mutable {
return v[idx++];
};
}
The [=,&v] addition is mine and I hope it's correct. For now, it lets me change the vector outside as expected. Please comment on this if you have a bad feeling about it...
For reference, this works (REQUIRE is from catch):
std::vector<double> v({1.0, 2.0, 3.0});
auto vec_gen = my_template_vector_generator(v);
REQUIRE( vec_gen() == 1 );
REQUIRE( vec_gen() == 2 );
v[2] = 5.0;
REQUIRE( vec_gen() == 5.0 );
That vector version obviously requires knowledge of v.size() at the call site. I'd like to go without that by returning something that indicates the generator is empty.
Brainstorming, I can think of the following:
return pair<T, bool> and indicate false once there are no more values.
return iterators to the container values and iterate for (auto it=gen(); it!=gen.end(); it=gen()) {cout << *it << endl;}
wrapping the generator in a class and implementing an empty() method. Not entirely sure how that would work, however.
Does either of these versions feel good to you? Do you have another idea?
I'd particularly be interested in implications when mapping such a generator or combining them recursively (see this C++14 blog post for inspiration).
Update:
After hearing the optional suggestions, I implemented something based on unique_ptr.
template<typename T>
std::function<std::unique_ptr<T>()> my_pointer_template_vector_generator(std::vector<T> &v) {
int idx = 0;
return [=,&v]() mutable {
if (idx < v.size()) {
return std::unique_ptr<T>(new T(v[idx++]));
} else {
return std::unique_ptr<T>();
}
};
}
This seems to work, consider the following passing test.
TEST_CASE("Pointer generator terminates as expected") {
std::vector<double> v({1.0, 2.0, 3.0});
int k = 0;
auto gen = my_pointer_template_vector_generator(v);
while(auto val = gen()) {
REQUIRE( *val == v[k++] );
}
REQUIRE( v.size() == k );
}
I have some concerns about creating lots of new T. I suppose I could also return pointers into the vector and nullptr otherwise. That should work in the same way. What do you think about this?

As soon as it looks like just study example i could suggest to add exception as end of collection. But do not use it in real code.

boost multi_index_container and erase performance

I have a boost multi_index_container declared as below which is indexed by hash_unique id(unsigned long) and hash_non_unique transaction id(long). Insertion and retrieval of elements is fast but when I delete elements, it is much slower. I was expecting it to be constant time as key is hashed.
e.g To erase elements from container
for 10,000 elements, it takes around 2.53927016 seconds
for 15,000 elements, it takes around 7.137100068 seconds
for 20,000 elements, it takes around 21.391720757 seconds
Is it something I am missing or is it expected behavior?
class Session
{
public:
Session() {
//increment unique id
static unsigned long counter = 0;
boost::mutex::scoped_lock guard(mx);
counter++;
m_nId = counter;
}
unsigned long GetId() {
return m_nId;
}
long GetTransactionHandle(){
return m_nTransactionHandle;
}
....
private:
unsigned long m_nId;
long m_nTransactionHandle;
boost::mutext mx;
....
};
typedef multi_index_container<
Session*,
indexed_by<
hashed_unique< mem_fun<Session,unsigned long,&Session::GetId> >,
hashed_non_unique< mem_fun<Session,unsigned long,&Session::GetTransactionHandle> >
> //end indexed_by
> SessionContainer;
typedef SessionContainer::nth_index<0>::type SessionById;
int main() {
...
SessionContainer container;
SessionById *pSessionIdView = &get<0>(container);
unsigned counter = atoi(argv[1]);
vector<Session*> vSes(counter);
//insert
for(unsigned i = 0; i < counter; i++) {
Session *pSes = new Session();
container.insert(pSes);
vSes.push_back(pSes);
}
timespec ts;
lock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts);
//erase
for(unsigned i = 0; i < counter; i++) {
pSessionIdView->erase(vSes[i]->getId());
delete vSes[i];
}
lock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
std::cout << "Total time taken for erase:" << ts.tv_sec << "." << ts.tv_nsec << "\n";
return (EXIST_SUCCESS);
}

In your test code, what value for m_nTransactionHandledo the Session objects receive? Could it be it's the same value for all the objects? If so, erasing will take long, as performance of hashed containers is poor when there are many equal elements. Try assigning different m_nTransactionHandle values on creation to see if this speeds your test up.

When erasing an element, performance is a function of all the indices making up the container (basically, the element must be erased from every index, not only the index you're currently working with). Hashed indices are badly hurt when there are many equivalent elements, this is not the pattern they're designed to work against.

I just found that the performance for hashed_non_unique versus hashed_unique for 2nd index is the almost same except slight overhead of checking duplicate.
The bottleneck was with boost::object_pool. I don't know internal implementation but it seem it is a list where it iterate through the list to find objects. See the link for performance results and source code.
http://joshitech.blogspot.com/2010/05/boost-object-pool-destroy-performance.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

STXXL: limited parallelism during sorting? - sorting

Related

How to get gtest results?

Passing a temporary stream object to a lambda function as part of an extraction expression

node.js c++ addon - afraid of memory leak

C++11 Lambda Generator - How to signal last element or end?

boost multi_index_container and erase performance

Categories

Resources