boost multi_index_container and erase performance - boost

I have a boost multi_index_container declared as below which is indexed by hash_unique id(unsigned long) and hash_non_unique transaction id(long). Insertion and retrieval of elements is fast but when I delete elements, it is much slower. I was expecting it to be constant time as key is hashed.
e.g To erase elements from container
for 10,000 elements, it takes around 2.53927016 seconds
for 15,000 elements, it takes around 7.137100068 seconds
for 20,000 elements, it takes around 21.391720757 seconds
Is it something I am missing or is it expected behavior?
class Session
{
public:
Session() {
//increment unique id
static unsigned long counter = 0;
boost::mutex::scoped_lock guard(mx);
counter++;
m_nId = counter;
}
unsigned long GetId() {
return m_nId;
}
long GetTransactionHandle(){
return m_nTransactionHandle;
}
....
private:
unsigned long m_nId;
long m_nTransactionHandle;
boost::mutext mx;
....
};
typedef multi_index_container<
Session*,
indexed_by<
hashed_unique< mem_fun<Session,unsigned long,&Session::GetId> >,
hashed_non_unique< mem_fun<Session,unsigned long,&Session::GetTransactionHandle> >
> //end indexed_by
> SessionContainer;
typedef SessionContainer::nth_index<0>::type SessionById;
int main() {
...
SessionContainer container;
SessionById *pSessionIdView = &get<0>(container);
unsigned counter = atoi(argv[1]);
vector<Session*> vSes(counter);
//insert
for(unsigned i = 0; i < counter; i++) {
Session *pSes = new Session();
container.insert(pSes);
vSes.push_back(pSes);
}
timespec ts;
lock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts);
//erase
for(unsigned i = 0; i < counter; i++) {
pSessionIdView->erase(vSes[i]->getId());
delete vSes[i];
}
lock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
std::cout << "Total time taken for erase:" << ts.tv_sec << "." << ts.tv_nsec << "\n";
return (EXIST_SUCCESS);
}

In your test code, what value for m_nTransactionHandledo the Session objects receive? Could it be it's the same value for all the objects? If so, erasing will take long, as performance of hashed containers is poor when there are many equal elements. Try assigning different m_nTransactionHandle values on creation to see if this speeds your test up.

When erasing an element, performance is a function of all the indices making up the container (basically, the element must be erased from every index, not only the index you're currently working with). Hashed indices are badly hurt when there are many equivalent elements, this is not the pattern they're designed to work against.

I just found that the performance for hashed_non_unique versus hashed_unique for 2nd index is the almost same except slight overhead of checking duplicate.
The bottleneck was with boost::object_pool. I don't know internal implementation but it seem it is a list where it iterate through the list to find objects. See the link for performance results and source code.
http://joshitech.blogspot.com/2010/05/boost-object-pool-destroy-performance.html

Related

node.js c++ addon - afraid of memory leak

first of all I admit I'm a newbie in C++ addons for node.js.
I'm writing my first addon and I reached a good result: the addon does what I want. I copied from various examples I found in internet to exchange complex data between the two languages, but I understood almost nothing of what I wrote.
The first thing scaring me is that I wrote nothing that seems to free some memory; another thing which is seriously worrying me is that I don't know if what I wrote may helps or creating confusion for the V8 garbage collector; by the way I don't know if there are better ways to do what I did (iterating over js Object keys in C++, creating js Objects in C++, creating Strings in C++ to be used as properties of js Objects and what else wrong you can find in my code).
So, before going on with my job writing the real math of my addon, I would like to share with the community the nan and V8 part of it to ask if you see something wrong or that can be done in a better way.
Thank you everybody for your help,
iCC
#include <map>
#include <nan.h>
using v8::Array;
using v8::Function;
using v8::FunctionTemplate;
using v8::Local;
using v8::Number;
using v8::Object;
using v8::Value;
using v8::String;
using Nan::AsyncQueueWorker;
using Nan::AsyncWorker;
using Nan::Callback;
using Nan::GetFunction;
using Nan::HandleScope;
using Nan::New;
using Nan::Null;
using Nan::Set;
using Nan::To;
using namespace std;
class Data {
public:
int dt1;
int dt2;
int dt3;
int dt4;
};
class Result {
public:
int x1;
int x2;
};
class Stats {
public:
int stat1;
int stat2;
};
typedef map<int, Data> DataSet;
typedef map<int, DataSet> DataMap;
typedef map<float, Result> ResultSet;
typedef map<int, ResultSet> ResultMap;
class MyAddOn: public AsyncWorker {
private:
DataMap *datas;
ResultMap results;
Stats stats;
public:
MyAddOn(Callback *callback, DataMap *set): AsyncWorker(callback), datas(set) {}
~MyAddOn() { delete datas; }
void Execute () {
for(DataMap::iterator i = datas->begin(); i != datas->end(); ++i) {
int res = i->first;
DataSet *datas = &i->second;
for(DataSet::iterator l = datas->begin(); l != datas->end(); ++l) {
int dt4 = l->first;
Data *data = &l->second;
// TODO: real population of stats and result
}
// test result population
results[res][res].x1 = res;
results[res][res].x2 = res;
}
// test stats population
stats.stat1 = 23;
stats.stat2 = 42;
}
void HandleOKCallback () {
Local<Object> obj;
Local<Object> res = New<Object>();
Local<Array> rslt = New<Array>();
Local<Object> sts = New<Object>();
Local<String> x1K = New<String>("x1").ToLocalChecked();
Local<String> x2K = New<String>("x2").ToLocalChecked();
uint32_t idx = 0;
for(ResultMap::iterator i = results.begin(); i != results.end(); ++i) {
ResultSet *set = &i->second;
for(ResultSet::iterator l = set->begin(); l != set->end(); ++l) {
Result *result = &l->second;
// is it ok to declare obj just once outside the cycles?
obj = New<Object>();
// is it ok to use same x1K and x2K instances for all objects?
Set(obj, x1K, New<Number>(result->x1));
Set(obj, x2K, New<Number>(result->x2));
Set(rslt, idx++, obj);
}
}
Set(sts, New<String>("stat1").ToLocalChecked(), New<Number>(stats.stat1));
Set(sts, New<String>("stat2").ToLocalChecked(), New<Number>(stats.stat2));
Set(res, New<String>("result").ToLocalChecked(), rslt);
Set(res, New<String>("stats" ).ToLocalChecked(), sts);
Local<Value> argv[] = { Null(), res };
callback->Call(2, argv);
}
};
NAN_METHOD(AddOn) {
Local<Object> datas = info[0].As<Object>();
Callback *callback = new Callback(info[1].As<Function>());
Local<Array> props = datas->GetOwnPropertyNames();
Local<String> dt1K = Nan::New("dt1").ToLocalChecked();
Local<String> dt2K = Nan::New("dt2").ToLocalChecked();
Local<String> dt3K = Nan::New("dt3").ToLocalChecked();
Local<Array> props2;
Local<Value> key;
Local<Object> value;
Local<Object> data;
DataMap *set = new DataMap();
int res;
int dt4;
DataSet *dts;
Data *dt;
for(uint32_t i = 0; i < props->Length(); i++) {
// is it ok to declare key, value, props2 and res just once outside the cycle?
key = props->Get(i);
value = datas->Get(key)->ToObject();
props2 = value->GetOwnPropertyNames();
res = To<int>(key).FromJust();
dts = &((*set)[res]);
for(uint32_t l = 0; l < props2->Length(); l++) {
// is it ok to declare key, data and dt4 just once outside the cycles?
key = props2->Get(l);
data = value->Get(key)->ToObject();
dt4 = To<int>(key).FromJust();
dt = &((*dts)[dt4]);
int dt1 = To<int>(data->Get(dt1K)).FromJust();
int dt2 = To<int>(data->Get(dt2K)).FromJust();
int dt3 = To<int>(data->Get(dt3K)).FromJust();
dt->dt1 = dt1;
dt->dt2 = dt2;
dt->dt3 = dt3;
dt->dt4 = dt4;
}
}
AsyncQueueWorker(new MyAddOn(callback, set));
}
NAN_MODULE_INIT(Init) {
Set(target, New<String>("myaddon").ToLocalChecked(), GetFunction(New<FunctionTemplate>(AddOn)).ToLocalChecked());
}
NODE_MODULE(myaddon, Init)
One year and half later...
If somebody is interested, my server is up and running since my question and the amount of memory it requires is stable.
I can't say if the code I wrote really does not has some memory leak or if lost memory is freed at each thread execution end, but if you are afraid as I was, I can say that using same structure and calls does not cause any real problem.
You do actually free up some of the memory you use, with the line of code:
~MyAddOn() { delete datas; }
In essence, C++ memory management boils down to always calling delete for every object created by new. There are also many additional architecture-specific and legacy 'C' memory management functions, but it is not strictly necessary to use these when you do not require the performance benefits.
As an example of what could potentially be a memory leak: You're passing the object held in the *callback pointer to the function AsyncQueueWorker. Yet nowhere in your code is this pointer freed, so unless the Queue worker frees it for you, there is a memory leak here.
You can use a memory tool such as valgrind to test your program further. It will spot most memory problems for you and comes highly recommended.
One thing I've observed is that you often ask (paraphrased):
Is it okay to declare X outside my loop?
To which the answer actually is that declaring variables inside of your loops is better, whenever you can do it. Declare variables as deep inside as you can, unless you have to re-use them. Variables are restricted in scope to the outermost set of {} brackets. You can read more about this in this question.
is it ok to use same x1K and x2K instances for all objects?
In essence, when you do this, if one of these objects modifies its 'x1K' string, then it will change for all of them. The advantage is that you free up memory. If the string is the same for all these objects anyway, instead of having to store say 1,000,000 copies of it, your computer will only keep a single one in memory and have 1,000,000 pointers to it instead. If the string is 9 ASCII characters long or longer under amd64, then that amounts to significant memory savings.
By the way, if you don't intend to modify a variable after its declaration, you can declare it as const, a keyword short for constant which forces the compiler to check that your variable is not modified after declaration. You may have to deal with quite a few compiler errors about functions accepting only non-const versions of things they don't modify, some of which may not be your own code, in which case you're out of luck. Being as conservative as possible with non-const variables can help spot problems.

"Sigabrt Error" - Codechef

The following code ran perfectly in my DEV-C++ compiler but when I submitted in codechef, after running for 3-4 seconds it shows "SIGABRT ERROR". I have researched on this error and have done everything i could to debug, but even after a week I am not able to. Please Help !! Thanks in advance.
For reference question is http://www.codechef.com/problems/LOWSUM
enter code here
void selsort(long long *ssum,long long len)
{
long long low;
for(long long i=0;i<len;i++)
{
low = ssum[i];
long long pos=i;
for(int j=i+1;j<len;j++)
{
if(ssum[j]<low)
{
low = ssum[j];
pos = j;
}
}
ssum[pos] = ssum[i];
ssum[i] = low;
}
}
int main()
{
int t,k,q;
cin>>t;
for(int i=0;i<t;++i)
{
cin>>k;
cin>>q;
long long sq = k*k;
long long *mot=NULL,*sat=NULL;
mot = new long long [k];
sat = new long long [k];
long long *sum = new long long[sq];
long long qth;
long long b=0;
for(int j=0;j<k;++j)
{
cin>>mot[j];
}
for(int j=0;j<k;++j)
{
cin>>sat[j];
}
for(int j=0;j<k;++j)
{
for(int a=0;a<k;++a)
{
sum[b] = mot[a]+sat[j];
++b;
}
}
selsort(sum,sq);
for(int j=0;j<q;++j)
{
cout<<"\n";
cin>>qth;
cout<<"\n"<<sum[qth-1];
}
delete []sum;
delete []mot;
delete []sat;
}
return 0;
}
SIGABRT signal is sent due to many reasons quoting codechef
SIGABRT errors are caused by your program aborting due to a fatal error. In C++, this is normally due to an assert statement in C++ not returning true, but some STL elements can generate this if they try to store too much memory.
In your case it seems to be use of excessive memory
mot = new long long [k];
sat = new long long [k];
long long *sum = new long long[sq];
Note that the value of k can be as large as 20000 so declaring a array of size k will be fine but your sq = k*k which is of order of 4*10^8 which is causing a out of memory problem memory. And your algorithm is also not good enough to give AC within time limit.
Codechef has its own forum to ask such questions, and preferable ways to solve this problem has already been discussed there
http://discuss.codechef.com/problems/LOWSUM

STXXL: limited parallelism during sorting?

I populate a very large array using a stxxl::VECTOR_GENERATOR<MyData>::result::bufwriter_type (something like 100M entries) which I need to sort in parallel.
I use the stxxl::sort(vector->begin(), vector->end(), cmp(), memoryAmount) method, which in theory should do what I need: sort the elements very efficiently.
However, during the execution of this method I noticed that only one processor is fully utilised, and all the other cores are quite idle (I suspect there is little activity to fetch the input, but in practice they don't do anything).
This is my question: is it possible to exploit more cores during the sorting phase, or is the parallelism used only to fetch the input asynchronously? If so, are there documents that explain how to enable it? (I looked extensively the documentation on the website, but I couldn't find anything).
Thanks very much!
EDIT
Thanks for the suggestion. I provide below some more information.
First of all I use MacOs for my experiments. What I do is that I launch the following program and I study its behaviour.
typedef struct Triple {
long t1, t2, t3;
Triple(long s, long p, long o) {
this->t1 = s;
this->t2 = p;
this->t3 = o;
}
Triple() {
t1 = t2 = t3 = 0;
}
} Triple;
const Triple minv(std::numeric_limits<long>::min(),
std::numeric_limits<long>::min(), std::numeric_limits<long>::min());
const Triple maxv(std::numeric_limits<long>::max(),
std::numeric_limits<long>::max(), std::numeric_limits<long>::max());
struct cmp: std::less<Triple> {
bool operator ()(const Triple& a, const Triple& b) const {
if (a.t1 < b.t1) {
return true;
} else if (a.t1 == b.t1) {
if (a.t2 < b.t2) {
return true;
} else if (a.t2 == b.t2) {
return a.t3 < b.t3;
}
}
return false;
}
Triple min_value() const {
return minv;
}
Triple max_value() const {
return maxv;
}
};
typedef stxxl::VECTOR_GENERATOR<Triple>::result vector_type;
int main(int argc, const char** argv) {
vector_type vector;
vector_type::bufwriter_type writer(vector);
for (int i = 0; i < 1000000000; ++i) {
if (i % 10000000 == 0)
std::cout << "Inserting element " << i << std::endl;
Triple t;
t.t1 = rand();
t.t2 = rand();
t.t3 = rand();
writer << t;
}
writer.finish();
//Sort the vector
stxxl::sort(vector.begin(), vector.end(), cmp(), 1024*1024*1024);
std::cout << vector.size() << std::endl;
}
Indeed there seems to be only one or maximum two threads working during the execution of this program. Notice that the machine has only a single disk.
Can you please confirm me whether the parallelism work on macos? If not, then I will try to use linux to see what happens. Or is perhaps because there is only one disk?
In principle what you are doing should work out-of-the-box. With everything working you should see all cores doing processing.
Since it doesnt work, we'll have to find the error, and debugging why we see no parallel speedups is still tricky business these days.
The main idea is to go from small to large examples:
what platform is this? There is no parallelism on MSVC, only on Linux/gcc.
By default STXXL builds on Linux/gcc with USE_GNU_PARALLEL. you can turn it off to see if it has an effect.
Try reproducing the example values shown in http://stxxl.sourceforge.net/tags/master/stxxl_tool.html - with and without USE_GNU_PARALLEL
See if just in memory parallel sorting scales on your processor/system.

Build trie faster

I'm making an mobile app which needs thousands of fast string lookups and prefix checks. To speed this up, I made a Trie out of my word list, which has about 180,000 words.
Everything's great, but the only problem is that building this huge trie (it has about 400,000 nodes) takes about 10 seconds currently on my phone, which is really slow.
Here's the code that builds the trie.
public SimpleTrie makeTrie(String file) throws Exception {
String line;
SimpleTrie trie = new SimpleTrie();
BufferedReader br = new BufferedReader(new FileReader(file));
while( (line = br.readLine()) != null) {
trie.insert(line);
}
br.close();
return trie;
}
The insert method which runs on O(length of key)
public void insert(String key) {
TrieNode crawler = root;
for(int level=0 ; level < key.length() ; level++) {
int index = key.charAt(level) - 'A';
if(crawler.children[index] == null) {
crawler.children[index] = getNode();
}
crawler = crawler.children[index];
}
crawler.valid = true;
}
I'm looking for intuitive methods to build the trie faster. Maybe I build the trie just once on my laptop, store it somehow to the disk, and load it from a file in the phone? But I don't know how to implement this.
Or are there any other prefix data structures which will take less time to build, but have similar lookup time complexity?
Any suggestions are appreciated. Thanks in advance.
EDIT
Someone suggested using Java Serialization. I tried it, but it was very slow with this code:
public void serializeTrie(SimpleTrie trie, String file) {
try {
ObjectOutput out = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
out.writeObject(trie);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public SimpleTrie deserializeTrie(String file) {
try {
ObjectInput in = new ObjectInputStream(new BufferedInputStream(new FileInputStream(file)));
SimpleTrie trie = (SimpleTrie)in.readObject();
in.close();
return trie;
} catch (IOException | ClassNotFoundException e) {
e.printStackTrace();
return null;
}
}
Can this above code be made faster?
My trie: http://pastebin.com/QkFisi09
Word list: http://www.isc.ro/lists/twl06.zip
Android IDE used to run code: http://play.google.com/store/apps/details?id=com.jimmychen.app.sand
Double-Array tries are very fast to save/load because all data is stored in linear arrays. They are also very fast to lookup, but the insertions can be costly. I bet there is a Java implementation somewhere.
Also, if your data is static (i.e. you don't update it on phone) consider DAFSA for your task. It is one of the most efficient data structures for storing words (must be better than "standard" tries and radix tries both for size and for speed, better than succinct tries for speed, often better than succinct tries for size). There is a good C++ implementation: dawgdic - you can use it to build DAFSA from command line and then use a Java reader for the resulting data structure (example implementation is here).
You could store your trie as an array of nodes, with references to child nodes replaced with array indices. Your root node would be the first element. That way, you could easily store/load your trie from simple binary or text format.
public class SimpleTrie {
public class TrieNode {
boolean valid;
int[] children;
}
private TrieNode[] nodes;
private int numberOfNodes;
private TrieNode getNode() {
TrieNode t = nodes[++numberOnNodes];
return t;
}
}
Just build a large String[] and sort it. Then you can use binary search to find the location of a String. You can also do a query based on prefixes without too much work.
Prefix look-up example:
Compare method:
private static int compare(String string, String prefix) {
if (prefix.length()>string.length()) return Integer.MIN_VALUE;
for (int i=0; i<prefix.length(); i++) {
char s = string.charAt(i);
char p = prefix.charAt(i);
if (s!=p) {
if (p<s) {
// prefix is before string
return -1;
}
// prefix is after string
return 1;
}
}
return 0;
}
Finds an occurrence of the prefix in the array and returns it's location (MIN or MAX are mean not found)
private static int recursiveFind(String[] strings, String prefix, int start, int end) {
if (start == end) {
String lastValue = strings[start]; // start==end
if (compare(lastValue,prefix)==0)
return start; // start==end
return Integer.MAX_VALUE;
}
int low = start;
int high = end + 1; // zero indexed, so add one.
int middle = low + ((high - low) / 2);
String middleValue = strings[middle];
int comp = compare(middleValue,prefix);
if (comp == Integer.MIN_VALUE) return comp;
if (comp==0)
return middle;
if (comp>0)
return recursiveFind(strings, prefix, middle + 1, end);
return recursiveFind(strings, prefix, start, middle - 1);
}
Gets a String array and prefix, prints out occurrences of prefix in array
private static boolean testPrefix(String[] strings, String prefix) {
int i = recursiveFind(strings, prefix, 0, strings.length-1);
if (i==Integer.MAX_VALUE || i==Integer.MIN_VALUE) {
// not found
return false;
}
// Found an occurrence, now search up and down for other occurrences
int up = i+1;
int down = i;
while (down>=0) {
String string = strings[down];
if (compare(string,prefix)==0) {
System.out.println(string);
} else {
break;
}
down--;
}
while (up<strings.length) {
String string = strings[up];
if (compare(string,prefix)==0) {
System.out.println(string);
} else {
break;
}
up++;
}
return true;
}
Here's a reasonably compact format for storing a trie on disk. I'll specify it by its (efficient) deserialization algorithm. Initialize a stack whose initial contents are the root node of the trie. Read characters one by one and interpret them as follows. The meaning of a letter A-Z is "allocate a new node, make it a child of the current top of stack, and push the newly allocated node onto the stack". The letter indicates which position the child is in. The meaning of a space is "set the valid flag of the node on top of the stack to true". The meaning of a backspace (\b) is "pop the stack".
For example, the input
TREE \b\bIE \b\b\bOO \b\b\b
gives the word list
TREE
TRIE
TOO
. On your desktop, construct the trie using whichever method and then serialize by the following recursive algorithm (pseudocode).
serialize(node):
if node is valid: put(' ')
for letter in A-Z:
if node has a child under letter:
put(letter)
serialize(child)
put('\b')
This isn't a magic bullet, but you can probably reduce your runtime slightly by doing one big memory allocation instead of a bunch of little ones.
I saw a ~10% speedup in the test code below (C++, not Java, sorry) when I used a "node pool" instead of relying on individual allocations:
#include <string>
#include <fstream>
#define USE_NODE_POOL
#ifdef USE_NODE_POOL
struct Node;
Node *node_pool;
int node_pool_idx = 0;
#endif
struct Node {
void insert(const std::string &s) { insert_helper(s, 0); }
void insert_helper(const std::string &s, int idx) {
if (idx >= s.length()) return;
int char_idx = s[idx] - 'A';
if (children[char_idx] == nullptr) {
#ifdef USE_NODE_POOL
children[char_idx] = &node_pool[node_pool_idx++];
#else
children[char_idx] = new Node();
#endif
}
children[char_idx]->insert_helper(s, idx + 1);
}
Node *children[26] = {};
};
int main() {
#ifdef USE_NODE_POOL
node_pool = new Node[400000];
#endif
Node n;
std::ifstream fin("TWL06.txt");
std::string word;
while (fin >> word) n.insert(word);
}
Tries that prealloate space all possible children (256) have a huge amount of wasted space. You are making your cache cry. Store those pointers to children in a resizable data structure.
Some tries will optimize by having one node to represent a long string, and break that string up only when needed.
Instead of a simple file you can use a database like sqlite and a nested set or celko tree to store the trie and you can also build a faster and shorter (less nodes) trie with a ternary search trie.
I don't like the idea of addressing nodes by index in array, but only because it requires one more addition (index to the pointer). But with array of preallocated nodes you will maybe save some time on allocation and initialization. And you can also save a lot of space by reserving first 26 indices for leaf nodes. Thus you'll not need to allocate and initialize 180000 leaf nodes.
Also with indices you will be able to read the prepared nodes array from disk in binary format. This has to be several times faster. But I'm not sure how to do this on your language. Is this Java?
If you checked that your source vocabulary is sorted, you may also save some time by comparing some prefix of the current string with the previous one. E.g. first 4 characters. If they are equal you can start your
for(int level=0 ; level < key.length() ; level++) {
loop from the 5-th level.
Is it space inefficient or time inefficient? If you are rolling a plain trie then space may be part of the problem when dealing with a mobil device. Check out patricia/radix tries, especially if you are using it as a prefix look-up tool.
Trie:
http://en.wikipedia.org/wiki/Trie
Patricia/Radix trie:
http://en.wikipedia.org/wiki/Radix_tree
You didn't mention a language but here are two implementations of prefix tries in Java.
Regular trie:
http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/Trie.java
Patricia/Radix (space-effecient) trie:
http://github.com/phishman3579/java-algorithms-implementation/blob/master/src/com/jwetherell/algorithms/data_structures/PatriciaTrie.java
Generally speaking, avoid using a lot of object creations from scratch in Java, which is both slow and it also has a massive overhead. Better implement your own pooling class for memory management that allocates e.g. half a million entries at a time in one go.
Also, serialization is too slow for large lexicons. Use a binary read to populate array-based representations proposed above quickly.

How to put my structure variable into CPU caches to eliminate main memory page access time? Options

It's clear that there is no explicit way or certain system calls that
help programmers to put a variable into the CPU cache.
But I think that a certain programming style or well designed
algorithm can make it possible to increase the possibilities that the
variable can be cached into the CPU caches.
Here is my example:
I want to append an 8 byte structure at the end of an array consisting
of the same type of structures, declared in the global main memory
region.
This process is continuously repeated for 4 million operations. This process takes 6 seconds, 1.5 us for each operation. I think this result tells that the two memory areas have not been cached.
I got some clues from a cache-oblivious algorithm, so I tried several
ways to enhance this. Until now, no enhancement.
I think some clever codes can reduce the elapsed time, up to 10 to 100
times. Please show me the way.
-------------------------------------------------------------------------
Appended (2011-04-01)
Damon~ thank you for your comment!
After reading your comment, I analyzed my code again, and found several things
that I missed. The following code that I attached is the abbreviated version of my original code.
To accurately measure each operation's execution time (in the original code, there are several different types of operations), I inserted the time measuring code using clock_gettime() function. I thought if I measure each operation's execution time and accumulate them, the additional cost by the main loop can be avoided.
In the original code, the time measuring code was hidden by a macro function, so I totally forgot about it.
The running time of this code is almost 6 seconds. But if I get rid of the time measuring function in the main loop, it becomes 0.1 seconds.
Since the clock_gettime() function supports very high precision (upto 1 nano second), executed on the basis of an independent thread, and also it requires very big structure,
I think the function caused the cache-out of the main memory area where the consecutive insertions are performed.
Thank you again for your comment. For further enhancement, any suggestion will be very helpful for me to optimize my code.
I think the hierachically defined structure variable might cause unnecessary time cost,
but first I want to know how much it would be, before I change it to the more C-style code.
typedef struct t_ptr {
uint32 isleaf :1, isNextLeaf :1, ptr :30;
t_ptr(void) {
isleaf = false;
isNextLeaf = false;
ptr = NIL;
}
} PTR;
typedef struct t_key {
uint32 op :1, key :31;
t_key(void) {
op = OP_INS;
key = 0;
}
} KEY;
typedef struct t_key_pair {
KEY key;
PTR ptr;
t_key_pair() {
}
t_key_pair(KEY k, PTR p) {
key = k;
ptr = p;
}
} KeyPair;
typedef struct t_op {
KeyPair keyPair;
uint seq;
t_op() {
seq = 0;
}
} OP;
#define MAX_OP_LEN 4000000
typedef struct t_opq {
OP ops[MAX_OP_LEN];
int freeOffset;
int globalSeq;
bool queueOp(register KeyPair keyPair);
} OpQueue;
bool OpQueue::queueOp(register KeyPair keyPair) {
bool isFull = false;
if (freeOffset == (int) (MAX_OP_LEN - 1)) {
isFull = true;
}
ops[freeOffset].keyPair = keyPair;
ops[freeOffset].seq = globalSeq++;
freeOffset++;
}
OpQueue opQueue;
#include <sys/time.h>
int main() {
struct timespec startTime, endTime, totalTime;
for(int i = 0; i < 4000000; i++) {
clock_gettime(CLOCK_REALTIME, &startTime);
opQueue.queueOp(KeyPair());
clock_gettime(CLOCK_REALTIME, &endTime);
totalTime.tv_sec += (endTime.tv_sec - startTime.tv_sec);
totalTime.tv_nsec += (endTime.tv_nsec - startTime.tv_nsec);
}
printf("\n elapsed time: %ld", totalTime.tv_sec * 1000000LL + totalTime.tv_nsec / 1000L);
}
YOU don't put the structure into any cache. The CPU does that automatically for you. The CPU is even more clever than that; if you access sequential memory, it will start putting things from memory into the cache before you read them.
And really, it should be common sense that for a simple bit of code like this, the time you spend on measuring is ten times more than the time to perform the code (apparently 60 times in your case).
Since you put so much confidence in clock_gettime (): I suggest you call it five times in a row and store the results, then print the differences. There's resolution, there's precision, and there's how long it takes to return the current time, which is pretty damned long.
I have been unable to force caching, but you can force memory to be uncache-able. If you have large other datastructures you might exclude these so that they will not pollute your caches. This can be done by specifying PAGE_NOCACHE for the Windows VirutalAllocXXX functions.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786(v=vs.85).aspx

Resources