OOP much slower than Structural programming. why and how can be fixed? - performance

as i mentioned on subject of this post i found out OOP is slower than Structural Programming(spaghetti code) in the hard way.
i writed a simulated annealing program with OOP then remove one class and write it structural in main form. suddenly it got much faster . i was calling my removed class in every iteration in OOP program.
also checked it with Tabu Search. Same result .
can anyone tell me why this is happening and how can i fix it on other OOP programs?
are there any tricks ? for example cache my classes or something like that?
(Programs has been written in C#)

If you have a high-frequency loop, and inside that loop you create new objects and don't call other functions very much, then, yes, you will see that if you can avoid those news, say by re-using one copy of the object, you can save a large fraction of total time.
Between new, constructors, destructors, and garbage collection, a very little code can waste a whole lot of time.
Use them sparingly.

Memory access is often overlooked. The way o.o. tends to lay out data in memory is not conducive to efficient memory access in practice in loops. Consider the following pseudocode:
adult_clients = 0
for client in list_of_all_clients:
if client.age >= AGE_OF_MAJORITY:
adult_clients++
It so happens that the way this is accessed from memory is quite inefficient on modern architectures because they like accessing large contiguous rows of memory, but we only care for client.age, and of all clients we have; those will not be laid out in contiguous memory.
Focusing on objects that have fields results into data being laid out in memory in such a way that fields that hold the same type of information will not be laid out in consecutive memory. Performance-heavy code tends to involve loops that often look at data with the same conceptual meaning. It is conducive to performance that such data be laid out in contiguous memory.
Consider these two examples in Rust:
// struct that contains an id, and an optiona value of whether the id is divisible by three
struct Foo {
id : u32,
divbythree : Option<bool>,
}
fn main () {
// create a pretty big vector of these structs with increasing ids, and divbythree initialized as None
let mut vec_of_foos : Vec<Foo> = (0..100000000).map(|i| Foo{ id : i, divbythree : None }).collect();
// loop over all hese vectors, determine if the id is divisible by three
// and set divbythree accordingly
let mut divbythrees = 0;
for foo in vec_of_foos.iter_mut() {
if foo.id % 3 == 0 {
foo.divbythree = Some(true);
divbythrees += 1;
} else {
foo.divbythree = Some(false);
}
}
// print the number of times it was divisible by three
println!("{}", divbythrees);
}
On my system, the real time with rustc -O is 0m0.436s; now let us consider this example:
fn main () {
// this time we create two vectors rather than a vector of structs
let vec_of_ids : Vec<u32> = (0..100000000).collect();
let mut vec_of_divbythrees : Vec<Option<bool>> = vec![None; vec_of_ids.len()];
// but we basically do the same thing
let mut divbythrees = 0;
for i in 0..vec_of_ids.len(){
if vec_of_ids[i] % 3 == 0 {
vec_of_divbythrees[i] = Some(true);
divbythrees += 1;
} else {
vec_of_divbythrees[i] = Some(false);
}
}
println!("{}", divbythrees);
}
This runs in 0m0.254s on the same optimization level, — close to half the time needed.
Despite having to allocate two vectors instead of of one, storing similar values in contiguous memory has almost halved the execution time. Though obviously the o.o. approach provides for much nicer and more maintainable code.
P.s.: it occurs to me that I should probably explain why this matters so much given that the code itself in both cases still indexes memory one field at a time, rather than, say, putting a large swath on the stack. The reason is c.p.u. caches: when the program asks for the memory at a certain address, it actually obtains, and caches, a significant chunk of memory around that address, and if memory next to it be asked quickly again, then it can serve it from the cache, rather than from actual physical working memory. Of course, compilers will also vectorize the bottom code more efficiently as a consequence.

Related

Am I using Rust's bitvec_simd in the most efficient way?

I'm using bitvec_simd = "0.20" for bitvector operations in rust.
I have two instances of a struct, call them clique_into and clique_from. The relevant fields of the struct are two bitvectors members_bv and neighbors_bv, as well as a vector of integers which is called members. The members_bv and members vector represent the same data.
After profiling my code, I find that this is my bottleneck (in here 41% of the time): checking whether the members (typically 1) of clique_from are all neighbors of clique_into.
My current approach is to loop through the members of clique_from (typically 1) and check each one in turn to see if it's a neighbor of clique_into.
Here's my code:
use bitvec_simd::BitVec;
use smallvec::{smallvec, SmallVec};
struct Clique {
members_bv: BitVec,
members: SmallVec<[usize; 256]>,
neighbors_bv: BitVec,
}
fn are_cliques_mergable(clique_into: &Clique, clique_from: &Clique) -> bool {
for i in 0..clique_from.members.len() {
if !clique_into.neighbors_bv.get_unchecked(clique_from.members[i]) {
return false;
}
}
return true;
}
That code works fine and it's fast, but is there a way to make it faster? We can assume that clique_from almost always has a single member so the inner for loop is almost always executed once.
It likely comes down to this:
if !clique_into.neighbors_bv.get_unchecked(clique_from.members[i])
Is get_unchecked() the fastest way to do this? While I have written this so it will never panic, the compiler doesn't know that. Does this force Rust to waste time checking if it should panic?

Removing mutability without losing speed

I have a function like this:
fun randomWalk(numSteps: Int): Int {
var n = 0
repeat(numSteps) { n += (-1 + 2 * Random.nextInt(2)) }
return n.absoluteValue
}
This works fine, except that it uses a mutable variable, and I would like to make everything immutable when possible, for better safety and readability. So I came up with an equivalent version that doesn't use any mutable variables:
fun randomWalk_seq(numSteps: Int): Int =
generateSequence(0) { it + (-1 + 2 * Random.nextInt(2)) }
.elementAt(numSteps)
.absoluteValue
This also works fine and produces the same results, but it takes 3 times longer.
I used the following way to measure it:
#OptIn(ExperimentalTime::class)
fun main() {
val numSamples = 100000
val numSteps = 15708
repeat(5) {
val randomWalkSamples: IntArray
val duration = measureTime {
randomWalkSamples = IntArray(numSamples) { randomWalk(numSteps) }
}
println(duration)
}
}
I know it's a bit hacky (I could have used JMH but this is just a quick test - at least I know that measureTime uses a monotonic clock). The results for the iterative (mutable) version:
2.965358406s
2.560777033s
2.554363661s
2.564279403s
2.608323586s
As expected, the first line shows it took a bit longer on the first run due to the warming up of the JIT, but the next 4 lines have fairly small variation.
After replacing randomWalk with randomWalk_seq:
6.636866719s
6.980840906s
6.993998111s
6.994038706s
7.018054467s
Somewhat surprisingly, I don't see any warmup time - the first line is always lesser duration than the following 4 lines, every time I run this. And also, every time I run it, the duration keeps increasing, with line 5 always being the greatest duration.
Can someone explain the findings, and also is there any way of making this function not use any mutable variables but still have performance that is close to the mutable version?
Your solution is slower for two main reasons: boxing and the complexity of the iterator used by generateSequence()'s Sequence implementation.
Boxing happens because a Sequence uses its types generically, so it cannot use primitive 32-bit Ints directly, but must wrap them in classes and unwrap them when retrieving the items.
You can see the complexity of the iterator by Ctrl+clicking the generateSequence function to view the source code.
#Михаил Нафталь's suggestion is faster because it avoids the complex iterator of the sequence, but it still has boxing.
I tried writing an overload of sumOf that uses IntProgression directly instead of Iterable<T>, so it won't use boxing, and that resulted in equivalent performance to your imperative code with the var. As you can see, it's inline and when put together with the { -1 + 2 * Random.nextInt(2) } lambda suggested by #Михаил Нафталь, then the resulting compiled code will be equivalent to your imperative code.
inline fun IntProgression.sumOf(selector: (Int) -> Int): Int {
var sum: Int = 0.toInt()
for (element in this) {
sum += selector(element)
}
return sum
}
Ultimately, I don't think you're buying yourself much in the way of code clarity by removing a single var in such a small function. I would say the sequence code is arguably harder to read. vars may add to code complexity in complex algorithms, but I don't think they do in such simple algorithms, especially when there's only one of them and it's local to the function.
Equivalent immutable one-liner is:
fun randomWalk2(numSteps: Int) =
(1..numSteps).sumOf { -1 + 2 * Random.nextInt(2) }.absoluteValue
Probably, even more performant would be to replace
with
so that you'll have one multiplication and n additions instead of n multiplications and (2*n-1) additions:
fun randomWalk3(numSteps: Int) =
(-numSteps + 2 * (1..numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Update
As #Tenfour04 noted, there is no specific stdlib implementation for IntProgression.sumOf, so it's resolved to Iterable<T>.sumOf, which will add unnecessary overhead for int boxing.
So, it's better to use IntArray here instead of IntProgression:
fun randomWalk4(numSteps: Int) =
(-numSteps + 2 * IntArray(numSteps).sumOf { Random.nextInt(2) }).absoluteValue
Still encourage you to check this all with JMH
I think:"Removing mutability without losing speed" is wrong title .because
mutability thing comes to deal with the flow that program want to achieve .
you are using var inside function.... and 100% this var will not ever change from outside this function and that is mutability concept.
if we git rid off from var everywhere why we need it in programming ?

What is the time complexity performance of Scala's Vector data structure?

I know that most of the Vector methods are effectively O(1) (constant time) because of the tree they use, but I cannot find any information on the contains method. My first thought is that it would have to be O(n) to check all the elements but I am not sure.
Answering the question in the title, performance characteristics (2.13 docs version) of basic operations head, tail, apply, update, prepend, append, insert are all listed as eC for Vector:
eC The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.
You are correct contains is O(N), as there is no hashing or nothing else that would avoid the need to compare with all items. Still, if you want to be sure, it is best to check the implementation.
As finding the correct implementation in the library sources can be difficult because of many traits and overrides used to implement the containers, the best way to check this is the debugger. Use a code like:
val v = Vector(0, 1, 2)
v.contains(1)
Use the debugger to step into v.contains and the source you will see is:
def contains[A1 >: A](elem: A1): Boolean = exists (_ == elem)
If you are still not convinced at this point, some more "step into" will lead you to:
def exists(p: A => Boolean): Boolean = {
var res = false
while (!res && hasNext) res = p(next())
res
}

Randomly selecting elements from slices produced by a map in restricted key range in golang. Is there an O(1) shortcut?

In my program to simulate many-particle evolution, I have a map that takes a key value pop (the population size) and returns a slice containing the sites that have this population: myMap[pop][]int. These slices are generically quite large.
At each evolution step I choose a random population size RandomPop. I would then like to randomly choose a site that has a population of at least RandomPop. The sitechosen is used to update my population structures and I utilize a second map to efficiently update myMap keys. My current (slow) implementation looks like
func Evolve( ..., myMap map[int][]int ,...){
RandomPop = rand.Intn(rangeofpopulation)+1
for i:=RandPop,; i<rangeofpopulation;i++{
preallocatedslice=append(preallocatedslice,myMap[i]...)
}
randomindex:= rand.Intn(len(preallocatedslice))
sitechosen= preallocatedslice[randomindex]
UpdateFunction(site)
//reset preallocated slice
preallocatedslice=preallocatedslice[0:0]
}
This code (obviously) hits a huge bottle-neck when copying values from the map to preallocatedslice, with runtime.memmove eating 87% of my CPU usage. I'm wondering if there is an O(1) way to randomly choose an entry contained in the union of slices indicated by myMap with key values between 0 and RandomPop ? I am open to packages that allow you to manipulate custom hashtables if anyone is aware of them. Suggestions don't need to be safe for concurrency
Other things tried: I previously had my maps record all sites with values of at least pop but that took up >10GB of memory and was stupid. I tried stashing pointers to the relevant slices to make a look-up slice, but go forbids this. I could sum up the lengths of each slice and generate a random number based on this and then iterate through the slices in myMap by length, but this is going to be much slower than just keeping an updated cdf of my population and doing a binary search on it. The binary search is fast, but updating the cdf, even if done manually, is O(n). I was really hoping to abuse hashtables to speed up random selection and update if possible
A vague thought I have is concocting some sort of nested structure of maps pointing to their contents and also to the map with a key one less than theirs or something.
I was looking at your code and I have a question.
Why do you have to copy values from the map to the slice? I mean, I think that I am following the logic behind... but I wonder if there is a way to skip that step.
So we have:
func Evolve( ..., myMap map[int][]int ,...){
RandomPop = rand.Intn(rangeofpopulation)+1
for i:=RandPop,; i<rangeofpopulation;i++{
// slice of preselected `sites`. one of this will be 'siteChosen'
// we expect to have `n sites` on `preAllocatedSlice`
// where `n` is the amount of iterations,
// ie; n = rangeofpopulation - RandPop
preallocatedslice=append(preallocatedslice,myMap[i]...)
}
// Once we have a list of sites, we select `one`
// under a normal distribution every site ha a chance of 1/n to be selected.
randomindex:= rand.Intn(len(preallocatedslice))
sitechosen= preallocatedslice[randomindex]
UpdateFunction(site)
...
}
But what if we change that to:
func Evolve( ..., myMap map[int][]int ,...){
if len(myMap) == 0 {
// Nothing to do, print a log!
return
}
// This variable will hold our site chosen!
var siteChosen []int
// Our random population size is a value from 1 to rangeOfPopulation
randPopSize := rand.Intn(rangeOfPopulation) + 1
for i := randPopSize; i < rangeOfPopulation; i++ {
// We are going to pretend that the current candidate is the siteChosen
siteChosen = myMap[i]
// Now, instead of copying `myMap[i]` to preAllocatedSlice
// We will test if the current candidate is actually the 'siteChosen` here:
// We know that the chances for an specific site to be the chosen is 1/n,
// where n = rangeOfPopulation - randPopSize
n := float64(rangeOfPopulation - randPopSize)
// we roll the dice...
isTheChosenOne := rand.Float64() > 1/n
if isTheChosenOne {
// If the candidate is the Chosen site,
// then we don't need to iterate over all the other elements.
break
}
}
// here we know that `siteChosen` is a.- a selected candidate, or
// b.- the last element assigned in the loop
// (in the case that `isTheChosenOne` was always false [which is a probable scenario])
UpdateFunction(siteChosen)
...
}
Also if you want to can calculate n, or 1/n outside the loop.
So the idea is testing inside the loop if the candidate is the siteChosen, and avoid copying the candidates to this preselection pool.

Scala: Mutable vs. Immutable Object Performance - OutOfMemoryError

I wanted to compare the performance characteristics of immutable.Map and mutable.Map in Scala for a similar operation (namely, merging many maps into a single one. See this question). I have what appear to be similar implementations for both mutable and immutable maps (see below).
As a test, I generated a List containing 1,000,000 single-item Map[Int, Int] and passed this list into the functions I was testing. With sufficient memory, the results were unsurprising: ~1200ms for mutable.Map, ~1800ms for immutable.Map, and ~750ms for an imperative implementation using mutable.Map -- not sure what accounts for the huge difference there, but feel free to comment on that, too.
What did surprise me a bit, perhaps because I'm being a bit thick, is that with the default run configuration in IntelliJ 8.1, both mutable implementations hit an OutOfMemoryError, but the immutable collection did not. The immutable test did run to completion, but it did so very slowly -- it takes about 28 seconds. When I increased the max JVM memory (to about 200MB, not sure where the threshold is), I got the results above.
Anyway, here's what I really want to know:
Why do the mutable implementations run out of memory, but the immutable implementation does not? I suspect that the immutable version allows the garbage collector to run and free up memory before the mutable implementations do -- and all of those garbage collections explain the slowness of the immutable low-memory run -- but I'd like a more detailed explanation than that.
Implementations below. (Note: I don't claim that these are the best implementations possible. Feel free to suggest improvements.)
def mergeMaps[A,B](func: (B,B) => B)(listOfMaps: List[Map[A,B]]): Map[A,B] =
(Map[A,B]() /: (for (m <- listOfMaps; kv <-m) yield kv)) { (acc, kv) =>
acc + (if (acc.contains(kv._1)) kv._1 -> func(acc(kv._1), kv._2) else kv)
}
def mergeMutableMaps[A,B](func: (B,B) => B)(listOfMaps: List[mutable.Map[A,B]]): mutable.Map[A,B] =
(mutable.Map[A,B]() /: (for (m <- listOfMaps; kv <- m) yield kv)) { (acc, kv) =>
acc + (if (acc.contains(kv._1)) kv._1 -> func(acc(kv._1), kv._2) else kv)
}
def mergeMutableImperative[A,B](func: (B,B) => B)(listOfMaps: List[mutable.Map[A,B]]): mutable.Map[A,B] = {
val toReturn = mutable.Map[A,B]()
for (m <- listOfMaps; kv <- m) {
if (toReturn contains kv._1) {
toReturn(kv._1) = func(toReturn(kv._1), kv._2)
} else {
toReturn(kv._1) = kv._2
}
}
toReturn
}
Well, it really depends on what the actual type of Map you are using. Probably HashMap. Now, mutable structures like that gain performance by pre-allocating memory it expects to use. You are joining one million maps, so the final map is bound to be somewhat big. Let's see how these key/values get added:
protected def addEntry(e: Entry) {
val h = index(elemHashCode(e.key))
e.next = table(h).asInstanceOf[Entry]
table(h) = e
tableSize = tableSize + 1
if (tableSize > threshold)
resize(2 * table.length)
}
See the 2 * in the resize line? The mutable HashMap grows by doubling each time it runs out of space, while the immutable one is pretty conservative in memory usage (though existing keys will usually occupy twice the space when updated).
Now, as for other performance problems, you are creating a list of keys and values in the first two versions. That means that, before you join any maps, you already have each Tuple2 (the key/value pairs) in memory twice! Plus the overhead of List, which is small, but we are talking about more than one million elements times the overhead.
You may want to use a projection, which avoids that. Unfortunately, projection is based on Stream, which isn't very reliable for our purposes on Scala 2.7.x. Still, try this instead:
for (m <- listOfMaps.projection; kv <- m) yield kv
A Stream doesn't compute a value until it is needed. The garbage collector ought to collect the unused elements as well, as long as you don't keep a reference to the Stream's head, which seems to be the case in your algorithm.
EDIT
Complementing, a for/yield comprehension takes one or more collections and return a new collection. As often as it makes sense, the returning collection is of the same type as the original collection. So, for example, in the following code, the for-comprehension creates a new list, which is then stored inside l2. It is not val l2 = which creates the new list, but the for-comprehension.
val l = List(1,2,3)
val l2 = for (e <- l) yield e*2
Now, let's look at the code being used in the first two algorithms (minus the mutable keyword):
(Map[A,B]() /: (for (m <- listOfMaps; kv <-m) yield kv))
The foldLeft operator, here written with its /: synonymous, will be invoked on the object returned by the for-comprehension. Remember that a : at the end of an operator inverts the order of the object and the parameters.
Now, let's consider what object is this, on which foldLeft is being called. The first generator in this for-comprehension is m <- listOfMaps. We know that listOfMaps is a collection of type List[X], where X isn't really relevant here. The result of a for-comprehension on a List is always another List. The other generators aren't relevant.
So, you take this List, get all the key/values inside each Map which is a component of this List, and make a new List with all of that. That's why you are duplicating everything you have.
(in fact, it's even worse than that, because each generator creates a new collection; the collections created by the second generator are just the size of each element of listOfMaps though, and are immediately discarded after use)
The next question -- actually, the first one, but it was easier to invert the answer -- is how the use of projection helps.
When you call projection on a List, it returns new object, of type Stream (on Scala 2.7.x). At first you may think this will only make things worse, because you'll now have three copies of the List, instead of a single one. But a Stream is not pre-computed. It is lazily computed.
What that means is that the resulting object, the Stream, isn't a copy of the List, but, rather, a function that can be used to compute the Stream when required. Once computed, the result will be kept so that it doesn't need to be computed again.
Also, map, flatMap and filter of a Stream all return a new Stream, which means you can chain them all together without making a single copy of the List which created them. Since for-comprehensions with yield use these very functions, the use of Stream inside the prevent unnecessary copies of data.
Now, suppose you wrote something like this:
val kvs = for (m <- listOfMaps.projection; kv <-m) yield kv
(Map[A,B]() /: kvs) { ... }
In this case you aren't gaining anything. After assigning the Stream to kvs, the data hasn't been copied yet. Once the second line is executed, though, kvs will have computed each of its elements, and, therefore, will hold a complete copy of the data.
Now consider the original form::
(Map[A,B]() /: (for (m <- listOfMaps.projection; kv <-m) yield kv))
In this case, the Stream is used at the same time it is computed. Let's briefly look at how foldLeft for a Stream is defined:
override final def foldLeft[B](z: B)(f: (B, A) => B): B = {
if (isEmpty) z
else tail.foldLeft(f(z, head))(f)
}
If the Stream is empty, just return the accumulator. Otherwise, compute a new accumulator (f(z, head)) and then pass it and the function to the tail of the Stream.
Once f(z, head) has executed, though, there will be no remaining reference to the head. Or, in other words, nothing anywhere in the program will be pointing to the head of the Stream, and that means the garbage collector can collect it, thus freeing memory.
The end result is that each element produced by the for-comprehension will exist just briefly, while you use it to compute the accumulator. And this is how you save keeping a copy of your whole data.
Finally, there is the question of why the third algorithm does not benefit from it. Well, the third algorithm does not use yield, so no copy of any data whatsoever is being made. In this case, using projection only adds an indirection layer.

Resources