Speedup problems with go - performance

I wrote a very simple program in go to test performances of a parallel program. I wrote a very simple program that factorizes a big semiprime number by division trials. Since no communications are involved, I expected an almost perfect speedup. However, the program seems to scale very badly.
I timed the program with 1, 2, 4, and 8 processes, running on a 8 (real, not HT) cores computer, using the system timecommand. The number I factorized is "28808539627864609". Here are my results:
cores time (sec) speedup
1 60.0153 1
2 47.358 1.27
4 34.459 1.75
8 28.686 2.10
How to explain such bad speedups? Is it a bug in my program, or is it a problem with go runtime? How could I get better performances? I'm not talking about the algorithm by itself (I know there are better algorithms to factorize semiprime numbers), but about the way I parallelized it.
Here is the source code of my program:
package main
import (
"big"
"flag"
"fmt"
"runtime"
)
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var m big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
m.Mod(n, i)
if m.Cmp(z) == 0{
c <- i
}
i.Add(i, s)
}
}
func main() {
var np *int = flag.Int("n", 1, "Number of processes")
flag.Parse()
runtime.GOMAXPROCS(*np)
var n big.Int
n.SetString(flag.Arg(0), 10) // Uses number given on command line
c := make(chan *big.Int)
for i:=0; i<*np; i++ {
go factorize(&n, 2+i, *np, c)
}
fmt.Println(<-c)
}
EDIT
Problem really seems to be related to Mod function. Replacing it by Rem gives better but still imperfect performances and speedups. Replacing it by QuoRem gives 3 times faster performances, and perfect speedup. Conclusion: it seems memory allocation kills parallel performances in Go. Why? Do you have any references about this?

Big.Int methods generally have to allocate memory, usually to hold the result of the computation. The problem is that there is just one heap and all memory operations are serialized. In this program, the numbers are fairly small and the (parallelizable) computation time needed for things like Mod and Add is small compared to the non-parallelizable operations of repeatedly allocating all the tiny little bits of memory.
As far as speeding it up, there is the obvious answer of don't use big.Ints if you don't have to. Your example number happens to fit in 64 bits. If you plan on working with really big big numbers though, the problem will kind of go away on its own. You will spend much more time doing computations, and the time spent in the heap will be relatively much less.
There is a bug in your program, by the way, although it's not related to performance. When you find a factor and return the result on the channel, you send a pointer to the local variable i. This is fine, except that you don't break out of the loop then. The loop in the goroutine continues incrementing i and by the time the main goroutine gets around to fishing the pointer out of the channel and following it, the value is almost certain to be wrong.

After sending i through the channel, i should be replaced with a newly allocated big.Int:
if m.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
This is necessary because there is no guarantee when fmt.Println will process the integer received on line fmt.Println(<-c). It isn't usual for fmt.Println to cause goroutine switching, so if i isn't replaced with a newly allocated big.Int and the run-time switches back to executing the for-loop in function factorize then the for-loop will overwrite i before it is printed - in which case the program won't print out the 1st integer sent through the channel.
The fact that fmt.Println can cause goroutine switching means that the for-loop in function factorize may potentially consume a lot of CPU time between the moment when the main goroutine receives from channel c and the moment when the main goroutine terminates. Something like this:
run factorize()
<-c in main()
call fmt.Println()
continue running factorize() // Unnecessary CPU time consumed
return from fmt.Println()
return from main() and terminate program
Another reason for the small multi-core speedup is memory allocation. The function (*Int).Mod is internally using (*Int).QuoRem and will create a new big.Int each time it is called. To avoid the memory allocation, use QuoRem directly:
func factorize(n *big.Int, start int, step int, c chan *big.Int) {
var q, r big.Int
i := big.NewInt(int64(start))
s := big.NewInt(int64(step))
z := big.NewInt(0)
for {
q.QuoRem(n, i, &r)
if r.Cmp(z) == 0 {
c <- i
i = new(big.Int).Set(i)
}
i.Add(i, s)
}
}
Unfortunately, the goroutine scheduler in Go release r60.3 contains a bug which prevents this code to use all CPU cores. When the program is started with -n=2 (GOMAXPROCS=2), the run-time will utilize only 1 thread.
Go weekly release has a better run-time and can utilize 2 threads if n=2 is passed to the program. This gives a speedup of approximately 1.9 on my machine.
Another potential contributing factor to multi-core slowdown has been mentioned in the answer by user "High Performance Mark". If the program is splitting the work into multiple sub-tasks and the result comes only from 1 sub-task, it means that the other sub-tasks may do some "extra work". Running the program with n>=2 may in total consume more CPU time than running the program with n=1.
The learn how much extra work is being done, you may want to (somehow) print out values of all i's in all goroutines at the moment the program is exiting the function main().

I don't read go so this is probably the answer to a question which is not what you asked. If so, downvote or delete as you wish.
If you were to make a plot of 'time to factorise integer n' against 'n' you would get a plot that goes up and down somewhat randomly. For any n you choose there will be an integer in the range 1..n that takes longest to factorise on one processor. If your parallelisation strategy is to distribute the n integers across p processors one of those processors will take at least the time to factorise the hardest integer, then the time to factorise the rest of its load.
Perhaps you've done something similar ?

Related

How one can deterime the order that the order of receiving from channels?

Consider the following example from the tour of go.
How one can determine the order of reception from the channels?
why x always get the first output from the gorouting?
It sounds reasonable but i didn't find any documentation about it.
I tried to add some sleep and still x get the input from the first executed gorouting.
c := make(chan int)
go sumSleep(s[:len(s)/2], c)
go sum(s[len(s)/2:], c)
x, y := <-c, <-c // receive from c
fmt.Println(x, y, x+y)
The sleep is before the sending to the channel.
Messages are always received in the order they are sent. That is deterministic.
However, the order of execution of any given operation across concurrent Goroutines is not deterministic. So if you have two goroutines concurrently sending on a channel, you can't know which will send first and which will send second. Same if you have two goroutines receiving on the same channel.
I tried to add some sleep and still x get the input from the first executed goroutine
In addition to what #Adrian wrote, in your code x will always get result from first recieve on c, because of language rules for tuple assignment
The assignment proceeds in two phases. First, the operands of index expressions and pointer indirections (including implicit pointer indirections in selectors) on the left and the expressions on the right are all evaluated in the usual order. Second, the assignments are carried out in left-to-right order.
To add a bit to Adrian's answer: we don't know in what order the two goroutines might run. If your sleep comes before your channel-send, and sleeps "long enough",1 that will guarantee that the other goroutine can run to the point of doing its send. If both goroutines run "at the same time" and neither waits (as in the original Tour example), we cannot be sure which one will actually reach its c <- sum line first.
Running the Tour's example on the Go Playground (either directly, or through the Tour website), I actually get:
-5 17 12
in the output window, which (because we know that the -9 is in the 2nd half of the slice) tells us that the second goroutine "got there" (to the channel-send) first. In some sense, that's just luck—but when using the Go Playground, all jobs are run in a fairly deterministic environment, with a single CPU and with cooperative scheduling, so that the results are more predictable. In other words, if the second goroutine got there first on one run, it probably will on the next. If the playground used multiple CPUs and/or a less-deterministic environment, the results might change from one run to the next, but there is no guarantee of that.
In any case, assuming your code does what you say (and I believe it does), this:
go sumSleep(s[:len(s)/2], c)
go sum(s[len(s)/2:], c)
has the first sender wait and the second sender run first. But that's what we already observed actually happened when we let the two routines race. To see a change, we'd need to make the second sender delay.
I made a modified version of the example in the Go Playground here that prints more annotations. With the delay inserted in the 2nd half sum, we see the first-half sum as x:
2nd half: sleeping for 1s
1st half: sleeping for 0s
1st half: sending 17
2nd half: sending -5
17 -5 12
as we can expect since one second is "long enough".
1How long is "long enough"? Well, that depends: how fast are our computers? How much other stuff are they doing? If the computer is fast enough, a delay of a few milliseconds, or even a few nanoseconds, may be enough. If our computer is a really old one or very busy with other higher priority tasks, a few milliseconds might not be enough time. If the problem is sufficiently big, one second might not be enough time. It's often unwise to choose some particular amount of time, if you can control this better by some sort of synchronization operation, and usually you can. For instance, using a sync.WaitGroup variable allows you to wait for n goroutines (for some runtime value of n) to call a Done function before your own goroutine proceeds.
Playground code, copied to StackOverflow for convenience
package main
import (
"fmt"
"time"
)
func sum(s []int, c chan int, printme string, delay time.Duration) {
sum := 0
for _, v := range s {
sum += v
}
fmt.Printf("%s: sleeping for %v\n", printme, delay)
time.Sleep(delay)
fmt.Printf("%s: sending %d\n", printme, sum)
c <- sum
}
func main() {
s := []int{7, 2, 8, -9, 4, 0}
c := make(chan int)
go sum(s[:len(s)/2], c, "1st half", 0*time.Second)
go sum(s[len(s)/2:], c, "2nd half", 1*time.Second)
x, y := <-c, <-c
fmt.Println(x, y, x+y)
}

Is there a difference in Go between a counter using atomic operations and one using a mutex?

I have seen some discussion lately about whether there is a difference between a counter implemented using atomic increment/load, and one using a mutex to synchronise increment/load.
Are the following counter implementations functionally equivalent?
type Counter interface {
Inc()
Load() int64
}
// Atomic Implementation
type AtomicCounter struct {
counter int64
}
func (c *AtomicCounter) Inc() {
atomic.AddInt64(&c.counter, 1)
}
func (c *AtomicCounter) Load() int64 {
return atomic.LoadInt64(&c.counter)
}
// Mutex Implementation
type MutexCounter struct {
counter int64
lock sync.Mutex
}
func (c *MutexCounter) Inc() {
c.lock.Lock()
defer c.lock.Unlock()
c.counter++
}
func (c *MutexCounter) Load() int64 {
c.lock.Lock()
defer c.lock.Unlock()
return c.counter
}
I have run a bunch of test cases (Playground Link) and haven't been able to see any different behaviour. Running the tests on my machine the numbers get printed out of order for all the PrintAll test functions.
Can someone confirm whether they are equivalent or if there are any edge cases where these are different? Is there a preference to use one technique over the other? The atomic documentation does say it should only be used in special cases.
Update:
The original question that caused me to ask this was this one, however it is now on hold, and i feel this aspect deserves its own discussion. In the answers it seemed that using a mutex would guarantee correct results, whereas atomics might not, specifically if the program is running in multiple threads. My questions are:
Is it correct that they can produce different results? (See update below. The answer is yes.).
What causes this behaviour?
What are the tradeoffs between the two approaches?
Another Update:
I've found some code where the two counters behave differently. When run on my machine this function will finish with MutexCounter, but not with AtomicCounter. Don't ask me why you would ever run this code:
func TestCounter(counter Counter) {
end := make(chan interface{})
for i := 0; i < 1000; i++ {
go func() {
r := rand.New(rand.NewSource(time.Now().UnixNano()))
for j := 0; j < 10000; j++ {
k := int64(r.Uint32())
if k >= 0 {
counter.Inc()
}
}
}()
}
go func() {
prevValue := int64(0)
for counter.Load() != 10000000 { // Sometimes this condition is never met with AtomicCounter.
val := counter.Load()
if val%1000000 == 0 && val != prevValue {
prevValue = val
}
}
end <- true
fmt.Println("Count:", counter.Load())
}()
<-end
}
There is no difference in behavior. There is a difference in performance.
Mutexes are slow, due to the setup and teardown, and due to the fact that they block other goroutines for the duration of the lock.
Atomic operations are fast because they use an atomic CPU instruction (when possible), rather than relying on external locks to.
Therefore, whenever it is feasible, atomic operations should be preferred.
Alright, I'm going to attempt to self-answer for some closure. Edits are welcome.
There is some discussion about the atomic package here. But to quote the most telling comments:
The very short summary is that if you have to ask, you should probably
avoid the package. Or, read the atomic operations chapter of the
C++11 standard; if you understand how to use those operations safely
in C++, then you are more than capable of using Go's sync/atomic
package.
That said, sticking to atomic.AddInt32 and atomic.LoadInt32 is safe as
long as you are just reporting statistical information, and not
actually relying on the values carrying any meaning about the state of
the different goroutines.
And:
What atomicity does not guarantee, is any ordering of
observability of values. I mean, atomic.AddInt32() does only guarantee
that what this operation stores at &cnt will be exactly *cnt + 1 (with
the value of *cnt being what the CPU executing the active goroutine
fetched from memory when the operation started); it does not provide any
guarantee that if another goroutine will attempt to read this value at
the same time it will fetch that same value *cnt + 1.
On the other hand, mutexes and channels guarantee strict ordering
of accesses to values being shared/passed around (subject to the rules
of Go memory model).
In regards to why the code sample in the question never finishes, this is due to fact that the func that is reading the counter is in a very tight loop. When using the atomic counter, there are no syncronisation events (e.g. mutex lock/unlock, syscalls) which means that the goroutine never yields control. The result of this is that this goroutine starves the thread it is running on, and prevents the scheduler from allocating time to any other goroutines allocated to that thread, this includes ones that increment the counter meaning the counter never reaches 10000000.
Atomics are faster in the common case: the compiler translates each call to a function from the sync/atomic package to a special set of machine instructions which basically operate on the CPU level — for instance, on x86
architectures, an atomic.AddInt64 would be translated to some plain
ADD-class instruction prefixed with the LOCK instruction (see this for an example) —
with the latter ensuring coherent view of the updated memory location
across all the CPUs in the system.
A mutex is a much complicated thing as it, in the end, wraps some bit
of the native OS-specific thread synchronization API
(for instance, on Linux, that's futex).
On the other hand, the Go runtime is pretty much optimized when
it comes to synchronization stuff (which is kinda expected —
given one of the main selling points of Go), and the mutex implementation
tries to avoid hitting the kernel to perform synchronization
between goroutines, if possible, and carry it out completely in
the Go runtime itself.
This might explain no noticeable difference in the timings
in your benchmarks, provided the contention over the mutexes
was reasonably low.
Still, I feel oblidged to note — just in case — that atomics and
higher-level synchronization facilities are designed to solve different
tasks. Say, you can't use atomics to protect some memory state during
the execution of a whole function — and even a single statement,
in the general case.

What is the order that things are happening?

I started playing around with Go by going through the Tour of Go on the official site.
I only have basic experience with programming but on coming up to the channels page I started to play around to try get my head around it and I've ended up quite confused.
This is what I have the code as:
package main
import "fmt"
func sum(s []int, c chan int) {
sum := 0
s[0] = 8
s = append(s, 20)
fmt.Println(s)
for _, v := range s {
sum += v
}
c <- sum // send sum to c
}
func main() {
s := []int{7, 2, 8, -9, 4, 0}
c := make(chan int)
go sum(s[:len(s)/2], c)
fmt.Println(s[0])
go sum(s[len(s)/2:], c)
fmt.Println(s)
x, y := <-c, <-c // receive from c
fmt.Println(x, y, x+y)
fmt.Println(s)
}
And this is the result I get:
7
[8 2 8 20 4 0]
[8 2 8 20]
[8 4 0 20]
26 32 58
[8 2 8 8 4 0]
I get that in creating a slice you get an underlying array with needed number underneath and passes a slice to a function and modifying an element modifying the underlying array but I'm not sure in what sort of order the goroutines are going.
It prints out the first S[0] as 7 even though the function call before it should have modified it to 8, so I'm assuming that the goroutines hasn't run yet?
Then printing the entire array after the second goroutine function call prints it with all the modifications made during the first function call (modifying the first item to 8 and appending 20).
Yet the next 2 line print outs are the print outs of the slice segments from within the functions, which logically by the fact that I just said I saw the modifications done by the function call means they should have printed out BEFORE that line.
Then I'm not sure how it got the calculations or the final printout of the array.
Can someone familiar with how Go operates explain what the logical progression of this code was?
It prints out the first S[0] as 7 even though the function call before it should have modified it to 8, so I'm assuming that the goroutines hasn't run yet?
Yes, this interpretation is correct. (But should not be made, see next section).
Then printing the entire array after the second goroutine function call prints it with all the modifications made during the first function call (modifying the first item to 8 and appending 20).
NO. Never ever think along these lines.
The state of your application at this point is not well defined. 1) You started goroutines which modify the state and you have absolutely no synchronisation between the three goroutines (the main one and the two you started manually). 2) Your code is racy which means it is a) wrong, racy programs are never correct and b) results in undefined behaviour.
Yet the next 2 line print outs are the print outs of the slice segments from within the functions, which logically by the fact that I just said I saw the modifications done by the function call means they should have printed out BEFORE that line.
More or less, if your program would not be racy (see above).
Where is your code racy: the main.s is a slice with a backing array and you subslice it twice when starting the goroutines with go sum(...). Both subslices share the same backing array. Now inside the goroutines you write ( s[0]=8) and append (append(s, 20)) to this subslice. From both goroutines without synchronisation. The first goroutine will fulfill this append using the large enough backing array which is concurrently used by the second goroutine.
This results in concurrent write without synchronisation to the fourth element of main.s which is racy and thus undefined.
With reading twice from the channel (<-c, <-c) you introduce synchronisation between the two manually started goroutines and the main goroutine and your program returns to serial execution (but its state is still not welldefined as it is racy).
The Println statements you use to track what is going on are problematic too: These statements print stuff which is in a not well defined state because it is accessed concurrently from different goroutines; both may print arbitrary data.
Summing up: You program is racy, its output is undefined, there is no "reason" or "explanation" why it prints what it prints, its pure chance.
Always try your code under the race detector (-race) and do not try to make up pseudo-explanations for racy code.
How to fix:
The divide and conquer approach is nice and okay. What is dangerous and thus easy to get wrong (as it happened to you) is modifying the subslice inside the goroutine. While s[0]=9] is harmless in your case the append interferes badly with the shared backing array.
Do not modify s inside the goroutine. If you must: Make sure you have a new backing array.
(Do not try to mix slices and concurrency. Both have some subtleties, slices the shared backing array and concurrency the problem of undefined racy behaviour. Especially if you are new to all this. Learn one, then the other, then combine both.)

How does goroutines behave on a multi-core processor

I am a newbie in Go language, so please excuse me if my question is very basic. I have written a very simple code:
func main(){
var count int // Default 0
cptr := &count
go incr(cptr)
time.Sleep(100)
fmt.Println(*cptr)
}
// Increments the value of count through pointer var
func incr(cptr *int) {
for i := 0; i < 1000; i++ {
go func() {
fmt.Println(*cptr)
*cptr = *cptr + 1
}()
}
}
The value of count should increment by one the number of times the loop runs. Consider the cases:
Loop runs for 100 times--> value of count is 100 (Which is correct as the loop runs 100 times).
Loop runs for >510 times --> Value of count is either 508 OR 510. This happens even if it is 100000.
I am running this on an 8 core processor machine.
First of all: prior to Go 1.5 it runs on a single processor, only using multiple threads for blocking system calls. Unless you tell the runtime to use more processors by using GOMAXPROCS.
As of Go 1.5 GOMAXPROCS is set to the number of CPUS. See 6, 7 .
Also, the operation *cptr = *cptr + 1 is not guaranteed to be atomic. If you look carefully, it can be split up into 3 operations: fetch old value by dereferencing pointer, increment value, save value into pointer address.
The fact that you're getting 508/510 is due to some magic in the runtime and not defined to stay that way. More information on the behaviour of operations with concurrency can be found in the Go memory model.
You're probably getting the correct values for <510 started goroutines because any number below these are not (yet) getting interrupted.
Generally, what you're trying to do is neither recommendable in any language, nor the "Go" way to do concurrency. A very good example of using channels to synchronize is this code walk: Share Memory By Communicating (rather than communicating by sharing memory)
Here is a little example to show you what I mean: use a channel with a buffer of 1 to store the current number, fetch it from the channel when you need it, change it at will, then put it back for others to use.
You code is racy: You write to the same memory location from different, unsynchronized goroutines without any locking. The result is basically undefined. You must either a) make sure that all the goroutine writes after each other in a nice, ordered way, or b) protect each write by e.g. e mutex or c) use atomic operations.
If you write such code: Always try it under the race detector like $ go run -race main.go and fix all races.
A nice alternative to using channels in this case might be the sync/atomic package, which contains specifically functions for atomically incrementing/decrementing numbers.
You are spawning 500 or 1000 routines without synchronization among routines. This is creating a race condition, which makes the result un-predictable.
Imagine You are working in an office to account for expense balance for your boss.
Your boss was in a meeting with 1000 of his subordinates and in this meeting he said, "I would pay you all a bonus, so we will have to increment the record of our expense". He issued following command:
i) Go to Nerve and ask him/ her what is the current expense balance
ii) Call my secretary to ask how much bonus you would receive
iii) Add your bonus as an expense to the additional expense balance. Do the math yourself.
iv) Ask Nerve to write the new balance of expense on my authority.
All of the 1000 eager participants rushed to record their bonus and created a race condition.
Say 50 of the eager gophers hit Nerve at the same time (almost), they ask,
i) "What is the current expense balance?
-- Nerve says $1000 to all of those 50 gophers, as they asked at the same question at the same time(almost) when the balance was $1000.
ii) The gophers then called secretary, how much bonus should be paid to me?
Secretary answers, "just $1 to be fair"
iii) Boos said do the math, they all calculates $1000+ $1 = $1001 should be the new cost balance for the company
iv) They all will ask Nerve to put back $1001 to the balance.
You see the problem in the method of Eager gophers?
There are $50 units of computation done, every time the added $1 to the existing balance, but the cost didn't increase by $50; only increased by $1.
Hope that clarifies the problem. Now for the solutions, other contributors gave very good solutions that would be sufficient I believe.
All of that approaches failed to me, noobie here. But i have found a better way http://play.golang.org/p/OcMsuUpv2g
I'm using sync package to solve that problem and wait for all goroutines to finish, without Sleep or Channel.
And don't forget to take a look at that awesome post http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/

Replicate() versus a for loop?

Does anyone know how the replicate() function works in R and how efficient it is relative to using a for loop?
For example, is there any efficiency difference between...
means <- replicate(100000, mean(rnorm(50)))
And...
means <- c()
for(i in 1:100000) {
means <- c(means, mean(rnorm(50)))
}
(I may have typed something slightly off above, but you get the idea.)
You can just benchmark the code and get your answer empirically. Note that I also added a second for loop flavor which circumvents the growing vector problem by preallocating the vector.
repl_function = function(no_rep) means <- replicate(no_rep, mean(rnorm(50)))
for_loop = function(no_rep) {
means <- c()
for(i in 1:no_rep) {
means <- c(means, mean(rnorm(50)))
}
means
}
for_loop_prealloc = function(no_rep) {
means <- vector(mode = "numeric", length = no_rep)
for(i in 1:no_rep) {
means[i] <- mean(rnorm(50))
}
means
}
no_loops = 50e3
benchmark(repl_function(no_loops),
for_loop(no_loops),
for_loop_prealloc(no_loops),
replications = 3)
test replications elapsed relative user.self sys.self
2 for_loop(no_loops) 3 18.886 6.274 17.803 0.894
3 for_loop_prealloc(no_loops) 3 3.209 1.066 3.189 0.000
1 repl_function(no_loops) 3 3.010 1.000 2.997 0.000
user.child sys.child
2 0 0
3 0 0
1 0 0
Looking at the relative column, the un-preallocated for loop is 6.2 times slower. However, the preallocated for loop is just as fast as replicate.
replicate is a wrapper for sapply, which itself is a wrapper for lapply. lapply is ultimately an .Internal function that is written in C and performs the looping in an optimised way, rather than through the interpreter. It's main advantages are efficient memory management, especially compared to the highly inefficient vector growing method you present above.
I have a very different experience with replicate which also confuses me. It often happens that my R crashes and my laptop hangs when I use replicate compared to for and this surprises me, as for the reasons mentioned above, I also expected a C-written function to outperform the for loop. For example, if you execute the functions below, you'll see that for loop is faster than replicate
system.time(for (i in 1:10) runif(1e7))
# user system elapsed
# 3.340 0.218 3.558
system.time(replicate(10, runif(1e7)))
# user system elapsed
# 4.622 0.484 5.109
so with 10 replicates, the for loop is clearly faster. If you repeat it for 100 replicates you get similar results. So I wonder if anyone can come with an example that shows its practical privileges compared to for.
PS I also created a function for the runif(1e7) and that made no difference in the comparison. Basically I failed to come with any example that shows the advantage of replicate.
Vectorization is the key difference between them. I will tray to explain this point. R is an high-level-interpreted computer language. It takes care of many basic computer tasks for you. When you write
x <- 2.0
you don’t have to tell your computer that
“2.0” is a floating-point number;
“x” should store numeric-type data;
it has to find a place in memory to put “5”;
it has to register “x” as a pointer to a certain place in memory.
R figures these things by itself.
But, for such comfortable issue, there is a price: it is slower than low level languages.
In C or FORTRAN, much of this "test if" would be accomplished during the compilation step, not during the program execution. They are translated into binary computer language (0/1) after they are written, BUT before they are run. This allows the compiler to organize the binary machine code in an optimal way for the computer to interpret.
What does this have to do with vectorization in R? Well, many R functions are actually written in a a compiled language, such as C, C++, and FORTRAN, and have a small R “wrapper”. This is the difference between yours approach. for loops add further test if operations that the machine has to do on data, making it slower

Resources