simple pipelining and superscalar architecture

simple pipelining and superscalar architecture - instructions

consider this instruction flow diagram....
instruction fetch->instruction decode->operands fetch->instruction execute->write back
suppose a processor that supports
both cisc and risc...like intel 486
now if we issue a risc instruction it takes one clock cycle to execute and so there is no problem...but if a cisc instruction is issued its execution will take time...
so let it takes three clock cycles to execute the cisc instruction and one clock cycle each is taken by the stages preceeding execution....
now in a superscalar structure the two instructions issued while the first is being processed are diverted into other functional units available...but there is no such diversion possible in simple pipelining as only one functional unit is available for execution of instructions....
so what avoids the congestion of instructions in simple pipelining case?

Technically speaking, the x86 is not a RISC processor. It's a CISC processor. There are instructions that take less time, but those aren't RISC instructions. I believe that Intel internally turns instructions into RISC instructions, but that's not really relevant.
If we have instructions which take different amounts of time, then that becomes a CISC processor. It's nearly impossible to pipeline a CISC processor - to the best of my knowledge nobody has done it. There are many things that you can do inside of the CPU itself in order to speed up execution, such as out-of-order execution. So, there's no way you can have pipeline congestion because all instructions must be executed sequentially.
now if we issue a risc instruction it takes one clock cycle to execute and so there is no problem...but if a cisc instruction is issued its execution will take time...
A RISC instruction does not necessarily take one clock cycle. On the MIPS, it takes 5. However, the point of pipelining is that after you execute one instruction, the next instruction will complete one clock cycle after the current one finishes.
now in a superscalar structure the two instructions issued while the first is being processed are diverted into other functional units available...
In a superscalar architecture, two instructions are executed and finish at the same time. In a pure superscalar architecture, the cycle looks like this(F = Fetch, D = Decode, X = eXecute, M = Memory, W = Writeback):
(inst. 1) F D X M W
(inst. 2) F D X M W
(inst. 3) F D X M W
(inst. 4) F D X M W
but there is no such diversion possible in simple pipelining as only one functional unit is available for execution of instructions....
Right, so the cycle looks like this:
(inst. 1) F D X M W
(inst. 2) F D X M W
(inst. 3) F D X M W
(inst. 4) F D X M W
Now, if we have instructions that take a varying amount of time(a CISC computer), it's harder to pipeline, because there's only one execution unit, and we may have to wait for a previous instruction to finish executing. Instruction 1 takes 2 execution cycles, instruction 2 takes 5, instruction 3 takes two, and instruction 4 takes only one in this example
(inst. 1) F D X X M W
(inst. 2) F D X X X X X M W
(inst. 3) F D X X M W
(inst. 4) F D X M W
Thus, we can't really pipeline CISC processors - we must wait for the execute cycle to finish before we can go onto the next instruction. We don't have to do this in MIPS because it can determine if an instruction is a branch and the destination in the decode phase.

Related

Why does the single cycle processor not incur register latency on both read and write?

I wonder why last register write latency(200) is not added?
To be more precise, critical path is determined by load instruction's
latency, so then why critical path is not
I-Mem + Regs + Mux + ALU + D-Mem + MUX + Regs
but is actually
I-Mem + Regs + Mux + ALU + D-Mem + MUX
Background
Figure 4.2
In the following three problems, assume that we are starting with a
datapath from Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem,
and Control blocks have latencies of 400 ps, 100 ps, 30 ps, 120 ps,
200 ps, 350 ps, and 100 ps, respectively, and costs of 1000, 30, 10,
100, 200, 2000, and 500, respectively.
And I find solution like below
Cycle Time Without improvement = I-Mem + Regs + Mux + ALU + D-Mem +
Mux = 400 + 200 + 30 + 120 + 350 + 30 = 1130
Cycle Time With improvement = 1130 + 300 = 1430

It is a good question as to whether it requires two Regs latencies.
The register write is a capture of the output of one cycle.  It happens at the end of one clock cycle, and the start of the next — it is the clock cycle edge/transition to the next cycle that causes the capture.
In one sense, the written output of one instruction effectively happens in parallel with the early operations of the next instruction, including the register reads, with the only requirement for this overlap being that the next instruction must be able to read the output of the prior instruction instead of a stale register value.  And this is possible because the written data was already available at the very top/beginning of the current cycle (before the transition, in fact).
The PC works the same: at the clock transition from one cycle's end to another cycle's start, the value for the new PC is captured and then released to the I Mem.  So, the read and write effectively happen in parallel, with the only requirement then being that the read value sent to I Mem is the new one.
This is the fundamental way that cycles work: enregistered values start a cycle, then combinational logic computes new values that are captured at the end of the cycle (aka the start of the next cycle) and form the program state available for the start of the new cycle.  So one cycle does state -> processing -> (new) state and then the cycle repeats.
In the case of the PC, you might ask why we need a register at all?
(For the 32 CPU registers it is obvious that they are needed to provide the machine code program with lasting values (one instruction outputs register, say, $a0 and that register may be used many instructions later, or maybe even used many times before being changed.))
But one could speculate what might happen without a PC register (and the clocking that dictates its capture), and the answer there is that we don't want the PC to change until the instruction is completed, which is dictated by the clock.  Without the clock and the register, the PC could run ahead of the rest of the design, since much of the PC computation is not on the critical path (this would cause instability of the design).  But as we want the PC to hold stable for the whole clock cycle, and change only when the clock says the instruction is over, a register is use (and the clocked update of it).

Some weird behavior with Intel LLC

I have a single threaded void function whose performance I call about, let's call it f. f takes as input a pointer to a float buffer of size around 1.5Mb, let's call x. f writes to another buffer, let's say y. y also has size around 1.5Mb. So to use f, we call f(x,y).
Now I run f 1000 times. In scenario one, I have ONE x and ONE y, so I do f(x,y) a thousand times. Reads of x by f are serviced from local caches and are fast.
In scenario two, I have ONE x and 1000 different y, think y0, y1 ... y999, each of which is a buffer of size around 1.5Mb. (contiguous in memory or not, doesn't matter apparently) When I do f(x,y0), f(x,y1), f(x,y2) ..., reads of x by f are no longer serviced from local caches! I observe LLC misses and get bottlenecked by DRAM latency.
What is going on here? I am running an Intel Kaby Lake quadcore laptop. i5-8250. L3 cache size 6144K.

Latency bounds and throughput bounds for processors for operations that must occur in sequence

My textbook (Computer Systems: A programmer's perspective) states that a latency bound is encountered when a series of operations must be performed in strict sequence, while a throughput bound characterizes the raw computing capacity of the processor's functional units.
Questions 5.5 and 5.6 of the textbook introduce these two possible loop structures for polynomial computation
double result = a[0];
double xpwr = x;
for (int i = 1; i <= degree; i++) {
result += a[i] * xpwr;
xpwr = x * xpwr;
}
and
double result = a[degree];
double xpwr = x;
for (int i = degree - 1; i >= 0; i--) {
result = a[i] + x * result;
}
The loops are assumed to be executed on a microarchitecture with the following execution units:
One floating-point adder. It's latency of 3 cycles and is fully pipelined.
Two floating-pointer multipliers. The latency of each is 5 cycles and both are fully pipelined.
Four integer ALUs, each has a latency of one cycle.
The latency bounds for floating point multiplication and addition given for this problem are 5.0 and 3.0 respectively. According to the answer key, the overall loop latency for the first loop is 5.0 cycles per element and the second is 8.0 cycles per element. I don't understand why the first loop isn't also 8.0.
It seems as though a[i] must be multiplied by xpwr before adding a[i] to this product to produce the next value of result. Could somebody please explain this to me?

Terminology: you can say a loop is "bound on latency", but when analyzing that bottleneck I wouldn't say "the latency bound" or "bounds". That sounds wrong to me. The thing you're measuring (or calculating via static performance analysis) is the latency or length of the critical path, or the length of the loop-carried dependency chain. (The critical path is the latency chain that's longest, and is the one responsible for the CPU stalling if it's longer than out-of-order exec can hide.)
The key point is that out-of-order execution only cares about true dependencies, and allows operations to execute in parallel otherwise. The CPU can start a new multiply and a new add every cycle. (Assuming from the latency numbers that it's Intel Sandybridge or Haswell, or similar1. i.e. assume the FPU is fully pipelined.)
The two loop-carried dependency chains in the first loop are xpwr *= x and result += stuff, each reading from their own previous iteration, but not coupled to each other in a way that would add their latencies. They can overlap.
result += a[i] * xpwr has 3 inputs:
result from the previous iteration.
a[i] is assumed to be ready as early as you want it.
xpwr is from the previous iteration. And more importantly, that previous iteration could start computing xpwr right away, not waiting for the previous result.
So you have 2 dependency chains, one reading from the other. The addition dep chain has lower latency per step so it just ends up waiting for the multiplication dep chain.
The a[i] * xpwr "forks off" from the xpwr dep chain, independent of it after reading its input. Each computation of that expression is independent of the previous computation of it. It depends on a later xpwr, so the start of each a[i] * xpwr is limited only by the xpwr dependency chain having that value ready.
The loads and integer loop overhead (getting load addresses ready) can be executed far ahead by out-of-order execution.
Graph of the dependency pattern across iterations
mulsd (x86-64 multiply Scalar Double) is for the xpwr updates, addsd for the result updates. The a[i] * xpwr; multiplication is not shown because it's independent work each iteration. It skews the adds later by a fixed amount, but we're assuming there's enough FP throughput to get that done without resource conflicts for the critical path.
mulsd addsd # first iteration result += stuff
| \ | # first iteration xpwr *= x can start at the same time
v \ v
mulsd addsd
| \ |
v \ v
mulsd addsd
| \ |
v \ v
mulsd addsd
(Last xpwr mulsd result is unused, compiler could peel the final iteration and optimize it away.)
Second loop, Horner's method - fast with FMA, else 8 cycles
result = a[i] + x * result; is fewer math operations, but there we do have a loop-carried critical path of 8 cycles. The next mulsd can't start until the addsd is also done. This is bad for long chains (high-degree polynomials), although out-of-order exec can often hide the latency for small degrees, like 5 or 6 coefficients.
This really shines when you have FMA available: each iteration becomes one Fused Multiply-Add. On real Haswell CPUs, FMA has the same 5-cycle latency as an FP multiply, so the loop runs at one iteration per 5 cycles, with less extra latency at the tail to get the final result.
Real world high-performance code often uses this strategy for short polynomials when optimizing for machines with FMA, for high throughput evaluating the same polynomial for many different inputs. (So the instruction-level parallelism is across separate evaluations of the polynomial, rather than trying to create any within one by spending more operations.)
Footnote 1: similarity to real hardware
Having two FP multipliers with those latencies matches Haswell with SSE2/AVX math, although in actual Haswell the FP adder is on the same port as a multiplier, so you can't start all 3 operations in one cycle. FP execution units share ports with the 4 integer ALUs, too, but Sandybridge/Haswell's front-end is only 4 uops wide so it's typically not much of a bottleneck.
See David Kanter's deep dive on Haswell with nice diagrams, and https://agner.org/optimize/, and other performance resources in the x86 tag wiki)
On Broadwell, the next generation after Haswell, FP mul latency improved to 3 cycles. Still 2/clock throughput, with FP add/sub still being 3c and FMA 5c. So the loop with more operations would be faster there, even compared to an FMA implementation of Horner's method.
On Skylake, all FP operations are the same 4-cycle latency, all running on the two FMA units with 2/clock throughput for FP add/sub as well. More recently, Alder Lake re-introduced lower latency FP addition (3 cycles vs. 4 for mul/fma but still keeping the 2/clock throughput), since real-world code often does stuff like naively summing an array, and strict FP semantics don't let compilers optimize it to multiple accumulators. So on recent x86, there's nothing to gain by avoiding FMA if you would still have a multiply dep chain, not just add.
Also related:
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? more general analysis needs to consider other possible bottlenecks: front-end uop throughput, and back-end contention for execution units. Dep chains, especially loop-carried ones, are the 3rd major possible bottleneck (other than stalls like cache misses and branch misses.)
How many CPU cycles are needed for each assembly instruction? - another basic intro to these concepts
Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths - ability of out-of-order exec to overlap dep chains is limited when they're too long.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) - FP dep chains in parallel with unrolling when possible, e.g. a dot product.
In this case, for large-degree polynomials you could be doing stuff like x2 = x*x / x4 = x2 * x2, and maybe generate x^(2n) and x^(2n+1) in parallel. As in Estrin's scheme used in Agner Fog's vectorclass library for short polynomials. I found that when short polynomials are part of independent work across loop iterations (e.g. as part of log(arr[i])), straight Horner's Rule was better throughput, as out-of-order exec was able to hide the latency of a chain of 5 or 6 FMAs interleaved with other work.

For 5.5 , there are 3 parallel lines:
xpwr = x * xpwr; which has 5 cycle latency. Occurs in iteration #i
a[i] * xpwr; which has 5 cycle latency, but isn't on the critical path of a loop-carried dependency. Occurs in iteration #i.
result + (2); which has 3 cycle latency. Occurs in iteration #i+1 but for the iter #i result
Update
Based on clarifications by #peter
To understand 'loop-carried' dep: means current loop(i) depends on other loops(say , i-1): so we can see xpwr = x * xpwr; as xpwr(i) = x * xpwr(i-1); . consequently form a path( but not known yet if it's critical path)
a[i] * xpwr , could be seen as a byproduct of step 1. So called "forked off from step 1". which also takes 5 cycles.
Upon step 2 finished, result += ... starts for loop i . which takes 3 cycles. it dependent on step 1 , consequently, step 3 is also a 'loop carried' dep, so could be candidates of "critical path".
Since step 3 is 3-cycle < 5 cycle, so step 1 becomes critical path.
What if step 3 ( assuming ) takes 10 cycles . Then to my understanding step 3 becomes critical path.
Attached the diagram as below:

Faster cpu wastes more time as compared to slower cpu

Suppose I have a program that has an instruction to add two numbers and that operation takes 10 nanoseconds(constant, as enforced by the gate manufactures).
Now I have 3 different processors A, B and C(where A< B < C in terms of clock cycles). A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-14=7 ns.
If the above assumptions are correct then isn't this a contradiction to the regular assumption that processors with high clock cycles are faster. Here processor C which is the fastest actually takes 2 cycles and wastes 7ns whereas, the slower processor A takes just 1 cycle.

Processor C is fastest, no matter what. It takes 7 ns per cycle and therefore performs more cycles than A and B. It's not C's fault that the circuit is not fast enough. If you would implement the addition circuit in a way that it gives result in 1 ns, all processors will give the answer in 1 clock cycle (i.e. C will give you the answer in 7ns, B in 10ns and A in 15ns).
Firstly am I correct on my following assumptions-
1. Add operation takes 1 complete cycle of processor A(slow processor) and wastes rest of 5 ns of the cycle.
2. Add operation takes 1 complete cycle of processor B wasting no time.
3. Add operation takes 2 complete cycles(20 ns) of processor C(fast processor) wasting rest of the 20-7=13 ns.
No. It is because you are using incomplete data to express the time for an operation. Measure the time taken to finish an operation on a particular processor in clock cycles instead of nanoseconds as you are doing here. When you say ADD op takes 10 ns and you do not mention the processor on which you measured the time for the ADD op, the time measurement in ns is meaningless.
So when you say that ADD op takes 2 clock cycles on all three processors, then you have standardized the measurement. A standardized measurement can then be translated as:
Time taken by A for addition = 2 clock cycles * 15 ns per cycle = 30 ns
Time taken by B for addition = 2 clock cycles * 10 ns per cycle = 20 ns
Time taken by C for addition = 2 clock cycles * 07 ns per cycle = 14 ns
In case you haven't noticed, when you say:
A's one clock cycle has 15 nanosec, B has 10 nanosec and C has 7 nanosec.
which of the three processors is fastest?
Answer: C is fastest. It's one cycle is finished in 7ns. It implies that it finishes 109/7 (~= 1.4 * 108) cycles in one second, compared to B which finishes 109/10 (= 108) cycles in one second, compared to A which finishes only 109/15 (~= 0.6 * 108) cycles in one second.
What does a ADD instruction mean, does it purely mean only and only ADD(with operands available at the registers) or does it mean getting
the operands, decoding the instruction and then actually adding the
numbers.
Getting the operands is done by MOV op. If you are trying to compare how fast ADD op is being done, it should be compared by time to perform ADD op only. If you, on the other hand want to find out how fast addition of two numbers is being done, then it will involve more operations than simple ADD. However, if it's helpful, the list of all Original 8086/8088 instructions is available on Wikipedia too.
Based on the above context to what add actually means, how many cycles does add take, one or more than one.
It will depend on the processor because each processor may have the adder differently implemented. There are many ways to generate addition of two numbers. Quoting Wikipedia again - A full adder can be implemented in many different ways such as with a custom transistor-level circuit or composed of other gates.
Also, there may be pipelining in the instructions which can result in parallelizing of the addition of the numbers resulting in huge time savings.
Why is clock cycle a standard since it can vary with processor to processor. Shouldn't nanosec be the standard. Atleast its fixed.
Clock cycle along with the processor speed can be the standard if you want to tell the time taken by a processor to execute an instruction. Pick any two from:
Time to execute an instruction,
Processor Speed, and
Clock cycles needed for an instruction.
The third can be derived from it.
When you say the clock cycles taken by ADD is x and you know the processor speed is y MHz, you can calculate that the time to ADD is x / y. Also, you can mention the time to perform ADD as z ns and you know the processor speed is same y MHz as earlier, you can calculate the cycles needed to execute ADD as y * z.

I'm no expert BUT I'd say ...
the regular assumption that processors with high clock cycles are faster FOR THE VAST MAJORITY OF OPERATIONS
For example, a more intelligent processor might perform an "overhead task" that takes X ns. The "overhead task" might make it faster for repetitive operations but might actually cause it to take longer for a one-off operation such as adding 2 numbers.
Now, if the same processor performed that same operation 1 million times, it should be massively faster than the slower less intelligent processor.
Hope my thinking helps. Your feedback on my thoughts welcome.

Why would a faster processor take more cycles to do the same operation than a slower one?
Even more important: modern processors use Instruction pipelining, thus executing multiple operations in one clock cycle.
Also, I don't understand what you mean by 'wasting 5ns', the frequency determines the clock speed, thus the time it takes to execute 1 clock. Of course, cpu's can have to wait on I/O for example, but that holds for all cpu's.
Another important aspect of modern cpu's are the L1, L2 and L3 caches and the architecture of those caches in multicore systems. For example: if a register access takes 1 time unit, a L1 cache access will take around 2 while a normal memory access will take between 50 and 100 (and a harddisk access would take thousands..).

This is actually almost correct, except that on processor B taking 2 cycles means 14ns, so with 10ns being enough the next cycle starts 4ns after the result was already "stable" (though it is likely that you need some extra time if you chop it up, to latch the partial result). It's not that much of a contradiction, setting your frequency "too high" can require trade-offs like that. An other thing you might do it use more a different circuit or domino logic to get the actual latency of addition down to one cycle again. More likely, you wouldn't set addition at 2 cycles to begin with. It doesn't work out so well in this case, at least not for addition. You could do it, and yes, basically you will have to "round up" the time a circuit takes to an integer number of cycles. You can also see this in bitwise operations, which take less time than addition but nevertheless take a whole cycle. On machine C you could probably still fit bitwise operations in a single cycle, for some workloads it might even be worth splitting addition like that.
FWIW, Netburst (Pentium 4) had staggered adders, which computed the lower half in one "half-cycle" and the upper half in the next (and the flags in the third half cycle, in some sense giving the whole addition a latency of 1.5). It's not completely out of this world, though Netburst was over all, fairly mad - it had to do a lot of weird things to get the frequency up that high. But those half-cycles aren't very half (it wasn't, AFAIK, logic that advanced on every flank, it just used a clock multiplier), you could also see them as the real cycles that are just very fast, with most of the rest of the logic (except that crazy ALU) running at half speed.

Your broad point that 'a CPU will occasionally waste clock cycles' is valid. But overall in the real world, part of what makes a good CPU a good CPU is how it alleviates this problem.
Modern CPUs consist of a number of different components, none of whose operations will end up taking a constant time in practice. For example, an ADD instruction might 'burst' at 1 instruction per clock cycle if the data is immediately available to it... which in turn means something like 'if the CPU subcomponents required to fetch that data were immediately available prior to the instruction'. So depending on if e.g. another subcomponent had to wait for a cache fetch, the ADD may in practice take 2 or 3 cycles, say. A good CPU will attempt to re-order the incoming stream of instructions to maximise the availability of subcomponents at the right time.
So you could well have the situation where a particular series of instructions is 'suboptimal' on one processor compared to another. And the overall performance of a processor is certainly not just about raw clock speed: it is as much about the clever logic that goes around taking a stream of incoming instructions and working out which parts of which instructions to fire off to which subcomponents of the chip when.
But... I would posit that any modern chip contains such logic. Both a 2GHz and a 3GHz processor will regularly "waste" clock cycles because (to put it simply) a "fast" instruction executed on one subcomponent of the CPU has to wait for the result of the output from another "slower" subcomponent. But overall, you will still expect the 3GHz processor to "execute real code faster".

First, if the 10ns time to perform the addition does not include the pipeline overhead (clock skew and latch delay), then Processor B cannot complete an addition (with these overheads) in one 10ns clock cycle, but Processor A can and Processor C can still probably do it in two cycles.
Second, if the addition itself is pipelined (or other functional units are available), then a subsequent non-dependent operation can begin executing in the next cycle. (If the addition was width-pipelined/staggered (as mentioned in harold's answer) then even dependent additions, logical operations and left shifts could be started after only one cycle. However, if the exercise is constraining addition timing, it presumably also prohibits other optimizations to simplify the exercise.) If dependent operations are not especially common, then the faster clock of Processor C will result in higher performance. (E.g., if a dependence stall occurred every fourth cycle, then, ignoring other effects, Processor C can complete four instructions every five 7ns cycles (35 ns; the first three instruction overlap in execution) compared to 40ns for Processor B (assuming the add timing included pipelining overhead).) (Note: Your assumption 3 is incorrect, two cycles for Processor C would be 14ns.)
Third, the extra time in a clock cycle can be used to support more complex operations (e.g., preshifting one operand by a small immediate value and even adding three numbers — a carry-save adder has relatively little delay), to steal work from other pipeline stages (potentially reducing the number of pipeline stages, which generally reduces branch misprediction penalties), or to reduce area or power by using simpler logic. In addition, the extra time might be used to support a larger (or more associative) cache with fixed latency in cycles, reducing miss rates. Such factors can compensate for the "waste" of 5ns in Processor A.
Even for scalar (single issue per cycle) pipelines clock speed is not the single determinant of performance. Design choices become even more complex when power, manufacturing cost (related to yield, adjusted according to sellable bins, and area), time-to-market (and its variability/predictability), workload diversity, and more advanced architectural and microarchitectural techniques are considered.
The incorrect assumption that clock frequency determines performance even has a name: the Megahertz myth.

Processor performance complex and simple instructions

I have been stuck on a problem in my class for a week now.
I was hoping someone could help steer me in the right direction.
a busy cat http://www.designedbychristian.com/unnamed.png
Processor R IS A 64-BIT RISC processor with a 2GHz clock rate. The average instruction requires one cycle to complete, assuming zero wait state memory accesses. Processor C is a CISC processor with a 1.8GHz clock rate. The average simple instruction requires one cycle to complete, assuming zero wait state memory accesses. The average complex instruction requires two cycles to complete, assuming zero wait state memory accesses. Processor R can't directly implement the complex processing instructions or Processor C. Executing an equivalent set of simple instructions requires an average of three cycles to complete, assuming zero wait state to memory accesses.
Program S contains nothing but simple instructions. Program C executes 70% simple instructions and 30% complex instructions. Which processor will execute program S more quickly? What percentage of complex instructions will the performance of the two processor be equal?
I attached an image above translating the data into excel the best I could.
I am not asking you guys to answer this for me but I am stuck completely and I would some help on where to start and what my answer should look like.

For the second part:
Processor R Total cycles = 1 x #simpleInstructions + 3 x #complexInstructions
Processor C Total cycles = 1 x #simpleInstructions + 2 x #complexInstructions
So, how much time for R, and how much time for C?
When expressing complex/simple instructions as a percentage,
RCycles = 1 x 0.7 x totalInstructions + 3 x 0.3 x totalInstructions
CCycles = 1 x 0.7 x totalInstructions + 2 x 0.3 x totalInstructions
Which is faster?
Now replace the percentages with a variable, equate the Rtime and Ctime and calculate percentage.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio