Grafcet - synchronous machine behaviour

Grafcet - synchronous machine behaviour - parallel-processing

My goal is to implement a control algorithm written in Grafcet on a PLC. I am struggling with the difference of Grafcet as multi-process synchronous language and the single-core sequential PLC. Below is an example. What is the outcome of the Grafcet in the first cycle after the upper transition has fired? (a=1,x=1) or (a=1, x=0)?
I know that in SFC, it depends on the implementation of the engineering tool (e.g. Codesys, Multiprog) how actions are evaluated, typically from left to right. So for an SFC, (a=1,x=1) would be the answer. But since everything happens at the same time in Grafcet, I do not know how to handle this case.
Bonus points if someone can point out how I can learn more about the challenges of implementing languages like Grafcet on sequential machines.

Conditional actions are considered in not all Grafcet variants, but when they are, the behavior goes like this: as long as the step is active, turn on x while a is on.
If that's what you meant, though we may never find a conditional action formatted the way you did, x will be turned on within an infinitely short time after the two simultaneous steps are activated (at least that's my understanding based on the Grafcet evolution rules). So, the fact that the initial value of x is unpredictable - assuming that the two concurrent steps are activated at the very same time - should be actually no problem.
Moreover, as soon as the Grafcet is "implemented" in the real world (i.e. your single-core PLC), whether it's directly compiled by the engineering tool or converted into ladder diagram, an order of evaluation is necessarily chosen, as you said, and everything becomes deterministic, so your question is not a real problem when it comes to "implementing languages like Grafcet on sequential machines". You may find which are the real "challanges" by studying the canonical procedure for converting SFCs to ladder logic (detailed documentation is easily found on the web).

PLC's are single core, as you said, so... never 2 steps in the same moment of time.
There you have simultaneous branch, so both steps WILL execute. But clearly there will be one after another. By default, always the one from left. Please note that some PLC's allow you to change the order (never tried for simultaneous, but for divergent surely allow... such as RSLogix5000).
Simultaneous it's like having an AND. So you are telling processor execute first step AND second step. If you are familiar with Ladder Logic, I am sure this will be clear to you.
In the end, it should be a=1;x=1.
Also note that for other steps that are not simultaneous, there is one scan delay before evaluate next transition, which is a great thing. This is the most omitted thing when implementing a SFC in Ladder (and can lead to problems impossible to troubleshoot if you are not aware of it). I've seen this "bug" in about 50% of projects with ladder implementation and hundreds of projects so far. Example: If you have 10 consecutive transitions true, you are going from step 1 to step 10 in a single scan. Troubleshoot why motor didn't start :)
Tip: You can always use dummy steps in simultaneous branches to delay with 1 scan. So, if you want the other outcome (a=1,x=0), you can put a dummy step before left step.

Related

Does the guarantee of non-divergence when dispatching single work item exist?

As we know, work items running on GPUs could diverge when there are conditional branches. One of those mentions exist in Apple's OpenCL Programming Guide for Mac.
As such, some portions of an algorithm may run "single-threaded", having only 1 work item running. And when it's especially serial and long-running, some applications take those work back to CPU.
However, this question concerns only GPU and assume those portions are short-lived. Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches? Or will the compute units (or processing elements, whichever your terminology prefers) skip those false branches?
Update
In reply to comment, I'd remove the OpenCL tag and leave the Vulkan tag there.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.

Do these "single-threaded" portions also diverge (as in execute both true and false code paths) when they have conditional branches?
From an API point of view it has to appear to the program that only the active branch paths were taken. As to what actually happens, I suspect you'll never know for sure. GPU hardware architectures are nearly all confidential so it's impossible to be certain.
There are really two cases here:
Cases where a branch in the program turns into a real branch instruction.
Cases where a branch in the program turns into a conditional select between two computed values.
In the case of a real branch I would expect most cases to only execute the active path because it's a horrible waste of power to do both, and GPUs are all about energy efficiency. That said, YMMV and this isn't guaranteed at all.
For simple branches the compiler might choose to use a conditional select (compute both results, and then select the right answer). In this case you will compute both results. The compiler heuristics will generally aim to choose this where computing both results is less expensive than actually having a full branch.
I included OpenCL as I wanted to know if there's any difference at all between clEnqueueTask and clEnqueueNDRangeKernel with dim=1:x=1. The document says they're equivalent but I was skeptical.
Why would they be different? They are doing the same thing conceptually ...
I believe Vulkan removed the special function to enqueue a single-threaded task for good reasons, and if I'm wrong, please correct me.
Vulkan compute dispatch is in general a whole load simpler than OpenCL (and also perfectly adequate for most use cases), so many of the host-side functions from OpenCL have no equivalent in Vulkan. The GPU side behavior is pretty much the same. It's also worth noting that most of the holes where Vulkan shaders are missing features compared to OpenCL are being patched up with extensions - e.g. VK_KHR_shader_float16_int8 and VK_KHR_variable_pointers.

Q : Or will the compute units skip those false branches?
The ecosystem of CPU / GPU code-execution is rather complex.
The layer of hardware is where the code-paths (translated into "machine"-code) operate. On this laye, the SIMD-Computing-Units cannot and will not skip anything they are ordered to SIMD-process by the hardware-scheduler (next layer).
The layer of hardware-specific scheduler (GPUs have typically right two-modes: a WARP-mode scheduling for coherent, non-diverging code-paths efficiently scheduled in SIMD-blocks and greedy-mode scheduling). From this layer, the SIMD-Computing-Units are loaded to work on SIMD-operated blocks-of-work, so any first divergence detected on the lower layer (above) breaks the execution, flags the SIMD-hardware scheduler about blocks, deferred to be executed later and all known SIMD-specific block-device-optimised scheduling is well-known to start to grow less-efficient and less-efficient, due to each such run-time divergence.
The layer of { OpenCL | Vulkan API }-mediated device-specific programming decides a lot about the ease or comfort of human-side programming of the wide range of the target-devices, all without knowing about its respective internal constraints, about (compiler decided) preferred "machine"-code computing problem re-formulation and device-specific tricks and scheduling. A bit oversimplified battlefield picture has made for years human-users just stay "in front" of the mediated asynchronous work-units ( kernel's ) HOST-to-DEVICE scheduling queues and wait until we receive back the DEVICE-to-HOST delivered results back, doing some prior-H2D/posterior-D2H memory transfers, if allowed and needed.
The HOST-side DEVICE-kernel-code "scheduling" directives are rather imperative and help the mediated-device-specific programming reflect user-side preferences, yet leave user blind from seeing all internal decisions ( assembly-level reviews are indeed only for hard-core, DEVICE-specific, GPU-engineering Aces and hard to modify, if willing to )
All that said, "adaptive" run-time values' based decisions to move a particular "work-unit" back-to-the-HOST-CPU, rather than finalising it all in DEVICE-GPU, are not, to the best of my knowledge, taking place on the bottom of this complex computing ecosystem hierarchy ( afaik, it would be exhaustively expensive to try to do so ).

How do programmers test their algorithm in TopCoder or other competitions?

Good programmers who write programs of moderate to higher difficulty in competitions of TopCoder or ACM ICPC, have to ensure the correctness of their algorithm before submission.
Although they are provided with some sample test cases to ensure the correct output, but how does it guarantees that program will behave correctly? They can write some test cases of their own but it won't be possible in all cases to know the correct answer through manual calculation. How do they do it?
Update: As it seems, it is not quite possible to analyze and guarantee the outcome of an algorithm given tight constraints of a competitive environment. However, if there are any manual, more common traits which are adopted while solving such problems - should be enough to answer the question. Something like best practices..

In competitions, the top programmers have enough experience to read the question, and think of some test cases that should catch most of the possibilities for input.
It catches most of the bugs usually - but it is NOT 100% safe.
However, in real life critical applications (critical systems on air planes or nuclear reactors for example) there are methods to PROVE some piece of code does what it is supposed to do.
This is the field of formal verification - which is way too complex and time consuming to be done during a contest, but for some systems it is used because mistakes could not be tolerated.
Some additional information:
Formal verification basically consists of 2 parts:
Manual verification - in here we use proving systems such as Hoare logic and manually prove the program does what we wants it to do.
Automatic model checking - modeling the problem as state machine, and use Model Checking tools to verify that the module does what it is supposed to do (or not doing something "bad").
Specifying "what it should do" is usually done with temporal logic.
This is often used to verify correctness of hardware models as well. For example Intel uses it to ensure they won't get the floating point bug again.

Picture this, imagine you are a top programmer.Meaning you know a bunch of algorithms and wouldn't think think twice while implementing them.You know how to modify an already known algorithm to suit the problem's needs.You are strong with estimating time and complexity and you expect that in the worst case your tailored algorithm would run within time and memory constraints.
At this level you simply think and use a scratchpad for about five to ten minutes and have a super clear algorithm before you start to code.Once you finish coding, you hit compile and there is usually no compilation error.Because the code is so intuitive to you.
Then based on the algorithm used and data structures used, you expect that there might be
one of the following issues.
a corner case
an overflow problem
A corner case is basically like you have coded for the general case, however when say N=1, the answer is different from others.So you generally write it as a special case.
An overflow is when intermediate values or results overflow a data type's limits.
You make note of any problems which arise at this point, and use this data during Challenge phase(as in TopCoder).
Once you have checked against these two, you hit Submit.

There's a time element to Top Coder, so it's not possible to test every combination within that constraint. They probably do the best they can and rely on experience for the rest, just as one does in real life. I don't know that it's ever possible to guarantee that a significant piece of code is error free forever.

Ant colony behavior using genetic programming

I'm looking at evolving ants capable of food foraging behaviour using genetic programming, as described by Koza here. Each time step, I loop through each ant, executing its computer program (the same program is used by all ants in the colony). Currently, I have defined simple instructions like MOVE-ONE-STEP, TURN-LEFT, TURN-RIGHT, etc. But I also have a function PROGN that executes arguments in sequence. The problem I am having is that because PROGN can execute instructions in sequence, it means an ant can do multiple actions in a single time step. Unlike nature, I cannot run the ants in parallel, meaning one ant might go and perform several actions, manipulating the environment whilst all of the other ants are waiting to have their turn.
I'm just wondering, is this how it is normally done, or is there a better way? Koza does not seem to mention anything about it. Thing is, I want to expand the scenario to have other agents (e.g. enemies), which might rely on things occurring only once in a single time step.

I am not familiar with Koza's work, but I think a reasonable approach is to give each ant its own instruction queue that persists across time steps. By doing this, you can get the ants to execute PROGN functions one instruction per time step. For instance, the high-level logic for the time step of an ant can be:
Do-Time-Step(ant):
1. if ant.Q is empty: // then put the next instruction(s) into the queue
2. instructions <- ant.Get-Next-Instructions()
3. for instruction in instructions:
4. ant.Q.enqueue(instruction)
5. end for
6. end if
7. instruction <- ant.Q.dequeue() // get the next instruction in the queue
8. ant.execute(instruction) // have that ant do its job

Another similar approach to queuing instructions would be to preprocess the set of instructions an expand instances of PROGN to the set of component instructions. This would have to be done recursively if you allow PROGNs to invoke other PROGNs. The downside to this is that the candidate programs get a bit bloated, but this is only at runtime. On the other hand, it is easy, quick, and pretty easy to debug.
Example:
Say PROGN1 = {inst-p1 inst-p2}
Then the candidate program would start off as {inst1 PROGN1 inst2} and would be expanded to {inst1 inst-p1 inst-p2 inst2} when it was ready to be evaluated in simulation.

It all depends on your particular GP implementation.
In my GP kernel programs are either evaluated repeatedly or in parallel - as a whole, i.e. the 'atomic' operation in this scenario is a single program evaluation.
So all individuals in the population are repeated n times sequentially before evaluating the next program or all individuals are executed just once, then again for n times.
I've had pretty nice results with virtual agents using this level of concurrency.
It is definitely possible to break it down even more, however at that point you'll reduce the scalability of your algorithm:
While it is easy to distribute the evaluation of programs amongst several CPUs or cores it'll be next to worthless doing the same with per-node evaluation just due to the amount of synchronization required between all programs.
Given the rapidly increasing number of CPUs/cores in modern systems (even smartphones) and the 'CPU-hunger' of GP you might want to rethink your approach - do you really want to include move/turn instructions in your programs?
Why not redesign it to use primitives that store away direction and speed parameters in some registers/variables during program evaluation?
The simulation step then takes these parameters to actually move/turn your agents based on the instructions stored away by the programs.
evaluate programs (in parallel)
execute simulation
repeat for n times
evaluate fitness, selection, ...
Cheers,
Jay

Code generation by genetic algorithms

Evolutionary programming seems to be a great way to solve many optimization problems. The idea is very easy and the implementation does not make problems.
I was wondering if there is any way to evolutionarily create a program in ruby/python script (or any other language)?
The idea is simple:
Create a population of programs
Perform genetic operations (roulette-wheel selection or any other selection), create new programs with inheritance from best programs, etc.
Loop point 2 until program that will satisfy our condition is found
But there are still few problems:
How will chromosomes be represented? For example, should one cell of chromosome be one line of code?
How will chromosomes be generated? If they will be lines of code, how do we generate them to ensure that they are syntactically correct, etc.?
Example of a program that could be generated:
Create script that takes N numbers as input and returns their mean as output.
If there were any attempts to create such algorithms I'll be glad to see any links/sources.

If you are sure you want to do this, you want genetic programming, rather than a genetic algorithm. GP allows you to evolve tree-structured programs. What you would do would be to give it a bunch of primitive operations (while($register), read($register), increment($register), decrement($register), divide($result $numerator $denominator), print, progn2 (this is GP speak for "execute two commands sequentially")).
You could produce something like this:
progn2(
progn2(
read($1)
while($1
progn2(
while($1
progn2( #add the input to the total
increment($2)
decrement($1)
)
)
progn2( #increment number of values entered, read again
increment($3)
read($1)
)
)
)
)
progn2( #calculate result
divide($1 $2 $3)
print($1)
)
)
You would use, as your fitness function, how close it is to the real solution. And therein lies the catch, that you have to calculate that traditionally anyway*. And then have something that translates that into code in (your language of choice). Note that, as you've got a potential infinite loop in there, you'll have to cut off execution after a while (there's no way around the halting problem), and it probably won't work. Shucks. Note also, that my provided code will attempt to divide by zero.
*There are ways around this, but generally not terribly far around it.

It can be done, but works very badly for most kinds of applications.
Genetic algorithms only work when the fitness function is continuous, i.e. you can determine which candidates in your current population are closer to the solution than others, because only then you'll get improvements from one generation to the next. I learned this the hard way when I had a genetic algorithm with one strongly-weighted non-continuous component in my fitness function. It dominated all others and because it was non-continuous, there was no gradual advancement towards greater fitness because candidates that were almost correct in that aspect were not considered more fit than ones that were completely incorrect.
Unfortunately, program correctness is utterly non-continuous. Is a program that stops with error X on line A better than one that stops with error Y on line B? Your program could be one character away from being correct, and still abort with an error, while one that returns a constant hardcoded result can at least pass one test.
And that's not even touching on the matter of the code itself being non-continuous under modifications...

Well this is very possible and #Jivlain correctly points out in his (nice) answer that genetic Programming is what you are looking for (and not simple Genetic Algorithms).
Genetic Programming is a field that has not reached a broad audience yet, partially because of some of the complications #MichaelBorgwardt indicates in his answer. But those are mere complications, it is far from true that this is impossible to do. Research on the topic has been going on for more than 20 years.
Andre Koza is one of the leading researchers on this (have a look at his 1992 work) and he demonstrated as early as 1996 how genetic programming can in some cases outperform naive GAs on some classic computational problems (such as evolving programs for Cellular Automata synchronization).
Here's a good Genetic Programming tutorial from Koza and Poli dated 2003.
For a recent reference you might wanna have a look at A field guide to genetic programming (2008).

Since this question was asked, the field of genetic programming has advanced a bit, and there have been some additional attempts to evolve code in configurations other than the tree structures of traditional genetic programming. Here are just a few of them:
PushGP - designed with the goal of evolving modular functions like human coders use, programs in this system store all variables and code on different stacks (one for each variable type). Programs are written by pushing and popping commands and data off of the stacks.
FINCH - a system that evolves Java byte-code. This has been used to great effect to evolve game-playing agents.
Various algorithms have started evolving C++ code, often with a step in which compiler errors are corrected. This has had mixed, but not altogether unpromising results. Here's an example.
Avida - a system in which agents evolve programs (mostly boolean logic tasks) using a very simple assembly code. Based off of the older (and less versatile) Tierra.

The language isn't an issue. Regardless of the language, you have to define some higher-level of mutation, otherwise it will take forever to learn.
For example, since any Ruby language can be defined in terms of a text string, you could just randomly generate text strings and optimize that. Better would be to generate only legal Ruby programs. However, it would also take forever.
If you were trying to build a sorting program and you had high level operations like "swap", "move", etc. then you would have a much higher chance of success.
In theory, a bunch of monkeys banging on a typewriter for an infinite amount of time will output all the works of Shakespeare. In practice, it isn't a practical way to write literature. Just because genetic algorithms can solve optimization problems doesn't mean that it's easy or even necessarily a good way to do it.

The biggest selling point of genetic algorithms, as you say, is that they are dirt simple. They don't have the best performance or mathematical background, but even if you have no idea how to solve your problem, as long as you can define it as an optimization problem you will be able to turn it into a GA.
Programs aren't really suited for GA's precisely because code isn't good chromossome material. I have seen someone who did something similar with (simpler) machine code instead of Python (although it was more of an ecossystem simulation then a GA per se) and you might have better luck if you codify your programs using automata / LISP or something like that.
On the other hand, given how alluring GA's are and how basically everyone who looks at them asks this same question, I'm pretty sure there are already people who tried this somewhere - I just have no idea if any of them succeeded.

Good luck with that.
Sure, you could write a "mutation" program that reads a program and randomly adds, deletes, or changes some number of characters. Then you could compile the result and see if the output is better than the original program. (However we define and measure "better".) Of course 99.9% of the time the result would be compile errors: syntax errors, undefined variables, etc. And surely most of the rest would be wildly incorrect.
Try some very simple problem. Say, start with a program that reads in two numbers, adds them together, and outputs the sum. Let's say that the goal is a program that reads in three numbers and calculates the sum. Just how long and complex such a program would be of course depends on the language. Let's say we have some very high level language that lets us read or write a number with just one line of code. Then the starting program is just 4 lines:
read x
read y
total=x+y
write total
The simplest program to meet the desired goal would be something like
read x
read y
read z
total=x+y+z
write total
So through a random mutation, we have to add "read z" and "+z", a total of 9 characters including the space and the new-line. Let's make it easy on our mutation program and say it always inserts exactly 9 random characters, that they're guaranteed to be in the right places, and that it chooses from a character set of just 26 letters plus 10 digits plus 14 special characters = 50 characters. What are the odds that it will pick the correct 9 characters? 1 in 50^9 = 1 in 2.0e15. (Okay, the program would work if instead of "read z" and "+z" it inserted "read w" and "+w", but then I'm making it easy by assuming it magically inserts exactly the right number of characters and always inserts them in the right places. So I think this estimate is still generous.)
1 in 2.0e15 is a pretty small probability. Even if the program runs a thousand times a second, and you can test the output that quickly, the chance is still just 1 in 2.0e12 per second, or 1 in 5.4e8 per hour, 1 in 2.3e7 per day. Keep it running for a year and the chance of success is still only 1 in 62,000.
Even a moderately competent programmer should be able to make such a change in, what, ten minutes?
Note that changes must come in at least "packets" that are correct. That is, if a mutation generates "reax z", that's only one character away from "read z", but it would still produce compile errors, and so would fail.
Likewise adding "read z" but changing the calculation to "total=x+y+w" is not going to work. Depending on the language, you'll either get errors for the undefined variable or at best it will have some default value, like zero, and give incorrect results.
You could, I suppose, theorize incremental solutions. Maybe one mutation adds the new read statement, then a future mutation updates the calculation. But without the calculation, the additional read is worthless. How will the program be evaluated to determine that the additional read is "a step in the right direction"? The only way I see to do that is to have an intelligent being read the code after each mutation and see if the change is making progress toward the desired goal. And if you have an intelligent designer who can do that, that must mean that he knows what the desired goal is and how to achieve it. At which point, it would be far more efficient to just make the desired change rather than waiting for it to happen randomly.
And this is an exceedingly trivial program in a very easy language. Most programs are, what, hundreds or thousands of lines, all of which must work together. The odds against any random process writing a working program are astronomical.
There might be ways to do something that resembles this in some very specialized application, where you are not really making random mutations, but rather making incremental modifications to the parameters of a solution. Like, we have a formula with some constants whose values we don't know. We know what the correct results are for some small set of inputs. So we make random changes to the constants, and if the result is closer to the right answer, change from there, if not, go back to the previous value. But even at that, I think it would rarely be productive to make random changes. It would likely be more helpful to try changing the constants according to a strict formula, like start with changing by 1000's, then 100's then 10's, etc.

I want to just give you a suggestion. I don't know how successful you'd be, but perhaps you could try to evolve a core war bot with genetic programming. Your fitness function is easy: just let the bots compete in a game. You could start with well known bots and perhaps a few random ones then wait and see what happens.

Performance optimization strategies of last resort [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.
Let's assume:
the code already is working correctly
the algorithms chosen are already optimal for the circumstances of the problem
the code has been measured, and the offending routines have been isolated
all attempts to optimize will also be measured to ensure they do not make matters worse
What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.
Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.
I'll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:
The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.
Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.
Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.
Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.
Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.
That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.
Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.
More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.
Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.
Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.
Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.
Total speedup factor: 43.6
Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.
P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.
I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.
ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:
/* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)
These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.
Here is the second problem, in two separate lines:
/* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)
These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.
I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.
REFERENCE ADDED:
The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip.
EDIT 2011/11/26:
There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.

Suggestions:
Pre-compute rather than re-calculate: any loops or repeated calls that contain calculations that have a relatively limited range of inputs, consider making a lookup (array or dictionary) that contains the result of that calculation for all values in the valid range of inputs. Then use a simple lookup inside the algorithm instead.
Down-sides: if few of the pre-computed values are actually used this may make matters worse, also the lookup may take significant memory.
Don't use library methods: most libraries need to be written to operate correctly under a broad range of scenarios, and perform null checks on parameters, etc. By re-implementing a method you may be able to strip out a lot of logic that does not apply in the exact circumstance you are using it.
Down-sides: writing additional code means more surface area for bugs.
Do use library methods: to contradict myself, language libraries get written by people that are a lot smarter than you or me; odds are they did it better and faster. Do not implement it yourself unless you can actually make it faster (i.e.: always measure!)
Cheat: in some cases although an exact calculation may exist for your problem, you may not need 'exact', sometimes an approximation may be 'good enough' and a lot faster in the deal. Ask yourself, does it really matter if the answer is out by 1%? 5%? even 10%?
Down-sides: Well... the answer won't be exact.

When you can't improve the performance any more - see if you can improve the perceived performance instead.
You may not be able to make your fooCalc algorithm faster, but often there are ways to make your application seem more responsive to the user.
A few examples:
anticipating what the user is going
to request and start working on that
before then
displaying results as
they come in, instead of all at once
at the end
Accurate progress meter
These won't make your program faster, but it might make your users happier with the speed you have.

I spend most of my life in just this place. The broad strokes are to run your profiler and get it to record:
Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls.
Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing.
Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing.
Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall.
Sequential floating-point ops. Make these SIMD.
And one more thing I like to do:
Set your compiler to output assembly listings and look at what it emits for the hotspot functions in your code. All those clever optimizations that "a good compiler should be able to do for you automatically"? Chances are your actual compiler doesn't do them. I've seen GCC emit truly WTF code.

Throw more hardware at it!

More suggestions:
Avoid I/O: Any I/O (disk, network, ports, etc.) is
always going to be far slower than any code that is
performing calculations, so get rid of any I/O that you do
not strictly need.
Move I/O up-front: Load up all the data you are going
to need for a calculation up-front, so that you do not
have repeated I/O waits within the core of a critical
algorithm (and maybe as a result repeated disk seeks, when
loading all the data in one hit may avoid seeking).
Delay I/O: Do not write out your results until the
calculation is over, store them in a data structure and
then dump that out in one go at the end when the hard work
is done.
Threaded I/O: For those daring enough, combine 'I/O
up-front' or 'Delay I/O' with the actual calculation by
moving the loading into a parallel thread, so that while
you are loading more data you can work on a calculation on
the data you already have, or while you calculate the next
batch of data you can simultaneously write out the results
from the last batch.

Since many of the performance problems involve database issues, I'll give you some specific things to look at when tuning queries and stored procedures.
Avoid cursors in most databases. Avoid looping as well. Most of the time, data access should be set-based, not record by record processing. This includes not reusing a single record stored procedure when you want to insert 1,000,000 records at once.
Never use select *, only return the fields you actually need. This is especially true if there are any joins as the join fields will be repeated and thus cause unnecesary load on both the server and the network.
Avoid the use of correlated subqueries. Use joins (including joins to derived tables where possible) (I know this is true for Microsoft SQL Server, but test the advice when using a differnt backend).
Index, index, index. And get those stats updated if applicable to your database.
Make the query sargable. Meaning avoid things which make it impossible to use the indexes such as using a wildcard in the first character of a like clause or a function in the join or as the left part of a where statement.
Use correct data types. It is faster to do date math on a date field than to have to try to convert a string datatype to a date datatype, then do the calculation.
Never put a loop of any kind into a trigger!
Most databases have a way to check how the query execution will be done. In Microsoft SQL Server this is called an execution plan. Check those first to see where problem areas lie.
Consider how often the query runs as well as how long it takes to run when determining what needs to be optimized. Sometimes you can gain more perfomance from a slight tweak to a query that runs millions of times a day than you can from wiping time off a long_running query that only runs once a month.
Use some sort of profiler tool to find out what is really being sent to and from the database. I can remember one time in the past where we couldn't figure out why the page was so slow to load when the stored procedure was fast and found out through profiling that the webpage was asking for the query many many times instead of once.
The profiler will also help you to find who are blocking who. Some queries that execute quickly while running alone may become really slow due to locks from other queries.

The single most important limiting factor today is the limited memory bandwitdh. Multicores are just making this worse, as the bandwidth is shared betwen cores. Also, the limited chip area devoted to implementing caches is also divided among the cores and threads, worsening this problem even more. Finally, the inter-chip signalling needed to keep the different caches coherent also increase with an increased number of cores. This also adds a penalty.
These are the effects that you need to manage. Sometimes through micro managing your code, but sometimes through careful consideration and refactoring.
A lot of comments already mention cache friendly code. There are at least two distinct flavors of this:
Avoid memory fetch latencies.
Lower memory bus pressure (bandwidth).
The first problem specifically has to do with making your data access patterns more regular, allowing the hardware prefetcher to work efficiently. Avoid dynamic memory allocation which spreads your data objects around in memory. Use linear containers instead of linked lists, hashes and trees.
The second problem has to do with improving data reuse. Alter your algorithms to work on subsets of your data that do fit in available cache, and reuse that data as much as possible while it is still in the cache.
Packing data tighter and making sure you use all data in cache lines in the hot loops, will help avoid these other effects, and allow fitting more useful data in the cache.

What hardware are you running on? Can you use platform-specific optimizations (like vectorization)?
Can you get a better compiler? E.g. switch from GCC to Intel?
Can you make your algorithm run in parallel?
Can you reduce cache misses by reorganizing data?
Can you disable asserts?
Micro-optimize for your compiler and platform. In the style of, "at an if/else, put the most common statement first"

Although I like Mike Dunlavey's answer, in fact it is a great answer indeed with supporting example, I think it could be expressed very simply thus:
Find out what takes the largest amounts of time first, and understand why.
It is the identification process of the time hogs that helps you understand where you must refine your algorithm. This is the only all-encompassing language agnostic answer I can find to a problem that's already supposed to be fully optimised. Also presuming you want to be architecture independent in your quest for speed.
So while the algorithm may be optimised, the implementation of it may not be. The identification allows you to know which part is which: algorithm or implementation. So whichever hogs the time the most is your prime candidate for review. But since you say you want to squeeze the last few % out, you might want to also examine the lesser parts, the parts that you have not examined that closely at first.
Lastly a bit of trial and error with performance figures on different ways to implement the same solution, or potentially different algorithms, can bring insights that help identify time wasters and time savers.
HPH,
asoudmove.

You should probably consider the "Google perspective", i.e. determine how your application can become largely parallelized and concurrent, which will inevitably also mean at some point to look into distributing your application across different machines and networks, so that it can ideally scale almost linearly with the hardware that you throw at it.
On the other hand, the Google folks are also known for throwing lots of manpower and resources at solving some of the issues in projects, tools and infrastructure they are using, such as for example whole program optimization for gcc by having a dedicated team of engineers hacking gcc internals in order to prepare it for Google-typical use case scenarios.
Similarly, profiling an application no longer means to simply profile the program code, but also all its surrounding systems and infrastructure (think networks, switches, server, RAID arrays) in order to identify redundancies and optimization potential from a system's point of view.

Inline routines (eliminate call/return and parameter pushing)
Try eliminating tests/switches with table look ups (if they're faster)
Unroll loops (Duff's device) to the point where they just fit in the CPU cache
Localize memory access so as not to blow your cache
Localize related calculations if the optimizer isn't already doing that
Eliminate loop invariants if the optimizer isn't already doing that

When you get to the point that you're using efficient algorithms its a question of what you need more speed or memory. Use caching to "pay" in memory for more speed or use calculations to reduce the memory footprint.
If possible (and more cost effective) throw hardware at the problem - faster CPU, more memory or HD could solve the problem faster then trying to code it.
Use parallelization if possible - run part of the code on multiple threads.
Use the right tool for the job. some programing languages create more efficient code, using managed code (i.e. Java/.NET) speed up development but native programing languages creates faster running code.
Micro optimize. Only were applicable you can use optimized assembly to speed small pieces of code, using SSE/vector optimizations in the right places can greatly increase performance.

Divide and conquer
If the dataset being processed is too large, loop over chunks of it. If you've done your code right, implementation should be easy. If you have a monolithic program, now you know better.

First of all, as mentioned in several prior answers, learn what bites your performance - is it memory or processor or network or database or something else. Depending on that...
...if it's memory - find one of the books written long time ago by Knuth, one of "The Art of Computer Programming" series. Most likely it's one about sorting and search - if my memory is wrong then you'll have to find out in which he talks about how to deal with slow tape data storage. Mentally transform his memory/tape pair into your pair of cache/main memory (or in pair of L1/L2 cache) respectively. Study all the tricks he describes - if you don's find something that solves your problem, then hire professional computer scientist to conduct a professional research. If your memory issue is by chance with FFT (cache misses at bit-reversed indexes when doing radix-2 butterflies) then don't hire a scientist - instead, manually optimize passes one-by-one until you're either win or get to dead end. You mentioned squeeze out up to the last few percent right? If it's few indeed you'll most likely win.
...if it's processor - switch to assembly language. Study processor specification - what takes ticks, VLIW, SIMD. Function calls are most likely replaceable tick-eaters. Learn loop transformations - pipeline, unroll. Multiplies and divisions might be replaceable / interpolated with bit shifts (multiplies by small integers might be replaceable with additions). Try tricks with shorter data - if you're lucky one instruction with 64 bits might turn out replaceable with two on 32 or even 4 on 16 or 8 on 8 bits go figure. Try also longer data - eg your float calculations might turn out slower than double ones at particular processor. If you have trigonometric stuff, fight it with pre-calculated tables; also keep in mind that sine of small value might be replaced with that value if loss of precision is within allowed limits.
...if it's network - think of compressing data you pass over it. Replace XML transfer with binary. Study protocols. Try UDP instead of TCP if you can somehow handle data loss.
...if it's database, well, go to any database forum and ask for advice. In-memory data-grid, optimizing query plan etc etc etc.
HTH :)

Caching! A cheap way (in programmer effort) to make almost anything faster is to add a caching abstraction layer to any data movement area of your program. Be it I/O or just passing/creation of objects or structures. Often it's easy to add caches to factory classes and reader/writers.
Sometimes the cache will not gain you much, but it's an easy method to just add caching all over and then disable it where it doesn't help. I've often found this to gain huge performance without having to micro-analyse the code.

I think this has already been said in a different way. But when you're dealing with a processor intensive algorithm, you should simplify everything inside the most inner loop at the expense of everything else.
That may seem obvious to some, but it's something I try to focus on regardless of the language I'm working with. If you're dealing with nested loops, for example, and you find an opportunity to take some code down a level, you can in some cases drastically speed up your code. As another example, there are the little things to think about like working with integers instead of floating point variables whenever you can, and using multiplication instead of division whenever you can. Again, these are things that should be considered for your most inner loop.
Sometimes you may find benefit of performing your math operations on an integer inside the inner loop, and then scaling it down to a floating point variable you can work with afterwards. That's an example of sacrificing speed in one section to improve the speed in another, but in some cases the pay off can be well worth it.

I've spent some time working on optimising client/server business systems operating over low-bandwidth and long-latency networks (e.g. satellite, remote, offshore), and been able to achieve some dramatic performance improvements with a fairly repeatable process.
Measure: Start by understanding the network's underlying capacity and topology. Talking to the relevant networking people in the business, and make use of basic tools such as ping and traceroute to establish (at a minimum) the network latency from each client location, during typical operational periods. Next, take accurate time measurements of specific end user functions that display the problematic symptoms. Record all of these measurements, along with their locations, dates and times. Consider building end-user "network performance testing" functionality into your client application, allowing your power users to participate in the process of improvement; empowering them like this can have a huge psychological impact when you're dealing with users frustrated by a poorly performing system.
Analyze: Using any and all logging methods available to establish exactly what data is being transmitted and received during the execution of the affected operations. Ideally, your application can capture data transmitted and received by both the client and the server. If these include timestamps as well, even better. If sufficient logging isn't available (e.g. closed system, or inability to deploy modifications into a production environment), use a network sniffer and make sure you really understand what's going on at the network level.
Cache: Look for cases where static or infrequently changed data is being transmitted repetitively and consider an appropriate caching strategy. Typical examples include "pick list" values or other "reference entities", which can be surprisingly large in some business applications. In many cases, users can accept that they must restart or refresh the application to update infrequently updated data, especially if it can shave significant time from the display of commonly used user interface elements. Make sure you understand the real behaviour of the caching elements already deployed - many common caching methods (e.g. HTTP ETag) still require a network round-trip to ensure consistency, and where network latency is expensive, you may be able to avoid it altogether with a different caching approach.
Parallelise: Look for sequential transactions that don't logically need to be issued strictly sequentially, and rework the system to issue them in parallel. I dealt with one case where an end-to-end request had an inherent network delay of ~2s, which was not a problem for a single transaction, but when 6 sequential 2s round trips were required before the user regained control of the client application, it became a huge source of frustration. Discovering that these transactions were in fact independent allowed them to be executed in parallel, reducing the end-user delay to very close to the cost of a single round trip.
Combine: Where sequential requests must be executed sequentially, look for opportunities to combine them into a single more comprehensive request. Typical examples include creation of new entities, followed by requests to relate those entities to other existing entities.
Compress: Look for opportunities to leverage compression of the payload, either by replacing a textual form with a binary one, or using actual compression technology. Many modern (i.e. within a decade) technology stacks support this almost transparently, so make sure it's configured. I have often been surprised by the significant impact of compression where it seemed clear that the problem was fundamentally latency rather than bandwidth, discovering after the fact that it allowed the transaction to fit within a single packet or otherwise avoid packet loss and therefore have an outsize impact on performance.
Repeat: Go back to the beginning and re-measure your operations (at the same locations and times) with the improvements in place, record and report your results. As with all optimisation, some problems may have been solved exposing others that now dominate.
In the steps above, I focus on the application related optimisation process, but of course you must ensure the underlying network itself is configured in the most efficient manner to support your application too. Engage the networking specialists in the business and determine if they're able to apply capacity improvements, QoS, network compression, or other techniques to address the problem. Usually, they will not understand your application's needs, so it's important that you're equipped (after the Analyse step) to discuss it with them, and also to make the business case for any costs you're going to be asking them to incur. I've encountered cases where erroneous network configuration caused the applications data to be transmitted over a slow satellite link rather than an overland link, simply because it was using a TCP port that was not "well known" by the networking specialists; obviously rectifying a problem like this can have a dramatic impact on performance, with no software code or configuration changes necessary at all.

Very difficult to give a generic answer to this question. It really depends on your problem domain and technical implementation. A general technique that is fairly language neutral: Identify code hotspots that cannot be eliminated, and hand-optimize assembler code.

Last few % is a very CPU and application dependent thing....
cache architectures differ, some chips have on-chip RAM
you can map directly, ARM's (sometimes) have a vector
unit, SH4's a useful matrix opcode. Is there a GPU -
maybe a shader is the way to go. TMS320's are very
sensitive to branches within loops (so separate loops and
move conditions outside if possible).
The list goes on.... But these sorts of things really are
the last resort...
Build for x86, and run Valgrind/Cachegrind against the code
for proper performance profiling. Or Texas Instruments'
CCStudio has a sweet profiler. Then you'll really know where
to focus...

Not nearly as in depth or complex as previous answers, but here goes:
(these are more beginner/intermediate level)
obvious: dry
run loops backwards so you're always comparing to 0 rather than a variable
use bitwise operators whenever you can
break repetitive code into modules/functions
cache objects
local variables have slight performance advantage
limit string manipulation as much as possible

Did you know that a CAT6 cable is capable of 10x better shielding off external inteferences than a default Cat5e UTP cable?
For any non-offline projects, while having best software and best hardware, if your throughoutput is weak, then that thin line is going to squeeze data and give you delays, albeit in milliseconds...
Also the maximum throughput is higher on CAT6 cables because there is a higher chance that you will actually receive a cable whose strands exist of cupper cores, instead of CCA, Cupper Cladded Aluminium, which is often fount in all your standard CAT5e cables.
I if you are facing lost packets, packet drops, then an increase in throughput reliability for 24/7 operation can make the difference that you may be looking for.
For those who seek the ultimate in home/office connection reliability, (and are willing to say NO to this years fastfood restaurants, at the end of the year you can there you can) gift yourself the pinnacle of LAN connectivity in the form of CAT7 cable from a reputable brand.

Impossible to say. It depends on what the code looks like. If we can assume that the code already exists, then we can simply look at it and figure out from that, how to optimize it.
Better cache locality, loop unrolling, Try to eliminate long dependency chains, to get better instruction-level parallelism. Prefer conditional moves over branches when possible. Exploit SIMD instructions when possible.
Understand what your code is doing, and understand the hardware it's running on. Then it becomes fairly simple to determine what you need to do to improve performance of your code. That's really the only truly general piece of advice I can think of.
Well, that, and "Show the code on SO and ask for optimization advice for that specific piece of code".

If better hardware is an option then definitely go for that. Otherwise
Check you are using the best compiler and linker options.
If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure.
See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain.
Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler.
Consider: are you really using the best algorithm? Is it the best algorithm for your input size?

The google way is one option "Cache it.. Whenever possible don't touch the disk"

Here are some quick and dirty optimization techniques I use. I consider this to be a 'first pass' optimization.
Learn where the time is spent Find out exactly what is taking the time. Is it file IO? Is it CPU time? Is it the network? Is it the Database? It's useless to optimize for IO if that's not the bottleneck.
Know Your Environment Knowing where to optimize typically depends on the development environment. In VB6, for example, passing by reference is slower than passing by value, but in C and C++, by reference is vastly faster. In C, it is reasonable to try something and do something different if a return code indicates a failure, while in Dot Net, catching exceptions are much slower than checking for a valid condition before attempting.
Indexes Build indexes on frequently queried database fields. You can almost always trade space for speed.
Avoid lookups Inside of the loop to be optimized, I avoid having to do any lookups. Find the offset and/or index outside of the loop and reuse the data inside.
Minimize IO try to design in a manner that reduces the number of times you have to read or write especially over a networked connection
Reduce Abstractions The more layers of abstraction the code has to work through, the slower it is. Inside the critical loop, reduce abstractions (e.g. reveal lower-level methods that avoid extra code)
Spawn Threads for projects with a user interface, spawning a new thread to preform slower tasks makes the application feel more responsive, although isn't.
Pre-process You can generally trade space for speed. If there are calculations or other intense operations, see if you can precompute some of the information before you're in the critical loop.

If you have a lot of highly parallel floating point math-especially single-precision-try offloading it to a graphics processor (if one is present) using OpenCL or (for NVidia chips) CUDA. GPUs have immense floating point computing power in their shaders, which is much greater than that of a CPU.

Adding this answer since I didnt see it included in all the others.
Minimize implicit conversion between types and sign:
This applies to C/C++ at least, Even if you already think you're free of conversions - sometimes its good to test adding compiler warnings around functions that require performance, especially watch-out for conversions within loops.
GCC spesific: You can test this by adding some verbose pragmas around your code,
#ifdef __GNUC__
# pragma GCC diagnostic push
# pragma GCC diagnostic error "-Wsign-conversion"
# pragma GCC diagnostic error "-Wdouble-promotion"
# pragma GCC diagnostic error "-Wsign-compare"
# pragma GCC diagnostic error "-Wconversion"
#endif
/* your code */
#ifdef __GNUC__
# pragma GCC diagnostic pop
#endif
I've seen cases where you can get a few percent speedup by reducing conversions raised by warnings like this.
In some cases I have a header with strict warnings that I keep included to prevent accidental conversions, however this is a trade-off since you may end up adding a lot of casts to quiet intentional conversions which may just make the code more cluttered for minimal gains.

Sometimes changing the layout of your data can help. In C, you might switch from an array or structures to a structure of arrays, or vice versa.

Tweak the OS and framework.
It may sound an overkill but think about it like this: Operating Systems and Frameworks are designed to do many things. Your application only does very specific things. If you could get the OS do to exactly what your application needs and have your application understand how the the framework (php,.net,java) works, you could get much better out of your hardware.
Facebook, for example, changed some kernel level thingys in Linux, changed how memcached works (for example they wrote a memcached proxy, and used udp instead of tcp).
Another example for this is Window2008. Win2K8 has a version were you can install just the basic OS needed to run X applicaions (e.g. Web-Apps, Server Apps). This reduces much of the overhead that the OS have on running processes and gives you better performance.
Of course, you should always throw in more hardware as the first step...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio