How to allocate video-card memory directly in Windows?

How to allocate video-card memory directly in Windows? - windows

I am required to allocate bitmap in video-card memory in a project in Windows. Because the project use other 2d library other than GDI, so CreateCompatibleBitmap is useless.
Then i figure out a method using DX, here is my code:
if(FAILED(g_D3DDevice->CreateVertexBuffer(10240 * 1024, 0,
D3DFVF_VERTEX, D3DPOOL_DEFAULT, &g_VertexBuffer,
NULL))) return false;
// Fill the vertex buffer.
void *ptr;
if(FAILED(g_VertexBuffer->Lock(0, 1024 * 10240,
(void**)&ptr, 0))) return false;
//do something...
//printf("ptf = %x\n", ptr);
//memcpy(ptr, objData, sizeof(objData));
g_VertexBuffer->Unlock();
It works well so far. But are there any sideeffect?

Sideeffect:
You will have to unlock the VertexBuffer every single time you want to write or read to it
Slower then normal memory
Aside from this sideeffect make sure to check this process went sucessfull as video card out of memory does not supply the most "user obvious"-kind of error.
As for why it would be slow: by putting it into your video card memory you will have to request access to the vertex each time you want to write/read, which is slower then actually accessing it right away in regular memory.

I can't imagine any legitimate reason to want to do this but in general it is not possible anyway. Windows PCs have a wide range of hardware and many do not even have a concept of separate video memory. For those that do the performance characteristics are not guaranteed which is why Direct3D has lots of ways of indicating to the driver your intended usage so that the driver can decide where best to locate resources (amongst other things, like how best to lay them out in memory) rather than providing ways for you to specify 'in video memory' and other concepts that are not consistently defined across different hardware types.
If as you say your boss asked you to do this it seems like a case study in bad project management - specifying the implementation details of what to do rather than specifying the project goals and leaving the engineer tasked with implementation to figure out how best to achieve them. Presumably the real goal here is to improve performance of some task or operation. Almost certainly what you're doing is not the best way to achieve that.

Related

Are there DirectX guidelines for binding and unbinding resources between draw calls?

All DirectX books and tutorials strongly recommend reducing resource allocations between draw calls to a minimum – yet I can’t find any guidelines that get more into details. Reviewing a lot of sample code found in the web, I have concluded that programmers have completely different coding principles regarding this subject. Some even set and unset
VS/PS
VS/PS ResourceViews
RasterizerStage
DepthStencilState
PrimitiveTopology
...
before and after every draw call (although the setup remains unchanged), and others don’t.
I guess that's a bit overdone...
From my own experiments I have found that the only resources I have to bind on every draw call are the ShaderResourceViews (to VS and PS in my case). This requirement may be caused by the use of compute shaders since I bind/unbind UAVs to buffers that are bound to VS / PS later on.
I have lost many hours of work before I detected that this rebinding was necessary. And I guess that many coders aren’t sure either and prefer to unbind and rebind a “little too much” instead of running into a similar trap.
Question 1: Are there at least some rules of thumb regarding this problem?
Question 2: Is it possible that my ShaderResourceViews bound to VS/PS are unbound by the driver/DirectX core because I bind UAVs to the same buffers before the CS dispatch call (I don’t unbind the SRVs myself)?
Question 3: I don't even set VS/PS to null before I use the compute shaders. Works w/o problems yet I feel constantly unsure whether or not I'm digging my next trap using such a "lazy" approach.

You want to have as less overhead, but also while avoiding invalid pipeline state. That's why some people unbind everything (try to prevent as much), it depends on uses cases, and of course you can balance this a bit.
To balance this you can pre allocate a specific resource to a slot, depending on resource type, since you have a different number of slots, different rules can apply
1/Samplers and States
You have 16 slots, and generally 4-5 samplers you use 90% of the time (linear/point/anisotropic/shadow).
So on application startup create those states and bind them to each shader stage you need (try not to start at zero slot, since they would easily be overriden by mistake).
Create a shader header file with mapping SamplerState -> slot, and use it in your shaders, so any slot update is reflected automatically.
Reuse this as much as possible, and only bind custom samplers.
For standard states (Blend/Depth/Rasterizer), building a small collection of common states on application startup and bind as needed is common practice.
An easy way to minimize Render State binding at low cost, you can build a stack, so you set a default state, and if a shader needs a more specific state, it can push new state to the stack, once it's done, pop last state and apply it again to the pipeline.
2/Constant Buffers
You have 14 slots, which is quite a lot, it's pretty rare (at least in my use cases) to use all of them, specially now you can also use buffers/structuredbuffers as well.
One simple common case is setting a reserved slots for camera (with all data that you need, view/projection/viewprojection, plus their inverses since you might need that too.
Bind it to (all if needed) shader stage slots, and only thing you have to do is update your cbuffer every frame, it's ready to use anywhere.
3/Shader stages
You pretty much never need to unbind Compute Shader, since it's fully separated from the pipeline.
On the other side, for pipeline stage, instead of unbinding, a reasonably good practice is to set all the ones you need and set to null the ones you don't.
If you don't follow this as example and render a shadow map (depth buffer only), a pixel shader might still be bound.
If you forget to unset a Geometry Shader that you previously used, you might end up with invalid layout combination and your object will not render (error will only show up in runtime debug mode).
So setting the full shader stage adds little overhead, but the safety trade off is very far from negligible.
In your use case (using only VS/PS and CS to build), you can safely ignore that.
4/Uavs-RenderTargets-DepthStencil
For write resources, always unset when you done with unit of work. Within the same routine you can optimize inside, but at the end of your render/compute shader function, set your output back to null, since pipeline will not allow anything to be rebound as ShaderResource while it's on output.
Not unsetting a write resource at the end of your function is recipe for disaster.
5/ShaderResourceView
This is very situational, but idea is to minimize while also avoiding runtime warnings (which can be harmless, but then hide important messages).
One eventual thing is to reset to null all shader resource inputs at the beginning of the frame, to avoid a buffer still bound in VS to be set as UAV in CS for example, this costs you 6 pipeline calls per frame, but it's generally worth it.
If you have enough spare registers and some constant resources you can also of course set those in some reserved slots and bind those once and for all.
6/IA related resources
For this one, you need to set the right data to draw your geometry, so everytime you bind it it's pretty reasonable to set InputLayout/Topology . You can of course organize your draw calls to minimize switches.
I find Topology to be rather critical to be set properly, since invalid topology (for example, using Triangle List with a pipeline including tesselation), will draw nothing and give you a runtime warning, but it's very common that on AMD card it will just crash your driver, so better to avoid that as it becomes rather hard to debug.
Generally never really unbinding vertex/index buffers (since just overwriting them and Input layout tells how to fetch anyway).
Only exception to this rule if in the case those buffers are generated in compute/stream out, to avoid the above mentioned runtime warning.

Answer 1 : less is better.
Answer 2 : it is the opposite, you have to unbind a view before you bind the resource with a different kind of view. You should enable the debug layer to catch errors like this.
Answer 3 : that's fine.

Why not use GDI to repeatedly fill a window with RGB data from an array?

This is a follow-up to this question. I'm currently writing a simple game and am looking for the fastest way to (repeatedly) display an array of RGB data in a Win32 window, without flickering or other artifacts.
Several different approaches were recommended in the answers to the previous question, but there was no consensus on which would be the fastest. So, I threw together a test program. The code simply displays a framebuffer on the screen repeatedly, as fast as possible.
These are the results I obtained, for 32-bit data running in a 32-bit video mode - they may surprise some people:
- Direct3D (1): 500 fps
- Direct3D (2): 650 fps
- DirectDraw (3): 1100 fps
- DirectDraw (4): 800 fps
- GDI (SetDIBitsToDevice): 2000 fps
Given these figures:
Why are many people adamant that GDI is simply too slow for this operation?
Is there any reason to prefer DirectDraw or Direct3D over SetDIBitsToDevice?
Here is a brief summary of the calls made by each of the Direct* codepaths. If anyone knows a more efficient way to use DirectDraw/Direct3D, please comment.
1. CreateTexture(D3DUSAGE_DYNAMIC, D3DPOOL_DEFAULT);
LockRect(); memcpy(); UnlockRect(); DrawPrimitive()
2. CreateTexture(0, D3DPOOL_SYSTEMMEM); CreateTexture(0, D3DPOOL_DEFAULT);
LockRect(); memcpy(); UnlockRect(); UpdateTexture(); DrawPrimitive()
3. CreateSurface(); SetSurfaceDesc(lpSurface = &frameBuffer[0]);
memcpy(); primarySurface->Blt();
4. CreateSurface();
Lock(); memcpy(); Unlock(); primarySurface->Blt();

There are a couple of things to keep in mind here. First of all, a lot of "common knowledge" is based on some facts that no longer really apply.
In the days of AGP, when the CPU talked directly to the GPU, it always used the base PCI protocol, which happened at the "1x" rate (always and inevitably). AGX 2x/4x/8x only applied when the GPU was taking to the memory controller directly. In other words, depending on when you looked, it was up to 8 times as fast to have the GPU load a texture from memory as it was for the CPU to send the same data directly to the GPU. Of course, the CPU also had a great deal more bandwidth to memory than the PCI bus supported.
When things switched to PCI-E, however, that changed completely. While there can be differences in bandwidth depending on path, there's no general rule that memory->GPU will be faster than CPU->GPU. The one generalization that's (mostly) safe is that if you have a dedicated graphics card, then the GPU will almost always have more bandwidth to the memory on the graphics card than it does to main memory on the motherboard.
In your case, that doesn't matter much though -- you're talking about moving data from CPU space to GPU space regardless. The main speed difference with using DirectX (or OpenGL) happens when you keep all (or most) of the computation on the GPU, and avoid using the CPU (or main memory) at all. They don't (now that AGP is history) provide any substantial improvement in memory->display bandwidth.

Jerry Coffin makes some good points. The thing to bear in mind is what the DI stands for in SetDIBitsToDevice. It stands for Device Independent. Which means you were ALWAYS at the mercy of drivers. Some drivers used to be complete rubbish and it affected the performance massively. DirectDraw suffered from similar issues as well ... but you also had access to the hardware blitters so it was generally more useful. IHVs also tended to put more time in to writing proper drivers for DirectDraw because of its gaming association. Who wants to be the bottom of the performance pile when the hardware is quite capable of doing better?
These days many graphics cards can accept the bit data directly so no conversion happens. If it does need to be swizzled this is also INCREDIBLY quick in this day and age.
The reason your Direct3D performance is so terrible, by comparison, is that Direct3D, by nature of the fact it is meant to be used totally internally to the GPU, uses odd and complex formats to improve cache performance and so forth.
Couple that with the fact that you aren't testing like for like (with DDraw and D3D) by creating a texture/surface, locking it, copying, unlocking and then drawing over the back buffer (via various methods). To get best performance you'd be best off directly locking the backbuffer using a DISCARD lock then memcpy'ing directly into the returned buffer before unlocking. This will bring your performance much closer to the SetDIBitsToDevice. I still would expect D3D to be slower than DDraw, however, for the reasons outlined above.

The reason you will hear people trounce on GDI is that it used to just be old windows API calls. The newer versions of it (that were called GDI+ when I last looked at em) are actually just an API placed on top of DirectX calls. So using GDI may seem fairly simple programming wise at times, but adding a layer between things always slows things down. As mentioned in the response from Jerry Coffin, your examples are about moving the data, and that is the slow time. I am a bit surprised that DirectX is that much slower though but I can not be much more help with out digging through the DirectX documentation (which has been pretty awesome for quite some time really.. Might want to check out www.codesampler.com. I have always found good starting places from him and actually, while I may be insane for saying this, I would swear the improvements to the DirectX SDK in doc and examples were done based on this guys work!)
As for the DirectDraw vs Direct3D (and not the GDI calls) discussion. I would say go to Direct3D. I believe DirectDraw has been deprecated since 8.0 or so, and 9.0 has been around for quite a long while. And at the end of the day all of DirectX is 3D, it just varies on the levels of helpful 2D apis that are around, but you may find you can do some very interesting things in a 2D environment when you are actually using 3D space. (I had a pretty neat randomly generated lightning weapon for a space invaders clone at one time :))
Anywho, hope this helped!
PS: It should be noted that DirectX is not always the fastest. For keyboard input (unless this has changed in 10 or 11) it has pretty much always been recommended to use the windows events.. as DirectInput was actually just a wrapper for that system!.. XInput however is -awesome-!!

Performance optimization strategies of last resort [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.
Let's assume:
the code already is working correctly
the algorithms chosen are already optimal for the circumstances of the problem
the code has been measured, and the offending routines have been isolated
all attempts to optimize will also be measured to ensure they do not make matters worse
What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.
Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.
I'll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:
The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.
Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.
Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.
Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.
Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.
That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.
Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.
More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.
Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.
Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.
Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.
Total speedup factor: 43.6
Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.
P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.
I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.
ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:
/* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)
These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.
Here is the second problem, in two separate lines:
/* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)
These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.
I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.
REFERENCE ADDED:
The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip.
EDIT 2011/11/26:
There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.

Suggestions:
Pre-compute rather than re-calculate: any loops or repeated calls that contain calculations that have a relatively limited range of inputs, consider making a lookup (array or dictionary) that contains the result of that calculation for all values in the valid range of inputs. Then use a simple lookup inside the algorithm instead.
Down-sides: if few of the pre-computed values are actually used this may make matters worse, also the lookup may take significant memory.
Don't use library methods: most libraries need to be written to operate correctly under a broad range of scenarios, and perform null checks on parameters, etc. By re-implementing a method you may be able to strip out a lot of logic that does not apply in the exact circumstance you are using it.
Down-sides: writing additional code means more surface area for bugs.
Do use library methods: to contradict myself, language libraries get written by people that are a lot smarter than you or me; odds are they did it better and faster. Do not implement it yourself unless you can actually make it faster (i.e.: always measure!)
Cheat: in some cases although an exact calculation may exist for your problem, you may not need 'exact', sometimes an approximation may be 'good enough' and a lot faster in the deal. Ask yourself, does it really matter if the answer is out by 1%? 5%? even 10%?
Down-sides: Well... the answer won't be exact.

When you can't improve the performance any more - see if you can improve the perceived performance instead.
You may not be able to make your fooCalc algorithm faster, but often there are ways to make your application seem more responsive to the user.
A few examples:
anticipating what the user is going
to request and start working on that
before then
displaying results as
they come in, instead of all at once
at the end
Accurate progress meter
These won't make your program faster, but it might make your users happier with the speed you have.

I spend most of my life in just this place. The broad strokes are to run your profiler and get it to record:
Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls.
Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing.
Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing.
Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall.
Sequential floating-point ops. Make these SIMD.
And one more thing I like to do:
Set your compiler to output assembly listings and look at what it emits for the hotspot functions in your code. All those clever optimizations that "a good compiler should be able to do for you automatically"? Chances are your actual compiler doesn't do them. I've seen GCC emit truly WTF code.

Throw more hardware at it!

More suggestions:
Avoid I/O: Any I/O (disk, network, ports, etc.) is
always going to be far slower than any code that is
performing calculations, so get rid of any I/O that you do
not strictly need.
Move I/O up-front: Load up all the data you are going
to need for a calculation up-front, so that you do not
have repeated I/O waits within the core of a critical
algorithm (and maybe as a result repeated disk seeks, when
loading all the data in one hit may avoid seeking).
Delay I/O: Do not write out your results until the
calculation is over, store them in a data structure and
then dump that out in one go at the end when the hard work
is done.
Threaded I/O: For those daring enough, combine 'I/O
up-front' or 'Delay I/O' with the actual calculation by
moving the loading into a parallel thread, so that while
you are loading more data you can work on a calculation on
the data you already have, or while you calculate the next
batch of data you can simultaneously write out the results
from the last batch.

Since many of the performance problems involve database issues, I'll give you some specific things to look at when tuning queries and stored procedures.
Avoid cursors in most databases. Avoid looping as well. Most of the time, data access should be set-based, not record by record processing. This includes not reusing a single record stored procedure when you want to insert 1,000,000 records at once.
Never use select *, only return the fields you actually need. This is especially true if there are any joins as the join fields will be repeated and thus cause unnecesary load on both the server and the network.
Avoid the use of correlated subqueries. Use joins (including joins to derived tables where possible) (I know this is true for Microsoft SQL Server, but test the advice when using a differnt backend).
Index, index, index. And get those stats updated if applicable to your database.
Make the query sargable. Meaning avoid things which make it impossible to use the indexes such as using a wildcard in the first character of a like clause or a function in the join or as the left part of a where statement.
Use correct data types. It is faster to do date math on a date field than to have to try to convert a string datatype to a date datatype, then do the calculation.
Never put a loop of any kind into a trigger!
Most databases have a way to check how the query execution will be done. In Microsoft SQL Server this is called an execution plan. Check those first to see where problem areas lie.
Consider how often the query runs as well as how long it takes to run when determining what needs to be optimized. Sometimes you can gain more perfomance from a slight tweak to a query that runs millions of times a day than you can from wiping time off a long_running query that only runs once a month.
Use some sort of profiler tool to find out what is really being sent to and from the database. I can remember one time in the past where we couldn't figure out why the page was so slow to load when the stored procedure was fast and found out through profiling that the webpage was asking for the query many many times instead of once.
The profiler will also help you to find who are blocking who. Some queries that execute quickly while running alone may become really slow due to locks from other queries.

The single most important limiting factor today is the limited memory bandwitdh. Multicores are just making this worse, as the bandwidth is shared betwen cores. Also, the limited chip area devoted to implementing caches is also divided among the cores and threads, worsening this problem even more. Finally, the inter-chip signalling needed to keep the different caches coherent also increase with an increased number of cores. This also adds a penalty.
These are the effects that you need to manage. Sometimes through micro managing your code, but sometimes through careful consideration and refactoring.
A lot of comments already mention cache friendly code. There are at least two distinct flavors of this:
Avoid memory fetch latencies.
Lower memory bus pressure (bandwidth).
The first problem specifically has to do with making your data access patterns more regular, allowing the hardware prefetcher to work efficiently. Avoid dynamic memory allocation which spreads your data objects around in memory. Use linear containers instead of linked lists, hashes and trees.
The second problem has to do with improving data reuse. Alter your algorithms to work on subsets of your data that do fit in available cache, and reuse that data as much as possible while it is still in the cache.
Packing data tighter and making sure you use all data in cache lines in the hot loops, will help avoid these other effects, and allow fitting more useful data in the cache.

What hardware are you running on? Can you use platform-specific optimizations (like vectorization)?
Can you get a better compiler? E.g. switch from GCC to Intel?
Can you make your algorithm run in parallel?
Can you reduce cache misses by reorganizing data?
Can you disable asserts?
Micro-optimize for your compiler and platform. In the style of, "at an if/else, put the most common statement first"

Although I like Mike Dunlavey's answer, in fact it is a great answer indeed with supporting example, I think it could be expressed very simply thus:
Find out what takes the largest amounts of time first, and understand why.
It is the identification process of the time hogs that helps you understand where you must refine your algorithm. This is the only all-encompassing language agnostic answer I can find to a problem that's already supposed to be fully optimised. Also presuming you want to be architecture independent in your quest for speed.
So while the algorithm may be optimised, the implementation of it may not be. The identification allows you to know which part is which: algorithm or implementation. So whichever hogs the time the most is your prime candidate for review. But since you say you want to squeeze the last few % out, you might want to also examine the lesser parts, the parts that you have not examined that closely at first.
Lastly a bit of trial and error with performance figures on different ways to implement the same solution, or potentially different algorithms, can bring insights that help identify time wasters and time savers.
HPH,
asoudmove.

You should probably consider the "Google perspective", i.e. determine how your application can become largely parallelized and concurrent, which will inevitably also mean at some point to look into distributing your application across different machines and networks, so that it can ideally scale almost linearly with the hardware that you throw at it.
On the other hand, the Google folks are also known for throwing lots of manpower and resources at solving some of the issues in projects, tools and infrastructure they are using, such as for example whole program optimization for gcc by having a dedicated team of engineers hacking gcc internals in order to prepare it for Google-typical use case scenarios.
Similarly, profiling an application no longer means to simply profile the program code, but also all its surrounding systems and infrastructure (think networks, switches, server, RAID arrays) in order to identify redundancies and optimization potential from a system's point of view.

Inline routines (eliminate call/return and parameter pushing)
Try eliminating tests/switches with table look ups (if they're faster)
Unroll loops (Duff's device) to the point where they just fit in the CPU cache
Localize memory access so as not to blow your cache
Localize related calculations if the optimizer isn't already doing that
Eliminate loop invariants if the optimizer isn't already doing that

When you get to the point that you're using efficient algorithms its a question of what you need more speed or memory. Use caching to "pay" in memory for more speed or use calculations to reduce the memory footprint.
If possible (and more cost effective) throw hardware at the problem - faster CPU, more memory or HD could solve the problem faster then trying to code it.
Use parallelization if possible - run part of the code on multiple threads.
Use the right tool for the job. some programing languages create more efficient code, using managed code (i.e. Java/.NET) speed up development but native programing languages creates faster running code.
Micro optimize. Only were applicable you can use optimized assembly to speed small pieces of code, using SSE/vector optimizations in the right places can greatly increase performance.

Divide and conquer
If the dataset being processed is too large, loop over chunks of it. If you've done your code right, implementation should be easy. If you have a monolithic program, now you know better.

First of all, as mentioned in several prior answers, learn what bites your performance - is it memory or processor or network or database or something else. Depending on that...
...if it's memory - find one of the books written long time ago by Knuth, one of "The Art of Computer Programming" series. Most likely it's one about sorting and search - if my memory is wrong then you'll have to find out in which he talks about how to deal with slow tape data storage. Mentally transform his memory/tape pair into your pair of cache/main memory (or in pair of L1/L2 cache) respectively. Study all the tricks he describes - if you don's find something that solves your problem, then hire professional computer scientist to conduct a professional research. If your memory issue is by chance with FFT (cache misses at bit-reversed indexes when doing radix-2 butterflies) then don't hire a scientist - instead, manually optimize passes one-by-one until you're either win or get to dead end. You mentioned squeeze out up to the last few percent right? If it's few indeed you'll most likely win.
...if it's processor - switch to assembly language. Study processor specification - what takes ticks, VLIW, SIMD. Function calls are most likely replaceable tick-eaters. Learn loop transformations - pipeline, unroll. Multiplies and divisions might be replaceable / interpolated with bit shifts (multiplies by small integers might be replaceable with additions). Try tricks with shorter data - if you're lucky one instruction with 64 bits might turn out replaceable with two on 32 or even 4 on 16 or 8 on 8 bits go figure. Try also longer data - eg your float calculations might turn out slower than double ones at particular processor. If you have trigonometric stuff, fight it with pre-calculated tables; also keep in mind that sine of small value might be replaced with that value if loss of precision is within allowed limits.
...if it's network - think of compressing data you pass over it. Replace XML transfer with binary. Study protocols. Try UDP instead of TCP if you can somehow handle data loss.
...if it's database, well, go to any database forum and ask for advice. In-memory data-grid, optimizing query plan etc etc etc.
HTH :)

Caching! A cheap way (in programmer effort) to make almost anything faster is to add a caching abstraction layer to any data movement area of your program. Be it I/O or just passing/creation of objects or structures. Often it's easy to add caches to factory classes and reader/writers.
Sometimes the cache will not gain you much, but it's an easy method to just add caching all over and then disable it where it doesn't help. I've often found this to gain huge performance without having to micro-analyse the code.

I think this has already been said in a different way. But when you're dealing with a processor intensive algorithm, you should simplify everything inside the most inner loop at the expense of everything else.
That may seem obvious to some, but it's something I try to focus on regardless of the language I'm working with. If you're dealing with nested loops, for example, and you find an opportunity to take some code down a level, you can in some cases drastically speed up your code. As another example, there are the little things to think about like working with integers instead of floating point variables whenever you can, and using multiplication instead of division whenever you can. Again, these are things that should be considered for your most inner loop.
Sometimes you may find benefit of performing your math operations on an integer inside the inner loop, and then scaling it down to a floating point variable you can work with afterwards. That's an example of sacrificing speed in one section to improve the speed in another, but in some cases the pay off can be well worth it.

I've spent some time working on optimising client/server business systems operating over low-bandwidth and long-latency networks (e.g. satellite, remote, offshore), and been able to achieve some dramatic performance improvements with a fairly repeatable process.
Measure: Start by understanding the network's underlying capacity and topology. Talking to the relevant networking people in the business, and make use of basic tools such as ping and traceroute to establish (at a minimum) the network latency from each client location, during typical operational periods. Next, take accurate time measurements of specific end user functions that display the problematic symptoms. Record all of these measurements, along with their locations, dates and times. Consider building end-user "network performance testing" functionality into your client application, allowing your power users to participate in the process of improvement; empowering them like this can have a huge psychological impact when you're dealing with users frustrated by a poorly performing system.
Analyze: Using any and all logging methods available to establish exactly what data is being transmitted and received during the execution of the affected operations. Ideally, your application can capture data transmitted and received by both the client and the server. If these include timestamps as well, even better. If sufficient logging isn't available (e.g. closed system, or inability to deploy modifications into a production environment), use a network sniffer and make sure you really understand what's going on at the network level.
Cache: Look for cases where static or infrequently changed data is being transmitted repetitively and consider an appropriate caching strategy. Typical examples include "pick list" values or other "reference entities", which can be surprisingly large in some business applications. In many cases, users can accept that they must restart or refresh the application to update infrequently updated data, especially if it can shave significant time from the display of commonly used user interface elements. Make sure you understand the real behaviour of the caching elements already deployed - many common caching methods (e.g. HTTP ETag) still require a network round-trip to ensure consistency, and where network latency is expensive, you may be able to avoid it altogether with a different caching approach.
Parallelise: Look for sequential transactions that don't logically need to be issued strictly sequentially, and rework the system to issue them in parallel. I dealt with one case where an end-to-end request had an inherent network delay of ~2s, which was not a problem for a single transaction, but when 6 sequential 2s round trips were required before the user regained control of the client application, it became a huge source of frustration. Discovering that these transactions were in fact independent allowed them to be executed in parallel, reducing the end-user delay to very close to the cost of a single round trip.
Combine: Where sequential requests must be executed sequentially, look for opportunities to combine them into a single more comprehensive request. Typical examples include creation of new entities, followed by requests to relate those entities to other existing entities.
Compress: Look for opportunities to leverage compression of the payload, either by replacing a textual form with a binary one, or using actual compression technology. Many modern (i.e. within a decade) technology stacks support this almost transparently, so make sure it's configured. I have often been surprised by the significant impact of compression where it seemed clear that the problem was fundamentally latency rather than bandwidth, discovering after the fact that it allowed the transaction to fit within a single packet or otherwise avoid packet loss and therefore have an outsize impact on performance.
Repeat: Go back to the beginning and re-measure your operations (at the same locations and times) with the improvements in place, record and report your results. As with all optimisation, some problems may have been solved exposing others that now dominate.
In the steps above, I focus on the application related optimisation process, but of course you must ensure the underlying network itself is configured in the most efficient manner to support your application too. Engage the networking specialists in the business and determine if they're able to apply capacity improvements, QoS, network compression, or other techniques to address the problem. Usually, they will not understand your application's needs, so it's important that you're equipped (after the Analyse step) to discuss it with them, and also to make the business case for any costs you're going to be asking them to incur. I've encountered cases where erroneous network configuration caused the applications data to be transmitted over a slow satellite link rather than an overland link, simply because it was using a TCP port that was not "well known" by the networking specialists; obviously rectifying a problem like this can have a dramatic impact on performance, with no software code or configuration changes necessary at all.

Very difficult to give a generic answer to this question. It really depends on your problem domain and technical implementation. A general technique that is fairly language neutral: Identify code hotspots that cannot be eliminated, and hand-optimize assembler code.

Last few % is a very CPU and application dependent thing....
cache architectures differ, some chips have on-chip RAM
you can map directly, ARM's (sometimes) have a vector
unit, SH4's a useful matrix opcode. Is there a GPU -
maybe a shader is the way to go. TMS320's are very
sensitive to branches within loops (so separate loops and
move conditions outside if possible).
The list goes on.... But these sorts of things really are
the last resort...
Build for x86, and run Valgrind/Cachegrind against the code
for proper performance profiling. Or Texas Instruments'
CCStudio has a sweet profiler. Then you'll really know where
to focus...

Not nearly as in depth or complex as previous answers, but here goes:
(these are more beginner/intermediate level)
obvious: dry
run loops backwards so you're always comparing to 0 rather than a variable
use bitwise operators whenever you can
break repetitive code into modules/functions
cache objects
local variables have slight performance advantage
limit string manipulation as much as possible

Did you know that a CAT6 cable is capable of 10x better shielding off external inteferences than a default Cat5e UTP cable?
For any non-offline projects, while having best software and best hardware, if your throughoutput is weak, then that thin line is going to squeeze data and give you delays, albeit in milliseconds...
Also the maximum throughput is higher on CAT6 cables because there is a higher chance that you will actually receive a cable whose strands exist of cupper cores, instead of CCA, Cupper Cladded Aluminium, which is often fount in all your standard CAT5e cables.
I if you are facing lost packets, packet drops, then an increase in throughput reliability for 24/7 operation can make the difference that you may be looking for.
For those who seek the ultimate in home/office connection reliability, (and are willing to say NO to this years fastfood restaurants, at the end of the year you can there you can) gift yourself the pinnacle of LAN connectivity in the form of CAT7 cable from a reputable brand.

Impossible to say. It depends on what the code looks like. If we can assume that the code already exists, then we can simply look at it and figure out from that, how to optimize it.
Better cache locality, loop unrolling, Try to eliminate long dependency chains, to get better instruction-level parallelism. Prefer conditional moves over branches when possible. Exploit SIMD instructions when possible.
Understand what your code is doing, and understand the hardware it's running on. Then it becomes fairly simple to determine what you need to do to improve performance of your code. That's really the only truly general piece of advice I can think of.
Well, that, and "Show the code on SO and ask for optimization advice for that specific piece of code".

If better hardware is an option then definitely go for that. Otherwise
Check you are using the best compiler and linker options.
If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure.
See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain.
Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler.
Consider: are you really using the best algorithm? Is it the best algorithm for your input size?

The google way is one option "Cache it.. Whenever possible don't touch the disk"

Here are some quick and dirty optimization techniques I use. I consider this to be a 'first pass' optimization.
Learn where the time is spent Find out exactly what is taking the time. Is it file IO? Is it CPU time? Is it the network? Is it the Database? It's useless to optimize for IO if that's not the bottleneck.
Know Your Environment Knowing where to optimize typically depends on the development environment. In VB6, for example, passing by reference is slower than passing by value, but in C and C++, by reference is vastly faster. In C, it is reasonable to try something and do something different if a return code indicates a failure, while in Dot Net, catching exceptions are much slower than checking for a valid condition before attempting.
Indexes Build indexes on frequently queried database fields. You can almost always trade space for speed.
Avoid lookups Inside of the loop to be optimized, I avoid having to do any lookups. Find the offset and/or index outside of the loop and reuse the data inside.
Minimize IO try to design in a manner that reduces the number of times you have to read or write especially over a networked connection
Reduce Abstractions The more layers of abstraction the code has to work through, the slower it is. Inside the critical loop, reduce abstractions (e.g. reveal lower-level methods that avoid extra code)
Spawn Threads for projects with a user interface, spawning a new thread to preform slower tasks makes the application feel more responsive, although isn't.
Pre-process You can generally trade space for speed. If there are calculations or other intense operations, see if you can precompute some of the information before you're in the critical loop.

If you have a lot of highly parallel floating point math-especially single-precision-try offloading it to a graphics processor (if one is present) using OpenCL or (for NVidia chips) CUDA. GPUs have immense floating point computing power in their shaders, which is much greater than that of a CPU.

Adding this answer since I didnt see it included in all the others.
Minimize implicit conversion between types and sign:
This applies to C/C++ at least, Even if you already think you're free of conversions - sometimes its good to test adding compiler warnings around functions that require performance, especially watch-out for conversions within loops.
GCC spesific: You can test this by adding some verbose pragmas around your code,
#ifdef __GNUC__
# pragma GCC diagnostic push
# pragma GCC diagnostic error "-Wsign-conversion"
# pragma GCC diagnostic error "-Wdouble-promotion"
# pragma GCC diagnostic error "-Wsign-compare"
# pragma GCC diagnostic error "-Wconversion"
#endif
/* your code */
#ifdef __GNUC__
# pragma GCC diagnostic pop
#endif
I've seen cases where you can get a few percent speedup by reducing conversions raised by warnings like this.
In some cases I have a header with strict warnings that I keep included to prevent accidental conversions, however this is a trade-off since you may end up adding a lot of casts to quiet intentional conversions which may just make the code more cluttered for minimal gains.

Sometimes changing the layout of your data can help. In C, you might switch from an array or structures to a structure of arrays, or vice versa.

Tweak the OS and framework.
It may sound an overkill but think about it like this: Operating Systems and Frameworks are designed to do many things. Your application only does very specific things. If you could get the OS do to exactly what your application needs and have your application understand how the the framework (php,.net,java) works, you could get much better out of your hardware.
Facebook, for example, changed some kernel level thingys in Linux, changed how memcached works (for example they wrote a memcached proxy, and used udp instead of tcp).
Another example for this is Window2008. Win2K8 has a version were you can install just the basic OS needed to run X applicaions (e.g. Web-Apps, Server Apps). This reduces much of the overhead that the OS have on running processes and gives you better performance.
Of course, you should always throw in more hardware as the first step...

Does coding towards an interface rather then an implementation imply a performance hit?

In day to day programs I wouldn't even bother thinking about the possible performance hit for coding against interfaces rather than implementations. The advantages largely outweigh the cost. So please no generic advice on good OOP.
Nevertheless in this post, the designer of the XNA (game) platform gives as his main argument to not have designed his framework's core classes against an interface that it would imply a performance hit. Seeing it is in the context of a game development where every fps possibly counts, I think it is a valid question to ask yourself.
Does anybody have any stats on that? I don't see a good way to test/measure this as don't know what implications I should bear in mind with such a game (graphics) object.

Coding to an interface is always going to be easier, simply because interfaces, if done right, are much simpler. Its palpably easier to write a correct program using an interface.
And as the old maxim goes, its easier to make a correct program run fast than to make a fast program run correctly.
So program to the interface, get everything working and then do some profiling to help you meet whatever performance requirements you may have.

What Things Cost in Managed Code
"There does not appear to be a significant difference in the raw cost of a static call, instance call, virtual call, or interface call."
It depends on how much of your code gets inlined or not at compile time, which can increase performance ~5x.
It also takes longer to code to interfaces, because you have to code the contract(interface) and then the concrete implementation.
But doing things the right way always takes longer.

First I'd say that the common conception is that programmers time is usually more important, and working against implementation will probably force much more work when the implementation changes.
Second with proper compiler/Jit I would assume that working with interface takes a ridiculously small amount of extra time compared to working against the implementation itself.
Moreover, techniques like templates can remove the interface code from running.
Third to quote Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
So I'd suggest coding well first, and only if you are sure that there is a problem with the Interface, only then I would consider changing.
Also I would assume that if this performance hit was true, most games wouldn't have used an OOP approach with C++, but this is not the case, this Article elaborates a bit about it.
It's hard to talk about tests in a general form, naturally a bad program may spend a lot of time on bad interfaces, but I doubt if this is true for all programs, so you really should look at each particular program.

Interfaces generally imply a few hits to performance (this however may change depending on the language/runtime used):
Interface methods are usually implemented via a virtual call by the compiler. As another user points out, these can not be inlined by the compiler so you lose that potential gain. Additionally, they add a few instructions (jumps and memory access) at a minimum to get the proper PC in the code segment.
Interfaces, in a number of languages, also imply a graph and require a DAG (directed acyclic graph) to properly manage memory. In various languages/runtimes you can actually get a memory 'leak' in the managed environment by having a cyclic graph. This imposes great stress (obviously) on the garbage collector/memory in the system. Watch out for cyclic graphs!
Some languages use a COM style interface as their underlying interface, automatically calling AddRef/Release whenever the interface is assigned to a local, or passed by value to a function (used for life cycle management). These AddRef/Release calls can add up and be quite costly. Some languages have accounted for this and may allow you to pass an interface as 'const' which will not generate the AddRef/Release pair automatically cutting down on these calls.
Here is a small example of a cyclic graph where 2 interfaces reference each other and neither will automatically be collected as their refcounts will always be greater than 1.
interface Parent {
Child c;
}
interface Child {
Parent p;
}
function createGraph() {
...
Parent p = ParentFactory::CreateParent();
Child c = ChildFactory::CreateChild();
p.c = c;
c.p = p;
... // do stuff here
// p has a reference to c and c has a reference to p.
// When the function goes out of scope and attempts to clean up the locals
// it will note that p has a refcount of 1 and c has a refcount of 1 so neither
// can be cleaned up (of course, this is depending on the language/runtime and
// if DAGS are allowed for interfaces). If you were to set c.p = null or
// p.c = null then the 2 interfaces will be released when the scope is cleaned up.
}

I think object lifetime and the number of instances you're creating will provide a coarse-grain answer.
If you're talking about something which will have thousands of instances, with short lifetimes, I would guess that's probably better done with a struct rather than a class, let alone a class implementing an interface.
For something more component-like, with low numbers of instances and moderate-to-long lifetime, I can't imagine it's going to make much difference.

IMO yes, but for a fundamental design reason far more subtle and complex than virtual dispatch or COM-like interface queries or object metadata required for runtime type information or anything like that. There is overhead associated with all of that but it depends a lot on the language and compiler(s) used, and also depends on whether the optimizer can eliminate such overhead at compile-time or link-time. Yet in my opinion there's a broader conceptual reason why coding to an interface implies (not guarantees) a performance hit:
Coding to an interface implies that there is a barrier between you and
the concrete data/memory you want to access and transform.
This is the primary reason I see. As a very simple example, let's say you have an abstract image interface. It fully abstracts away its concrete details like its pixel format. The problem here is that often the most efficient image operations need those concrete details. We can't implement our custom image filter with efficient SIMD instructions, for example, if we had to getPixel one at a time and setPixel one at a time and while oblivious to the underlying pixel format.
Of course the abstract image could try to provide all these operations, and those operations could be implemented very efficiently since they have access to the private, internal details of the concrete image which implements that interface, but that only holds up as long as the image interface provides everything the client would ever want to do with an image.
Often at some point an interface cannot hope to provide every function imaginable to the entire world, and so such interfaces, when faced with performance-critical concerns while simultaneously needing to fulfill a wide range of needs, will often leak their concrete details. The abstract image might still provide, say, a pointer to its underlying pixels through a pixels() method which largely defeats a lot of the purpose of coding to an interface, but often becomes a necessity in the most performance-critical areas.
Just in general a lot of the most efficient code often has to be written against very concrete details at some level, like code written specifically for single-precision floating-point, code written specifically for 32-bit RGBA images, code written specifically for GPU, specifically for AVX-512, specifically for mobile hardware, etc. So there's a fundamental barrier, at least with the tools we have so far, where we cannot abstract that all away and just code to an interface without an implied penalty.
Of course our lives would become so much easier if we could just write code, oblivious to all such concrete details like whether we're dealing with 32-bit SPFP or 64-bit DPFP, whether we're writing shaders on a limited mobile device or a high-end desktop, and have all of it be the most competitively efficient code out there. But we're far from that stage. Our current tools still often require us to write our performance-critical code against concrete details.
And lastly this is kind of an issue of granularity. Naturally if we have to work with things on a pixel-by-pixel basis, then any attempts to abstract away concrete details of a pixel could lead to a major performance penalty. But if we're expressing things at the image level like, "alpha blend these two images together", that could be a very negligible cost even if there's virtual dispatch overhead and so forth. So as we work towards higher-level code, often any implied performance penalty of coding to an interface diminishes to a point of becoming completely trivial. But there's always that need for the low-level code which does do things like process things on a pixel-by-pixel basis, looping through millions of them many times per frame, and there the cost of coding to an interface can carry a pretty substantial penalty, if only because it's hiding the concrete details necessary to write the most efficient implementation.

In my personal opinion, all the really heavy lifting when it comes to graphics is passed on to the GPU anwyay. These frees up your CPU to do other things like program flow and logic. I am not sure if there is a performance hit when programming to an interface but thinking about the nature of games, they are not something that needs to be extendable. Maybe certain classes but on the whole I wouldn't think that a game needs to programmed with extensibility in mind. So go ahead, code the implementation.

it would imply a performance hit
The designer should be able to prove his opinion.

What Simple Changes Made the Biggest Improvements to Your Delphi Programs [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a Delphi 2009 program that handles a lot of data and needs to be as fast as possible and not use too much memory.
What small simple changes have you made to your Delphi code that had the biggest impact on the performance of your program by noticeably reducing execution time or memory use?
Thanks everyone for all your answers. Many great tips.
For completeness, I'll post a few important articles on Delphi optimization that I found.
Before you start optimizing Delphi code at About.com
Speed and Size: Top 10 Tricks also at About.com
Code Optimization Fundamentals and Delphi Optimization Guidelines at High Performance Delphi, relating to Delphi 7 but still very pertinent.

.BeginUpdate;
.EndUpdate;
;)

Use a Delphi Profiling tool (Some here or here) and discover your own bottle necks. Optimizing the wrong bottlenecks is a waste of time. In other words, if you apply all of these suggestions here, but ignore the fact someone put a sleep(1000) (or similar) in some very important code is a waste of your time. Fix your actual bottlenecks first.

Stop using TStringList for everything.
TStringList is not a general purpose datastructure for effective storage and handling of everything from simple to complex types. Look for alternatives. I use Delphi Container and Algorithm Library (DeCAL, formerly known as SDL). Julians EZDSL should also be a good alternative.

Pre-allocating lists and arrays, rather than growing them with each iteration.
This has probably had the biggest impact for me in terms of speed.

If you need to use Application.processmesssages (or similar) in a loop, try calling it only every Nth iteration.
Similarly, if updating a progressbar, don't update it every iteration. Instead, increment it by x units every x iterations, or scale the updates according to time or as a percentage of overall task length.

FastMM
FastCode (lib)
Use high performance data structures, like hash table (etc). Many places it is faster to make one loop which makes lookup hash table for your data. Uses quite lot of memory but it surely is fast. (this maybe is most important one, but 2 first are dead simple and need very little of effort to do)

Reduce disk operations. If there's enough memory, load the file entirely to RAM and do all operations in memory.

Consider the careful use of threads. If you are not using threads now, then consider adding a couple. If you are, make sure you are not using too many. If you are running on a Dual or Quad core computer (which most are any more) then proper thread tuning is very important.
You could look at OmniThread Library by Gabr, but there are a number of thread libraries in development for Delphi. You could easily implement your own parallel for using anonymous types.

Before you do anything, identify slow parts. Do not touch working code which performs fast enough.

The biggest improvement came when I started using AsyncCalls to convert single-threaded applications that used to freeze up the UI, into (sort of) multi-threaded apps.
Although AsyncCalls can do a lot more, I've found it useful for this very simple purpose. Let's say you have a subroutine blocked like this: Disable Button, Do Work, Enable Button.
You move the 'Do Work' part to a local function (call it AsyncDoWork), and add four lines of code:
var a: IAsyncCall;
a := LocalAsyncCall(#AsyncDoWork);
while (NOT a.Finished) do
application.ProcessMessages;
a.Sync;
What this does for you is run AsyncDoWork in a separate thread, while your main thread remains available to respond to the UI (like dragging the window or clicking Abort.) When AsyncDoWork is finished the code continues. Because I moved it to a local function, all local vars are available, an the code does not need to be changed.
This is a very limited type of 'multi-threading'. Specifically, it's dual threading. You must ensure that your Async function and the UI do not both access the same VCL components or data structures. (I disable all controls except the stop button.)
I don't use this to write new programs. It's just a really quick & easy way to make old programs more responsive.

When working with a tstringlist (or similar), set "sorted := false" until needed (if at all). Seems like a no-brainer...

Create unit tests
Verify tests all pass
Profile your application
Refactor looking for bottlenecks and memory
Repeat from Step 2 (comparing to previous pass)

Make intelligent use of SetLength() for strings and arrays. Optimise initialisation with FillChar or ZeroMemory.
Local variables created on stack (e.g. record types) are faster than heap allocated (objects and New()) variables.
Reuse objects rather than Destroy then create. But make sure management code for this is faster than memory manager!

Check heavily-used loops for calculations that could be (at least partially) pre-calculated or handled with a lookup table. Trig functions are a classic for this, but it applies to many others.

If you have a list, use a dynamic array of anything, even a record as follows:
This needs no classes, no freeing and access to it is very fast. Even if it needs to grow you can do this - see below. Only use TList or TStringList if you need lots of size changing flexibility.
type
TMyRec = record
SomeString : string;
SomeValue : double;
end;
var
Data : array of TMyRec;
I : integer;
..begin
SetLength( Data, 100 ); // defines the length and CLEARS ALL DATA
Data[32].SomeString := 'Hello';
ShowMessage( Data[32] );
// Grow the list by 1 item.
I := Length( Data );
SetLength( Data, I+1 );
..end;

Separating the program logic from user interface, refactoring, then optimizing the most-used, most resource-intensive elements independently.

Turn debugging OFF
Turn optimizations ON
Remove all references to units that
you don't actually use
Look for memory leaks

Use a lot of assertions to debug, then turn them off in shipping code.

Turn off range and overflow checking after you have tested extensively.

If you really, really, really need to be light weight then you can shed the VCL. Take a look at the KOL & MCK. Granted if you do that then you are trading features for reduced footprint.

Use the full FastMM and study the documentation and source and see if you can tweak it to your specifications.

For an old BDE development when I first started Delphi, I was using lots of TQuery components. Someone told me to use TTable master-detail after I explained him what I was doing, and that made the program run much faster.
Calling DisableControls can omit unnecessary UI updates.

When identifying records, use integers if at all possible for record comparison. While a primary key of "company name" might seem logical, the time spent generating and storing a hash of this will greatly improve overall search times.

You might consider using runtime packages. This could reduce your memory foot print if there are more then one program running that is written using the same packages.

If you use threads, set their processor affinity. If you don't use threads yet, consider using them, or look into asynchronous I/O (completion ports) if your application does lots of I/O.

Consider if a DBMS database is really the perfect choice. If you are only reading data and never changing it, then a flat fixed record file could work faster, especially if the path to the data can be easily mapped (ie, one index). A trivial binary search on a fixed record file is still extremely fast.

BeginUpdate ... EndUpdate
ShortString vs. String
Use arrays instead of TStrings and TList
But the sad answer is that tuning and optimization will give you maybe 10% improvement (and it's dangerous); re-design can give you 90%. Once you really understand the goal, you often can restate the problem (and therefore the solution) in much better terms.
Cheers

Examine all loops, and look for ways to short circuit. If your looking for something specific and find it in a loop, then use the BREAK command to immediately bail...no sense looping thru the rest. If you know that you don't have a match, then use a CONTINUE as quickly as possible.

Take advantage of some of the FastCode project code. Parts of it were incorporated into VCL/RTL proper (like FastMM was), but there is more out there you can use!
Note, they have a new site they are moving too, but it seems to be a bit inactive.

Consider hardware issues. If you really need performance then consider the type of hard drive(s) your program and your databases are running on. There are a lot of variables especially if you are running a database. RAID is not always the best answer either.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio