I/O performance - async vs TPL vs Dataflow vs RX

I/O performance - async vs TPL vs Dataflow vs RX - task-parallel-library

I have a piece of C# 5.0 code that generates a ton of network and disk I/O. I need to run multiple copies of this code in parallel. Which of the following technologies is likely to give me the best performance:
async methods with await
directly use Task from TPL
the TPL Dataflow nuget
Reactive Extensions
I'm not very good at this parallel stuff, but if using a lower lever, like say Thread, can give me a lot better performance I'd consider that too.

This is like trying to optimize the length of your transatlantic flight by asking the quickest method to remove your seatbelt.
Ok, some real advice, since I was kind of a jerk
Let's give a helpful answer. Think of performance as in "Classes" of activities - each one is an order of magnitude slower (at least!):
Only accessing the CPU, very little memory usage (i.e. rendering very simple graphics to a very fast GPU, or calculating digits of Pi)
Only accessing CPU and in-memory things, nothing on disk (i.e. a well-written game)
Accessing the disk
Accessing the network.
If you do even one of activity #3, there's no point in doing optimizations typical to activities #1 and #2 like optimizing threading libraries - they're completely overshadowed by the disk hit. Same for CPU tricks - if you're constantly incurring L2/L3 cache misses, sparing a few CPU cycles by hand-writing assembly isn't worth it (which is why things like loop unrolling are usually a bad idea these days).
So, what can we derive from this? There are two ways to make your program faster, either move up from #3 to #2 (which isn't often possible, depending on what you're doing), or by doing less I/O. I/O and network speed is the rate-limiting factor in most modern applications, and that's what you should be trying to optimize.

Any performance difference between these options would be inconsequential in the face of "a ton of network and disk I/O".
A better question to ask is "which option is easiest to learn and develop with?" Or "which option would be best to maintain this code with five years from now?" And for that I would suggest async first, or Dataflow or Rx if your logic is better represented as a stream.

It's an older question, but for anyone reading this...
It depends. If you try to saturate 1Gbps link with 50B messages, you will be CPU bound even with simple non-blocking send over raw sockets. If, on the other hand, you are happy with 1Mbps throughput or your messages are larger than 10KB, any of these frameworks will do the job.
For low-bandwidth situations, I would recommend to prioritize by ease of use, i.e. async/await, Dataflow, Rx, TPL in this order. Note that high-bandwidth application should be prototyped as if it is low-bandwidth and optimized later.
For true high-bandwidth application, I can recommend Dataflow over Rx, because Rx is not designed for high concurrency. Raw TPL is the bottom layer, which guarantees the lowest overhead if you can handle the complexity. If you can make efficient use of dedicated threads, then that would be even faster. Async/await vs. Dataflow IMO doesn't make any performance difference. The overhead seems comparable, so choose one that's a better fit.

Related

Elixir immutability in a game context

I'm aware that, in order to ensure that all threads reading a memory access read the exact same value, Elixir never overwrites an address in use. Instead, if a var is changed, it's written in a new address.
What I want to know is how that would affect real time games. For instance, moving in a 3D game would generate a huge number of different values needing to be newly allocated and the old values to be released in a timely manner. How better or worse is this, for a game, compared to simply rewriting the values in memory as needed?

This is a very generic question. Like very generic.
The first and foremost, BEAM (Elixir and Erlang VM) is prioritizing throughput and predictable responsiveness over latency, but in the games the latency is a king, more latency you have, less FPS there will be.
Second, BEAM was designed primarily for fault-tolerance and concurrency not performance, so performance-wise C/C++ will be faster on doing direct memory access and computations.
In general, there are advantages in immutable data structures (safe concurrency, simpler reasoning about programs, less headache during debugging and simpler algorithms in general: i.e. it is much easier construct new RB tree than implement concurrent deletion from it) at cost of raw performance.

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages

One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.

One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.

Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.

More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.

You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.

What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.

Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.

You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.

I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.

I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.

You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.

Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

When does Erlang's parallelism overcome its weaknesses in numeric computing?

With all the hype around parallel computing lately, I've been thinking a lot about parallelism, number crunching, clusters, etc...
I started reading Learn You Some Erlang. As more people are learning (myself included), Erlang handles concurrency in a very impressive, elegant way.
Then the author asserts that Erlang is not ideal for number crunching. I can understand that a language like Erlang would be slower than C, but the model for concurrency seems ideally suited to things like image handling or matrix multiplication, even though the author specifically says its not.
Is it really that bad? Is there a tipping point where Erlang's strength overcomes its local speed weakness? Are/what measures are being taken to deal with speed?
To be clear: I'm not trying to start a debate; I just want to know.

It's a mistake to think of parallelism as only about raw number crunching power. Erlang is closer to the way a cluster computer works than, say, a GPU or classic supercomputer.
In modern GPUs and old-style supercomputers, performance is all about vectorized arithmetic, special-purpose calculation hardware, and low-latency communication between processing units. Because communication latency is low and each individual computing unit is very fast, the ideal usage pattern is to load the machine's RAM up with data and have it crunch it all at once. This processing might involve lots of data passing among the nodes, as happens in image processing or 3D, where there are lots of CPU-bound tasks to do to transform the data from input form to output form. This type of machine is a poor choice when you frequently have to go to a disk, network, or some other slow I/O channel for data. This idles at least one expensive, specialized processor, and probably also chokes the data processing pipeline so nothing else gets done, either.
If your program requires heavy use of slow I/O channels, a better type of machine is one with many cheap independent processors, like a cluster. You can run Erlang on a single machine, in which case you get something like a cluster within that machine, or you can easily run it on an actual hardware cluster, in which case you have a cluster of clusters. Here, communication overhead still idles processing units, but because you have many processing units running on each bit of computing hardware, Erlang can switch to one of the other processes instantaneously. If it happens that an entire machine is sitting there waiting on I/O, you still have the other nodes in the hardware cluster that can operate independently. This model only breaks down when the communication overhead is so high that every node is waiting on some other node, or for general I/O, in which case you either need faster I/O or more nodes, both of which Erlang naturally takes advantage of.
Communication and control systems are ideal applications of Erlang because each individual processing task takes little CPU and only occasionally needs to communicate with other processing nodes. Most of the time, each process is operating independently, each taking a tiny fraction of the CPU power. The most important thing here is the ability to handle many thousands of these efficiently.
The classic case where you absolutely need a classic supercomputer is weather prediction. Here, you divide the atmosphere up into cubes and do physics simulations to find out what happens in each cube, but you can't use a cluster because air moves between each cube, so each cube is constantly communicating with its 6 adjacent neighbors. (Air doesn't go through the edges or corners of a cube, being infinitely fine, so it doesn't talk to the other 20 neighboring cubes.) Run this on a cluster, whether running Erlang on it or some other system, and it instantly becomes I/O bound.

Is there a tipping point where Erlang's strength overcomes its local speed weakness?
Well, of course there is. For example, when trying to find the median of a trillion numbers :) :
http://matpalm.com/median/question.html
Just before you posted, I happened to notice this was the number 1 post on erlang.reddit.com.

Almost any language can be parallelized. In some languages it's simple, in others it's a pain in the butt, but it can be done. If you want to run a C++ program across 8000 CPU's in a grid, go ahead! You can do that. It's been done before.
Erlang doesn't do anything that's impossible in other languages. If a single CPU running an Erlang program is less efficient than the same CPU running a C++ program, then two hundred CPU's running Erlang will also be slower than two hundred CPU's running C++.
What Erlang does do is making this kind of parallelism easy to work with. It saves developer time and reduces the chance of bugs.
So I'm going to say no, there is no tipping point at which Erlang's parallelism allows it to outperform another language's numerical number-crunching strength.
Where Erlang scores is in making it easier to scale out and do so correctly. But it can still be done in other languages which are better at number-crunching, if you're willing to spend the extra development time.
And of course, let's not forget the good old point that languages don't have a speed.
A sufficiently good Erlang compiler would yield perfectly optimal code. A sufficiently bad C compiler would yield code that runs slower than anything else.

There is pressure to make Erlang execute numeric code faster. The HiPe compiler compiles to native code instead of the BEAM bytecode for example, and it probably has its most effective optimization on code on floating points where it can avoid boxing. This is very beneficial for floating point code, since it can store values directly in FPU registers.
For the majority of Erlang usage, Erlang is plenty fast as it is. They use Erlang to write always-up control systems where the most important speed measurement that matters is low latency responses. Performance under load tends to be IO-bound. These users tend to stay away from HiPe since it is not as flexible/malleable in debugging live systems.
Now that servers with 128Gb of RAM are not that uncommon, and there's no reason they'll get even more memory, some IO-bound problems might shift over to be somewhat CPU bound. That could be a driver.
You should follow HiPe for the development.
Your examples of image manipulations and matrix multiplications seem to me as very bad matches for Erlang though. Those are examples that benefit from vector/SIMD operations. Erlang is not good at parallellism (where one does the same thing to multiple values at once).
Erlang processes are MIMD, multiple instructions multiple data. Erlang does lots of branching behind pattern matching and recursive loops. That kills CPU instruction pipelining.
The best architecture for heavily parallellised problems are the GPUs. For programming GPUs in a functional language I see the best potential in using Haskell for creating programs targeting them. A GPU is basically a pure function from input data to output data. See the Lava project in Haskell for creating FPGA circuits, if it is possible to create circuits so cleanly in Haskell, it can't be harder to create program data for GPUs.
The Cell architecture is very nice for vectorizable problems as well.

I think the broader need is to point out that parallelism is not necessarily or even typically about speed.
It is about how to express algorithms or programs in which the sequence of activities is partial-ordered.

MPI for multicore?

With the recent buzz on multicore programming is anyone exploring the possibilities of using MPI ?

I've used MPI extensively on large clusters with multi-core nodes. I'm not sure if it's the right thing for a single multi-core box, but if you anticipate that your code may one day scale larger than a single chip, you might consider implementing it in MPI. Right now, nothing scales larger than MPI. I'm not sure where the posters who mention unacceptable overheads are coming from, but I've tried to give an overview of the relevant tradeoffs below. Read on for more.
MPI is the de-facto standard for large-scale scientific computation and it's in wide use on multicore machines already. It is very fast. Take a look at the most recent Top 500 list. The top machines on that list have, in some cases, hundreds of thousands of processors, with multi-socket dual- and quad-core nodes. Many of these machines have very fast custom networks (Torus, Mesh, Tree, etc) and optimized MPI implementations that are aware of the hardware.
If you want to use MPI with a single-chip multi-core machine, it will work fine. In fact, recent versions of Mac OS X come with OpenMPI pre-installed, and you can download an install OpenMPI pretty painlessly on an ordinary multi-core Linux machine. OpenMPI is in use at Los Alamos on most of their systems. Livermore uses mvapich on their Linux clusters. What you should keep in mind before diving in is that MPI was designed for solving large-scale scientific problems on distributed-memory systems. The multi-core boxes you are dealing with probably have shared memory.
OpenMPI and other implementations use shared memory for local message passing by default, so you don't have to worry about network overhead when you're passing messages to local processes. It's pretty transparent, and I'm not sure where other posters are getting their concerns about high overhead. The caveat is that MPI is not the easiest thing you could use to get parallelism on a single multi-core box. In MPI, all the message passing is explicit. It has been called the "assembly language" of parallel programming for this reason. Explicit communication between processes isn't easy if you're not an experienced HPC person, and there are other paradigms more suited for shared memory (UPC, OpenMP, and nice languages like Erlang to name a few) that you might try first.
My advice is to go with MPI if you anticipate writing a parallel application that may need more than a single machine to solve. You'll be able to test and run fine with a regular multi-core box, and migrating to a cluster will be pretty painless once you get it working there. If you are writing an application that will only ever need a single machine, try something else. There are easier ways to exploit that kind of parallelism.
Finally, if you are feeling really adventurous, try MPI in conjunction with threads, OpenMP, or some other local shared-memory paradigm. You can use MPI for the distributed message passing and something else for on-node parallelism. This is where big machines are going; future machines with hundreds of thousands of processors or more are expected to have MPI implementations that scale to all nodes but not all cores, and HPC people will be forced to build hybrid applications. This isn't for the faint of heart, and there's a lot of work to be done before there's an accepted paradigm in this space.

I would have to agree with tgamblin. You'll probably have to roll your sleeves up and really dig into the code to use MPI, explicitly handling the organization of the message-passing yourself. If this is the sort of thing you like or don't mind doing, I would expect that MPI would work just as well on multicore machines as it would on a distributed cluster.
Speaking from personal experience... I coded up some C code in graduate school to do some large scale modeling of electrophysiologic models on a cluster where each node was itself a multicore machine. Therefore, there were a couple of different parallel methods I thought of to tackle the problem.
1) I could use MPI alone, treating every processor as it's own "node" even though some of them are grouped together on the same machine.
2) I could use MPI to handle data moving between multicore nodes, and then use threading (POSIX threads) within each multicore machine, where processors share memory.
For the specific mathematical problem I was working on, I tested two formulations first on a single multicore machine: one using MPI and one using POSIX threads. As it turned out, the MPI implementation was much more efficient, giving a speed-up of close to 2 for a dual-core machine as opposed to 1.3-1.4 for the threaded implementation. For the MPI code, I was able to organize operations so that processors were rarely idle, staying busy while messages were passed between them and masking much of the delay from transferring data. With the threaded code, I ended up with a lot of mutex bottlenecks that forced threads to often sit and wait while other threads completed their computations. Keeping the computational load balanced between threads didn't seem to help this fact.
This may have been specific to just the models I was working on, and the effectiveness of threading vs. MPI would likely vary greatly for other types of parallel problems. Nevertheless, I would disagree that MPI has an unwieldy overhead.

No, in my opinion it is unsuitable for most processing you would do on a multicore system. The overhead is too high, the objects you pass around must be deeply cloned, and passing large objects graphs around to then run a very small computation is very inefficient. It is really meant for sharing data between separate processes, most often running in separate memory spaces, and most often running long computations.
A multicore processor is a shared memory machine, so there are much more efficient ways to do parallel processing, that do not involve copying objects and where most of the threads run for a very small time. For example, think of a multithreaded Quicksort. The overhead of allocating memory and copying the data to a thread before it can be partioned will be much slower with MPI and an unlimited number of processors than Quicksort running on a single processor.
As an example, in Java, I would use a BlockingQueue (a shared memory construct), to pass object references between threads, with very little overhead.
Not that it does not have its place, see for example the Google search cluster that uses message passing. But it's probably not the problem you are trying to solve.

MPI is not inefficient. You need to break the problem down into chunks and pass the chunks around and reorganize when the result is finished per chunk. No one in the right mind would pass around the whole object via MPI when only a portion of the problem is being worked on per thread. Its not the inefficiency of the interface or design pattern thats the inefficiency of the programmers knowledge of how to break up a problem.
When you use a locking mechanism the overhead on the mutex does not scale well. this is due to the fact that the underlining runqueue does not know when you are going to lock the thread next. You will perform more kernel level thrashing using mutex's than a message passing design pattern.

MPI has a very large amount of overhead, primarily to handle inter-process communication and heterogeneous systems. I've used it in cases where a small amount of data is being passed around, and where the ratio of computation to data is large.
This is not the typical usage scenario for most consumer or business tasks, and in any case, as a previous reply mentioned, on a shared memory architecture like a multicore machine, there are vastly faster ways to handle it, such as memory pointers.
If you had some sort of problem with the properties describe above, and you want to be able to spread the job around to other machines, which must be on the same highspeed network as yourself, then maybe MPI could make sense. I have a hard time imagining such a scenario though.

I personally have taken up Erlang( and i like to so far). The messages based approach seem to fit most of the problem and i think that is going to be one of the key item for multi core programming. I never knew about the overhead of MPI and thanks for pointing it out

You have to decide if you want low level threading or high level threading. If you want low level then use pThread. You have to be careful that you don't introduce race conditions and make threading performance work against you.
I have used some OSS packages for (C and C++) that are scalable and optimize the task scheduling. TBB (threading building blocks) and Cilk Plus are good and easy to code and get applications of the ground. I also believe they are flexible enough integrate other thread technologies into it at a later point if needed (OpenMP etc.)
www.threadingbuildingblocks.org
www.cilkplus.org

Power Efficient Software Coding

In a typical handheld/portable embedded system device Battery life is a major concern in design of H/W, S/W and the features the device can support. From the Software programming perspective, one is aware of MIPS, Memory(Data and Program) optimized code.
I am aware of the H/W Deep sleep mode, Standby mode that are used to clock the hardware at lower Cycles or turn of the clock entirel to some unused circutis to save power, but i am looking for some ideas from that point of view:
Wherein my code is running and it needs to keep executing, given this how can I write the code "power" efficiently so as to consume minimum watts?
Are there any special programming constructs, data structures, control structures which i should look at to achieve minimum power consumption for a given functionality.
Are there any s/w high level design considerations which one should keep in mind at time of code structure design, or during low level design to make the code as power efficient(Least power consuming) as possible?

Like 1800 INFORMATION said, avoid polling; subscribe to events and wait for them to happen
Update window content only when necessary - let the system decide when to redraw it
When updating window content, ensure your code recreates as little of the invalid region as possible
With quick code the CPU goes back to deep sleep mode faster and there's a better chance that such code stays in L1 cache
Operate on small data at one time so data stays in caches as well
Ensure that your application doesn't do any unnecessary action when in background
Make your software not only power efficient, but also power aware - update graphics less often when on battery, disable animations, less hard drive thrashing
And read some other guidelines. ;)
Recently a series of posts called "Optimizing Software Applications for Power", started appearing on Intel Software Blogs. May be of some use for x86 developers.

Zeroith, use a fully static machine that can stop when idle. You can't beat zero Hz.
First up, switch to a tickless operating system scheduler. Waking up every millisecend or so wastes power. If you can't, consider slowing the scheduler interrupt instead.
Secondly, ensure your idle thread is a power save, wait for next interrupt instruction.
You can do this in the sort of under-regulated "userland" most small devices have.
Thirdly, if you have to poll or perform user confidence activities like updating the UI,
sleep, do it, and get back to sleep.
Don't trust GUI frameworks that you haven't checked for "sleep and spin" kind of code.
Especially the event timer you may be tempted to use for #2.
Block a thread on read instead of polling with select()/epoll()/ WaitForMultipleObjects().
Puts stress on the thread scheuler ( and your brain) but the devices generally do okay.
This ends up changing your high-level design a bit; it gets tidier!.
A main loop that polls all the things you Might do ends up slow and wasteful on CPU, but does guarantee performance. ( Guaranteed to be slow)
Cache results, lazily create things. Users expect the device to be slow so don't disappoint them. Less running is better. Run as little as you can get away with.
Separate threads can be killed off when you stop needing them.
Try to get more memory than you need, then you can insert into more than one hashtable and save ever searching. This is a direct tradeoff if the memory is DRAM.
Look at a realtime-ier system than you think you might need. It saves time (sic) later.
They cope better with threading too.

Do not poll. Use events and other OS primitives to wait for notifiable occurrences. Polling ensures that the CPU will stay active and use more battery life.

From my work using smart phones, the best way I have found of preserving battery life is to ensure that everything you do not need for your program to function at that specific point is disabled.
For example, only switch Bluetooth on when you need it, similarly the phone capabilities, turn the screen brightness down when it isn't needed, turn the volume down, etc.
The power used by these functions will generally far outweigh the power used by your code.

To avoid polling is a good suggestion.
A microprocessor's power consumption is roughly proportional to its clock frequency, and to the square of its supply voltage. If you have the possibility to adjust these from software, that could save some power. Also, turning off the parts of the processor that you don't need (e.g. floating-point unit) may help, but this very much depends on your platform. In any case, you need a way to measure the actual power consumption of your processor, so that you can find out what works and what not. Just like speed optimizations, power optimizations need to be carefully profiled.

Consider using the network interfaces the least you can. You might want to gather information and send it out in bursts instead of constantly send it.

Look at what your compiler generates, particularly for hot areas of code.

If you have low priority intermittent operations, don't use specific timers to wake up to deal with them, but deal with when processing other events.
Use logic to avoid stupid scenarios where your app might go to sleep for 10 ms and then have to wake up again for the next event. For the kind of platform mentioned it shouldn't matter if both events are processed at the same time.
Having your own timer & callback mechanism might be appropriate for this kind of decision making. The trade off is in code complexity and maintenance vs. likely power savings.

Simply put, do as little as possible.

Well, to the extent that your code can execute entirely in the processor cache, you'll have less bus activity and save power. To the extent that your program is small enough to fit code+data entirely in the cache, you get that benefit "for free". OTOH, if your program is too big, and you can divide your programs into modules that are more or less independent of the other, you might get some power saving by dividing it into separate programs. (I suppose it's also possible to make a toolchain that spreas out related bundles of code and data into cache-sized chunks...)
I suppose that, theoretically, you can save some amount of unnecessary work by reducing the number of pointer dereferencing, and by refactoring your jumps so that the most likely jumps are taken first -- but that's not realistic to do as a programmer.
Transmeta had the idea of letting the machine do some instruction optimization on-the-fly to save power... But that didn't seem to help enough... And look where that got them.

Set unused memory or flash to 0xFF not 0x00. This is certainly true for flash and eeprom, not sure about s or d ram. For the proms there is an inversion so a 0 is stored as a 1 and takes more energy, a 1 is stored as a zero and takes less. This is why you read 0xFFs after erasing a block.

Rather timely this, article on Hackaday today about measuring power consumption of various commands:
Hackaday: the-effect-of-code-on-power-consumption
Aside from that:
- Interrupts are your friends
- Polling / wait() aren't your friends
- Do as little as possible
- make your code as small/efficient as possible
- Turn off as many modules, pins, peripherals as possible in the micro
- Run as slowly as possible
- If the micro has settings for pin drive strengh, slew rate, etc. check them & configure them, the defaults are often full power / max speed.
- returning to the article above, go back and measure the power & see if you can drop it by altering things.

also something that is not trivial to do is reduce precision of the mathematical operations, go for the smallest dataset available and if available by your development environment pack data and aggregate operations.
knuth books could give you all the variant of specific algorithms you need to save memory or cpu, or going with reduced precision minimizing the rounding errors
also, spent some time checking for all the embedded device api - for example most symbian phones could do audio encoding via a specialized hardware

Do your work as quickly as possible, and then go to some idle state waiting for interrupts (or events) to happen. Try to make the code run out of cache with as little external memory traffic as possible.

On Linux, install powertop to see how often which piece of software wakes up the CPU. And follow the various tips that the powertop site links to, some of which are probably applicable to non-Linux, too.
http://www.lesswatts.org/projects/powertop/

Choose efficient algorithms that are quick and have small basic blocks and minimal memory accesses.
Understand the cache size and functional units of your processor.
Don't access memory. Don't use objects or garbage collection or any other high level constructs if they expands your working code or data set outside the available cache. If you know the cache size and associativity, lay out the entire working data set you will need in low power mode and fit it all into the dcache (forget some of the "proper" coding practices that scatter the data around in separate objects or data structures if that causes cache trashing). Same with all the subroutines. Put your working code set all in one module if necessary to stripe it all in the icache. If the processor has multiple levels of cache, try to fit in the lowest level of instruction or data cache possible. Don't use floating point unit or any other instructions that may power up any other optional functional units unless you can make a good case that use of these instructions significantly shortens the time that the CPU is out of sleep mode.
etc.

Don't poll, sleep
Avoid using power hungry areas of the chip when possible. For example multipliers are power hungry, if you can shift and add you can save some Joules (as long as you don't do so much shifting and adding that actually the multiplier is a win!)
If you are really serious,l get a power-aware debugger, which can correlate power usage with your source code. Like this

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio