Computing vs printing - performance

In my program, I made few modifications for performance improvement.
First, I undid some 3D point computations as it was a repetitive computation.
Second, I undid some print statements.
What I observe is that second change substantially improved the performance, while first one not so much.
Does it mean computations involving floating numbers are much less expensive than printing out some data to console? Is not floating point mathematics considered to be highly computation extensive?

Floating-point arithmetic is often more expensive than integer arithmetic, in terms of processor cycles and/or the space required for it in the silicon of processors and/or the energy required for it. However, printing is generally much more expensive.
Typical performance for floating-point additions or multiplications might be a latency of four processor cycles, compared to one for integer additions or multiplications.
Formatting output requires many instructions. Converting numbers to decimal requires dividing or performing table-lookups or executing other algorithms. The characters generated to represent a number must be placed in a buffer. Checks must be performed to ensure that internal buffers are not overflowed. When a buffer is full, or a printing operation is complete and must be sent to the output device (rather than just merely held in a buffer for future operations), then an operating system call must be performed to transfer the data from user memory to some input-output driver. Even simple in-buffer formatting operations may take hundreds of cycles, and printing that requires interaction with the file system or other devices may take thousands of cycles. (The actual upper limit is infinite, since printing may have to wait for some physical device to become ready. But even if all the activity of a particular operation is inside the computer itself, a print operation may take thousands of cycles.)

Related

Is generating random numbers from hardware performance cryptographically secure?

Suppose I have a program that needs an RNG.
If I were to run arbitrary operations and check the ∆t it takes to do said operations, I could generate random numbers from that
For example:
double start = device.time();
for(int i=0;i<100;i++);//assume compiler doesn't optimize this away
double end = device.time();
double dt = end-start;
dt will be more or less random based on many variables on the device such as battery level, transistor age, room temperature, other processes running, etc.
Now, suppose I keep generating dts and multiply them together as I go, hundreds of times, thousands of times, millions of times, eventually I am left with a very arbitrary number based on values that were more or less randomly calculated by hardware performance benchmarking.
Every time I multiply these dts together, the possible outputs increases exponentially, so determining what possible outputs may be becomes perhaps an impossible task after millions of iterations of this, even if each individual dt value is going to be within a similar range.
A thought then occurs, if you have a very consistent device, you may have dt always in the range of say 0.000000011, 0.000000012, 0.000000013, 0.000000014, then the final output number, no matter how many times I iterate and multiply, will be a number of the form 0.000000011^a * 0.000000012^b * 0.000000013^c * 0.000000014^d, that's probably easy to crack.
But then I turn to hashing, suppose rather than multiplying each dt, I concatenate it in string form to the previous values and hash them, so every time I generate a new dt based on hardware performance's random environmental values, I hash. Then at the end I digest the hash to whatever form I need, now the final output number can't be written in a general algebraic form.
Will numbers generated in this form be cryptographically secure?
Using a clock potentially leaks information to an adversary. Using a microphone also -- the adversary may have planted a bug and is hearing the same input. Best not to rely on any single source but to combine entropy inputs from multiple sources, both external to your computer and internal. By all means use internal OS entropy sources, such as dev/urandom, but use other sources as well.
It might be worth reading the description of the Fortuna CSPRNG for ideas.
If you take enough samples, and use few enough low bits of the time difference, then maybe timing of async interrupts could eventually add up to a useful amount of entropy.
Most OS kernels will collect entropy from timing in their own interrupt handlers, as part of the source for /dev/urandom, but if you really want to roll your own instead of asking the OS for randomness, it's plausible if you're very careful with your mixing function. e.g. have a look at what the Linux kernel uses for mixing in new data into its entropy pool. It has to avoid being "hurt" by sources that aren't actually random on a given system.
Other than interrupts, performance over short times is nearly deterministic, and CPU frequency variations are quantized into not that many different frequencies.
based on many variables on the device such as ...
battery level: maybe a 2-state effect like limiting max turbo when not on AC power, and/or when the battery is low.
transistor age: no. At most an indirect effect if aged transistors use more power, leading to the CPU running hotter and dropping out of max turbo sooner. I'm not sure there's any significant effect.
room temperature: again, only possibly reducing max clock speed sooner. Unless you're wasting multiple seconds of CPU time on this, it won't have an effect even on lightweight laptops. Desktops typically have enough cooling to sustain max turbo indefinitely on a single core, especially for simple scalar code. (SIMD FMA would make a lot more heat.)
other processes running: yes, that and async interrupts that happen to come inside your timed intervals will be the main source of randomness.
Most of the factors that affect clock speed will just uniformly scale up all times, correlated between samples, not more entropy. Clock frequency doesn't change that often; after ramping up to full speed for your benchmark loops, expect it to stay constant for multiple seconds.

Suitability of parallel computation for comparisons over a large dataset

Suppose the following hypothetical task:
I am given a single integer A (say, 32 bit double) an a large array of integers B's (same type). The size of the integer array is fixed at runtime (doesn't grow mid-run) but of arbitrary size except it can always fit inside either RAM or VRAM (whichever is smallest). For the sake of this scenario, the integer array can sit in either RAM and VRAM; ignore any time cost in transferring this initial data set at start-up.
The task is to compare A against each B and to return true only if the test is true for against ALL B's, returning false otherwise. For the sake of this scenario, let is the greater than comparison (although I'd be interested if your answer is different for slightly more complex comparisons).
A naïve parallel implementation could involve slicing up the set B and distributing the comparison workload across multiple core. The core's workload would then be entirely independent save for when a failed comparison would interrupt all others as the result would immediately be false. Interrupts play a role in this implementation; although I'd imagine an ever decreasing one probabilistically as the array of integers gets larger.
My question is three-fold:
Would such a scenario be suitable for parallel-processing on GPU. If so, under what circumstances? Or is this a misleading case where the direct CPU implementation is actually the fastest?
Can you suggest an improved parallel algorithm over the naïve one?
Can you suggest any reading to gain intuition on deciding such problems?
If I understand your questions correctly, what you are trying to perform is a reductive operation. The operation in question is equivalent to a MATLAB/Numpy all(A[:] == B). To answer the three sections:
Yes. Reductions on GPUs/multicore CPUs can be faster than their sequential counterpart. See the presentation on GPU reductions here.
The presentation should provide a hierarchical approach for reduction. A more modern approach would be to use atomic operations on shared memory and global memory, as well as warp-aggregation. However, if you do not wish to deal with the intricate details of GPU implementations, you can use a highly-optimized library such as CUB.
See 1 and 2.
Good luck! Hope this helps.
I think this is a situation where you'll derive minimal benefit from the use of a GPU. I also think this is a situation where it'll be difficult to get good returns on any form of parallelism.
Comments on the speed of memory versus CPUs
Why do I believe this? Behold: the performance gap (in terrifyingly unclear units).
The point here is that CPUs have gotten very fast. And, with SIMD becoming a thing, they are poised to become even faster.
In the meantime, memory is getting faster slower. Not shown on the chart are memory buses, which ferry data to/from the CPU. Those are also getting faster, but at a slow rate.
Since RAM and hard drives are slow, CPUs try to store data in "little RAMs" known as the L1, L2, and L3 caches. These caches are super-fast, but super-small. However, if you can design an algorithm to repeatedly use the same memory, these caches can speed things up by an order of magnitude. For instance, this site discusses optimizing matrix multiplication for cache reuse. The speed-ups are dramatic:
The speed of the naive implementation (3Loop) drops precipitously for everything about a 350x350 matrix. Why is this? Because double-precision numbers (8 bytes each) are being used, this is the point at which the 1MB L2 cache on the test machine gets filled. All the speed gains you see in the other implementations come from strategically reusing memory so this cache doesn't empty as quickly.
Caching in your algorithm
Your algorithm, by definition, does not reuse memory. In fact, it has the lowest possible rate of memory reuse. That means you get no benefit from the L1, L2, and L3 caches. It's as though you've plugged your CPU directly into the RAM.
How do you get data from RAM?
Here's a simplified diagram of a CPU:
Note that each core has it's own, dedicated L1 cache. Core-pairs share L2 caches. RAM is shared between everyone and accessed via a bus.
This means that if two cores want to get something from RAM at the same time, only one of them is going to be successful. The other is going to be sitting there doing nothing. The more cores you have trying to get stuff from RAM, the worse this is.
For most code, the problem's not too bad since RAM is being accessed infrequently. However, for your code, the performance gap I talked about earlier, coupled your algorithm's un-cacheable design, means that most of your code's time is spent getting stuff from RAM. That means that cores are almost always in conflict with each other for limited memory bandwidth.
What about using a GPU?
A GPU doesn't really fix things: most of your time will still be spent pulling stuff from RAM. Except rather than having one slow bus (from the CPU to RAM), you have two (the other being the bus from the CPU to the GPU).
Whether you get a speed up is dependent on the relative speed of the CPU, the GPU-CPU bus, and the GPU. I suspect you won't get much of a speed up, though. GPUs are good for SIMD-type operations, or maps. The operation you describe is a reduction or fold: an inherently non-parallel operation. Since your mapped function (equality) is extremely simple, the GPU will spend most of its time on the reduction operation.
tl;dr
This is a memory-bound operation: more cores and GPUs are not going to fix that.
ignore any time cost in transferring this initial data set at
start-up
if there are only a few flase conditions in millions or billions of elements, you can try an opencl example:
// A=5 and B=arr
int id=get_global_id(0);
if(arr[id]!=5)
{
atomic_add(arr,1);
}
is as fast as it gets. arr[0] must be zero if all conditions are "true"
If you are not sure wheter there are only a few falses or millions(which makes atomic functions slow), you can have a single-pass preprocessing to decrease number of falses:
int id=get_global_id(0);
// get arr[id*128] to arr[id*128+128] into local/private mem
// check if a single false exists.
// if yes, set all cells true condition except one
// write results back to a temporary arr2 to be used
this copies whole array to another but if you can ignore time delta of transferring from host device, this should be also ignored. On top of this, only two kernels shouldn't take more than 1ms for the overhead(not including memory read writes)
If data fits in cache, the second kernel(one with the atomic function) will access it instead of global memory.
If time of transfers starts concerning, you can hide their latency using pipelined upload compute download operations if threads are separable from whole array.

How to test algorithm performance on devices with low resources?

I am interested in using atmel avr controllers to read data from LIN bus. Unfortunately, messages on such bus have no beginning or end indicator and only reasonable solution seems to be brute force parsing. Available data from bus is loaded into circular buffer, and brute force method finds valid messages in buffer.
Working with 64 byte buffer and 20MHZ attiny, how can I test the performance of my code in order to see if buffer overflow is likely to occur? Added: My concern is that algorith will be running slow, thus buffering even more data.
A bit about brute force algorithm. Second element in a buffer is assumed to be message size. For example, if assumed length is 22, first 21 bytes are XORed and tested against 22nd byte in buffer. If checksum passes, code checks if first (SRC) and third (DST) byte are what they are supposed to be.
AVR is one of the easiest microcontrollers for performance analysis, because it is a RISC machine with a simple intruction set and well-known instruction execution time for each instruction.
So, the beasic procedure is that you take the assembly coude and start calculating different scenarios. Basic register operations take one clock cycle, branches usually two cycles, and memory accesses three cycles. A XORing cycle would take maybe 5-10 cycles per byte, so it is relatively cheap. How you get your hands on the assembly code depends on the compiler, but all compilers tend to give you the end result in a reasonable legible form.
Usually without seeing the algorithm and knowing anything about the timing requirements it is quite impossible to give a definite answer to this kind of questions. However, as the LIN bus speed is limited to 20 kbit/s, you will have around 10 000 clock cycles for each byte. That is enough for almost anything.
A more difficult question is what to do with the LIN framing which is dependent on timing. It is not a very nice habit, as it really requires some time extra effort from the microcontroller. (What on earth is wrong with using the 9th bit?)
The LIN frame consists of a
break (at least 13 bit times)
synch delimiter (0x55)
message id (8 bits)
message (0..8 x 8 bits)
checksum (8 bits)
There are at least four possible approaches with their ups and downs:
(Your apporach.) Start at all possible starting positions and try to figure out where the checksummed message is. Once you are in sync, this is not needed. (Easy but returns ghost messages with a probability 1/256. Remember to discard the synch field.)
Use the internal UART and look for the synch field; try to figure out whether the data after the delimiter makes any sense. (This has lower probability of errors than the above, but requires the synch delimiter to come through without glitches and may thus miss messages.)
Look for the break. Easiest way to do this to timestamp all arriving bytes. It is quite probably not required to buffer the incoming data in any way, as the data rate is very low (max. 2000 bytes/s). Nominally, the distance between the end of the last character of a frame and the start of the first character of the next frame is at least 13 bits. As receiving a character takes 10 bits, the delay between receiving the end of the last character in the previous message and end of the first character of the next message is nominally at least 23 bits. In order to allow some tolerance for the bit timing, the limit could be set to, e.g. 17 bits. If the distance in time between "character received" interrupts exceeds this limit, the characters belong to different frame. Once you have detected the break, you may start collecting a new message. (This works almost according to the official spec.)
Do-it-yourself bit-by-bit. If you do not have a good synchronization between the slave and the master, you will have to determine the master clock using this method. The implementation is not very straightforward, but one example is: http://www.atmel.com/images/doc1637.pdf (I do not claim that one to be foolproof, it is rather simplistic.)
I would go with #3. Create an interrupt for incoming data and whenever data comes you compare the current timestamp (for which you need a counter) to the timestamp of the previous interrupt. If the inter-character time is too long, you start a new message, otherwise append to the old message. Then you may need double buffering for the messages (one you are collecting, another you are analyzing) to avoid very long interrupt routines.
The actual implementation depends on the other structure of your code. This shouldn't take much time.
And if you cannot make sure your clock is well enough synchronized (+- 4%) to the moster clock, then you'll have to look at #4, which is probably much more instructive but quite tedious.
Your fundamental question is this (as I see it):
how can I test the performance of my code in order to see if buffer overflow is likely to occur?
Set a pin high at the start of the algorithm, set it low at the end. Look at it on an oscilloscope (I assume you have one of these - embedded development is very difficult without it.) You'll be able to measure the max time the algorithm takes, and also get some idea of the variability.

Algorithmic Complexity Analysis: practically using Knuth's Ordinary Operations (oops) and Memory Operations (mems) method

In implementing most algorithms (sort, search, graph traversal, etc.), there is frequently a trade-off that can be made in reducing memory accesses at the cost of additional ordinary operations.
Knuth has a useful method for comparing the complexity of various algorithm implementations by abstracting it from particular processors and only distinguishing between ordinary operations (oops) and memory operations (mems).
In compiled programs, one typically lets the compiler organise the low level operations, and hopes that the operating system will handle the question of whether data is held in cache memory (faster) or in virtual memory (slower). Furthermore, the exact number / cost of instructions is encapsulated by the compiler.
With Forth, there is no longer such encapsulation, and one is much closer to the machine, albeit perhaps to a stack machine running on top of a register processor.
Ignoring the effect of an operating system (so no memory stalls, etc.), and assuming for the moment a simple processor,
(1) Can anyone advise on how the ordinary stack operations in Forth (e.g. dup, rot, over, swap, etc.) compare with the cost of Forth's memory access fetch (#) or store (!) ?
(2) Is there a rule of thumb I can use to decide how many ordinary operations to trade-off against saving a memory access?
What I'm looking for is something like 'memory access costs as much as 50 ordinary ops, or 500 ordinary ops, or 5 ordinary ops' Ballpark is absolutely fine.
I'm trying to get a sense of the relative expense of fetch and store vs. rot, swap, dup, drop, over, correct to an order of magnitude.
This article How much time does it take to fetch one word from memory? talks about main memory stall times, with some rule of thumb type numbers, but basically you can do lots of instructions while stalling for main memory. As others have said, the numbers vary a lot between systems.
Main memory stalls is a big area of interest, especially as CPUs have more cores, but typically not much faster memory bandwidth. There is some research going on around compressing data in main memory too, so that the CPU can take advantage of 'spare' cycles and tightly packed cache lines http://oai.cwi.nl/oai/asset/15564/15564B.pdf
For those who are really interested in the details, most CPU manufacturers publish in depth guides on memory optimisations etc. mostly aimed at high end and compiler writers, but readable by all 2gl and 3gl programmers.
Ps. Go Forth.
A comparison between memory fetches and register operations is okay for assembler programs, as it is for the output of c-compilers, which is in fact an assembler program.
In Forth this question hardly makes sense. In the first place Forth is an interpreter and in using Forth one foregoes the ultimate in speed. Of course one could add an optimiser on top of Forth but then the question makes even less sense, because the output of a c-optimiser and a Forth optimiser converge to -- you guessed it -- an optimal solution.
Let's look at an elementary operation in Forth like AND.
This is implemented as
> CODE AND
> POP AX
> POP BX
> AND AX, BX
> PUSH AX
> NEXT
So we see already three memory operations for something that looks like an elementary calculation operation. It appears the Knuth metric is not applicable. Also Forth seems to be loosing big time.That is however not true. Those memory operations are all onto the L1 cache of a typical processor. That is about as efficient as local variables in small c functions,
We can compare stack operations with memory operations using VARIABLE's and the stack. The answer is simple. A VARIABLE risks a memory stall. A stack operation will almost certainly be a L1 cache hit. This is the single most important point of consideration. However the question explicitly asks not to consider it!
So there.

What is "overhead"?

I am a student in Computer Science and I am hearing the word "overhead" a lot when it comes to programs and sorts. What does this mean exactly?
It's the resources required to set up an operation. It might seem unrelated, but necessary.
It's like when you need to go somewhere, you might need a car. But, it would be a lot of overhead to get a car to drive down the street, so you might want to walk. However, the overhead would be worth it if you were going across the country.
In computer science, sometimes we use cars to go down the street because we don't have a better way, or it's not worth our time to "learn how to walk".
The meaning of the word can differ a lot with context. In general, it's resources (most often memory and CPU time) that are used, which do not contribute directly to the intended result, but are required by the technology or method that is being used. Examples:
Protocol overhead: Ethernet frames, IP packets and TCP segments all have headers, TCP connections require handshake packets. Thus, you cannot use the entire bandwidth the hardware is capable of for your actual data. You can reduce the overhead by using larger packet sizes and UDP has a smaller header and no handshake.
Data structure memory overhead: A linked list requires at least one pointer for each element it contains. If the elements are the same size as a pointer, this means a 50% memory overhead, whereas an array can potentially have 0% overhead.
Method call overhead: A well-designed program is broken down into lots of short methods. But each method call requires setting up a stack frame, copying parameters and a return address. This represents CPU overhead compared to a program that does everything in a single monolithic function. Of course, the added maintainability makes it very much worth it, but in some cases, excessive method calls can have a significant performance impact.
You're tired and cant do any more work. You eat food. The energy spent looking for food, getting it and actually eating it consumes energy and is overhead!
Overhead is something wasted in order to accomplish a task. The goal is to make overhead very very small.
In computer science lets say you want to print a number, thats your task. But storing the number, the setting up the display to print it and calling routines to print it, then accessing the number from variable are all overhead.
Wikipedia has us covered:
In computer science, overhead is
generally considered any combination
of excess or indirect computation
time, memory, bandwidth, or other
resources that are required to attain
a particular goal. It is a special
case of engineering overhead.
Overhead typically reffers to the amount of extra resources (memory, processor, time, etc.) that different programming algorithms take.
For example, the overhead of inserting into a balanced Binary Tree could be much larger than the same insert into a simple Linked List (the insert will take longer, use more processing power to balance the Tree, which results in a longer percieved operation time by the user).
For a programmer overhead refers to those system resources which are consumed by your code when it's running on a giving platform on a given set of input data. Usually the term is used in the context of comparing different implementations or possible implementations.
For example we might say that a particular approach might incur considerable CPU overhead while another might incur more memory overhead and yet another might weighted to network overhead (and entail an external dependency, for example).
Let's give a specific example: Compute the average (arithmetic mean) of a set of numbers.
The obvious approach is to loop over the inputs, keeping a running total and a count. When the last number is encountered (signaled by "end of file" EOF, or some sentinel value, or some GUI buttom, whatever) then we simply divide the total by the number of inputs and we're done.
This approach incurs almost no overhead in terms of CPU, memory or other resources. (It's a trivial task).
Another possible approach is to "slurp" the input into a list. iterate over the list to calculate the sum, then divide that by the number of valid items from the list.
By comparison this approach might incur arbitrary amounts of memory overhead.
In a particular bad implementation we might perform the sum operation using recursion but without tail-elimination. Now, in addition to the memory overhead for our list we're also introducing stack overhead (which is a different sort of memory and is often a more limited resource than other forms of memory).
Yet another (arguably more absurd) approach would be to post all of the inputs to some SQL table in an RDBMS. Then simply calling the SQL SUM function on that column of that table. This shifts our local memory overhead to some other server, and incurs network overhead and external dependencies on our execution. (Note that the remote server may or may not have any particular memory overhead associated with this task --- it might shove all the values immediately out to storage, for example).
Hypothetically we might consider an implementation over some sort of cluster (possibly to make the averaging of trillions of values feasible). In this case any necessary encoding and distribution of the values (mapping them out to the nodes) and the collection/collation of the results (reduction) would count as overhead.
We can also talk about the overhead incurred by factors beyond the programmer's own code. For example compilation of some code for 32 or 64 bit processors might entail greater overhead than one would see for an old 8-bit or 16-bit architecture. This might involve larger memory overhead (alignment issues) or CPU overhead (where the CPU is forced to adjust bit ordering or used non-aligned instructions, etc) or both.
Note that the disk space taken up by your code and it's libraries, etc. is not usually referred to as "overhead" but rather is called "footprint." Also the base memory your program consumes (without regard to any data set that it's processing) is called its "footprint" as well.
Overhead is simply the more time consumption in program execution. Example ; when we call a function and its control is passed where it is defined and then its body is executed, this means that we make our CPU to run through a long process( first passing the control to other place in memory and then executing there and then passing the control back to the former position) , consequently it takes alot performance time, hence Overhead. Our goals are to reduce this overhead by using the inline during function definition and calling time, which copies the content of the function at the function call hence we dont pass the control to some other location, but continue our program in a line, hence inline.
You could use a dictionary. The definition is the same. But to save you time, Overhead is work required to do the productive work. For instance, an algorithm runs and does useful work, but requires memory to do its work. This memory allocation takes time, and is not directly related to the work being done, therefore is overhead.
You can check Wikipedia. But mainly when more actions or resources are used. Like if you are familiar with .NET there you can have value types and reference types. Reference types have memory overhead as they require more memory than value types.
A concrete example of overhead is the difference between a "local" procedure call and a "remote" procedure call.
For example, with classic RPC (and many other remote frameworks, like EJB), a function or method call looks the same to a coder whether its a local, in memory call, or a distributed, network call.
For example:
service.function(param1, param2);
Is that a normal method, or a remote method? From what you see here you can't tell.
But you can imagine that the difference in execution times between the two calls are dramatic.
So, while the core implementation will "cost the same", the "overhead" involved is quite different.
Think about the overhead as the time required to manage the threads and coordinate among them. It is a burden if the thread does not have enough task to do. In such a case the overhead cost over come the saved time through using threading and the code takes more time than the sequential one.
To answer you, I would give you an analogy of cooking Rice, for example.
Ideally when we want to cook, we want everything to be available, we want pots to be already clean, rice available in enough quantities. If this is true, then we take less time to cook our rice( less overheads).
On the other hand, let's say you don't have clean water available immediately, you don't have rice, therefore you need to go buy it from the shops first and you need to also get clean water from the tap outside your house. These extra tasks are not standard or let me say to cook rice you don't necessarily have to spend so much time gathering your ingredients. Ideally, your ingredients must be present at the time of wanting to cook your rice.
So the cost of time spent in going to buy your rice from the shops and water from the tap are overheads to cooking rice. They are costs that we can avoid or minimize, as compared to the standard way of cooking rice( everything is around you, you don't have to waste time gathering your ingredients).
The time wasted in collecting ingredients is what we call the Overheads.
In Computer Science, for example in multithreading, communication overheads amongst threads happens when threads have to take turns giving each other access to a certain resource or they are passing information or data to each other. Overheads happen due to context switching.Even though this is crucial to them but it's the wastage of time (CPU cycles) as compared to the traditional way of single threaded programming where there is never a time wastage in communication. A single threaded program does the work straight away.
its anything other than the data itself, ie tcp flags, headers, crc, fcs etc..

Resources