How to test algorithm performance on devices with low resources?

How to test algorithm performance on devices with low resources? - performance

I am interested in using atmel avr controllers to read data from LIN bus. Unfortunately, messages on such bus have no beginning or end indicator and only reasonable solution seems to be brute force parsing. Available data from bus is loaded into circular buffer, and brute force method finds valid messages in buffer.
Working with 64 byte buffer and 20MHZ attiny, how can I test the performance of my code in order to see if buffer overflow is likely to occur? Added: My concern is that algorith will be running slow, thus buffering even more data.
A bit about brute force algorithm. Second element in a buffer is assumed to be message size. For example, if assumed length is 22, first 21 bytes are XORed and tested against 22nd byte in buffer. If checksum passes, code checks if first (SRC) and third (DST) byte are what they are supposed to be.

AVR is one of the easiest microcontrollers for performance analysis, because it is a RISC machine with a simple intruction set and well-known instruction execution time for each instruction.
So, the beasic procedure is that you take the assembly coude and start calculating different scenarios. Basic register operations take one clock cycle, branches usually two cycles, and memory accesses three cycles. A XORing cycle would take maybe 5-10 cycles per byte, so it is relatively cheap. How you get your hands on the assembly code depends on the compiler, but all compilers tend to give you the end result in a reasonable legible form.
Usually without seeing the algorithm and knowing anything about the timing requirements it is quite impossible to give a definite answer to this kind of questions. However, as the LIN bus speed is limited to 20 kbit/s, you will have around 10 000 clock cycles for each byte. That is enough for almost anything.
A more difficult question is what to do with the LIN framing which is dependent on timing. It is not a very nice habit, as it really requires some time extra effort from the microcontroller. (What on earth is wrong with using the 9th bit?)
The LIN frame consists of a
break (at least 13 bit times)
synch delimiter (0x55)
message id (8 bits)
message (0..8 x 8 bits)
checksum (8 bits)
There are at least four possible approaches with their ups and downs:
(Your apporach.) Start at all possible starting positions and try to figure out where the checksummed message is. Once you are in sync, this is not needed. (Easy but returns ghost messages with a probability 1/256. Remember to discard the synch field.)
Use the internal UART and look for the synch field; try to figure out whether the data after the delimiter makes any sense. (This has lower probability of errors than the above, but requires the synch delimiter to come through without glitches and may thus miss messages.)
Look for the break. Easiest way to do this to timestamp all arriving bytes. It is quite probably not required to buffer the incoming data in any way, as the data rate is very low (max. 2000 bytes/s). Nominally, the distance between the end of the last character of a frame and the start of the first character of the next frame is at least 13 bits. As receiving a character takes 10 bits, the delay between receiving the end of the last character in the previous message and end of the first character of the next message is nominally at least 23 bits. In order to allow some tolerance for the bit timing, the limit could be set to, e.g. 17 bits. If the distance in time between "character received" interrupts exceeds this limit, the characters belong to different frame. Once you have detected the break, you may start collecting a new message. (This works almost according to the official spec.)
Do-it-yourself bit-by-bit. If you do not have a good synchronization between the slave and the master, you will have to determine the master clock using this method. The implementation is not very straightforward, but one example is: http://www.atmel.com/images/doc1637.pdf (I do not claim that one to be foolproof, it is rather simplistic.)
I would go with #3. Create an interrupt for incoming data and whenever data comes you compare the current timestamp (for which you need a counter) to the timestamp of the previous interrupt. If the inter-character time is too long, you start a new message, otherwise append to the old message. Then you may need double buffering for the messages (one you are collecting, another you are analyzing) to avoid very long interrupt routines.
The actual implementation depends on the other structure of your code. This shouldn't take much time.
And if you cannot make sure your clock is well enough synchronized (+- 4%) to the moster clock, then you'll have to look at #4, which is probably much more instructive but quite tedious.

Your fundamental question is this (as I see it):
how can I test the performance of my code in order to see if buffer overflow is likely to occur?
Set a pin high at the start of the algorithm, set it low at the end. Look at it on an oscilloscope (I assume you have one of these - embedded development is very difficult without it.) You'll be able to measure the max time the algorithm takes, and also get some idea of the variability.

Related

Are bytes real?

I know that this question may sound stupid, but let me just explain. So...
Everyone knows that byte is 8 bits. Simple, right? But where exactly is it specified? I mean, phisically you don't really use bytes, but bits. For example drives. As I understand, it's just a reaaaaly long string of ones and zeros and NOT bytes. Sure, there are sectors, but, as far as I know, there are programmed at software level (at least in SSDs, I think). Also RAM, which is again - a long stream of ones and zeros. Another example is CPU. It doesn't process 8 bits at a time, but only one.
So where exactly is it specified? Or is it just general rule, which everyone follows? If so, could I make system (either operating system or even something at lower level) that would use, let's say, 9 bits in a byte? Or I wouldn't have to? Also - why can't you use less than a byte of memory? Or maybe you can? For example: is it possible for two applications to use the same byte (e.g. first one uses 4 bits and second one uses other 4)? And last, but not least - does computer drives really use bytes? Or is it that, for example, bits 1-8 belong to something, next to them there are some 3 random bits and bits 12-20 belong to something different?
I know that there are a lot of question and knowing answers to these questions doesn't change anything, but I was just wondering.
EDIT: Ok, I might've expressed myself not clear enough. I know that byte is just a concept (well, even bit is just a concept that we make real). I'm NOT asking why there are 8 bits in byte and why bytes exist as a term. What I'm asking is where in a computer is byte defined or if it even is defined. If bytes really are defined somewhere, at what level (a hardware level, OS level, programming language level or just at application level)? I'm also asking if computers even care about bytes (in that concept that we've made real), if they use bytes constantly (like in between two bytes, can there be some 3 random bits?).

Yes, they’re real insofaras they have a definition and a standardised use/understanding. The Wikipedia article for byte says:
The modern de-facto standard of eight bits, as documented in ISO/IEC 2382-1:1993, is a convenient power of two permitting the values 0 through 255 for one byte (2 in power of 8 = 256, where zero signifies a number as well).[7] The international standard IEC 80000-13 codified this common meaning. Many types of applications use information representable in eight or fewer bits and processor designers optimize for this common usage. The popularity of major commercial computing architectures has aided in the ubiquitous acceptance of the eight-bit size.[8] Modern architectures typically use 32- or 64-bit words, built of four or eight bytes
The full article is probably worth reading. No one set out stall 50+ years ago, banged a fist on the desk and said ‘a byte shallt be 8 bits’ but it became that way over time, with popular microprocessors being able to carry out operations on 8 bits at a time. Subsequent processor architectures carry out ops on multiples of this. While I’m sure intel could make their next chip a 100bit capable one, I think the next bitness revolution we’ll encounter will be 128
Everyone knows that byte is 8 bits?
These days, yes
But where exactly is it specified?
See above for the ISO code
I mean, phisically you don't really use bytes, but bits.
Physically we don’t use bits either, but a threshold of detectable magnetic field strength on a rust coated sheet of aluminium, or an amount of electrical charge storage
As I understand, it's just a reaaaaly long string of ones and zeros and NOT bytes.
True, everything to a computer is a really long stream of 0 and 1. What is important in defining anything else is where to stop counting this group of 0 or 1, and start counting the next group, and what you call the group. A byte is a group of 8 bits. We group things for convenience. It’s a lot more inconvenient to carry 24 tins of beer home than a single box containing 24 tins
Sure, there are sectors, but, as far as I know, there are programmed at software level (at least in SSDs, I think)
Sectors and bytes are analogous in that they represent a grouping of something, but they aren’t necessarily directly related in the way that bits and bytes are because sectors are a level of grouping on top of bytes. Over time the meaning of a sector as a segment of a track (a reference to a platter number and a distance from the centre of the platter) has changed as the march of progress has done away with positional addressing and later even rotational storage. In computing you’ll typically find that there is a base level that is hard to use, so someone builds a level of abstraction on top of it, and that becomes the new “hard to use”, so it’s abstracted again, and again.
Also RAM, which is again - a long stream of ones and zeros
Yes, and is consequently hard to use, so it’s abstracted, and abstracted again. Your program doesn’t concern itself with raising the charge level of some capacitive area of a memory chip, it uses the abstractions it has access to, and that attraction fiddles the next level down, and so on until the magic happens at the bottom of the hierarchy. Where you stop on this downward journey is largely a question of definition and arbitrary choice. I don’t usually consider my ram chips as something like ice cube trays full of electrons, or the subatomic quanta, but I could I suppose. We normally stop when it ceases to useful to solving the
Problem
Another example is CPU. It doesn't process 8 bits at a time, but only one.
That largely depends on your definition of ‘at a time’ - most of this question is about the definitions of various things. If we arbitrarily decide that ‘at a time’ is the unit block of the multiple picoseconds it takes the cpu to complete a single cycle then yes, a CPU can operate on multiple bits of information at once - it’s the whole idea of having a multiple bit cpu that can add two 32 bit numbers together and not forget bits. If you want to slice the time up so precisely that we can determine that enough charge has flowed to here but not there then you could say which bit the cpu is operating on right at this pico (or smaller) second, but it’s not useful to go so fine grained because nothing will happen until the end of the time slice the cpu is waiting for.
Suffice to say, when we divide time just enough to observe a single cpu cycle from start to finish, we can say the cpu is operating on more than one bit.
If you write at one letter per second, and I close my eyes for 2 out of every 3 seconds, I’ll see you write a whole 3 letter word “at the same time” - you write “the cat sat onn the mat” and to the observer, you generated each word simultaneously.
CPUs run cycles for similar reasons, they operate on the flow and buildup of electrical charge and you have to wait a certain amount of time for the charge to build up so that it triggers the next set of logical gates to open/close and direct the charge elsewhere. Faster CPUs are basically more sensitive circuitry; the rate of flow of charge is relatively constant, it’s the time you’re prepared to wait for input to flow from here to there, for that bucket to fill with just enough charge, that shortens with increasing MHz. Once enough charge has accumulated, bump! Something happens, and multiple things are processed “at the same time”
So where exactly is it specified? Or is it just general rule, which everyone follows?
it was the general rule, then it was specified to make sure it carried on being the general rule
If so, could I make system (either operating system or even something at lower level) that would use, let's say, 9 bits in a byte? Or I wouldn't have to?
You could, but you’d essentially have to write an adaptation(abstraction) of an existing processor architecture and you’d use nine 8bit bytes to achieve your presentation of eight 9bit bytes. You’re creating an abstraction on top of an abstraction and boundaries of basic building blocks don’t align. You’d have a lot of work to do to see the system out to completion, and you wouldn’t bother.
In the real world, if ice cube trays made 8 cubes at a time but you thought the optimal number for a person to have in the freezer was 9, you’d buy 9 trays, freeze them and make 72 cubes, then divvy them up into 8 bags, and sell them that way. If someone turned up with 9 cubes worth of water (it melted), you’d have to split it over 2 trays, freeze it, give it back.. this constant adaptation between your industry provided 8 slot trays and your desire to process 9 cubes is the adaptive abstraction
If you do do it, maybe call it a nyte? :)
Also - why can't you use less than a byte of memory? Or maybe you can?
You can, you just have to work with the limitations of the existing abstraction being 8 bits. If you have 8 Boolean values to store you can code things up so you flip bits of the byte on and off, so even though you’re stuck with your 8 cube ice tray you can selectively fill and empty each cube. If your program only ever needs 7 Booleans, you might have to accept the wastage of the other bit. Or maybe you’ll use it in combination with a regular 32 bit int to keep track of a 33bit integer value. Lot of work though, writing an adaptation that knows to progress onto the 33rd bit rather than just throw an overflow error when you try to add 1 to 4,294,967,295. Memory is plentiful enough that you’d waste the bit, and waste another 31bits using a 64bit integer to hold your 4,294,967,296 value.
Generally, resource is so plentiful these days that we don’t care to waste a few bits.. It isn’t always so, of course: take credit card terminals sending data over slow lines. Every bit counts for speed, so the ancient protocols for info interchange with the bank might well use different bits of the same byte to code up multiple things
For example: is it possible for two applications to use the same byte (e.g. first one uses 4 bits and second one uses other 4)?
No, because hardware and OS memory management these days keeps programs separate for security and stability. In the olden days though, one program could write to another program’s memory (it’s how we cheated at games, see the lives counter go down, just overwrite a new value), so in those days if two programs could behave, and one would only write to the 4 high bits and the other the 4 low bits then yes, they could have shared a byte. Access would probably be whole byte though, so each program would have to read the whole byte, only change its own bits of it, then write the entire result back
And last, but not least - does computer drives really use bytes? Or is it that, for example, bits 1-8 belong to something, next to them there are some 3 random bits and bits 12-20 belong to something different?
Probably not, but you’ll never know because you don’t get to peek to that level of abstraction enough to see the disk laid out as a sequence of bits and know where the byte boundaries are, or sector boundaries, and whether this logical sector follows that logical sector, or whether a defect in the disk surface means the sectors don’t follow on from each other. You don’t typically care though, because you treat the drive as a contiguous array of bytes (etc) and let its controller worry about where the bits are

Computing vs printing

In my program, I made few modifications for performance improvement.
First, I undid some 3D point computations as it was a repetitive computation.
Second, I undid some print statements.
What I observe is that second change substantially improved the performance, while first one not so much.
Does it mean computations involving floating numbers are much less expensive than printing out some data to console? Is not floating point mathematics considered to be highly computation extensive?

Floating-point arithmetic is often more expensive than integer arithmetic, in terms of processor cycles and/or the space required for it in the silicon of processors and/or the energy required for it. However, printing is generally much more expensive.
Typical performance for floating-point additions or multiplications might be a latency of four processor cycles, compared to one for integer additions or multiplications.
Formatting output requires many instructions. Converting numbers to decimal requires dividing or performing table-lookups or executing other algorithms. The characters generated to represent a number must be placed in a buffer. Checks must be performed to ensure that internal buffers are not overflowed. When a buffer is full, or a printing operation is complete and must be sent to the output device (rather than just merely held in a buffer for future operations), then an operating system call must be performed to transfer the data from user memory to some input-output driver. Even simple in-buffer formatting operations may take hundreds of cycles, and printing that requires interaction with the file system or other devices may take thousands of cycles. (The actual upper limit is infinite, since printing may have to wait for some physical device to become ready. But even if all the activity of a particular operation is inside the computer itself, a print operation may take thousands of cycles.)

How is more than one bit read at a time?

This seems like it should be incredibly easy for me to grasp, but I am going insane trying to understand it. I understand that the computer only understands on or off, so what I am trying to understand is a situation where the computer reads more than one on or off value at a time - as in byte-addressable memory - where it would need to read 8 values of on and off to get the byte value. If the computer reads these 8 values sequentially - one after the other - then that would make sense to me. But from what I gather it reads all of these on and off states at the same time. I assume this is accomplished through the circuit design? Would someone be able to either explain this in simple terms to my simple mind or direct me to a resource that could do so? I would love to not have to be committed to the mental hospital over this - thanks.

If you mean reading the data off of a CD, it does happen sequentially. Each bit is gathered individually by a laser, but this operation is still treated as occurring all at once conceptually because it doesn't expose sufficiently fine-grained control to other parts of the computer to for it to be instructed to read just one bit. Simplified, the way this works is that a laser bounces off a mirror which depending on whether it is scratched or not in that particular place either causes the laser to bounce back and hit a photo-transistor (unscratched) setting a single bit to off, or to be deflected and not hit the photo-transistor (scratched), setting the bit to on. Even though it is loading these bits sequentially, it will always load complete bytes into its buffer before sending them to the computer's memory. Multiple bits travel from the hard drive buffer to memory at once. This is possible because the bus connecting them has multiple wires. Each wire can only hold one value at a given time, either high voltage or low (on or off). Eight adjacent wires can hold a byte at once.
A hard drive works by the same principal except rather than a laser and a scratched disk, you have a disk that is magnetized in specific locations, and a very sensitive inductor that will have current flow one way or another depending on the polarity of the accelerating magnetic field below it (i.e. the spinning disk).
Other parts of computer hardware actually do their operations in parallel. For sending information between different parts of the memory hierarchy, you can still think of this as just having multiple wires adjacent to each other.
Processing actually is built around the fact that multiple bits are being accessed at once, i.e. that there are different wires with on and off right next to each other. For example, in modern computers, a register can be thought of as 64 adjacent wires. (Calling them wires is a pretty big simplification at this point, but it is still gets across the appropriate idea.) A processor is built up of logic gates, and logic gates are made up of transistors. Transistors are a physical objects that have four places to attach wires. (1),(2),(3), and (4). A transistor allows current to flow from (1) to (2) if and only if the voltage at (3) is higher than the voltage at (4). Basically, a logic gate sets the voltage at (3) and (4) on a bunch of wires, and has a constant ON supply at (1), and (4) will only be powered if the amount of voltage at (3) and (4) matches the configuration that the logic gate is allowed to power.
From there it is pretty easy to see how input of some pre-specified size can be operated on at once.
If each of the gates in the drawing below is an OR gate, the drawing would show a new composite gate that calculated the OR of 8 bits at once instead of just two.
(calculated end -- registers get written to here)
^
^ ^ (everything that happens in here is "invisible" to the computer)
^ ^ ^ ^
a b c d e f g h (registers get read here)
Processors are built with a bunch of operations of some specified depth in number of transistors so that every 2 ^ - 32 second or so, the processor can flip a switch that allows the voltage stored at the calculated end of the gates to flow back to the registers. So that it can be used in new computations or pushed over wires back to memory.
That, in a nut shell, is what it means for a computer to deal with multiple bits at once.

Does cache line flush write the whole line to the memory?

When a dirty cache line is flushed (because of any reason), is the whole cache line written to the memory or CPU tracks down which words got written to and reduces the number of memory writes?
If this differs among architectures, I'm primarily interested in knowing this for Blackfin, but it would be nice to hear practices in x86, ARM, etc...

generally if you have a write buffer it flushes through the write buffer (entire cache line). Then the write buffer at some point completes the writes to ram. I have not heard of a cache that keeps track per item within a line which parts are dirty or not, that is why you have a cache line. So for the cases I have heard of the whole line goes out. Another point is that it is not uncommon for the slow memory on the back side of a cache DDR for example, is accessed through some fixed width, 32 bits at a time 64 bits at a time 128 bits at a time, or each part is at that width and there are multiple parts. That kind of thing, so to avoid a read-modify-write you want to write in complete ram width sizes. Cache lines are multiples of that, sure, and the opportunity to not do writes is there. Also if there is ecc on that ram then you need to write a whole ecc line at once to avoid a read-modify write.
You would need a dirty bit per writeable item in the cache line so that would multiply the dirty bit storage bu some amount, that may or may not have a real impact on size or cost, etc. There may or may not be an overhead on the ram side per transaction and it may be cheaper to do a multi word transaction rather than even two separate transactions, so this scheme might create a performance hit rather than boost (same problem inside the write buffer, instead of one transaction with a start address and length, now multiple transactions).
It just seems like a lot of work for something that may or may not result in a gain. If you find one that does please post it here.

I'm dusting off my cobwebby computer architecture knowledge from classes taken 15 years ago -- please be kind if I'm totally wrong.
I seem to remember that x86, MIPS and Motorola, the whole line gets written. This is because the cache line is the same as the bus width (except in very odd circumstances, such as the moldy old 386-SX line which was a 32-bit architecture with a 16-bit bus), so there's no point in trying to do word-wise optimization, the whole line is going to be written anyway.
I can't imagine any scenario in which a hardware architecture of any kind would do anything different, but I've been known to be wrong in the past.

Sending (serial) break using windows (XP+) api

Is there a better way to send a serial break then the setcommbreak - delay - clearcommbreak sequence?
I have to communicate with a microcontroller that uses serial break as the start of a packet on 115k2, and the setcommbreak has two problems:
with 115k2, the break is well below 1ms, and it will get timing critical.
Since the break must be embedded in the packet stream at the correct position, I expect trouble with the fifo.
Is there a better way of doing this, without moving the serial communication to a thread without fifo ? The UART is typically a 16550+
I have a choice in the sense that the microcontroller setup can be switched(other firmware) to a more convention packet format, but the manual warns that the "break" way features hardware integrity checking of the serial.
Compiler is Delphi (2009/XE), but any code or even just a reference is welcome.

The short answer is that serial programming with Windows is fairly limited :-(
You're right that the normal way of sending a break is with SetCommBreak(), and yes, you have to handle the delay yourself - which tends to mean the break ends up substantially longer than it needs to be. The good news is that this doesn't usually matter - most devices expecting a break will treat a much longer break in exactly the same way as a short one.
In the event that your microcontroller is fussy about the precise duration of the break, one way of achieving a shorter, precisely-defined break is to change the baud rate on the port to a slower rate, send a zero byte, then change it back again.
The reason that this works is that a byte sent to the serial port is sent as (usually) one start bit (a zero), followed by the bits in the byte, followed by one or more stop bits (high bits). A 'break' is a sequence of zero bits that is too long to be a byte - i.e. the stop bits don't come in time. By choosing a slower baud rate and sending a zero, you end up holding the line at zero for longer than the receiver expects a byte to be, so it interprets it as a break. (It's up to you whether to determine the baud rate to use by precise calculation or trial-and-error of what the microcontroller seems to like :-)
Of course, either method (SetCommBreak() or baud changing) requires you to know when all data has been sent out of the serial port (i.e. there's nothing left in the transmit FIFO). This nice article about Windows Serial programming describes how to use SetCommMask(), WaitCommEvent() etc. to determine this.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio