Parallel processing on ATmega328 - parallel-processing

I want to implement the ideea of parallel processing , with 3 ATmega328 microcontrollers . The problem will be a trivial one , computing a number's factorial and writing the result on a LCD display. One of the 3 microcontrollers will be the dispatcher , it will receive the problem and will split it into smaller task for the other 2 microcontrollers . Now , my question is , how can I implement the communication between the microcontrollers , what to send from dispatcher ? Thank you !

Related

What is the performance of 100 processors capable of 2 GFLOPs running 2% sequential and 98% parallelizable code?

This is how I solved the following question, I want to be sure if my solution is correct?
A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2 Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable
Solution :
I think I'm right in thinking that 2% of the program is run at 2 GFLOPs, and 98% is run at 200 GFLOPs, and that I can average these speeds to find the performance of the multiprocessor in GFLOPs
(2/100)*2 + (98/100)*200 = 196,04 Gflops
I want to be sure if my solution is correct?
From my understanding, it is 2% of the program that is sequential and not 2% of the execution time. This means the sequential code takes a significant portion of the time since there are a lot of processor so the parallel part is drastically accelerated.
With your method, a program with 50% of sequential code and 1000 processors will run at (50/100)*2 + (50/100)*2_000 = 1001 Gflops. This means that all processors are use at ~50% of their maximum capacity in average during all the execution of the program which is too good to be possible. Indeed, the parallel part of the program should be so fast that it will take only a tiny faction of the execution time (<5%) while the sequential part will take almost all the time of the program (>95%). Since the largest part of the execution time runs at 2 Gflops, the processors cannot be used at ~50% of their capacity!
Based on the Amdahl's law, you can compute the actual speed up of this code:
Slat = 1 / ((1-p) + p/s) where Slat is the speed up of the whole program, p the portion of parallel code (0.98) and s is the number of processors (100). This means Slat = 33.6. Since one processor runs at 2 Gflops and the program is 33.6 time faster overall using many processors, the overall program runs at 33.6 * 2 = 67.2 Gflops.
What the Amdahl's law show is that a tiny fraction of the execution time being sequential impact strongly the scalability and thus the performance of parallel programs.
Forgive me to start light & anecdotally,citing a meme from my beloved math professor,later we will see why & how well it helps us here
2 + 7 = 15 . . . , particularly so for higher values of 7
It is ideal if I start off with stating some definitions:
a) GFLOPS is a unit, that measures how many operations in FLO-ating point arithmetics, with no particular kind thereof specified (see remark 1), were performed P-er S-econd ~ FLOPS, here expressed for convenience in multiples of billion ( G-iga ), i.e. the said GFLOPS
b) processor, multi-processor is a device ( or some kind of composition of multiple such same devices, expressed as multi-p. ), used to perform some kind of a useful work - a processing
This pair of definitions was necessary to further judge the asked question to solve.
The term (a) is a property of (b), irrespective of all other factors, if we assume such "device" not to be a kind of some polymorphic, self-modifying FGPA or evolutionary reflective self-evolving amoeboid, which both processors & multi-processors prefer not to be, at least in our part of Universe as we know it in 2022-Q2.
Once manufactured, each kind of processor(b) (be it a monolithic or a multi-processor device) has certain, observable, repetitively measurable, qualities of processing (doing a work).
"A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable"
A multiprocessor . . . (device)
consists . . . has a property of being composed of
100 . . . (quantitative factor) ~ 100
processors,. . . (device)
each . . . declaration of equality
capable . . . having a property of
of a peak . . . peak (not having any higher)
execution . . . execution of work (process/code)
rate . . . being measured in time [1/s]
of 2Gflops . . . (quantitative factor) ~ 2E+9 FLOPS
What is . . . Questioning
the PERFORMANCE . . . (property) a term (not defined yet)
of the SYSTEM . . . (system) a term (not defined yet)
as measured in . . . using some measure to evaluate a property of (system) in
Gflops . . . (units of measure) to express such property in
when . . . (proposition)
2% . . . (quantitative factor) ~ 0.02 fraction of
of the code . . . (subject-being-processed)
is . . . has a property of being
sequential . . . sequential, i.e. steps follow one-after-another
and
98% . . . (quantitative factor) ~ 0.98 fraction of (code)
( the same code)
is . . . has a property of being
parallelizable . . . possible to re-factor
into some other form,
from a (sequential)
original form
( emphasis added )
Fact #1 )
the processor(b) ( a (device) ), from which an introduced multiprocessor ( a macro-(device) ) is internally composed from, has a declared (granted) property of not being able to process more FLOPS, than the said 2 GFLOPS.
This property does not say, how many actual { INTOPS | FLOPS } it will perform in any particular moment in time.
This property does say, any device, that was measured and got labeled to have indeed X {M|G|P|E}FLOPS has the very same "glass-ceiling" of not being able to perform a single more instruction per second, even when it is doing nothing at all (chopping NOP-s) or even when it is switched off and powered down.
This property is a static supreme, an artificial (in relation to real-world work-loads' instruction mixes), temperature-dependent-constant (and often degrades in-vivo not only due to thermal throttling but due to many other reasons in real-world { processor + !processor }-composed SYSTEM ecosystems )
Fact #2 )
the problem, as visible to us here, has no particular definition of what is or what is not a part of the said "SYSTEM" - Is it just the (multi)processor - if so, then why introducing a new, not yet defined, term SYSTEM, for being it a pure identity with the already defined & used term (multi)processor per se? Is it both the (multi)processor and memory or other peripherals - if so, the why we know literally nothing about such important neighbourhood (a complement) of the said (multi)processor, without which a SYSTEM would not be The SYSTEM, but a mere part of it, the (multi)processor, that is NOT a SYSTEM without its (such a SYSTEM-defining and completing) neighbourhood?
Fact #3 )
the original Amdahl's Law, often dubbed as The Law of Diminishing Returns (of extending the System with more and more resources) speaks about SYSTEM and its re-organised forms, when comparing the same amount and composition of work, as performed in original SYSTEM (with a pure-[SERIAL] flow of operations, one-step-after-another-after-another), with another, improved SYSTEM' (created by re-organising and extending the original SYSTEM by adding more resources of some kinds and turning such a new SYSTEM' into operating more parts of the original work-to-be-done in an improved organisation of work, where more resources can & do perform parts of work-to-be-done independently one on any other one ~ in a concurrent, some parts even in a true parallel fashion, using all degrees of parallelism the SYSTEM' resources can provide & sustain to serve).
Given no particular piece of information was present about a SYSTEM, the less about a SYSTEM', we have no right to use The Law of Diminishing Returns to address the problem, as was defined above. Having no facts does not give us a right to guestimate, the less to turn into feelings-based evidencing, if we strive to remain serious to ourselves, don't we?
Given (a) and (b) above, the only fair to achieve claim, that indeed holds true, can be to say :
"From what has been defined so far,we know that such a multiprocessorwill never work on more than 100 x 2 GFLOP per second of time."
There is zero other knowledge to claim a single bit more (and yet we still have to silently assume that such above claimed peak FLOP-s have no side-effect and remain sustainable for at least a one whole second (see remark 2 ) -- otherwise even this claim will become skewed
An extended, stronger version :
"No matter what kind of code is actually being run,for this, above specified multiprocessor, we cannot say morethan that such a multiprocessor will never work on more than 100 x 2 GFLOPS in any moment of time."
Remarks :
see how this is being so often misused by promotion of "Exaflops performance" by marketing people, when FMUL f8,f8 is being claimed and "sold" to the public as that it "looks" equal as FMUL f512,f512, which it by far is not using the same yardstick to measure, is it?
a similar skewed argument (if not a straight misinformation) has been countless times repeated in a (false) claim, that a world "largest" femtosecond-LASER was capable to emit a light pulse carrying more power than XY-Suns (a-WOW-moment!), without adding, how long did it take to pump-up the energy for a single such femtosecond long ( 1 [fs] ~ 1E-15 [s] ) "packet-of-a-few-photons" ... careful readers have already rejected a-"WOW"-moment artificial stupidity for not being possible to carry such astronomic amount of energy, as an energy of XY-Suns on a tiny, energy poor planet, the less to carry that "over" a wire towards that "superpower" LASER )
If 2% is the run-time percentage for serial part then you can not surpass 50x speedup. This means you can not surpass 50x gflops of serial version.
If unoptimized program had 2 gflops fully serial then the optimized version with perfect scaling makes 98% of runtime compressed to 0.98%.
2% plus 0.98% is equivalent to ~3% as a new total run time. This means the program is spending 2/3 of the time in serial part and only 1/3 in the parallelized part. If parallel part is 200gflops then you have to average it over the whole 3/3 of the time. 200 gflops for 1 microsecond and 2 gflops for 2 microseconds.
This is roughly equal to 67 gflops. If there is a single-core turbo to boost the serial part, then 20% turbo boost in 2/3 of the time means shaving ~13% less of total run time, hence 15%-20% higher gflops average. Turbo core frequency is important even if it boosts single core.

Find Serial and parallel percentage of code

If I know the job completion time of a 2 processor system and of a 4 processor system, how do i calculate the time taken(Ts) by a 1 processor system. I want to know this so I can find the serial percentage of any given code using the equation
Ts/Tp = 1/(S+[(1-S)/N])
where S is serial percentage of code and N number of processors
I think time taken would be product of JobCompletionTime and number of processors;
jobCompletionTime*n(no of processors);

Optimal number of filters in a Convolutional network

I'm building a convolutional Network image classification purposes, my network is inspired by VGG conv network but I changed the number of layers and filters per layers because my image dataset is quite simple.
Nevertheless I'm wondering why the number of fitlers in VGG is always a power of 2 : 64 -> 128 -> 256 -> 512 -> 4096
I guessed that's because each pooling divide the output size by 2 x 2 and therefore one would want to multiply the number of filters by 2.
But I'm still wondering what's the real reason behind this choice; is this for optimization ? is it easier to distribute calculation ? And should I keep this logic in my network.
Yes, it is mainly for optimization. If the network is going to run on a GPU, threads in GPUs come in groups and blocks, normally a group is of 32 threads.
Roughly speaking, if you have a layer with 40 filters, you will need 2 groups = 64 threads. So why not making use of the rest threads and make the layer of 64 filters that can be computed in parallel.

What does a multiplexer do in CPU?

I had designed a simple ALU, and I generated "operation codes" using a decoder. Now, I'm studying about Multiplexers, but I couldn't understand what they do in a CPU or ALU?
A really simple example: If you want to fetch a data bit from memory, a multiplexer allows you to specify an address (the input code), and the memory bit will be connected to another "pin".
So say you have 256 bits of memory, and you want to connect this to an output pin, the multiplexer has 8 bits for input codes. You proved a code say N, and and bit N is connected through the logic gates to the output of the multiplexer. This multiplexer would have a total of 256 + 8 input lines.
I'm not sure how this would be implemented in more modern CPUs but you can probably see how several bit multiplexers could be stacked together and be used to fetch a byte from memory in parallel as well, and connected to say an arithmetic register to perform computations.
Fun right?!

How do bits become a byte? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Possibly the most basic computer question of all time, but I have not been able to find a straightforward answer and it is driving me crazy. When a computer 'reads' a byte, does it read it as a sequential series of ones and zeros one after the other, or does it somehow read all 8 ones and zeros at once?
A computer system reads the data in both ways depending upon type of operation and the how the digital system is designed.I'll explain this with very simple example of a Full adder circuit.
A full adder adds binary numbers and accounts for values carried in as well as out (Wikipedia)
Example of Parallel operation
Suppose in some task we need to add two 8 bit(1 byte) numbers such that all bits are available at the time of addition.
Then in that case we can design a digital system with 8 full-adders(1 for each bit).
Example of Serial Operation
In some other task you observe that all 8 bits will not be simultaneously available.
Or you think having 8 separate adders is costly as you need to implement other mathematical operations (like subtraction,multiplication and division). So instead of having 8 separate units you have 1 unit which will individually process bits. In this scenario we will need three storage units ( Shift Registers) such that two storage units will store two 8-bit numbers and one storage units will store the result .At a given clock pulse single bit will be transmitted from each of two registers to the full adder which will perform the addition process and transfer 1 bit result to the result shift register in single clock pulse.
This figure contains some additional stuff which is not useful for this thread but you can
study digital logic design and computer architecture if you want to go more deep in this stuff.
Shift register
Shift register operations demo
This is really kind of outside the scope of Stackoverflow, but it brings back such fond memories from college.
It depends. Some times a computer reads bits one at a time. For example over older ethernet manchester code is used. However over old parallel printer cables, there were 8 pins each one signaling a bit, and an entireoctet (byte) is sent at once.
In serial (one-bit-at-a-time) encodings, you're typically measuring transitions in the line or transitions against some well-defined clock source.
In parallel encodings, you're typically reading all the bits into a register at a time and latching the register.
Look up flipflops, registers, and logic gates for information on the low-level parts of this.
Bits are transmitted one at a time in serial transmission, and
multiple numbers of bits in parallel transmission. A bitwise operation
optionally process bits one at a time. Data transfer rates are usually
measured in decimal SI multiples of the unit bit per second (bit/s),
such as kbit/s.
Wikipedia's article on Bit
the processor works with a defined number of registerlength. 8, 16, 32, 64 ... think about a register as an amount of connection, one for each bit... thats the amount of bits that will be processed at once in one processor core, one register at once ... the processor hat different kinds of register, examples are the private instruction register or the public data or adress register
Think of it this way, at least at a physical level: In a transmission cable from point A to B (A and B can be anything, hard drive, CPU, RAM, USB, etc.) each wire in that cable can transmit one bit at a time. Both A and B have a clock pulsing at the same rate. On each pulse, the sender changes the amount of power going down each wire to signify the value of the new bit(s). So, the # of wires in the cable = the # of bits that can be transmitted each "pulse". (Note: This is a very simplified and theoretical explanation).
At a software level, in the CPU, you can never address anything smaller than a byte. You can "access" and manipulate specific bytes by using the bitwise operators (& (AND), | (OR), << (Left Shift), >> (Right Shift), ^ (XOR)).
In hardware, the number of bits being sent each pulse is completely dependent of the hardware itself.

Resources