understanding parallelism of FPGAs - fpga

I am having a bit of problems with understanding the benefits of FPGAs for parallel processing. Everybody sais it is parallel, but It looks to me it is not trully parallel. Lets see this example:
I have data signal coming on some pin, at 1 bit per clock cycle. The FPGA will receive this data and since it has the data already inside the integrated circuit it can start processing it right away. But this is called serial processing, not parallel. If the FPGA will wait for the data to accumulate, to later process it in parallel, then we can say FPGA processing is trully parallel but what is the benefit of wating for the data to arrive in large quantities, we will just lose time, for example, if we wait for 8 bit data , we will lose 7 cycles. So where is the benefit of parallelism of FPGAs ??? I can't get it.
It would be parallel if the data would be comming in parallel, like when you use the old DB-25 Parallel Port connector. But this technology became obsolete since parallel port can not support high speeds. Today's USB standard is serial, Ethernet is serial, so .... where is the parallelism ???

The parallelism comes in if you have data that arrives in chunks, and the chunks arrive faster than they can be processed, and the chunks can be processed individually. Rather than having to slow-down the data sender, an FPGA allows you to add more processing "blocks" so that the processing goes faster.
Example:
You receive data (serially or in parallel, doesn't matter) at 1MB/s in 50kB chunks, but your algorithm only allows 1 chunk to be processed per second. In an FPGA, you can wire up the "receiver" to distribute the chunks across 20 "processors", so now your sender can still send at full speed, and your receiver sees less overall lag.

Parallelism has several levels, which need to be understood if you want to under stand computer architectures. FPGAs are just a tool to build a "computer".
The levels are:
bit level: multiple bits or datawords are processed in parallel.
For example you can build adders of 8 bit, 32 bit or 4096 bit, which add two integer numbers in just one cycle
instruction level: multiple instructions of one control flow are executed in parallel
=> pipelining, super scalar architecture
thread level: multiple control flows are executed in paralell
=> multi threading, multi core, n-socket systems
application level: execute multiple applications in parallel
=> multi processing
dataflow processing: every thing in parallel :)
FPGAs can use each level to do everything in parallel.

Related

Why does reading from multiple SSDs results in lower throughput than reading from a single SSD

I'm writing an application that replicates data on three SSDs. The application then handles read requests by randomly assign each request to one of the three SSDs, so in theory all SSDs should be used equally. Note that I'm using a thread pool so the read requests are being processed concurrently.
However, when comparing the read throughput of this setup against the read throughput from just one single SSD, I get the surprising result that the 3-SSD setup actually has a lower read throughput. What could be the cause of this?
You may have multiple CPU's handling multiple processes and threads at the same time, but at the end of the day, your SSD's are using the same bus in the board. That's the chokepoint you have there.
Making a very cheap analogy: You are trying to feed three different babies from different plates, but you have only one spoon.
Maybe using a cluster/the cloud might do the trick for you, if parallelization is important.

Can parallel processing be achieved?

Can an MCU really do parallel processing?
Let's just say that I wana countdown, send data through another interface, and do one more work such as Light up an LED all at the sametime.
Is that even possible?
A processor with multiple execution units or cores can perform parallel processing. Most microcontrollers do not have multiple execution units.
Some architectures support SIMD (Single Instruction/Multiple Data) instructions that can generate multiple results from a single instruction - this is a low level form of parallel processing, similarly DSPs (Digital Signal Processors) and microcontrollers with DSP instructions support dual or multiple MAC (multiply/accumulate) units that are also a form of parallel processing. Both SIMD and MAC are used primarily for number crunching and signal processing applications. High end DSPs often support other instruction level parallel execution capabilities.
Another low-level architecture feature that allows parallel execution is pipeline execution. This allows instructions that may take multiple cycles to run to generate one result per cycle by running different stages of the same operation simultaneously.
Most microcontrollers can support a multi-tasking or multi-threading scheduler that can give the impression of concurrent execution by scheduling execution time to each task according to the scheduling algorithm used. While this is not parallel processing and in fact adds an overhead rather than accelerates processing, it is useful in other ways such as functional partitioning of the code and, in the case of a real-time priority based preemptive scheduler, achieving real-time response to events. For the example use case you give in your question, this form of scheduling is entirely appropriate and adequate. See Real-time Operating System (RTOS)
Microcontroller architectures that do support true parallel processing include XMOS, PicoChip, and the Cell processor. Historically the Transputer pioneered parallel processing in microprocessors.
A way of achieving a high level of parallelism at a low level where individual operations of the same process can occur simultaneously (when one does not depend of the result of the other, or a pipeline is used) is to implement a process on an FPGA - essentially to implement the processing in hardware rather than software, but the languages used to program FPGAs share similarities with software languages.
A company named Parallax makes an 8 core MCU called Propeller that does parallel processing. Their programming language "Spin" is interesting, object oriented, scriptish, but also has inline assembly.

My program use only 25% of cpu power

My program with single thread uses only 25% of CPU with 2 cores (intel i5-3210M). Why not 50% (one core)? Program is being tested on macbook pro with windows 7 64. I think that problem is hyper-threading and because of this program uses only one logical core (25% of cpu power). How can I give more CPU power to my program?
It's important for me because this program works with big set of data and it takes about 30 hours to finish calculations.
It is expectable as you said with your CPU(which has 4 logical processors). You can search for the ways of transforming your program in order to use more than one threads. I can recommend you to search for "parallel programming", "concurrent programming","multi-threading". if you are using MS VC++ PPL library is so easy to use..OpenMP is a more prowerful tool which is available in Linux also. There are lots more ways and libraries for this issue but you need to choose it according to your OS, compiler, environment, programming language and your problem.
However, the easiest solution is to run it on a desktop machine with a better CPU and cross your fingers to get the results as quick as possible.
This program uses only one logical core (25% of cpu power). How can I give more CPU power to my programm? ...this programm works with big set of data ... it takes about 30 hours to finish calculations.
Divide up your data set into (at least) 4 separate pieces. With that much data, you want to think in terms of indexes into the data instead of copying data elements to 4 separate structures. Create a separate thread for each segment of your data, and have that thread only process one segment. You may need to set a processor affinity for your threads.
If the data streams, or must be processed in order, think in terms of queing elements for processing, where individual threads will then dequeue and process each item. This works well when the enqueue operation is relatively fast compared to processing an item, and can be done by a single master thread, while each dequeue/processing operation is more expensive.
Choosing the correct number of threads is tricky. Modern CPUs and operating systems are designed to switch tasks from time to time. This will always be an expensive operation, but the scheduler will want to do something else every so often, even if your process may seem like the best candidate. Therefore, you can often get the best throughput by overloading your CPUs to a small extent, such that you may want two or three threads per logical cpu. One way to manage this is through use the ThreadPool object.

multi core and parallel processing

what is the difference between parallel processing and multi core processing
Parallel and multi-core processing both refer to the same thing: the ability to execute code at the same time (in more than one core/CPU/machine.) So in this sense multi-core is just a means to do parallel processing.
On the other hand, concurrency (which is probably what you mean by parallel processing) refers to having multiple units of execution (threads or processes) that are interleaved. This can also happen in either in a single core CPU or in many cores/CPUs or even in many machines (clusters).
Summing up, multicore is a subset of parallel and concurrency can occur with or without parallelism. The field that studies this is distributed systems or distributed computing.
Parallel processing just refers to a program running more than 1 part simultaneously, usually with the different parts communicating in some way. This might be on multiple cores, multiple threads on one core (which is really simulated parallel processing), multiple CPUs, or even multiple machines.
Multicore processing is usually a subset of parallel processing.
Multicore processing means code working on more than one "core" of a single CPU chip. A core is like a little processor within a processor. So making code work for multicore processing will nearly always be talking about the parallelization aspect (though would also include removing any core specific assumptions, which you shouldn't normally have anyway).
As far as an algorithm design goes, if it is correct in a parallel processing point of view, it will be correct multicore.
However, if you need to optimise your code to get it to run as fast as possible "in parallel" then the differences between multicore, multi-cpu, multi-machine, or vectorised will make a big difference.
Parallel processing can be done inside a single core with multiple threads.
Multi-Core processing means distributing those threads to make use of the multiple cores in a CPU.

Configurable processor implemented on FPGA board

For a university mid-term project I have to design a configurable processor, to write the code in VHDL and then synthesize it on a Spartan 3E FPGA board from Digilent. I'm a beginner so could you point me to some information about configurable processors, to some ideas related to the concept?
You can check out my answer for a related question. We did nearly the same, building a CPU in VHDL for a FPGA board.
This is just a mockup so please be aware i will clean it up
fetch instruct1,
fetch instruct2, fetch datas1
fetch instruct3, fetch datas2, process datas1
fetch instruct4, fetch datas3, process datas2, store1 datas1
fetch instruct5, fetch datas4, process datas3, store2 datas1
fetch instruct6, fetch datas5, process datas4, store3 datas1
fetch instruct7, fetch datas6, process datas5, store4 datas1
fetch instruct8, fetch datas7, process datas6, store5 datas1
basically these are the main components of a processor
Part 1
ALU: Arithmetic Logic Unit (this is where draeing would come in handy)
AN ALU has 2 input ports and an output port. THe 2 input ports get operated on and the result outputed. To know which instruction the ALU has to accomplish there is a control Port.
Basically this is the name of the command. So if the control port has 4bits there are 16 possible instructions.
Part 2
REGISTER UNIT: This is a set of memory cells (cache memory). The contents of this memory is often transferred to the ALU's input ports.
Part3
Control Unit: This is sort of like the orchestra master of the cpu. Its job is to
1send the data to the ALU input
2Readwhich instruction needs to happen in the Instruction Registers,send those codes to the ALU control ports
Interface. This is how the RAM and other peripherals communicate with the cpu
Everytime the intruction outputs a result it has to be stored. It can be stored in RAM so a ram write must be ready once the result is ready. At the same time, a RAM read of the next instruction's inputs can occurs.And also at the same time, the next next instruction can be fetched from RAM.
Generating 1 instruction usually requires more than 1 clock cycle. Processing an in struction is analogeous to industrial production. So chain work is done.
VLIW The programming we write is linear, meaning instructions happening one after the other. But CPUs today (not ARMs though) have multiple ALU's so multiple instructions are processed at the same time.
So you have processing units chain working multiple instructions at the same time (pipeline)
and you have alot of those units (superscalar)
It then becomes a question of what can/need to do to taylor your cpu architecture.
I did a similar project, implementing a processor with a 5-stage pipeline in VHDL.
First things first, you have to understand the architecture of how processors work. Without understand what each stange is doing and what kind of control signals you need you've got no hope of actually writing one in VHDL.
Secondly, start drawing diagrams of how instructions and data will flow through your processor (i.e. through each stage). How is each stage hooked up to each other? Where do the control signals go? Where do my inputs come from and where do my outputs go?
Once you have a solid diagram, the actual implementation in VHDL should be relatively straightforward. You can use VHDL's behaviorial modelling to essetially explain exactly what you see in the diagram.

Resources