Can FPGAs be automatically programmed to accelerate arbitrary software or is manual work required? I imagine there's nothing inherently stopping this from being possible - I'm just curious if it's currently possible as that could be a nice way to do hardware acceleration assuming the cost made sense.
One of the techniques (available for Xilinx FPGAs for example) is the PR (Partial Reconfiguration).
Partial Reconfiguration is the ability to dynamically modify blocks of logic by downloading partial bit files while the remaining logic continues to operate without interruption. Xilinx Partial Reconfiguration technology allows designers to change functionality on the fly, eliminating the need to fully reconfigure and re-establish links, dramatically enhancing the flexibility that FPGAs offer. The use of Partial Reconfiguration can allow designers to move to fewer or smaller devices, reduce power, and improve system upgradability. Make more efficient use of the silicon by only loading in functionality that is needed at any point in time.
Anyway, in the literature, you can find a lot of other differents example and differents strategy and techniques to change, runtime and automatically, the FPGA configuration. This gives the possibility (for an autonomous system) to evolve and to adapt themselves to several contexts. You can find a a tool for the design of dynamically Reconfigurable Embedded and Modular Systems Here and here you can find an example.
*
Key Technology Benefits
*
Increase solution flexibility by time-multiplexing design functionality
Reduce FPGA size or count (and therefore cost) by time-sharing functionality
Reduce dynamic power consumption by loading functions on-demand
Related
This is more of a general question about FPGA design than a specific question about code. I studied computer science but have been trying to learn more about hardware recently. I’ve been using a Xilinx FPGA to teach myself VHDL and some of the basics about hardware design, but I have a lot of gaps in my knowledge that have led to me hitting some pretty big walls in my projects. This is the most recent one.
I have a design with a couple dozen “workers”. Part of the design’s functionality depends on these workers executing compute-heavy tasks. In order to save FPGA resources, I have the workers sharing the computing circuitry and have another module to schedule access to that circuitry between the workers. The logic itself works fine and I’ve tested it in the simulator, however when I try to implement the design on the FPGA itself it never meets the timing requirements. A look at the diagram in Vivado showed me that the placer puts all of the shared computing circuitry on one side of the FPGA and all of the workers on the other side. Additionally, the routes that carry data from the workers to the computing circuitry meet timing but the routes that carry the results back to the workers are almost all failing.
So, my question is what solutions are typically used to fix data transfer problems like this in hardware design? I know that I could lower the clock rate to give the signals more time to move around, but I’m hesitant to do that since it would decrease the overall throughout of my design. On the other hand, I could place a few buffers between the shared computing circuitry and the workers (acting like a shift register), at the cost of increasing the compute time for the individual workers. What other techniques or design patterns are there for moving data around between points in an FPGA that are far apart?
Indeed the solutions you propose to reduce timing violations are rights and the most common.
You can also :
Modify synthesis and implementation directives in Vivado to prefer timing optimization than ressources utilization or compute time (of the synthesis and implementation).
Rework your compute unit to ensure that there is a buffer after all of your logic. Indeed you have different ways to segment your compute unit between sequential part and combinationnal part.
Place and route critical parts of your design by yourself. I never did it but I know it's possible (at least set location constraints in .xdc).
About adding buffers on the critcial paths : if you can do a piplined architecture, you will only add one latency clock cycle (It's not a high cost to ensure your design will work correctly).
In my thesis, I plan on writing a section of real-time capability comparison of single board computers:
the factors (if they really have a real time clock, even if they don't have one, can real-time frameworks or RTOS be used to utilize them with real-time properties and how)
what scheduling is used in their out-of-the-box kernel? (for example, if Round-robin is used, then AFAIK real-time scheduling cannot be achieved)
Comparison between Pandaboard, Beagleboard, Beaglebone, and Especially Raspberry Pi
If you have a resource or idea regarding this, I would really appreciate it. In case I have missed an information, please do say and I'd be happy to provide that.
Thanks in advance.
EDIT:
I found a good answer here, but I can always appreciate any better guidance.
What makes a kernel/OS real-time?
First an observation. Scheduling is an OS concept. Why would it matter which scheduler is used in out-of-the-box kernel? If indeed there is such a thing as out-of-the-box kernel. Having said that, realtimeness is affected by scheduler and hardware. But when comparing boards, I would keep scheduler constant (or may be pick a few) and then compare boards. Choosing scheduler(s) is a separate topic on its own. Couple of things to take into account are that it should be pre-emptive and be able to deal with issues like priority inversion.
Note that all these boards have MMU which will bring in latency. That shouldn't really matter though, as long as that latency is bounded. I'd also compare accuracy of crystals on which the clocks are based. Note also SoCs have low power modes, they also tend to switch clocks. Whenever they come out of LP mode, they switch from some internal oscillator to more accurate clock source like external crystal. That requires time to for crystal to stabilise before it can continue normal operations. Comparison of latency involved in switching between power mode will also be a useful determinant.
I am trying to optimize critical parts of a C code for image processing in ARM devices and recently discovered NEON.
Having read tips here and there, I am getting pretty nice results, but there is something that escapes me. I see that overall performance is very much dependant on memory accesses and how they are done.
Which is the simplest way (by simple I mean, if possible, not having to run the whole compiled code in an emulator or simulator, but something that can be feed of small pieces of assembly and analyze them), in order to get an idea of how memory accesses are "bottlenecking" the subroutine?
I know this can not be done exactly without running it in a specific hardware and specific conditions, but the purpose is to have a "comparison" trial-and error tool to experiment with, even if the results are only approximations.
(something similar to this great tool for cycle counting)
I think you've probably answered your own question. Memory is a system level effect and many ARM implementers (Apple, Samsung, Qualcomm, etc) implement the system differently with different results.
However, of course you can optimize things for a certain system and it will probably work well on others, so really it comes down to figuring out a way that you can quickly iterate and test/simulate system level effects. This does get complicated so you might pay some money for system level simulators such as is included in ARM's RealView. Or I might recommend getting some open source hardware like a Panda Board and using valgrind's cache-grind. With linux on the panda board you can write some scripts to automate your testing.
It can be a hassle to get this going but if optimizing for ARM will be part of your professional life, then it's worth the (relatively low compared to your salary) software/hardware investment and time.
Note 1: I recommend against using PLD. This is very system tuning dependent, and if you get it working well on one ARM implementation it may hurt you for the next generation of chip or a different implementation. This may be a hint that trying to optimize at the system level, other than some basic data localization and ordering stuff may not be worth your efforts? (See Stephen's comment below).
Memory access is one thing that simply cannot be modeled from "small pieces of assembly” to generate meaningful guidance. Cache hierarchies, store buffers, load miss queues, cache policy, etc … even relatively simple processors have an enormous amount of “state” hiding underneath the LSU, and any small-scale analysis cannot accurately capture that state. That said, there are a few basic guidelines for getting the best performance:
maximize the ratio of "useful computation” instructions to LSU operations.
align your memory accesses (ideally to 16B).
if you need to pick between aligning loads or aligning stores, align your stores.
try to write out complete cachelines when possible.
PLD is mainly useful for non-uniform-but-somehow-still-predictable memory access patterns (these are rare).
For NEON specifically, you should prefer to use the vld1 and vst1 instructions (with an alignment hint). On most micro-architectures, in most cases, they are the fastest way to move between NEON and memory. Eschew v[ld|st][3|4] in particular; these are an attractive nuisance, slower than doing separate permutes on most micro-architectures in most cases.
I'm super excited about my program powering a little seven-segment display, but when I show it off to people not in the field, they always say "well what can you do with it?" I'm never able to give them a concise answer. Can anyone help me out?
First: They don't need to have volatile memory.
Indeed the big players (Xilinx, Altera) usually have their configuration on-chip in SRAM, so you need additional EEPROM/Flash/WhatEver(TM) to store it outside.
But there are others, e.g. Actel is one big player that come to mind, that has non-volatile configuration storage on their FPGAs (btw. this has also other advantages, as SRAM is usually not very radiation tolerant, and you have to require special measurements when you go into orbit).
There are two big things that justify FPGAS:
Price - They are not cheap. But sometimes you can't do something in software, and you need hardware for it. And when you are below a certain point in your required volume (e.g. because its just small series, or a prototype) an FPGA is MUCH cheaper than an ASIC. Also, while developing ASICs this allows - before a final state is reached - much higher turn-around times.
Reconfiguration - You can reconfigure your FPGA. That is something a processor or an ASIC can't do. There are some applications where you can use this: E.g. When you need the ability to fix something in the design, but you can't get physically to the device. Example for this: The mars orbiters/rovers used Xilinx FPGAs. When someone finds there a mistake (or wants to switch to a different coding for transmitting data or whatever), you can't replace the ship, as it is just not reachable. But with an FPGA you can just reconfigure and can apply your changes. Another scenario is, that you can have one single chip which is able to perform different accelerations, depending on the scenario. Imagine a smartphone, when telephoning the FPGA can be configured to make audio en-/decoding, when surfing it can work as a compression engine, when playing videos it can be configured as h264 decoder/accelerator. Another thing you could do is that you can match your hardware to your problem instance. E.g. Cisco uses many FPGAs in their hardware. You need the hardware to perform switching/routing/packet inspection with the required speed, and you can generate from actual setting matching engines directly into hardware.
Another thing which might come up soon (I know some car manufacturer thought about it), is for devices which include a lot of different electronics and have a big supply chain. It's more or less a combination of price and reconfiguration. It's more expensive to have 10 ASICs than 10 FPGAs - where both perform the same task, but it's cheaper to have 10 FPGAs with just one supplier and the need to hold just 1 type of chip at service and supply than to have 10 suppliers with the necessity to hold and manage 10 different chips in supply and service.
True story.
They allow you to fix design flaws in the custom data-acquisition boards for a multi-million dollar particle physics experiment that become obvious only after you have everything installed and are doing integration work and detector characterization.
You can evolve circuits, this is a bit old school evolutionary algorithms but starting from a set of random individuals you can select the circuits that score higher in a fitness function than the rest and breed them to create a new population ad infinitum. read up about Evolutionary Hardware, think this book covers FPGA's http://www.amazon.co.uk/Introduction-Evolvable-Hardware-Self-Adaptive-Computational/dp/0471719773/ref=sr_1_1?ie=UTF8&qid=1316308403&sr=8-1
Say for example you wanted a DSP circuit, you have an input signal and a desired output signal, starting with a random population you select perhaps only the fittest (bad) or perhaps a mixture of fitties and odd ones to create the next generation. after a number of generations you can open the lid and discover low and behold evolution has taken place and you have a circuit that may even out perform your initial expectations!
also read the field guide to genetic programming, it's free on the web somewhere.
There are limitations to software. On software, you're running at the CPU's clock rate, enabling you to only execute one instruction per clock cycle. On software, everything is high level, you do not control details that happen in the low level. You'll always be limited by the operating system or development board you are programming. This is true for popular development boards out there such as Arduinos and Raspberry Pi.
In FPGA hardware, you can precisely program and control what happens between each clock cycle, providing your computations the speed at the electron level (note: speed of electrons determines speed of electric signal transfers between hardware)
Now, we know FPGA implies Hardware, Speed of Electrons, which is much better than
CPU that implies Software, 1 instruction per clock cycle.
So why use FPGA when we can design our own boards using Printed Circuit Board, transistor level?
This is because FPGA's are programmable hardware! It is built such that you can program the connections of a board instead of wiring it up for a specific application. This explains why FPGA's are expensive! It is sort of a 'general hardware' or Programmable Hardware.
To argue why you should pick FPGA's despite their cost, the programmable hardware component allows:
Longer product cycle (you can update the programmable hardware on the customer's products which contains your FPGA by simply allowing them to programmed your updated HDL code into their FPGA)
Recovery for hardware bug. You simply allow them to download the corrected program onto their FPGA. (note: you cannot do this with specific hardware designs as you will have to spend millions to gather back your products, create new ones, and ship them back to customers)
For examples on the cool things FPGA can do, refer to Stanford's infamous ECE5760 course.
http://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/
Hope this helps!
Soon Chee Loong,
University of Toronto
FPGA are also used to test/research circuit design before they start mass production. This is happening in several sectors: image processing, signal processing, etc.
Edit - after few years we can now see more practical applications including finance and machine earning:
aerepospace
emulation
automotive
broadcast
high performance computers
medical
machine learning
finance (including cryptocoins)
I like this article: http://www.hpcwire.com/hpcwire/2011-07-13/jp_morgan_buys_into_fpga_supercomputing.html
My feeling is that FPGA's can sit directly in your streaming data at the point where it enters your the systems under your control. You can then crunch that data without going through the steps a GPGPU would require (bringing the data in off the network, passing it across the PCI Express bus and crunching it a Gb at a time).
There are good reasons for both, but I think the notion of whether you mind buffering the data is a good bellwether.
Here's another cool FPGA application:
https://ehsm.eu/m-labs.hk/m1.html
Automotive image processing is one interesting domain:
Providing lane-keeping support to the driver (disclosure: I wrote this page!):
http://www.conekt.co.uk/capabilities/50-fpga-for-ldw
Providing an aerial view of a car from 4 fisheye-lens cameras (with video):
http://www.logicbricks.com/Solutions/Surround-View-DA-System/Xylon-Test-Vehicle.aspx
I found the concept as in a paper on dynamic instrumentation. But I couldnt find the explanation of this concept. Please explain, if possible...
EDIT: or is there any tutorial on how to achieve lightweight dynamic instrumentation (in user space, for syscalls and normal function calls)?
EDIT(Added paper details):
A code generation approach to optimizing high-performance distributed data stream processing
Abstract:
We present a code-generation-based
optimization approach to bringing
performance and scalability to
distributed stream processing
applications. We express stream
processing applications using an
operator-based, stream-centric
language called SPADE, which supports
composing distributed data flow graphs
out of toolkits of type-generic
operators. A major challenge in
building such applications is to find
an effective and flexible way of
mapping the logical graph of operators
into a physical one that can be
deployed on a set of distributed
nodes. This involves finding how best
operators map to processes and how
best processes map to computing nodes.
In this paper, we take a two-stage
optimization approach, where an
instrumented version of the
application is first generated by the
SPADE compiler to profile and collect
statistics about the processing and
communication characteristics of the
operators within the application. In
the second stage, the profiling
information is fed to an optimizer to
come up with a physical data flow
graph that is deployable across nodes
in a computing cluster. This approach
not only creates highly optimized
applications that are tailored to the
underlying computing and networking
infrastructure, but also makes it
possible to re-target the application
to a different hardware setup by
simply repeating the optimization step
and re-compiling the application to
match the physical flow graph produced
by the optimizer. Using real-world
applications, from diverse domains
such as finance and radio-astronomy,
we demonstrate the effectiveness of
our approach on System S -- a
large-scale, distributed stream
processing platform.
Instrumentation means inserting code into a stream of instructions whose purpose is to measure something -- execution time, function calls, data access, all sorts of things relating to profiling. That's one of two ways to do profiling, and it's the more accurate but slower one. The other one is sampling, where you periodically interrupt the program and look at its current state. This has less performance impact but isn't as accurate, especially for short runs.
Without knowing what paper you are referencing it is difficult to be sure, but in general it would be a place in the code that has a "hook" for instrumentation.
That is, it is coded so it can be dynamically instrumented, so some measurements can be recorded about how the code runs.
Whether this would be for time spent in a method, power consumption or something else depends on what and how it is being instrumented.
It would be useful to see a link to the paper for the context.
In a tool such as systemtap/gdb, an instrumentation point would be any place in the code, whose execution can yield an event. For "dynamic" instrumentation, there is usually no need to compile a hook into the code; the tool just needs to determine a PC address where a breakpoint can be inserted.