I am designing a cell SPE processor. I have to check for data hazards before I take data from register file. What is the best way to identify the number of stall cycles necessary and what design decisions should I take about where to stall the pipeline? I am looking for a general view on the subject.
Related
This is more of a general question about FPGA design than a specific question about code. I studied computer science but have been trying to learn more about hardware recently. I’ve been using a Xilinx FPGA to teach myself VHDL and some of the basics about hardware design, but I have a lot of gaps in my knowledge that have led to me hitting some pretty big walls in my projects. This is the most recent one.
I have a design with a couple dozen “workers”. Part of the design’s functionality depends on these workers executing compute-heavy tasks. In order to save FPGA resources, I have the workers sharing the computing circuitry and have another module to schedule access to that circuitry between the workers. The logic itself works fine and I’ve tested it in the simulator, however when I try to implement the design on the FPGA itself it never meets the timing requirements. A look at the diagram in Vivado showed me that the placer puts all of the shared computing circuitry on one side of the FPGA and all of the workers on the other side. Additionally, the routes that carry data from the workers to the computing circuitry meet timing but the routes that carry the results back to the workers are almost all failing.
So, my question is what solutions are typically used to fix data transfer problems like this in hardware design? I know that I could lower the clock rate to give the signals more time to move around, but I’m hesitant to do that since it would decrease the overall throughout of my design. On the other hand, I could place a few buffers between the shared computing circuitry and the workers (acting like a shift register), at the cost of increasing the compute time for the individual workers. What other techniques or design patterns are there for moving data around between points in an FPGA that are far apart?
Indeed the solutions you propose to reduce timing violations are rights and the most common.
You can also :
Modify synthesis and implementation directives in Vivado to prefer timing optimization than ressources utilization or compute time (of the synthesis and implementation).
Rework your compute unit to ensure that there is a buffer after all of your logic. Indeed you have different ways to segment your compute unit between sequential part and combinationnal part.
Place and route critical parts of your design by yourself. I never did it but I know it's possible (at least set location constraints in .xdc).
About adding buffers on the critcial paths : if you can do a piplined architecture, you will only add one latency clock cycle (It's not a high cost to ensure your design will work correctly).
Can FPGAs be automatically programmed to accelerate arbitrary software or is manual work required? I imagine there's nothing inherently stopping this from being possible - I'm just curious if it's currently possible as that could be a nice way to do hardware acceleration assuming the cost made sense.
One of the techniques (available for Xilinx FPGAs for example) is the PR (Partial Reconfiguration).
Partial Reconfiguration is the ability to dynamically modify blocks of logic by downloading partial bit files while the remaining logic continues to operate without interruption. Xilinx Partial Reconfiguration technology allows designers to change functionality on the fly, eliminating the need to fully reconfigure and re-establish links, dramatically enhancing the flexibility that FPGAs offer. The use of Partial Reconfiguration can allow designers to move to fewer or smaller devices, reduce power, and improve system upgradability. Make more efficient use of the silicon by only loading in functionality that is needed at any point in time.
Anyway, in the literature, you can find a lot of other differents example and differents strategy and techniques to change, runtime and automatically, the FPGA configuration. This gives the possibility (for an autonomous system) to evolve and to adapt themselves to several contexts. You can find a a tool for the design of dynamically Reconfigurable Embedded and Modular Systems Here and here you can find an example.
*
Key Technology Benefits
*
Increase solution flexibility by time-multiplexing design functionality
Reduce FPGA size or count (and therefore cost) by time-sharing functionality
Reduce dynamic power consumption by loading functions on-demand
I'm trying to make a post gate level simulation for a pipelined processor.
I have the net list in vhdl format and I need now to simulate it again to be sure the functionality is right after the synthesis.
The problem is I have a 2 rams one for instructions and the other for data, in post gate level simulation I don't have the ability to view the memory list view and load my instructions and data into my 2 rams.
How can I insert my data into the rams as they have been translated into flip flops and muxs?
Thanks in advance.
From your description I would assume that the two RAMs are for instruction cache and data cache. Since these are usually of a significant size, even on smaller processors, I would doubt that these RAMs are implemented in flip flops and muxes. My first suggestion would therefore be that you check the netlist to see if the RAMs are actually separate RAM primitive modules.
The reason is that RAM primitive modules may sometimes (depending on model) be initialized with the contents of a file. In that case you just need to make a file with the right format.
An alternative if RAM primitive modules are actually in the netlist, but don't allow initialization, is to substitute the RAM primitive modules with your own version that can be initialized.
If the RAMs are actually converted to flip flops and muxes, then the processor may support some cache manipulation instructions, usually available from protected (kernel) mode. These instructions can be used to load the instruction cache and data cache with contents provided by the executed program. Loading the cache RAMs in this way may take numerous instructions, thus some simulation time.
Last, you may consider not to spend that much time on gate level simulation. It may be OK to run a little, just to be sure that the netlist is OK, but commercial well known synthesis tools are generally of a high quality, so these are unlikely to be the reason for bugs in your design. The risk of bugs is much larger in the dedicated design for the project, so you may want to spend more time on functions verification and code review ;-)
I found the concept as in a paper on dynamic instrumentation. But I couldnt find the explanation of this concept. Please explain, if possible...
EDIT: or is there any tutorial on how to achieve lightweight dynamic instrumentation (in user space, for syscalls and normal function calls)?
EDIT(Added paper details):
A code generation approach to optimizing high-performance distributed data stream processing
Abstract:
We present a code-generation-based
optimization approach to bringing
performance and scalability to
distributed stream processing
applications. We express stream
processing applications using an
operator-based, stream-centric
language called SPADE, which supports
composing distributed data flow graphs
out of toolkits of type-generic
operators. A major challenge in
building such applications is to find
an effective and flexible way of
mapping the logical graph of operators
into a physical one that can be
deployed on a set of distributed
nodes. This involves finding how best
operators map to processes and how
best processes map to computing nodes.
In this paper, we take a two-stage
optimization approach, where an
instrumented version of the
application is first generated by the
SPADE compiler to profile and collect
statistics about the processing and
communication characteristics of the
operators within the application. In
the second stage, the profiling
information is fed to an optimizer to
come up with a physical data flow
graph that is deployable across nodes
in a computing cluster. This approach
not only creates highly optimized
applications that are tailored to the
underlying computing and networking
infrastructure, but also makes it
possible to re-target the application
to a different hardware setup by
simply repeating the optimization step
and re-compiling the application to
match the physical flow graph produced
by the optimizer. Using real-world
applications, from diverse domains
such as finance and radio-astronomy,
we demonstrate the effectiveness of
our approach on System S -- a
large-scale, distributed stream
processing platform.
Instrumentation means inserting code into a stream of instructions whose purpose is to measure something -- execution time, function calls, data access, all sorts of things relating to profiling. That's one of two ways to do profiling, and it's the more accurate but slower one. The other one is sampling, where you periodically interrupt the program and look at its current state. This has less performance impact but isn't as accurate, especially for short runs.
Without knowing what paper you are referencing it is difficult to be sure, but in general it would be a place in the code that has a "hook" for instrumentation.
That is, it is coded so it can be dynamically instrumented, so some measurements can be recorded about how the code runs.
Whether this would be for time spent in a method, power consumption or something else depends on what and how it is being instrumented.
It would be useful to see a link to the paper for the context.
In a tool such as systemtap/gdb, an instrumentation point would be any place in the code, whose execution can yield an event. For "dynamic" instrumentation, there is usually no need to compile a hook into the code; the tool just needs to determine a PC address where a breakpoint can be inserted.
Assume an embedded environment which has either a DSP core(any other processor core).
If i have a code for some application/functionality which is optimized to be one of the best from point of view of Cycles consumed(MCPS) , will it also be a code, best from the point of view of Power consumed by that code in a real hardware system?
Can a code optimized for least MCPS be guaranteed to have least power consumption as well?
I know there are many aspects to be considered here like the architecture of the underlying processor and the hardware system(memory, bus, etc..).
Very difficult to tell without putting a sensitive ammeter between your board and power supply and logging the current drawn. My approach is to test assumptions for various real world scenarios rather than go with the supporting documentation.
No, lowest cycle count will not guarantee lowest power consumption.
It's a good indication, but you didn't take into account that memory bus activity consumes quite a lot of power as well.
Your code may for example have a higher cycle count but lower power consumption if you move often needed data into internal memory (on chip ram). That won't increase the cycle-count of your algorithms but moving the data in- and out the internal memory increases cycle-count.
If your system has a cache as well as internal memory, optimize for best cache utilization as well.
This isn't a direct answer, but I thought this paper (from this answer) was interesting: Real-Time Task Scheduling for Energy-Aware Embedded Systems.
As I understand it, it trying to run each task under the processor's low power state, unless it can't meet the deadline without high power. So in a scheme like that, more time efficient code (less cycles) should allow the processor to spend more time throttled back.