I found the concept as in a paper on dynamic instrumentation. But I couldnt find the explanation of this concept. Please explain, if possible...
EDIT: or is there any tutorial on how to achieve lightweight dynamic instrumentation (in user space, for syscalls and normal function calls)?
EDIT(Added paper details):
A code generation approach to optimizing high-performance distributed data stream processing
Abstract:
We present a code-generation-based
optimization approach to bringing
performance and scalability to
distributed stream processing
applications. We express stream
processing applications using an
operator-based, stream-centric
language called SPADE, which supports
composing distributed data flow graphs
out of toolkits of type-generic
operators. A major challenge in
building such applications is to find
an effective and flexible way of
mapping the logical graph of operators
into a physical one that can be
deployed on a set of distributed
nodes. This involves finding how best
operators map to processes and how
best processes map to computing nodes.
In this paper, we take a two-stage
optimization approach, where an
instrumented version of the
application is first generated by the
SPADE compiler to profile and collect
statistics about the processing and
communication characteristics of the
operators within the application. In
the second stage, the profiling
information is fed to an optimizer to
come up with a physical data flow
graph that is deployable across nodes
in a computing cluster. This approach
not only creates highly optimized
applications that are tailored to the
underlying computing and networking
infrastructure, but also makes it
possible to re-target the application
to a different hardware setup by
simply repeating the optimization step
and re-compiling the application to
match the physical flow graph produced
by the optimizer. Using real-world
applications, from diverse domains
such as finance and radio-astronomy,
we demonstrate the effectiveness of
our approach on System S -- a
large-scale, distributed stream
processing platform.
Instrumentation means inserting code into a stream of instructions whose purpose is to measure something -- execution time, function calls, data access, all sorts of things relating to profiling. That's one of two ways to do profiling, and it's the more accurate but slower one. The other one is sampling, where you periodically interrupt the program and look at its current state. This has less performance impact but isn't as accurate, especially for short runs.
Without knowing what paper you are referencing it is difficult to be sure, but in general it would be a place in the code that has a "hook" for instrumentation.
That is, it is coded so it can be dynamically instrumented, so some measurements can be recorded about how the code runs.
Whether this would be for time spent in a method, power consumption or something else depends on what and how it is being instrumented.
It would be useful to see a link to the paper for the context.
In a tool such as systemtap/gdb, an instrumentation point would be any place in the code, whose execution can yield an event. For "dynamic" instrumentation, there is usually no need to compile a hook into the code; the tool just needs to determine a PC address where a breakpoint can be inserted.
Related
I am working on implementing prototype performance monitoring system, I went through multiple documents and resources for understanding the concept but am still confused between profiling and dignostics. Can somebody provide an explanation of these two terms, their relation and when/where do we use them?
"Profiling" usually means mapping things happening in the system (e.g., performance monitoring events) to processes, or to functions (or instructions) within processes. Examples of profiling tools in the Unix/Linux world include "gprof" and "oprofile". Intel's "VTune Amplifier" is another commonly used profiler. Some profilers are limited to looking at the performance of a single process, while others (usually requiring elevated privileges) monitor all processes (including the kernel) operating on the system during the measurement period.
"Diagnostics" is not a term I see very often in performance monitoring, but from the context I would assume that this means looking for evidence of "trouble" in the overall operation of the system. As an example, the performance monitoring system at https://github.com/TACC/tacc_stats collects hardware and software performance monitoring data on each server. In TACC's operation, the data is reviewed automatically to look for matches to a variety of heuristics related to known patterns of poor performance (e.g., all memory accesses being made to one socket in a 2-socket system). The data is also used by human performance analysts in response to user queries and is aggregated to provide an overview of performance-related characteristics by application area.
This is more of a general question about FPGA design than a specific question about code. I studied computer science but have been trying to learn more about hardware recently. I’ve been using a Xilinx FPGA to teach myself VHDL and some of the basics about hardware design, but I have a lot of gaps in my knowledge that have led to me hitting some pretty big walls in my projects. This is the most recent one.
I have a design with a couple dozen “workers”. Part of the design’s functionality depends on these workers executing compute-heavy tasks. In order to save FPGA resources, I have the workers sharing the computing circuitry and have another module to schedule access to that circuitry between the workers. The logic itself works fine and I’ve tested it in the simulator, however when I try to implement the design on the FPGA itself it never meets the timing requirements. A look at the diagram in Vivado showed me that the placer puts all of the shared computing circuitry on one side of the FPGA and all of the workers on the other side. Additionally, the routes that carry data from the workers to the computing circuitry meet timing but the routes that carry the results back to the workers are almost all failing.
So, my question is what solutions are typically used to fix data transfer problems like this in hardware design? I know that I could lower the clock rate to give the signals more time to move around, but I’m hesitant to do that since it would decrease the overall throughout of my design. On the other hand, I could place a few buffers between the shared computing circuitry and the workers (acting like a shift register), at the cost of increasing the compute time for the individual workers. What other techniques or design patterns are there for moving data around between points in an FPGA that are far apart?
Indeed the solutions you propose to reduce timing violations are rights and the most common.
You can also :
Modify synthesis and implementation directives in Vivado to prefer timing optimization than ressources utilization or compute time (of the synthesis and implementation).
Rework your compute unit to ensure that there is a buffer after all of your logic. Indeed you have different ways to segment your compute unit between sequential part and combinationnal part.
Place and route critical parts of your design by yourself. I never did it but I know it's possible (at least set location constraints in .xdc).
About adding buffers on the critcial paths : if you can do a piplined architecture, you will only add one latency clock cycle (It's not a high cost to ensure your design will work correctly).
Can FPGAs be automatically programmed to accelerate arbitrary software or is manual work required? I imagine there's nothing inherently stopping this from being possible - I'm just curious if it's currently possible as that could be a nice way to do hardware acceleration assuming the cost made sense.
One of the techniques (available for Xilinx FPGAs for example) is the PR (Partial Reconfiguration).
Partial Reconfiguration is the ability to dynamically modify blocks of logic by downloading partial bit files while the remaining logic continues to operate without interruption. Xilinx Partial Reconfiguration technology allows designers to change functionality on the fly, eliminating the need to fully reconfigure and re-establish links, dramatically enhancing the flexibility that FPGAs offer. The use of Partial Reconfiguration can allow designers to move to fewer or smaller devices, reduce power, and improve system upgradability. Make more efficient use of the silicon by only loading in functionality that is needed at any point in time.
Anyway, in the literature, you can find a lot of other differents example and differents strategy and techniques to change, runtime and automatically, the FPGA configuration. This gives the possibility (for an autonomous system) to evolve and to adapt themselves to several contexts. You can find a a tool for the design of dynamically Reconfigurable Embedded and Modular Systems Here and here you can find an example.
*
Key Technology Benefits
*
Increase solution flexibility by time-multiplexing design functionality
Reduce FPGA size or count (and therefore cost) by time-sharing functionality
Reduce dynamic power consumption by loading functions on-demand
I need to write an application that hashes words from a dictionary to make WPA pre-shared-keys. This is my thesis for a "Networking Security" course. The application needs to be parallel for increased performance. I have some experience with MPI from my IT studies but I would like to tie it up with CUDA. The idea is to use MPI to distribute the load evenly to the nodes of the cluster and then utilize CUDA to run the individual chunks in parallel inside the GPUs of the nodes.
Distributing the load with MPI is something I can easily do and have done in the past. Also computing with CUDA is something I can learn. There is also a project (pyrit) that does more or less what I need to do (actually a lot more) and I can get ideas from there.
I would like some advice on how to make the connection between MPI and CUDA. If there is somebody that has built anything like this I would greatly appreciate his advice and suggestions. Also if you happen to know of any resources on the topic please do point them to me.
Sorry for the lengthy intro but I thought it was necessary to give some background.
This question is largerly open-ended and so it's hard to give a definitive answer. This one is just a summary of the comments made High Performance Mark, me and Jonathan Dursi. I do not claim authorship and thus made this answer a community wiki.
MPI and CUDA are orthogonal. The former is an IPC middleware and is used to communicate between processes (possibly residing on separate nodes) while the latter provides highly data-parallel shared-memory computing to each process that uses it. You can break the task into many small subtasks and use MPI to distribute them to worker processes running on the network. The master/worker approach is suitable for this kind of application, especially if words in the dictionary vary greatly in their length and variance in processing time is to be expected. Provided with all the necessary input values, worker processes can then use CUDA to perform the necessary computations in parallel and then return results back using MPI. MPI also provides the mechanisms necessary to launch and control multinode jobs.
Although MPI and CUDA could be used separately, modern MPI implementations provide some mechanisms that blur the boundaries between those two. It could be either direct support for device pointers in MPI communication operations that transparently call CUDA functions to copy memory when necessary or it could be even support for RDMA to/from device memory without intermediate copy to main memory. The former simplifies your code while the latter can save different amount of time, depending on how your algorithm is structured. The latter also requires both failry new CUDA hardware and drivers and newer networking equipment (e.g. newer InfiniBand HCA).
MPI libraries that support direct GPU memory operations include MVAPICH2 and the trunk SVN version of Open MPI.
Assume an embedded environment which has either a DSP core(any other processor core).
If i have a code for some application/functionality which is optimized to be one of the best from point of view of Cycles consumed(MCPS) , will it also be a code, best from the point of view of Power consumed by that code in a real hardware system?
Can a code optimized for least MCPS be guaranteed to have least power consumption as well?
I know there are many aspects to be considered here like the architecture of the underlying processor and the hardware system(memory, bus, etc..).
Very difficult to tell without putting a sensitive ammeter between your board and power supply and logging the current drawn. My approach is to test assumptions for various real world scenarios rather than go with the supporting documentation.
No, lowest cycle count will not guarantee lowest power consumption.
It's a good indication, but you didn't take into account that memory bus activity consumes quite a lot of power as well.
Your code may for example have a higher cycle count but lower power consumption if you move often needed data into internal memory (on chip ram). That won't increase the cycle-count of your algorithms but moving the data in- and out the internal memory increases cycle-count.
If your system has a cache as well as internal memory, optimize for best cache utilization as well.
This isn't a direct answer, but I thought this paper (from this answer) was interesting: Real-Time Task Scheduling for Energy-Aware Embedded Systems.
As I understand it, it trying to run each task under the processor's low power state, unless it can't meet the deadline without high power. So in a scheme like that, more time efficient code (less cycles) should allow the processor to spend more time throttled back.