for example:
Slice Logic Utilization:
Number of Slice Registers: 6 out of 18224 0%
Number of Slice LUTs: 8 out of 9112 0%
Number used as Logic: 8 out of 9112 0%
Slice Logic Distribution:
Number of LUT Flip Flop pairs used: 14
Number with an unused Flip Flop: 8 out of 14 57%
Number with an unused LUT: 6 out of 14 42%
Number of fully used LUT-FF pairs: 0 out of 14 0%
Number of unique control sets: 2
IO Utilization:
Number of IOs: 8
Number of bonded IOBs: 8 out of 232 3%
Specific Feature Utilization:
Number of BUFG/BUFGCTRLs: 1 out of 16 6%
what is the number meaning
This is a report of how many resources are being used in your FPGA to implement your design. Read the documentation for this FPGA to understand exactly what these all mean. Basically this is telling you that you are using a very small portion of the chip right now so you should have no trouble implementing this.
Related
My University has computational nodes with 128 total cores but comprised of two individual AMD processors (i.e., sockets), each with 64 cores. This leads to anomalous simulation runtimes in ABAQUS using crystal plasticity models implemented in a User MATerial Subroutine (UMAT). For instance, if I run a simulation using 1 node and 128 cores, this takes around 14 hours. If I submit the same job to run across two nodes with 128 cores (i.e., using 64 cores/1 processor on two separate nodes), the job finishes in only 9 hours. One would expect the simulation running on a single host node to be faster than on two separate nodes for the same total number of cores, but this is not the case. The problem is that in the latter configuration, each host node contains two processors each with 64 cores and the abaqus_v6.env file therefore contains:
mp_host_list=[['node_1', 64],['node_2', 64]]
for the 2 node/128 core simulation. The ABAQUS .msg file then accordingly splits the job into two processes each with 64 threads:
PROCESS 1 THREAD NUMBER OF ELEMENTS
1 3840
2 3840
...
63
64 3840
PROCESS 2 THREAD NUMBER OF ELEMENTS
1 3584
2 4096
...
63
64 3840
The problem arises when I specify a single host node with 128 cores because ABAQUS has no way of recognizing that the host node consists of two separate processors. I can modify the abaqus_v6.env file accordingly as:
mp_host_list=[['node_1', 64],['node_1', 64]]
but ABAQUS just clumps this into one process with 128 threads, and I believe this is why my simulations actually run quicker on two nodes instead of one with the same number of cores, because ABAQUS does not recognize that it should treat the single node as two processors/processes.
Is there a way to specify two processes on the same host node in ABAQUS?
As a note, the amount of memory/RAM reserved per core does not change (~2 GB per core).
Final update: able to reduce runtimes using multiple nodes
I found that running these types of simulations across multiple nodes reduces run times. A table of simulation speeds for two models across various numbers of cores, nodes, and cores/processor are listed below.
The smaller model finished in 9.7 hours on two nodes with 64 cores/node = total of 128 cores. The runtime reduced by 25% when simulated over four nodes with 32 cores/node for the same total of 128 cores. Interestingly, the simulation took longer using three nodes with 64 cores/node (total of 192 cores), and there could be many reasons for this. One surprising result was that the simulation ran quicker using 64 nodes split over two nodes (32 cores/socket) vs. 64 cores on a single socket, which means the extra memory bandwidth of using multiple nodes helps (details of which I do not fully understand).
The larger model finished in ~32.5 hours using 192 cores and there was little between using three (64 cores/processor) or six (32 cores/processor) nodes, which means that at some point, using more nodes does not help. However, this larger model finished in 36.7 hours using 128 cores with 32 cores/processor (four nodes). Thus, the most efficient use of nodes for both the larger and smaller model is with 128 CPUs split over four nodes.
Simulation details for a model with 477,956 tetrahedral elements and 86,153 nodes. Model is cyclically strained to a strain of 1.3% for 10 cycles with a strain ratio R = 0.
# CPUs
# Nodes
# ABAQUS processes: actual
# ABAQUS processes: ideal
Notes on cores per processor
Wall time (hr)
Notes
64
1
1
2
32 cores/processor
13.8
Using cores on two processors but unable to specify two separate processes
64
1
1
1
64 cores/processor
11.5
No need to specify two processes
64
2
2
2
32 cores/processor
10.5
Correctly specifies two processes. Surprisingly faster than the scenario directly above!
128
1
1
2
64 cores/processor
14.5
Unable to specify two separate processes
128
2
2
2
64 cores/processor
8.9
Correctly specifies two processes
128
2
2
4
~32 cores/processor; 4 total processors
9.9
Specifies two processes but should be four processes
128
2
2
3
64 cores/processor
9.7
Specifies two processes over three processors
128
4
4
4
32 cores/processor
7.2
32 cores per node. Most efficient!
192
3
3
3
64 cores/processor
7.6
Three nodes with three processors
192
2
2
4
64 and 32 cores/processor on both node
10.5
Four processors over two nodes
Simulation details for a model with 4,104,272 tetrahedral elements and 702,859 nodes. The model is strained to 1.3% strain and then back to 0% strain (one cycle).
# CPUs
# Nodes
# ABAQUS processes: actual
# ABAQUS processes: ideal
Notes on cores per processor
Wall time (hr)
Notes
64
1
1
1
64 cores/processor
53.0
Using a single processor
128
1
1
2
64 cores/processor
57.3
Using two processors on one node
128
2
2
2
64 cores/processor
40.9
128
4
4
4
32 cores/processor
36.7
Most efficient!
192
2
2
4
64 and 32 cores/processor on both node
42.7
192
3
3
3
64 cores/processor
32.4
192
6
6
6
32 cores/processor
32.6
I have CUDA program with multiple kernels run on series (in the same stream- the default one). I want to make performance analysis for the program as a whole specifically the GPU portion. I'm doing the analysis using some metrics such as achieved_occupancy, inst_per_warp, gld_efficiency and so on using nvprof tool.
But the profiler gives metrics values separately for each kernel while I want to compute that for them all to see the total usage of the GPU for the program.
Should I take the (average or largest value or total) of all kernels for each metric??
One possible approach would be to use a weighted average method.
Suppose we had 3 non-overlapping kernels in our timeline. Let's say kernel 1 runs for 10 milliseconds, kernel 2 runs for 20 millisconds, and kernel 3 runs for 30 milliseconds. Collectively, all 3 kernels are occupying 60 milliseconds in our overall application timeline.
Let's also suppose that the profiler reports the gld_efficiency metric as follows:
kernel duration gld_efficiency
1 10ms 88%
2 20ms 76%
3 30ms 50%
You could compute the weighted average as follows:
88*10 76*20 50*30
"overall" global load efficiency = ----- + ----- + ----- = 65%
60 60 60
I'm sure there may be other approaches that make sense also. For example, a better approach might be to have the profiler report the total number of global load transaction for each kernel, and do your weighting based on that, rather than kernel duration:
kernel gld_transactions gld_efficiency
1 1000 88%
2 2000 76%
3 3000 50%
88*1000 76*2000 50*3000
"overall" global load efficiency = ------- + ------- + ------- = 65%
6000 6000 6000
Update (Jan 24, 2019):
This question was asked 4 years ago about Go 1.4 (and is still getting views). Profiling with pprof has changed dramatically since then.
Original Question:
I'm trying to profile a go martini based server I wrote, I want to profile a single request, and get the complete breakdown of the function with their runtime duration.
I tried playing around with both runtime/pprof and net/http/pprof but the output looks like this:
Total: 3 samples
1 33.3% 33.3% 1 33.3% ExternalCode
1 33.3% 66.7% 1 33.3% runtime.futex
1 33.3% 100.0% 2 66.7% syscall.Syscall
The web view is not very helpful either.
We regularly profile another program, and the output seems to be what I need:
20ms of 20ms total ( 100%)
flat flat% sum% cum cum%
10ms 50.00% 50.00% 10ms 50.00% runtime.duffcopy
10ms 50.00% 100% 10ms 50.00% runtime.fastrand1
0 0% 100% 20ms 100% main.func·004
0 0% 100% 20ms 100% main.pruneAlerts
0 0% 100% 20ms 100% runtime.memclr
I can't tell where the difference is coming from.
pprof is a timer based sampling profiler, originally from the gperftools suite. Rus Cox later ported the pprof tools to Go: http://research.swtch.com/pprof.
This timer based profiler works by using the system profiling timer, and recording statistics whenever it receives SIGPROF. In go, this is currently set to a constant 100Hz. From pprof.go:
// The runtime routines allow a variable profiling rate,
// but in practice operating systems cannot trigger signals
// at more than about 500 Hz, and our processing of the
// signal is not cheap (mostly getting the stack trace).
// 100 Hz is a reasonable choice: it is frequent enough to
// produce useful data, rare enough not to bog down the
// system, and a nice round number to make it easy to
// convert sample counts to seconds. Instead of requiring
// each client to specify the frequency, we hard code it.
const hz = 100
You can set this frequency by calling runtime.SetCPUProfileRate and writing the profile output yourself, and Gperftools allows you to set this frequency with CPUPROFILE_FREQUENCY, but in practice it's not that useful.
In order to sample a program, it needs to be doing what you're trying to measure at all times. Sampling the idle runtime isn't showing anything useful. What you usually do is run the code you want in a benchmark, or in a hot loop, using as much CPU time as possible. After accumulating enough samples, there should be a sufficient number across all functions to show you proportionally how much time is spent in each function.
See also:
http://golang.org/pkg/runtime/pprof/
http://golang.org/pkg/net/http/pprof/
http://blog.golang.org/profiling-go-programs
https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs
i was taking an exam earlier and i memorized the questions that i didnt know how to answer but somehow got it correct(since the online exam using electronic classrom(eclass) was done through the use of multiple choice.. The exam was coded so each of us was given random questions at random numbers and random answers on random choices, so yea)
anyways, back to my questions..
1.)
There is a CPU with a clock frequency of 1 GHz. When the instructions consist of two
types as shown in the table below, what is the performance in MIPS of the CPU?
-Execution time(clocks)- Frequency of Appearance(%)
Instruction 1 10 60
Instruction 2 15 40
Answer: 125
2.)
There is a hard disk drive with specifications shown below. When a record of 15
Kbytes is processed, which of the following is the average access time in milliseconds?
Here, the record is stored in one track.
[Specifications]
Capacity: 25 Kbytes/track
Rotation speed: 2,400 revolutions/minute
Average seek time: 10 milliseconds
Answer: 37.5
3.)
Assume a magnetic disk has a rotational speed of 5,000 rpm, and an average seek time of 20 ms. The recording capacity of one track on this disk is 15,000 bytes. What is the average access time (in milliseconds) required in order to transfer one 4,000-byte block of data?
Answer: 29.2
4.)
When a color image is stored in video memory at a tonal resolution of 24 bits per pixel,
approximately how many megabytes (MB) are required to display the image on the
screen with a resolution of 1024 x768 pixels? Here, 1 MB is 106 bytes.
Answer:18.9
5.)
When a microprocessor works at a clock speed of 200 MHz and the average CPI
(“cycles per instruction” or “clocks per instruction”) is 4, how long does it take to
execute one instruction on average?
Answer: 20 nanoseconds
I dont expect someone to answer everything, although they are indeed already answered but i am just wondering and wanting to know how it arrived at those answers. Its not enough for me knowing the answer, ive tried solving it myself trial and error style to arrive at those numbers but it seems taking mins to hours so i need some professional help....
1.)
n = 1/f = 1 / 1 GHz = 1 ns.
n*10 * 0.6 + n*15 * 0.4 = 12 ns (=average instruction time) = 83.3 MIPS.
2.)3.)
I don't get these, honestly.
4.)
Here, 1 MB is 10^6 bytes.
3 Bytes * 1024 * 768 = 2359296 Bytes = 2.36 MB
But often these 24 bits are packed into 32 bits b/c of the memory layout (word width), so often it will be 4 Bytes*1024*768 = 3145728 Bytes = 3.15 MB.
5)
CPI / f = 4 / 200 MHz = 20 ns.
A program run on a parallel machine is measured to have the following efficiency values for increasing numbers of processors, P.
P 1 2 3 4 5 6 7
E 100 90 85 80 70 60 50
Using the above results, plot the speedup graph.
Use the graph to explain whether or not the program has been successfully parallelized.
P E Speedup
1 100% 1
2 90% 1.8
3 85% 2.55
4 80% 3.2
5 70% 3.5
6 60% 3.6
7 50% 3.5
This is a past year exam question, and I know how to calculate the speedup & plot the graph. However I don't know how to tell a program is successfully parallelized.
Amdahl's law
I think the idea here is that not all portion can be parallelized.
For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimum execution time cannot be less than that critical 1 hour. Hence the speedup is limited up to 20×
In this example, the speedup reached maximum 3.6 with 6 processors. So the parallel portion is about 1-1/3.6 is about 72.2%.