How to measure precisely the memory usage of the GPU (OpenACC+Managed Memory) - gpgpu

Which is the most precise method to measure the memory usage of the GPU of an application that is using OpenACC with Managed Memory?
I used two method to do so: one is
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla v100 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 51C P5 11W / N/A | 10322MiB / 16160MiB | 65% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2670 G ./myapp 398MiB |
+-----------------------------------------------------------------------------+
About what here is printed, which is the difference between the Memory usage above (10322MiB / 16160MiB) and that below (./myapp 398MiB) ?
The other method I used is:
void measure_acc_mem_usage() {
auto dev_ty = acc_get_device_type();
auto dev_mem = acc_get_property(0, dev_ty, acc_property_memory);
auto dev_free_mem = acc_get_property(0, dev_ty, acc_property_free_memory);
auto mem = dev_mem - dev_free_mem;
if (mem > max_mem_usage)
max_mem_usage = mem;
}
A function I call many times during the program execution.
Both these methods don't seem to report the exact behaviour of the device (basing this statement on when the saturation seems to occurs: when the application begins to run really slow increasing the problem size) and report very different values (while for example, the second method indicates 2GB of memory usage, nvidia-smi says 16GB)

Not sure you'll be able to get a precise value of memory usage when using CUDA Unified Memory (aka managed). The nvidia-smi utility will only show cudaMalloc allocated memory and the OpenACC property function will use cudaGetMemInfo which isn't accurate for UM.
Bob gives a good explanation as to why here: CUDA unified memory pages accessed in CPU but not evicted from GPU

Related

In Cache Coherency (specifically write-through and write-back), can cache be updated even though it hasn't read from the memory first?

There are 3 serial process. In the serial 1, the event is blank, and in write-through and write back only the memory column has X. In serial 2, the event is "P reads X" and, both memory and cache column in write-through and write-back have X. Lastly, in serial 3, the event is "P updates X" and, in write-through, both memory and cache have X'. However, in write-back, only the cach column have X' while the memory column is still X.
From this image, it shows how write-through and write-back works. But, what if in serial 2, instead of "P reads X", it is "P updates X"? What will happen to the memory and cache?
From what I understand, this is what will happen if the serial 2 is "P updates X"
| | Write-Through | Write-Back |
| Serial | Event | Memory | Cache | Memory | Cache |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | | X | | X | |
| 2 | P updates X | X' | X' | X | X' |
But I'm not really sure if it's correct though. I need clarificaiton about this.

Is there any calibration tool between two languages performance?

I'm measuring the performance of A and B programs. A is written in Golang, B is written in Python. The important point here is that I'm interested in how the performance value increases, not the absolute performance value of the two programs over time.
For example,
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 3 | 500 |
+------+-----+-----+
| 2 | 5 | 800 |
+------+-----+-----+
| 3 | 9 | 1300|
+------+-----+-----+
| 4 | 13 | 1800|
+------+-----+-----+
Where the values in columns A and B(A: 3, 5, 9, 13 / B: 500, 800, 1300, 1800) are the execution times of the program. This execution time can be seen as performance, and the difference between the absolute performance values of A and B is very large. Therefore, the slope comparison of the performance graphs of the two programs would be meaningless.(Python is very slow compared to Golang.)
I want to compare the performance of Program A written in Golang with Program B written in Python, and I'm looking for a calibration tool or formula based on benchmarks that calculates the execution time when Program A is written in Python.
Is there any way to solve this problem ?
If you are interested in the relative change, you should normalize the data for each programming language. In other words, divide the values for golang with 3 and for python, divide with value 500.
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 1 | 1 |
+------+-----+-----+
| 2 | 1.66| 1.6 |
+------+-----+-----+
| 3 | 3 | 2.6 |
+------+-----+-----+
| 4 |4.33 | 3.6 |
+------+-----+-----+

Can LUT cascade be used simultaneously with the carry-chain in the iCE40 FPGAs by any tools?

I try to construct the following:
CO
|
/carry\ ____
s2 ---(((---|I0 |------------ O
+------+((---|I1 |
| +-(+---|I2 |
| | +----|I3__|
| +-(-----------+
| | |
| /carry\ ____ |B ___ BQ
D -----+------(((---|I0 |-+-----| |-+
s0 --+((---|I1 | > | |
s1 ---(+---|I2 | s3 -|S | |
| +-|I3__| s4 -|CE_| |
| +--------------------+
|
/carry\
|||
I write in Verilog, and instantiates SB_LUT4, SB_CARRY, SB_DFFESS primitives. To try to get a LUT cascade, I edit a .pcf constraints file (set_cascading...). However, synthesis (Lattice IceCube 2017.01.27914) disregards the constraints:
W2401: Ignoring cascade constraint for LUT instance 'filt.blk_0__a.cmbA.l.l', as it is packed with DFF/CARRY in a LogicCell
In the admirable Project IceStorm I can't see any reason why a combination of cascaded LUTs and the carry chain can't be used.
I am aware that a (slightly) newer IceCube2 is available. I know of the Yosys/arachne-pnr/icepack/iceprog toolchain. But before changing a toolchain, it seems prudent to ask if anyone solved this problem already, or if it is indeed not possible to combine the carry chain and LUT cascades?
Update - a quick install of Yosys/arachne-pnr/icetools synthesizes my design without warnings, but visualisation in ice40_viewer (and log output) indicates that the chained lut is not used.

What is the use of Vertica "metadata" resource pool?

In Vertica 8, the "metadata" resource pool was introduced. The documentation describes it as :
The pool that tracks memory allocated for catalog data and storage data structures.
It doesn't seem essential, since the documentation indicates how to disable it using the EnableMetadataMemoryTracking parameter.
What is this pool used for ? Since it consumes quite a lot of RAM (4Gb on our servers), can I disable it safely ?
metadata RAM it's vertica catalog size, reserved dynamically RAM that vertica process allocated for catalog.
for example you have 32GB of RAM total , vertica will use 95% of total ram ~30.5 GB but you have large catalog ~3GB (tons of objects) and vertica process consume couple of GB -> vertica process uses RAM that according to general pool must be free for queries -> can cause starvation.
If you use metadata pool that dynamicly borrow from general RAM needed for catalog your resource management will be better.
BTW why you have 4GB RAM catalog?? its kinda huge how much RAM vertica process consume in IDLE? Is it consume less after restart and grows over time?
created simple script that create 1000 tables with 100 int columns, insert 1 row and analyze statistics. You can see how catalog size grow with number of objects and how it affect metadata pool and vertica process RAM :
dbadmin=> select (select count(1) from tables),node_name,memory_size_kb,memory_size_actual_kb from resource_pool_status where pool_name ilike 'metadata';
?column? | node_name | memory_size_kb | memory_size_actual_kb
----------+--------------------+----------------+-----------------------
218 | v_vertica_node0001 | 108622 | 108622
218 | v_vertica_node0002 | 119596 | 119596
218 | v_vertica_node0003 | 122374 | 122374
(3 rows)
dbadmin=> select (select count(1) from tables),node_name,memory_size_kb,memory_size_actual_kb from resource_pool_status where pool_name ilike 'metadata'; \! top -n 1 | grep vertica
?column? | node_name | memory_size_kb | memory_size_actual_kb
----------+--------------------+----------------+-----------------------
513 | v_vertica_node0001 | 229210 | 229210
513 | v_vertica_node0002 | 281601 | 281601
513 | v_vertica_node0003 | 289407 | 289407
(3 rows)
476260 dbadmin 20 0 5391m 407m 39m S 109.2 2.6 21:25.64 vertica
dbadmin=> select (select count(1) from tables),node_name,memory_size_kb,memory_size_actual_kb from resource_pool_status where pool_name ilike 'metadata'; \! top -n 1 | grep vertica
?column? | node_name | memory_size_kb | memory_size_actual_kb
----------+--------------------+----------------+-----------------------
825 | v_vertica_node0001 | 352359 | 352359
825 | v_vertica_node0002 | 448032 | 448032
825 | v_vertica_node0003 | 456439 | 456439
(3 rows)
476260 dbadmin 20 0 5564m 554m 39m S 79.2 3.5 38:16.91 vertica
dbadmin=> select (select count(1) from tables),node_name,memory_size_kb,memory_size_actual_kb from resource_pool_status where pool_name ilike 'metadata'; \! top -n 1 | grep vertica
?column? | node_name | memory_size_kb | memory_size_actual_kb
----------+--------------------+----------------+-----------------------
1143 | v_vertica_node0001 | 489867 | 489867
1143 | v_vertica_node0002 | 627409 | 627409
1143 | v_vertica_node0003 | 635616 | 635616
(3 rows)
476260 dbadmin 20 0 5692m 711m 39m S 0.7 4.5 58:13.61 vertica

What is the difference using NOP and stalls in MIPS

What difference does it make to use NOP instead of stall.
Both happen to do the same task in case of pipelining. I cant understand
I think you've got your terminology confused.
A stall is injected into the pipeline by the processor to resolve data hazards (situations where the data required to process an instruction is not yet available. A NOP is just an instruction with no side-effect.
Stalls
Recall the 5 pipeline stage classic RISC pipeline:
IF - Instruction Fetch (Fetch the next instruction from memory)
ID - Instruction Decode (Figure out which instruction this is and what the operands are)
EX - Execute (Perform the action)
MEM - Memory Access (Store or read from memory)
WB - Write back (Write a result back to a register)
Consider the code snippet:
add $t0, $t1, $t1
sub $t2, $t0, $t0
From here it is obvious that the second instruction relies on the result of the first. This is a data hazard: Read After Write (RAW); a true dependency.
The sub requires the value of the add during its EX phase, but the add will only be in its MEM phase - the value will not be available until the WB phase:
+------------------------------+----+----+----+-----+----+---+---+---+---+
| | CPU Cycles |
+------------------------------+----+----+----+-----+----+---+---+---+---+
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+------------------------------------------------------------------------+
| 0 | add $t0, $t1, $t1 | IF | ID | EX | MEM | WB | | | | |
| 1 | sub $t2, $t0, $t0 | | IF | ID | EX | | | | | |
+---------+--------------------+----+----+----+-----+----+---+---+---+---+
One solution to this problem is for the processor to insert stalls or bubble the pipeline until the data is available.
+------------------------------+----+----+----+-----+----+----+-----+---+----+
| | CPU Cycles |
+------------------------------+----+----+----+-----+----+----+-----+----+---+
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+----------------------------------------------------------------------------+
| 0 | add $t0, $t1, $t1 | IF | ID | EX | MEM | WB | | | | |
| 1 | sub $t2, $t0, $t0 | | IF | ID | S | S | EX | MEM | WB | |
+----------+-------------------+----+----+----+-----+----+---+---+---+-------+
NOPs
A NOP is an instruction that does nothing (has no side-effect). MIPS assembler often support a nop instruction but in MIPS this is equivalent to sll $zero $zero 0.
This instruction will take up all 5 stages of pipeline. It is most commonly used to fill the branch delay slot of jumps or branches when there is nothing else useful that can be done in that slot.
j label
nop # nothing useful to put here
If you are using a MIPS simulator you may need to enable branch delay slot simulation to see this. (For example, in spim use the -delayed_branches argument)
We should not use NOP in place of the stall and vice-versa.
We will use the stall when there is a dependency causing hazard which results in the particular stage of the pipeline to wait until it gets the data required whereas by using NOP in case of stall it will just pass that stage of the instruction without doing anything. However, after the completion of the stage by using NOP the data required by the stage is available and we need to start the instruction from the beginning which will increase the average CPI of the processor results in performance reduction. Also, in some cases the data required by that instruction might be modified by another instruction before restarting the instruction which will result in faulty execution.
Also, in the same way if we use the stall in the place of the NOP.
whenever a non-mask-able interrupt occurs like (divide by zero) in execution stage we need to pass the stages after the exception without changing the state of the processor here we use NOP to pass the remaining stages of the pipeline without any changes to the processor state (like writing something into the register or the memory which is a false value generated to the exception).
Here, we cannot use stall because the next instruction will wait for the stall to be completed and the stall will not be completed as it is a non-mask-able interrupt (user cannot control these type of instructions) and the pipeline enters deadlock.

Resources