Does Amazon GPU Instance Get Exclusive Access to the GPU? - amazon-ec2

I am running Ubuntu 16.04 on an EC2 p2.xlarge shared instance.
The P2.xlarge instance gives access to a single GPU (1/2 of an NVidia K80 GPU?) to my shared instance.
How is this GPU shared between other Amazon EC2 instances on the same physical machine?
I was under the impression that 100% of the GPU was allocated to my instance. But this is clearly not the case. When my instance is running nothing on the GPU:
$ nvidia-smi
Tue Feb 21 00:11:16 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 39C P0 55W / 149W | 0MiB / 11439MiB | 63% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
A moment later:
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 40C P0 57W / 149W | 0MiB / 11439MiB | 97% Default |
And the next moment:
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 40C P0 56W / 149W | 0MiB / 11439MiB | 100% Default |
And pretty much stays there...
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 41C P0 56W / 149W | 0MiB / 11439MiB | 100% Default |
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 41C P0 56W / 149W | 0MiB / 11439MiB | 99% Default |
What are the rules for GPU allocation for GPU instances?

Related

USRP - daughterboard installation issue

I'm trying to receive a signal whith a center frequency equal to 2e8Hz and a bandwith slightly smaller than 2e5Hz with my USRP X310.
I use the rx_sampletofile.cpp func of uhd 3.10.1 as shown :
./rx_sampletofile --file test.bin --duration --rate 4e8 --nsamps 4e8 1 --freq 2e8 --type float --bw 8e7 --skip-lo
When I look at the terminal, everything is executed without error or warnings but I see that the actual rx frequency isn't changed afterall ...
Setting RX Rate: 20.000000 Msps...
Actual RX Rate: 20.000000 Msps...
Setting RX Freq: 200.000000 MHz...
Setting RX LO Offset: 0.000000 MHz...
Actual RX Freq: 0.000000 MHz...
Setting RX Bandwidth: 0.250000 MHz...
Actual RX Bandwidth: 0.250000 MHz...
I tried to change RX Freq to lower freq and also--lo-offset but it always stays at RX Freq = 0Hz so if you have any idea I'll take it.
Thanks
It appears that my issue is finally related to the daughterboards because the drivers don't detect them well.
Daughterboard Issue :
_____________________________________________________
| | /
| | | RX Dboard: A
| | | ID: Unknown (0x0095)
| | | Serial: 31F94F3
| | | _____________________________________________________
| | | /
| | | | RX Frontend: 0
| | | | Name: Unknown (0x0095) - 0
| | | | Antennas:
| | | | Sensors:
| | | | Freq range: 0.000 to 0.000 MHz
| | | | Gain Elements: None
| | | | Bandwidth range: 0.0 to 0.0 step 0.0 Hz
| | | | Connection Type: IQ
| | | | Uses LO offset: No
| | | _____________________________________________________
| | | /
| | | | RX Codec: A
| | | | Name: ads62p48
| | | | Gain range digital: 0.0 to 6.0 step 0.5 dB
| | _____________________________________________________
| | /
| | | RX Dboard: B
| | | _____________________________________________________
| | | /
| | | | RX Frontend: 0
| | | | Name: Unknown (0xffff) - 0
| | | | Antennas:
| | | | Sensors:
| | | | Freq range: 0.000 to 0.000 MHz
| | | | Gain Elements: None
| | | | Bandwidth range: 0.0 to 0.0 step 0.0 Hz
| | | | Connection Type: IQ
| | | | Uses LO offset: No
| | | _____________________________________________________
| | | /
| | | | RX Codec: B
| | | | Name: ads62p48
| | | | Gain range digital: 0.0 to 6.0 step 0.5 dB
| | _____________________________________________________
Your UHD is too old for your hardware revision of the TwinRX daughterboard.
The only solution is to use a more modern version of UHD. This will also require you to load a more modern version of the FPGA image.

How to compare two circuits based on their utilization

I have some hardware IPs that I need to synthesize. And the IP contains several generic parameters I can play with. Each combination of parameters gives me a different utilization report after synthesis and implementation.
So for example for two different configurations Design_1 and Design_2, I get the following in Vivado 2018.1. The 3rd line is the ratio of the values of Design_2 devided by values of Design_1.
So as you can see in this simple example, Design_2 has less Slice LUTs but slightly more F7 Muxes.
My question is how to conclude about the cost of each one? Should I privilege Slice LUTs or Registers ...etc?
+----------+-------------------+-----------------+------------------+----------+-------------------+-------------------+---------------+---------------------+----------------+------+------------+--------------+-------------+------------+----------+---------+------------+---------+---------------------------+-------------------------+-----------------------------+--------+--------+----------+---------+------------+-----------+---------+--------+---------+---------+-----------+----------+-----------+-------------+---------+----------+-----------+---------+
| Name | Slice LUTs | Slice Registers | F7 Muxes | F8 Muxes | Slice | LUT as Logic | LUT as Memory | LUT Flip Flop Pairs | Block RAM Tile | DSPs | Bonded IOB | Bonded IPADs | PHY_CONTROL | PHASER_REF | OUT_FIFO | IN_FIFO | IDELAYCTRL | IBUFDS | PHASER_OUT/PHASER_OUT_PHY | PHASER_IN/PHASER_IN_PHY | IDELAYE2/IDELAYE2_FINEDELAY | ILOGIC | OLOGIC | BUFGCTRL | BUFIO | MMCME2_ADV | PLLE2_ADV | BUFMRCE | BUFHCE | BUFR | BSCANE2 | CAPTUREE2 | DNA_PORT | EFUSE_USR | FRAME_ECCE2 | ICAPE2 | PCIE_2_1 | STARTUPE2 | XADC |
+----------+-------------------+-----------------+------------------+----------+-------------------+-------------------+---------------+---------------------+----------------+------+------------+--------------+-------------+------------+----------+---------+------------+---------+---------------------------+-------------------------+-----------------------------+--------+--------+----------+---------+------------+-----------+---------+--------+---------+---------+-----------+----------+-----------+-------------+---------+----------+-----------+---------+
| Design_1 | 34124 | 16913 | 1453 | 91 | 10272 | 31538 | 2586 | 9020 | 37 | 11 | 125 | 0 | 1 | 1 | 4 | 2 | 1 | 0 | 4 | 2 | 16 | 16 | 46 | 10 | 0 | 2 | 2 | 0 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Design_2 | 34097 | 16913 | 1550 | 91 | 10189 | 31511 | 2586 | 9021 | 37 | 11 | 125 | 0 | 1 | 1 | 4 | 2 | 1 | 0 | 4 | 2 | 16 | 16 | 46 | 10 | 0 | 2 | 2 | 0 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| -------- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| (2)/(1) | 0.999208768022506 | 1 | 1.06675843083276 | 1 | 0.991919781931464 | 0.999143889910584 | 1 | 1.00011086474501 | 1 | 1 | 1 | #DIV/0! | 1 | 1 | 1 | 1 | 1 | #DIV/0! | 1 | 1 | 1 | 1 | 1 | 1 | #DIV/0! | 1 | 1 | #DIV/0! | 1 | #DIV/0! | 1 | #DIV/0! | #DIV/0! | #DIV/0! | #DIV/0! | #DIV/0! | #DIV/0! | #DIV/0! | #DIV/0! |
+----------+-------------------+-----------------+------------------+----------+-------------------+-------------------+---------------+---------------------+----------------+------+------------+--------------+-------------+------------+----------+---------+------------+---------+---------------------------+-------------------------+-----------------------------+--------+--------+----------+---------+------------+-----------+---------+--------+---------+---------+-----------+----------+-----------+-------------+---------+----------+-----------+---------+
It's depending on your needs, LUTs and F7 Muxes are differents physical cells in your FPGA. So even if you don't use its, its will be there.
If you have one ressource more critical than the other, you should try to minimize the utilisation of the critical ressource to simplify the place and route.
If you have nothing critical, I think the better is to use F7 Muxes first because Slice LUTs are more flexible for the rest of your design.

Extract values from output/file and pipe to email if !=

I have a server with installed hwraid (megaclisas) https://hwraid.le-vert.net/wiki/DebianPackages
the sample output looks like:
-- Controller information --
-- ID | H/W Model | RAM | Temp | BBU | Firmware
c0 | PERC H310 Mini | 0MB | 59C | Absent | FW: 20.13.3-0001
-- Array information --
-- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Path | CacheCade |InProgress
c0u0 | RAID-10 | 3272G | 64 KB | RA,WT | Default | Optimal | /dev/sda | None |None
-- Disk information --
-- ID | Type | Drive Model | Size | Status | Speed | Temp | Slot ID | LSI ID
c0u0s0p0 | HDD | SEAGATE ST900MM0006 LS0AS0N3Bxxx | 837. Gb | Online, Spun Up | 6.0Gb/s | 31C | [32:0] | 0
c0u0s0p1 | HDD | SEAGATE ST900MM0006 LS0AS0N3Bxxx | 837. Gb | Online, Spun Up | 6.0Gb/s | 31C | [32:1] | 1
c0u0s1p0 | HDD | SEAGATE ST900MM0006 LS0AS0N3Bxxx | 837. Gb | Online, Spun Up | 6.0Gb/s | 28C | [32:2] | 2
c0u0s1p1 | HDD | SEAGATE ST900MM0006 LS0AS0N3Bxxx | 837. Gb | Online, Spun Up | 6.0Gb/s | 30C | [32:3] | 3
c0u0s2p0 | HDD | SEAGATE ST900MM0006 LS0AS0N3Bxxx | 837. Gb | Online, Spun Up | 6.0Gb/s | 29C | [32:4] | 4
c0u0s2p1 | HDD | SEAGATE ST900MM0006 LS0AS0N3Bxxx | 837. Gb | Online, Spun Up | 6.0Gb/s | 31C | [32:5] | 5
c0u0s3p0 | HDD | SEAGATE ST900MM0006 LS0AS0N3Bxxx | 837. Gb | Online, Spun Up | 6.0Gb/s | 30C | [32:7] | 7
c0u0s3p1 | HDD | SEAGATE ST900MM0006 LS0AS0N3Bxxx | 837. Gb | Online, Spun Up | 6.0Gb/s | 28C | [32:6] | 6
What I want to achieve is to grep Status value if it is not equal to Optimal or Online and then pipe to email. The problem I have here is to how to get that using sed or awk.
Here is a solution how to proceed
data=$(mktemp)
externalprogram > $data
RESULT=$(
grep "| RAID" $data | sed -n '/Optimal/!p'
grep "| HDD" $data | sed -n '/Online,/!p'
)
rm $data
echo "$RESULT"
explanation
grep interested lines
print if search pattern was not found

Tensorflow: Multi-GPU training cannot make all GPU running at the same time

I have a machine that has 3x 1080 GPU. Below are the code of the training:
dynamic_learning_rate = tf.placeholder(tf.float32, shape=[])
model_version = tf.constant(1, tf.int32)
with tf.device('/cpu:0'):
with tf.name_scope('Input'):
# Input images and labels.
batch_images,\
batch_input_vectors,\
batch_one_hot_labels,\
batch_file_paths,\
batch_labels = self.get_batch()
grads = []
pred = []
cost = []
# Define optimizer
optimizer = tf.train.MomentumOptimizer(learning_rate=dynamic_learning_rate / self.batch_size,
momentum=0.9,
use_nesterov=True)
split_input_image = tf.split(batch_images, self.num_gpus)
split_input_vector = tf.split(batch_input_vectors, self.num_gpus)
split_input_one_hot_label = tf.split(batch_one_hot_labels, self.num_gpus)
for i in range(self.num_gpus):
with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
with tf.variable_scope(tf.get_variable_scope(), reuse=i > 0):
with tf.name_scope('Model'):
# Construct model
with tf.variable_scope("inference"):
tower_pred = self.model(split_input_image[i], split_input_vector[i], is_training=True)
pred.append(tower_pred)
with tf.name_scope('Loss'):
# Define loss and optimizer
softmax_cross_entropy_cost = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits=tower_pred, labels=split_input_one_hot_label[i]))
cost.append(softmax_cross_entropy_cost)
# Concat variables
pred = tf.concat(pred, 0)
cost = tf.reduce_mean(cost)
# L2 regularization
trainable_vars = tf.trainable_variables()
l2_regularization = tf.add_n(
[tf.nn.l2_loss(v) for v in trainable_vars if any(x in v.name for x in ['weights', 'biases'])])
for v in trainable_vars:
if any(x in v.name for x in ['weights', 'biases']):
print(v.name + ' - included for L2 regularization!')
else:
print(v.name)
cost = cost + self.l2_regularization_strength*l2_regularization
with tf.name_scope('Accuracy'):
# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(batch_one_hot_labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
prediction = tf.nn.softmax(pred, name='softmax')
# Creates a variable to hold the global_step.
global_step = tf.Variable(0, trainable=False, name='global_step')
# Minimization
update = optimizer.minimize(cost, global_step=global_step, colocate_gradients_with_ops=True)
After I run the training:
Fri Nov 10 12:28:00 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 42% 65C P2 62W / 198W | 7993MiB / 8114MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 33% 53C P2 150W / 198W | 7886MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 Off | 00000000:05:00.0 On | N/A |
| 26% 54C P2 170W / 198W | 7883MiB / 8108MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23228 C python 7982MiB |
| 1 23228 C python 7875MiB |
| 2 4793 G /usr/lib/xorg/Xorg 40MiB |
| 2 23228 C python 7831MiB |
+-----------------------------------------------------------------------------+
Fri Nov 10 12:28:36 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 42% 59C P2 54W / 198W | 7993MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 33% 57C P2 154W / 198W | 7886MiB / 8114MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 Off | 00000000:05:00.0 On | N/A |
| 27% 55C P2 155W / 198W | 7883MiB / 8108MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23228 C python 7982MiB |
| 1 23228 C python 7875MiB |
| 2 4793 G /usr/lib/xorg/Xorg 40MiB |
| 2 23228 C python 7831MiB |
+-----------------------------------------------------------------------------+
You see that the whenever the first GPU is running, the other two GPUs will be idle and vice versa. The alternate frequency is about 0.5 second.
For a single GPU, the training speed is around 650 [images/second], with all the 3 GPUs I got only 1050 [images/second].
Any idea of the problem?
You need to make sure that all the trainable variables are on the controller device (usually the CPU) and all the other worker devices (usually GPUs) are using the variables from the CPU in parallel.

How can I get the list of GPU cards to which are connected monitors?

How can I get the list of GPU cards to which are connected monitors?
Can I get a list with the parameters: pciBusID, pciDeviceID, pciDomainID?
OS: Windows 7
GPUs: nVidia GeForce/Quadro
We can use utility nvidia-smi, which contained in the nVidia Video Drivers, to indicate to which GPU-card display is connected (only for professional GPU-card: Quadro / Tesla):
Windows: C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe
Linux: /usr/local/cuda/bin/nvidia-smi
example of nvidia-smi output:
+------------------------------------------------------+
| NVIDIA-SMI 332.88 Driver Version: 332.88 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K4000 WDDM | 0000:01:00.0 Off | N/A |
| 30% 30C P8 9W / 87W | 3027MiB / 3071MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GT 640 WDDM | 0000:02:00.0 N/A | N/A |
| 40% 27C N/A N/A / N/A | 2005MiB / 2047MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro K4000 WDDM | 0000:03:00.0 On | N/A |
| 30% 34C P8 11W / 87W | 3028MiB / 3071MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Where Disp.A - Shows the on which GPU-card Display is Active:
Off - display is not connected
On - display is connected
N/A - unknown (for not professional cards: GeForce)
Then we can say, that display is connected to GPU: 2 Quadro K4000 0000:03:00.0.

Resources