Amazon EC2 how to install ixgbevf on a Centos 7 instance? - amazon-ec2

I'm trying to install the ixgbevf on a Amazon EC2 CentOS7 instance. The steps looks good, but everytime when I run the instance in a Enhanced Networking enabled type, such as M4.xlarge, the network seems fail(I can not connect to the instance after startup).
Here's what I did:
wget http://elrepo.org/linux/elrepo/el7/x86_64/RPMS/kmod-ixgbevf-2.16.1-1.el7.elrepo.x86_64.rpm
rpm -ivh kmod-ixgbevf-2.16.1-1.el7.elrepo.x86_64.rpm
cp -p /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut -f
Then shutdown the instance, and set the sriov
ec2-modify-instance-attribute instance_id --sriov simple
That all. Whenever the type(such as T2.micro) doesn't support Enhanced Networking, the instance works fine. But if I change the type to Enhanced Networking enabled(such as M4.xlarge), the instance simply can't be accessed. Anyone have any idea about this? Did I miss something?

The answer lies buried into this section of the original documentation:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html#enhanced-networking-linux
In a nutshell, CentOS 7 already ships with the ixgbevf module, although not with the latest version, but this is hardly a problem. What was causing my instance to be unreachable after a reboot were the "predictable network interfaces", enabled by default.
To disable them, simply visit that link, jump straight to step number 6 and type:
$ rpm -qa | grep -e '^systemd-[0-9]\+\|^udev-[0-9]\+'
$ sudo sed -i '/^GRUB\_CMDLINE\_LINUX/s/\"$/\ net\.ifnames\=0\"/' /etc/default/grub
$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg
After that:
Stop the instance
Enable the Enhanced Networking via the aws CLI
Restart it
You should now be able to log in!

On Oracle Linux 6.9 (same as RHEL6/CENTOS6), in a aws placement group running iperf3 between two r2.xlarge instances, i got just shy of 2.5Gbps. ethtools reports vif but the ixgbevf driver is installed. Without SRIOV set to simple, most instances seem to get 1Gbps max.
[ 4] local 10.11.5.61 port 52754 connected to 10.11.5.222 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 268 MBytes 2.25 Gbits/sec 56 559 KBytes
[ 4] 1.00-2.00 sec 296 MBytes 2.48 Gbits/sec 54 629 KBytes
[ 4] 2.00-3.00 sec 296 MBytes 2.48 Gbits/sec 61 551 KBytes
[ 4] 3.00-4.00 sec 296 MBytes 2.48 Gbits/sec 62 454 KBytes
[ 4] 4.00-5.00 sec 296 MBytes 2.48 Gbits/sec 55 551 KBytes
[ 4] 5.00-6.00 sec 288 MBytes 2.42 Gbits/sec 50 454 KBytes
[ 4] 6.00-7.00 sec 291 MBytes 2.44 Gbits/sec 55 559 KBytes
[ 4] 7.00-8.00 sec 296 MBytes 2.48 Gbits/sec 55 507 KBytes
[ 4] 8.00-9.00 sec 296 MBytes 2.48 Gbits/sec 60 472 KBytes
[ 4] 9.00-10.00 sec 296 MBytes 2.48 Gbits/sec 59 559 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 2.85 GBytes 2.45 Gbits/sec 567 sender
[ 4] 0.00-10.00 sec 2.85 GBytes 2.45 Gbits/sec receiver
Note speeds are slower between different familys even in a placement group as they may have to be placed on different machines.
You can also look at adding ENA interfaces to see if you get better speeds on m4 and c4 instances. Also this is the only support networking on newer instance types m5 and c5.

Related

Why Ceph turns status to Err when there is still available storage space

I built a 3 node Ceph cluster recently. Each node had seven 1TB HDD for OSDs. In total, I have 21 TB of storage space for Ceph.
However, when I ran a workload to keep writing data to Ceph, it turns to Err status and no data can be written to it any more.
The output of ceph -s is:
cluster:
id: 06ed9d57-c68e-4899-91a6-d72125614a94
health: HEALTH_ERR
1 full osd(s)
4 nearfull osd(s)
7 pool(s) full
services:
mon: 1 daemons, quorum host3
mgr: admin(active), standbys: 06ed9d57-c68e-4899-91a6-d72125614a94
osd: 21 osds: 21 up, 21 in
rgw: 4 daemons active
data:
pools: 7 pools, 1748 pgs
objects: 2.03M objects, 7.34TiB
usage: 14.7TiB used, 4.37TiB / 19.1TiB avail
pgs: 1748 active+clean
Based on my comprehension, since there is still 4.37 TB space left, Ceph itself should take care about how to balance the workload and make each OSD to not be at full or nearfull status. But the result doesn't work as my expectation, 1 full osd and 4 nearfull osd shows up, the health is HEALTH_ERR.
I can't visit Ceph with hdfs or s3cmd anymore, so here comes the question:
1, Any explanation about current issue?
2, How can I recover from it? Delete data on Ceph node directly with ceph-admin, and relaunch the Ceph?
Not get an answer for 3 days and I made some progress, let me share my findings here.
1, It's normal for different OSD to have size gap. If you list OSD with ceph osd df, you will find that different OSD has different usage ratio.
2, To recover from this issue, the issue here means the cluster crush due to OSD full. Follow steps below, it's mostly from redhat.
Get ceph cluster health info by ceph health detail. It's not necessary but you can get the ID of failed OSD.
Use ceph osd dump | grep full_ratio to get current full_ratio. Do not use statement listed at above link, it's obsoleted. The output can be like
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
Set OSD full ratio a little higher by ceph osd set-full-ratio <ratio>. Generally, we set ratio to 0.97
Now, the cluster status will change from HEALTH_ERR to HEALTH_WARN or HEALTH_OK. Remove some data that can be released.
Change OSD full ratio back to previous ratio. It can't be 0.97 always cause it's a little risky.
Hope this thread is helpful to some one who ran into same issue. The details about OSD configuration please refer to ceph.
Ceph requires free disk space to move storage chunks, called pgs, between different disks. As this free space is so critical to the underlying functionality, Ceph will go into HEALTH_WARN once any OSD reaches the near_full ratio (generally 85% full), and will stop write operations on the cluster by entering HEALTH_ERR state once an OSD reaches the full_ratio.
However, unless your cluster is perfectly balanced across all OSDs there is likely much more capacity available, as OSDs are typically unevenly utilized. To check overall utilization and available capacity you can run ceph osd df.
Example output:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
2 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 72 MiB 3.6 GiB 742 GiB 73.44 1.06 406 up
5 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 119 MiB 3.3 GiB 726 GiB 74.00 1.06 414 up
12 hdd 2.72849 1.00000 2.7 TiB 2.2 TiB 2.2 TiB 72 MiB 3.7 GiB 579 GiB 79.26 1.14 407 up
14 hdd 2.72849 1.00000 2.7 TiB 2.3 TiB 2.3 TiB 80 MiB 3.6 GiB 477 GiB 82.92 1.19 367 up
8 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
1 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 27 MiB 2.9 GiB 1006 GiB 64.01 0.92 253 up
4 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 79 MiB 2.9 GiB 1018 GiB 63.55 0.91 259 up
10 hdd 2.72849 1.00000 2.7 TiB 1.9 TiB 1.9 TiB 70 MiB 3.0 GiB 887 GiB 68.24 0.98 256 up
13 hdd 2.72849 1.00000 2.7 TiB 1.8 TiB 1.8 TiB 80 MiB 3.0 GiB 971 GiB 65.24 0.94 277 up
15 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 58 MiB 3.1 GiB 793 GiB 71.63 1.03 283 up
17 hdd 2.72849 1.00000 2.7 TiB 1.6 TiB 1.6 TiB 113 MiB 2.8 GiB 1.1 TiB 59.78 0.86 259 up
19 hdd 2.72849 1.00000 2.7 TiB 1.6 TiB 1.6 TiB 100 MiB 2.7 GiB 1.2 TiB 56.98 0.82 265 up
7 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
0 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 105 MiB 3.0 GiB 734 GiB 73.72 1.06 337 up
3 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 98 MiB 3.0 GiB 781 GiB 72.04 1.04 354 up
9 hdd 2.72849 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
11 hdd 2.72849 1.00000 2.7 TiB 1.9 TiB 1.9 TiB 76 MiB 3.0 GiB 817 GiB 70.74 1.02 342 up
16 hdd 2.72849 1.00000 2.7 TiB 1.8 TiB 1.8 TiB 98 MiB 2.7 GiB 984 GiB 64.80 0.93 317 up
18 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 79 MiB 3.0 GiB 792 GiB 71.65 1.03 324 up
6 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
TOTAL 47 TiB 30 TiB 30 TiB 1.3 GiB 53 GiB 16 TiB 69.50
MIN/MAX VAR: 0.82/1.19 STDDEV: 6.64
As you can see in the above output, the used OSDs vary from 56.98% (OSD 19) to 82.92% (OSD 14) utilized, which is a significant variance.
As only a single OSD is full, and only 4 of your 21 OSD's are nearfull you likely have a significant amount of storage still available in your cluster, which means that it is time to perform a rebalance operation. This can be done manually by reweighting OSDs, or you can have Ceph do a best-effort rebalance by running the command ceph osd reweight-by-utilization. Once the rebalance is complete (i.e you have no objects misplaced in ceph status) you can check for the variation again (using ceph osd df) and trigger another rebalance if required.
If you are on Luminous or newer you can enable the Balancer plugin to handle OSD rewighting automatically.

building software using bazel is slower than make?

My team has a project which is not too big, built by make -js, cost 40 seconds, when using bazel, the time incresed to 70 secs. And here is the profile of the build process of bazel. I noticed that SKYFUNCTION takes 47% of the time cost, is that reasonable?
PROFILES
the last section of it:
Type Total Count Average
ACTION 0.03% 77 0.70 ms
ACTION_CHECK 0.00% 4 0.90 ms
ACTION_EXECUTE 40.40% 77 912 ms
ACTION_UPDATE 0.00% 74 0.02 ms
ACTION_COMPLETE 0.19% 77 4.28 ms
INFO 0.00% 1 0.05 ms
VFS_STAT 1.07% 117519 0.02 ms
VFS_DIR 0.27% 4613 0.10 ms
VFS_MD5 0.22% 151 2.56 ms
VFS_DELETE 4.43% 53830 0.14 ms
VFS_OPEN 0.01% 232 0.11 ms
VFS_READ 0.06% 3523 0.03 ms
VFS_WRITE 0.00% 4 0.97 ms
WAIT 0.05% 156 0.56 ms
SKYFRAME_EVAL 6.23% 1 10.830 s
SKYFUNCTION 47.01% 687 119 ms
#ittai, #Jin, #Ondrej K, I have tried to switched off the sandboxing in bazel, it seems much faster than with it switched on. here is the comparison:
SWITCHED ON: 70s
SWITCHED OFF: 33s±2
the skyFunction still takes 47% of all the execution time. but the everage times it takes turned from 119ms to 21ms.

AB (Apache benchmark) to the authenticate

I want to test the performance of my apache with AB (Apache Benchemark) with parameter authentication.
I followed this tutorial step
Using Apache Benchmark (ab) on sites with authentication
and when I execute the command
ab-c 1-n 1-C PHPSESSID = 65pnirttcbn0l6seutjkf28452 http://my-web-site
but authentication does not pass
testeur#ERP:~$ ab -c 1 -n 1 -C PHPSESSID=65pnirttcbn0l6seutjkf28452 http:my-web-site.com/mapviewimproved
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, `http://www.zeustech.net/`
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking my-web-site.com (be patient).....done
Server Software: Apache
Server Hostname: algeotrack.com
Server Port: 80
Document Path: /my-page
Document Length: 0 bytes
Concurrency Level: 1
Time taken for tests: 0.627 seconds
Complete requests: 1
Failed requests: 0
Write errors: 0
Non-2xx responses: 1
Total transferred: 335 bytes
HTML transferred: 0 bytes
Requests per second: 1.59 [#/sec] (mean)
Time per request: 627.320 [ms] (mean)
Time per request: 627.320 [ms] (mean, across all concurrent requests)
Transfer rate: 0.52 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 36 36 0.0 36 36
Processing: 591 591 0.0 591 591
Waiting: 591 591 0.0 591 591
Total: 627 627 0.0 627 627
I note that the application is developed with Zend Framework 1
is that you can help me please
You have to quote the cookie value:
ab -c 1 -n 1 -C 'PHPSESSID=65pnirttcbn0l6seutjkf28452' http:my-web-site.com/mapviewimproved
see this other question as reference: How do I pass a complex cookie to ab for testing?

PHP processing speed apache 2.4 mpm-prefork mod_php 5.4 vs nginx 1.2.x PHP-FPM 5.4

I've been looking for days to see if someone has done a good, documented, PHP processing speed comparison between
apache-mpm-prefork 2.4 with mod_php 5.4
and
nginx 1.2.x + PHP-FPM 5.4
Why I'm looking: The only test I saw are abount benchmarks, serving full pages or Hello, World -like test, without proper documentation on what exactly was tested. I don't care about the request/seconds, the hardware, but I do need to see what PHP script was tested and with what exact configuration.
Why these two: mod_php was known to be the fastest on processing PHP ( no static files, no request/response measuring, just processing the PHP itself ) but a lot has changed since then, including apache version. Nginx and PHP-FPM eats a lot less memory, so it'd be a good reason to change architecture but if they're not fast enough in this case, the change would be irrelevant.
I know if I'm unable to find one I have to do it myself but I can't believe no one has done a test like this so far :)
I have completed this test on CentOS 6.3 using nginx 1.2.7, apache 2.4.3 and php 5.4.12 all compiled with no changes to default.
./configure
make && make install
With the exception of php where I enabled php-fpm
./configure --enable-fpm
All servers have 100% default config except as noted below. All testing was done on a test server, with no load and a reboot between tests. The server has a Intel(R) Xeon(R) CPU E3-1230, 1GB RAM and 2 x 60GB SSD in RAID 1. Tests were run using ab -n 50000 -c 500 http://127.0.0.1/test.php
Test PHP script:
<?php
$testing = 0;
for ($i = 0; $i < 1000; $i++) {
$testing++;
}
echo $testing;
I also had to enable php in nginx.conf as it's not enabled by default.
location ~ \.php$ {
fastcgi_pass 127.0.0.1:9000;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME /var/www/html$fastcgi_script_name;
include fastcgi_params;
}
Nginx - PHP-FPM on 127.0.0.1:9000
Concurrency Level: 500
Time taken for tests: 10.932 seconds
Complete requests: 50000
Failed requests: 336
(Connect: 0, Receive: 0, Length: 336, Exceptions: 0)
Write errors: 0
Non-2xx responses: 336
Total transferred: 7837824 bytes
HTML transferred: 379088 bytes
Requests per second: 4573.83 [#/sec] (mean)
Time per request: 109.317 [ms] (mean)
Time per request: 0.219 [ms] (mean, across all concurrent requests)
Transfer rate: 700.17 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 34 338.5 0 7000
Processing: 0 34 166.5 23 8120
Waiting: 0 34 166.5 23 8120
Total: 13 68 409.2 23 9846
Percentage of the requests served within a certain time (ms)
50% 23
66% 28
75% 32
80% 33
90% 34
95% 46
98% 61
99% 1030
100% 9846 (longest request)
Nginx - PHP-FPM via socket (config change to fastcgi_pass)
fastcgi_pass unix:/var/lib/php/php.sock;
Concurrency Level: 500
Time taken for tests: 7.054 seconds
Complete requests: 50000
Failed requests: 351
(Connect: 0, Receive: 0, Length: 351, Exceptions: 0)
Write errors: 0
Non-2xx responses: 351
Total transferred: 7846209 bytes
HTML transferred: 387083 bytes
Requests per second: 7087.70 [#/sec] (mean)
Time per request: 70.545 [ms] (mean)
Time per request: 0.141 [ms] (mean, across all concurrent requests)
Transfer rate: 1086.16 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 26 252.5 0 7001
Processing: 0 24 112.9 17 3683
Waiting: 0 24 112.9 17 3683
Total: 7 50 306.4 17 7001
Percentage of the requests served within a certain time (ms)
50% 17
66% 19
75% 20
80% 21
90% 23
95% 31
98% 55
99% 1019
100% 7001 (longest request)
Apache - mod_php
Concurrency Level: 500
Time taken for tests: 10.979 seconds
Complete requests: 50000
Failed requests: 0
Write errors: 0
Total transferred: 9800000 bytes
HTML transferred: 200000 bytes
Requests per second: 4554.02 [#/sec] (mean)
Time per request: 109.793 [ms] (mean)
Time per request: 0.220 [ms] (mean, across all concurrent requests)
Transfer rate: 871.67 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 22 230.2 1 7006
Processing: 0 58 426.0 18 9612
Waiting: 0 58 425.9 18 9611
Total: 5 80 523.8 19 10613
Percentage of the requests served within a certain time (ms)
50% 19
66% 23
75% 25
80% 26
90% 31
95% 36
98% 1012
99% 1889
100% 10613 (longest request)
I'll be more than happy to tune apache further, but it seems apache just can't keep up. The clear winner is nginx with php-fpm via socket.
It seems you are comparing apples with oranges, or more to put it more accurately, you are confounding the results by adjusting two variables. Surely, it would be more sensible to compare Apache+fastcgi+php-fpm against nginx+php-fpm? You'd expect the php-fpm part to be the same, so then you would be measuring the better of Apache_fastcgi vs nginx.

Mysql stopped working after mountain lion upgrade and Server.app installation

I did an update this morning from snow leopard server to mountain lion and installed the Server app as well, now I can't connect to the mysql and I fear all my databases are lost, anyone had this problem and can provide a solution?
First issue seems to be with the mysql.sock file, it's not present. So I can't connect from anything or dump to files and start over. Here is what happends when I run mysql
/usr/libexec/mysqld
130102 17:07:48 [Warning] Setting lower_case_table_names=2 because file system for /var/mysql/ is case insensitive
InnoDB: The InnoDB memory heap has been disabled.
InnoDB: Mutex and rw_lock use GCC atomic builtins.
InnoDB: Log scan progressed past the checkpoint lsn 0 36808
130102 17:07:48 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Doing recovery: scanned up to log sequence number 0 43655
130102 17:07:48 InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
130102 17:07:49 InnoDB: Started; log sequence number 0 43655
130102 17:07:49 [Note] Recovering after a crash using mysql-bin
130102 17:07:49 [Note] Starting crash recovery...
130102 17:07:49 [Note] Crash recovery finished.
130102 17:07:49 [ERROR] Fatal error: Can't open and lock privilege tables: Table 'mysql.host' doesn't exist
I've tried both mysqld_update and bunch of other stuff. My main goal now is to get the data out somehow and do a clean install. But I can't seem to find the data.
If I do a locate in the terminal on a databasename, I find it stuffed away in /Library/Server/Previous/private/var/mysql/DBNAME, but I can't access that location in either terminal or finder (even as root), trying to to cd into them gives me a "not exists" in return.
MySQL is no longer installed by default as of 10.7 Lion. Apple has a technical bulletin that covers installing MySQL after upgrading to 10.7, or in your case 10.8.
Default location for mysql data files is:
/usr/local/mysql/data
Maybe the data files are still there, if they are, then copy them, then re-install mysql and then copy data files back to the original location.
I am not 100% percent sure it works, but I have done similar trick on Snow Leopard some time ago...

Resources