Odd NUMA behavior/system topology? - cpu

I have a two socket system. I have disabled hyperthreading in BIOS.
numactl --hardware shows this:
ucs48:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 12222 MB
node 0 free: 11192 MB
node 1 cpus: 1 3 5 7
node 1 size: 12288 MB
node 1 free: 11366 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Why are the CPU numbers for node 0 not: 0 1 2 3 and node 1: 4 5 6 7
On some other systems, I have continuous CPU numbers on a NUMA node. Is there any config (which) where I can fix this? What is the root cause of this?
My kernel command line is:
BOOT_IMAGE=/vmlinuz-3.2.0-23-generic root=/dev/mapper/fe--ucs48-root ro intel_iommu=on
Some additional info:
ucs48:/proc# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 2395.000
BogoMIPS: 4800.19
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0,2,4,6
NUMA node1 CPU(s): 1,3,5,7

Related

Why Ceph turns status to Err when there is still available storage space

I built a 3 node Ceph cluster recently. Each node had seven 1TB HDD for OSDs. In total, I have 21 TB of storage space for Ceph.
However, when I ran a workload to keep writing data to Ceph, it turns to Err status and no data can be written to it any more.
The output of ceph -s is:
cluster:
id: 06ed9d57-c68e-4899-91a6-d72125614a94
health: HEALTH_ERR
1 full osd(s)
4 nearfull osd(s)
7 pool(s) full
services:
mon: 1 daemons, quorum host3
mgr: admin(active), standbys: 06ed9d57-c68e-4899-91a6-d72125614a94
osd: 21 osds: 21 up, 21 in
rgw: 4 daemons active
data:
pools: 7 pools, 1748 pgs
objects: 2.03M objects, 7.34TiB
usage: 14.7TiB used, 4.37TiB / 19.1TiB avail
pgs: 1748 active+clean
Based on my comprehension, since there is still 4.37 TB space left, Ceph itself should take care about how to balance the workload and make each OSD to not be at full or nearfull status. But the result doesn't work as my expectation, 1 full osd and 4 nearfull osd shows up, the health is HEALTH_ERR.
I can't visit Ceph with hdfs or s3cmd anymore, so here comes the question:
1, Any explanation about current issue?
2, How can I recover from it? Delete data on Ceph node directly with ceph-admin, and relaunch the Ceph?
Not get an answer for 3 days and I made some progress, let me share my findings here.
1, It's normal for different OSD to have size gap. If you list OSD with ceph osd df, you will find that different OSD has different usage ratio.
2, To recover from this issue, the issue here means the cluster crush due to OSD full. Follow steps below, it's mostly from redhat.
Get ceph cluster health info by ceph health detail. It's not necessary but you can get the ID of failed OSD.
Use ceph osd dump | grep full_ratio to get current full_ratio. Do not use statement listed at above link, it's obsoleted. The output can be like
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
Set OSD full ratio a little higher by ceph osd set-full-ratio <ratio>. Generally, we set ratio to 0.97
Now, the cluster status will change from HEALTH_ERR to HEALTH_WARN or HEALTH_OK. Remove some data that can be released.
Change OSD full ratio back to previous ratio. It can't be 0.97 always cause it's a little risky.
Hope this thread is helpful to some one who ran into same issue. The details about OSD configuration please refer to ceph.
Ceph requires free disk space to move storage chunks, called pgs, between different disks. As this free space is so critical to the underlying functionality, Ceph will go into HEALTH_WARN once any OSD reaches the near_full ratio (generally 85% full), and will stop write operations on the cluster by entering HEALTH_ERR state once an OSD reaches the full_ratio.
However, unless your cluster is perfectly balanced across all OSDs there is likely much more capacity available, as OSDs are typically unevenly utilized. To check overall utilization and available capacity you can run ceph osd df.
Example output:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
2 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 72 MiB 3.6 GiB 742 GiB 73.44 1.06 406 up
5 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 119 MiB 3.3 GiB 726 GiB 74.00 1.06 414 up
12 hdd 2.72849 1.00000 2.7 TiB 2.2 TiB 2.2 TiB 72 MiB 3.7 GiB 579 GiB 79.26 1.14 407 up
14 hdd 2.72849 1.00000 2.7 TiB 2.3 TiB 2.3 TiB 80 MiB 3.6 GiB 477 GiB 82.92 1.19 367 up
8 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
1 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 27 MiB 2.9 GiB 1006 GiB 64.01 0.92 253 up
4 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 79 MiB 2.9 GiB 1018 GiB 63.55 0.91 259 up
10 hdd 2.72849 1.00000 2.7 TiB 1.9 TiB 1.9 TiB 70 MiB 3.0 GiB 887 GiB 68.24 0.98 256 up
13 hdd 2.72849 1.00000 2.7 TiB 1.8 TiB 1.8 TiB 80 MiB 3.0 GiB 971 GiB 65.24 0.94 277 up
15 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 58 MiB 3.1 GiB 793 GiB 71.63 1.03 283 up
17 hdd 2.72849 1.00000 2.7 TiB 1.6 TiB 1.6 TiB 113 MiB 2.8 GiB 1.1 TiB 59.78 0.86 259 up
19 hdd 2.72849 1.00000 2.7 TiB 1.6 TiB 1.6 TiB 100 MiB 2.7 GiB 1.2 TiB 56.98 0.82 265 up
7 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
0 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 105 MiB 3.0 GiB 734 GiB 73.72 1.06 337 up
3 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 98 MiB 3.0 GiB 781 GiB 72.04 1.04 354 up
9 hdd 2.72849 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
11 hdd 2.72849 1.00000 2.7 TiB 1.9 TiB 1.9 TiB 76 MiB 3.0 GiB 817 GiB 70.74 1.02 342 up
16 hdd 2.72849 1.00000 2.7 TiB 1.8 TiB 1.8 TiB 98 MiB 2.7 GiB 984 GiB 64.80 0.93 317 up
18 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 79 MiB 3.0 GiB 792 GiB 71.65 1.03 324 up
6 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
TOTAL 47 TiB 30 TiB 30 TiB 1.3 GiB 53 GiB 16 TiB 69.50
MIN/MAX VAR: 0.82/1.19 STDDEV: 6.64
As you can see in the above output, the used OSDs vary from 56.98% (OSD 19) to 82.92% (OSD 14) utilized, which is a significant variance.
As only a single OSD is full, and only 4 of your 21 OSD's are nearfull you likely have a significant amount of storage still available in your cluster, which means that it is time to perform a rebalance operation. This can be done manually by reweighting OSDs, or you can have Ceph do a best-effort rebalance by running the command ceph osd reweight-by-utilization. Once the rebalance is complete (i.e you have no objects misplaced in ceph status) you can check for the variation again (using ceph osd df) and trigger another rebalance if required.
If you are on Luminous or newer you can enable the Balancer plugin to handle OSD rewighting automatically.

Getting VT-X is not available inside CENTOS7 guest

I have currently got a CENTOS7 guest inside windows 10 host and I am trying to install vagrant and Virtual box inside the CENTOS guest to create a kubernetes cluster for local development.
I have installed vagrant 2.1.1 and VirtualBox 5.2.12
When I am trying to
vagrant up
It says VT-X is not available.
I have turned on Virtualization in BIOS and I have checked windows features and HyperV is disabled/unchecked.
Also when I cat the /proc/cpuinfo , vmx is not in the flags
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i5-2400 CPU # 3.10GHz
stepping : 7
cpu MHz : 3092.984
cache size : 6144 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni ssse3 sse4_1 sse4_2 x2apic hypervisor lahf_lm
bogomips : 6185.96
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
Also for the lscpu command :
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Model name: Intel(R) Core(TM) i5-2400 CPU # 3.10GHz
Stepping: 7
CPU MHz: 3092.984
BogoMIPS: 6185.96
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0,1
I would really appreciate any insight on how I can get this resolved.
You disabled hyper-v, this means that you are using alternative hypervisor(virtualbox?). I don't think that anything except hyper-v on windows supports nested virtualization.
You can give a try to hyper-v guest with centos(with enabled nested virtualization by command Set-VMProcessor -VMName <VMName> -ExposeVirtualizationExtensions $true) and run virtualbox with vagrant inside it.

Why does Spark only use one executor on my 2 worker node cluster if I increase the executor memory past 5 GB?

I am using a 3 node cluster: 1 master node and 2 worker nodes, using T2.large EC2 instances.
The "free -m" command gives me the following info:
Master:
total used free shared buffers cached
Mem: 7733 6324 1409 0 221 4555
-/+ buffers/cache: 1547 6186
Swap: 1023 0 1023
Worker Node 1:
total used free shared buffers cached
Mem: 7733 3203 4530 0 185 2166
-/+ buffers/cache: 851 6881
Swap: 1023 0 1023
Worker Node 2:
total used free shared buffers cached
Mem: 7733 3402 4331 0 185 2399
-/+ buffers/cache: 817 6915
Swap: 1023 0 1023
In the yarn-site.xml file, I have the following properties set:
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>7733</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>7733</value>
</property>
In $SPARK_HOME/conf/spark-defaults.conf I am setting the spark.executor.cores at 2 and spark.executor.instances at 2.
When looking at the spark-history UI after running my spark application, both executors (1 and 2) show up in the "Executors" tab along with the driver. In the cores column on that same page, it says 2 for both executors.
When I set the executor-memory at 5G and lower, my spark application runs fine with both worker node executors running. When I set the executor memory at 6G or more, only one worker node runs an executor. Why does this happen? Note: I have tried increasing the yarn.nodemanager.resource.memory-mb and it doesn't change this behavior.

Data 100% unknown after Ceph Update

I updated my dev Ceph cluster yesterday from Jewel to Luminous. Everything was seemingly okay until I ran this command "ceph osd require-osd-release luminous". After that, the data in my cluster is now completely unknown. If I do a detailed view on any given pg, it shows "active+clean". The cluster thinks they're degraded and unclean. Here's what I am seeing:
CRUSH MAP
-1 10.05318 root default
-2 3.71764 host cephfs01
0 0.09044 osd.0 up 1.00000 1.00000
1 1.81360 osd.1 up 1.00000 1.00000
2 1.81360 osd.2 up 1.00000 1.00000
-3 3.62238 host cephfs02
3 hdd 1.81360 osd.3 up 1.00000 1.00000
4 hdd 0.90439 osd.4 up 1.00000 1.00000
5 hdd 0.90439 osd.5 up 1.00000 1.00000
-4 2.71317 host cephfs03
6 hdd 0.90439 osd.6 up 1.00000 1.00000
7 hdd 0.90439 osd.7 up 1.00000 1.00000
8 hdd 0.90439 osd.8 up 1.00000 1.00000
HEALTH
cluster:
id: 279e0565-1ab4-46f2-bb27-adcb1461e618
health: HEALTH_WARN
Reduced data availability: 1024 pgs inactive
Degraded data redundancy: 1024 pgs unclean
services:
mon: 2 daemons, quorum cephfsmon02,cephfsmon01
mgr: cephfsmon02(active)
mds: ceph_library-1/1/1 up {0=cephfsmds01=up:active}
osd: 9 osds: 9 up, 9 in; 306 remapped pgs
data:
pools: 2 pools, 1024 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs: 100.000% pgs unknown
1024 unknown
HEALTH_WARN
Reduced data availability: 1024 pgs inactive; Degraded data redundancy: 1024 pgs unclean
PG_AVAILABILITY Reduced data availability: 1024 pgs inactive
pg 1.e6 is stuck inactive for 2239.530584, current state unknown, last acting []
pg 1.e8 is stuck inactive for 2239.530584, current state unknown, last acting []
pg 1.e9 is stuck inactive for 2239.530584, current state unknown, last acting []
It looks like this for every PG in the cluster.
PG DETAIL
"stats": {
"version": "57'5211",
"reported_seq": "4527",
"reported_epoch": "57",
"state": "active+clean",
I can't run a scrub or repair on the pgs or osds because of this:
ceph osd repair osd.0
failed to instruct osd(s) 0 to repair (not connected)
Any ideas?
The problem was the firewall. I bounced the firewall on each host and immediately the pgs were found.

What must I change to boot Linux 4.9 with Nios2?

I read the manual and used buildroot the create the images
$ ls buildroot/output/images/
rootfs.cpio rootfs.jffs2 rootfs.tar
Then I built u-boot:
~/nios2/u-boot-socfpga$ ls -al u-boot
-rwxrwxr-x 1 developer developer 852924 apr 8 14:00 u-boot
Then I do make menuconfig and choose the path to the buildroot files
Now I build the vmImage.
OBJCOPY arch/nios2/boot/vmlinux.bin
GZIP arch/nios2/boot/vmlinux.gz
UIMAGE arch/nios2/boot/vmImage
Image Name: Linux-4.9.0-00104-g84d4f8a-dirty
Created: Sat Apr 8 14:02:22 2017
Image Type: NIOS II Linux Kernel Image (gzip compressed)
Data Size: 7237729 Bytes = 7068.09 kB = 6.90 MB
Load Address: c0000000
Entry Point: c0000000
Kernel: arch/nios2/boot/vmImage is ready
I wonder why I can't run any of the generated images now. Where is the faulty method? It all compiles but my file system doesn't come in place when I download the files to the fpga:
$ nios2-download -g /home/developer/nios2/linux4/vmlinux
Using cable "USB-Blaster [2-1]", device 1, instance 0x00
Pausing target processor: OK
Initializing CPU cache (if present)
OK
Downloaded 8504KB in 43.7s (194.5KB/s)
Verified OK
Starting processor at address 0xC0000000
It stops with kernel panic:
$ nios2-terminal
nios2-terminal: connected to hardware target using JTAG UART on cable
nios2-terminal: "USB-Blaster [2-1]", device 1, instance 0
nios2-terminal: (Use the IDE stop button or Ctrl-C to terminate)
Linux version 4.9.0-00104-g84d4f8a-dirty (developer#1604) (gcc version 6.2.0 (Sourcery CodeBench Lite 2016.11-32) ) #49 Sat Apr 8 14:02:21 CEST 2017
bootconsole [early0] enabled
early_console initialized at 0xe8001440
On node 0 totalpages: 32768
free_area_init_node: node 0, pgdat c084e52c, node_mem_map c0883b80
Normal zone: 256 pages used for memmap
Normal zone: 0 pages reserved
Normal zone: 32768 pages, LIFO batch:7
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
pcpu-alloc: [0] 0
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32512
Kernel command line: debug console=ttyAL0,115200
PID hash table entries: 512 (order: -1, 2048 bytes)
Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
Sorting __ex_table...
Memory: 121196K/131072K available (2233K kernel code, 66K rwdata, 352K rodata, 5852K init, 197K bss, 9876K reserved, 0K cma-reserved)
SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS:64 nr_irqs:64 0
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������clocksource: nios2-clksrc: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 38225208935 ns
Console: colour dummy device 80x25
Calibrating delay loop (skipped), value calculated using timer frequency.. 100.00 BogoMIPS (lpj=50000)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
cpu cpu0: Error -2 creating of_node link
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
�random: fast init done
clocksource: Switched to clocksource nios2-clksrc
random: crng init done
futex hash table entries: 256 (order: -1, 3072 bytes)
workingset: timestamp_bits=30 max_order=15 bucket_order=0
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
io scheduler noop registered
io scheduler deadline registered
io scheduler cfq registered (default)
�8001440.serial: ttyJ0 at MMIO 0x8001440 (irq = 2, base_baud = 0) is a Altera JTAG UART
mousedev: PS/2 mouse device common for all mice
Warning: unable to open an initial console.
Failed to create /dev/root: -2
VFS: Cannot open root device "(null)" or unknown-block(0,0): error -2
Please append a correct "root=" boot option; here are the available partitions:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
What can I do about it?
You must point the initramfs source file to the cpio archive, so /home/developer/nios2/buildroot/output/images/rootfs.cpio.
Even easier is to let Buildroot build your kernel and choose the initramfs filesystem. Then Buildroot will take care of linking the initramfs into the kernel.

Resources