Why Ceph turns status to Err when there is still available storage space - hadoop

I built a 3 node Ceph cluster recently. Each node had seven 1TB HDD for OSDs. In total, I have 21 TB of storage space for Ceph.
However, when I ran a workload to keep writing data to Ceph, it turns to Err status and no data can be written to it any more.
The output of ceph -s is:
cluster:
id: 06ed9d57-c68e-4899-91a6-d72125614a94
health: HEALTH_ERR
1 full osd(s)
4 nearfull osd(s)
7 pool(s) full
services:
mon: 1 daemons, quorum host3
mgr: admin(active), standbys: 06ed9d57-c68e-4899-91a6-d72125614a94
osd: 21 osds: 21 up, 21 in
rgw: 4 daemons active
data:
pools: 7 pools, 1748 pgs
objects: 2.03M objects, 7.34TiB
usage: 14.7TiB used, 4.37TiB / 19.1TiB avail
pgs: 1748 active+clean
Based on my comprehension, since there is still 4.37 TB space left, Ceph itself should take care about how to balance the workload and make each OSD to not be at full or nearfull status. But the result doesn't work as my expectation, 1 full osd and 4 nearfull osd shows up, the health is HEALTH_ERR.
I can't visit Ceph with hdfs or s3cmd anymore, so here comes the question:
1, Any explanation about current issue?
2, How can I recover from it? Delete data on Ceph node directly with ceph-admin, and relaunch the Ceph?

Not get an answer for 3 days and I made some progress, let me share my findings here.
1, It's normal for different OSD to have size gap. If you list OSD with ceph osd df, you will find that different OSD has different usage ratio.
2, To recover from this issue, the issue here means the cluster crush due to OSD full. Follow steps below, it's mostly from redhat.
Get ceph cluster health info by ceph health detail. It's not necessary but you can get the ID of failed OSD.
Use ceph osd dump | grep full_ratio to get current full_ratio. Do not use statement listed at above link, it's obsoleted. The output can be like
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
Set OSD full ratio a little higher by ceph osd set-full-ratio <ratio>. Generally, we set ratio to 0.97
Now, the cluster status will change from HEALTH_ERR to HEALTH_WARN or HEALTH_OK. Remove some data that can be released.
Change OSD full ratio back to previous ratio. It can't be 0.97 always cause it's a little risky.
Hope this thread is helpful to some one who ran into same issue. The details about OSD configuration please refer to ceph.

Ceph requires free disk space to move storage chunks, called pgs, between different disks. As this free space is so critical to the underlying functionality, Ceph will go into HEALTH_WARN once any OSD reaches the near_full ratio (generally 85% full), and will stop write operations on the cluster by entering HEALTH_ERR state once an OSD reaches the full_ratio.
However, unless your cluster is perfectly balanced across all OSDs there is likely much more capacity available, as OSDs are typically unevenly utilized. To check overall utilization and available capacity you can run ceph osd df.
Example output:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
2 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 72 MiB 3.6 GiB 742 GiB 73.44 1.06 406 up
5 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 119 MiB 3.3 GiB 726 GiB 74.00 1.06 414 up
12 hdd 2.72849 1.00000 2.7 TiB 2.2 TiB 2.2 TiB 72 MiB 3.7 GiB 579 GiB 79.26 1.14 407 up
14 hdd 2.72849 1.00000 2.7 TiB 2.3 TiB 2.3 TiB 80 MiB 3.6 GiB 477 GiB 82.92 1.19 367 up
8 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
1 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 27 MiB 2.9 GiB 1006 GiB 64.01 0.92 253 up
4 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 79 MiB 2.9 GiB 1018 GiB 63.55 0.91 259 up
10 hdd 2.72849 1.00000 2.7 TiB 1.9 TiB 1.9 TiB 70 MiB 3.0 GiB 887 GiB 68.24 0.98 256 up
13 hdd 2.72849 1.00000 2.7 TiB 1.8 TiB 1.8 TiB 80 MiB 3.0 GiB 971 GiB 65.24 0.94 277 up
15 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 58 MiB 3.1 GiB 793 GiB 71.63 1.03 283 up
17 hdd 2.72849 1.00000 2.7 TiB 1.6 TiB 1.6 TiB 113 MiB 2.8 GiB 1.1 TiB 59.78 0.86 259 up
19 hdd 2.72849 1.00000 2.7 TiB 1.6 TiB 1.6 TiB 100 MiB 2.7 GiB 1.2 TiB 56.98 0.82 265 up
7 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
0 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 105 MiB 3.0 GiB 734 GiB 73.72 1.06 337 up
3 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 98 MiB 3.0 GiB 781 GiB 72.04 1.04 354 up
9 hdd 2.72849 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
11 hdd 2.72849 1.00000 2.7 TiB 1.9 TiB 1.9 TiB 76 MiB 3.0 GiB 817 GiB 70.74 1.02 342 up
16 hdd 2.72849 1.00000 2.7 TiB 1.8 TiB 1.8 TiB 98 MiB 2.7 GiB 984 GiB 64.80 0.93 317 up
18 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 79 MiB 3.0 GiB 792 GiB 71.65 1.03 324 up
6 ssd 0.10840 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up
TOTAL 47 TiB 30 TiB 30 TiB 1.3 GiB 53 GiB 16 TiB 69.50
MIN/MAX VAR: 0.82/1.19 STDDEV: 6.64
As you can see in the above output, the used OSDs vary from 56.98% (OSD 19) to 82.92% (OSD 14) utilized, which is a significant variance.
As only a single OSD is full, and only 4 of your 21 OSD's are nearfull you likely have a significant amount of storage still available in your cluster, which means that it is time to perform a rebalance operation. This can be done manually by reweighting OSDs, or you can have Ceph do a best-effort rebalance by running the command ceph osd reweight-by-utilization. Once the rebalance is complete (i.e you have no objects misplaced in ceph status) you can check for the variation again (using ceph osd df) and trigger another rebalance if required.
If you are on Luminous or newer you can enable the Balancer plugin to handle OSD rewighting automatically.

Related

Why does Spark only use one executor on my 2 worker node cluster if I increase the executor memory past 5 GB?

I am using a 3 node cluster: 1 master node and 2 worker nodes, using T2.large EC2 instances.
The "free -m" command gives me the following info:
Master:
total used free shared buffers cached
Mem: 7733 6324 1409 0 221 4555
-/+ buffers/cache: 1547 6186
Swap: 1023 0 1023
Worker Node 1:
total used free shared buffers cached
Mem: 7733 3203 4530 0 185 2166
-/+ buffers/cache: 851 6881
Swap: 1023 0 1023
Worker Node 2:
total used free shared buffers cached
Mem: 7733 3402 4331 0 185 2399
-/+ buffers/cache: 817 6915
Swap: 1023 0 1023
In the yarn-site.xml file, I have the following properties set:
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>7733</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>7733</value>
</property>
In $SPARK_HOME/conf/spark-defaults.conf I am setting the spark.executor.cores at 2 and spark.executor.instances at 2.
When looking at the spark-history UI after running my spark application, both executors (1 and 2) show up in the "Executors" tab along with the driver. In the cores column on that same page, it says 2 for both executors.
When I set the executor-memory at 5G and lower, my spark application runs fine with both worker node executors running. When I set the executor memory at 6G or more, only one worker node runs an executor. Why does this happen? Note: I have tried increasing the yarn.nodemanager.resource.memory-mb and it doesn't change this behavior.

Data 100% unknown after Ceph Update

I updated my dev Ceph cluster yesterday from Jewel to Luminous. Everything was seemingly okay until I ran this command "ceph osd require-osd-release luminous". After that, the data in my cluster is now completely unknown. If I do a detailed view on any given pg, it shows "active+clean". The cluster thinks they're degraded and unclean. Here's what I am seeing:
CRUSH MAP
-1 10.05318 root default
-2 3.71764 host cephfs01
0 0.09044 osd.0 up 1.00000 1.00000
1 1.81360 osd.1 up 1.00000 1.00000
2 1.81360 osd.2 up 1.00000 1.00000
-3 3.62238 host cephfs02
3 hdd 1.81360 osd.3 up 1.00000 1.00000
4 hdd 0.90439 osd.4 up 1.00000 1.00000
5 hdd 0.90439 osd.5 up 1.00000 1.00000
-4 2.71317 host cephfs03
6 hdd 0.90439 osd.6 up 1.00000 1.00000
7 hdd 0.90439 osd.7 up 1.00000 1.00000
8 hdd 0.90439 osd.8 up 1.00000 1.00000
HEALTH
cluster:
id: 279e0565-1ab4-46f2-bb27-adcb1461e618
health: HEALTH_WARN
Reduced data availability: 1024 pgs inactive
Degraded data redundancy: 1024 pgs unclean
services:
mon: 2 daemons, quorum cephfsmon02,cephfsmon01
mgr: cephfsmon02(active)
mds: ceph_library-1/1/1 up {0=cephfsmds01=up:active}
osd: 9 osds: 9 up, 9 in; 306 remapped pgs
data:
pools: 2 pools, 1024 pgs
objects: 0 objects, 0 bytes
usage: 0 kB used, 0 kB / 0 kB avail
pgs: 100.000% pgs unknown
1024 unknown
HEALTH_WARN
Reduced data availability: 1024 pgs inactive; Degraded data redundancy: 1024 pgs unclean
PG_AVAILABILITY Reduced data availability: 1024 pgs inactive
pg 1.e6 is stuck inactive for 2239.530584, current state unknown, last acting []
pg 1.e8 is stuck inactive for 2239.530584, current state unknown, last acting []
pg 1.e9 is stuck inactive for 2239.530584, current state unknown, last acting []
It looks like this for every PG in the cluster.
PG DETAIL
"stats": {
"version": "57'5211",
"reported_seq": "4527",
"reported_epoch": "57",
"state": "active+clean",
I can't run a scrub or repair on the pgs or osds because of this:
ceph osd repair osd.0
failed to instruct osd(s) 0 to repair (not connected)
Any ideas?
The problem was the firewall. I bounced the firewall on each host and immediately the pgs were found.

Xcode Scene Editor Memory Leak Crashing Computer

I am having the same issue as seen here: https://forums.developer.apple.com/thread/38432 Basically if the scene editor is open for a few minutes, my computer reboots with that memory leak error. Nobody seems to have come up with a solution there, so I am curious if anyone here has a solution. I have also reported this bug to Apple. Here is the crash log:
*** Panic Report ***
panic(cpu 0 caller 0xffffff8026b1f634): "zalloc: zone map exhausted while allocating from zone kalloc.8192, likely due to memory leak in zone kalloc.8192 (1123926016 total bytes, 137198 elements allocated)"#/Library/Caches/com.apple.xbs/Sources/xnu/xnu-3248.60.10/osfmk/kern/zalloc.c:2628
Backtrace (CPU 0), Frame : Return Address
0xffffff90c53e38a0 : 0xffffff8026adab52
0xffffff90c53e3920 : 0xffffff8026b1f634
0xffffff90c53e3a50 : 0xffffff8027039a9b
0xffffff90c53e3a70 : 0xffffff7fa929f8d3
0xffffff90c53e3a90 : 0xffffff7fa92233fc
0xffffff90c53e3ac0 : 0xffffff7fa920ef71
0xffffff90c53e3b40 : 0xffffff7fa9214228
0xffffff90c53e3ba0 : 0xffffff7fa9215268
0xffffff90c53e3be0 : 0xffffff80270dfed7
0xffffff90c53e3d20 : 0xffffff8026b97f90
0xffffff90c53e3e30 : 0xffffff8026adf2c3
0xffffff90c53e3e60 : 0xffffff8026ac28f8
0xffffff90c53e3ea0 : 0xffffff8026ad26a5
0xffffff90c53e3f10 : 0xffffff8026bb8eca
0xffffff90c53e3fb0 : 0xffffff8026becd86
Kernel Extensions in backtrace:
com.apple.iokit.IOAcceleratorFamily2(205.11)[AE224148-CFD1-3A17-943F-E42B98EB06DC]#0xffffff7fa91ff000->0xffffff7fa926afff
dependency: com.apple.iokit.IOPCIFamily(2.9)[F51AA3D6-EC2F-3AD3-A043-06DB79027AA2]#0xffffff7fa7401000
dependency: com.apple.iokit.IOGraphicsFamily(2.4.1)[A360453D-2050-3C49-A549-AC0DD5E87917]#0xffffff7fa7f9c000
com.apple.driver.AppleIntelHD4000Graphics(10.1.4)[6EBBF243-AAF9-3765-8E08-50CA36DA5F27]#0xffffff7fa9279000->0xffffff7fa92e1fff
dependency: com.apple.iokit.IOSurface(108.2.3)[354FC780-7EA4-3C3F-A9E1-2658F63663A9]#0xffffff7fa7e20000
dependency: com.apple.iokit.IOPCIFamily(2.9)[F51AA3D6-EC2F-3AD3-A043-06DB79027AA2]#0xffffff7fa7401000
dependency: com.apple.iokit.IOGraphicsFamily(2.4.1)[A360453D-2050-3C49-A549-AC0DD5E87917]#0xffffff7fa7f9c000
dependency: com.apple.iokit.IOAcceleratorFamily2(205.11)[AE224148-CFD1-3A17-943F-E42B98EB06DC]#0xffffff7fa91ff000
BSD process name corresponding to current thread: Xcode
Mac OS version:
15G31
Kernel version:
Darwin Kernel Version 15.6.0: Thu Jun 23 18:25:34 PDT 2016; root:xnu-3248.60.10~1/RELEASE_X86_64
Kernel UUID: B5AA8E3E-65B6-3D0E-867B-8DCCF81E536C
Kernel slide: 0x0000000026800000
Kernel text base: 0xffffff8026a00000
__HIB text base: 0xffffff8026900000
System model name: MacBookPro9,2 (Mac-6F01561E16C75D06)
System uptime in nanoseconds: 2566458908068
Zone Name Cur Size Free Size
vm objects 49098720 1920
vm object hash entri 7436000 2640
VM map entries 13414400 5200
pv_list 26222592 4992
vm pages 64320704 13568
kalloc.16 9301824 1120
kalloc.32 1290240 400192
kalloc.48 21938112 912
kalloc.64 21740544 3776
kalloc.96 1871328 649824
kalloc.128 71966720 896
kalloc.160 3003200 1760
kalloc.256 108310528 768
kalloc.512 12038144 465408
kalloc.1024 3780608 4096
kalloc.2048 3350528 2048
kalloc.4096 3428352 0
kalloc.8192 1123926016 0
mem_obj_control 2975616 2272
ipc ports 24267840 4480
threads 1345536 381936
vnodes 15977280 2880
namecache 6813600 4128
HFS node 21840864 231896
HFS fork 6062080 31360
ubc_info zone 4165656 49544
vnode pager structur 1892000 21040
compressor_pager 8896512 64
compressor_segment 7856128 1024
Kernel Stacks 2752512
PageTables 108298240
Kalloc.Large 47181331
Backtrace suspected of leaking: (outstanding bytes: 696320)
0xffffff8026b1fa6f
0xffffff8027039a9b
0xffffff7fa929f8d3
0xffffff7fa92233fc
0xffffff7fa920ef71
0xffffff7fa9214228
0xffffff7fa9215268
0xffffff80270dfed7
0xffffff8026b97f90
0xffffff8026adf2c3
0xffffff8026ac28f8
0xffffff8026ad26a5
0xffffff8026bb8eca
Kernel Extensions in backtrace:
com.apple.iokit.IOAcceleratorFamily2(205.11)[AE224148-CFD1-3A17-943F-E42B98EB06DC]#0xffffff7fa91ff000->0xffffff7fa926afff
dependency: com.apple.iokit.IOPCIFamily(2.9)[F51AA3D6-EC2F-3AD3-A043-06DB79027AA2]#0xffffff7fa7401000
dependency: com.apple.iokit.IOGraphicsFamily(2.4.1)[A360453D-2050-3C49-A549-AC0DD5E87917]#0xffffff7fa7f9c000
com.apple.driver.AppleIntelHD4000Graphics(10.1.4)[6EBBF243-AAF9-3765-8E08-50CA36DA5F27]#0xffffff7fa9279000->0xffffff7fa92e1fff
dependency: com.apple.iokit.IOSurface(108.2.3)[354FC780-7EA4-3C3F-A9E1-2658F63663A9]#0xffffff7fa7e20000
dependency: com.apple.iokit.IOPCIFamily(2.9)[F51AA3D6-EC2F-3AD3-A043-06DB79027AA2]#0xffffff7fa7401000
dependency: com.apple.iokit.IOGraphicsFamily(2.4.1)[A360453D-2050-3C49-A549-AC0DD5E87917]#0xffffff7fa7f9c000
dependency: com.apple.iokit.IOAcceleratorFamily2(205.11)[AE224148-CFD1-3A17-943F-E42B98EB06DC]#0xffffff7fa91ff000
last loaded kext at 1251941963265: com.apple.filesystems.smbfs 3.0.1 (addr 0xffffff7fa97c9000, size 409600)
last unloaded kext at 653540897476: com.apple.driver.AppleUSBMergeNub 900.4.1 (addr 0xffffff7fa7ced000, size 12288)
loaded kexts:
com.techsmith.TACC 1.0.2
com.Cycling74.driver.Soundflower 2
com.apple.filesystems.smbfs 3.0.1
com.apple.driver.AGPM 110.22.0
com.apple.driver.X86PlatformShim 1.0.0
com.apple.driver.AudioAUUC 1.70
com.apple.filesystems.autofs 3.0
com.apple.driver.AppleOSXWatchdog 1
com.apple.driver.AppleMikeyHIDDriver 124
com.apple.driver.AppleUpstreamUserClient 3.6.1
com.apple.driver.AppleHDA 274.12
com.apple.driver.pmtelemetry 1
com.apple.iokit.IOUserEthernet 1.0.1
com.apple.iokit.IOBluetoothSerialManager 4.4.6f1
com.apple.Dont_Steal_Mac_OS_X 7.0.0
com.apple.driver.AppleMikeyDriver 274.12
com.apple.driver.AppleIntelHD4000Graphics 10.1.4
com.apple.driver.AppleHV 1
com.apple.driver.AppleBacklight 170.8.9
com.apple.driver.AppleThunderboltIP 3.0.8
com.apple.driver.AppleSMCPDRC 1.0.0
com.apple.iokit.BroadcomBluetoothHostControllerUSBTransport 4.4.6f1
com.apple.driver.AppleIntelSlowAdaptiveClocking 4.0.0
com.apple.driver.AppleMCCSControl 1.2.13
com.apple.driver.AppleSMCLMU 208
com.apple.driver.AppleIntelFramebufferCapri 10.1.4
com.apple.driver.AppleLPC 3.1
com.apple.driver.SMCMotionSensor 3.0.4d1
com.apple.driver.AppleUSBTCButtons 245.4
com.apple.driver.AppleUSBTCKeyboard 245.4
com.apple.driver.AppleIRController 327.6
com.apple.AppleFSCompression.AppleFSCompressionTypeDataless 1.0.0d1
com.apple.AppleFSCompression.AppleFSCompressionTypeZlib 1.0.0
com.apple.BootCache 38
com.apple.iokit.SCSITaskUserClient 3.7.7
com.apple.iokit.IOAHCIBlockStorage 2.8.5
com.apple.driver.AirPort.Brcm4360 1040.1.1a6
com.apple.driver.AppleFWOHCI 5.5.4
com.apple.iokit.AppleBCM5701Ethernet 10.2.0
com.apple.driver.AppleSDXC 1.7.0
com.apple.driver.AppleAHCIPort 3.1.8
com.apple.driver.usb.AppleUSBEHCIPCI 1.0.1
com.apple.driver.AppleSmartBatteryManager 161.0.0
com.apple.driver.AppleACPIButtons 4.0
com.apple.driver.AppleRTC 2.0
com.apple.driver.AppleHPET 1.8
com.apple.driver.AppleSMBIOS 2.1
com.apple.driver.AppleACPIEC 4.0
com.apple.driver.AppleAPIC 1.7
com.apple.driver.AppleIntelCPUPowerManagementClient 218.0.0
com.apple.nke.applicationfirewall 163
com.apple.security.quarantine 3
com.apple.security.TMSafetyNet 8
com.apple.driver.AppleIntelCPUPowerManagement 218.0.0
com.apple.kext.triggers 1.0
com.apple.driver.DspFuncLib 274.12
com.apple.kext.OSvKernDSPLib 525
com.apple.iokit.IOSerialFamily 11
com.apple.iokit.IOSurface 108.2.3
com.apple.driver.CoreCaptureResponder 1
com.apple.driver.AppleHDAController 274.12
com.apple.iokit.IOHDAFamily 274.12
com.apple.iokit.IOAudioFamily 204.4
com.apple.vecLib.kext 1.2.0
com.apple.iokit.IOBluetoothHostControllerUSBTransport 4.4.6f1
com.apple.iokit.IOBluetoothFamily 4.4.6f1
com.apple.iokit.IOSlowAdaptiveClockingFamily 1.0.0
com.apple.driver.AppleSMBusController 1.0.14d1
com.apple.iokit.IOFireWireIP 2.2.6
com.apple.driver.AppleBacklightExpert 1.1.0
com.apple.iokit.IONDRVSupport 2.4.1
com.apple.driver.AppleSMBusPCI 1.0.14d1
com.apple.driver.X86PlatformPlugin 1.0.0
com.apple.iokit.IOAcceleratorFamily2 205.11
com.apple.AppleGraphicsDeviceControl 3.12.8
com.apple.iokit.IOGraphicsFamily 2.4.1
com.apple.driver.IOPlatformPluginFamily 6.0.0d7
com.apple.driver.AppleSMC 3.1.9
com.apple.driver.AppleUSBMultitouch 250.5
com.apple.iokit.IOUSBHIDDriver 900.4.1
com.apple.driver.usb.cdc 5.0.0
com.apple.driver.usb.networking 5.0.0
com.apple.driver.usb.AppleUSBHostCompositeDevice 1.0.1
com.apple.driver.usb.AppleUSBHub 1.0.1
com.apple.driver.CoreStorage 517.50.1
com.apple.iokit.IOSCSIMultimediaCommandsDevice 3.7.7
com.apple.iokit.IOBDStorageFamily 1.8
com.apple.iokit.IODVDStorageFamily 1.8
com.apple.iokit.IOCDStorageFamily 1.8
com.apple.driver.AppleThunderboltDPInAdapter 4.1.3
com.apple.driver.AppleThunderboltDPAdapterFamily 4.1.3
com.apple.driver.AppleThunderboltPCIDownAdapter 2.0.2
com.apple.iokit.IOAHCISerialATAPI 2.6.2
com.apple.iokit.IOSCSIArchitectureModelFamily 3.7.7
com.apple.driver.AppleThunderboltNHI 4.0.4
com.apple.iokit.IOThunderboltFamily 6.0.2
com.apple.iokit.IO80211Family 1110.26
com.apple.driver.corecapture 1.0.4
com.apple.iokit.IOFireWireFamily 4.6.1
com.apple.iokit.IOEthernetAVBController 1.0.3b3
com.apple.driver.mDNSOffloadUserClient 1.0.1b8
com.apple.iokit.IONetworkingFamily 3.2
com.apple.iokit.IOAHCIFamily 2.8.1
com.apple.driver.usb.AppleUSBXHCIPCI 1.0.1
com.apple.driver.usb.AppleUSBXHCI 1.0.1
com.apple.driver.usb.AppleUSBEHCI 1.0.1
com.apple.iokit.IOUSBFamily 900.4.1
com.apple.iokit.IOUSBHostFamily 1.0.1
com.apple.driver.AppleUSBHostMergeProperties 1.0.1
com.apple.driver.AppleEFINVRAM 2.0
com.apple.driver.AppleEFIRuntime 2.0
com.apple.iokit.IOHIDFamily 2.0.0
com.apple.iokit.IOSMBusFamily 1.1
com.apple.security.sandbox 300.0
com.apple.kext.AppleMatch 1.0.0d1
com.apple.driver.AppleKeyStore 2
com.apple.driver.AppleMobileFileIntegrity 1.0.5
com.apple.driver.AppleCredentialManager 1.0
com.apple.driver.DiskImages 417.4
com.apple.iokit.IOStorageFamily 2.1
com.apple.iokit.IOReportFamily 31
com.apple.driver.AppleFDEKeyStore 28.30
com.apple.driver.AppleACPIPlatform 4.0
com.apple.iokit.IOPCIFamily 2.9
com.apple.iokit.IOACPIFamily 1.4
com.apple.kec.pthread 1
com.apple.kec.corecrypto 1.0
com.apple.kec.Libm 1
Model: MacBookPro9,2, BootROM MBP91.00D3.B0D, 2 processors, Intel Core i5, 2.5 GHz, 4 GB, SMC 2.2f44
Graphics: Intel HD Graphics 4000, Intel HD Graphics 4000, Built-In
Memory Module: BANK 0/DIMM0, 2 GB, DDR3, 1600 MHz, 0x02FE, 0x45424A3230554638424455302D474E2D4620
Memory Module: BANK 1/DIMM0, 2 GB, DDR3, 1600 MHz, 0x02FE, 0x45424A3230554638424455302D474E2D4620
AirPort: spairport_wireless_card_type_airport_extreme (0x14E4, 0xF5), Broadcom BCM43xx 1.0 (7.21.95.175.1a6)
Bluetooth: Version 4.4.6f1 17910, 3 services, 18 devices, 1 incoming serial ports
Network Service: Wi-Fi, AirPort, en1
Serial ATA Device: APPLE HDD HTS545050A7E362, 500.11 GB
Serial ATA Device: HL-DT-ST DVDRW GS41N
USB Device: USB 2.0 Bus
USB Device: Hub
USB Device: Hub
USB Device: Apple Internal Keyboard / Trackpad
USB Device: IR Receiver
USB Device: BRCM20702 Hub
USB Device: Bluetooth USB Host Controller
USB Device: USB 2.0 Bus
USB Device: Hub
USB Device: FaceTime HD Camera (Built-in)
USB Device: USB 3.0 Bus
Thunderbolt Bus: MacBook Pro, Apple Inc., 25.1
I reported the bug to Apple and they seem to have fixed the bug.

Not enough ram to run whole docker-compose stack

Our microservice stack has now crept up to 15 small services for business logic like Auth, messaging, billing, etc. It's now getting to the point where a docker-compose up uses more ram than our devs have on their laptops.
It's not a crazy amount, about 4GB, but I regularly feel the pinch on my 8GB machine (thanks, Chrome).
There's app-level optimisations that we can be, and are, doing, sure, but eventually we are going to need an alternative strategy.
I see a two obvious options:
Use a big cloudy dev machine, perhaps provisioned with docker-machine and aws.
spinning up some machines into a shared dev cloud, like postgres and redis
These aren't very satisfactory, in (1), local files aren't synced, making local dev a nightmare, and in (2) we can break each other's envs.
Help!
Apendix I: output from docker stats
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
0ea1779dbb66 32.53% 137.9 MB / 8.186 GB 1.68% 46 kB / 29.4 kB 42 MB / 0 B
12e93d81027c 0.70% 376.1 MB / 8.186 GB 4.59% 297.7 kB / 243 kB 0 B / 1.921 MB
25f7be321716 34.40% 131.1 MB / 8.186 GB 1.60% 38.42 kB / 23.91 kB 39.64 MB / 0 B
26220cab1ded 0.00% 7.274 MB / 8.186 GB 0.09% 19.82 kB / 648 B 6.645 MB / 0 B
2db7ba96dc16 1.22% 51.29 MB / 8.186 GB 0.63% 10.41 kB / 578 B 28.79 MB / 0 B
3296e274be54 0.00% 4.854 MB / 8.186 GB 0.06% 20.07 kB / 1.862 kB 4.069 MB / 0 B
35911ee375fa 0.27% 12.87 MB / 8.186 GB 0.16% 29.16 kB / 6.861 kB 7.137 MB / 0 B
49eccc517040 37.31% 65.76 MB / 8.186 GB 0.80% 31.53 kB / 18.49 kB 36.27 MB / 0 B
6f23f114c44e 31.08% 86.5 MB / 8.186 GB 1.06% 37.25 kB / 29.28 kB 34.66 MB / 0 B
7a0731639e31 30.64% 66.21 MB / 8.186 GB 0.81% 31.1 kB / 19.39 kB 35.6 MB / 0 B
7ec2d73d3d97 0.00% 10.63 MB / 8.186 GB 0.13% 8.685 kB / 834 B 10.4 MB / 12.29 kB
855fd2c80bea 1.10% 46.88 MB / 8.186 GB 0.57% 23.39 kB / 2.423 kB 29.64 MB / 0 B
9993de237b9c 40.37% 170 MB / 8.186 GB 2.08% 19.75 kB / 1.461 kB 52.71 MB / 12.29 kB
a162fbf77c29 24.84% 128.6 MB / 8.186 GB 1.57% 59.82 kB / 54.46 kB 37.81 MB / 0 B
a7bf8b64d516 43.91% 106.1 MB / 8.186 GB 1.30% 46.33 kB / 31.36 kB 35 MB / 0 B
aae18e01b8bb 0.99% 44.16 MB / 8.186 GB 0.54% 7.066 kB / 578 B 28.12 MB / 0 B
bff9c9ee646d 35.43% 71.65 MB / 8.186 GB 0.88% 63.3 kB / 68.06 kB 45.53 MB / 0 B
ca86faedbd59 38.09% 104.9 MB / 8.186 GB 1.28% 31.84 kB / 18.71 kB 36.66 MB / 0 B
d666a1f3be5c 0.00% 9.286 MB / 8.186 GB 0.11% 19.51 kB / 648 B 6.621 MB / 0 B
ef2fa1bc6452 0.00% 7.254 MB / 8.186 GB 0.09% 19.88 kB / 648 B 6.645 MB / 0 B
f20529b47684 0.88% 41.66 MB / 8.186 GB 0.51% 12.45 kB / 648 B 23.96 MB / 0 B
We have been struggling with this issue as well, and still don't really have an ideal solution. However, we have two ideas that we are currently debating.
Run a "Dev" environment in the cloud, which is constantly updated with the master/latest version of every image as it is built. Then each individual project can proxy to that environment in their docker-compose.yml file... so they are running THEIR service locally, but all the dependencies are remote. An important part of this (from your question) is that you have shared dependencies like databases. This should never be the case... never integrate across the database. Each service should store its own data.
Each service is responsible for building a "mock" version of their app that can be used for local dev and medium level integration tests. The mock versions shouldn't have dependencies, and should enable someone to only need a single layer from their service (the 3 or 4 mocks, instead of the 3 or 4 real services each with 3 or 4 of their own and so on).

Odd NUMA behavior/system topology?

I have a two socket system. I have disabled hyperthreading in BIOS.
numactl --hardware shows this:
ucs48:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 12222 MB
node 0 free: 11192 MB
node 1 cpus: 1 3 5 7
node 1 size: 12288 MB
node 1 free: 11366 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Why are the CPU numbers for node 0 not: 0 1 2 3 and node 1: 4 5 6 7
On some other systems, I have continuous CPU numbers on a NUMA node. Is there any config (which) where I can fix this? What is the root cause of this?
My kernel command line is:
BOOT_IMAGE=/vmlinuz-3.2.0-23-generic root=/dev/mapper/fe--ucs48-root ro intel_iommu=on
Some additional info:
ucs48:/proc# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 2395.000
BogoMIPS: 4800.19
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0,2,4,6
NUMA node1 CPU(s): 1,3,5,7

Resources