Oracle NUM_CPUS, NUM_CPU_SOCKETS and Connection Pooling - oracle

I read about Oracle connection pooling size. Official documents say:
For example, suppose a server has 2 CPUs and each CPU has 18 cores.
Each CPU core has 2 threads. Based on the Oracle Real-Wold Performance
group guidelines, the application can have between 36 and 360
connections to the database instance.
My Oracle server's NUM_CPUS are 16, NUM_CPU_CORES are 8 but NUM_CPU_SOCKETS are 2. It means we have 2 CPU in fact but with multithreading it works like 16 CPU.
I am not sure which one to use in connection formula. Probably 16 but I'm asking here to be sure.
16cpu * 8cores = min 128 connection?
or
2cpu * 8cores = min 16 connection?
Which one applies to me. :/
Thanks in advance.

You should use 2 cpu * 8 cores = 16 minimum connections.
This sentence in the documentation implies that the rule is meant for physical CPUs and cores: "The number of connections should be based on the number of CPU cores and not the number of CPU core threads." Your NUM_CPUS value of 16 must be a "logical" CPU since the system only has 2 sockets.

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

Cluster Configuration - Worker Nodes

I am beginner in cluster configuration. I know in our cluster we have types of worker nodes:
16 x 4TB Disks
128 RAM
2 x 8 Core CPUs
12 x 1.2 TB Disks
256 RAM
2 x 10 Core CPUs
I am confused about the configuration. What does mean 2 x 8 cores? It means 2 processor with 8 physical core each? So if my processor are hyperthreading i will have 2 X 8 X 2 = 32 virtual cores?
And 12 x 1.2 TB means, 12 disks with 1.2 TB each?
Usually 2x 8 Core CPUs, means, that you have 2 physical chips on your motherboard, each having 8 Cores. If you enable hyperthreading, you then have 32 virtual cores.
The amount of disks is either the way, like you stated it, or its the number of nodes. Then you have 16 nodes with 4TB disk.... and 12 nodes with 1.2TB disk ....
I am just wondering, how someone can get to this hardware, not knowing what it means. Can you send me some nodes? :)

Cassandra Amazon EC2 , lots of IOWait

We have the following stats on single node cassandra on Amazon EC2/Rightscale m1.large instance with 2 ephemeral disks with raid0. (7.6 GB Total Memory)
4 GB RAM is allocated to cassandra Heap, 800MB is Heap NEW size.
following stats are from OpsCenter community 2.0
Read Requests 285 to 340 per second
Write Requests 257 to 720 per second
OS Load 15.15 to 17.15
Write Request Latency 293 to 685 micros
OS Sent Network Traffic 18 MB to 30 MB per second
OS Recieved Network Traffic 22 MB to 34 MB per second
OS Disk Queue Size 23 to 26 requests
Read Requests Pending 8 to 20
Read Request Latency 69140 to 92885 micros
OS Disk latency 37 to 42 ms
OS Disk Throughput 12 to 14 Mb per second
Disk IOPs Reads 600 to 740 per second
Disk IOPs Writes 2 to 7 per second
IOWait 60 to 70 % CPU avg
Idle 24 to 30 % CPU avg
Rowcache is disabled.
Are the above stats are satisfying with the provided configuration....OR how could we tweak it more to get less IOWait..........because we think that we are experiencing lots of IOWait.....how could we tweak it to get the best.
Read Requests are mixed.........some are from one super column family and one standard having more than million keys......and varying no. of super columns max 14 with varying no. of subcolumns from 1 to 10000 and varying no. of columns max 14 in standard column family...............subcolumns are very thin in nature with 0 bytes value....8 bytes for name.
Process is removing the data from super column family and writing the processed data on standard one.
Would EBS Disks work better....on Amazon EC2
I'm not positive whether you can tweak your config easily to get more disk performance, but using Snappy compression could help a good deal in making your app need to read less overall. It may also help to use the new composite key layout instead of supercolumns.
One thing I can say for sure: EBS will NOT work better. Stay away from that at all costs if you care about latency.

Task Manager: CPU usage history

I bougth recently a server with 2 x X5550, they are quad (4 cores each) total 8 cores
If I check the task manager it shows in the CPU usage history 16 diagrams,
Should't it be 8 cause I have 2 processors with quad?
or the diagrams maybee shows the Threads of the CPU?
The CPUs have support for HyperThreading, so each core x2 logical CPUs.
You can always lookup the chip specs on Intel's site

Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?
Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.
PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.
For nehalem: rolfed.com/nehalem/nehalemPaper.pdf
Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.
128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)
The L2 and L3 caches each have a 256-bit port for reading or writing,
but the L3 cache must share its port with three other cores on the chip.
Can L2 and L3 read and write ports be used in single clock?
Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.
Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7
L1 L2 L3, cycles; mem link
Core 2 3 15 -- 66 ns http://www.anandtech.com/show/2542/5
Core i7-xxx 4 11 39 40c+67ns http://www.anandtech.com/show/2542/5
Itanium 1 5-6 12-17 130-1000 (cycles)
Itanium2 2 6-10 20 35c+160ns http://www.7-cpu.com/cpu/Itanium2.html
AMD K8 12 40-70c +64ns http://www.anandtech.com/show/2139/3
Intel P4 2 19 43 200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3 20 180 (cycles) --//--
AthlonFX-51 3 13 125 (cycles) --//--
POWER4 4 12-20 ?? hundreds cycles --//--
Haswell 4 11-12 36 36c+57ns http://www.realworldtech.com/haswell-cpu/5/
And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html
More about lat_mem_rd program is in its man page or here on SO.
Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.
I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

Resources