Postgres configuration for better performance - performance

We have done a PostreSql database Based ERP project. I have 32 GB RAM Windows Server 2012 R2 system. Out of 32 GB, I have used 8 GB for JVM and assuming 4 GB for OS, I have tried to tune the postgres with 20 GB RAM.
I have find out the configuration from the below link:
https://www.pgconfig.org/#/tuning?total_ram=20&max_connections=300&environment_name=OLTP&pg_version=9.2&os_type=Windows&arch=x86-64&share_link=true
But the performance goes down after the change. What could be the reason. As I am less knowledge in the postgres server maintenance, if anything more required for you to assess/answer let me know.
UPDATE
shared_buffers (integer) : 512 MB
effective_cache_size (integer) : 15 GB
work_mem (integer): 68 MB
maintenance_work_mem (integer): 1 GB
checkpoint_segments (integer): 96
checkpoint_completion_target (floating): 0.9
wal_buffers (integer): 16 MB

Related

Performance of my MPI code does not improve when I use two NUMA nodes (dual Xeon chips)

I have a computer, Precision-Tower-7810 dual Xeon E5-2680v3 #2.50GHz × 48 threads.
Here is result of $lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
Stepping: 2
CPU MHz: 1200.000
CPU max MHz: 3300,0000
CPU min MHz: 1200,0000
BogoMIPS: 4988.40
Virtualization: VT-x
L1d cache: 768 KiB
L1i cache: 768 KiB
L2 cache: 6 MiB
L3 cache: 60 MiB
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
My MPI code is based on basic MPI (Isend, Irecv, Wait, Bcast). Fundamentally, the data will be distributed and sent to all processors. On each processor, data is used to calculate something and its value is changed. After the above procedure, the amount of data on each processor is exchanged between all processors. This work is repeated to a limit.
Now, the main issue is that when I increase the number of processors within the limit of one chip (24 threads), performance increases. However, performance does not improve while the number of processors > 24 threads.
An example:
$mpiexec -n 6 ./mywork : 72s
$mpiexec -n 12 ./mywork : 46s
$mpiexec -n 24 ./mywork : 36s
$mpiexec -n 32 ./mywork : 36s
$mpiexec -n 48 ./mywork : 35s
I have tried on the both OpenMPI and MPICH, obtained result is the same. So, I think issue of physical connect type (NUMA nodes) of two chips. It is assumption of mine, I have never used a really supercomputer. I hope anyone know this issue and help me. Thank you for reading.

strange CPU binding/pining result within OpenMPI

I have tried to evaluate an OpenMPI program with Matrix Multiplication algorithm, the written code scales very well on a single thread per core machine in our Laboratory (close to ideal speedup within 48 and 64 cores), However, on some other machines which are hyperthreaded there is strange behavior, as you can see in the screenshot from htop I realized the CPU utilization when I run the same experiment with the same command is different and strange, I executed the program with
mpirun --bind-to hwthread--use-hwthread-cpus -n 2 ...
Here I bind the MPI workers to each hwthread, and can be seen with -n 2 which means I overwrite the variable in such a way to bind the execution on two processors (here hwthreads), however, seems it uses another hwthread with more or less 50% of utilization as well! I found this strange because there is not any extra CPU utilization on other machines, I tried this experiment many times and I'm sure this is not a temporary check or sth by OS and is due to the execution model of OpenMPI.
I appreciate it if someone could explain this behavior and extra CPU utilization when I execute this on the hyper-threaded machine.
The output of lscpu is as below:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6786.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
The version of OpenMPI for all machines is the same 2.1.1.
Maybe Hyperthreading is not the case and I was misled by this, but the only big difference between these environments are 1) the Hyperthreading and 2) Clock Frequency of the processors which is based on different CPUs is different between 2200 MHz to 4.8 GHz.

Spark Scratch Space

I have a cluster of 13 machines with 4 physical CPUs and 24 G of RAM.
I started a spark cluster with one driver and 12 slaves.
I set the number of cores by slaves to 12 cores, meaning I have a cluster as foloowing :
Alive Workers: 12
Cores in use: 144 Total, 110 Used
Memory in use: 263.9 GB Total, 187.0 GB Used
I started an application with the folowing configuration :
[('spark.driver.cores', '4'),
('spark.executor.memory', '15G'),
('spark.executor.id', 'driver'),
('spark.driver.memory', '5G'),
('spark.python.worker.memory', '1042M'),
('spark.cores.max', '96'),
('spark.rdd.compress', 'True'),
('spark.serializer.objectStreamReset', '100'),
('spark.executor.cores', '8'),
('spark.default.parallelism', '48')]
I understand there are 15G of RAM by executor with 8 task slot and a parallelism of 48 (48 = 6 task slot * 12 slaves).
then I have two big files on HDFS : 6 G each, (from a directory of 12 files of 5 blocks of 128 Mb each) , with a 3x replication factor.
I union these two files => I get one dataframe of 12 GB I think but I see a 37 G reading input through the IHM :
That could be the first question : Why 37 Gb ?
Then as the execution time is too long for me, I try to cache the data so that I can go faster. But the caching method never finishes, here you can see it is already 45 minutes before the end (Vs 6 min not cached !):
So I try to understand why, and I see the usage of Memory/Disk on the storage section of the ihm :
So there are some part of the RDD that are staying on disk.
Furthemore I see the executors may still have free memory :
And I notice on the same "storage" page that the size of the RDD has jumped :
Storage Level: Disk Serialized 1x Replicated
Cached Partitions: 72
Total Partitions: 72
Memory Size: 42.7 GB
Disk Size: 73.3 GB
=> I understand : Memory Size: 42.7 GB + Disk Size: 73.3 GB = 110 G !
=> So my 6 G file has transformed on 37 G and then on 110 G ???
But i try to understand why is there still some memory left on my executor, and I go to the "err" dump of one, and I see :
18/02/08 11:04:08 INFO MemoryStore: Will not store rdd_50_46
18/02/08 11:04:09 WARN MemoryStore: Not enough space to cache rdd_50_46 in memory! (computed 1134.1 MB so far)
18/02/08 11:04:09 INFO MemoryStore: Memory use = 1641.6 KB (blocks) + 7.7 GB (scratch space shared across 6 tasks(s)) = 7.7 GB. Storage limit = 7.8 GB.
18/02/08 11:04:09 WARN BlockManager: Persisting block rdd_50_46 to disk instead.
And Here I see that the executor want to cache a 1641.6 KB block (only 1Mo !) and I can't because there is a ["scratch space"] of 7.7 Gb "shared across 6 tasks".
=> What is a "scratch space" ? ?
=> The 6 tasks => comes from the parallelism of 48 / 12 = 6
And then I come back to the app information, and I see that the count that lasted 48 min read only 37 Gb of data ! (The 48 min are clearly used to cache the data too)
When I do a count on the cached dataframe I have a 116G input read :
And at the end of the day, the time saved by the cached count is not that impressive, here are 3 duration :
4.8 ' : count on cached df
48' : count while caching
5.8' : count on not cached df (read directly from hdfs)
So why is it so ?
Because the cached df is not that much cached :
Meaning more or less 40 Gb in memory and 60 Gb on disk.
I am surprised because at 15G / executor * 12 slaves => 180 Gb of memory, and I can cache only 40 Gb ... But in fact I remember that the memory is splitted :
30% for spark
54% for storage
16% for shuffle
So I understand that I do have 54% * 15G for storage, ie 8.1 G, meaning that on my 180 Gb, I only have 97 Gb for storage. Why do I have 90 - 40 = 50 G not used then ?
Oups... This is a long post !
Plenty of questions... Sorry...

postgresql cache and swap memory increasing

I have postgresql(Master and Slave) infrastructure. The postgresql server have 32 GB RAM and 600 MB swap space. We are using java(tomcat) applications. I have performed postgresql tuning using http://pgtune.leopard.in.ua/. But currently my issue is, my postgresql server caching and swap memory increasing periodically and certain point we forced to clear the cache or restart the postgresql. Could you please let me with the reason behind it. Below are my performance tuning parameter. Rest of the parameters are default. Also we are using pgbarman to take the point of time backup which configure on other server.
postgresql.conf
===============
max_connections = 300
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 27962kB
maintenance_work_mem = 2GB
checkpoint_segments = 32
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
/etc/sysctl.conf
================
kernel.shmmax=17179869184
kernel.shmall=4194304

How to Resolve this Out of Memory Issue for a Small Variable in Matlab?

I am running a 32-bit version of Matlab R2013a on my computer (4GB RAM, and 32-bit Windows 7).
I have dataset (~ 60 MB) and I want to read it using
ds = dataset('File', myFile, 'Delimiter', ',');
And each time I face Out of Memory error. Theoretically, I should be able to use 2GB of RAM, so there should be no problem reading such small files.
Here is what I got when typed memory
Maximum possible array: 36 MB (3.775e+07 bytes) *
Memory available for all arrays: 421 MB (4.414e+08 bytes) **
Memory used by MATLAB: 474 MB (4.969e+08 bytes)
Physical Memory (RAM): 3317 MB (3.478e+09 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
I followed every instructions I found (this is not a new issue), but for my case it seems rather weird, because I cannot run a simple program now.
System: Windows 7 32 bit
Matlab: R2013a
RAM: 4 GB
Clearly your issue is right here.
Maximum possible array: 36 MB (3.775e+07 bytes) *
You are either using a lot of memory in your system and/or you have a very low swap space.

Resources