I've seen a good bit about docker setups and the like using unpriv containers running ES. Basically, I wan't to set up a simple "prod cluster". Have a total of two nodes, one physical (for data), and one for Injest/Master (LXD Container).
The issue that I've run into is using bootstrap.memory_lock: true as a config option to lock memory (avoid swapping) on my container master/injest node.
[2018-02-07T23:28:51,623][WARN ][o.e.b.JNANatives ] Unable to lock JVM Memory: error=12, reason=Cannot allocate memory
[2018-02-07T23:28:51,624][WARN ][o.e.b.JNANatives ] This can result in part of the JVM being swapped out.
[2018-02-07T23:28:51,625][WARN ][o.e.b.JNANatives ] Increase RLIMIT_MEMLOCK, soft limit: 65536, hard limit: 65536
[2018-02-07T23:28:51,625][WARN ][o.e.b.JNANatives ] These can be adjusted by modifying /etc/security/limits.conf, for example:
# allow user 'elasticsearch' mlockall
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
...
[1]: memory locking requested for elasticsearch process but memory is not locked
Now, this makes sense given that the ES user can't adjust ulimits on the host. Given that I know enough about this to be dangerous, is there a way/how do I ensure that my unpriv container, can lock the memory it needs, given that there is no ES user on the host?
I'll just call this resolved - set swapoff on parent, and leave that setting to default in container. Not what I would call "the right way" as asked in my question, but good/close enough.
Related
I recently pushed a new container image to one of my GKE deployments and noticed that API latency went up and requests started returning 502's.
Looking at the logs I found that the container started crashing because of OOM:
Memory cgroup out of memory: Killed process 2774370 (main) total-vm:1801348kB, anon-rss:1043688kB, file-rss:12884kB, shmem-rss:0kB, UID:0 pgtables:2236kB oom_score_adj:980
Looking at the memory usage graph it didn't look like the pods were using more than 50MB of memory combined. My original resource requests were:
...
spec:
...
template:
...
spec:
...
containers:
- name: api-server
...
resources:
# You must specify requests for CPU to autoscale
# based on CPU utilization
requests:
cpu: "150m"
memory: "80Mi"
limits:
cpu: "1"
memory: "1024Mi"
- name: cloud-sql-proxy
# It is recommended to use the latest version of the Cloud SQL proxy
# Make sure to update on a regular schedule!
image: gcr.io/cloudsql-docker/gce-proxy:1.17
resources:
# You must specify requests for CPU to autoscale
# based on CPU utilization
requests:
cpu: "100m"
...
Then I tried bumping the request for API server to 1GB but it did not help. In the end, what helped was reverting the container image to the previous version:
Looking through the changes in the golang binary there are no obvious memory leaks. When I run it locally it uses at most 80MB of memory, even under load from the same requests as in production.
And the above graph which I got from the GKE console also shows the pod using far less than the 1GB memory limit.
So my question is: What could cause GKE to kill my process for OOM when both GKE monitoring and running it locally only uses 80MB out of the 1GB limit?
=== EDIT ===
Adding another graph of the same outage. This time splitting the two containers in the pod. If I understand correctly, the metric here is non-evictable container/memory/used_bytes:
container/memory/used_bytes GA
Memory usage
GAUGE, INT64, By
k8s_container Memory usage in bytes. Sampled every 60 seconds.
memory_type: Either `evictable` or `non-evictable`. Evictable memory is memory that can be easily reclaimed by the kernel, while non-evictable memory cannot.
Edit Apr 26 2021
I tried updating the resources field in the deployment yaml to 1GB RAM requested and 1GB RAM limit like suggested by Paul and Ryan:
resources:
# You must specify requests for CPU to autoscale
# based on CPU utilization
requests:
cpu: "150m"
memory: "1024Mi"
limits:
cpu: "1"
memory: "1024Mi"
Unfortunately it had the same result after updating with kubectl apply -f api_server_deployment.yaml:
{
insertId: "yyq7u3g2sy7f00"
jsonPayload: {
apiVersion: "v1"
eventTime: null
involvedObject: {
kind: "Node"
name: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy"
uid: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy"
}
kind: "Event"
message: "Memory cgroup out of memory: Killed process 1707107 (main) total-vm:1801412kB, anon-rss:1043284kB, file-rss:9732kB, shmem-rss:0kB, UID:0 pgtables:2224kB oom_score_adj:741"
metadata: {
creationTimestamp: "2021-04-26T23:13:13Z"
managedFields: [
0: {
apiVersion: "v1"
fieldsType: "FieldsV1"
fieldsV1: {
f:count: {
}
f:firstTimestamp: {
}
f:involvedObject: {
f:kind: {
}
f:name: {
}
f:uid: {
}
}
f:lastTimestamp: {
}
f:message: {
}
f:reason: {
}
f:source: {
f:component: {
}
f:host: {
}
}
f:type: {
}
}
manager: "node-problem-detector"
operation: "Update"
time: "2021-04-26T23:13:13Z"
}
]
name: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy.16798b61e3b76ec7"
namespace: "default"
resourceVersion: "156359"
selfLink: "/api/v1/namespaces/default/events/gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy.16798b61e3b76ec7"
uid: "da2ad319-3f86-4ec7-8467-e7523c9eff1c"
}
reason: "OOMKilling"
reportingComponent: ""
reportingInstance: ""
source: {
component: "kernel-monitor"
host: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy"
}
type: "Warning"
}
logName: "projects/questions-279902/logs/events"
receiveTimestamp: "2021-04-26T23:13:16.918764734Z"
resource: {
labels: {
cluster_name: "api-us-central-1"
location: "us-central1-a"
node_name: "gke-api-us-central-1-e2-highcpu-4-nod-dfe5c3a6-c0jy"
project_id: "questions-279902"
}
type: "k8s_node"
}
severity: "WARNING"
timestamp: "2021-04-26T23:13:13Z"
}
Kubernetes seems to have almost immediately killed the container for using 1GB of memory. But again, the metrics show that container using only 2MB of memory:
And again I am stumped because even under load this binary does not use more than 80MB when I run it locally.
I also tried running go tool pprof <url>/debug/pprof/heap. It showed several different values as Kubernetes kept thrashing the container. But none higher than ~20MB and not memory usage out of the ordinary
Edit 04/27
I tried setting request=limit for both containers in the pod:
requests:
cpu: "1"
memory: "1024Mi"
limits:
cpu: "1"
memory: "1024Mi"
...
requests:
cpu: "100m"
memory: "200Mi"
limits:
cpu: "100m"
memory: "200Mi"
But it didn't work either:
Memory cgroup out of memory: Killed process 2662217 (main) total-vm:1800900kB, anon-rss:1042888kB, file-rss:10384kB, shmem-rss:0kB, UID:0 pgtables:2224kB oom_score_adj:-998
And the memory metrics still show usage in the single digit MBs.
Update 04/30
I pinpointed the change that seemed to cause this issue by painstakingly checking out my latest commits one by one.
In the offending commit I had a couple of lines like
type Pic struct {
image.Image
Proto *pb.Image
}
...
pic.Image = picture.Resize(pic, sz.Height, sz.Width)
...
Where picture.Resize eventually calls resize.Resize.
I changed it to:
type Pic struct {
Img image.Image
Proto *pb.Image
}
...
pic.Img = picture.Resize(pic.Img, sz.Height, sz.Width)
This solves my immediate problem and the container runs fine now. But it does not answer my original question:
Why did these lines cause GKE to OOM my container?
And why did the GKE memory metrics show that everything was fine?
I guess it was caused by Pod QoS class
When the system is overcommitted, the QoS classes determine which pod gets killed first so the freed resources can be given to higher priority pods.
In your case, the QoS of your pod would be Burstable
Each running process has an OutOfMemory(OOM) score. The system selects the process to kill by comparing OOM score of all the running processes. When memory needs to be freed, the process with the highest score gets killed. For details of how the score is calculated please refer to How is kernel oom score calculated?.
Which pod will be killed first if both in the Burstable class?
For short, the system will kill the one using more of its requested memory than the other in percentage-wise.
Pod A
used: 90Mi
requests: 100Mi
limits: 200Mi
Pod B
used: 150Mi
requests: 200Mi
limits: 400Mi
Pod A will get killed before Pod B because it uses 90% of its requested memory while Pod B use only 75% of its requested memory.
Ensuring a QoS class of "guaranteed" won't help in your scenario. One of your processes causes the parent cgroup to go over its memory limit - in turn set by the memory limit value you specify against the respective container - and the OOM killer terminates it. It's not a pod eviction as you clearly see the trademark message of the OOM killer in the logs. The scenario where a "guaranteed" QoS class would help if another pod allocates so much memory that brings the node under memory pressure - and in that case your "guaranteed" pod will be spared. But in your case, the Kubelet never gets a word in all this - as in deciding to evict the pod altogether - as the OOM killer acts faster.
Burak Serdar has a good point in its comments - temporary allocations of large memory chunks. Which could very well be the case, given that the resolution of collecting data is 60s in your case from the messages you pasted. That's a lot of time. One can easily fill GB of RAM in less than 1s. My assumption is that the memory "spikes" never get rendered as the metrics never get collected in time (even if you'd query cAdvisor directly it would be tricky since it has a resolution of 10-15 seconds for collecting its metrics).
How to see more about what goes on? A couple of ideas:
There are tools that show how much an application actually allocates, down to the framework level. In .NET dotMemory is a commonly used tool that can run inside a container and capture what goes on. There is probably an equivalent for Go. The problem with this approach is that when the container gets OOMKilled, the tool gets taken down with it
Write details around memory usage from within your own app. Here you'll find a movie that captures a process allocating memory until its parent container got OOM killed. The respective .NET app writes from time to time to the console the quantity of memory it uses, which the Kubernetes logs show even after the container is no longer, thus allowing to see what happened
Throttle the app so that it processes a small amount of data (e.g. temporarily look at what happens from a memory standpoint if you process just 1 picture per minute)
Look at the detailed OOM killer kernel logs to see all the processes in the cgroup. It's perfectly valid to have multiple processes inside a container (as in other processes aside the one with PID 1 in that container) and the OOM killer could very well kill any one of them. You can stumble upon unexpected twists in this case. Yet in your scenario it's the main process that appears to be terminated, otherwise the container wouldn't have gotten OOMkilled, so this scenario is unlikely.
Just for completeness sake: the underlying framework can enforce lower limits than the memory limit for your container(s). E.g. in .NET this is 75% when running in a container with a memory limit. In other words a .NET app allocating memory inside a container with a 2,000 MiB limit will error out at 1,500 MiB. Yet in that case you get an exit code of 139 (SIGSEGV). This doesn't appear to apply here, since the OOM killer terminates the process, and it's clearly visible from the Kernel logs that all the 1 GiB is actually used (anon-rss:1043688kB). To my knowledge Go doesn't have a similar setting yet, although the community has repeatedly asked for it.
The resource spec here is the root cause for the OOM.
In Kubernetes, required and limited memory are defined differently. Required memory is the memory must-have. Limited memory is the memory that the container can be bursted into. But limited memory does not guarantee that the container can have that resources.
In most of the production systems, it is not recommended that the limited and required resource differ too much. For example, in your case,
requests:
cpu: "150m"
memory: "80Mi"
limits:
cpu: "1"
memory: "1024Mi"
The container can only have 80Mi guaranteed memory but it can somehow burst into 1024Mi. The node may not have enough memory for the container and container itself will go into OOM.
So, if you want to improve this situation, you need to configure the resource to be something like this.
requests:
cpu: "150m"
memory: "1024Mi"
limits:
cpu: "1"
memory: "1024Mi"
Please note that CPU is just fine because you won't get the process killed under low CPU time. But the OOM will lead to the process killed.
As the answer above mentioned, this is related to the quality of service in the pod. In general, to most of the end user, you should always configure your container as guaranteed class, i.e. requested == limited. You may need to have some justification before configuring it as bursted class.
Dear Stackoverflowians,
I’m having a problem with a Spring Cloud Stream app using a Kafka Streams binder. It’s only in our own Pivotal Cloud Foundry (CF) environment where this issue occurs. I have kind of hit a wall at this point and so I turn to you and your wisdom!
When the application starts up I see the following error
<snip>
2019-08-07T15:17:58.36-0700 [APP/PROC/WEB/0]OUT current active tasks: [0_3, 1_3, 2_3, 3_3, 4_3, 0_7, 1_7, 5_3, 2_7, 3_7, 4_7, 0_11, 1_11, 5_7, 2_11, 3_11, 4_11, 0_15, 1_15, 5_11, 2_15, 3_15, 4_15, 0_19, 1_19, 5_15, 2_19, 3_19, 4_19, 0_23, 1_23, 5_19, 2_23, 3_23, 4_23, 5_23]
2019-08-07T15:17:58.36-0700 [APP/PROC/WEB/0]OUT current standby tasks: []
2019-08-07T15:17:58.36-0700 [APP/PROC/WEB/0]OUT previous active tasks: []
2019-08-07T15:18:02.67-0700 [API/0] OUT Updated app with guid 2db4a719-53ee-4d4a-9573-fe958fae1b4f ({"state"=>"STOPPED"})
2019-08-07T15:18:02.64-0700 [APP/PROC/WEB/0]ERR terminate called after throwing an instance of 'std::system_error'
2019-08-07T15:18:02.64-0700 [APP/PROC/WEB/0]ERR what(): Resource temporarily unavailable
2019-08-07T15:18:02.67-0700 [CELL/0] OUT Stopping instance 516eca4f-ea73-4684-7e48-e43c
2019-08-07T15:18:02.67-0700 [CELL/SSHD/0]OUT Exit status 0
2019-08-07T15:18:02.71-0700 [APP/PROC/WEB/0]OUT Exit status 134
2019-08-07T15:18:02.71-0700 [CELL/0] OUT Destroying container
2019-08-07T15:18:03.62-0700 [CELL/0] OUT Successfully destroyed container
The key here being the line with
what(): Resource temporarily unavailable
The error is related to the number of partitions. If I set the partition count to 12 or less things work. If I double it, the process fails to start with this error.
This doesn’t happen on my local windows dev machine. It also doesn’t happen in my local docker environment when I wrap this app in a docker image and run. I can take the same image and push it to CF or push the app as a java app, I get this error.
Here is some information about the kafka streams app. We have an input topic with a number of partitions. The topic is the output of debezium connector and basically it’s a change log of a bunch of database tables. The topology is not super complex but it’s not trivial. Its job is to aggregate the table update information back into our aggregates. We end up with 17 local stores in the topology. I have a strong suspicion this issue has something to do with rocksdb and the resources available to the CF container the app is in. But I have not the faintest idea what the Resource is that’s “temporarily unavailable”.
As I mentioned, I tried deploying it as a docker container with various jdk8 jvms, different base images centos, debian, I tried various different CF java buildbacks, I tried limiting java heap with relation to max container memory size (thinking that maybe it has something to do with native memory allocation) to no avail.
I’ve also asked our ops folks to up some limits on the containers and open files limit changed from the initial 16k to now 500k+. I saw some file lock related errors as below but they went away after this change.
2019-08-01T15:46:23.69-0700 [APP/PROC/WEB/0]ERR Caused by: org.rocksdb.RocksDBException: lock : /home/vcap/tmp/kafka-streams/cms-cdc/0_7/rocksdb/input/LOCK: No locks available
2019-08-01T15:46:23.69-0700 [APP/PROC/WEB/0]ERR at org.rocksdb.RocksDB.open(Native Method)
However the error what(): Resource temporarily unavailable with higher number of partitions persists.
ulimit -a on the container looks like this
~$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1007531
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 524288
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I really do need to understand what the root of this error is. It’s hard to plan in this case not knowing what limit we’re hitting here.
Hope to hear your ideas. Thanks!
Edit:
Is there maybe some way to get more verbose error messages from the rocksdb library or a way to maybe build it so it outputs more info?
Edit 2
I have also tried to customize the rocksdb memory settings using org.apache.kafka.streams.state.RocksDBConfigSetter
The defaults are in org.apache.kafka.streams.state.internals.RocksDBStore#openDB(org.apache.kafka.streams.processor.ProcessorContext)
First I made sure the Java heap settings were well below the container process size limit and left nothing to the memory calculator by setting
JAVA_OPTS: -XX:MaxDirectMemorySize=100m -XX:ReservedCodeCacheSize=240m -XX:MaxMetaspaceSize=145m -Xmx1000m
With this I tried:
1.
Lowering the write buffer size
org.rocksdb.Options#setWriteBufferSize(long)
org.rocksdb.Options#setMaxWriteBufferNumber(int)
2.
Setting max_open_files to half the limit for the container (the total of all db instances)
org.rocksdb.Options#setMaxOpenFiles(int)
3.
I tried turning off the block cache altogether
org.rocksdb.BlockBasedTableConfig#setNoBlockCache
4.
I also tried setting cache_index_and_filter_blocks = true after re-enabling block cache
https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-and-filter-blocks
All to no avail. The above issue is still happening when I set higher number of partitions (24) on the input topic. Now that I have RocksDBConfigSetter with logging in it, I can see that the error happens exactly when the rocksdb is configured.
Edit 3
I still haven't gotten to the bottom of this. I have asked the question on https://www.facebook.com/groups/rocksdb.dev and was advised to trace system calls with strace or similar, but I was not able to obtain the required permissions to do that in our environment.
It has eaten up so much time that I had to settle for a workaround for now. What I ended up doing is refactored the topology to
1) minimize the number of materialized ktables (and the number of resulting rocksdb instances) and
2) break up the topology among multiple processes.
This allowed me to turn on and off topology parts in separate deployments with spring profiles and has given me some limited way forward for now.
I am running systemd version 219.
root#EVOvPTX1_RE0-re0:/var/log# systemctl --version
systemd 219
+PAM -AUDIT -SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP -GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID -ELFUTILS +KMOD -IDN
I have a service, let's call it foo.service which has the following.
[Service]
MemoryLimit=1G
I have deliberately added code to allocate 1M memory 4096 times which causes
4G memory alloc when a certain event is received. The idea is that after
the process consumes 1G of address space, memory alloc would start failing.
However, this does not seem to be the case. I am able to alloc 4G memory
without any issues. This tells me that the memory limit specified in the
service file is not enforced.
Can anyone let me know what am I missing ?
I looked at the proc file system - file named limits. This shows that the
Max address space is Unlimited, which also confirms that the memory limit
is not getting enforced.
This distinction is that you have allocated memory, but you haven't actually used it. In the output of top, this is the difference between the "VIRT" memory column (allocated) and the "RES" column (actually used).
Try modifying your experiment to assign values to elements of a large array instead of just allocating memory and see if you hit the memory limit that way.
Reference: Resident and Virtual memory on Linux: A short example
when I try and start redis service I keep getting this error:
"The Redis service on Local Computer started and then stopped. Some services stop automatically if they are not in use by other services or programs".
The only thing that works is restarting my computer, then the redis service is running on startup.
Is there any configuration I need to set up in order for it to work better?
I installed redis using the .msi, version 2.8.2104.
All help would be very appreciated! Thanks
Right click in the service in Window Services and go to Properties. Then go to Log On tab and select Local System account. Click on Ok button and start the service.
For those that may have a similar problem (like we did), I found another solution.
The machine we're running on (TEST) only had 7GB of free space on the drive. But we have 16GB of RAM. In our redis.windows.conf file, there is a setting called maxheap that was NOT set.
According to the documentation on maxheap:
# The maxheap flag controls the maximum size of this memory mapped file,
# as well as the total usable space for the Redis heap. Running Redis
# without either maxheap or maxmemory will result in a memory mapped file
# being created that is equal to the size of physical memory. During
# fork() operations the total page file commit will max out at around:
#
# (size of physical memory) + (2 * size of maxheap)
#
# For instance, on a machine with 8GB of physical RAM, the max page file
# commit with the default maxheap size will be (8)+(2*8) GB , or 24GB. The
# default page file sizing of Windows will allow for this without having
# to reconfigure the system. Larger heap sizes are possible, but the maximum
# page file size will have to be increased accordingly.
#
# The Redis heap must be larger than the value specified by the maxmemory
# flag, as the heap allocator has its own memory requirements and
# fragmentation of the heap is inevitable. If only the maxmemory flag is
# specified, maxheap will be set at 1.5*maxmemory. If the maxheap flag is
# specified along with maxmemory, the maxheap flag will be automatically
# increased if it is smaller than 1.5*maxmemory.
#
# maxheap <bytes>
So I set it to a reasonable value and the service started right up.
I found a read-write error on configuration (ini)
Please check out all the files and directory specified on the INI.
all:
here is my server memory info with 'free -m'
total used free shared buffers cached
Mem: 64433 49259 15174 0 3 31
-/+ buffers/cache: 49224 15209
Swap: 8197 184 8012
my redis-server has used 46G memory, there is almost 15G memory left free
As my knowledge,fork is copy on write, it should not failed when there has 15G free memory,which is enough to malloc necessary kernel structures .
besides, when redis-server used 42G memory, bgsave is ok and fork is ok too.
Is there any vm parameter I can tune to make fork return success ?
More specifically, from the Redis FAQ
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
Redis doesn't need as much memory as the OS thinks it does to write to disk, so may pre-emptively fail the fork.
Modify /etc/sysctl.conf and add:
vm.overcommit_memory=1
Then restart sysctl with:
On FreeBSD:
sudo /etc/rc.d/sysctl reload
On Linux:
sudo sysctl -p /etc/sysctl.conf
From proc(5) man pages:
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4
any non-zero value implies mode 1. In mode 2 (available since Linux 2.6), the total virtual address space on the system is limited to (SS + RAM*(r/100)), where SS is the size
of the swap space, and RAM is the size of the physical memory, and r is the contents of the file /proc/sys/vm/overcommit_ratio.
Redis's fork-based snapshotting method can effectively double physical memory usage and easily OOM in cases like yours. Reliance on linux virtual memory for doing snapshotting is problematic, because Linux has no visibility into Redis data structures.
Recently a new redis-compatible project Dragonfly has been released. Among other things, it solves the OOM problem entirely. (disclosure - I am the author of this project).