RHEL 7: missing cgroup after reboot instances

RHEL 7: missing cgroup after reboot instances - amazon-ec2

I'm trying to limit resources by using cgroup. It's working fine until I reboot the instance.
I had checked and found that the cgroup was removed for some reason. This is my step to creating the cgroup:
# Create a cgroup
mkdir /sys/fs/cgroup/memory/my_cgroup
# Add the process to it
echo $PID > /sys/fs/cgroup/memory/my_cgroup/cgroup.procs
# Set the limit to 40MB
echo $((40 * 1024 * 1024)) > /sys/fs/cgroup/memory/my_cgroup/memory.limit_in_bytes
I'm using AMI RHEL-7.5_HVM-20180813-x86_64, kernel version 3.10.0-862.11.6.el7.x86_64.
Could you guys help me out with this problem?
Thanks in advance.

It seems like cgroup config is not persistant across reboot. I personally am not very familar and can't test but you can have a look at this. You need to configure cgconfig to persist your changes.

Related

LXD Issue, vm.max_map_count with Elasticsearch

Ok so to start, all the things I've tried so far:
Set vm.max_map_count in:
The host in etc/sysctl.conf
The host in /etc/sysctl.d/99-sysctl.conf
The LXD Container in /etc/sysctl.conf
The LXD container in /etc/sysctl.d/99-sysctl.conf
According to the official LXD production settings, this setting is possible with LXD:
source: https://linuxcontainers.org/lxd/docs/master/production-setup
According to multiple resources online, this is the approved fix to remediate the error, because the default setting is 65530.
I've checked the host, it says this:
cmd: sysctl vm.max_map_count
output: vm.max_map_count = 262144
I've checked the lxd container, it says this:
cmd: sysctl vm.max_map_count
output: vm.max_map_count = 65530
I also verified the configuration file again in LXD container in /etc/sysctl.conf, and it shows the setting as: vm.max_map_count=262144
I've rebooted the container, I've stop and restarted the container, I've even built a new test container. All of them keep saying 65530. What can I do here to close this out?

So I figured out two ways to solve this problem:
Apply the solution above, and then go through an incredibly lengthy and painful process of disabling Apparmor just to change the one setting, then reenable AppArmor again.
Build Elasticsearch on another box, and bypass the entire process.
Took a quick 3 minute assessment, figured it wasn't worth the time + frustration to deal with all the apparmor pains, build it elsewhere.
But to answer the question in case anyone is willing to eat the time & pain to do it in lxd, disable apparmor, apply the vm.max_map_count setting, and then turn apparmor back on.

As of 5-19-2022 I had good luck simply adding vm.max_map_count = 262144 in /etc/sysctl.conf on the host and rebooting the host.
Host is Ubuntu 22.04 as is the LXD container. The Elasticsearch process came up without an issue.
No having to mess with apparmor thankfully!

custom Linux kernel build failure in vmware workstation

While trying to compile/build and boot custom kernel inside vmware workstation, while booting new kernel, it fails and falls to shell with error "failed to find disk by uuid".
I tried this with both ubuntu and centos.
Things I tried but didn't help
check mapping by uuid in boot entry and existence in directory.
initramfs-update
replaced root=uuid=<> with /dev/disk/sda3
is it issue with vmware workstation?
how can it be rectified..??

I had a similar fault with my own attempts to bootstrap Fedora 22 onto a blank partition using a Centos install on another partition. I never did solve it completely, but I did find the problem was in my initrd rather than the kernel.
The problem is the initrd isn't starting LVM because dracut didn't tell the initrd that it needs LVM. Therefore if you start LVM manually you should be able to boot into your system to fix it.
I believe this is the sequence of commands I ran from the emergency shell to start LVM:
vgscan
vgchange -ay
lvs
this link helped me remember
Followed by exit to resume normal boot.
You might have to mount your LVM /etc/fstab entries manually, I don't recall whether I did or not.

Try this:
sudo update-grub
Then:
mkinitcpio -p linux
It won't hurt to check your fstab file. There, you should find the UUID of your drive. Make sure you have the proper flags set in the fstab.
Also, there's a setting in the grub.cfg that has has GRUB use the old style of hexadecimal UUIDs. Check that out as well!

The issue is with creation of initramfs, after doing a
make oldconfig
and choosing default for new options, make sure the ENOUGH diskspace is available for the image to be created.
in my case the image created was not correct and hence it was failing to mount the image at boot time.
when compared; the image size was quite less than the existing image of lower version, so I added another disk with more than sufficient size and then
make bzImage
make modules
make modules_install
make install
starts working like a charm.
I wonder why the image creation got completed earlier and resulted in corrupt image (with less size) without throwing any error [every single time]

Elasticsearch, Failed to obtain node lock, is the following location writable

Elasticsearch won't start using ./bin/elasticsearch.
It raises the following exception:
- ElasticsearchIllegalStateException[Failed to obtain node lock, is the following location writable?: [/home/user1/elasticsearch-1.4.4/data/elasticsearch]
I checked the permissions on the same location and the location has 777 permissions on it and is owned by user1.
ls -al /home/user1/elasticsearch-1.4.4/data/elasticsearch
drwxrwxrwx 3 user1 wheel 4096 Mar 8 13:24 .
drwxrwxrwx 3 user1 wheel 4096 Mar 8 13:00 ..
drwxrwxrwx 52 user1 wheel 4096 Mar 8 13:51 nodes
What is the problem?
Trying to run elasticsearch 1.4.4 on linux without root access.

I had an orphaned Java process related to Elasticsearch. Killing it solved the lock issue.
ps aux | grep 'java'
kill -9 <PID>

I got this same error message, but things were mounted fine and the permissions were all correctly assigned.
Turns out that I had an 'orphaned' elasticsearch process that was not being killed by the normal stop command.
I had to manually kill the process and then restarting elasticsearch worked again.

the reason is another instance is running!
first find the id of running elastic.
ps aux | grep 'elastic'
then kill using kill -9 <PID_OF_RUNNING_ELASTIC>.
There were some answers to remove node.lock file but that didn't help since the running instance will make it again!

In my situation I had wrong permissions on the ES dir folder. Setting correct owner solved it.
# change owner
chown -R elasticsearch:elasticsearch /data/elasticsearch/
# to validate
ls /data/elasticsearch/ -la
# prints
# drwxr-xr-x 2 elasticsearch elasticsearch 4096 Apr 30 14:54 CLUSTER_NAME

After I upgraded the elasticsearch docker-image from version 5.6.x to 6.3.y the container would not start anymore because of the aforementioned error
Failed to obtain node lock
In my case the root-cause of the error was missing file-permissions
The data-folder used by elasticsearch was mounted from the host-system into the container (declared in the docker-compose.yml):
volumes:
- /var/docker_folders/common/experimental-upgrade:/usr/share/elasticsearch/data
This folder could not be accessed anymore by elasticsearch for reasons I did not understand at all. After I set very permissive file-permissions to this folder and all sub-folders the container did start again.
I do not want to reproduce the command to set those very permissive access-rights on the mounted docker-folder, because it is most likely a very bad practice and a security-issue. I just wanted to share the fact that it might not be a second process of elasticsearch running, but actually just missing access-rights to the mounted folder.
Maybe someone could elaborate on the apropriate rights to set for a mounted-folder in a docker-container?

As with many others here replying, this was caused by wrong permissions on the directory (not owned by the elasticsearch user). In our case it was caused by uninstalling Elasticsearch and reinstalling it (via yum, using the official repositories).
As of this moment, the repos do not delete the nodes directory when they are uninstalled, but they do delete the elasticsearch user/group that owns it. So then when Elasticsearch is reinstalled, a new, different elasticsearch user/group is created, leaving the old nodes directory still present, but owned by the old UID/GID. This then conflicts and causes the error.
A recursive chown as mentioned by #oleksii is the solution.

You already have ES running. To prove that type:
curl 'localhost:9200/_cat/indices?v'
If you want to run another instance on the same box you can set node.max_local_storage_nodes in elasticsearch.yml to a value larger than 1.

Try the following:
1. find the port 9200, e.g.: lsof -i:9200
This will show you which processes use the port 9200.
2. kill the pid(s), e.g. repeat kill -9 pid for each PID that the output of lsof showed in step 1
3. restart elasticsearch, e.g. elasticsearch

I had an another ElasticSearch running on the same machine.
Command to check : netstat -nlp | grep 9200 (9200 - Elastic Port)
Result : tcp 0 0 :::9210 :::* LISTEN 27462/java
Kill the process by,
kill -9 27462
27462 - PID of ElasticSearch instance
Start the elastic search and it may run now.

In my case, this error was caused by not mounting the devices used for the configured data directories using "sudo mount".

chown -R elasticsearch:elasticsearch /var/lib/elasticsearch
It directly shows it doesn't have permission to obtain a lock. So need to give permissions.

In my case the /var/lib/elasticsearch was the dir with missing permissions (CentOS 8):
error: java.io.IOException: failed to obtain lock on /var/lib/elasticsearch/nodes/0
To fix it, use:
chown -R elasticsearch:elasticsearch /var/lib/elasticsearch

To add to the above answers there could be some other scenarios in which you can get the error.In fact I had done a update from 5.5 to 6.3 for elasticsearch.I have been using the docker compose setup with named volumes for data directories.I had to do a docker volume prune to remove the stale ones.After doing that I was no longer facing the issue.

If anyone is seeing this being caused by:
Caused by: java.lang.IllegalStateException: failed to obtain node locks, tried [[/docker/es]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
The solution is to set max_local_storage_nodes in your elasticsearch.yml
node.max_local_storage_nodes: 2
The docs say to set this to a number greater than one on your development machine
By default, Elasticsearch is configured to prevent more than one node from sharing the same data path. To allow for more than one node (e.g., on your development machine), use the setting node.max_local_storage_nodes and set this to a positive integer larger than one.
I think that Elasticsearch needs to have a second node available so that a new instance can start. This happens to me whenever I try to restart Elasticsearch inside my Docker container. If I relaunch my container then Elasticsearch will start properly the first time without this setting.

Mostly this error occurs when you kill the process abruptly. When you kill the process, node.lock file may not be cleared. you can manually remove the node.lock file and start the process again, it should work

For me the error was a simple one: I created a new data directory /mnt/elkdata and changed the ownership to the elastic user. I then copied the files and forgot to change the ownership afterwards again.
After doing that and restarting the elastic node it worked.

check these options
sudo chown 1000:1000 <directory you wish to mount>
# With docker
sudo chown 1000:1000 /data/elasticsearch/
OR
# With VM
sudo chown elasticsearch:elasticsearch /data/elasticsearch/

If you are on windows then try this:
Kill any java processes
If the start batch is interrupted in between then rather than closing the terminal, press ctrl+c to properly stop the elastic search service before you exit the terminal.

unable to coredump - embedded linux

i did ulimit -c unlimited / some number
proc... core_pattern is core
and my rootfs and the apps are all debugversion [ not the kernel though]
any idea why iam unable to get coredumps on kill -SIGABRT/SEGV pid
thanks
Furion.

Can you try to create the core using gdb as follows?
$ gdb --pid=1234
(gdb) gcore
Saved corefile core.1234
(gdb) detatch
gdb doesn't care about the settings.
If you are wondering what is detach.
Since you have attached the process to gdb, detatch it using gdb control using detatch command

Check to see if core dumps are enabled for your kernel:
CONFIG_ELF_CORE=y
Here's some documentation of the configuration item.

Ensure the current directory (getcwd()) of the process is writable by the process and contains enough space to hold the core dump file.

Maybe the application in question itself changes the core dump size ?

i used prctl in the program to explicitly enable core dump (it sounds like a script is disabling coredumps ) and alls good now

EC2 small to micro instance downgrade problems

Have a few applications where EC2 small instances are, well, too large. So the announcement of micro instances is just what the doctor ordered.
I'd like to take a small instance's EBS volume, detach it, and pair it up with a micro instance. At some point it might be great to go the other way and upsize a micro instance to a small or beyond.
For this failed experiment I tried:
Creating a new small instance with the Alestic Ubuntu 10.04 32 bit AMI (ami-1234de7b). Boots like a charm.
Power down my freshly minted micro instance, detach the volume that was created for me in the previous step.
Attach the small instance's volume to the micro instance.
Power up.
Nada.
What's odd is there is no console log output until I power down. Then I see it all.
[ 0.000000] Reserving virtual address space above 0xf5800000
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
...
[ 1.221261] VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
[ 1.221261] VFS: Mounted root (ext3 filesystem) readonly on device 8:1.
[ 1.222164] devtmpfs: mounted
[ 1.222202] Freeing unused kernel memory: 216k freed
[ 1.223409] Write protecting the kernel text: 4328k
[ 1.223760] Write protecting the kernel read-only data: 1336k
init: console-setup main process (63) terminated with status 1
%Ginit: plymouth main process (45) killed by SEGV signal
init: plymouth-splash main process (196) terminated with status 2
cloud-init running: Thu, 09 Sep 2010 17:37:54 +0000. up 2.61 seconds
mountall: Disconnected from Plymouth
init: hwclock-save main process (291) terminated with status 1
Checking for running unattended-upgrades: * Asking all remaining processes to terminate...
[80G
[74G[ OK ]
* All processes ended within 1 seconds....
[80G
[74G[ OK ]
* Deconfiguring network interfaces...
[80G
[74G[ OK ]
* Deactivating swap...
[80G
[74G[ OK ]
* Unmounting local filesystems...
[80G
[74G[ OK ]
* Will now halt
[ 185.599636] System halted.
This method of swapping has worked well between same sized instanced in the past and it's my first attempt at doing this between different sizes. Is this just not possible or am I missing something fundamental in my EC2 knowledge?

Even though this will probably be migrated to Server Fault, I ran into the exact same problem with this instance earlier today.
It appears that this image assumes that there will be ephemeral storage present, when there is none on the micro instances. To work around this, comment out the following line in /etc/fstab:
/dev/sda2 /mnt auto defaults,comment=cloudconfig 0 0
This should prevent the instance from hanging on startup, or at least it did for me with ami-1234de7b.

I created a new micro instance using alestic ami's (ami-2c354b7e). I was able to login to the system normally the first time. But once I reboot the system, I was not able to login again.
commenting the line indicated above worked for me. "/dev/sda2 /mnt auto defaults,comment=cloudconfig 0 0"

Commenting the line out doesn't fix it fully. If you reboot, it will write the same line back in. You need to:
$ l="deb http://archive.ubuntu.com/ubuntu lucid-proposed main"
$ echo "$l" | sudo tee -a /etc/apt/sources.list
$ sudo apt-get update && sudo apt-get install cloud-init
$ dpkg-query --show cloud-init
I'm assuming this will be fixed in the official Ubuntu release soon and you won't have to do this, but for now...
Source: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/634102
Also, we have a couple images based off the official Ubuntu AMI's that work on Micro's: http://blog.simpledeployr.com/2010/09/new-ruby-amis-with-latest-ubuntu-lucid.html

I don't see a problem on your side. This could be a problem in Amazon's infrastructure.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

RHEL 7: missing cgroup after reboot instances - amazon-ec2

It seems like cgroup config is not persistant across reboot. I personally am not very familar and can't test but you can have a look at this. You need to configure cgconfig to persist your changes.

Related

LXD Issue, vm.max_map_count with Elasticsearch

custom Linux kernel build failure in vmware workstation

Elasticsearch, Failed to obtain node lock, is the following location writable

unable to coredump - embedded linux

EC2 small to micro instance downgrade problems

Categories

Resources