Datadog monitoring Disk usage - amazon-ec2

I want to use datadog for monitoring my EC2 Instance Disk utilization and create alerts for it. I am using system.disk.in_use metric but I am not getting my root mount point in from sectionavg:system.disk.in_use{device:/dev/loop0} by {host} and my root mount point is /dev/root. I can see every loop mount point in the list but can't see the root. due to this, the data I am getting in the monitor is different than the actual server, for example, df -hT is showing 99% root in the server but on datadog monitoring it is showing 60%.
I am not too familiar with how to use datadog, can someone please help?
Try to research about it but not able to resolve the issue.

You can also try to use the device label to read in only the root volume such as:
avg:system.disk.in_use{device_label:/} by {host}
I personally found the metric system.disk.in_use to equal the total and instead added a formula that calculated the utilization using system.disk.total and system.disk.free to be more accurate.

Related

Write to and read from free disk space using Windows API

Is it possible to write to free clusters on disk or read data from them using Windows APIs? I found Defrag API: https://learn.microsoft.com/en-gb/windows/desktop/FileIO/defragmenting-files
FSCTL_GET_VOLUME_BITMAP can be used to obtain allocation state of each cluster, FSCTL_MOVE_FILE can be used to move clusters around. But I couldn't find a way of reading data from free clusters or writing data to them.
Update: one of the workarounds which comes to mind is creating a small new file, writing some data to it, then relocating it to desired position and deleting the file (the data will remain in freed cluster). But that still doesn't solve reading problem.
What I'm trying to do is some sort of transparent cache, so user could still use his NTFS partition as usual and still see these clusters as free space, but I could store some data in them. Data safety is not of concern, it can be overwritten by user actions and will just be regenerated / redownloaded later when clusters become free again.
There is no easy solution in this way.
First of all, you should create own partition of the drive. It prevents from an accidental access to your data from OS or any process. Then call CreateFileA() with name of the partition. You will get raw access to the data. Please bear in mind that the function will fail for any partition accessed by OS.
You can perform the same trick with a physical drive too.
The docs
One way could be to open the volume directly via using CreateFile with the volumes UNC path as filename arguement (e.g.: \\.\C:).
You now can directly read and write to the volume.
So you maybe can achieve your desired goal with:
get the cluster size in bytes with GetDiskFreeSpace
get the map of free clusters with DeviceIoControl and FSCTL_GET_VOLUME_BITMAP
open the volume with CreateFile with its UNC path \\.\F:
(take a careful look into the documentation, especially the Remarks sections part about opening drives and volumes)
seek to the the offset of a free cluster (clusterindex * clusterByteSize) by using SetFilePointer
write/read your data with WriteFile/ReadFile on the handle, retreived by above CreateFile
(Also note that read/write access has to be sector aligned, otherwise the ReadFile/WriteFile calls fail)
Please note:
this is only meant as a starting point for your own research. This is not a bullet proof cooking receipt.
Backup your data before messing with the file system!!!
Also keep in mind that the free cluster bitmap will be outdated as soon as you get it (especially if using the system volume).
So I would strongly advise against use of such techniques in production or customer environments.

Nifi failed to write to FileSystemRepository Stream

I have a flow where I am using the getFile processor. The input directory is a network mount point. When I test the flow on small files (les than 1GB), it works well. When I test it on bigger files (more than 1GB), I get the following error :
GetFile[id=f1a533fd-1959-16d3-9579-64e64fab1ac6] Failed to retrieve
files due to
org.apache.nifi.processor.exception.FlowFileAccessException: Failed to
import data from /path/to/directory for
StandardFlowFileRecord[uuid=f8389032-c6f5-43b9-a0e3-7daab3fa115a,claim=,offset=0,name=490908299598990,size=0]
due to java.io.IOException: Failed to write to FileSystemRepository
Stream [StandardContentClaim
[resourceClaim=StandardResourceClaim[id=1486976827205-28,
container=default, section=28], offset=0, length=45408256]]
Do you have any idea about the origin of this error ?
Thank you for your answers
Based on the comments the answer in this specific case was found by Andy and confirmed by the asker:
The content repositories were too small in proportion to the file size.
Another thing to look at for future readers is whether the memory of the Nifi node is large enough to hold individual messages.
Although the answer provided by Dennis is correct at analysing the root cause, there is not a solution for it, so let me provide one.
Answer/Solution for Containers
Since you can't specify a size for a docker volume, we can't use them for this task if you are lacking the required space for your flowfiles content.
Instead, I recommend using bind mounts. This way, you could use up to (theoretically) all your machine disk.
# Create a folder where to locate the NiFi Flowfiles content in the host
mkdir -p /tmp/data/nifi/nifi-data-content_repository
Then, modify your docker-compose.yml file to modify the type of storage to be used. Specifically, you have to look for the contents repository volume:
- type: source
source: nifi-data-content_repository
target: /opt/nifi/nifi-current/content_repository
And replace it with the bind mount targeting the folder that we've just created above:
- type: bind
source: /tmp/data/nifi/nifi-data-content_repository
target: /opt/nifi/nifi-current/content_repository
Ready, now you can re-deploy your NiFi with this mount capable of using your host disk space.
Read more on bind mounts in the official docs.

Expanding root partition on AWS EC2

I created a public VPC and then added a bunch of nodes to it so that I can use it for a spark cluster. Unfortunately, all of them have a partition setup that looks like the following:
ec2-user#sparkslave1: lsblk
/dev/xvda 100G
/dev/xvda1 5.7G /
I setup a cloud manager on top of these machines and all of the nodes only have 1G left for HDFS. How do I extend the partition so that it takes up all of the 100G?
I tried created /dev/xvda2, then created a volume group, added all of /dev/xvda* to it but /dev/xvda1 doesn't get added as it's mounted. I cannot boot from a live CD in this case, it's on AWS. I also tried resize2fs but it says that the root partition already takes up all of the available blocks, so it cannot be resized. How do I solve this problem and how do I avoid this problem in the future?
Thanks!
I don't think you can just resize the running root volume. This is how you'd go about increasing the root size:
create a snapshot of your current root volume
create a new volume from this snapshot of the size that you want (100G?)
stop the instance
detach the old small volume
attach the new bigger volume
start instance
I had a same problem before, but can't remember the solution. Did you try to run
e2resize /dev/xvda1
*This is when your using ext3, which is usually the default. The e2resize command will "grow" the ext3 filesystem to use the remaining free space.

What is OS Load in Elasticsearch node stat?

In Elasticsearch node stat API when I send a query for OS stat:
curl -XGET "http://esls1.ping-service.com:9200/_nodes/stats/os"
In the response I get a metric load_average:
"load_average": [0,0.04,0.13]
What does is means?
That is the currently calculated average load of the system and how this is obtained is specific to the operating system Elasticsearch is installed on.
ES uses Sigar to get this kind of information. The three numbers represent average loads calculated for 1 minute, 5 minutes and 15 minutes intervals.
For linux, for example, Sigar uses /proc/loadavg to get this information from the system. You can find more about this specific calculation in this SO post.
For AIX, Sigar is using perfstat_cpu_total subroutine, if I'm not mistaken to get the same information.
Sigar is not used in Elasticsearch anymore since the first Beta of 2.0.0: github.com/elastic/elasticsearch/pull/12010 github.com/elastic/elasticsearch/issues/11034
since then, they switched to generic OS load metrics. similar to what you see with the top command. see here for an explanation what this means: https://askubuntu.com/questions/532845/what-is-system-load
beware: this means if you run ES in a docker container, the load shown will be actually from the host machine, and not from only the docker container!

PVCS service getting down once the server CPU physical memory usage become high. Whats the issue and How to resolve it?

Our PVCS service getting down once the physical memory usage of the server goes high. Once the server restarts(Not recommended) again the service will be up. Is there any permenant fix for this?
I resolved this issue by increasing the heapsize parameters...:-)
1.On the server system, open the following file in a text editor:
Windows as of VM 8.4.6: VM_Install\vm\common\bin\pvcsrunner.bat
Windows prior to VM 8.4.6: VM_Install\vm\common\bin\pvcsstart.bat
UNIX/Linux: VM_Install/vm/common/bin/pvcsstart.sh
2.Find the following line:
set JAVA_OPTS=
And set the value of the following parameters as needed:
-Xmsvaluem -Xmxvaluem
3.If you are running a VM release prior to 8.4.3, make sure -Dpvcs.mx= is followed by the same value shown after -Xmx.
4.Save the file and restart the server.
The following is a rule of thumb when increasing the values for -Xmx:
•256m -> 512m
•512m -> 1024m
•1024m -> 1280m
As Riant points out above, adjusting the HEAP size is your best course of action here. I actually supported PVCS for nine years until this time in 2014 when I jumped ship. Riant's numbers are exactly what I would recommend.
I would actually counsel a lot of customers to set -Xms and -Xmx to the same value (basically start it at 1024) because if your PDBs and/or your user community are large you're going to hit the ceiling quicker than you might realize.

Resources