How to get the Uptime Percentage from Uptime check - google-cloud-stackdriver

I want to fetch the Uptime percentage from the above image to a third party application I am making. I have gone through various documentation but couldn't find anything!
Can anyone guide me through?

It's not possible for now to generate such metric / graph to show uptime percentage of the VM's.
You may try to extract that information using cloud trace API.
Another approach may be to use MQL query to create a custom metric to show total uptime agregated by hours/days etc.
fetch gce_instance
| metric 'compute.googleapis.com/instance/uptime'
| filter (metric.instance_name == 'instance-1')
| align delta(1d)
| every 1d
| group_by [], [value_uptime_mean: mean(value.uptime)]
Additionally you may find those answers helpful:
How to generate uptime reports through Google Cloud Stackdriver?
How to get uptime total and percentage of GCP compute vm instance through MQL?
Get google cloud uptime history to a third party application

Related

How do i consult these queries in prometheus?

I'm trying to setup some graphs using prometheus with grafana in a home lab using a single node kubernetes deployed using minikube. I also have some stress test to use on the cluster. I want to measure the results of the stress test using prometheus, so i need help with the following queries:
Cpu usage of the node/cluster and from and individual pod by given name, in a period of time (ie 5min).
Memory usage of the node/cluster and from and individual pod by given name, in a period of time (ie 5min).
Disk or file system usage of the node/cluster and from an individual pod by given name, in a period of time (ie 5min).
Latency from an individual pod by given name, in a period of time (ie 5min).
If any can help with that, or know a grafana dashboard for that (i've already tried the 737 and the 6417) or give a hint of which metrics i need to consult (i've tried rate(container_cpu_usage_seconds_total[5m]) and this gives me some sort of result for the cpu usage query for the whole node).
You can use Prometheus's labels to get metrics for a specific pod:
CPU (you don't have to provide all labels, you can select only one if that's unique:
sum(rate(container_cpu_usage_seconds_total{pod=~"<your_pod_name>", container=~"<your_container_name>", kubernetes_io_hostname=~"<your_node_name>"}[5m])) by (pod,kubernetes_io_hostname)
Memory:
sum(container_memory_working_set_bytes{pod=~"<your_pod_name>", container=~"<your_container_name>", kubernetes_io_hostname=~"<your_node_name>"}) by (pod,kubernetes_io_hostname)
Disk:
kubelet_volume_stats_used_bytes{kubernetes_io_hostname=~"<your_node_name>$", persistentvolumeclaim=~".*<your_pod_name>"}
Latency:
You can collect it in your application (web server)? via Prometheus client (application level)

prometheus query for continuous uptime

I'm a prometheus newbie and have been trying to figure out the right query to get the last continuous uptime for my service.
For example, if the present time is 0:01:20 my service was up at 0:00:00, went down at 0:01:01 and went up again at 0:01:10, I'd like to see the uptime of "10 seconds".
I'm mainly looking at the "up{}" metric and possibly combine it with the functions (changes(), rate(), etc.) but no luck so far. I don't see any other prometheus metric similar to "up" either.
The problem is that you need something which tells when your service was actually up vs. whether the node was up :)
We use the following (I hope one will help or the general idea of each):
1. When we look at a host we use node_time{...} - node_boot_time{...}
2. When we look at a specific process / container (docker via cadvisor in our case) we use node_time{...} - on(instance) group_right container_start_time_seconds{name=~"..."}) by(name,instance)
The following PromQL query must be used for calculating the application uptime in seconds:
time() - process_start_time_seconds
This query works for all the applications written in Go, which use either github.com/prometheus/client_golang or github.com/VictoriaMetrics/metrics client libraries, which expose the process_start_time_seconds metric by default. This metric contains unix timestamp for the application start time.
Kubernetes exposes the container_start_time_seconds metric for each started container by default. So the following query can be used for tracking uptimes for containers in Kubernetes:
time() - container_start_time_seconds{container!~"POD|"}
The container!~"POD|" filter is needed in order to filter aux time series:
Time series with container="POD" label reflect e.g. pause containers - see this answer for details.
Time series without container label correspond to e.g. cgroups hierarchy. See this answer for details.
If you need to calculate the overall per-target uptime over the given time range, then it is possible to estimate it with up metric. Prometheus automatically generates up metric per each scrape target. It sets it to 1 per each successful scrape and sets it to 0 otherwise. See these docs for details. So the following query can be used for estimating the total uptime in seconds per each scrape target during the last 24 hours:
avg_over_time(up[24h]) * (24*3600)
See avg_over_time docs for details.

Questions Nagios Monitoring

I am a beginner for Nagios and have read the documentation here.
However, I still have some questions and appreciate if you can please help me:
1) What are the different metrics that nagios captures? I know, it captures CPU, network, disk metrics etc. But I am looking for more detailed information like CPU idle time, CPU busy time etc?
2) If say for CPU, Nagios captures 5 metrics, where can I get the meaning of each metric captured by Nagios?
3) Can I export the metrics captured by Nagios in a CSV file or to an external database?
4) Can we collect custom metrics?
5) How these metrics are captured by Nagios i.e the mechanism or working of Nagios?
Any help is appreciated. Thanks!
I feel that your questions are somewhat generic. Nagios captures snaphots of data over by frequent sampling. You can always access nagios logs and build your own set of detailed information if you are looking to mine for historical data like busy time and idle time. Logstash Elasticsearch Kibana stack springs to mind.
As far as your third question is concerned, Yes it is possible to get csv outputs for availability. In the nagios/cgi-bin/avail.cgi GET request pass &csvoutput as a parameter and you will get the csv format availability reports for all hosts.
Please check this link for more info :
Link
and also:
Link 2

How get max count of request in time in splunk

Hi I'm developing rails web application with Solr search engine inside. The path to get search results is '/search/results'.
Users makes many requests when searching for something and I am in need of getting max count of intime search requests for all time (to check need it to do some optimization or increase RAM etc.). I know that there are peak times, when loading is critical and search works slowly.
I use Splunk service to collect app logs and it's possible to get this requests count from logs, but I don't know how write correct Splunk query to get data which I need.
So, how can I get max number of per 1hour requests to '/search/results' path for date range?
Thanks kindly!
If you can post your example data & or your sample search, its much easier to figure out. I'll just post a few examples of I think might lead you in the right direction.
Let's say the '/search/results' is in a field called "uri_path".
earliest=-2w latest=-1w sourcetype=app_logs uri_path='/search/results'
| stats count(uri_path) by date_hour
would give you a count (sum) per hour over last week, per hour.
earliest=-2w latest=-1w sourcetype=app_logs uri_path=*
| stats count by uri_path, hour
would split the table (you can think 'group by') by the different uri_paths.
You can use the time-range picker on the right side of the search bar to use a GUI to select your time if you don't want to use the time range abbreviations, (w=week, mon=month, m=minute, and so on).
After that, all you need to do is | pipe to the stats command where you can count by date_hour (which is an automatically generated field).
NOTE:
If you don't have the uri_path field already extracted, you can do it really easily with the rex command.
... | rex "matching stuff before uri path (?<uri_path>\/\w+\/\w+) stuff after'
| uri_path='/search/results'
| stats count(uri_path) by date_hour
In case you want to learn more:
Stats Functions (in Splunk)
Field Extractor - for permanent extractions

does testing a website through JMeter actually overload the main server

I am using to test my web server https://buyandbrag.in .
I have tested it for 100 users. But the main server is not showing like it is crowded or not.
I want to know whether it is really pressuring the main server(a cloud server I am using).Or just use the client resourse where the tool is installed.
Yes as mentioned you should be monitoring both servers to see how they handle the load. The simplest way to do this is with TOP (if your server OS is *NIX) also you should be watching the network activity i.e. Bandwidth, connection status (time wait, close wait and so on).
Also if your using apache keep an eye on the logs you should see the requests being logged there
Good luck with the tests
I want to know "how many users my website can handele ?",when I tested with 50 threads ,the cpu usage of my server increased but not the connections log(It showed just 2 connections).also the bandwidth usage is not that much
Firstly what connections are you referring to? Apache, DB etc?
Secondly if you want to see how many users your current setup can hand you need to create a profile or traffic model of what an average user will do on your site.
For example:
Say 90% of the time they will search for something
5% of the time they will purchase x
5% of the time they login.
Once you have your "Traffic Model" defined, implement it in jMeter then start increasing your load in increments i.e. running your load test for 10mins with x users, after 10mins increment that number and so on until you find your breaking point.
If you graph your responses you should see two main things:
1) The optimum response time / number of users before the service degrades
2) The tipping point i.e. at what point you start returning 503's etc
Now you'll have enough data to scale your site or to start making performance improvements from a code point of view.

Resources