Questions Nagios Monitoring - performance

I am a beginner for Nagios and have read the documentation here.
However, I still have some questions and appreciate if you can please help me:
1) What are the different metrics that nagios captures? I know, it captures CPU, network, disk metrics etc. But I am looking for more detailed information like CPU idle time, CPU busy time etc?
2) If say for CPU, Nagios captures 5 metrics, where can I get the meaning of each metric captured by Nagios?
3) Can I export the metrics captured by Nagios in a CSV file or to an external database?
4) Can we collect custom metrics?
5) How these metrics are captured by Nagios i.e the mechanism or working of Nagios?
Any help is appreciated. Thanks!

I feel that your questions are somewhat generic. Nagios captures snaphots of data over by frequent sampling. You can always access nagios logs and build your own set of detailed information if you are looking to mine for historical data like busy time and idle time. Logstash Elasticsearch Kibana stack springs to mind.
As far as your third question is concerned, Yes it is possible to get csv outputs for availability. In the nagios/cgi-bin/avail.cgi GET request pass &csvoutput as a parameter and you will get the csv format availability reports for all hosts.
Please check this link for more info :
Link
and also:
Link 2

Related

Long duration soak tests in jmeter

Jmeter tests are run in master slave fashion with around 8 slave machines. However with the remote batching mode set to MODE_STRIPPED_BATCH, I am not able to run tests for more than 64 hours. Throughput is around 450 requests per minute, and per slave machine it results in the creation of jtl files that are around 1.5 gb. All 8 slaves are going to send this to the master (1.5 gb x 8) and probably the I/O gets too much for the master to handle. The master machines memory is at 16 gb ram and has disk storage of around 250 gb. I was wondering if the jmeter distributed architecture has any provision to make long running soak tests possible without any un explained stress on the master machine. Obviously I have the option to abandon master slave setup and go for 8 independent nodes, however I'll in that case run into complications with respect to serving data csv files ( which I currently serve using simple table server plugin from the master m) and also around aggregating result files. Any suggestions please. It would be great to be able to run tests atleast for around 4 days (96 hours or so).
I would suggest to go for an independent JMeter workers + external data collector setup.
Actually, the JMeter right-out-of-the-box "distributed scaling" abilities are weak, way outdated & overall pretty ridiculous. As well as it's data collection/agregation/processing abilities.
This situation actually puzzles me a lot - mind you, rivals are even worse, so there's literally NOTHING in the field (except for, perhaps, some SaaS solutions trying to monetize on this gap).
But is is what it is...
So that's about why-s, now to how-s.
If I were you, I would:
Containerize the JMeter worker
Equip each container with a watchdog to quickly restart the worker if things go south locally (or probably even on schedule to refresh it ultimately). Be that an internal one, or external like cloud services have - doesn't matter.
Set up a timeseries database - I recommend InfluxDB, it's an excellent product & it's free in basic version (which is going to be enough for your purposes).
Flow your test results/metrics into that DB - do not collect them locally! You can do it right from your tests with pretty simple custom listener (Influx line protocol is ridiculously simple & fast), or you can have external agent watching the result files as they flow. I just suggest you not to use so called Backend Listner to do the job - it's garbage, it won't shape your data right, so you'd have to do additional ops to bring them to order.
If you shape your test result/metrics data properly, you've get 'em already time-synced into a single set - and the further processing options are amazingly powerful!
My expectation is that you're looking for the StrippedAsynch sampler sender mode.
As per the documentation:
Asynch
samples are temporarily stored in a local queue. A separate worker thread sends the samples. This allows the test thread to continue without waiting for the result to be sent back to the client. However, if samples are being created faster than they can be sent, the queue will eventually fill up, and the sampler thread will block until some samples can be drained from the queue. This mode is useful for smoothing out peaks in sample generation. The queue size can be adjusted by setting the JMeter property asynch.batch.queue.size (default 100) on the server node.
StrippedAsynch
remove responseData from successful samples, and use Async sender to send them.
So on slave node add the following line to user.properties file:
mode=StrippedAsynch
and on the master node define asynch.batch.queue.size, to be as high to not to have impact onto JMeter's throughput (won't slow it down) and as low to not to overwhelm the master. I would start with 1000.
Another option is using StrippedDiskStore but you will have to manually collect serialized results after test completion (make sure that slave processes will not shut down because the results will be deleted when slave process finishes)
You could use JMeter PerfMon Plugin to monitor memory and network usage on master and slaves.

prometheus query for continuous uptime

I'm a prometheus newbie and have been trying to figure out the right query to get the last continuous uptime for my service.
For example, if the present time is 0:01:20 my service was up at 0:00:00, went down at 0:01:01 and went up again at 0:01:10, I'd like to see the uptime of "10 seconds".
I'm mainly looking at the "up{}" metric and possibly combine it with the functions (changes(), rate(), etc.) but no luck so far. I don't see any other prometheus metric similar to "up" either.
The problem is that you need something which tells when your service was actually up vs. whether the node was up :)
We use the following (I hope one will help or the general idea of each):
1. When we look at a host we use node_time{...} - node_boot_time{...}
2. When we look at a specific process / container (docker via cadvisor in our case) we use node_time{...} - on(instance) group_right container_start_time_seconds{name=~"..."}) by(name,instance)
The following PromQL query must be used for calculating the application uptime in seconds:
time() - process_start_time_seconds
This query works for all the applications written in Go, which use either github.com/prometheus/client_golang or github.com/VictoriaMetrics/metrics client libraries, which expose the process_start_time_seconds metric by default. This metric contains unix timestamp for the application start time.
Kubernetes exposes the container_start_time_seconds metric for each started container by default. So the following query can be used for tracking uptimes for containers in Kubernetes:
time() - container_start_time_seconds{container!~"POD|"}
The container!~"POD|" filter is needed in order to filter aux time series:
Time series with container="POD" label reflect e.g. pause containers - see this answer for details.
Time series without container label correspond to e.g. cgroups hierarchy. See this answer for details.
If you need to calculate the overall per-target uptime over the given time range, then it is possible to estimate it with up metric. Prometheus automatically generates up metric per each scrape target. It sets it to 1 per each successful scrape and sets it to 0 otherwise. See these docs for details. So the following query can be used for estimating the total uptime in seconds per each scrape target during the last 24 hours:
avg_over_time(up[24h]) * (24*3600)
See avg_over_time docs for details.

Apache Niffi getMongo Processor

I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}
There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.

Jmeter how do I get the right number for load?

I have recorded a script against the application that I want to test. Now, I am having a hard time arriving at the decision that what is that number that the application will run without any issue and to find out the max number of users. Here is what I have done-
I have run the Jmeter script for 10, 50, 100, 150 users
Until 50 users, it runs like a charm. After about 80 users the throughput starts to come down and some samples do not show up in Aggregate Report.
I see heap memory problems in my console for about 150 users over period of time. Is it the application problem or my machine problem?
Do you have an article where I could read about how to come to a conclusion about THE number?
UPDATE- after increasing the heap size, it is running smoothly for 100 users. I am even more confused now
Thank you
The problem can be anywhere!
Server Performance Metrics collector:
First you need an agent running in the application server to monitor the server performance while you are running the test.
This link will give you an idea about the set up.
JMeter Best Practices:
I think that you are running your test in GUI mode with listeners. Most likely the problem is with your machine/your test. Ensure that you follow this.
Samples not showing in aggregate Report:
You already asked a question on this in SO. Do not select 'Successes' in the Log/display only section of the listener while writing the results in the jtl. It will not write the failed requests details. You might need all the results. Once the jtl is created, you can always filter 'Success' only results as and when you want.

does testing a website through JMeter actually overload the main server

I am using to test my web server https://buyandbrag.in .
I have tested it for 100 users. But the main server is not showing like it is crowded or not.
I want to know whether it is really pressuring the main server(a cloud server I am using).Or just use the client resourse where the tool is installed.
Yes as mentioned you should be monitoring both servers to see how they handle the load. The simplest way to do this is with TOP (if your server OS is *NIX) also you should be watching the network activity i.e. Bandwidth, connection status (time wait, close wait and so on).
Also if your using apache keep an eye on the logs you should see the requests being logged there
Good luck with the tests
I want to know "how many users my website can handele ?",when I tested with 50 threads ,the cpu usage of my server increased but not the connections log(It showed just 2 connections).also the bandwidth usage is not that much
Firstly what connections are you referring to? Apache, DB etc?
Secondly if you want to see how many users your current setup can hand you need to create a profile or traffic model of what an average user will do on your site.
For example:
Say 90% of the time they will search for something
5% of the time they will purchase x
5% of the time they login.
Once you have your "Traffic Model" defined, implement it in jMeter then start increasing your load in increments i.e. running your load test for 10mins with x users, after 10mins increment that number and so on until you find your breaking point.
If you graph your responses you should see two main things:
1) The optimum response time / number of users before the service degrades
2) The tipping point i.e. at what point you start returning 503's etc
Now you'll have enough data to scale your site or to start making performance improvements from a code point of view.

Resources