How get max count of request in time in splunk - performance

Hi I'm developing rails web application with Solr search engine inside. The path to get search results is '/search/results'.
Users makes many requests when searching for something and I am in need of getting max count of intime search requests for all time (to check need it to do some optimization or increase RAM etc.). I know that there are peak times, when loading is critical and search works slowly.
I use Splunk service to collect app logs and it's possible to get this requests count from logs, but I don't know how write correct Splunk query to get data which I need.
So, how can I get max number of per 1hour requests to '/search/results' path for date range?
Thanks kindly!

If you can post your example data & or your sample search, its much easier to figure out. I'll just post a few examples of I think might lead you in the right direction.
Let's say the '/search/results' is in a field called "uri_path".
earliest=-2w latest=-1w sourcetype=app_logs uri_path='/search/results'
| stats count(uri_path) by date_hour
would give you a count (sum) per hour over last week, per hour.
earliest=-2w latest=-1w sourcetype=app_logs uri_path=*
| stats count by uri_path, hour
would split the table (you can think 'group by') by the different uri_paths.
You can use the time-range picker on the right side of the search bar to use a GUI to select your time if you don't want to use the time range abbreviations, (w=week, mon=month, m=minute, and so on).
After that, all you need to do is | pipe to the stats command where you can count by date_hour (which is an automatically generated field).
NOTE:
If you don't have the uri_path field already extracted, you can do it really easily with the rex command.
... | rex "matching stuff before uri path (?<uri_path>\/\w+\/\w+) stuff after'
| uri_path='/search/results'
| stats count(uri_path) by date_hour
In case you want to learn more:
Stats Functions (in Splunk)
Field Extractor - for permanent extractions

Related

prometheus query for continuous uptime

I'm a prometheus newbie and have been trying to figure out the right query to get the last continuous uptime for my service.
For example, if the present time is 0:01:20 my service was up at 0:00:00, went down at 0:01:01 and went up again at 0:01:10, I'd like to see the uptime of "10 seconds".
I'm mainly looking at the "up{}" metric and possibly combine it with the functions (changes(), rate(), etc.) but no luck so far. I don't see any other prometheus metric similar to "up" either.
The problem is that you need something which tells when your service was actually up vs. whether the node was up :)
We use the following (I hope one will help or the general idea of each):
1. When we look at a host we use node_time{...} - node_boot_time{...}
2. When we look at a specific process / container (docker via cadvisor in our case) we use node_time{...} - on(instance) group_right container_start_time_seconds{name=~"..."}) by(name,instance)
The following PromQL query must be used for calculating the application uptime in seconds:
time() - process_start_time_seconds
This query works for all the applications written in Go, which use either github.com/prometheus/client_golang or github.com/VictoriaMetrics/metrics client libraries, which expose the process_start_time_seconds metric by default. This metric contains unix timestamp for the application start time.
Kubernetes exposes the container_start_time_seconds metric for each started container by default. So the following query can be used for tracking uptimes for containers in Kubernetes:
time() - container_start_time_seconds{container!~"POD|"}
The container!~"POD|" filter is needed in order to filter aux time series:
Time series with container="POD" label reflect e.g. pause containers - see this answer for details.
Time series without container label correspond to e.g. cgroups hierarchy. See this answer for details.
If you need to calculate the overall per-target uptime over the given time range, then it is possible to estimate it with up metric. Prometheus automatically generates up metric per each scrape target. It sets it to 1 per each successful scrape and sets it to 0 otherwise. See these docs for details. So the following query can be used for estimating the total uptime in seconds per each scrape target during the last 24 hours:
avg_over_time(up[24h]) * (24*3600)
See avg_over_time docs for details.

Apache Niffi getMongo Processor

I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}
There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.

Correct method to calculate "Total wait time" of a session in oracle

I need to find out the total time a session is waiting when its is active.
For this i used the query like below...
SELECT (SUM (wait_time + time_waited) / 1000000)
FROM v$active_session_history
WHERE session_id = 614
But, i feel i'm not getting what i wanted using this query.
Like, first time when i ran this query i got 145.980962, # second time=145.953926and #3rd time i got 127.706429.
Ideally, the time should be same or increase. But, as you see, the value returned is reducing everytime.
Please correct me where i'm doing wrong.
It does not contain whole history, v$active_session_history "forgets" older lines. Think about it as a ring of buffers. Once all buffers are written, it restarts from 1st buffer.
To get events of some session, look v$session_event. To get current (active) event of active session: v$session_wait (In recent Oracle versions, you can find this info also in v$session)
NOTE: v$session_event view will not show you CPU time (which is not event but can be seen in v$active_session_history). You can add it, for example, from v$sesstat if needed...
Your bloomer is that you have not understood the nature of v$active_session_history: it is a sample not a log. That is, each record in ASH is a point in time, and doesn't refer back to previous records.
Don't worry, it's a common mistake.
This is a particular problem with WAIT_TIME. This is the total time waited for that specfic occurence of that event. So if the wait event stretches across two samples, in the first record WAIT_TIME will be 1 (one second) and in the next sample it will be 2 (two seconds). However, a SUM(WAIT_TIME) would produce a total of 3 which is too much. Of course this is an arithmetic proghression so if the wait event stretches to ten samples (ten seconds) a SUM(WAIT_TIME) would produce a total of 55.
Basically, WAIT_TIME is a flag - if it is 0 the session is ON CPU and if it's greater than zero it is WAITING.
TIME_WAITED is only populated when the event has stopped waiting. So a SUM(TIME_WAITED) wouldn't give an inflated value. In fact just the opposite: it will only be populated for wait events which were ongoing at the sample time. So there can be lots of waits which fall between the interstices of the samples which won't show up in that SUM.
This is why ASH is good for highlighting big performance issues and bad for identifying background niggles.
So why doesn't the total time doesn't increase each time you run your query? Because ASH is a circular buffer. Older records get aged out to make way for new samples. AWR stores a percentage of the ASH records on disk; they are accessible through DBA_HIST_ACTIVE_SESSION_HIST (the default is one record in ten). So probably ASH purged some samples with high wait times between the second and third times you ran your queries. You could check that by including MIN(SAMPLE_TIME) in the select list.
Finally, bear in mind that SIDs get reused. The primary key for identifying a session is (SID, Serial#), Your query only grouops by SID, so it may use data from several different sessions.
There is a useful presentation by Graham Woods, on of the Oracle gurus who worked on ASH called "Shifting through the ASHes". Altough if would be better to hear Graham speaking, the slide deck on its own still provides some useful insights. Find it here.
tl;dr
ASH is a sample not a log. Use it for COUNTs not SUMs.
"Anything wrong in the way query these tables? "
As I said above, but perhaps didn't make clear enough, DBA_HIST_ACTIVE_SESSION_HIST only holds a fraction of the records from ASH. So it is even less meaningful to run SUM() on its columns than on the live ASH.
Whereas V$SESSION_EVENT is an actual log of events. Its wait times are reliable and accurate. That's why you pay the overhead of enabling timed statistics. Having said which, V$SESSION_EVENT only gives us aggregated values per session, so it's not particularly useful in diagnosis.

sorting algorithm issue

I need help with my server application problems. Thing is:
I need to count 'top urls' in my web server within a eg one minute. How to acquire it?
by 'top urls' i mean top 10 or something
Suppose in one minute i got:
1 request with url 'http://localhost/10.jpg',
2 requests with url 'http://localhost/1.jpg', and 'http://localhots/12.jpg'
4 request with url 'http://localhost/2.jpg' and 'http://localhost/3.jpg'
and 10 requestes for 'http://localhost/13.jpg'
Should I add all requestes to table, and then after given time, sort them, or maybe is antoher, simpler way to sort them ?
Thx for all help
If you are keeping a temporary hit counter for each page, you don't really need to sort. When you want to start tracking, reset all the temporary counters to 0, and initialize a top ten list of pages. Every a time page is fetched, increment it's count, then check the value against the top ten list. If the count is greater than the next higher count on the list, move it up a rank.

Why is Cacti showing an empty graph, even though the rrd file is created?

I have developed my own SNMP service, and i want to plot a graph of an OID provided.
So, i have created a graph in Cacti.
-) It is showing device up.
-) It is creating rrd file. (RRDTool says OK).
-) Showing the graph, but it's empty.
But when I check it, say
rrdtool fetch <rrd file> AVERAGE
it shows me nan for all the values. The monitored OID has value 47 and i have set min=0 and max=100.
I am using Cacti appliance by rpath:
http://www.rpath.org/ui/#/appliances?id=http://www.rpath.org/api/products/cacti-appliance
Still, I can't show value on graph..
Where is the problem? Can anyone please tell me?
First of all, use Cacti's "Rebuild Poller Cache" function under the Utilities menu.
If that didn't work ,check if the RRD file is actually updating with new data.
To do this use the command:
rrdtool last [filename.rrd]
This will output the last time (in unix timestamp) that a new value has been inserted into the RRA file which you can compare to the current time that date +%s will output.
If it's not updating with data then you should change the cacti log level to DEBUG via the settings page on Cacti's web UI and look for appropriate messages.
If the poller couldn't get the data then it's usually an issue relating to connectiviy/SNMP.
You can further check issues as such by manually polling the specific OID on that host:
snmpwalk -c[SNMP COMMUNITY] -v2c [HOSTNAME OR IP ADDRESS] 1.3.6.1.2.1
You can use the above command and OID (1.3.6.1.2.1) just to see if you're getting a reply.
If that worked then you should change the command from snmpwalk to snmpget and the OID to the actual OID you're trying to poll and retry.
If the RRD is updating with new data but you're still getting NaN in your graphs then I suggest looking into the heartbeat and step values of the data source (via the data template) in relation to your polling interval and poller cronjob interval.
These values determine how many times the RRD file will miss data before inserting a NaN.
The cronjob calls the cacti poller to start performing it's polling cycle.
The poller interval is the actual time that the poller will wait between two polling cycles if it was indeed invoked in time by the cronjob.
So for 1 minute polling (on the poller and the cronjob) you will have to use a step of 60 (seconds) and a heartbeat of 120.
For 5 minutes polling, the step will be 300 and the heartbeat will be 600.
This is mainly caused by someone changing the poller interval on the settings page.
Gandalf from the Cacti forums wrote a nice Guide that you can use and further help can be found on Cacti forums.
Good luck! :)
Maybe cacti doesn't have the needed permissions to access the rrd file and your test was done with a user who has the required permissions, for example root?
Are you sure you have collected enough data?
If your RRD has a step of 1 minute, and your first RRA has a consolidated count of 1 (1cdp=1pdp), then you should collect data for at least (step x ( count + 1 )) seconds before you expect to see any data in the graph. Make sure you are collecting data at least as often as the step size.
If you collect data for 10 min and nothing shows up, then make sure you are actually collecting the data, make sure the values you get are within range, and that they are being used. Check the last modification time on the RRD file. Print out the values before you update to verify they are what you think they are.
You should double check the range Cacti is plotting in. I moved the values in the graph filter and spotted a little chunk of data in the graphs, then you just have to adjust it.

Resources