Why is Cacti showing an empty graph, even though the rrd file is created? - snmp

I have developed my own SNMP service, and i want to plot a graph of an OID provided.
So, i have created a graph in Cacti.
-) It is showing device up.
-) It is creating rrd file. (RRDTool says OK).
-) Showing the graph, but it's empty.
But when I check it, say
rrdtool fetch <rrd file> AVERAGE
it shows me nan for all the values. The monitored OID has value 47 and i have set min=0 and max=100.
I am using Cacti appliance by rpath:
http://www.rpath.org/ui/#/appliances?id=http://www.rpath.org/api/products/cacti-appliance
Still, I can't show value on graph..
Where is the problem? Can anyone please tell me?

First of all, use Cacti's "Rebuild Poller Cache" function under the Utilities menu.
If that didn't work ,check if the RRD file is actually updating with new data.
To do this use the command:
rrdtool last [filename.rrd]
This will output the last time (in unix timestamp) that a new value has been inserted into the RRA file which you can compare to the current time that date +%s will output.
If it's not updating with data then you should change the cacti log level to DEBUG via the settings page on Cacti's web UI and look for appropriate messages.
If the poller couldn't get the data then it's usually an issue relating to connectiviy/SNMP.
You can further check issues as such by manually polling the specific OID on that host:
snmpwalk -c[SNMP COMMUNITY] -v2c [HOSTNAME OR IP ADDRESS] 1.3.6.1.2.1
You can use the above command and OID (1.3.6.1.2.1) just to see if you're getting a reply.
If that worked then you should change the command from snmpwalk to snmpget and the OID to the actual OID you're trying to poll and retry.
If the RRD is updating with new data but you're still getting NaN in your graphs then I suggest looking into the heartbeat and step values of the data source (via the data template) in relation to your polling interval and poller cronjob interval.
These values determine how many times the RRD file will miss data before inserting a NaN.
The cronjob calls the cacti poller to start performing it's polling cycle.
The poller interval is the actual time that the poller will wait between two polling cycles if it was indeed invoked in time by the cronjob.
So for 1 minute polling (on the poller and the cronjob) you will have to use a step of 60 (seconds) and a heartbeat of 120.
For 5 minutes polling, the step will be 300 and the heartbeat will be 600.
This is mainly caused by someone changing the poller interval on the settings page.
Gandalf from the Cacti forums wrote a nice Guide that you can use and further help can be found on Cacti forums.
Good luck! :)

Maybe cacti doesn't have the needed permissions to access the rrd file and your test was done with a user who has the required permissions, for example root?

Are you sure you have collected enough data?
If your RRD has a step of 1 minute, and your first RRA has a consolidated count of 1 (1cdp=1pdp), then you should collect data for at least (step x ( count + 1 )) seconds before you expect to see any data in the graph. Make sure you are collecting data at least as often as the step size.
If you collect data for 10 min and nothing shows up, then make sure you are actually collecting the data, make sure the values you get are within range, and that they are being used. Check the last modification time on the RRD file. Print out the values before you update to verify they are what you think they are.

You should double check the range Cacti is plotting in. I moved the values in the graph filter and spotted a little chunk of data in the graphs, then you just have to adjust it.

Related

Nifi DetectDuplicate Not Detecting Duplicates

I am using the DetectDuplicate processor within a flow but am seeing some confusing behavior. The processor is configured as follows:
Cache Entry Identifier: ${rk.id}
FlowFile Description: Empty string set
Age Off Duration: 10s
Distributed Cache Service: DistributedMapCacheClientService
Cache The Entry Identifier: true
The "duplicate" relationship is automatically terminated. Concurrency is set to 1.
However, I'm seeing multiple copies of flowfiles on the output queue with the same rk.id that were run through the processor less than 2 seconds apart. How is this possible? I even tried increasing the age off to 5m and it made no difference. I also tried setting the processor to only run every 500ms, thinking there may be some delay in writing to the cache, and 2 flowfiles that were processed 1s apart with the same rk.id showed up in the output queue. What am I missing?
I think I figured this out. It looks like the cache was full and not accepting new values? Because we had a lot less traffic this morning and it seems to have properly run the deduplication.

Nifi - Process the files based on count or time elapsed?

I have a following flow,
ListFile ---> FetchFile ---> ? ExecuteScript (maybe) ---> Notify
Basically, I want to go to Notify, if
Total flowfiles (from fetch files) is say 200; OR
Time elapsed (from last signal) is say 3 hours.
I think the 1st condition is easy to achieve. I can have a groovy script which can read number of flowfiles, if 200 go to SUCCESS or else ROLLBACK the session.
But I want to know how to also check the time elapsed for n (number can be less than 200) flowfiles in queue is more than 3 hours or so?
Update
Here is the problem: We have a batch processing (~200 files and can increase based on business in future) currently. We have a NiFi pipeline, i.e. List, Fetch, Basic validation on checksum, etc and process (call the SQL) which is working fine.
As per the business, throughout the day we can have the correction to data so that we can get all or some of the files to "re-process". That is also fine and working.
Now, as per new requirements, we need to build the process after this "batch" is completed. So in the best case, I can have the MergeContent processor with max bin of n and give the signal or notify to my new processor.
However, as explained above, throughout that day we can get few or all files processed again. So now my "n" may not match the new "number" of files re-processed. Hence, even in this case if we have elapsed say 3 hours, then irrespective of "n" not equal to new number of files reprocessed, I should notify the new process to run again.
Hence, I am looking for n files OR m hours elapsed check.
I think this may be an example of an XY problem -- you're trying to solve a problem and believe that counting the number of files fetched or time elapsed will help, but this pattern is usually discouraged in Apache NiFi and there are other solutions to the original problem. I would encourage you to describe more fully the higher level problem you are trying to solve to see if there is a better solution.
I will answer the question though (none of these are ideal solutions).
You can use a MergeContent processor with a minimum bin count of 200
You can use an ExecuteScript processor as you noted
You can write a value (the current timestamp) to a DistributedCacheMapServer when the Notify processor executes, and check that value with a FetchDistributedCacheMap processor against the current timestamp and use a simple Expression Language statement to compare the timestamp values
I think you may also want to read some examples of Wait/Notify logic, because creating thresholds like "200 incoming flowfiles || 3 hours elapsed time" is what the Wait processor does.
"How to wait for all fragments to be processed, then do something?" by Koji Kawamura
"NiFi workflow monitoring – Wait/Notify pattern with split and merge" by Pierre Villard
"Simple NiFi Wait/Notify Example" answer by Abdelkrim Hadjidj

Apache Niffi getMongo Processor

I am new in niffi i am using getMongo to extract document from mongodb but same result is coming again and again but the result of query is only 2 document the query is {"qty":{$gt:10}}
There is a similar question regarding this. Let me quote what I had said there:
"GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach."
The question can be found here.

Correct method to calculate "Total wait time" of a session in oracle

I need to find out the total time a session is waiting when its is active.
For this i used the query like below...
SELECT (SUM (wait_time + time_waited) / 1000000)
FROM v$active_session_history
WHERE session_id = 614
But, i feel i'm not getting what i wanted using this query.
Like, first time when i ran this query i got 145.980962, # second time=145.953926and #3rd time i got 127.706429.
Ideally, the time should be same or increase. But, as you see, the value returned is reducing everytime.
Please correct me where i'm doing wrong.
It does not contain whole history, v$active_session_history "forgets" older lines. Think about it as a ring of buffers. Once all buffers are written, it restarts from 1st buffer.
To get events of some session, look v$session_event. To get current (active) event of active session: v$session_wait (In recent Oracle versions, you can find this info also in v$session)
NOTE: v$session_event view will not show you CPU time (which is not event but can be seen in v$active_session_history). You can add it, for example, from v$sesstat if needed...
Your bloomer is that you have not understood the nature of v$active_session_history: it is a sample not a log. That is, each record in ASH is a point in time, and doesn't refer back to previous records.
Don't worry, it's a common mistake.
This is a particular problem with WAIT_TIME. This is the total time waited for that specfic occurence of that event. So if the wait event stretches across two samples, in the first record WAIT_TIME will be 1 (one second) and in the next sample it will be 2 (two seconds). However, a SUM(WAIT_TIME) would produce a total of 3 which is too much. Of course this is an arithmetic proghression so if the wait event stretches to ten samples (ten seconds) a SUM(WAIT_TIME) would produce a total of 55.
Basically, WAIT_TIME is a flag - if it is 0 the session is ON CPU and if it's greater than zero it is WAITING.
TIME_WAITED is only populated when the event has stopped waiting. So a SUM(TIME_WAITED) wouldn't give an inflated value. In fact just the opposite: it will only be populated for wait events which were ongoing at the sample time. So there can be lots of waits which fall between the interstices of the samples which won't show up in that SUM.
This is why ASH is good for highlighting big performance issues and bad for identifying background niggles.
So why doesn't the total time doesn't increase each time you run your query? Because ASH is a circular buffer. Older records get aged out to make way for new samples. AWR stores a percentage of the ASH records on disk; they are accessible through DBA_HIST_ACTIVE_SESSION_HIST (the default is one record in ten). So probably ASH purged some samples with high wait times between the second and third times you ran your queries. You could check that by including MIN(SAMPLE_TIME) in the select list.
Finally, bear in mind that SIDs get reused. The primary key for identifying a session is (SID, Serial#), Your query only grouops by SID, so it may use data from several different sessions.
There is a useful presentation by Graham Woods, on of the Oracle gurus who worked on ASH called "Shifting through the ASHes". Altough if would be better to hear Graham speaking, the slide deck on its own still provides some useful insights. Find it here.
tl;dr
ASH is a sample not a log. Use it for COUNTs not SUMs.
"Anything wrong in the way query these tables? "
As I said above, but perhaps didn't make clear enough, DBA_HIST_ACTIVE_SESSION_HIST only holds a fraction of the records from ASH. So it is even less meaningful to run SUM() on its columns than on the live ASH.
Whereas V$SESSION_EVENT is an actual log of events. Its wait times are reliable and accurate. That's why you pay the overhead of enabling timed statistics. Having said which, V$SESSION_EVENT only gives us aggregated values per session, so it's not particularly useful in diagnosis.

no response from the host :snmpwalk

I have implemented AgentX using mib2c.create-dataset.conf ( with cache enabled)
In my snmd.conf :: agentXTimeout 15
In testtable.h file I have changed cache value as below...
#define testTABLE_TIMEOUT 60
According to my understanding It loads data every 60 second.
Now my issue is if the data in data table is exceeds some amount it takes some amount of time to load it.
As in between If I fired SNMPWALK it gives me “no response from the host” If I use SNMPWALK for whole table and in between testTABLE_TIMEOUT occurs it stops in between and shows following error (no response from the host).
Please tell me how to solve it ? In my table large amount of data is present and changing frequently.
I read some where:
(when the agent receives a request for something in this table and the cache is older than the defined timeout (12s > 10s), then it does re-load the data. This is the expected behaviour.
However the agent does not automatically release the local cache (i.e. call the 'free' routine) as soon as the timeout has expired.
Instead this is handled by a regular "garbage collection" run (once a minute), which will free any stale caches.
In the meantime, a request that tries to use that cache will spot that it's expired, and reload the data.)
Is there any connection between these two ?? I can’t get this... How to resolve my problem ???
Unfortunately, if your data set is very large and it takes a long time to load then you simply need to suffer the slow load and slow response. You can try and load the data on a regular basis using snmp_alarm or something so it's immediately available when a request comes in, but that doesn't really solve the problem either since the request could still come right after the alarm is triggered and the agent will still take a long time to respond.
So... the best thing to do is optimize your load routine as much as possible, and possibly simply increase the timeout that the manager uses. For snmpwalk, for example, you might add -t 30 to the command line arguments and I bet everything will suddenly work just fine.

Resources