Elastalert rule for disk space - elasticsearch

I am monitoring disk space usage using metricbeat.
Now I want alerts via Elastalert depending on the disk size.
Alert when disk space used crossed 50%
Alert when disk space used crossed 70%
Alert when disk space used crossed 80%
Alert when disk space used crossed 95%
Alert when disk space used crossed 100%
Now the catch here is that alerts should be raised only once when it crosses certain thresholds (50, 70, 80, 95, 100)
So, if alert is already sent for crossing 50% mark it should not send alert for 50.1% / 50.2% / ... / 69.9%
The next alert should only be raised when it crosses 70%.
Initial Approach:
If (dir size==50 || dir size==70 || dir size ==80 || dir size==95 || dir size ==100)
alert
I planned to use "any rule" to match disk space field to different values and alert. But this may generate false alerts too, reason being if the storage is saturated at 50.0% (consider no new data written to DB) for the last 1 hour and if we evaluate rules every 10 mins, it will raise alert 6 times in that hour. Also I don't want to use realert as I don't know for how long to wait.
Approach v1:
Make n number of rule configs where n is the number of different conditions
use a realert setting that is so long it's effectively "never"
realert:
weeeks: 9999
This approach is not ideal as we need repeated alerts.
Example - When usage drops below 50% and then crosses 50% again, Alert is required.
Approach v2:
Combination of two rule can be used. (consider for 50% only)
Rule 1: check disk space >= 50, send mail, enable rule 2 and disable itself using command
Rule 2: check disk space <50, enable Rule 1, disable itself using command.
Any better approach?

Created a custom rule. For more details check this post:
Using Elastalert to monitor disk growth

Related

how to optimally use nifi wait processor

I am currently creating a flow, where I will be merging result of 10K http response. I have couple of questions. (please refer image below, I am numbering my questions as per image).
1) As queue is becoming too long, is it ok to put "concurrent task" as 10 for invokeHTTP? what should drive this? # of cores on the server?
2) wait is showing quite a big number, is this just # of bytes it is writing? or is this using that much memory? if this is just a write, then I might be ok...but if it is some internal queue, then soon I may run out of memory?
does it make sense to reduce this number? by increasing "Run Schedule" from 0 to say 20 sec?
3) what exactly is "Back Pressure Data Size Threshold", value is set at 1 GB, does it meant, if size of ff in queue is more than that, nifi will start dropping it? or will it somehow stop processing of upstream processor?
1) Yes increasing concurrent tasks on InvokeHttp would probably make sense. I wouldn't jump right to 10, but would test increasing from 1 to 2, 2 to 3, etc until it seems to be working better. Concurrent tasks is the number of threads that can concurrently execute the processor, the total number of threads for your NiFi instance is defined in the controller settings from top right menu under Timer Driven threads, you should set the timer driven threads based of the # of CPUs/core you have.
2) The stats on the processor are totals for the last 5 mins, so "In" is the total size of all the flow files that have come in to the processor in the last 5 mins. You can see "Out" is almost the same # which means almost all the flow files in have also been transferred out.
3) Back-pressure stops the upstream processor from executing until the back pressure threshold is reduced. The data size threshold is saying "when the total size of all flow files in the queue exceeds 1GB, then stop executing the upstream processor so that no more data enters the queue while the downstream processor works on the queue". In the case of a self-loop connection, I think back-pressure won't stop the processor from executing otherwise it will end up in a dead-lock where it can't produce more data but also can't work off the queue. In any case, data is never dropped unless you set flow file expiration on the queue.

Max scrollable time for elasticsearch

What is the max scrollable time that can be set for scrolling search ?
Documentation:
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/api-reference.html#api-scroll
If you're asking this kind of question you're probably not using Scroll in ES how it was intended. You want to use scroll when you know for sure that you need to return ALL matching records.
Great use case for Scroll
I want to pull back 1,000,000 records from ES to be written to a CSV file. This is a perfect use case for scroll. You need to return 1M rows, but you don't want to return them all as 1 chunk from the database. Instead you can chunk them into ~1000 record chunks, write the chunk to the CSV file, then get the next chunk. Your scroll keep alive can be set to 1 minute and you'll have no problems.
Bad use case for Scroll
A user is viewing the first 50 records and at some time in the future, they may or may not want to view the next 50 records.
For a use case like this you want to use the Search After API
There is no one-value-fits-all value of max scroll time.
Scan & Scroll is meant to scan through a large number of records in chunks. The max value for each chunk has to be obtained by incremental increases till you hit the breaking as it depends on your cluster resources,network latency and cluster load.
We had a 3 node test setup with about 1 billion records and 1 TB of data. I was able to scroll through the entire index with scroll size 5000 and timeout 5m. However, there were lots of timeouts with those values. From our analysis,we observered that scroll timeouts were heavily dependent on cluster load and network latency. So we finally settled on 3500 size and 4m timeout.
So i would recomend the following-
Incrementally increase the size and timeout values to get the max value for your network.
Once you have the max value, reduce it a notch to accommodate for failures due to cluster load & latency

How do I weight my rate by sample size (in Datadog)?

So I have an ongoing metric of events. They are either tagged as success or fail. So I have 3 numbers; failed, completed, total. This is easily illustrated (in Datadog) using a stacked bar graph like so:
So the dark part are the failures. And by looking at the y scale and the dashed red line for scale, this easily tells a human if the rate is a problem and significant. Which to mean means that I have a failure rate in excess of 60%, over at least some time (10 minutes?) and that there are enough events in this period to consider the rate exceptional.
So I am looking for some sort of formula that starts with: failures divided by total (giving me a score between 0 and 1) and then multiplies this somehow again with the total and some thresholds that I decide means that the total is high enough for me to get an automated alert.
For extra credit, here is the actual Datadog metric that I am trying to get to work:
(sum:event{status:fail}.rollup(sum, 300) / sum:event{}.rollup(sum,
300))
And I am watching for 15 minutes and alert of score above 0.75. But I am not sure about sum, count, avg, rollup or count. And ofc this alert will send me mail during the night when the total events goes low enough to were a high failure rate isn't proof of any problem.

How much load can cassandra handle on m1.xlarge instance?

I setup 3 nodes of Cassandra (1.2.10) cluster on 3 instances of EC2 m1.xlarge.
Based on default configuration with several guidelines included, like:
datastax_clustering_ami_2.4
not using EBS, raided 0 xfs on ephemerals instead,
commit logs on separate disk,
RF=3,
6GB heap, 200MB new size (also tested with greater new size/heap values),
enhanced limits.conf.
With 500 writes per second, the cluster works only for couple of hours. After that time it seems like not being able to respond because of CPU overload (mainly GC + compactions).
Nodes remain Up, but their load is huge and logs are full of GC infos and messages like:
ERROR [Native-Transport-Requests:186] 2013-12-10 18:38:12,412 ErrorMessage.java (line 210) Unexpected exception during request java.io.IOException: Broken pipe
nodetool shows many dropped mutations on each node:
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 7
BINARY 0
READ 2
MUTATION 4072827
_TRACE 0
REQUEST_RESPONSE 1769
Is 500 wps too much for 3-node cluster of m1.xlarge and I should add nodes? Or is it possible to further tune GC somehow? What load are you able to serve with 3 nodes of m1.xlarge? What are your GC configs?
Cassandra is perfectly able to handle tens of thousands small writes per second on a single node. I just checked on my laptop and got about 29000 writes/second from cassandra-stress on Cassandra 1.2. So 500 writes per second is not really an impressive number even for a single node.
However beware that there is also a limit on how fast data can be flushed to disk and you definitely don't want your incoming data rate to be close to the physical capabilities of your HDDs. Therefore 500 writes per second can be too much, if those writes are big enough.
So first - what is the average size of the write? What is your replication factor? Multiply number of writes by replication factor and by average write size - then you'll approximately know what is required write throughput of a cluster. But you should take some safety margin for other I/O related tasks like compaction. There are various benchmarks on the Internet telling a single m1.xlarge instance should be able to write anywhere between 20 MB/s to 100 MB/s...
If your cluster has sufficient I/O throughput (e.g. 3x more than needed), yet you observe OOM problems, you should try to:
reduce memtable_total_space_mb (this will cause C* to flush smaller memtables, more often, freeing heap earlier)
lower write_request_timeout to e.g. 2 seconds instead of 10 (if you have big writes, you don't want to keep too many of them in the incoming queues, which reside on the heap)
turn off row_cache (if you ever enabled it)
lower size of the key_cache
consider upgrading to Cassandra 2.0, which moved quite a lot of things off-heap (e.g. bloom filters and index-summaries); this is especially important if you just store lots of data per node
add more HDDs and set multiple data directories, to improve flush performance
set larger new generation size; I usually set it to about 800M for a 6 GB heap, to avoid pressure on the tenured gen.
if you're sure memtable flushing lags behind, make sure sstable compression is enabled - this will reduce amount of data physically saved to disk, at the cost of additional CPU cycles

Howto take latency differences into consideration when verifying location differences with timestamps (anti-cheating)?

When you have a multiplayer game where the server is receiving movement (location) information from the client, you want to verify this information as an anti-cheating measure.
This can be done like this:
maxPlayerSpeed = 300; // = 300 pixels every 1 second
if ((1000 / (getTime() - oldTimestamp) * (newPosX - oldPosX)) > maxPlayerSpeed)
{
disconnect(player); //this is illegal!
}
This is a simple example, only taking the X coords into consideration. The problem here is that the oldTimestamp is stored as soon as the last location update was received by the server. This means that if there was a lag spike at that time, the old timestamp will be received much later relatively than the new location update by the server. This means that the time difference will not be accurate.
Example:
Client says: I am now at position 5x10
Lag spike: server receives this message at timestamp 500 (it should normally arrive at like 30)
....1 second movement...
Client says: I am now at position 20x15
No lag spike: server receives message at timestamp 1530
The server will now think that the time difference between these two locations is 1030. However, the real time difference is 1500. This could cause the anti-cheating detection to think that 1030 is not long enough, thus kicking the client.
Possible solution: let the client send a timestamp while sending, so that the server can use these timestamps instead
Problem: the problem with that solution is that the player could manipulate the client to send a timestamp that is not legal, so the anti-cheating system won't kick in. This is not a good solution.
It is also possible to simply allow maxPlayerSpeed * 2 speed (for example), however this basically allows speed hacking up to twice as fast as normal. This is not a good solution either.
So: do you have any suggestions on how to fix this "server timestamp & latency" issue in order to make my anti-cheating measures worthwhile?
No no no.. with all due respect this is all wrong, and how NOT to do it.
The remedy is not trusting your clients. Don't make the clients send their positions, make them send their button states! View the button states as requests where the clients say "I'm moving forwards, unless you object". If the client sends a "moving forward" message and can't move forward, the server can ignore that or do whatever it likes to ensure consistency. In that case, the client only fools itself.
As for speed-hacks made possible by packet flooding, keep a packet counter. Eject clients who send more packets within a certain timeframe than the allowed settings. Clients should send one packet per tick/frame/world timestep. It's handy to name the packets based on time in whole timestep increments. Excessive packets of the same timestep can then be identified and ignored. Note that sending the same packet several times is a good idea when using UDP, to prevent package loss.
Again, never trust the client. This can't be emphasized enough.
Smooth out lag spikes by filtering. Or to put this another way, instead of always comparing their new position to the previous position, compare it to the position of several updates ago. That way any short-term jitter is averaged out. In your example the server could look at the position before the lag spike and see that overall the player is moving at a reasonable speed.
For each player, you could simply hold the last X positions, or you might hold a lot of recent positions plus some older positions (eg 2, 3, 5, 10 seconds ago).
Generally you'd be performing interpolation/extrapolation on the server anyway within the normal movement speed bounds to hide the jitter from other players - all you're doing is extending this to your cheat checking mechanism as well. All legitimate speed-ups are going to come after an apparent slow-down, and interpolation helps cover that sort of error up.
Regardless of opinions on the approach, what you are looking for is the speed threshold that is considered "cheating".
Given a a distance and a time increment, you can trivially see if they moved "too far" based on your cheat threshold.
time = thisTime - lastTime;
speed = distance / time;
If (speed > threshold) dudeIsCheating();
The times used for measurement are server received packet times. While it seems trivial, it is calculating distance for every character movement, which can end up very expensive. The best route is server calculate position based on velocity and that is the character's position. The client never communicates a position or absolute velocity, instead, the client sends a "percent of max" velocity.
To clarify:
This was just for the cheating check. Your code has the possibility of lag or long processing on the server affect your outcome. The formula should be:
maxPlayerSpeed = 300; // = 300 pixels every 1 second
if (maxPlayerSpeed <
(distanceTraveled(oldPos, newPos) / (receiveNewest() - receiveLast()))
{
disconnect(player); //this is illegal!
}
This compares the players rate of travel against the maximum rate of travel. The timestamps are determined by when you receive the packet, not when you process the data. You can use whichever method you care to to determine the updates to send to the clients, but for the threshold method you want for determining cheating, the above will not be impacted by lag.
Receive packet 1 at second 1: Character at position 1
Receive packet 2 at second 100: Character at position 3000
distance traveled = 2999
time = 99
rate = 30
No cheating occurred.
Receive packet 3 at second 101: Character at position 3301
distance traveled = 301
time = 1
rate = 301
Cheating detected.
What you are calling a "lag spike" is really high latency in packet delivery. But it doesn't matter since you aren't going by when the data is processed, you go by when each packet was received. If you keep the time calculations independent of your game tick processing (as they should be as stuff happened during that "tick") high and low latency only affect how sure the server is of the character position, which you use interpolation + extrapolation to resolve.
If the client is out of sync enough to where they haven't received any corrections to their position and are wildly out of sync with the server, there is significant packet loss and high latency which your cheating check will not be able to account for. You need to account for that at a lower layer with the handling of actual network communications.
For any game data, the ideal method is for all systems except the server to run behind by 100-200ms. Say you have an intended update every 50ms. The client receives the first and second. The client doesn't have any data to display until it receives the second update. Over the next 50 ms, it shows the progression of changes as it has already occurred (ie, it's on a very slight delayed playback). The client sends its button states to the server. The local client also predicts the movement, effects, etc. based on those button presses but only sends the server the "button state" (since there are a finite number of buttons, there are a finite number of bits necessary to represent each state, which allows for a more compact packet format).
The server is the authoritative simulation, determining the actual outcomes. The server sends updates every, say, 50ms to the clients. Rather than interpolating between two known frames, the server instead extrapolates positions, etc. for any missing data. The server knows what the last real position was. When it receives an update, the next packet sent to each of the clients includes the updated information. The client should then receive this information prior to reaching that point in time and the players react to it as it occurs, not seeing any odd jumping around because it never displayed an incorrect position.
It's possible to have the client be authoritative for some things, or to have a client act as the authoritative server. The key is determining how much impact trust in the client is there.
The client should be sending updates regularly, say, every 50 ms. That means that a 500 ms "lag spike" (delay in packet reception), either all packets sent within the delay period will be delayed by a similar amount or the packets will be received out of order. The underlying networking should handle these delays gracefully (by discarding packets that have an overly large delay, enforcing in order packet delivery, etc.). The end result is that with proper packet handling, the issues anticipated should not occur. Additionally, not receiving explicit character locations from the client and instead having the server explicitly correct the client and only receive control states from the client would prevent this issue.

Resources