Bigdesk charts explanation - elasticsearch

I don't understand what Search time per second (Δ) means. Is it the delta of number of milliseconds that the search requests took in previous and current refresh interval? Also there is a Query and Fetch time below the chart, not sure what that represents.
Attached is a screenshot:

A query in Elasticsearch actually a 2 phased process:
Query Phase :
During the initial query phase, the query is broadcast to a shard copy (a primary or replica shard) of every shard in the index. Each shard executes the search locally and builds a priority queue of matching documents.
And
Fetch Phase :
The query phase identifies which documents satisfy the search request, but we still need to retrieve the documents themselves. This is the job of the fetch phase.
And that mail explains the Search time per second (Δ) part in detail:
Here is an example for "Search requests per second (Δ)":
- You do some "_search" request
- It hits 15 shards of some indices on that node, so the value of indices -> search -> "query_total" in nodes stats API 2 response
increases by 15
- Bigdesk refresh value is 5000 (5 sec)
As a result the chart should display peak of 3 (15/5) in the Query
line. So if the value is ~1500 in your case then it means in average
an X number of shards is hit by search requests per second where
X=1500*refresh (does it make sense)?
You can see the chart is really only informative (it depends on
refresh interval and number of shards). But there is the cumulative
"query_total" value displayed as well in the web UI.
Similarly, the second chart "Search time per second (Δ)" displays the
average time (in mills) spent in query or fetch phase on the node.
Again this value includes all involved shards on that node.

Search time per second (Δ) based on 2 series seies1 and serie2
they are explained here
looks like chart shows these metrics per time unit

Related

Get the last value of a metric in a Datadog dashboard

I'm searching to display in my Datadog dashboard the last value of a metric in a QueryValue field.
For the moment, I'm using
"queries": [
{
"query": "max:blabla.mycount{$env}",
"data_source": "metrics",
"name": "query1",
"aggregator": "last"
}
]
Is this the right way to do that ? For this series of mycount [20,1,5,3,2], which number will be taken ? Is it really the last one of the serie (2) or the biggest one in the serie (20) ?
Regards,
Blured.
So there's going to be 3 levels of aggregation to consider: the Time Aggregation and Space Aggregation of your query, and then the aggregation of the query value widget on the frontend (which is what you're asking about). For now, let's understand time aggregation by thinking of a time series widget, and then we'll see what happens with the query value widget after.
Space aggregation is the simplest one. The idea is the you have multiple time series being submitted from multiple applications/ servers. If 20 computers send a metric all at the same time, which metric should we pick to display? You decide that with the aggregation chunk of your query, yours is currently set to max.
The idea is that you have to decide which out of the dozens or hundreds of instances of your metric is the one you want to display.
If you don't want to worry about space aggregation, you have to make you query specific enough that only 1 time series exists for that metric. For example a cpu metric will need to be scoped to at least the hostname. For a container metric, hostname isn't enough, you would need at least the container_id. For a database there should be a db_identifier or something that gets you just 1 result back.
Now for time aggregation, let's look at the docs a bit:
As Datadog stores data at a 1 second granularity, it cannot display all real data on graphs. See How data is aggregated in graphs for more details.
For a graph on a 1-week time window, it would require sending hundreds of thousands of values to your browser—and besides, not all these points could be graphed on a widget occupying a small portion of your screen.
...
The Datadog backend tries to keep the number of intervals to a number below ~300.
https://docs.datadoghq.com/dashboards/guide/query-to-the-graph/#proceed-to-time-aggregation
So for example if you are looking at a 5 minute window, the time aggregation will be as granular as possible. there are 300 seconds in 5 minutes, so every interval on the graph will represent 1 second. If we zoomed out to 10 minutes (600 seconds), we can only show data every 2 seconds. So each bucket will represent 2 data points (assuming the metric is submitted every second).
In most scenarios your metrics are being submitted at a 15 second interval. So you won't notice any time aggregation rollups until 15*300=4500 seconds (a bit over an hour).
You control this with the rollup function, as described in the docs. If you don't want to worry about time aggregation, just make sure your time range is zoomed in enough to not have any bucketing.
And now for the last level of aggregation, the query value widget. You now have obtained a set of 300 points from the backend, space and time aggregation has already been applied. Out of those 300 datapoints, which one do you want to display? You could choose the last point, or a sum of the points, or whatever.
Hopefully that helps!

Elasticsearch - Search with in near realtime (1 sec)

I come across the following phrase https://www.elastic.co/guide/en/elasticsearch/reference/6.8/documents-indices.html
When a document is stored, it is indexed and fully searchable in near real-time—​within 1 second.
Assuming the 1 sec is subjective and depends on various factors , can we safely assume it is atleast 1 sec ? And also, I see different time intervals that will kickin as part of the indexing like refresh interval, etc , is this 1 sec is approximately sum of all those intervals (intermediate )
Howmuch realtime it is when we say elasticsearch is (near) realtime search engine
The default refresh interval (controlled by the index setting index.refresh_interval) is one second. The sentence you cite means exactly that. By default, a document you index will be available for search within at most one second, but it can be less than that.
If a refresh happens at instant T and you index a document at that same moment, then the underlying segments will be refreshed in pretty much exactly one second and your document will be searchable after that refresh.
If a refresh happens at instant T, and you index your document 500ms after that instant, then it will be available for search just 500ms after being indexed.
That also means your document could be available just a few milliseconds (say 10ms) after being indexed if you index it at instant T+990ms after the last refresh that happened at instant T.
It's not exact science, so that one second should be taken with a grain of salt, sometimes it could last a tad longer, say 10xx ms, where xx depends on various factors. You should not rely on that duration being nano-exact, though.
So near-real time simply means the duration of that refresh interval (which you can modify).

How to accommodate minutely and hourly data in the same visualisation?

Current Scenario -
The current dashboard is set to Sum aggregation at minutely level. My dashboard currently works only when interval is set to minutely. If I change the interval the current graph shows incorrect values. This happens due to the fact that there are more than 1 documents generated per minute and the correct value per minute will be the sum of the field values at minutely level.
So even today we are obliged to use minute interval but I'm fine with this.
Now the hourly documents is designed to ingest data after doing all the math( and we have validated the ingestion logic). So there is 1 doc per hour. This is the reason the visualisation is not able to accommodate both types of data.
If I had a scenario like 1 document per minute and then 1 document per hour, then I could have gone with using average metrics or perhaps max metrics but at present the problem is I have to do sum of the doc values for a minute (mandatory), therefore, whatever internal logic applies for minutely data gets also applied to hourly too.
Is there a way where I can show both types of data in the same graph?
Mathematically, the approach is wrong.
Having n documents per minute (where n depends on the no. of hosts in that cluster) and then 1 document per hour per type is illogical from visualisation perspective because the actual value needed was the sum of all n documents generated per min and so the sum metric that was being applied at minutely level was also getting applied at hourly data. If we wanted to accommodate both types of data in the same graph, there is a need of uniformity and thus, aggregate the data at minutely level from other end and then send aggregated data to elastic.

Elasticsearch - Inconsistent results while using Native Script

ES Version 2.1.2
I am using NativeScript to filter documents by term value. The script returns true/false based on the lookup of the term value from the document. The number of results returned by Elasticsearch grows steadily for the same query until it reaches the actual count. I am not sure if it is because the fielddata is cached progressively. This always happens when ES is restarted. If the script query is replaced with terms query, the results are accurate in the first search. I also notice my custom script gets initialized multiple times and the number varies for every search request. What is wrong here?
I am running ES in single node with 3 shards and 0 replicas. The Native Script gets initialized 6 times for the first query. The number of initialization increases to somewhere 18 for second time for the same query and reaches to 22 for the third time. It stays on to 22 times for subsequent searches. As the number of initialization of Native Script increases I get more results for the same query and when it reaches the final count, actual number of search results is returned. Cant understand this inconsistency in total count of search results for the same query.
Found the issue. Search timeout was set to 500ms and during the first query (after bouncing ES), ES tries to load fielddata in memory from all active segments.Since initial population of fielddata took more time than the timeout threshold, fielddata from some segments were not loaded into memory. During subsequent searches, fielddata gets loaded completely and hence see actual count after few searches.
Increasing the timeout to 5s gave enough time to load fielddata from all segments and thereby getting actual total count in the first search itself.

How does elasticsearch handle skip requests (from/size parameter)

I am deploying an approach which uses from parameter a lot of times. I wish to understand how 'skip' works in elasticsearch or other such systems in general to judge what performance lost does it incur.
It depends on search type. If you use the default, i.e. query then fetch, then to fetch page 20 with size 10 (from: 190, size: 10), elasticsearch will:
ask each primary shard for ids and relevance scores of top 200 documents (which are selected from all docs matching the query, so this means searching the whole index, but this is the same as with fetching only the first page)
merge the results, sorting by relevance, and skip 190 top hits of such merged list, taking those 10 that follow
fetch actual docs (i.e. 10 of them) from relevant shards
It means that if you have e.g. 3 primary replicas, then elasticsearch nodes need to exchange information about 3 * 200 = 600 docs. There are some optimizations to make obtaining particularly 'distant' pages more efficient, but in a nutshell, you need to process more and more documents each time you fetch next page.
If your use case involves going through a result set sequentially, consider scrolling.

Resources