Kibana: Show values on Y axis as percentage - elasticsearch

I want to visualize the amount of correct auto-responses my system sent in regards to the percentage of questions it has already learned.
So my idea was to filter all my test-results where a boolean field didSendCorrectAutoResponse is true, make the bucket on the x axis over a field called learnPercentage and on the y axis simply take the count as a metric.
The only problem with this is that the values on the y-axis are absolute and only count the number of responses sent but I want it to show it as a percentage of the total number of tests per percentage learned.
Here is how I defined my chart:
I can calculate the total number of test-cases for each percentage learned with this learnPercentage: 100 && strategy.keyword: "sum" (it only counts them for 100% questions learned, but the number of tests for each percentage is the same).
So what I want on the y-axis is not the plain count but count / totalNumberOfTestCases
edit:
In order for you to better understand what I need here is what I do with my system:
Lets say I have 100 known questions my system can learn. And I have 2500 test questions. Now I do the following:
Let my system learn none of the known questions
Ask the 2500 test questions
Save how many questions have been correctly answered (let's say 600)
Save this test result in elastic
Repeat with 10 questions learned:
Let my system learn 10% of the known questions
Ask the 2500 test questions
Save how many questions have been correctly answered (let's say 590)
Save this result in elastic
Repeat with 20 questions learned...
Now I want to plot how many questions have been correctly answered in each learning step:
600 at 0%
590 at 10%
900 at 20%
...
But instead of showing these absolute numbers I want 600/2500, 590/2500 etc on the y-axis.

For Visualizing your Y axis in percentage if it is not already in, You should first create a scripted field for your favorite column and then visualize that scripted field in kibana.
check the photos; in scripted field code, the removed part is your column name.

Related

Display count for a day using counter metrics in data dog

We have a counter metric in one our micro services which pushes data to DataDog. I want to display the total count for given time frame, and also the count per day (X axis would have the date and Y axis would have count). How do we achive this?
I tried using sum by and diff with Query value representation. It gives the total number of the count for given time frame. But I would like to get a bar graph with the X axis as the date and the Y axis as the count. Is this possible in DataDog?
It seems like there are 2 main questions here:
display the total count for a given time frame.
the count per day.
I think the rollup method is going to be your friend for both questions.
For #1 you need to pass in the time frame you want a total over: sum:<metric_name>.rollup(sum, <time_frame>) and the single value can be displayed using the Query Value visualization.
For #2 the datadog docs say you can get metrics per a day by
graphed using a day-long rollup with .rollup(avg,86400)
So this would look something like sum:<metric_name>.rollup(sum, 86400) and can be displayed a Timeseries with bars.

How to create value over time line chart in Kibana 4?

I'm facing a following problem. In Kibana 4 I've created a line chart based on my input from elasticeasrch but I can only display average, min, max instead of an actual value of the field per time, e.g. sent bytes.
Most answears to that question on stackoverflow are about Kibana 3 (How to create value over time chart with Kibana 3?) and seem to include a Histogram on a X axis, yet I can't seem to find one which will enable me to apply them to Kibana 4. I was unable to find the histogram panel and once I click on the discover tab there is the constant Searching loading.
If I have the following fields in my _source:
{"timestamp":"2015-06-02T10:16:44.0855","time":587,"threadName":"Thread Group 1-957","byte":1372,"status":"false","latence":306,"registerCall":"404"}
and I would like to have the number of bytes on the Y-axis and on the X-axis my timestamp.
Any help in the right direction will be appreciated :)
To create a value over time line chart in Kibana, follow these steps:
Go to visualize tab and select line chart
In the X-axis, select X-axis, Aggregation as Date Histogram and then select your timestamp field as the date field.
Next for the Y-Axis, select Sum as the aggregation and then bytes as the field.
For the X axis, what Alcanzar said is good, but as you notice, the Y axis is problematic.
Sum (suggested by "Limit") works, but since it's aggregated, it shows the total used in each aggregated bucket, but that may be meaningless depending on what you are trying to show. Your question isn't clear on what you want, so I'm just guessing here. One hour of requests, each of which ran for one minute and sent 1 megabyte is indeed 60 megabytes-minutes, if you are trying to show total capacity used over than hour (maybe you are paying a bill based on usage per time). On the other hand, if you are trying to show peak usage in each time, it would be wrong.
You said you already looked and Max and Min and they don't meet your needs. I don't suppose Standard Deviation would be any better?
I have the same concern. The best I've been able to do so far is
display Min and Max simultaneously in the Y axis. When they diverge, I know I'm zoomed out too far, so I zoom in until they align.
This is how I know I'm seeing individual events.
In any case, I share your frustration. I too would like to be able to show time series as easily as I can in, say, Excel.

Estimating number of results in Google App Engine Query

I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.
In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:
* I ordered by the random numbers that I had assigned in ascending order
* I set the offset to 1000
* I fetched only one entity
I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):
1 / RANDOM * OFFSET
The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).
The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.
Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).
Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?
Thanks!
Edit:
It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.
I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.
This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:
When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.
If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.
If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.
If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.
If you tell us the query you're trying to run, there might be someone who has a better idea.
Some quick thought:
Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently.
http://code.google.com/appengine/docs/python/datastore/stats.html
[EDIT1.]
I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem.
http://en.wikipedia.org/wiki/Order_statistic#The_order_statistics_of_the_uniform_distribution
For example:
If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."
If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806]
I think the probability won't be very large.
This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.
http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count

How does RetailMeNot calculate its success rate trends?

I am developing a rails application where I need a "success rate" system similar to RetailMeNot. I noticed that they use jQuery Sparkline library (http://omnipotent.net/jquery.sparkline/) to generate a success rate trend for each coupon.
For example, in their source code:
<em>84%</em> Success<br/><span class="trend">14,18,18,22,19,16,15,28,21,17</span>
<em>20%</em> Success<br/><span class="trend">-1,1,-1,-1,-2,-2,1,-1,1,-1</span>
Can someone explain to me the best way to develop a similar trending system for success rate?
A trend is just a number calculated at regular intervals. In this case it looks like the site is just binning the data they get from the "Did this coupon work for you?" question, and then plotting those values in the chart. In other words, they take the number of (successes - failures) in some time interval (e.g. 12 hours) and plot that number for each interval.
As time passes, they probably rebin to keep the number of bars on the x axis acceptable. For example, if they only want to show 8 bars on the plot, then after 4 hours they'll have to widen the bins.

Statistical estimation algorithm

I'm not sure if this question is appropriate for Stack Overflow but I'll give it a try anyway.
I have some data as follows:
I also have another set of data that I believe follows a similar distribution but I only know the total percent (e.g. 30% rather than 17%.) Can anyone suggest an algorithm to estimate the %s for each individual tier based on the new total % and the original distribution?
You question is unclear. If you want to estimate a new total percent by including the additinal data you are getting you must have quantity associated with your percentage so that you can create a meaninful weighted average.
If you want to determine if the new set of data has a different distribution than the historical data there are several tests mostly doing obtuse calculations on cummulative actual vs. expected percentages of values falling underneath a particular value. There is a lot of literature on the subject on comparing the distributions of two populations.
For paired samples Wilcoxon-Rank is a standard method if you can make no assumptions about the distribuion of the data. For non paired data non-parametric statistics exist but they require some in depth study.
Step-1: If your overall percentage 17% → 30% then, Actual (total) 105 → ~189.
Step-2: This number needs to be distributed over all elements in Actual column
From here things become non-linear, and we need some formula for arriving at Actual from POssible. And this needs to be a function of total.
i.e., function (possible, total (actual)) = actual.
If we can arrive at the above, then it might work ;)
If your new total is x, then put (22/627)*x as possible for tier 1, and (21/627)*x as actual for tier 1, which will give you the same percentage as before for tier 1. Then do the same thing for the other tiers (so possible for tier 2 is (45/627)*x, etc.).

Resources