Best way to calculate a trend - algorithm

I am coding an app that allows users to vote (market_trend_up += 1 for example), the app then fetches the accumulated data (trend_up_votes = 632; trend_down_votes = 236), analyzes and displays the resulting trend (if up_votes > down_votes { trend = up }).
What would you advise me to do to refresh the trends regularly? I thought about reinitializing the votes every 6h for instance but then the first voter will decide of the trend by himself.
Would letting the votes accumulate always provide the current trend? Thank you!

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

Using grafana counter to visualize weather data

I'm trying to visualize my weather data using grafana. I've already made the prometheus part and now I face an issue that hunts me for quite a while.
I created an counter that adds temperature indoor every five minutes.
var tempIn = prometheus.NewCounter(prometheus.CounterOpts{
Name: "tempin",
Help: "Temperature indoor",
})
for {
tempIn.Add(station.Body.Devices[0].DashboardData.Temperature)
time.Sleep(time.Second*300)
}
How can I now visualize this data that it shows current temperature and stores it for unlimited time so I can look at it even 1 year later like an normal graph?
tempin{instance="localhost:9999"} will only display added up temperature so its useless for me. I need the current temperature not the added up one. I also tried rate(tempin{instance="localhost:9999"}[5m])
How to solve this issue?
Although a counter is not the best solution for this use case, you can use the operator increase.
Increase(tempin{instance="localhost:9999"}[5m])
This will tell you how much the counter increased in the last five minutes

Is there a way to have a moving average in Grafana?

I didn't find a 'moving average' feature and I'm wondering if there's a workaround.
I'm using influxdb as the backend.
Grafana supports adding a movingAverage(). I also had a hard time finding it in the docs, but you can (somewhat hilariously) see its usage on the feature intro page:
As is normal, click on the graph title, edit, add the metric movingAverage() as per described in the graphite documentation:
movingAverage(seriesList, windowSize)
Graphs the moving average of a metric (or metrics) over a fixed number of past points, or a time interval.
Takes one metric or a wildcard seriesList followed by a number N of datapoints or a quoted string with a length of time like ‘1hour’ or ‘5min’ (See from / until in the render_api_ for examples of time formats). Graphs the average of the preceding datapoints for each point on the graph. All previous datapoints are set to None at the beginning of the graph.
Example:
&target=movingAverage(Server.instance01.threads.busy,10)
&target=movingAverage(Server.instance*.threads.idle,'5min')
Grafana does no calculations itself, it just queries a backend and draws nice charts. So aggregating abilities depends solely on your backend. While Graphite supports windowing functions such as moving average, InfluxDB currently doesn't support it.
There are quite a lot requests for moving average in influxdb on the web. You can leave your "+1" and track progress in this ticket https://github.com/influxdb/influxdb/issues/77
Possible (yet not so easy) workaround is to create a custom script (cron, daemon, whatever) that will pre-calcuate MA and save it in a separate influxdb series.
I found myself here trying to do a moving average in Grafana with a PostgreSQL database, so I'll just add a way to do with a SQL query:
SELECT
date as time,
AVG(daily_average_column)
OVER(ORDER BY date ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
AS value,
'5 Day Moving Average' as metric
FROM daily_average_table
ORDER BY time ASC;
This uses a "window" function to average of the last 4 rows (plus the current row).
I'm sure there are ways to do this with MySQL as well.
Method and capability for this is dependent on your datasource.
You specified InfluxDB, so your query will need to wrap an 'Aggregation function' [ such as mean($field) ] within the moving_average($aggregation_function, $num_of_points) 'Transformation Function'.
In the 'Metrics' tab, you will find both the 'Transformation' functions in the 'select' portion of the menu.
Craft your query with the 'Aggregation function' (mean, min, max, etc.) first -- this way you can make sure the data looks as you expect it.
After this, just click the '+' button next to the 'Aggregation function', and under the menu 'Transformations', select 'moving_average'.
The number in brackets will be the number of points you want the average taken over.
Screenshot:
try avg_over_time(mymetric[5m])
InfluxDB 2 allows you to calculate the moving average in the query, e.g.:
from(bucket: "iot")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "PoolWeather")
|> filter(fn: (r) => r["_field"] == "batteryvoltage")
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> movingAverage(n: 10)
|> yield(name: "average")
Another option is to report the data as "timing" metrics and not counts.
This is easy to do especially with Statsd in your stack.
Plotting timing data (coming from statsd) as average of the reported data points is already built in.

D3 ticks() does not return value if provided scale has only 1 result

I have an x-axis that displays the days that my data occurs on. The data is dynamic and sometimes I have data for only 1 day, 2 days, n days, etc.
Here is my code for displaying the days on the x-axis:
chart.x = d3.time.scale()
.range([0, chart.w]);
chart.xAxis = d3.svg.axis()
.scale(chart.x)
.orient("bottom")
.ticks(d3.time.day) // --- TODO : this is not showing the current day, for some reason...
.tickFormat(d3.time.format("%b %-d %p"));
If my data is spread on 2 days (ex: Tuesday, Wednesday), this will only display a tick for the second day (Wednesday), ie. when the day "changes" from one to another.
I want to also display a tick for the first day (Tuesday).
Even if there is only data on 1 day, I still want to display a tick for it.
Thanks you guys,
To extend the domain so that the scale starts and ends at a tick mark you use the .nice() method, as #meetamit suggested -- but "nicing" only works if you call that method after you set the domain, so that's why you might not have noticed any change. The API doesn't really make that clear, although since the method alters the domain I suppose it makes sense that changing the domain later would over-ride the effect of a previous nice() call.
Also, be sure to use the time-scale version of the method: .nice(d3.time.day) to get a domain rounded off to the nearest day as opposed to just the nearest hour.
Here's a fiddle:
http://fiddle.jshell.net/4rGQq/
The key code is simply:
xScale.domain(d3.extent(d))
//d3.extent() returns max and min of array, which become the basic domain
.nice(d3.time.day);
//nice() extends the domain to nearest start/end of a day
Compare what happens if you comment out the .nice() call after setting the domain, even with the other .nice() call during initialization of the scale. Also compare what happens if you don't specify the day-interval as a parameter to the nice method.
Can you show how chart.x is set up? Hard to tell without seeing it, but you may be able to fix it by calling chart.x.nice() (see documentation).
Otherwise, seems like you'll need to manually check the extents of its domain, and adjust them in the case of single day.
Clarification
Your code shows how you call range() but not how you call domain(), which is the important one.
It seems to me to me that if do
var domain = chart.x.domain()
console.log domain[0] == domain[1]
you'll see true getting logged whenever the data is for only one day. If so, it means you're dealing with a single point in time rather than a time range. In that case, you'll need to adjust the domain to be a longer range.
Really hard to know without even seeing an image of what you're working on.
.ticks() should be used to set the number of ticks you'd like to have on your axis, not the kind of data that should be in them. So try to set it like .ticks(3) and it should set a couple of ticks.
From the wiki:
.ticks([count])
Returns approximately count representative values from the scale's input domain. If count is not specified, it defaults to 10. The returned tick values are uniformly spaced, have human-readable values (such as multiples of powers of 10), and are guaranteed to be within the extent of the input domain. Ticks are often used to display reference lines, or tick marks, in conjunction with the visualized data. The specified count is only a hint; the scale may return more or fewer values depending on the input domain.

MongoDB Geospatial Load More Between HTTP Requests

AcaniUsers loads the first 20 users in MongoDB (on Heroku via Sinatra) closest to me from my iPhone. I want to add a Load More button that will load the next 20 users closest to me. Keep in mind, my location and the locations of the users on my phone may have changed. I was thinking of switching from Sinatra to Node.js and opening a WebSocket, so I could have realtime updates of the presences & locations of the users on my phone, but think I should save that challenge for a next iteration. Basically, how should I implement the load more functionality?
To paginate queries in MongoDB you can use a combination of limit() and skip().
So, the first query will be:
your_query.limit(20)
Then if you want to load the second 20 (you will have to remember the first query somewhere):
your_query.skip(20).limit(20)
btw I suggest you to execute in the first place the query with a limit higher than 20 and put in the cache the result you don't display. When requested, just get them from the cache (you can store it in the user session). If the position change, restart from scratch and re-query the db invalidating the cache.
think of it more as a client side question: use subscriptions based on the current group - encode the group into a geo-square if possible (more efficient than circle, I think?) - periodically (t) executes an operation that checks the locations of each user and simply sends them out with a group id to match the subscriptions
actually...to build your subscription groups, just use the geonear command on all of your subscribers
- build a hash of your subscribers and their groups
- each subscriber is subscribed to one group and themselves (for targeted communication => indicate that a specific subscriber should change their subscription)
- iterate through the results i number of times where i is the number of individuals in an update group
- execute an action that checks the current value of j, the group number for a specific subscriber, against the new j value - if there is a change, notify the subscriber on the subsriber's private channel
- notifications synchronously follow subscriber adjustments
something like:
var pageSize;
// assign pageSize in method call
var documents = collection.Find(query);
var max = documents.Size();
for (int i = 0; i == max ; i++)
{
var level = i*pageSize;
if (max / level > 1)
{
documents.Skip(pageSize);
}
else
{
documents.Skip(pageSize).Limit(level);
break;
}
}
:)

Resources