MongoDB Geospatial Load More Between HTTP Requests - ruby

AcaniUsers loads the first 20 users in MongoDB (on Heroku via Sinatra) closest to me from my iPhone. I want to add a Load More button that will load the next 20 users closest to me. Keep in mind, my location and the locations of the users on my phone may have changed. I was thinking of switching from Sinatra to Node.js and opening a WebSocket, so I could have realtime updates of the presences & locations of the users on my phone, but think I should save that challenge for a next iteration. Basically, how should I implement the load more functionality?

To paginate queries in MongoDB you can use a combination of limit() and skip().
So, the first query will be:
your_query.limit(20)
Then if you want to load the second 20 (you will have to remember the first query somewhere):
your_query.skip(20).limit(20)
btw I suggest you to execute in the first place the query with a limit higher than 20 and put in the cache the result you don't display. When requested, just get them from the cache (you can store it in the user session). If the position change, restart from scratch and re-query the db invalidating the cache.

think of it more as a client side question: use subscriptions based on the current group - encode the group into a geo-square if possible (more efficient than circle, I think?) - periodically (t) executes an operation that checks the locations of each user and simply sends them out with a group id to match the subscriptions
actually...to build your subscription groups, just use the geonear command on all of your subscribers
- build a hash of your subscribers and their groups
- each subscriber is subscribed to one group and themselves (for targeted communication => indicate that a specific subscriber should change their subscription)
- iterate through the results i number of times where i is the number of individuals in an update group
- execute an action that checks the current value of j, the group number for a specific subscriber, against the new j value - if there is a change, notify the subscriber on the subsriber's private channel
- notifications synchronously follow subscriber adjustments
something like:
var pageSize;
// assign pageSize in method call
var documents = collection.Find(query);
var max = documents.Size();
for (int i = 0; i == max ; i++)
{
var level = i*pageSize;
if (max / level > 1)
{
documents.Skip(pageSize);
}
else
{
documents.Skip(pageSize).Limit(level);
break;
}
}
:)

Related

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

Get current no from prooph event store

I try to update a projection from event store. The following line will load all events:
$events = $this->eventStore->load(new StreamName('mystream'));
Currently i try to load only not handled events by passing the fromNumber parameter:
$events = $this->eventStore->load(new StreamName('mystream'), 10);
This will load all events eg from 15 to 40. But i found no way to figure out which is the current/highest "no" of the results. But this is necessary for me to load only from this entry on the next time.
If the database is truncated (with restarted sequences) this is not a real problem cause i know that the events will start with 1. But if the primary key starts with a number higher than 1 can not figure out which event has which number in the event store
When you are using pdo-event-store, you have a key _position in the event metadata after loading, so your read model can track which position was the last you were working on. Other then that, if you are working with proophs event-store projections, you don't need to take care of that at all. The projector will track the current event position for all needed streams internally, you just need to provide callbacks for each event where you need to do something.

Odata filter on cached data

Say for instance we want to pull back a list of 50 individuals. When using odata with this filter, $top=2&$skip=4, it is returning those 2 records, but what we are wanting to do is possibly sending back 50 individuals, and be able to run the filter against those.
When I debug my program, I run it to get the 50, and every subsequent call brings me the 50 back from cache. When I run the same thing, but add the following $top=2&$skip=4, it runs through my code and gets all the records and returns the two objects from the code, not the cache.
GetIndividuals(ODataQueryOptions<Individuals> opts)
{
.....
var results = opts.ApplyTo(objAll.AsQueryable(), settings);
return new PageResult<Individual>(
results as IEnumerable<Individual>,
Request.ODataProperties().NextLink,
objAll.AsQueryable().Count());
}
I hope this is clear.....any ideas on how to return a larger group of data and then run odata on it after the fact?

how to get total number of mixpanel events via API

Can I get total number of events (=data points) for a time period?
The 'events' method (http://mixpanel.com/api/2.0/events/)
seems almost what I need, it's just that it requires a list of event names, and I need the total count of my events, I do not have the names.
I could not find this one in the API.
You can first hit the events by name API to return the list of your events at http://mixpanel.com/api/2.0/events/names/. Then, you can pump the return as a list into your request to http://mixpanel.com/api/2.0/events/ to get the count for each of your events.
Depending on your usage case, it may make more sense to use a URL hack on the main segmentation report instead of hitting the API. If you add union:1 as a new parameter to the URL (they are comma separated at the end) the report will display the union of all your events over a time period -- if you are viewing totals, this will be the total event count.
You can use the JQL console within Mixpanel, under the "Applications" menu in the left-nav. Just run the following, and it'll count the total number of events. See the JQL API reference here: https://mixpanel.com/help/reference/jql/api-reference#api/concepts
function main() {
return Events({
from_date: '2010-02-02',
to_date: '2017-02-03'
}).reduce(mixpanel.reducer.count());
}
// 989322

How to Spam Filter Gmail Messages by Recipient Address?

I use the dot feature (m.yemail#gmail.com instead of myemail#gmail.com) to give emails for questionable sites so that I can easily spot spam from my address being sold.
I made this function and set it to trigger every 30 minutes to automatically filter these.
function moveSpamByAddress(){
var addresses = ["m.yemail#gmail.com"]
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++){
var messages = threads[i].getMessages();
for (var ii = 0; ii<messages.length; ii++){
for (var iii = 0; iii<addresses.length; iii++){
if (messages[ii].getTo().indexOf(addresses[iii]) > -1){
threads[i].moveToSpam()
}
}
}
}
}
This works, but I noticed that this runs slower than I would expect it to (but my expectation may be unreasonable) given that my inbox only contains 50 messages and I am only currently filtering one address. Is there a way to increase execution speed?
Also are there any penalties for running scripts too often? I see that I have the option to trigger a script every minute, and that would increase the likelihood of filtering a message before I see it, but it would also run the scripts uselessly significantly more times.
You can do this using native gmail filters plus apps script.
Script time quotas varies from 1 to 6 hours depending on account type.
To improve performance, first check getInboxUnreadCount and return inmediately if zero.
If you use a 1minute trigger, make sure to use a lock to avoid one timer starting while the other runs. If the lock is in use simply return.
First, make a gmail filter so when "to" matches your special address, apply a special label like "mySpam"
Second, make an apps script with my suggestions above, plus your code no longer needs to search so much, now you just need to find emails with that label (a single api call) and .moveToSpam
There shouldnt be that many at any time in the label if the script runs often.

Resources