How to structure an efficient GeoLocation and Time query - rethinkdb

Hey rethinkers!
i've got a query optimization question that i cant quite figure out. It deals with geoLocation and time. I've got a ton of events, that all have a startTime, endTime (indexed), and location (indexed). If i want to get the events that are happening nearby by a certain location that haven't happend yet, i can do one of two ways:
I can get and filter all the events that haven't happend yet based on the end time, then calculate the location of all those events and only return the ones with the specified radius.
I can use the getNearest() command (which would return all the expired events) and then filter out the events that haven't happend yet. My one worry with this apporach is that getNearest() specifies how many to return, but I essentially need all of them within the given radius so i don't miss any events that haven't happend yet.
Im just unsure how i can figure out the fastest/most effecient query for this.
The best option to me would seem to be to filter and get all events that haven't happend yet, then use getNearest() to take advatage of the indexes. But i can call a get nearest on a filtered set. Please help!?!?!

For getting all events within a radius, I recommend using getIntersecting() together with r.circle.
That is not only more efficient than getNearest, but also doesn't have any limit on the number of returned documents.
You might need to make the radius you pass into r.circle slightly larger, to account for the fact that the generated polygon will be slightly smaller than the specified radius between the vertices.

Related

Parse: limitations of count()

Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.

Best Workaround for Geopoint SubQuery

I have a class with objects that contain a geopoint type. Each object also has a boolean entitled "goodPoint".
I need to return these objects sorted from distance to user's location.
I have been using the "near" function to accomplish this.
I would also like to only grab the objects who have the boolean "goodPoint" set to "true".
I would typically use "isEqual" or "wherekey".
However, it is my understanding that parse does not support combinational queries when a geopoint query (like "near") is also used.
What is the best workaround for effectively achieving my desired result without the use of the unsupported combinational query?
Possible thoughts:
I would like to get all 1000 points. I could filter out client side, but am afraid this will not be scalable in the long run as I anticipate around 10%-20% of my points to be "bad" (goodPoint=false) where worst cast case would limit me to 800 points.
I could create a "graveyard" to send bad points so that they don't list in the nearest 1000, but I'm not sure where exactly to put the points for latitude and longitude.
I could move the "bad" points to another class, but parse doesn't seem to allow you to move objects across classes.
I also could just delete the points, but I need to keep them for user feedback purposes.

Counting events

I'm using Cube and Cubism. It's perfect, except for one thing... I need to display the total events numerically. E.g. I have a metric showing API calls per 10 second, I need to know the total API calls.
Is there anything built-in that I'm missing?
I thought about adding a (Mongo) count in the evaluator, but events expire so that wouldn't work.
Keeping track of the running total client-side and including it in the event could be an option, but the sources are distributed and the events are not monotonic, so a simple sum on the last 10 seconds won't work. I would need to be able to query 'get the last event for each distinct source'. Is that possible?
I have a lot of metrics, so I really want to keep the number of client requests to a minimum. If I could get e.g. cumulative alongside value in the standard metric query I'd be happy.
EDIT
I was missing something... using sum and a large (e.g. 1 day) step works.
I was missing something... using sum and a large (e.g. 1 day) step works perfectly.

MongoDB geospacial query

I use mongo's "$near" query, it works as expected and saves me a lot of time.
Now I need to perform something more complicated. Imagine, we have a collection of "checkins" (let's use foursquare notation), that contains the geospacial information (nothing unusual: just lat and lng) and time. Given the checkins by two people, how do I find their "were near to each other" checkins? I mean, e.g.: "1/23/12 you've been 100 meters away"
The easiest solution is to select all the checkins by the first user and find nearest checkin for each first user's checkin on the framework side (I use ruby). But is it the most efficient solution?
Do you have better ideaas? May be I need some kind of a special index?
Best,
Roman
The MongoDB GeoSpatial indexes provide two types of queries: $near and $within. The $near query returns all points in the database that are within a certain range of a requested point, while the $within query lists all points in the database that are inside of a particular area (box, circle, or arbitrary polygon).
MongoDB does not currently provide a query that will return all points that are within a certain distance of any member of another set of points, which is what you seem to want.
You could conceivably use the point data from user1 to build a polygon describing the "area of interest" and then use the $within query to see if there were any checkins by other people inside of that area. If you use a compound index on location & date, you could even restrict the query to folks who were inside of that area on a particular day.
References:
http://docs.mongodb.org/manual/core/indexes/#geospatial-indexes
http://docs.mongodb.org/manual/reference/operators/#geospatial

Random noise in Solr score

I am looking for a way of introducing random noise into my scoring function, and I'm at a loss on how to best proceed.
Some background:
We use Solr for a web application that manages large-ish sets of photos for agencies.
One customer has an interesting requirement for scoring:
'quality' field, maintained by editors, from 1 (highest) to 3 (lowest);
'date' field, boosting more recent photos; I would probably use a logarithmic function;
However, due to how the stock photo market works, this will likely result in many similar photos appearing together.
Their request is to give 'quality' a large boost, but introduce some randomness so that photos will not appear in a strict date order.
Any idea?
EDITED: a key requirement is to have "stable" query results: if I search twice for "tropical island" I can get a slightly different result set, but if I ask for the first page, then the second, then the first, I'd better get the same results :)
You could do this with FunctionQueries. For each photo add a field with a random number close to 1 (e.g. 0.99, 1.02) and use it in a product function query to alter the "natural" score.
Turns out my first approach to solving the problem was the correct one, and I had a trivial implementation bug. In case it helps others:
RandomSortField does have the characteristics I need (that is, returning repeatable results for the same query).
Leaving aside the FunctionQuery for a moment, even something trivial like:
sort=quality_i asc, date_d desc, random_12345 desc
will approximate my requirements.
However, when using the Sunspot ruby gem, there's no way of passing the seed, and that's what was tricking me earlier: I ended up using a different seed each time, thus getting "true" random results.

Resources