How to efficiently search large datasets by location and date range? - algorithm

I have a MongoDB collection containing attributes such as:
longitude, latitude, start_date, end_date, price
I have over 500 million documents.
My question is how to search by lat/long, date range and price as efficiently as possible?
As I see it my options are:
Create an Geo-spatial index on lat/long and use MongoDB's proximity search... and then filter this based on date range and price.
I have yet to test this but, am worrying that the amount of data would be too much to search this quickly, when we have around 1 search a second.
have you had experience with how MongoDB would react under these circumstances?
Split the data into multiple collections by location. i.e. by cities like london_collection, paris_collection, new_york_collection.
I would then have to query by lat/long first, find the nearest city collection and then do a MongoDB spatial search on that subset data in that collection with date and price filters.
I would have uneven distribution of documents as some cities would have more documents than others.
Create collections by dates instead of location. Same as above but each document is allocated a collection based on it's date range.
problem with searches that have a date range that straddles multiple collections.
Create unique ids based on city_start_date_end_date for each document.
Again I would have to use my lat/long query to find the nearest city append the date range to access the key. This seems to be pretty fast but I don't really like the city look up aspect... it seems a bit ugly.
I am in the process of experimenting with option 1.) but would really like to hear your ideas before I go too far down one particular path?
How do search engines split up and manage their data... this must be a similar kind of problem?
Also I do not have to use MongoDB, I'm open to other options?
Many thanks.

Indexing and data access performance is a deep and complex subject. A lot of factors can effect the most efficient solution including the size of your data sets, the read to write ratio, the relative performance of your IO and backing store, etc.
While I can't give you a concrete answer, I can suggest investigating using morton numbers as an efficient way of pulling multiple similar numeric values like lat longs.
Morton number

Why do you think option 1 would be too slow? Is this the result of a real world test or is this merely an assumption that it might eventually not work out?
MongoDB has native support for geohashing and turns coordinates into a single number which can then be searched by a BTree traversal. This should be reasonably fast. Messing around with multiple collections does not seem like a very good idea to me. All it does is replace one level of BTree traversal on the database with some code you still need to write, test and maintain.
Don't reinvent the wheel, but try to optimize the most obvious path (1) first:
Set up geo indexes
Use explain to make sure your queries actually use the index
Make sure your indexes fit into RAM
Profile the database using the built-in profiler
Don't measure performance on a 'cold' system where the indexes didn't have a chance to go to RAM yet
If possible, try not to use geoNear if possible, and stick to the faster (but not perfectly spherical) near queries
If you're still hitting limits, look at sharding to distribute reads and writes to multiple machines.

Related

Is ElasticSearch suited for retrieving a very large number of search records?

So, our production environment has an ES cluster that contains all our products inventory (ID and attributes) where each product is mapped to one document. Internally, one of our use cases is to create a logical grouping of these products based on text matching on a bunch of these product attributes.
Often times, it's possible that a product set could contain a very large number of products, say, 5 million. That is, the query to create a product set could match about 5 million documents.
Now, my question is, is ES capable of handling such large retrievals of documents, or is it recommended to use a backing store like Cassandra or HBase to fetch a huge number of documents? Note that I'm not concerned about realtime use cases - I'm okay with having an asynchronous execution of the product set creation, so latency isn't a major concern for me. From what I understand, ES provides the Scroll API to retrieve a large number of documents, but, I'm approaching the problem more from a school of thought perspective.
Is it fine to use ES to fetch very large documents, in the range of 5-10 million? Or should we use a parallel DB with big data capabilities to fetch the data and use ES only as the search store?
TL;DR no, it is not meant to retrieve large sets of documents, although you could work your way around with different approaches
notice that Scroll API might not be suitable for purposes other than re-indexing:
Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one data stream or index into a new data stream or index with a different configuration.
Another way to achieve it would be the Search after parameter
search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.
Rethink whether your use case really needs to exhaustively paginate over large sets of documents. Since ElasticSearch's strength doesn't lie on large result sets.
consult documentation:
Paginate search results
index.max_result_window
Track total hits
Scroll API
Search after parameter

MongoDB text index search slow for common words in large table

I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!
Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth
This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.

Caching sortable/filterable data in Redis

I have a variety of data that I've got cached in a standard Redis hashmap, and I've run into a situation where I need to respond to client requests for ordering and filtering. Order rankings for name, average rating, and number of reviews can change regularly (multiple times a minute, possibly). Can anyone advise me on a proper strategy for attacking this problem? Consider the following example to help understand what I'm looking for:
Client makes an API request to /api/v1/cookbooks?orderBy=name&limit=20&offset=0
I should respond with the first 20 entries, ordered by name
Strategies I've considered thus far:
for each type of hashmap store (cookbooks, recipes, etc), creating a sorted set for each ordering scheme (alphabetical, average rating, etc) from a Postgres ORDER BY; then pulling out ZRANGE slices based on limit and offset
storing ordering data directly into the JSON string data for each key.
hitting postgres with an SELECT id FROM table ORDER BY _, and using the ids to pull directly from the hashmap store
Any additional thoughts or advice on how to best address this issue? Thanks in advance.
So, as mentioned in a comment below Sorted Sets are a great way to implement sorting and filtering functionality in cache. Take the following example as an idea of how one might solve the issue of needing to order objects in a hash:
Given a hash called "movies" with the scheme of bucket:objectId -> object, which is a JSON string representation (read about "bucketing" your hashes for performance here.
Create a sorted set called "movieRatings", where each member is an objectId from your "movies" hash, and its score is an average of all rating values (computed by the database). Just use a numerical representation of whatever you're trying to sort, and Redis gives you a lot of flexibility on how you can extract the slices you need.
This simple scheme has a lot of flexibility in what can be achieved - you simply ask your sorted set for a set of keys that fit your requirements, and look up those keys with HMGET from your "movies" hash. Two swift Redis calls, problem solved.
Rinse and repeat for whatever type of ordering you need, such as "number of reviews", "alphabetically", "actor count", etc. Filtering can also be done in this manner, but normal sets are probably quite sufficient for that purpose.
This depends on your needs. Each of your strategies could work.
Your first approach of storing an auxiliary sorted set for each way
you want to order is the best way to do this if you have a very big
hash and/or you run your order queries frequently. This approach will
require a lot of ram if your hash is big, but it will also scale well
in terms of time complexity as your hash gets bigger and you start
running order queries more frequently. On the other hand, it
introduces complexity in your data structures, and feels like you're
trying to use Redis for something a typical DB like Postgres, MySQL,
or Mongo would be better at.
Storing ordering data directly into your keys means you need to pull
your entire hash every time you do an order query. Maybe that's not
so bad if your hash is very small, or you don't do ordered queries very often, but this won't scale at all.
If you're already hitting Postgres to get keys, why not just store the values in Postgres as well. That would be much cheaper than hitting Postgres and then hitting Redis, and would have your code depend on fewer things. IMO, this is probably your best option and would work most naturally. Do this, unless you have some really good reason to not store values in Postgres, or some really big speed concerns, in which case go with your first strategy.

Why is Solr so much faster than Postgres?

I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."
I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).
As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.
What differences in architecture would explain this?
First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.
Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)
But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).
You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.
Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.
Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performed in each product, but more to do with ease of use for your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.
I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peace in text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstore type or JSON columns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.
-- Create the table.
-- In reality, I would probably make tsv NOT NULL,
-- but I'm keeping the example simple...
CREATE TABLE war_and_peace
(
lineno serial PRIMARY KEY,
linetext text NOT NULL,
tsv tsvector
);
-- Load from downloaded data into database.
COPY war_and_peace (linetext)
FROM '/home/kgrittn/Downloads/war-and-peace.txt';
-- "Digest" data to lexemes.
UPDATE war_and_peace
SET tsv = to_tsvector('english', linetext);
-- Index the lexemes using GiST.
-- To use GIN just replace "gist" below with "gin".
CREATE INDEX war_and_peace_tsv
ON war_and_peace
USING gist (tsv);
-- Make sure the database has statistics.
VACUUM ANALYZE war_and_peace;
Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:
-- Find lines with "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'gentlemen');
84 rows, gist: 2.006 ms, gin: 0.194 ms
-- Find lines with "ladies".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies');
184 rows, gist: 3.549 ms, gin: 0.328 ms
-- Find lines with "ladies" and "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies & gentlemen');
1 row, gist: 0.971 ms, gin: 0.104 ms
Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.
Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.
The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.
The setup is, of course, a one-time thing. With a trigger to maintain the tsv column, any changes made would instantly be searchable without redoing any of the setup.
With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZE of your query, someone can almost always spot the problem and suggest how to get it to run faster.
UPDATE (Dec 9 '16)
I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and "gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.
Solr is designed primarily for searching data, not for storage. This enables it to discard much of the functionality required from an RDMS. So it (or rather lucene) concentrates on purely indexing data.
As you've no doubt discovered, Solr enables the ability to both search and retrieve data from it's index. It's the latter (optional) capability that leads to the natural question... "Can I use Solr as a database?"
The answer is a qualified yes, and I refer you to the following:
https://stackoverflow.com/questions/5814050/solr-or-database
Using Solr search index as a database - is this "wrong"?
For the guardian solr is the new database
My personal opinion is that Solr is best thought of as a searchable cache between my application and the data mastered in my database. That way I get the best of both worlds.
This biggest difference is that a Lucene/Solr index is like a single-table database without any support for relational queries (JOINs). Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
Please read this and this.
Solr (Lucene) creates an inverted index which is where retrieving data gets quite faster. I read that PostgreSQL also has similar facility but not sure if you had used that.
The performance differences that you observed can also be accounted to "what is being searched for ?", "what are the user queries ?"

Best approaches to reduce the number of searches between the filenet object stores to find a document based on the time of the document creation?

For example, there are 5 object stores. I am thinking of inserting documents into them, but not in sequential order. Initially it might be sequential, but if i could insert by using some ranking method it would be easier to know which object store to search to find the document. The goal is to reduce the number of object store searches. This can only be achieved if the insertion uses some intelligent algorithm.
One method i found useful is using the current year MOD N (number of object stores) to determine where a document goes. Could we have some better approaches to this?
If you want fast access there are a couple of criteria:
The hash function has to be reproducible based on the data which is queried. This means, a lot depends on the queries you expect.
You usually want to distribute your object as much evenly accross stores as possible. If you want to go parallel, you want to access each document for a given query from different stores, so they will not block each other. Hence your hashing function should spread as much as possible to different stores for similar documents. If you expect documents related to the same query to be from the same year, do not use the year directly.
This assumes, you want to be able to have fast queries which can be paralised. If you instead have a system in which you first have to open a potentially expensive connection to the store, then most documents related to the same query should go in the same store and you should not take my advice above.
Your criteria for "what goes in a FileNet object store?" is basically "what documents logically belong together?".

Resources