I recently switched from Postgres to Solr and saw a ~50x speed up in our queries. The queries we run involve multiple ranges, and our data is vehicle listings. For example: "Find all vehicles with mileage < 50,000, $5,000 < price < $10,000, make=Mazda..."
I created indices on all the relevant columns in Postgres, so it should be a pretty fair comparison. Looking at the query plan in Postgres though it was still just using a single index and then scanning (I assume because it couldn't make use of all the different indices).
As I understand it, Postgres and Solr use vaguely similar data structures (B-trees), and they both cache data in-memory. So I'm wondering where such a large performance difference comes from.
What differences in architecture would explain this?
First, Solr doesn't use B-trees. A Lucene (the underlying library used by Solr) index is made of a read-only segments. For each segment, Lucene maintains a term dictionary, which consists of the list of terms that appear in the segment, lexicographically sorted. Looking up a term in this term dictionary is made using a binary search, so the cost of a single-term lookup is O(log(t)) where t is the number of terms. On the contrary, using the index of a standard RDBMS costs O(log(d)) where d is the number of documents. When many documents share the same value for some field, this can be a big win.
Moreover, Lucene committer Uwe Schindler added support for very performant numeric range queries a few years ago. For every value of a numeric field, Lucene stores several values with different precisions. This allows Lucene to run range queries very efficiently. Since your use-case seems to leverage numeric range queries a lot, this may explain why Solr is so much faster. (For more information, read the javadocs which are very interesting and give links to relevant research papers.)
But Solr can only do this because it doesn't have all the constraints that a RDBMS has. For example, Solr is very bad at updating a single document at a time (it prefers batch updates).
You didn't really say much about what you did to tune your PostgreSQL instance or your queries. It's not unusual to see a 50x speed up on a PostgreSQL query through tuning and/or restating your query in a format which optimizes better.
Just this week there was a report at work which someone had written using Java and multiple queries in a way which, based on how far it had gotten in four hours, was going to take roughly a month to complete. (It needed to hit five different tables, each with hundreds of millions of rows.) I rewrote it using several CTEs and a window function so that it ran in less than ten minutes and generated the desired results straight out of the query. That's a 4400x speed up.
Perhaps the best answer to your question has nothing to do with the technical details of how searches can be performed in each product, but more to do with ease of use for your particular use case. Clearly you were able to find the fast way to search with Solr with less trouble than PostgreSQL, and it may not come down to anything more than that.
I am including a short example of how text searches for multiple criteria might be done in PostgreSQL, and how a few little tweaks can make a large performance difference. To keep it quick and simple I'm just running War and Peace in text form into a test database, with each "document" being a single text line. Similar techniques can be used for arbitrary fields using the hstore type or JSON columns, if the data must be loosely defined. Where there are separate columns with their own indexes, the benefits to using indexes tend to be much bigger.
-- Create the table.
-- In reality, I would probably make tsv NOT NULL,
-- but I'm keeping the example simple...
CREATE TABLE war_and_peace
(
lineno serial PRIMARY KEY,
linetext text NOT NULL,
tsv tsvector
);
-- Load from downloaded data into database.
COPY war_and_peace (linetext)
FROM '/home/kgrittn/Downloads/war-and-peace.txt';
-- "Digest" data to lexemes.
UPDATE war_and_peace
SET tsv = to_tsvector('english', linetext);
-- Index the lexemes using GiST.
-- To use GIN just replace "gist" below with "gin".
CREATE INDEX war_and_peace_tsv
ON war_and_peace
USING gist (tsv);
-- Make sure the database has statistics.
VACUUM ANALYZE war_and_peace;
Once set up for indexing, I show a few searches with row counts and timings with both types of indexes:
-- Find lines with "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'gentlemen');
84 rows, gist: 2.006 ms, gin: 0.194 ms
-- Find lines with "ladies".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies');
184 rows, gist: 3.549 ms, gin: 0.328 ms
-- Find lines with "ladies" and "gentlemen".
EXPLAIN ANALYZE
SELECT * FROM war_and_peace
WHERE tsv ## to_tsquery('english', 'ladies & gentlemen');
1 row, gist: 0.971 ms, gin: 0.104 ms
Now, since the GIN index was about 10 times faster than the GiST index you might wonder why anyone would use GiST for indexing text data. The answer is that GiST is generally faster to maintain. So if your text data is highly volatile the GiST index might win on overall load, while the GIN index would win if you are only interested in search time or for a read-mostly workload.
Without the index the above queries take anywhere from 17.943 ms to 23.397 ms since they must scan the entire table and check for a match on each row.
The GIN indexed search for rows with both "ladies" and "gentlemen" is over 172 times faster than a table scan in exactly the same database. Obviously the benefits of indexing would be more dramatic with bigger documents than were used for this test.
The setup is, of course, a one-time thing. With a trigger to maintain the tsv column, any changes made would instantly be searchable without redoing any of the setup.
With a slow PostgreSQL query, if you show the table structure (including indexes), the problem query, and the output from running EXPLAIN ANALYZE of your query, someone can almost always spot the problem and suggest how to get it to run faster.
UPDATE (Dec 9 '16)
I didn't mention what I used to get the prior timings, but based on the date it probably would have been the 9.2 major release. I just happened across this old thread and tried it again on the same hardware using version 9.6.1, to see whether any of the intervening performance tuning helps this example. The queries for only one argument only increased in performance by about 2%, but searching for lines with both "ladies" and "gentlemen" about doubled in speed to 0.053 ms (i.e., 53 microseconds) when using the GIN (inverted) index.
Solr is designed primarily for searching data, not for storage. This enables it to discard much of the functionality required from an RDMS. So it (or rather lucene) concentrates on purely indexing data.
As you've no doubt discovered, Solr enables the ability to both search and retrieve data from it's index. It's the latter (optional) capability that leads to the natural question... "Can I use Solr as a database?"
The answer is a qualified yes, and I refer you to the following:
https://stackoverflow.com/questions/5814050/solr-or-database
Using Solr search index as a database - is this "wrong"?
For the guardian solr is the new database
My personal opinion is that Solr is best thought of as a searchable cache between my application and the data mastered in my database. That way I get the best of both worlds.
This biggest difference is that a Lucene/Solr index is like a single-table database without any support for relational queries (JOINs). Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
Please read this and this.
Solr (Lucene) creates an inverted index which is where retrieving data gets quite faster. I read that PostgreSQL also has similar facility but not sure if you had used that.
The performance differences that you observed can also be accounted to "what is being searched for ?", "what are the user queries ?"
Related
I use Tableau and have a table with 140 fields. Due to the size/width of the table, the performance is poor. I would like to remove fields to increase reading speed, but my user base is so large, that at least one person uses each of the fields, while 90% use the same ~20 fields.
What is the best solution to this issue? (Tableau is our BI tool, BigQuery is our database)
What I have done thus far:
In Tableau, it isn't clear how to user dynamic data sources that change based on the field selected. Ideally, I would like to have smaller views OR denormalized tables. As the users makes their selections in Tableau, the underlying data sources updates to the table or view with that field.
I have tried a simple version of a large view, but that performed worse than my large table, and read significantly more data (remember, I am BigQuery, so I care very much about bytes read due to costs)
Suggestion 1: Extract your data.
Especially when it comes to datasources which are pay per query byte, (Big Query, Athena, Etc) extracts make a great deal of sense. Depending how 'fresh' the data must be for the users. (Of course all users will say 'live is the only way to go', but dig into this a little and see what it might actually be.) Refreshes can be scheduled for as little as 15 minutes. The real power of refreshes comes in the form of 'incremental refreshes' whereby only new records are added (along an index of int or date.) This is a great way to reduce costs - if your BigQuery database is partitioned - (which it should be.) Since Tableau Extracts are contained within .hyper files, a structure of Tableau's own design/control, they are extremely fast and optimized perfectly for use in Tableau.
Suggestion 2: Create 3 Data Sources (or more.) Certify these datasources after validating that they provide correct information. Provide users with with clear descriptions.
Original Large Dataset.
Subset of ~20 fields for the 90%.
Remainder of fields for the 10%
Extract of 1
Extract of 2
Extract of 3
Importantly, if field names match in each datasource (ie: not changed manually ever) then it should be easy for a user to 'scale up' to larger datasets as needed. This means that they could generally always start out with a small subset of data to begin their exploration, and then use the 'replace datasource' feature to switch to a different datasource while keeping their same views. (This wouldn't work as well if at all for scaling down, though.)
I have 4 text columns of interest.
Each column is up to about 100 characters.
The text in 3 of the columns is mostly Latin words. (The data is a biological catalog, and these are names of things.)
The data is currently about 500 rows. I don't expect this to grow beyond 1000.
A small number of users (under 10) will have editing privileges to add, update, and delete data. I do not expect these users to put a heavy load on the database.
So all this suggests a pretty small data set to consider.
I need to perform a search on all 4 columns for rows where at least 1 column contains the search text (case insensitive). The query will be issued (and the results served) via a web application. I'm a bit lost about how to approach it.
PostgreSQL offers a few options for improving text searching speed. The possible options built into PostgreSQL I've been considering are
Don't try to index this at all. Just use ILIKE, LIKE on lower, or similar. (Without an index?)
Index with pg_trgm to improve search speed. I would assume that I would need to index the concatenation somehow.
Full text searching. I assume this would involve concatenating for the index also.
Unfortunately, I'm not really familiar with the expected performance of any of these or the benefits and trades off, so it's hard to know what things I should try first and what things I shouldn't even consider. Some things I have read suggest that doing the indexing for 2 and 3 is pretty slow, which conflicts with the fact that I'll be having occasional modifications going on. And the mixed language makes full text search seem unattractive since it appears to be language based, unless it can handle multiple languages simultaneously. Would I expect that for data this small, a simple ILIKE or maybe a LIKE on lower is probably fast enough? Or maybe the indexing is fast enough for the low load of modifications on data this small? Would I be better off looking for something outside the database?
Granted, I would have to actually benchmark all these to really know for sure what's fastest, but unfortunately, I don't have much time for this project. So what are the benefits and trade offs of these methods? What of these options are not appropriate for solving this type of problem? What are some other types of solutions (including potentially outside the database) worth considering?
(I suppose I might find some kind of beginner's tutorial on text searching in PG useful, but my searches turn up Full Text Search for the most part, which I don't even know if it's useful for me.)
I'm on PG 9.2.4, so any goodies pre-9.3 are an option.
Update: I've expanded this answer into a detailed blog post.
Rather than focusing purely on speed, please consider search semantics first. Define your requirements.
For example, do users need to be able to differentiate based on the order of terms? Should
radiata pinus
find:
pinus radiata
? Does the same rule apply to words within a column as between columns?
Are spaces always word separators, or are spaces within a column part of the search term?
Do you need wildcards? If so, do you need only left-anchored wildcards (think staph%) or do you need right-anchored or infix wildcards too (%ccus, p%s)? Only pg_tgrm will help you with infix wildcards. Suffix wildcards can be handled by an index on the reverse() of a word, but that gets clumsy quickly so in practice pg_tgrm is the best option there.
If you're mostly searching for discrete words and word-order isn't important, Pg's full-text search with to_tsvector and to_tsquery will be desirable. It supports left-anchored wildcard searches, weighting, categories, etc.
If you're mostly doing prefix searches of discrete columns then simple LIKE queries on a regular b-tree index per column will be the way to go.
So. Figure out what you need, then how to do it. Your current uncertainty probably stems partly from not really knowing quite what you want.
For a 1000 rows, I would guess that LIKE together with lower() should be fast enough. After a couple of queries the table will most probably be completely cached.
Regarding the indexing using pg_trgm: you are talking about "occasional" updates/inserts to the table. I would think that the additional costs of using a trigram index would only show up when you update/insert that table a lot - like several times a second.
If "occasional" only means several times an hour (or even less), then I doubt you'd see the difference in real live. I think somewhere in Depesz's blob there was also an article that compared the insert speed with and without a trigram index, but I can't find it anymore.
I am hosting a mongodb database for a service that supports full text searching on a collection with 6.8 million records.
Its text index includes ten fields with varying weights.
Most searches take less than a second. Some searches take two to three seconds. However, some searches take 15 - 60 seconds! The 15-60 second search cases are unacceptable for my application. I need to find a way to speed those up.
Searching takes 15-60 seconds when words that are very common in the index are used in the search query.
I seems that the text search feature does not support lazy parameters. My first thought was to cache a list of the 50 most common words in my text index and then ask mongodb to evaluate those last (lazy) and on top of the filtered results returned by the less common parameters. Hopefully people are still with me. For example, say I have a query "products chocolate", where products is common and chocolate is uncommon. I would like to be able to ask mongodb to evaluate "chocolate" first, and then filter those results with the "products" term. Does anyone know of a way to achieve this?
I can achieve the above scenario by omitting the most common words (i.e. "products") from the db query and then reapplying the common term filter on the application side after it has received records found by db. It is preferable for all query logic to happen on the database, but am open to application side processing for a speed payout.
There are still some holes in this design. If a user only searches common terms, I have no choice but to hit the database with all the terms. From preliminary reading, I gather that it is not recommended (or not supported) to have multiple text indexes (with different names) on the same collection. My plan is to create two identical tables, each with my 6.8M records, with different indexes - one for common words and one for uncommon words. This feels kludgy and clunky, but am willing to do this for a speed increase.
Does anyone have any insight and/or advice on how to speed up this system. I'd like as much processing to happen on the database as possible to keep it fast. I'm sure my little 6.8M record table is not the largest that mongodb has seen. Thanks!
Well I worked around these performance issues by allowing MongoDB full text search to search in OR based format. I'm prioritizing my results by fine tuning the weights on my indexed fields and just ordering by rank. I do get more results than desired, but that's not a huge problem because my weighted results that appear at the top will most likely be consumed before my user gets to less relevant results at the bottom.
If anyone is struggling with MongoDB text search performance using AND searching only, just switch back to OR and control your results using weights. It performs leaps better.
hth
This is the exact same issue as $all versus $in. $all only uses the index for the first keyword in the array. I believe your seeing the same issue here, reason why the OR a.k.a. IN works for you.
I have a MongoDB collection containing attributes such as:
longitude, latitude, start_date, end_date, price
I have over 500 million documents.
My question is how to search by lat/long, date range and price as efficiently as possible?
As I see it my options are:
Create an Geo-spatial index on lat/long and use MongoDB's proximity search... and then filter this based on date range and price.
I have yet to test this but, am worrying that the amount of data would be too much to search this quickly, when we have around 1 search a second.
have you had experience with how MongoDB would react under these circumstances?
Split the data into multiple collections by location. i.e. by cities like london_collection, paris_collection, new_york_collection.
I would then have to query by lat/long first, find the nearest city collection and then do a MongoDB spatial search on that subset data in that collection with date and price filters.
I would have uneven distribution of documents as some cities would have more documents than others.
Create collections by dates instead of location. Same as above but each document is allocated a collection based on it's date range.
problem with searches that have a date range that straddles multiple collections.
Create unique ids based on city_start_date_end_date for each document.
Again I would have to use my lat/long query to find the nearest city append the date range to access the key. This seems to be pretty fast but I don't really like the city look up aspect... it seems a bit ugly.
I am in the process of experimenting with option 1.) but would really like to hear your ideas before I go too far down one particular path?
How do search engines split up and manage their data... this must be a similar kind of problem?
Also I do not have to use MongoDB, I'm open to other options?
Many thanks.
Indexing and data access performance is a deep and complex subject. A lot of factors can effect the most efficient solution including the size of your data sets, the read to write ratio, the relative performance of your IO and backing store, etc.
While I can't give you a concrete answer, I can suggest investigating using morton numbers as an efficient way of pulling multiple similar numeric values like lat longs.
Morton number
Why do you think option 1 would be too slow? Is this the result of a real world test or is this merely an assumption that it might eventually not work out?
MongoDB has native support for geohashing and turns coordinates into a single number which can then be searched by a BTree traversal. This should be reasonably fast. Messing around with multiple collections does not seem like a very good idea to me. All it does is replace one level of BTree traversal on the database with some code you still need to write, test and maintain.
Don't reinvent the wheel, but try to optimize the most obvious path (1) first:
Set up geo indexes
Use explain to make sure your queries actually use the index
Make sure your indexes fit into RAM
Profile the database using the built-in profiler
Don't measure performance on a 'cold' system where the indexes didn't have a chance to go to RAM yet
If possible, try not to use geoNear if possible, and stick to the faster (but not perfectly spherical) near queries
If you're still hitting limits, look at sharding to distribute reads and writes to multiple machines.
We have a table with, say, 5 indices (one clustered).
Question: will it somehow negatively affect optimizer performance - either speed or accuracy of index picks - if all 5 indices start with the same exact field? (all other things being equal).
It was suggested by someone at the company that it may have detrimental effect on performance, and thus one of the indices needs to have the first two fields switched.
I would prefer to avoid change if it is not necessary, since they didn't back up their assertion with any facts/reasoning, but the guy is senior and smart enough that I'm inclined to seriously consider what he suggests.
NOTE1: The basic answer "tailor the index to the where clauses and overall queries" is not going to help me - the index that would be changed is a covered index for the only query using it and thus the order of the fields in it would not affect the IO amount. I have asked a separate SO question just to confirm that assertion.
NOTE2: That field is a date when the records are inserted, and the table is pretty big, if this matters. It has data for ~100 days, about equal # of rows per date, and the first index is a clustered index starting with that date field.
The optimizer has to think more about which if any of the indexes to use if there are five. That cost is usually not too bad, but it depends on the queries you're asking of it. In principle, once the query is optimized, the time taken to execute it should be about the same. If you are preparing SELECT statements for multiple uses, that won't matter much. If every query is prepared afresh and never reused, then the overhead may become a drag on the system performance - particularly if it turns out that it really doesn't matter which of the indexes is actually used for most queries (a moderately strong danger when five indexes all share the same leading columns).
There is also the maintenance cost when the data changes - updating five indexes takes noticably longer than just one index, plus you are using roughly five times as much disk storage for five indexes as for one.
I do not wish to speak for your senior colleague but I believe you have misinterpreted what he said, or he has not expressed himself explicitly enough for you to understand.
One of the things that stand out about poorly designed, and therefore poorly performing tables are, they have many indices on them, and the leading columns of the indices are all the same. Every single time.
So it is pointless debating (the debate is too isolated) whether there is a server cost for indices which all have the same leading columns; the problem is the poorly designed table which exposes itself in myriad ways. That is a massive server cost on every access. I suspect that that is where your esteemed colleague was coming from.
A monotonic column for an index is very poor choice (understood, you need at least one) for an index. But when you use that monotonic column to force uniqueness in some other index, which would otherwise be irrelevant (due to low cardinality, such as SexCode), that is another red flag to me. You've merely forced an irrelevant index to be slightly relevant); the queries, except for the single covered query, perform poorly on anything beyond the simplest select via primary key.
There is no such thing as a "covered index", but I understand what you mean, you have added an index so that a certain query will execute as a covered query. Another flag.
I am with Mitch, but I am not sure you get his drift.
Last, responding to your question in isolation, having five indices with the leading columns all the same would not cause a "performance problem", beyond that which your already have due to the poor table design, but it will cause angst and unnecessary manual labour for the developers chasing down weird behaviour, such as "how come the optimiser used index_1 for my query but today it is using index_4?".
Your language consistently (and particularly in the comments) displays a manner of dealing with issues in isolation. The concept of a server and a database, is that it is a shared central resource, the very opposite of isolation. A problem that is "solved" in isolation will usually result in negative performance impact for everyone outside that isolated space.
If you really want the problem dealt with, fully, post the CREATE TABLE statement.
I doubt it would have any major impact on SELECT performance.
BUT it probably means you could reorganise those indexes (based on a respresentative query workload) to better serve queries more efficiently.
I'm not familiar with the recent version of Sybase, but in general with all SQL servers,
the main (and almost) only performance impact indexes have is with INSERT, DELETE and UPDATE queries. Basically each change to the database requires the data table per-se (or the clustered index) to be updated, as well as all the indexes.
With regards to SELECT queries, having "too many" indexes may have a minor performance impact for example by introducing competing hard disk pages for cache. But I doubt this would be a significant issue in most cases.
The fact that the first column in all these indexes is the date, and assuming a generally monotonic progression of the date value, is a positive thing (with regards to CRUD operations) for it will keep the need of splitting/balancing the index tables to a minimal. (since most inserts at at the end of the indexes).
Also this table appears to be small enough ("big" is a relative word ;-) ) that some experimentation with it to assert performance issues in a more systematic fashion could probably be done relatively safely and easily without interfering much with production. (Unless the 10k or so records are very wide or the query per seconds rate is high etc..)