I currently have a MongoDB setup with a fairly large database (about 250m documents). At present, I have one main collection that has the majority of the data, which has a single index (time). This results in acceptable query times when only the time is in the where part of the query (the index is used).
The problem is when I need to use a compound key - the time index uses about 2.5GB of memory, and I only have 4GB on the server, so I don't want to create a compound key index since that will prevent all indexes from fitting in memory and thus slow things down a lot.
So my question is this: can I query first for time, and then query that subset for the other variables?
I should point out that I am using the Ruby driver.
At the moment, my query looks like this (this is very slow):
trade_stop_loss_time = ticks.find_one({
"time" => { "$gt" => trade_time_open, "$lte" => trade_time_close },
"bid" => { "$lte" => stop_loss_price }
}).sort({"time" => 1})
Thanks!
If you simply perform the query you present, the database should be smart enough to do exactly that.
The query you have should basically filter down the candidate set using the time index, then scan the remaining objects for the bid parameter. This should be a lot more efficient than doing the scan on the client.
You should definitely run explain() on your query to find out what it's doing. If it uses an index (BtreeCursor) and the number of scanned objects is just the number of items in the given time frame, it's doing fine. I don't think there's a better way than that, given your constraints. Doing the same operation on the client will definitely be slower.
Of course, a limit and a small time frame will help to make your query faster, but these might be external factors. mongostat might also help to find the problem.
However, if your documents and/or time spans are large, it might still be better to add the compound index: loading a lot of large documents from disk (since your RAM is already full) will take some time. Paging the index from disk is also slow, but it's much less data.
A good answer can be given only be experiment.
You could return the results using just the time index then filter them further client side? Other than that I think you're pretty much out of luck.
Related
I want to save a large graph in Redis and was trying to accomplish this using RedisGraph. To test this I was creating a test-graph first to check the performance characteristics.
The graph is rather small for the purposes we need.
Vertices: about 3.5 million
Edges: about 18 million
And this is very limited for our purposes, we would need to be able to increase this to 100's of millions of edges in a single database.
In any case, I was checking space and performance requirements buit stopped after only loading in the vertices and seeing that the performance for a:
GRAPH.QUERY gid 'MATCH (t:token {token: "some-string"}) RETURN t'
Is over 300 milliseconds for just this retrieval which is absolutely unacceptable.
Am I missing an obvious way to improve the retrieval performance, or is that currently the limit of RedisGraph?
Thanks
Adding an index will speed things up a lot when matching.
CREATE INDEX ON :token(token)
From my investigations, I think that at least one instance of the item must exist for an index to be created, but I've not done any numbers on extra overhead of creating the index early and then adding most of the new nodes, rather than after all items are in the tree and they can be indexed en-mass.
In case all nodes are labeled as "token" then redisgraph will have to scan 3.5 million entities, comparing each entity "token" attribute against the value you've provided ("some-string")
for speed up I would recommend either adding an index, or limiting the number of results you would like to receive using LIMIT.
Also worth mentioning is that the first query to be served might take awhile longer then following queries due to internal memory management.
For Oracle and being Relative to application tuning, when may it make sense to not have an index on a table and why?
There is a cost associated to having an index:
it takes up disk space
it slows down updates (index needs to be updated as well)
it makes query planning more complex (slightly slower, but more importantly increased potential for bad decisions)
These costs are supposed to be offset by the benefit of more efficient query processing (faster, fewer I/O).
If the index is not used enough to justify the cost, then having the index will be negative.
In particular, if your data distribution is low (think flags like 'Y' and 'N'), indexes won't help much. Think of it this way: if the number of distinct values in an index is low, the optimizer will probably choose not to use the index. An interesting aside is that if the column in the index is null, it might be much faster if your query criteria include actual values as nulls aren't indexed, which means that only the actual values (non null) are in that particular index,thereby not evaluating most of the rows in the table. In the "is null" case, it will never use an index - if you have a query with a "where" clause like "where mytable.mycolumn is null", abandon all indexes ye who enter here.
If a table has very little data (small number of rows) then it doesn't serve you to use an index. An index makes it quick to search on a specific attribute and if the application you are working with doesn't need a fast lookup then using an index does very little for you.
I'm working on a simple index containing one million docs with 30 fields each.
a q=: with a very low start value (0 for instance) takes only a few milliseconds (~1 actually)
the higher the start value is, the slowest SolR gets...
start=100000 => 171 ms
start=500000 => 844 ms
start=1000000 => 1274 ms
I'm a bit surprised by this performance degradation, and I'm afraid since the index is supposed to grow over hundred million documents within a few month.
Maybe did I something wrong in the schema? Or is it aenter code here normal behavior, given slicing docs behind the few first hundreds should usually not happen :)
EDIT
Thanks guys for those explanations - I was guessing something like that, however I do prefer be sure that this was not related to the way the schema has been described. So the question solved for me.
Every time you make search query to solr, it collect all the matching documents to the query. Then it skip the document until the start value is reached and then return the results.
Other point to note is that, every time you make the same search query but with higher start value, these documents are also not present in cache, so it might refresh cache as well. (depending on the size and type of cache you have configured)
Pagination naively works by retrieving all the documents up until the cut off point, throwing them away, then fetching enough documents to satisfy the number of documents requested and then returning.
If you're going to deep paging (going far into the dataset) this becomes expensive, and the CursorMark support was implemented (see "Fetching A Large Number of Sorted Results: Cursors") to support near-instant pagination into a large set of documents.
Yonik also has a good blog post about deep pagination in Solr.
right now, we are using Solr as an fulltext index, where all fields of the documents are indexed but not stored.
There are some million documents, index-size is 50 GB. Average query-time is around 100ms.
To use features like Highlighting, we are thinking about to: additional store text. But, that could double the size of the index-files.
I know there is absolutely no (linear) relation between index size and query time. Rising the documents on factor 10 results in nearly no difference of query time.
But at all, the system (Solr/Lucene/Linux/...) has to handle more informations - the index files (for example) are based on much more I-nodes, and so on.
So I'm sure, there is an impact on query time in relation to the index-size. (But: is this noticeably?)
1st:
Do you think, I'm right?
Did you have any experiences on index-size and search speed in relation to with/without stored text?
Is it smart and reasonable to blow up the index by storing the documents?
2nd:
Do you know, how Solr/Lucene handled stored text? Maybe in separate files? (So that there is no impact for simples searches, where no stored text is needed!?)
Thank you.
Yes, it's absolutely true that the index grows if you make big fields stored, but if you want to highlight them, you don't have other ways. I don't think the speed will be decreased that much, maybe just because you need to download more data retrieving results, but it's not that relevant.
Regarding the lucene index format and the different files within the index you can have a look here: the stored fields are stored in a specific file.
Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?
I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.
As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.
I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?
Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.
(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)
It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.
The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.