Optimizing Redis-Graph query performance (match) - performance

I want to save a large graph in Redis and was trying to accomplish this using RedisGraph. To test this I was creating a test-graph first to check the performance characteristics.
The graph is rather small for the purposes we need.
Vertices: about 3.5 million
Edges: about 18 million
And this is very limited for our purposes, we would need to be able to increase this to 100's of millions of edges in a single database.
In any case, I was checking space and performance requirements buit stopped after only loading in the vertices and seeing that the performance for a:
GRAPH.QUERY gid 'MATCH (t:token {token: "some-string"}) RETURN t'
Is over 300 milliseconds for just this retrieval which is absolutely unacceptable.
Am I missing an obvious way to improve the retrieval performance, or is that currently the limit of RedisGraph?
Thanks

Adding an index will speed things up a lot when matching.
CREATE INDEX ON :token(token)
From my investigations, I think that at least one instance of the item must exist for an index to be created, but I've not done any numbers on extra overhead of creating the index early and then adding most of the new nodes, rather than after all items are in the tree and they can be indexed en-mass.

In case all nodes are labeled as "token" then redisgraph will have to scan 3.5 million entities, comparing each entity "token" attribute against the value you've provided ("some-string")
for speed up I would recommend either adding an index, or limiting the number of results you would like to receive using LIMIT.
Also worth mentioning is that the first query to be served might take awhile longer then following queries due to internal memory management.

Related

Parse: limitations of count()

Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.

SolR queries on simple index are getting slower as start param is higher

I'm working on a simple index containing one million docs with 30 fields each.
a q=: with a very low start value (0 for instance) takes only a few milliseconds (~1 actually)
the higher the start value is, the slowest SolR gets...
start=100000 => 171 ms
start=500000 => 844 ms
start=1000000 => 1274 ms
I'm a bit surprised by this performance degradation, and I'm afraid since the index is supposed to grow over hundred million documents within a few month.
Maybe did I something wrong in the schema? Or is it aenter code here normal behavior, given slicing docs behind the few first hundreds should usually not happen :)
EDIT
Thanks guys for those explanations - I was guessing something like that, however I do prefer be sure that this was not related to the way the schema has been described. So the question solved for me.
Every time you make search query to solr, it collect all the matching documents to the query. Then it skip the document until the start value is reached and then return the results.
Other point to note is that, every time you make the same search query but with higher start value, these documents are also not present in cache, so it might refresh cache as well. (depending on the size and type of cache you have configured)
Pagination naively works by retrieving all the documents up until the cut off point, throwing them away, then fetching enough documents to satisfy the number of documents requested and then returning.
If you're going to deep paging (going far into the dataset) this becomes expensive, and the CursorMark support was implemented (see "Fetching A Large Number of Sorted Results: Cursors") to support near-instant pagination into a large set of documents.
Yonik also has a good blog post about deep pagination in Solr.

Mongodb Performance issue

I am using mongodb and I need to update my documents say total 1000 are there. My document has a basic structure like:
{
People:[1,2,3,4],
Place:"Auckland"
Event:"Music Show"
}
I have 10,000 threads running concurrently in another VM. Each thread looks for these documents(1000), see if these 1000 documents matches the query and push a number in People array . Suppose if thread 100 found say 500 out of these 1000 documents relevant, then it pushes the number 100 in People array of all the 500 documents.
For this,
I am using for each thread(10000) the command
update.append("$push",new BasicDBObject("People",serial_number));
updateMulti(query,update);
I observe poor performance for these in-place updates (multi-query).
Is this a problem due to a write lock?
Every thread(10000) updates the document that is relevant to the query ? - so there seems to be a lot of "waiting"
Is there a more efficient way to do these "push" operations?
Is "UpdateMulti" the right way to approach this?
Th‎ank you for a great response - Editing and Adding some more information
Some design background :
Yes your reading of our problem is correct. We have 10000 threads each representing one "actor" updating upto 1000 entities ( based on the appropriate query ) at a time with a $push .
Inverting the model leads us to a few broken usecases ( from our domain perspective ) leading us to joins across "states" of the primary entity ( which will now be spread across many collections ) - ex: each of these actions is a state change for that entity - E has states ( e1,e2,e3,e4,e5 ) - So e1 to e5 is represented as an aggregate array which gets updated by the 10,000 threads/processes which represent actions of external apps.
We need close to real-time aggregation as another set of "actors" look at these "states" of e1 to e5 and then respond appropriately via another channel to the "elements in the array".
What should be the "ideal" design strategy in such a case - to speed up the writes.
Will sharding help - is there a "magnitude" heuristic for this - at what lock% should we shard etc..
This is a problem because of your schema design.
It is extremely inefficient to $push multiple values to multiple documents, especially from multiple threads. It's not so much that the write lock is the problem, it's that your design made it the problem. In addition, you are continuously growing documents which means that the updates are not "in place" and your collection is quickly getting fragmented.
It seems like your schema is "upside down". You have 10,000 threads looking to add numbers representing people (I assume a very large number of people) to a small number of documents (1000) which will grow to be huge. It seems to me that if you want to embed something in something else, you might consider collections representing people and then embedding events that those people are found at - at least then you are limiting the size of the array for each person to 1,000 at most, and the updates will be spread across a much larger number of documents, reducing contention significantly.
Another option is simply to record the event/person in attendance and then do aggregation over the raw data later, but without knowing exactly what your requirements for this application are, it's hard to know which way will produce the best results - the way you have picked is definitely one that's unlikely to give you good performance.

Mongo db query subset

I currently have a MongoDB setup with a fairly large database (about 250m documents). At present, I have one main collection that has the majority of the data, which has a single index (time). This results in acceptable query times when only the time is in the where part of the query (the index is used).
The problem is when I need to use a compound key - the time index uses about 2.5GB of memory, and I only have 4GB on the server, so I don't want to create a compound key index since that will prevent all indexes from fitting in memory and thus slow things down a lot.
So my question is this: can I query first for time, and then query that subset for the other variables?
I should point out that I am using the Ruby driver.
At the moment, my query looks like this (this is very slow):
trade_stop_loss_time = ticks.find_one({
"time" => { "$gt" => trade_time_open, "$lte" => trade_time_close },
"bid" => { "$lte" => stop_loss_price }
}).sort({"time" => 1})
Thanks!
If you simply perform the query you present, the database should be smart enough to do exactly that.
The query you have should basically filter down the candidate set using the time index, then scan the remaining objects for the bid parameter. This should be a lot more efficient than doing the scan on the client.
You should definitely run explain() on your query to find out what it's doing. If it uses an index (BtreeCursor) and the number of scanned objects is just the number of items in the given time frame, it's doing fine. I don't think there's a better way than that, given your constraints. Doing the same operation on the client will definitely be slower.
Of course, a limit and a small time frame will help to make your query faster, but these might be external factors. mongostat might also help to find the problem.
However, if your documents and/or time spans are large, it might still be better to add the compound index: loading a lot of large documents from disk (since your RAM is already full) will take some time. Paging the index from disk is also slow, but it's much less data.
A good answer can be given only be experiment.
You could return the results using just the time index then filter them further client side? Other than that I think you're pretty much out of luck.

Getting the most frequent items without counting every item

I was wondering if there was an algorithm for counting "most frequent items" without having to keep a count of each item? For example, let's say I was a search engine and wanted to keep track of the 10 most popular searches. What I don't want to do is keep a counter of every query since there could be too many queries for me to count (and most them will be singletons). Is there a simple algorithm for this? Maybe something that is probabilistic? Thanks!
Well, if you have a very large number of queries (like a search engine presumably would), then you could just do "sampling" of queries. So you might be getting 1,000 queries per second, but if you just keep a count one per second, then over a longish period of time, you'd get an answer that would be relatively close to the "real" answer.
This is how, for example, a "sampling" profiler works. Every n mililiseconds it looks at what function is currently being executed. Over a long period of time (several seconds) you get a good idea of the "expensive" functions, because they're the ones that appear in your samples more often.
You still have to do "counting" but by doing periodic samples, instead of counting every single query you can get an upper bound on the amount of data that you actually have to store (e.g. max of one query per second, etc)
If you want the most frequent searches at any given time, you don't need to have endless counters keeping track of each query submitted. Instead, you need an algorithm to measure the amount of submissions for any given query divided by a set period of time. This is a pretty simple algorithm. Any search submitted to your search engine, for example the word “cache,” is stored for a fixed period of time called a refresh rate, (the length of your refresh rate depends on the kind of traffic your search engine is getting and the amount of “top-results” you want to keep track of). If the refresh rate time period expires and searches for the word “cache” have not persisted, the query is deleted memory. If searches for the word “cache” do persist, your algorithm only needs to keep track of the rate at which the word “cache” is being searched. To do this, simply store all searches on a “leaky-counter.” Every entry is pushed onto the counter with an expiration date after which the query is deleted. Your active counters are the indicators of your top queries.
Storing each and every query would be expensive, yet necessary to ensure the top 10 are actually the top 10. You'll have to cheat.
One idea is to store a table of URLs, hit counters, and timestamp indexed by count, then timestamp. When the table reaches some arbitrary near-maximum size, start removing low-end entries that are older than a given number of days. Although old, infrequent queries won't be counted, the queries likely to make the top 10 should make it on the table because of the faster query rate.
Another idea would be to write a 16-bit (or more) hash function for search queries. Have a 65536-entry table holding counters and URLs. When a search is performed, increment the respective table entry and set the URL if necessary. However, this approach has a major drawback. A spam bot could make repeated queries like "cheap viagra", possibly making legitimate queries increment the spam query counters instead, placing their messages on your main page.
You want a cache, of which there are many kinds; see Wikipedia
Cache algorithms and
Page replacement algorithm Aging.

Resources