Mongodb Performance issue - performance

I am using mongodb and I need to update my documents say total 1000 are there. My document has a basic structure like:
{
People:[1,2,3,4],
Place:"Auckland"
Event:"Music Show"
}
I have 10,000 threads running concurrently in another VM. Each thread looks for these documents(1000), see if these 1000 documents matches the query and push a number in People array . Suppose if thread 100 found say 500 out of these 1000 documents relevant, then it pushes the number 100 in People array of all the 500 documents.
For this,
I am using for each thread(10000) the command
update.append("$push",new BasicDBObject("People",serial_number));
updateMulti(query,update);
I observe poor performance for these in-place updates (multi-query).
Is this a problem due to a write lock?
Every thread(10000) updates the document that is relevant to the query ? - so there seems to be a lot of "waiting"
Is there a more efficient way to do these "push" operations?
Is "UpdateMulti" the right way to approach this?
Th‎ank you for a great response - Editing and Adding some more information
Some design background :
Yes your reading of our problem is correct. We have 10000 threads each representing one "actor" updating upto 1000 entities ( based on the appropriate query ) at a time with a $push .
Inverting the model leads us to a few broken usecases ( from our domain perspective ) leading us to joins across "states" of the primary entity ( which will now be spread across many collections ) - ex: each of these actions is a state change for that entity - E has states ( e1,e2,e3,e4,e5 ) - So e1 to e5 is represented as an aggregate array which gets updated by the 10,000 threads/processes which represent actions of external apps.
We need close to real-time aggregation as another set of "actors" look at these "states" of e1 to e5 and then respond appropriately via another channel to the "elements in the array".
What should be the "ideal" design strategy in such a case - to speed up the writes.
Will sharding help - is there a "magnitude" heuristic for this - at what lock% should we shard etc..

This is a problem because of your schema design.
It is extremely inefficient to $push multiple values to multiple documents, especially from multiple threads. It's not so much that the write lock is the problem, it's that your design made it the problem. In addition, you are continuously growing documents which means that the updates are not "in place" and your collection is quickly getting fragmented.
It seems like your schema is "upside down". You have 10,000 threads looking to add numbers representing people (I assume a very large number of people) to a small number of documents (1000) which will grow to be huge. It seems to me that if you want to embed something in something else, you might consider collections representing people and then embedding events that those people are found at - at least then you are limiting the size of the array for each person to 1,000 at most, and the updates will be spread across a much larger number of documents, reducing contention significantly.
Another option is simply to record the event/person in attendance and then do aggregation over the raw data later, but without knowing exactly what your requirements for this application are, it's hard to know which way will produce the best results - the way you have picked is definitely one that's unlikely to give you good performance.

Related

Optimizing Redis-Graph query performance (match)

I want to save a large graph in Redis and was trying to accomplish this using RedisGraph. To test this I was creating a test-graph first to check the performance characteristics.
The graph is rather small for the purposes we need.
Vertices: about 3.5 million
Edges: about 18 million
And this is very limited for our purposes, we would need to be able to increase this to 100's of millions of edges in a single database.
In any case, I was checking space and performance requirements buit stopped after only loading in the vertices and seeing that the performance for a:
GRAPH.QUERY gid 'MATCH (t:token {token: "some-string"}) RETURN t'
Is over 300 milliseconds for just this retrieval which is absolutely unacceptable.
Am I missing an obvious way to improve the retrieval performance, or is that currently the limit of RedisGraph?
Thanks
Adding an index will speed things up a lot when matching.
CREATE INDEX ON :token(token)
From my investigations, I think that at least one instance of the item must exist for an index to be created, but I've not done any numbers on extra overhead of creating the index early and then adding most of the new nodes, rather than after all items are in the tree and they can be indexed en-mass.
In case all nodes are labeled as "token" then redisgraph will have to scan 3.5 million entities, comparing each entity "token" attribute against the value you've provided ("some-string")
for speed up I would recommend either adding an index, or limiting the number of results you would like to receive using LIMIT.
Also worth mentioning is that the first query to be served might take awhile longer then following queries due to internal memory management.

Parse: limitations of count()

Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.

Is there a probabilistic data structure for storing relations?

I have database with user subscriptions to topics.
There is currently about 20 000 topics,
20 mln users and 200 mln subscriptions stored in SQL database.
Because of its size, the database is partitioned by topics,
so I can't get the info in one database query.
There are couple of topics with 10 mln subscriptions, couple with 100 000 and others have hundreds or less.
When an event occurs, it usually matches couple of topics, so to inform users, I need to perform query like "give me all users subscribed to topics x, y, z and perform union of sets", so that one user gets the news once even if he subscribed both topics x and z.
The constraints are:
There must be no duplicates in the union set. (users can't get the content twice)
There can be bounded amount of users missing from the union set. (if sometimes user doesn't get the content, it is not that bad, but it can't be always the same user for the same topic)
It is possible to subscribe to new topic without rebuilding whole thing.
I thought about using set of bloom filters for every topic, but they constraints are the other way round: "user either not subscribed for sure or probably subscribed". I need something like "user subscribed for sure or probably not".
Lossy hash tables might be good idea, but I am not sure, if they can be as memory efficient as bloom filters and I am afraid, that it would be always the same user, that is missing the content in his topic.
Do you know any other data structures, that mey be good for solving this problem?
What if each user record had a BIT FIELD representing all of the topics.
TABLE Users(ID INT, UserName VARCHAR(16), Topics BINARY(8000))
A binary 8k would allow you to have 64000 topics. I would probably use multiple columns of BINARY(1024) each so I could add more topics easily.
Now when an event comes in that's tagged for topics 1, 10, 20, 30, 40.
I have to search every User, but this can be parallelized and will always be N complexity where N is the number of total users.
SELECT ID
FROM Users (READPAST)
WHERE
SUBSTRING(Topics, 1 / 8, 1) & (1 * POWER(2, (1 % 8))) > 0
OR
SUBSTRING(Topics, 10 / 8, 1) & (1 * POWER(2, (10 % 8))) > 0
OR
SUBSTRING(Topics, 20 / 8, 1) & (1 * POWER(2, (20 % 8))) > 0
OR
SUBSTRING(Topics, 30 / 8, 1) & (1 * POWER(2, (30 % 8))) > 0
OR
SUBSTRING(Topics, 40 / 8, 1) & (1 * POWER(2, (40 % 8))) > 0
OPTION (MAXDOP = 64)
No duplicates we're scanning Users once so we don't have o worry about unions
Some users missing the READPAST hint will skip any rows that are currently locked (being updated), so some users may be missing from the result.
SUbscribe You can [un]subscribe to a topic simply by toggling the topics bit in the Topics column.
As I said in comments, a memory-based exact solution is certainly feasible.
But if you really want an approximate data structure, then what you're looking for a size-limited set (of users for each topic) with random eviction.
You also need to compute unions rapidly on the fly when queries arrive. There's no helpful pre-computation here. If topic sets tend to repeat, you can look at caching the frequently used unions.
All the usual methods of representing a set apply. Hash tables (both closed and open), trees, and skip lists (all containing user id keys; no values required) are most likely.
If you use a closed hash table with a good hash function, pseudo-random eviction happens automatically. On collision, just replace the previous value. The problem with closed hashes is always picking a good table size for the set you need to represent. Remember that to recover set elements, you'll have to traverse the whole open table including null entries, so starting with a huge table isn't a good idea; rather start with a small one and reorganize, growing by a factor each time so reorganization amortizes to constant time overhead per element stored.
With the other schemes, you can literally do pseudo-random eviction when the table gets too big. The easiest way to evict fairly is store the user id's an a table and have the size-limited set store indices. Evict by generating a random index into the table and removing that id before adding a new one.
It's also possible to evict fairly from a BST set representation by using an order statistic tree: store the number of descendants in each node. Then you can always find the n'th element in key sorted order, where n is pseudo-random, and evict it.
I know you were looking for the bitwise space efficiency of a Bloom filter, but guaranteeing no false positives seems to rule that out.
This might not be the solution you were looking for, but you could utilize ElasticSearch's terms filter and to have one document like this for each user:
{
"id": 12345,
"topics": ["Apache", "GitHub", "Programming"]
}
Terms filters directly responds to the query "which users subscribe to at least one of these topics" and ES is very smart on how to cache and re-utilize filters.
It wouldn't be a probabilistic data structure but would very efficiently solve this problem. You'd need to use scan api for serializing retrieving potentially large JSON responses. If necessary you can scale this solution to billions of users spread on multiple computers and have response times like 10 - 100 milliseconds. You could also find correlations between topics (significant terms aggregation) and use ES as an engine for further analysis.
Edit: I implemented searching and scan / sroll API usage in Python and got some interesting results. I did "users who subscribe to any three of these topics" queries with that 20m users and 200m subscriptions dataset, and in general the search itself finishes in 4 - 8 milliseconds. Queries return 350.000 - 750.000 users.
Problems arise from getting user ids out of ES, even with the scan/scroll API. On Core i5 I seems to get only 8200 users / second so it is less than 0.5 million / minute (with "_source": false). The query itself looks like this:
{
"filtered": {
"filter": {
"terms": {
"topics": [ 123, 234, 345 ],
"execution": "plain",
"_cache": false
}
}
}
}
In production I would use "execution": "bool" so that partial query results can be cached and re-utilized at other queries. I don't know what is the bottle-neck with getting results out, server's CPU usage is 50% and I run the client's python script at the same machine, utilizing elasticsearch.helpers.scan.
[This solution is similar to Louis Ricci's, except inverted to the Topics table - which could make subscription updates less practical, be warned!]
(The probabilistic data structure approach is cool, but unnecessary for your current data-size. I was initially looking at compressed bitsets for a non-probabilistic solution, as they are great at performing set operations in-memory, but I think that's overkill as well. Here is a good implementation for this type of use-case. if you're interested.)
But looking at the sparsity of your data, bitsets waste space over integer-arrays. And even with integer-arrays, the union operation is still pretty inexpensive given that you only have an average of 10,000 subscriptions per topic.
So maybe, just maybe, a dead-simple data-structure given your use-case is simply:
Topic 1 => [array of subscriber IDs]
Topic 2 => [array of subscriber IDs]
...
Topic 20,000 => [array of subscriber IDs]
Storing (avg) 10,000 subscriber IDs (assuming 32-bit integers) only requires about 40kb of space per topic.
[In an array-type or BLOB, depending on your database]
With 20,000 topics, this adds only 800mb of data to your topic table ... and very little of this (~200kb avg) needs to be loaded to memory when a notification event occurs!
Then when an average event (affecting 5 topics) occurs, all that needs to happen is:
Query / Pull the data for the relevant topics (avg 5 records) into memory
(avg ~200kb of I/O)
Dump them into a Set data structure (de-duplicates subscriber list)
Alert the subscribers in the Set.

What is the best way to analyse a large dataset with similar records?

Currently I am loooking for a way to develop an algorithm which is supposed to analyse a large dataset (about 600M records). The records have parameters "calling party", "called party", "call duration" and I would like to create a graph of weighted connections among phone users.
The whole dataset consists of similar records - people mostly talk to their friends and don't dial random numbers but occasionaly a person calls "random" numbers as well. For analysing the records I was thinking about the following logic:
create an array of numbers to indicate the which records (row number) have already been scanned.
start scanning from the first line and for the first line combination "calling party", "called party" check for the same combinations in the database
sum the call durations and divide the result by the sum of all call durations
add the numbers of summed lines into the array created at the beginning
check the array if the next record number has already been summed
if it has already been summed then skip the record, else perform step 2
I would appreciate if anyone of you suggested any improvement of the logic described above.
p.s. the edges are directed therefore the (calling party, called party) is not equal to (called party, calling party)
Although the fact is not programming related I would like to emphasize that due to law and respect for user privacy all the informations that could possibly reveal the user identity have been hashed before the analysis.
As always with large datasets the more information you have about the distribution of values in them the better you can tailor an algorithm. For example, if you knew that there were only, say, 1000 different telephone numbers to consider you could create a 1000x1000 array into which to write your statistics.
Your first step should be to analyse the distribution(s) of data in your dataset.
In the absence of any further information about your data I'm inclined to suggest that you create a hash table. Read each record in your 600M dataset and calculate a hash address from the concatenation of calling and called numbers. Into the table at that address write the calling and called numbers (you'll need them later, and bear in mind that the hash is probably irreversible), add 1 to the number of calls and add the duration to the total duration. Repeat 600M times.
Now you have a hash table which contains the data you want.
Since there are 600 M records, it seems to be large enough to leverage a database (and not too large to require a distributed Database). So, you could simply load this into a DB (MySQL, SQLServer, Oracle, etc) and run the following queries:
select calling_party, called_party, sum(call_duration), avg(call_duration), min(call_duration), max (call_duration), count(*) from call_log group by calling_party, called_party order by 7 desc
That would be a start.
Next, you would want to run some Association analysis (possibly using Weka), or perhaps you would want to analyze this information as cubes (possibly using Mondrian/OLAP). If you tell us more, we can help you more.
Algorithmically, what the DB is doing internally is similar to what you would do yourself programmatically:
Scan each record
Find the record for each (calling_party, called_party) combination, and update its stats.
A good way to store and find records for (calling_party, called_party) would be to use a hashfunction and to find the matching record from the bucket.
Althought it may be tempting to create a two dimensional array for (calling_party, called_party), that will he a very sparse array (very wasteful).
How often will you need to perform this analysis? If this is a large, unique dataset and thus only once or twice - don't worry too much about the performance, just get it done, e.g. as Amrinder Arora says by using simple, existing tooling you happen to know.
You really want more information about the distribution as High Performance Mark says. For starters, it's be nice to know the count of unique phone numbers, the count of unique phone number pairs, and, the mean, variance and maximum of the count of calling/called phone numbers per unique phone number.
You really want more information about the analysis you want to perform on the result. For instance, are you more interested in holistic statistics or identifying individual clusters? Do you care more about following the links forward (determining who X frequently called) or following the links backward (determining who X was frequently called by)? Do you want to project overviews of this graph into low-dimensional spaces, i.e. 2d? Should be easy to indentify indirect links - e.g. X is near {A, B, C} all of whom are near Y so X is sorta near Y?
If you want fast and frequently adapted results, then be aware that a dense representation with good memory & temporal locality can easily make a huge difference in performance. In particular, that can easily outweigh a factor ln N in big-O notation; you may benefit from a dense, sorted representation over a hashtable. And databases? Those are really slow. Don't touch those if you can avoid it at all; they are likely to be a factor 10000 slower - or more, the more complex the queries are you want to perform on the result.
Just sort records by "calling party" and then by "called party". That way each unique pair will have all its occurrences in consecutive positions. Hence, you can calculate the weight of each pair (calling party, called party) in one pass with little extra memory.
For sorting, you can sort small chunks separately, and then do a N-way merge sort. That's memory efficient and can be easily parallelized.

Getting the most frequent items without counting every item

I was wondering if there was an algorithm for counting "most frequent items" without having to keep a count of each item? For example, let's say I was a search engine and wanted to keep track of the 10 most popular searches. What I don't want to do is keep a counter of every query since there could be too many queries for me to count (and most them will be singletons). Is there a simple algorithm for this? Maybe something that is probabilistic? Thanks!
Well, if you have a very large number of queries (like a search engine presumably would), then you could just do "sampling" of queries. So you might be getting 1,000 queries per second, but if you just keep a count one per second, then over a longish period of time, you'd get an answer that would be relatively close to the "real" answer.
This is how, for example, a "sampling" profiler works. Every n mililiseconds it looks at what function is currently being executed. Over a long period of time (several seconds) you get a good idea of the "expensive" functions, because they're the ones that appear in your samples more often.
You still have to do "counting" but by doing periodic samples, instead of counting every single query you can get an upper bound on the amount of data that you actually have to store (e.g. max of one query per second, etc)
If you want the most frequent searches at any given time, you don't need to have endless counters keeping track of each query submitted. Instead, you need an algorithm to measure the amount of submissions for any given query divided by a set period of time. This is a pretty simple algorithm. Any search submitted to your search engine, for example the word “cache,” is stored for a fixed period of time called a refresh rate, (the length of your refresh rate depends on the kind of traffic your search engine is getting and the amount of “top-results” you want to keep track of). If the refresh rate time period expires and searches for the word “cache” have not persisted, the query is deleted memory. If searches for the word “cache” do persist, your algorithm only needs to keep track of the rate at which the word “cache” is being searched. To do this, simply store all searches on a “leaky-counter.” Every entry is pushed onto the counter with an expiration date after which the query is deleted. Your active counters are the indicators of your top queries.
Storing each and every query would be expensive, yet necessary to ensure the top 10 are actually the top 10. You'll have to cheat.
One idea is to store a table of URLs, hit counters, and timestamp indexed by count, then timestamp. When the table reaches some arbitrary near-maximum size, start removing low-end entries that are older than a given number of days. Although old, infrequent queries won't be counted, the queries likely to make the top 10 should make it on the table because of the faster query rate.
Another idea would be to write a 16-bit (or more) hash function for search queries. Have a 65536-entry table holding counters and URLs. When a search is performed, increment the respective table entry and set the URL if necessary. However, this approach has a major drawback. A spam bot could make repeated queries like "cheap viagra", possibly making legitimate queries increment the spam query counters instead, placing their messages on your main page.
You want a cache, of which there are many kinds; see Wikipedia
Cache algorithms and
Page replacement algorithm Aging.

Resources