Datastore fetch VS fetch(keys_only=True) then get_multi

Datastore fetch VS fetch(keys_only=True) then get_multi - performance

I am fetching multiple entities 100+ from datastore using the below Query
return entity.query(ancestor = ancestorKey).filter(entity.year= myStartYear).order(entity.num).fetch()
Which was taking a long time (order of a few seconds) to load.
Trying to find an optimum way, I created exactly 100 entities, found that it takes anywhere between 750ms ~ 1000ms to fetch the 100 entities on local server, which is a lot of course. I am not sure how to get around a single line fetch to make it more efficient!
In a desperate attempt to optimize, I tried
Removing the order part, still got the same results
Removing the filter part, still got the same results
Removing the order & filter part, still got the same results
So apparently it is something else. In a desperate attempt, I tried fetching for keys only then passing the keys to ndb.get_multi() function:
qKeys = entity.query(ancestor = ancestorKey).filter(entity.year= myStartYear).order(entity.num).fetch(keys_only=True)
return ndb.get_multi(qKeys)
To my surprise I get a better throughput! query results now loads in 450 ~ 550ms which is around ~40% better performance on average!
I am not sure why this happens, I would have thought that the fetch function already queries entities in the most optimum time.
Question:
Any idea how I can optimize the single query line to load faster?
Side Question:
Anyone knows what's the underlying mechanism for the fetch function, and why fetching keys only, then using ndb.get_multi() is faster?

FWIW, you shouldn't expect meaningful results from datastore performance tests performed locally, using either the development server or the datastore emulator - they're just emulators, they don't have the same performance (or even the 100% equivalent functionality) as the real datastore.
Credit goes to #snakecharmerb, who correctly identified the culprit, confirmed by OP:
Be aware that performance characteristics in the cloud may differ from
those on your local machine. You really want to be running these tests
in the cloud. – snakecharmerb yesterday
#snakecharmerb you were right on your suggestion! Just tested on the
cloud it's actually the other way around on the cloud in terms of
performance. fetch() ~550ms, fetch(keysonly) then get_multi was ~700ms
seems that fetch() works better on the cloud! – Khaled yesterday

Related

Potiential data loss with using InMemoryMessageDataRepository in production with MassTransit?

We are hoping to convert to mass transit, but due to the size of the data we need to store in the message data repository, we are needing to keep it in memory. The data is sometimes over 15GB, and the SQL query to fetch it takes up to 300 seconds. We don't want to have to fetch the data potentially 100s of times (which is how the consumers would be logically broken up). We need to continue to fetch it from sql once and keep it in memory, or performance will be crippled. We would ultimately like to stop needing to have so much data in memory, but it is not something we could feasibly fix in the short term. We are hoping to convert to mass transit now though, as we see significant benefits in doing so, and the InMemoryMessageDataRepository seems like the only path that would allow us to do this right now. However, the documentation says only to use that for unit testing.
I am trying to find out if this is something we can use in production code. What are the pitfalls in using the InMemoryMessageDataRepository in production? I can't find much about this, other than this blog post, https://kevsoft.net/2017/05/18/transferring-big-payloads-with-masstransit-and-azure-blob-storage.html, which says
It is however advised not to use in-memory storage due to potential data loss.
Why would there be potential data loss? What is mass transit doing with the InMemoryMessageDataRepository that would make using it in production risky?

Tabular Model Not Caching Results

Can anyone, someone point me in the direction of how to troubleshoot why a Tabular model that I have built does not seem to want to cache query results?
It is my understanding that MDX queries to Tabular model will be cached, however with our model they never seem to be! And I can't figure out why..
My best guess is that it's memory pressure, and the system is clearing down the RAM, but even that is a guess..
Are there any counters, DMVs, or other perfmon stats etc that i can use to actually see what is going on and check?
Thanks.

Plenty of places to look, but I'd recommend starting with a Profiler/xEvent trace. Below is an example of 2 runs of the same MDX query.
The first run is on a cold-cache...
The second run is on a warm-cache and you can see that it is resolving the query from cache...
This is much easier to see if you can isolate the query on non-production server (e.g. test/dev environment). There are quite a few reasons why a particular query may not be taking advantage of the cache...but you need to first confirm that it is not using the cache.

How to deactivate safe mode in the mongo shell?

Short question is on the title: I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
Long Question for those willing to know the context:
I am working on a huge set of data like
{
_id:ObjectId("azertyuiopqsdfghjkl"),
stringdate:"2008-03-08 06:36:00"
}
and some other fields and there are about 250M documents like that (whole database with the indexes weights 36Go). I want to convert the date in a real ISODATE field. I searched a bit how I could make an update query like
db.data.update({},{$set:{date:new Date("$stringdate")}},{multi:true})
but did not find how to make this work and resolved myself to make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value. The query use the _id so the default index is used.
Problem is that it takes a very long time. I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added. I also set an index on a relevant field to process the database chunk by chunk. Finally I ran several concurrent mongo clients on both the server and my workstation to ensure that the limitant factor is the database lock availability and not any other factor like cpu or network costs.
I monitored the whole thing with mongotop, mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time. I am a bit disappointed mongodb does not have a more precise granularity on its write lock, why not allowing concurrent write operations on the same collection as long as there is no risk of interference? Now that I think about it I should have sharded the collection on a dozen shards even while staying on the same server, because there would have been individual locks on each shard.
But since I can't do a thing right now to the current database structure, I searched how to improve performance to at least spend 90% of my time writing in mongo (from 70% currently), and I figured out that since I ran my script in the default mongo shell, every time I make an update, there is also a getLastError() which is called afterwards and I don't want it because there is a 99.99% chance of success and even in case of failure I can still make an aggregation request after the end of the big process to retrieve the single exceptions.
I don't think I would gain so much performance by deactivating the getLastError calls, but I think itis worth trying.
I took a look at the documentation and found confirmation of the default behavior, but not the procedure for changing it. Any suggestion?

I work with my mongo Shell wich is in safe mode by default, and I want to gain better performance by deactivating this behaviour.
You can use db.getLastError({w:0}) ( http://docs.mongodb.org/manual/reference/method/db.getLastError/ ) to do what you want but it won't help.
This is because for one:
make a script that take the documents one after the other and make an update to set a new field which takes the new Date(stringdate) as its value.
When using the shell in a non-interactive mode like within a loop it doesn't actually call getLastError(). As such downing your write concern to 0 will do nothing.
I already figured out that if only I had inserted empty dates object when I created the database I would now get better performances since there is the problem of data relocation when a new field is added.
I did tell people when they asked about this stuff to add those fields incase of movement but instead they listened to the guy who said "leave them out! They use space!".
I shouldn't feel smug but I do. That's an unfortunately side effect of being right when you were told you were wrong.
mongostats and the web monitoring interfaces which confirmed that write lock is taken 70% of the time
That's because of all the movement in your documents, kinda hard to fix that.
I am a bit disappointed mongodb does not have a more precise granularity on its write lock
The write lock doesn't actually denote the concurrency of MongoDB, this is another common misconception that stems from the transactional SQL technologies.
Write locks in MongoDB are mutexs for one.
Not only that but there are numerous rules which dictate that operations will subside to queued operations under certain circumstances, one being how many operations waiting, another being whether the data is in RAM or not, and more.
Unfortunately I believe you have got yourself stuck in between a rock and hard place and there is no easy way out. This does happen.

RavenDB - slow write/save performance?

I started porting a simple ASP.NET MVC web app from SQL to RavenDB. I noticed that the pages were faster on SQL than on RavenDB.
Drilling down with Miniprofiler it seems the culprit is the time it takes to do: session.SaveChanges (150-220ms). The code for saving in RavenDB looks like:
var btime = new TimeData() { Time1 = DateTime.Now, TheDay = new DateTime(2012, 4, 3), UserId = 76 };
session.Store(btime);
session.SaveChanges();
Authentication Mode: When RavenDB is running as a service, I assume it using "Windows Authentication". When deployed as an IIS application I just used the defaults - which was "Windows Authentication".
Background: The database machine is separate from my development machine which acts as the web server. The databases are running on the same database machine. The test data is quite small - say 100 rows. The queries are simple returning an object with 12 properties 48 bytes in size. Using fiddler to run a WCAT test against RavenDB generated higher utilization on the database machine (vs SQL) and far fewer pages. I tried running Raven as a service and as an IIS application, but didn't see a noticible difference.
Edit
I wanted to ensure it wasn't a problem with a) one of my machines or b) the solution I created. So, decided to try testing it on Appharbor using another solution created by Michael Friis: RavenDN sample app and simply add Miniprofiler to that solution. Michael is one of the awesome guys at Apharbor and you can download the code here if you want to look at it.
Results from Appharbor
You can try it here (for now):
Read: (7-12ms with a few outliers at 100+ms).
Write/Save: (197-312ms) * WOW that's a long time to save *. To test the save, just create a new "thingy" and save it. You might want to do it at least twice since the first one usually takes longer as the application warms up.
Unless we're both doing something wrong, RavenDB is very slow to save - around 10-20x slower to save than read. Given that it re-indexes asynchronously, this seems very slow.
Are there ways to speed it up or is this to be expected?

First - Ayende is "the man" behind RavenDB (he wrote it). I have no idea why he's not addressing the question, although even in the Google groups, he seems to chime in once to ask some pointed questions, but rarely comes back to provide a complete answer. Maybe he's working hard to to get RavenHQ off the ground?!?
Second - We experienced a similar problem. Below's a link to a discussion on Google Groups that may be the cause:
RavenDB Authentication and 401 Response.
A reasonable question might be: "If these recommendations fix the problem, why doesn't RavenDB work that way out of the box?" or at least provide documentation about how to get decent write performance.
We played for a while with the suggestions that were made in the thread above and the response-time did improve. In the end though, we switched back to MySQL because it's well-tested, we ran into this problem early (luckily) which caused concern that we might hit more problems and, finally, because we did not have the time to:
fully test whether it fixed the performance problems we saw on the RavenDB Server
investigate and test the implications of using UnsafeAuthenticatedConnectionSharing & pre-authentication.

To summarize Ayende's response you're actually testing the summation of network latency and authentication chatter. As Joe pointed out there's ways you can optimize the authentication to be less chatty. This does however arguably reduce security, clearly Microsoft built security to be secure first and performance secondary. You as the user of RavenDB can choose if the default security model is too robust as it arguably is for protected server-to-server communication.
RavenDB is clearly defined to be READ orientated. 10-20x slower for writes than reads is entirely acceptable because writes are full ACID and transactional.
If write speed is your limiting factor with RavenDB you've likely not modeled transaction boundaries properly. That you are saving documents that are too similar to RDBMS table rows and not actually well modeled documents.
Edit: Reading your question again and looking into the background section, you explicitly define your test conditions to be an optimal scenario for SQL Server while being one of the least efficient methods for RavenDB. For data that size, that's almost certainly 1 document if it would be real world usage.

efficient serverside autocomplete

First off all I know:
Premature optimization is the root of all evil
But I think wrong autocomplete can really blow up your site.
I would to know if there are any libraries out there which can do autocomplete efficiently(serverside) which preferable can fit into RAM(for best performance). So no browserside javascript autocomplete(yui/jquery/dojo). I think there are enough topic about this on stackoverflow. But I could not find a good thread about this on stackoverflow (maybe did not look good enough).
For example autocomplete names:
names:[alfred, miathe, .., ..]
What I can think off:
simple SQL like for example: SELECT name FROM users WHERE name LIKE al%.
I think this implementation will blow up with a lot of simultaneously users or large data set, but maybe I am wrong so numbers(which could be handled) would be cool.
Using something like solr terms like for example: http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=al&wt=json&omitHeader=true.
I don't know the performance of this so users with big sites please tell me.
Maybe something like in memory redis trie which I also haven't tested performance on.
I also read in this thread about how to implement this in java (lucene and some library created by shilad)
What I would like to hear is implementation used by sites and numbers of how well it can handle load preferable with:
Link to implementation or code.
numbers to which you know it can scale.
It would be nice if it could be accesed by http or sockets.
Many thanks,
Alfred

Optimising for Auto-complete
Unfortunately, the resolution of this issue will depend heavily on the data you are hoping to query.
LIKE queries will not put too much strain on your database, as long as you spend time using 'EXPLAIN' or the profiler to show you how the query optimiser plans to perform your query.
Some basics to keep in mind:
Indexes: Ensure that you have indexes setup. (Yes, in many cases LIKE does use the indexes. There is an excellent article on the topic at myitforum. SQL Performance - Indexes and the LIKE clause ).
Joins: Ensure your JOINs are in place and are optimized by the query planner. SQL Server Profiler can help with this. Look out for full index or full table scans
Auto-complete sub-sets
Auto-complete queries are a special case, in that they usually works as ever decreasing sub sets.
'name' LIKE 'a%' (may return 10000 records)
'name' LIKE 'al%' (may return 500 records)
'name' LIKE 'ala%' (may return 75 records)
'name' LIKE 'alan%' (may return 20 records)
If you return the entire resultset for query 1 then there is no need to hit the database again for the following result sets as they are a sub set of your original query.
Depending on your data, this may open a further opportunity for optimisation.

I will no comply with your requirements and obviously the numbers of scale will depend on hardware, size of the DB, architecture of the app, and several other items. You must test it yourself.
But I will tell you the method I've used with success:
Use a simple SQL like for example: SELECT name FROM users WHERE name LIKE al%. but use TOP 100 to limit the number of results.
Cache the results and maintain a list of terms that are cached
When a new request comes in, first check in the list if you have the term (or part of the term cached).
Keep in mind that your cached results are limited, some you may need to do a SQL query if the term remains valid at the end of the result (I mean valid if the latest result match with the term.
Hope it helps.

Using SQL versus Solr's terms component is really not a comparison. At their core they solve the problem the same way by making an index and then making simple calls to it.
What i would want to know is "what you are trying to auto complete".
Ultimately, the easiest and most surefire way to scale a system is to make a simple solution and then just scale the system by replicating data. Trying to cache calls or predict results just make things complicated, and don't get to the root of the problem (ie you can only take them so far, like if each request missed the cache).
Perhaps a little more info about how your data is structured and how you want to see it extracted would be helpful.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio