SQLite not utilizing full system resources during slow SELECT queries - performance

I'm performing some select queries with SQLite. The columns are already indexed.
Queries are in the format:
SELECT stuff
FROM table
WHERE haystack LIKE "%needle%"
I'm also loading the db into memory by running RESTORE FROM my_db.db and this seems to be working (used memory goes up)
The problem is - queries like the above still take a long time (close to 100ms), and I need to run a lot of them (thousands at a time), which means the results take up to several minutes to arrive.
However while the queries are running, I see a very low CPU usage - around 25-35% on average on all cores.
I think SQLite might be throttling its own performance somehow and not using all of the CPU.
Is there a way to get it to not throttle, and to utilize the full system resources so queries are faster?
Thanks.

Related

How should PostgreSQL be configured for this setup?

I would like to tweak my PostgreSQL server but even after reading a few tutorials online I am not getting good performance out of the database.
I've got a server with the following specs:
Windows Server 2012 R2 Datacenter
Intel CPU E5-2670 v2 # 2.50 GHz
64-bit Operating System
512 GB RAM
PostgreSQL 9.3
I would like to use postgres as a data storage / aggregation system for the following tasks:
Read data from various data sources (mostly flat files) (volumes between 100GB and 1TB)
Pre-process / clean data
Aggregate data
Feed aggregated or sampled data into R or python for modelling
Up to 10 concurrent users only
This means, I do not really care about the following:
Update speads (I only bulk-load data)
Failure resistance (in the unlikely event that things break, I can always reload everything from my input files)
Currently, load speeds are fine, but creating indexes and aggregating data takes very long and barely uses any memory.
Here is my current postgres.config: http://pastebin.com/KpSi2zSd
I think the obvious step here is to increase the work_mem and maintenance_work_mem considerably, with the fine detail being "how much"?
If you have control over how many aggregation queries and/or index creations are running at a time then you can be pretty aggressive with these, but you face the risk that with 10 concurrent users and a 30GB setting you could be putting your server under memory pressure.
It would really benefit you to get some execution plans for the slow running queries, as they will tell you that you need so-much memory for "Sort Method: external merge Disk" for example, and you can then adjust your settings while keeping an eye on the total memory usage on the server.
I wouldn't rule out that you have to re-jig your loads so that the most resource intensive run on their own, while less resource intensive operations run at the same time.
However, I think at the moment you are lacking some of the hard metrics that will let you make a good choice on memory allocation.

How to compare performance on neo4j queries without cache?

I've been trying to compare queries performance in neo4j.
In order to make the queries more efficient, I added index, analysed the result using profile, and tried doing the same while using the USING INDEX.
On most queries, DB Hits were much better using the second option (with the USING INDEX), rows were the same or less, but the time performance seems not to be reliable: on several queries adding the USING INDEX was slower though the better performance parameters (db hits & rows)and times got much better by re-executing a query.
In order to stop the cache's interfering, went to the the properties file, changed the cache_type in the neo4j.properties to none and restarted neo, but it still seems like the results of the same query comes faster each time (until a certain point).
What will be the best way to test it?
Neo4j has (up to 2.2.x) a two layered cache architecture. With cache_type=node you switch of just the object cache. To disable page cache, you can use dbms.pagecache.memory=0. However if all caches are disabled you basically measure the speed of your IO subsystem since every query goes down to the bare metal and reads from disc.
I recommend a different approach: enable both caches and run the queries you want to compare multiple times to warm up caches. Take measurement on warmed cache since this is much closer to a real production scenario.
On a side note: in Neo4j 2.3 the object cache will go away and we just have the page cache.

Bulk insert performance in MongoDB for large collections

I'm using the BulkWriteOperation (java driver) to store data in large chunks. At first it seems to be working fine, but when the collection grows in size, the inserts can take quite a lot of time.
Currently for a collection of 20M documents, bulk insert of 1000 documents could take about 10 seconds.
Is there a way to make inserts independent of collection size?
I don't have any updates or upserts, it's always new data I'm inserting.
Judging from the log, there doesn't seem to be any issue with locks.
Each document has a time field which is indexed, but it's linearly growing so I don't see any need for mongo to take the time to reorganize the indexes.
I'd love to hear some ideas for improving the performance
Thanks
You believe that the indexing does not require any document reorganisation and the way you described the index suggests that a right handed index is ok. So, indexing seems to be ruled out as an issue. You could of course - as suggested above - definitively rule this out by dropping the index and re running your bulk writes.
Aside from indexing, I would …
Consider whether your disk can keep up with the volume of data you are persisting. More details on this in the Mongo docs
Use profiling to understand what’s happening with your writes
Do have any index in your collection?
If yes, it has to take time to build index tree.
is data time-series?
if yes, use updates more than inserts. Please read this blog. The blog suggests in-place updates more efficient than inserts (https://www.mongodb.com/blog/post/schema-design-for-time-series-data-in-mongodb)
do you have a capability to setup sharded collections?
if yes, it would reduce time (tested it in 3 sharded servers with 15million ip geo entry records)
Disk utilization & CPU: Check the disk utilization and CPU and see if any of these are maxing out.
Apparently, it should be the disk which is causing this issue for you.
Mongo log:
Also, if a 1000 bulk query is taking 10sec, then check for mongo log if there are any few inserts in the 1000 bulk that are taking time. If there are any such queries, then you can narrow down your analysis
Another thing that's not clear is the order of queries that happen on your Mongo instance. Is inserts the only operation that happens or there are other find queries that run too? If yes, then you should look at scaling up whatever resource is maxing out.

Why do same query on couchdb takes different amount of time ?

I have a couch db application and for most of the views I notice that the time taken by the server to return a response varies from 10ms to 100ms. I do not have any concurrent write operations on the server and there are at the most 10 concurrent read requests.
How should I diagnose the problem ? Where you I look ?
I am running it on a rackspace cloud machine with 1GB RAM.
From the Couchdb Guide:
If you read carefully over the last few paragraphs, one part stands out: “When you query your view, CouchDB takes the source code and runs it for you on every document in the database.” If you have a lot of documents, that takes quite a bit of time and you might wonder if it is not horribly inefficient to do this. Yes, it would be, but CouchDB is designed to avoid any extra costs: it only runs through all documents once, when you first query your view. If a document is changed, the map function is only run once, to recompute the keys and values for that single document.
Most likely you are seeing the views be regenerated and recached.

Does it make sense to optimize queries for less i/o pressure?

I have a read only database (product) that recides on its own Sql Server 2008.
I already optimized queries by looking at most expensive queries in activity monitor - report. I ordered the report by CPU-cost. I now have something like 50 queries/second and no query is longer than 300ms.
CPU-Time is ok (30%) and Memory is only used by 20% (out of 64GB).
There is one issue: disk time is at steady 100% (I looked at idle time performance counter and used ideras SQL diagnostic manager). I can see that the product db behaves different than my order db which is on a different machine and has smaller tables: If I look at a profiler trace I have queries in product-db that show a value in column "read" higher than 50.000. In my order DB these values are never higher than 1000. The queries in product-db use a lot of Common table expressions, work on large tables (some are around 5 Million entries).
I am not shure if I should invest time in optimizing queries for i/o performance or if I should just add a server. By otimizing for query duration I already added the missing indexes. Is optimizing for i/o something that is usually done?
In short, yes. Optimize for both CPU and IO.
Queries with high CPU tend to be doing unnecessary in-memory sorts, (sometimes inefficient) hash joins, or complex logic.
Queries with high IO (Page Reads) tend to be doing full table scans or working in other inefficient ways.
9 times out of 10, the same queries will be near the top of the list, but if you've worked on the high CPU and you still are unhappy with performance, then by all means, work on the high IO procs next.
There's always a next bottleneck.
they say.
Now that you've tuned CPU usage, it's only natural that I/O load emerges as dominant. Is your performance already acceptable? If yes stop, if no you have to estimate how many hours you will have to invest in further tuning and if buying another server or more hard disks might be cheaper.
Regarding the I/O tuning again, try to see what you can achieve with easy measures. Sometimes you can trade CPU for I/O and vice versa. Compression is an example for this. You would then tune that component that is your current bottlneck.
Before you seek to make the I/O faster try to reduce the I/O that is generated.
Look for obvious IO performance improvements for your query, but more importantly, look at how you can improve your IO performance at the server level.
If your other resources (CPU and memory) aren't overloaded, you probably don't need a new server. Consider adding an SSD for logs and temp files, and/or consider if you can affordably fit your whole DB onto an array of SSDs.
Of course, clearing out your disk IO bottleneck is likely to raise CPU usage, but if your performance is close to acceptable, this will probably improve things to the point that you can stop optimizing for now.
Unless you are using SSDs or a DB optimized SAN then IO is almost always the limit in database applications.
So yes, optimize to get rid of it as much as possible.
Table indexes are the first thing to do.
Then, add as much RAM as you possibly can, up to the complete size of your DB files.
Then partition your data tables (if that is a reasonable thing to do) so that any necessary table or index scans are done on only one or two table partitions.
Then I suppose you either buy bigger machines with even more RAM and/or buy SSDs or a SAN or a SAN with SSDs.
Alternatively you rebuild your entire database application to use something like NoSQL or database sharding, and implement all your relations, joins, constraints, etc in a middle interface layer.

Resources