tldr: Big neo4j dataset, trying to load into page cache. Only a part of the db being loaded. Queries with count very slow. Sysinfo and memrec showing different db data volumes. How to load the entire graph into memory, and reduce db hit counts?
I have a 25G 3.5.12 Enterprise Neo4j Database, running on a 52G server. Currently I have set the page cache to be 28G, and heap size to be 12G in the conf file. I want to load the entire dataset into cache for fast querying.
The problem:
1. Even after warming up with apoc.warmup.run(true, true, true), my queries are still hitting the database millions of times. I have also tried the basic MATCH (n)-[r]-(m) RETURN count(r), count(n.name) etc., but the same problem persists.
Here is the post apoc warmup result:
This is a sample profile query:
2. Another very hard to understand thing is that memrec and sysinfo seem to be showing different results.
Memrec says I have 7.7G data volume, and indexes, while sysinfo and du say I have 25G.
sysinfo:
memrec:
du -ach:
Exactly as the numbers indicated by memrec, the system is only using approx. 8G + 12G(Heap) from neo4j :
May someone please help me understand what I am doing wrong here? How can I get all the graph nodes and relationships in the page cache?
Related
I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.
just out of curiosity, does anybody know if Neo4j and OrientDB implement caching of query results, that is, storing in cache a query together with its result, so that subsequent requests of the same query are served without actually computing the result of the query.
Notice that this is different from caching part of the DB since in this case the query would be anyway executed (possibly using only data taken from memory instead of disk).
Starting from release v2.2 (not in SNAPSHOT but will be RC in few days), OrientDB supports caching of commands results. Caching command results has been used by other DBMSs and proven to dramatically improve the following use cases:
database is mostly read than write
there are a few heavy queries that result a small result set
you have available RAM to use or caching results
By default, the command cache is disabled. To enable it, set command.timeout=true.
For more information: http://orientdb.com/docs/last/Command-Cache.html.
There are a couple of layers where you can put the caching. You can put it at the highest level behind Varnish ( https://www.varnish-cache.org ) or some other high level cache. You can use a KV store like Redis ( http://redis.io ) and store a result with an expiration. You can also cache within Neo4j using extensions. Both simple things like index look-ups, partial traversals or complete results. See http://maxdemarzi.com/2014/03/23/caching-partial-traversals/ or http://maxdemarzi.com/2015/02/27/caching-immutable-id-lookups-in-neo4j/ for some ideas.
I need Mongo cluster doing 2 operations:
get/update a single document - Mongo is great for realtime changes, excelent speed.
export all documents into JSON file (one file for a category, there are cca 15 categories) - this is very slow, when I use regular query. May be I do not know, what command or options to use ... or I would need to fit it whole into RAM, which is expesive. Even replication to a new mongo instance is much faster (takes hours) then a query and writing data to disk (takes days).
I have about 10m documents. Mongo data on disk has 250Gb. There are cca 15 categories for which I need separate files (at the moment all documents are in 1 collection regardless of category).
Which command should I use to export all data into files in a couple of hours?
How large aws instances should I use to speed it up, but not to pay too much for RAM. Would it help? Operation 2) must not cause a performace hit for operation 1) -- I cannot stop Mongo and use mongoexport.
I am not sure what kind of servers you are using but this may provide some further insights regarding the export/file creation performance and not shutting off mongo. One presumes you are working with a sharded and replicated cluster.
In my case I am on Azure VMs running Windows server in a replicated and sharded cluster. So I would take a copy of the Azure blobs associated with the data disks on a secondary in each RS. You should stop your balancer and lock the db on the secondary to do this. This should take a couple of minutes at most to copy only 250gb. Then I would restore the blobs to disks on a new VM.
Then you could query data out of this VM without affecting your cluster's performance. You may additionally add indexing fir this export process since you are on a separate instance now.
Personally I use PowerShell to do this in Azure. Golang may be a better choice to write your queries in due to its parallel capabilities if JavaScript via the mongo shell fails you. I've had JS work faster than python code but it also depends on what you know.
This is just one way but it does address some of the criteria you posted.
I looked around and apparently Infinispan has a limit on the amount of keys you can store when persisting data to the FileStore. I get the "too many open files" exception.
I love the idea of torquebox and was anxious to slim down the stack and just use Infinispan instead of Redis. I have an app that needs to cache allot of data. The queries are computationally expensive and need to be re-computed daily (phone and other productivity metrics by agent in a call center).
I don't run a cluster though I understand the cache would persist if I had at least one app running. I would rather like to persist the cache. Has anybody run into this issue and have a work around?
Yes, Infinispan's FileCacheStore used to have an issue with opening too many files. The new SingleFileStore in 5.3.x solves that problem, but it looks like Torquebox still uses Infinispan 5.1.x (https://github.com/torquebox/torquebox/blob/master/pom.xml#L277).
I am also using infinispan cache in a live application.
Basically we are storing database queries and its result in cache for tables which are not up-datable and smaller in data size.
There are two approaches to design it:
Use queries as key and its data as value
It leads to too many entries in cache when so many different queries are placed into it.
Use xyz as key and Map as value (Map contains the queries as key and its data as value)
It leads to single entry in cache whenever data is needed from this cache (I call it query cache) retrieve Map first by using key xyz then find the query in Map itself.
We are using second approach.
I want MongoDB to hold query results in RAM for longer period of time (say 30 minutes if memory is available). Is it possible? OR is there any way i can make sure that the data is pre-loaded into RAM before subsequent queries on it.
In fact i am wondering about simple query results performance by MongoDB. I have a dedicated server with 10GB RAM and my db.stats() are as follows;
db.stats();
{
"db": "test",
"collections":16,
"objects":625690,
"avgObjSize":68.90,
"dataSize":43061996,
"storageSize":1121402888,
"numExtents":74,
"indexes":25,
"indexSize":28207200,
"fileSize":469762048,
"nsSizeMB":16,
"ok":1
}
Now when i query single document (as mentioned here) from a web service it loads in 1.3 seconds. Subsequent calls of same queries gives response in 400ms and then after few seconds, it again starts taking 1.3 seconds. Looks like MongoDB has lost the previous queried document from Memory, where as there is no other queries asking for data mapped to RAM.
Please explain this and let me know any way to make subsequent queries faster responding.
Your observed performance problem on an initial query is likely one of the following issues (in rough order of likelihood):
1) Your application / web service has some overhead to initialize on first request (i.e. allocating memory, setting up connection pools, resolving DNS, ...).
2) Indexes or data you have requested are not yet in memory, so need to be loaded.
3) The Query Optimizer may take a bit longer to run on the first request, as it is comparing the plan execution for your query pattern.
It would be very helpful to test the query via the mongo shell, and isolate whether the overhead is related to MongoDB or your web service (rather than timing both, as you have done).
Following are some notes related to MongoDB.
Caching
MongoDB doesn't have a "caching" time for documents in memory. It uses memory-mapped files for disk I/O and the documents in memory are based on your active queries (documents/indexes you've recently loaded) as well as the available memory. The operating system's virtual memory manager is in charge of caching, and typically will follow a Least-Recently Used (LRU) algorithm to decide which pages to swap out of memory.
Memory Usage
The expected behaviour is that over time MongoDB will grow to use all free memory to store your active working data set.
Looking at your provided db.stats() numbers (and assuming that is your only database), it looks like your database size is current about 1Gb so you should be able to keep everything within your 10Gb total RAM unless:
there are other processes competing for memory
you have restarted your mongod server and those documents/indexes haven't been requested yet
In MongoDB 2.2, there is a new touch command you can use to load indexes or documents into memory after a server restart. This should only be used on initial startup to "warm up" the server, as otherwise you could be unhelpfully forcing actual "active" data out of memory.
On a linux system, for example, you can use the top command and should see that:
virtual bytes/VSIZE will tend to be the size of the entire database
if the server doesn't have other processes running, resident bytes/RSIZE will be the total memory of the machine (this includes file system cache contents)
mongod should not use swap (since the files are memory-mapped)
You can use the mongostat tool to get a quick view of your mongod activity .. or more usefully, use a service like MMS to monitor metrics over time.
Query Optimizer
The MongoDB Query Optimizer compares plan execution for a query pattern every ~1,000 write operations, and then caches the "winning" query plan until the next time the optimizer runs .. or you explicitly call an explain() on that query.
This should be a straightforward one to test: run your query in the mongo shell with .explain() and look at the ms timings, and also the number of index entries and documents scanned. The timing for an explain() isn't the actual time the queries will take to run, as it includes the cost of comparing the plans. The typical execution will be much faster .. and you can look for slow queries in your mongod log.
By default MongoDB will log all queries slower than 100ms, so this provides a good starting point to look for queries to optimize. You can adjust the slow ms value with the --slowms config option, or using the Database Profiler commands.
Further reading in the MongoDB documentation:
Caching
Checking Server Memory Usage
Database Profiler
Explain
Monitoring & Diagnostics