We would be using Virtuoso for storing RDFs, the triple count will be 100 million to start with. I need to know what should be typical RAM, CPU, Disk etc for this. Querying will be with SPARQL and there will be a bit complex queries.
Kindly provide your inputs.
The average size of a Virtuoso version 6.x triple (quad) is about 30bytes thus for 100 million triples you would need about 3GB RAM , this being the most critical component to enable the database working set to fit in memory , data does not need to be loaded from disk once the database is "warmed up", for best performance. This would be especially the case when running complex queries. In terms of disk, the fast they are the quicker the databaase can be loaded into memory, checkpoints performed etc. thus SSDs or similar devices are recommended where possible, espcially if memory is limited and reading data from disk at times in unavoidable. In terms of processor standard commodity 64bit processor available today would suffice, typically running on a Linux x86_64 system of your choice, as said memory is always the most critical component though.
See the following Virtuoso FAQ and peformance tuning documents for more details:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/#FAQ
Related
I have written a script to bulk index a dataset with elasticsearch. It is working as intended, however, if I run the same script on the same dataset on different servers the execution time varies. In the server equipped with SSD, the 2 million documents are done indexing within 10 minutes, however on the one with normal hard disk, it takes up to an hour to complete. Is there a diagnostic tool I can make use of to figure out what causes the slow down?
Some additional information:
The script is written for Python3, and uses elasticsearch-py module for the bulk indexing
Both server runs the same operating system (Ubuntu 14.04 LTS), the one with slower hard drive has 64GB of RAM, but the one with SSD has half the RAM.
You will run into index merges when the large number of records is ingested. That is a process heavily dependent on the speed of the underlying storage. RAM is not really that significant here - it may be more significant when it comes to query performance and stuff you do there. Disk latencies will add up and cause a slow-down compared to the SSD platform.
Therefore, I am not surprised about the SSD speedup. SSD storage is faster than HDD by a factor of 3-8, depending on the manufacturers. If you take into account that HDDs also needs to perform positioning operations for access to different parts of the storage, it is clear that simply using an SDD instead of an HDD can accelerate disk-bound applications by a factor of 10 and more.
Suppose you had a server with 24G RAM at your disposal, how much memory would you allocate to (Tomcat to run) eXist?
I'm setting up our new webserver, with an Intel Xeon E5649 (2.53GHz) processor, running Ubuntu 12.04 64-bit. eXist is running as a webapp inside Tomcat, and the db is only used for querying 'stable' collections --that is, no updates are being executed to the resources inside eXist.
I've been experimenting with different heap sizes (via -Xms and -Xmx settings when starting the Tomcat process), and so far haven't noticed much difference in response time for queries against eXist. In other words, it doesn't seem to matter much whether the JVM is allocated 4G or 16G. I have also upped the #cachesize and #collectionCache in eXist's WEB-INF/conf.xml file to e.g. 8192M, but this doesn't seem to have much effect. I suppose these settings /do/ have an influence when eXist is running inside Tomcat?
I know each situation is different (and I know there's a Tomcat server involved), but are there some rules of thumb for eXist performance w.r.t. the memory it is allocated? I'd like to get at a sensible memory configuration for a setup with a larger amount of RAM available.
This question was asked and answered on the exist-open mailing list. The answer from wolfgang#exist-db.org was:
Giving more memory to eXist will not necessarily improve response times. "Bad"
queries may consume lots of RAM, but the better your queries are optimized, the
less RAM they need: most of the heavy processing will be done using index
lookups and the optimizer will try to reduce the size of the node sets to be
passed around. Caching memory thus has to be large enough to hold the most
relevant index pages. If this is already the case, increasing the caching space
will not improve performance anymore. On the other hand, a too small cacheSize
of collectionCache will result in a recognizable bottleneck. For example, a
batch upload of resources or creating a backup can take several hours (instead
of e.g. minutes) if #collectionCache is too small.
If most of your queries are optimized to use indexes, 8gb RAM for eXist does
usually give you enough room to handle the occasional high load. Ideally you
could run some load tests to see what the maximum memory use actually is. For
#cacheSize, I rarely have to go beyond 512m. The setting for #collectionCache
depends on the number of collections and documents in the database. If you have
tens or hundreds of thousands of collections, you may have to increase it up to
768m or more. As I said above, you will recognize a sudden breakdown in
performance during uploads or backups if the collectionCache becomes too small.
So to summarize, a reasonable setting for me would be: -Xmx8192m,
#cacheSize="512m", #collectionCache="768m". If you can afford giving 16G main
memory it certainly won’t hurt. Also, if you are using the lucene index or the
new range index, you should consider increasing the #buffer setting in the
corresponding index module configurations in conf.xml as well:
<module id="lucene-index" buffer="256" class="org.exist.indexing.lucene.LuceneIndex" />
<module id="range-index" buffer="256" class="org.exist.indexing.range.RangeIndex"/>
How much data can the free version of Datomic handle in terms of storage and throughput? As far as I know, the free transactor uses H2 to store a local embedded database.
What stops me from using this in production, ignoring the obvious lack of storage redundancy and limited number of peers (1)?
Datomic Pro starter edition is probably more of a go-to than the free edition since it supports all of the storage services as the licensed pro version. In terms of local storage and throughput, I'd say it can store up to the amount of space is available on the disk. The transactor handles all writes, so I'd imagine your only potential bottlenecks for throughput would be the hard disk or, if you're throwing tons of transactions at it, it would be waiting for the indexing jobs to complete as your data grows. Indexing can get computationally expensive over time and it's best to have GC and excision jobs in place if you're handling that much volume.
We're about to buy new hardware to run our analyses and are wondering if we're making the right decisions.
The setting:
We're a bioinformatics lab that will be handling DNA sequencing data. The biggest issue that our field has is the amount of data, rather than the compute. A single experiment will quickly go into the 10s-100s of Gb, and we would typically run different experiments at the same time. Obviously, mapreduce approaches are interesting (see also http://abhishek-tiwari.com/2010/08/mapreduce-and-hadoop-algorithms-in-bioinformatics-papers.html), but not all our software use that paradigm. Also, some software uses ascii files as in/output while other software works with binary files.
What we might be buying:
The machine that we might be buying would be a server with 32 cores and 192Gb of RAM, linked to NAS storage (>20Tb). This seems a very interesting setup for us for many of our (non-mapreduce) applications, but will such configuration prevent us from implementing hadoop/mapreduce/hdfs in a meaningful way?
Many thanks,
jan.
You have an interesting configuration. What would be the Disk IO for the NAS storage used by you?
Make your decision based on the following:
Map Reduce paradigm is used to solve the problem of handling large amount of data. Basically, RAM is more expensive than the Disk storage. You cannot hold all the data in the RAM. Disk storage allows you to store large amounts of data at cheaper costs. But, the speed at which you can read data from the disks is not very high. How does Map Reduce solve this problem? Map Reduce solves this problem by distributing the data over multiple machines. Now, the speed at which you can read data in parallel is greater than you could have done with a single storage disk. Suppose the Disk IO speed is 100 Mbps. With 100 machines you can read the data at 100*100 Mbps = 10Gbps.
Typically processor speed is not the bottleneck. Rather, the Disk IOs are the big bottlenecks while processing large amount of data.
I have a feeling that it may not be very efficient.
Suppose, I wanted to develop stack overflow website. How do I estimate the amount of commodity hardware required to support this website assuming 1 million requests per day. Are there any case studies that explains the performance improvements possible in this situation?
I know I/O bottleneck is the major bottleneck in most systems. What are the possible options to improve I/O performance? Few of them I know are
caching
replication
You can improve I/O performance in several ways depending upon what you use for your storage setup:
Increase filesystem block size if your app displays good spatial locality in its I/Os or uses large files.
Use RAID 10 (striping + mirroring) for performance + redundancy (disk failure protection).
Use fast disks (Performance Wise: SSD > FC > SATA).
Segregate workloads at different times of day. e.g. Backup during night, normal app I/O during day.
Turn off atime updates in your filesystem.
Cache NFS file handles a.k.a. Haystack (Facebook), if storing data on NFS server.
Combine small files into larger chunks, a.k.a BigTable, HBase.
Avoid very large directories i.e. lots of files in the same directory (instead divide files between different directories in a hierarchy).
Use a clustered storage system (yeah not exactly commodity hardware).
Optimize/design your application for sequential disk accesses whenever possible.
Use memcached. :)
You may want to look at "Lessons Learned" section of StackOverflow Architecture.
check out this handy tool:
http://www.sizinglounge.com/
and another guide from dell:
http://www.dell.com/content/topics/global.aspx/power/en/ps3q01_graham?c=us&l=en&cs=555
if you want your own stackoverflow-like community, you can sign up with StackExchange.
you can read some case studies here:
High Scalability - How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
http://www.gear6.com/gear6-downloads?fid=56&dlt=case-study&ls=Veoh-Case-Study
1 million requests per day is 12/second. Stack overflow is small enough that you could (with interesting normalization and compression tricks) fit it entirely in RAM of a 64 GByte Dell PowerEdge 2970. I'm not sure where caching and replication should play a role.
If you have a problem thinking enough about normalization, a PowerEdge R900 with 256GB is available.
If you don't like a single point of failure, you can connect a few of those and just push updates over a socket (preferably on a separate network card). Even a peak load of 12K/second should not be a problem for a main-memory system.
The best way to avoid the I/O bottleneck is to not do I/O (as much as possible). That means a prevayler-like architecture with batched writes (no problem to lose a few seconds of data), basically a log file, and for replication also write them out to a socket.