advice for choosing linux filesystem for neo4j - amazon-ec2

I'm looking for advice for choosing and configuring Linux filesystem for storing neo4j database.
Of course one should always test with ones own workload, but in general, is there any advice regarding which filesystem would perform best?
Based on this page http://grokbase.com/t/gg/neo4j/131grvg09k/best-filesystem-for-new-neo4j-persistant-storage, and if I understand neo's write patterns correctly, xfs would be preferred.
I've also read the two Linux-specific pages from the manual, but neither seems to give guidance for choosing the filesystem.
docs.neo4j.org / chunked / stable / configuration-linux-notes.html
docs.neo4j.org / chunked / stable / linux-performance-guide.html
If there is a difference in choosing filesystem for HDD vs SSD, please mention your thoughts on both.
If there is a special consideration for EC2 / EBS, I would also love to know, as this is where I'm running currently.
If it matters, here's information on the type of graph and workload I'm planning to house:
* product catalog-style graph, with 100s of millions of nodes with large and small properties, and billions of relationships.
* main use - traversals of 100-500K node subsets to answer queries (desired sub-second response)
* periodic updates of 100-500K nodes via bulk uploads (20-30 minutes is ok for this)
Thanks so much!

If you're application is mainly read driven I wouldn't struggle too much choosing the right filesystem. You should focus on choosing the MMIO caches to fit your filestorage. If cache is warmed up, a read operation will not access the IO subsystem.
However when it comes to write operations generally SSD work way much more performant than HDD. Ext4 seems to be the most widely used filesystem for Neo4j. On EC2 you might benefit from using SSD backed instances - however this depends on your amout of write operations.
Generally speaking it's best practice to generate a graph db of approximately the size of the considered production system and run checks beforehand. Premature optimization is mostly a stupid thing.

I'm not sure of the specifics for Neo4j, but MongoDB works much better on XFS. Ext3/4 did not handle allocating sparse files correctly (30s vs 0.1s on XFS), nor deleting many files quickly.
That said, the advice about bench-marking is good. I wouldn't worry about any other Filesystems besides ext4 and XFS until BTRFS is production ready.

Related

Loading PetaBytes of data at scale

I need to load petabytes of text data into a storage (RAM/SSD) within a second.
Below are some of the question to solve the above problem.
1) Is it practically/theoretically possible to load petabytes of data in a second?
2) What will be the best design approach in order to achive fast loading of petabyte scale data in sub seconds.
3) Any benchmark approach available?.
I am okay to implement with any kind of technologies like Hadoop, spark, HPCC etc...
"petabytes .... within a second". seriously? Please check wikipedia Petabyte: it is 1.000.000 GB!
Also check wikipedia Memory bandwidth. Even the fastest RAM cannot handle more than a few 10 GB / s (in practice this is far lower).
Just curious: what is your use-case?
No, it is not technically possible at this time. Not even RAM memory is fast enough (not to mention the obvious capacity constraints). The fastest SSD (M.2 drives) you can get write speed around 1.2GB/s and with raid 0, you might achieve speeds just around 3GB/s at most. There are also economical constraints, as those drives by themselves are quite expensive. So to answer your question, those speeds are technically impossible at current time.
From HPCC perspective...
Thor is designed to load data and support multiple servers. However the biggest cluster I heard about is about 4000 servers. Thor is designed to load a lot of data over long time (even a week).
In the other hand Roxie is designed to serve data quickly but is not what you are asking for...nor it could serve Petabytes under a second.

Comparing druid and pipelinedb

I have been working on aggregation of streaming data, I found 2 tools to achieve the same. They are druid and pipelinedb. I have understood the implementation and architecture of the both. But couldn't figure out a way to benchmark these two. Is there any existing benchmark test that has been done? Or if I want to do a benchmarking of my own apart from the speed and scalability what are all the factors that I need to consider. Any ideas, links and help would be really appreciable. Also do share your own experience with pipelinedb and druid
Thanks
UPD:
After reading PipelineDB pages, I only wonder why do you need to compare such different things?
Druid is quite complex to install and maintain, it requires several external dependencies (such as zookeeper and hdfs/amazon, which must be maintained too).
And for that price you buy the key features of druid: column-oriented and distributed storage and processing. That also implies horizontal scalabitily out-of-the box, and it is completely automatic, you don't have even to think about it.
So if you don't need its distributed nature, I'd say you don't need druid at all.
FIRST VERSION:
I have no experience with pipelinedb (what is it? google shows nothing, pls share some link), but I have much experience with druid. So I would consider (apart from [query] speed and scalability):
ingesting performance (how many rows per sec/min/hour/... can be
inserted?)
RAM consumption of ingesting (how much RAM it needs to ingest with target speed?)
compression level (how many disk space needs one
hour/day/month/... of data?)
fault-tolerance (what happens when some
of the components fail? it is critical for my business?)
Caching (just keep in mind)

Best practices for use of Neo4j on Google Compute Engine / Amazon EC2 instances

There is a very nice guide on optimizing linux machine for Neo4j. But this guide assumes the typical characteristics of a physical hard drive. I am running my Neo4j instances on Google CE and Amazon EC2. I am unable to find any document detailing an optimal setup for these virtual machines. What resources do I need in terms of memory (for heap or extended use) and disk speed / IOPS to get an optimal performance? I currently have a couple of million nodes and about ten million relationships (2 GBs) and the data size is increasing with imports.
On EC2 I used to rely on SSD scratch disks and then make regular backups to permanent disks. There is no such thing available on Compute Engines, and the write speeds don't seem very high to me, at least at normal disk sizes (because speed changes with size). Is there any way to get a reasonable performance on my import/index operations? Or maybe these operations have more to do with memory and compute capacities?
Any additional reading is welcome...
Use local disks whenever possible, SSDs are better than other, try provisioned ops on AWS.
EBS is not a good fit, it is slow and jittery.
No idea for compute engine though, you might want to use more RAM and try to load larger parts of the graph into memory then.
Additional reading: http://structr.org/blog/neo4j-performance-on-ext4
You still should check the other things mentioned in that blog post. Like Linux scheduler, write barriers etc.
Better to set those memory mapping settings manually. And for the 2nd level caches probably check out the enterprise version with the hpc cache.
See also this webinar: https://vimeo.com/46049647 on hw-sizing

Oracle (RAC) vs NoSQL

I am curious if anybody did benchmarks for accessing of data in NoSQL databases vs Oracle (particularly I am talking about Oracle RAC)?
The project requires to work with at least 10mil+ of records, search among them (but not necessary have to be real time), the read is very important for speed, and it's also very important to guarantee HA and reliability (can't lose records!!!)
I can see for myself how say Cassandra/MongoDB might be better fit (because key value storage will provide faster reads than SQL when you go over 10mil records), but I find difficult to articulate all of them nicely. Any links? Suggestions? Bullet points?
Thanks!
10 million records. Assume 250 bytes per record. That is about 2.5 Gb of data, which is well within the capacity of a basic desktop / laptop PC. The data volumes are insignificant (unless each record is sized in Mb, such as picture or audio).
What you do need to talk about is transaction volumes (separated into read and write) and what you consider HA. Read-only HA is easy relative to "Read-write HA". It can be trivial to replicate a read-only data set off to multiple servers at different geographic locations and distribute a query workload on them.
It's much harder to scale out an update heavy workload, which is why you often hear about systems going into meltdown when tickets for a big concert are released. Quite simply there's a fixed number of seats and you can't have ten duplicated systems each selling what they think is available. There has to be a single source of truth, which means a bottleneck (and potentially a single point of failure).
On the HA aspect, RAC is a shared storage technology which generally means your RAC nodes are in close proximity. That can make them vulnerable to localized events such as a building fire or telecoms breakdown. Data Guard is the Oracle technology that relates to off-site replication and failover.
Mostly when you come to comparison of NoSQL vs SQL, you have to understand a very important difference between them. Data in NoSQL may be inconsistent in cost order to achieve HA.
What do I mean by inconsistent? It depends, but usually around 3-5 seconds to propagate the data around nodes. NoSQL database provide mechanism to manage and eliminate that, but if you want all your data be consistent in real time, then you simply use classic SQL, like Oracle RAC.
Coming back to speed comparison: it's simply incomparable which one is faster, because it relays on factors like network infrastructure, computing power and database model etc. But important thing is that at some point you may reach the moment that SQL is economically inefficient to maintain and you have to switch to NoSQL.

Storing images in NoSQL stores

Our application will be serving a large number of small, thumbnail-size images (about 6-12KB in size) through HTTP. I've been asked to investigate whether using a NoSQL data store is a viable solution for data storage. Ideally, we would like our data store to be fault-toerant and distributed.
Is it a good idea to store blobs in NoSQL stores, and which one is good for it? Also, is NoSQL a good solution for our problem, or would we be better served storing the images in the file system and serving them directly from the web server (as an aside, CDN is currently not an option for us)?
Whether or not to store images in a DB or the filesystem is sometime one of those "holy war" type of debates; each side feels their way of doing things is the one right way. In general:
To store in the DB:
Easier to manage back-up/replicate everything at once in one place.
Helps with your data consistency and integrity. You can set the BLOB field to disallow NULLs, but you're not going to be able to prevent an external file from being deleted. (Though this isn't applicable to NoSQL since there aren't the traditional constraints).
To store on the filesystem:
A filesystem is designed to serve files. Let it do it's job.
The DB is often your bottleneck in an application. Whatever load you can take off it, the better.
Easier to serve on a CDN (which you mentioned isn't applicable in your situation).
I tend to come down on the side of the filesystem because it scales much better. But depending on the size of your project, either choice will likely work fine. With NoSQL, the differences are even less apparent.
Mongo DB should work well for you. I haven't used it for blobs yet, but here is a nice FLOSS Weekly podcast interview with Michael Dirolf from the Mongo DB team where he addresses this use case.
I was looking for a similar solution for a personal project and came across Riak, which, to me, seems like an amazing solution to this problem. Basically, it distributes a specified number of copies of each file to the servers in the network. It is designed such that a server coming or going is no big deal. All the copies on a server that leaves are distributed amongst the others.
With the right configuration, Riak can deal with an entire datacenter crashing.
Oh, and it has commercial support available.
Well CDN would be the obvious choice. Since that's out, I'd say your best bet for fault tolerance and load balancing would be your own private data center (whatever that means to you) behind 2 or more load balancers like an F5. This will be your easiest management system and you can get as much fault tolerance as your hardware budget allows. You won't need any new software expertise, just XCOPY.
For true fault tolerance you're going to need geographic dispersion or you're subject to anyone with a backhoe.
(Gravatars?)
If you are in a Python environment, consider the y_serial module: http://yserial.sourceforge.net/
In under 10 minutes, you will be able to store and access your images (in fact, any arbitrary Python object including webpages) -- in compressed form; NoSQL.

Resources