I'm importing a file of important size , as a graph on orientdb 11M edges vs 20000 nodes .
and it is taking too much time in vain.
is there a way to optimize the graph load or to explore the max of performance of the machine of 16G.
My question is why is it taking to much time ?
Second , how can I optimize that ?
Some advice for a fast import:
use plocal connection if you can
use a transactional connection and commit in batches of ~500 records
try to avoid reloading of vertices frequently. Most of the times, the biggest part of the time to insert a new edge is spent seeking the two vertices.
if your graph is not huge and the use case is simple enough, you can try to have a look at this http://orientdb.com/docs/2.2.x/Graph-Batch-Insert.html
if your main concern is insertion speed, OrientDB ETL is not the best choice, use some custom Java code instead
Related
I want to retrive a large amount of items but using limit clause:
g.V().hasLabel('foo').as('f').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
This query takes very long time and consumes about 800 MB memory to download the collection
Whan i use below query:
g.V().hasLabel('foo').as('f').has('propA','ValueA').has('propB','ABC').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
it is faster and consumes less memory around 500 MB to download this collection but still high.
My Qestion is how to optimize the first query with just limit if i do not want to filter by Properties A and B.
Second Question why there is such difference in memory size between those two results? In both queries i download 5000 items to memory. What could be possible way to reduce this consumption.
I use GremlinDriver for .Net.
I'm not expert at CosmosDB optimization but from a Gremlin perspective when I look at this traversal:
g.V().hasLabel('foo').as('f').
limit(5000).order().by('f_Id',incr).by('f_bar',incr).
select('f').unfold().dedup()
I wonder why you wouldn't just write it as:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr)
Meaning, you want 5000 "foo" vertices ordered a certain way. The need to use the "f" step label and unfold() seem unnecessary and I don't see how you could end up with duplicates so you can drop dedup(). I'm not sure if those changes will make any difference to how CosmosDB processes things but it certainly removes some unneeded processing.
I'd also wonder if you need to pair down the data returned in your vertices. Right now you're returning all the properties for each vertex. If you don't need all of those it might be better to be more specific and transform the data to the form your application requires:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr).
valueMap('name','age')
That should help reduce serialization costs.
I am using Postgres and I have a ruby task that updates the contents of an entire table at an hourly rate. Currently this is achieved by updating the table in batches. However, I am not exactly sure what the formula is for finding an optimal batch size. Is there a formula or standard for determining an appropriate batch size?
In my opinion there is no theoretical optimal batch size. The optimal batch size will surely depend on your application model, the internal structure and of the accessed tables, the query structure and so on. The only reliable way I see to determine its size is benchmarking.
There are some optimization tips that can help you build a faster application, buy these tips cannot be followed blindly because many of them have corner cases where cannot be applied successfully. Again, the way to determine if a change (adding an index, changing the batch size, enabling the query cache...) improves the performance is benchmarking before and after every single change.
I have an OrientDB graph database. Some of the class types are pretty large (i.e. > 60M) some are somewhat smaller. Orient is pretty fast on searching, when indexed.
Some of the properties I indexed are running at near 100K/s, while some other start at 5K/s then slow towards 10/s.
This, for a part, is due to the settings maxHeap and bufferSize.
I could not find a helpful page on how to compute the server.bat/sh settings for indexing certain types of data.
Anybody got experience on indexing large ( >>10M) items?
Do I have to reboot the server each time I index a large set, or start creating edges? Does the index-type matter wrt indexing speed?
I just started learning Hadoop, in the official guide, it mentioned that double amount of
clusters is able to make querying double size of data as fast as original.
On the other hand, traditional RDBM still spend twice amount of time on querying result.
I cannot grasp the relation between cluster and processing data. Hope someone can give me
some idea.
It's the basic idea of distributed computing.
If you have one server working on data of size X, it will spend time Y on it.
If you have 2X data, the same server will (roughly) spend 2Y time on it.
But if you have 10 servers working in parallel (in a distributed fashion) and they all have the entire data (X), then they will spend Y/10 time on it. You would gain the same effect by having 10 times more resources on the one server, but usually this is not feasible and/or doable. (Like increasing CPU power 10-fold is not very reasonable.)
This is of course a very rough simplification and Hadoop doesn't store the entire dataset on all of the servers - just the needed parts. Hadoop has a subset of the data on each server and the servers work on the data they have to produce one "answer" in the end. This requires communications and different protocols to agree on what data to share, how to share it, how to distribute it and so on - this is what Hadoop does.
I'm using a batch inserter to create a database with about 1 billion nodes and 10 billion relationships. I've read in multiple places that it is preferable to sort the relationships in order min(from, to) (which I didn't do), but I haven't grasped why this practice is optimal. I originally thought this only aided insertion speed, but when I turned the database on, traversal was very slow. I realize there can be many reasons for that, especially with a database this size, but I want to be able to rule out the way I'm storing relationships.
Main question: does it kill traversal speed to insert relationships in a very "random" order because of where they will be stored on disk? I'm thinking that maybe when it tries to traverse nodes, the relationships are too fragmented. I hope someone can enlighten me about whether this would be the case.
UPDATES:
Use-case is pretty much the basic Neo4j friends of friends example using Cypher via the REST API for querying.
Each node (person) is unique and has a bunch of "knows" relationships for who they known. Although I have a billion nodes, all of the 10 billion relationships come from about 30 million of the nodes. So for any starting node I use in my query, it has an average of about 330 relationships coming from it.
In my initial tests, even getting 4 non-ordered friends of friends results was incredibly slow (100+ seconds on average). Of course, after the cache was warmed up for each query, it was fairly quick, but the graph is pretty random and I can't have the whole relationship store in memory.
Some of my system details, if that's needed:
- Neo4j 1.9.RC1
- Running on Linux server, 128gb of RAM, 8 core machine, non-SSD HD
I have not worked with Neo4J on such a large scale, but as far as i know this won't make much difference in the speed. Could you provide any links which state the order of insertion matters.
What matters in this case if the relations are cached or not. Until the cache is fairly populated, performance will be on the slower side. You should also set an appropriate cache size as soon as the index is created.
You should read this link on regarding neo4j performance.
Read the neo4j documentation on batch insert and these SO questions for help with bulk insert if you haven't already read them.