TinkerGraph g.V(ids).drop().iterate() is confusingly slow - performance

Have run into an issue with using plain old TinkerGraph to drop a moderate sized number of vertices. In total, there are about 1250 vertices and 2500 edges that will be dropped.
When running the following:
g.V(ids).drop().iterate()
It takes around 20-30 seconds. This seems ridiculous and I have seemingly verified that it is not caused by anything other than the removal of the nodes.
I'm hoping there is some key piece that I am missing or an area I have yet to explore that will help me out here.
The environment is not memory or CPU constrained in any way. I've profiled the code and see the majority of the time spent is in the TinkerVertex.remove method. This is doubly strange because the creation of these nodes takes less than a second.
I've been able to optimize this a bit by doing a batching and separate threads solution like this one: Improve performance removing TinkerGraph vertices
vertices
However, 10-15 seconds is still too long as I'm hoping to have this be a synchronous operation.
I've considered following something like this but that feels like overkill for dropping less than 5k elements...
To note, the size of the graph is around 110k vertices and 150k edges.
I've tried to profile the gremlin query but it seems that you can't profile through the JVM using:
g.V(ids).drop().iterate().profile()
I've tried various ways of writing the query for profiling but was unable to get it to work.
I'm hoping there is just something I'm missing that will help get this resolved.

As mentioned in comments, it definitely seems unusual that this operation is taking so long, unless the machine being used is very busy performing other tasks. Using my laptop (16GB RAM, modest CPU and other specs) I can drop the air-routes graph (3,747 nodes and 57,660 edges) in milliseconds time from the Gremlin console.
gremlin> Gremlin.version
==>3.6.0
gremlin> g
==>graphtraversalsource[tinkergraph[vertices:3747 edges:57660], standard]
gremlin> g.V().drop().profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
TinkerGraphStep(vertex,[]) 3747 3747 6.226 7.52
DropStep 76.587 92.48
>TOTAL - - 82.813 -
gremlin> g
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
I also tried dropping a list of 1000 nodes as follows but still experienced millisecond time.
gremlin> g
tinkergraph[vertices:3747 edges:57660]
gremlin> a=[] ; for (x in (1..1000)) {a << x}
==>null
gremlin> a.size()
==>1000
gremlin> g.V(a).drop().profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
TinkerGraphStep(vertex,[1, 2, 3, 4, 5, 6, 7, 8,... 1000 1000 2.677 13.87
DropStep 16.626 86.13
>TOTAL - - 19.304 -
gremlin> g
==>graphtraversalsource[tinkergraph[vertices:2747 edges:9331], standard]
Perhaps see if you can get a profile from your Java code using a query without iterate (it's not needed as profile is a terminal step). Also check for any unusual GC activity. I would also see if you see this same issue using the Gremlin Console. Something is definitely odd here. If none of these investigations bear fruit perhaps update the question to show the exact Java code you are using.

Related

Orientdb taking forever to load a graph (Large set)

I'm importing a file of important size , as a graph on orientdb 11M edges vs 20000 nodes .
and it is taking too much time in vain.
is there a way to optimize the graph load or to explore the max of performance of the machine of 16G.
My question is why is it taking to much time ?
Second , how can I optimize that ?
Some advice for a fast import:
use plocal connection if you can
use a transactional connection and commit in batches of ~500 records
try to avoid reloading of vertices frequently. Most of the times, the biggest part of the time to insert a new edge is spent seeking the two vertices.
if your graph is not huge and the use case is simple enough, you can try to have a look at this http://orientdb.com/docs/2.2.x/Graph-Batch-Insert.html
if your main concern is insertion speed, OrientDB ETL is not the best choice, use some custom Java code instead

Pipeline inputs 8 billion lines from GCS and does a GroupByKey to prevent fusion, group step running very slow

I read 8 billion lines from GCS, do processing on each line, then output. My processing step can take a little time and to avoid worker leases expiring and getting below error; I do a GroupByKey on 8 billion and group by id to prevent fusion.
A work item was attempted 4 times without success. Each time the
worker eventually lost contact with the service. The work item was
attempted on:
The problem is GroupByKey step is taking forever to complete for 8 billion lines even on a 1000 high-mem-2 nodes.
I looked into the possible cause of slow processing being; large size of each value generated per key by GroupByKey. I don't think that's is possible because out of 8 billion inputs, one input id cannot be in that set more than 30 times. So clearly the problem of HotKeys is not here, something else is going on.
Any ideas on how to optimize this are appreciated. Thanks.
I did manage to solve this problem. There were a number of incorrect assumptions here on my part about dataflow wall times. I was looking at my pipeline and the step with highest wall time; which was in days, I thought is the bottleneck. But in Apache beam a step is usually fused together with steps downstream in the pipeline, and will only run as fast as the step down the pipeline runs. So a wall time that is significant is not enough to conclude that this step is the bottleneck in the pipeline. The real solution to the problem stated above came from this thread. I reduced the number of nodes my pipeline runs on. And changed node type from high-mem-2 to high-mem-4. I wish there was an easy way to get memory usage metrics for a dataflow pipeline. I had to ssh into VMs and do JMAP.

Spark MLLib's LassoWithSGD doesn't scale?

I have code similar to what follows:
val fileContent = sc.textFile("file:///myfile")
val dataset = fileContent.map(row => {
val explodedRow = row.split(",").map(s => s.toDouble)
new LabeledPoint(explodedRow(13), Vectors.dense(
Array(explodedRow(10), explodedRow(11), explodedRow(12))
))})
val algo = new LassoWithSGD().setIntercept(true)
val lambda = 0.0
algo.optimizer.setRegParam(lambda)
algo.optimizer.setNumIterations(100)
algo.optimizer.setStepSize(1.0)
val model = algo.run(dataset)
I'm running this in the cloud on my virtual server with 20 cores. The file is a "local" (i.e. not in HDFS) file with a few million rows. I run this in local mode, with sbt run (i.e. I don't use a cluster, I don't use spark-submit).
I would have expected this to get be increasingly faster as I increase the spark.master=local[*] setting from local[8] to local[40]. Instead, it takes the same amount of time regardless of what setting I use (but I notice from the Spark UI that my executor has a maximum number of Active Tasks at any given time that is equal to the expected amount, i.e. ~8 for local[8], ~40 for local[40], etc. -- so it seems that the parallelization works).
By default the number of partitions my dataset RDD is 4. I tried forcing the number of partitions to 20, without success -- in fact it slows the Lasso algorithm down even more...
Is my expectation of the scaling process incorrect? Can somebody help me troubleshoot this?
Is my expectation of the scaling process incorrect?
Well, kind of. I hope you don't mind I use a little bit of Python to prove my point.
Lets be generous and say a few million rows is actually ten million. With 40 000 000 values (intercept + 3 features + label per row) it gives around 380 MB of data (Java Double is a double-precision 64-bit IEEE 754 floating point). Lets create some dummy data:
import numpy as np
n = 10 * 1000**2
X = np.random.uniform(size=(n, 4)) # Features
y = np.random.uniform(size=(n, 1)) # Labels
theta = np.random.uniform(size=(4, 1)) # Estimated parameters
Each step of gradient descent (since default miniBatchFraction for LassoWithSGD is 1.0 it is not really stochastic) ignoring regularization requires operation like this.
def step(X, y, theta):
return ((X.dot(theta) - y) * X).sum(0)
So lets see how long it takes locally on our data:
%timeit -n 15 step(X, y, theta)
## 15 loops, best of 3: 743 ms per loop
Less than a second per step, without any additional optimizations. Intuitively it is pretty fast and it won't be easy to match this. Just for fun lets see how much it takes to get closed form solution for data like this
%timeit -n 15 np.linalg.inv(X.transpose().dot(X)).dot(X.transpose()).dot(y)
## 15 loops, best of 3: 1.33 s per loop
Now lets go back to Spark. Residuals for a single point can be computed in parallel. So this is a part which scales linearly when you increase number of partitions which are processed in parallel.
Problem is that you have to aggregate data locally, serialize, transfer to the driver, deserialize and reduce locally to get a final result after each step. Then you have compute new theta, serialize send back and so on.
All of that can be improved by a proper usage of mini batches and some further optimizations but at the end of the day you are limited by a latency of a whole system. It is worth noting that when you increase parallelism on a worker side you also increase amount of work that has to be performed sequentially on a driver and the other way round. One way or another the Amdahl's law will bite you.
Also all of the above ignores actual implementation.
Now lets perform another experiment. First some dummy data:
nCores = 8 # Number of cores on local machine I use for tests
rdd = sc.parallelize([], nCores)
and bechmark:
%timeit -n 40 rdd.mapPartitions(lambda x: x).count()
## 40 loops, best of 3: 82.3 ms per loop
It means that with 8 cores, without any real processing or network traffic we get to the point where we cannot do much better by increasing parallelism in Spark (743ms / 8 = 92.875ms per partition assuming linear scalability of the parallelized part)
Just to summarize above:
if data can be easily processed locally with a closed-form solution using gradient descent is just a waste of time. If you want to increase parallelism / reduce latency you can use good linear algebra libraries
Spark is designed to handle large amounts of data not to reduce latency. If your data fits in a memory of a few years old smartphone it is a good sign that is not the right tool
if computations are cheap then constant costs become a limiting factor
Side notes:
relatively large number of cores per machine is generally speaking not the best choice unless you can match this with IO throughput

Practical Parallel Efficiency % in Teradata

Teradata is built for parallelism.
I believe that from the below query we can measure the Parallel Efficiency of user's query
SELECT
USERNAME,
NumOfActiveAMPs,
((sum(AMPCPUTime))/1024) / ((sum(MaxAmpCPUTime) * NumOfActiveAMPs)/1024) * 100 as Parallel_Efficiency,
count(1)
FROM dbc.qrylog
WHERE MaxAmpCPUTime > 0
group by 1,2
In a ideal situation, i believe PE can be 100%
But for various reasons, i see that most PE (rolled up) is usually less than 50%
What according to you is a good Parallel Efficiency % that we should try to achieve ?
I was told that trying to achieve a high PE (like 60% or more) is also not good for the state of the system, not sure of the reason though, is this true ? your thoughts ?
Thanks for sharing your thoughts !
Parallel Efficiency for a given query can be calculated as AMPCPUTime / (MaxAMPCPUTime * (HASHAMP () + 1)). Where (MaxAMPCPUTime * (HASHAMP () + 1)) is the ImpactCPU measure, representing the highest CPU consumed by a participating AMP in the query multiplied by the number of AMPs in the configuration. You may find individual workloads are all over the board on their parallel efficiency.
I some times wonder if PE for an individual query would be more accurate if you replace the number of nodes in the system with the number of AMPs used by the query. This metric is available in DBQL and may help balance queries that are using PI or USI access paths that are not all AMP operations.
Parallel efficiency for your overall system can be obtained using ResUsage metrics by dividing the average node utilization by the maximum node utilization. This helps you understand how evenly the system is processing a given workload but does not consider how "heavy" that workload might be. Here you are looking to see the overall efficiency to be greater than 60%, the closer to 100% the better the nodes are working together.
I know your inquiry was about individual queries, but I thought sharing details about the PE of your environment would be beneficial as well.

Optimum number of threads for a highly parallelizable problem

I parallelized a simulation engine in 12 threads to run it on a cluster of 12 nodes(each node running one thread). Since chances of availability of 12 systems is generally less, I also tweaked it for 6 threads(to run on 6 nodes), 4 threads(to run on 4 nodes), 3 threads(to run on 3 nodes), and 2 threads(to run on 2 nodes). I have noticed that more the number of nodes/threads, more is the speedup. But obviously, the more nodes I use, the more expensive(in terms of cost and power) the execution becomes.
I want to publish these results in a journal so I want to know if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?
Thanks,
Akshey
How have you parallelised your program and what is inside each of your nodes ?
For instance, on one of my clusters I have several hundred nodes each containing 4 dual-core Xeons. If I were to run an OpenMP program on this cluster I would place a single execution on one node and start up no more than 8 threads, one for each processor core. My clusters are managed by Grid Engine and used for batch jobs, so there is no contention while a job is running. In general there is no point in asking for more than one node on which to run an OpenMP job since the shared-memory approach doesn't work on distributed-memory hardware. And there's not much to be gained by asking for fewer than 8 threads on an 8-core node, I have enough hardware available not to have to share it.
If you have used a distributed-memory programming approach, such as MPI, then you are probably working with a number of processes (rather than threads) and may well be executing these processes on cores on different nodes, and be paying the costs in terms of communications traffic.
As #Blank has already pointed out the most efficient way to run a program, if by efficiency one means 'minimising total cpu-hours', is to run the program on 1 core. Only. However, for jobs of mine which can take, say, a week on 256 cores, waiting 128 weeks for one core to finish its work is not appealing.
If you are not already familiar with the following terms, Google around for them or head for Wikipedia:
Amdahl's Law
Gustafson's Law
weak scaling
strong scaling
parallel speedup
parallel efficiency
scalability.
"if there are any laws/theorems which will help me to decide the optimum number of nodes on which I should run this program?"
There's no such general laws, because every problem has slightly different characteristics.
You can make a mathematical model of the performance of your problem on different number of nodes, knowing how much computational work has to be done, and how much communications has to be done, and how long each takes. (The communications times can be estimated by the amount of commuincations, and typical latency/bandwidth numbers for your nodes' type of interconnect). This can guide you as to good choices.
These models can be valuable for understanding what is going on, but to actually determine the right number of nodes to run on for your code for some given problem size, there's really no substitute for running a scaling test - running the problem on various numbers of nodes and actually seeing how it performs. The numbers you want to see are:
Time to completion as a function of number of processors: T(P)
Speedup as a function of number of processors: S(P) = T(1)/T(P)
Parallel efficiency: E(P) = S(P)/P
How do you choose the "right" number of nodes? It depends on how many jobs you have to run, and what's an acceptable use of computational resources.
So for instance, in plotting your timing results you might find that you have a minimum time to completion T(P) at some number of processors -- say, 32. So that might seem like the "best" choice. But when you look at the efficiency numbers, it might become clear that the efficiency started dropping precipitously long before that; and you only got (say) a 20% decrease in run time over running at 16 processors - that is, for 2x the amount of computational resources, you only got a 1.25x increase in speed. That's usually going to be a bad trade, and you'd prefer to run at fewer processors - particularly if you have a lot of these simulations to run. (If you have 2 simulations to run, for instance, in this case you could get them done in 1.25 time units insetad of 2 time units by running the two simulations each on 16 processors simultaneously rather than running them one at a time on 32 processors).
On the other hand, sometimes you only have a couple runs to do and time really is of the essence, even if you're using resources somewhat inefficiently. Financial modelling can be like this -- they need the predictions for tomorrow's markets now, and they have the money to throw at computational resources even if they're not used 100% efficiently.
Some of these concepts are discussed in the "Introduction to Parallel Performance" section of any parallel programming tutorials; here's our example, https://support.scinet.utoronto.ca/wiki/index.php/Introduction_To_Performance
Increasing the number of nodes leads to diminishing returns. Two nodes is not twice as fast as one node; four nodes even less so than two. As such, the optimal number of nodes is always one; it is with a single node that you get most work done per node.

Resources