Compute betweenness centrality on a (1M,3M)-graph using graph-tool

Compute betweenness centrality on a (1M,3M)-graph using graph-tool - performance

I am trying to compute betweenness centrality on a 1M nodes 3M edges graph. I am using graph-tool and the following lines of code:
from graph_tool.all import *
g = load_graph("youtube.graphml")
scores = graph_tool.centrality.betweenness(g)
In its performance comparisons page, it is reported that to compute betweenness on a (40k, 300k)-directed_graph, graph-tool takes about 4 minutes. https://graph-tool.skewed.de/performance
Since graph-tool uses Brandes algorithm, which has a O(VE) complexity, I was expecting an approximated running time of:
(1M/40k)*(3M/300k)*4m=25*10*4m=1000m~17h
I found this calculation coherent with the following stack post, where for a (2M,5M)-graph an user gave an approximated running time of 6 months using NetworkX,which is x180 slower than graph-tool. Hence:
6 months = 180 days(NetworkX) ~ 1 day(graph-tool)
The point is that my program is running on a 4-core machine since 2 days, so I am starting to wonder if my reasoning makes any sense.
Moreover graph-tools benchmarks are performed on a directed graph, for which Brandes algorithm has a complexity of O(VE+V(V+E)logV). Given this point, shouldn't be the expected running time even smaller than previously written? And more importantly Is it feasible at all to compute betweenness centrality on a (1M,3M)-network using graph-tool and a 4-core machine?
I am using an Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz

Eventually the algorithm stopped:
It took slightly more than 2
days (a couple of hours more).
I was not able to get the result:
During the computation the
UI crashed and I was not able to reconnect to the ipython notebook
kernel which was running the program (which however kept running).
I am going to update this post as soon as I get the actual result.
UPDATE
I recomputed the betweenness using a 16 core machine and it took about 18 hours.

Related

Why actual runtime for a larger search value is smaller than a lower search value in a sorted array?

I executed a linear search on an array containing all unique elements in range [1, 10000], sorted in increasing order with all search values i.e., from 1 to 10000 and plotted the runtime vs search value graph as follows:
Upon closely analysing the zoomed in version of the plot as follows:
I found that the runtime for some larger search values is smaller than the lower search values and vice versa
My best guess for this phenomenon is that it is related to how data is processed by CPU using primary memory and cache, but don't have a firm quantifiable reason to explain this.
Any hint would be greatly appreciated.
PS: The code was written in C++ and executed on linux platform hosted on virtual machine with 4 VCPUs on Google Cloud. The runtime was measured using the C++ Chrono library.

CPU cache size depends on the CPU model, there are several cache levels, so your experiment should take all those factors into account. L1 cache is usually 8 KiB, which is about 4 times smaller than your 10000 array. But I don't think this is cache misses. L2 latency is about 100ns, which is much smaller than the difference between lowest and second line, which is about 5 usec. I suppose this (second line-cloud) is contributed from the context switching. The longer the task, the more probable the context switching to occur. This is why the cloud on the right side is thicker.
Now for the zoomed in figure. As Linux is not a real time OS, it's time measuring is not very reliable. IIRC it's minimal reporting unit is microsecond. Now, if a certain task takes exactly 15.45 microseconds, then its ending time depends on when it started. If the task started at exact zero time clock, the time reported would be 15 microseconds. If it started when the internal clock was at 0.1 microsecond in, than you will get 16 microsecond. What you see on the graph is a linear approximation of the analogue straight line to the discrete-valued axis. So the tasks duration you get is not actual task duration, but the real value plus task start time into microsecond (which is uniformly distributed ~U[0,1]) and all that rounded to the closest integer value.

Calculating which compiler is faster in terms of cycling

I just have a simple question, a bit silly, but I just need some clarification for an upcoming exam so I don't make a stupid mistake. I am currently taking a class in computer organization and design and am learning about execution time, CPI, clock cycles, etc.
For a problem, I have to calculate the amount of cycles for 2 compilers and find out which one is faster and by how much given the number of instructions and the cycles for each instruction. My main problem is figuring how much faster the faster compiler is.
For example lets say their are two compilers:
Compiler 1 has 3 load instructions, 4 store instructions, and 5 add
instructions.
Compiler 2 has 5 load instructions, 4 store instructions, and 3 add
instructions
A load instruction takes 2 cycles, a store instruction takes 3 cycles and a add instruction takes 1 cycle
So what I would do this add up to the instructions (3+4+5) and (5+4+3) which both equal to 12 instructions.
I'd then calculate the cycles by multiplying the number of instructions by the cycles and adding them all together like this
Compiler 1: (3*2)+(4*3)+(5*1) = 23 cycles
Compiler 2: (5*2)+(4*3)+(3*1) = 25 cycles
So obviously compiler 1 is faster because it requires less cycles. To find out how much faster compiler 1 is against compiler 2 would I just divide the ratio of the cycles?
My calculation was 23/25 = 0.92, so compiler 1 is 0.92 times faster than compiler 2 (92% faster).
A classmate of mine was discussing this with me and claims that it would be 25/23 which would mean it is 1.08 times faster.
I know I can also calculate this by dividing the cycles by the instructions like:
23 cycles/12 instructions = 1.91
25 cycles/12 instructions = 2.08
and then 1.91/2.08 = 0.92 which is the same as the above answer.
I'm not sure which way would be correct.
I was also wondering if the amount of instructions are difference for the second compiler, let's say 15 instructions. Would calculating the ratio of the cycles be sufficient enough?
Or would I have to divide the cycles with the instructions (cycles/instructions) but put 15 instructions for both?
(ex. 23/15 and 25/15?) and then divide the quotients of both to get
the times faster? I also get the same number(0.92) in that case.
Thank you for any clarification.

The first compiler would be 1.08 times the speed of the second compiler, which is 8% faster (because 1.0 + 0.08 = 1.08).

Probably both calculations are innacurate, with modern/multi-core processors a compiler that generates more instruction may actually produce faster code.

Calculating Cycles Per Instruction

From what I understand, to calculate CPI, it's the percentage of the type of instruction multiplied by the number of cycles right? Does the type of machine have any part of this calculation whatsoever?
I have a problem that asks me if a change should be recommended.
Machine 1: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 3 - Cycles, on a 2.5 GHz machine
Machine 2: 40% R - 5 Cycles, 30% lw - 6 Cycles, 15% sw - 6 Cycles, 15% beq 4 - Cycles, on a 2.7 GHz machine
By my calculations, machine 1 has 5.15 CPI while machine 2 has 5.3 CPI. Is it okay to ignore the GHz of the machine and say that the change would not be a good idea or do I have to factor the machine in?

I think the point is to evaluate a design change that makes an instruction take more clocks, but allows you to raise the clock frequency. (i.e. leaning towards a speed-demon design like Pentium 4, instead of brainiac like Apple's A7/A8 ARM cores. http://www.lighterra.com/papers/modernmicroprocessors/)
So you need to calculate instructions per second to see which one will get more work done in the same amount of real time. i.e. (clock/sec) / (clocks/insn) = insn/sec, cancelling out the clocks from the units.
Your CPI calculation looks ok; I didn't check it, but yes a weighted average of the cycles according to the instruction mix.
These numbers are obviously super simplified; any CPU worth building at 2.5GHz would have some kind of branch prediction so the cost of a branch isn't just a 3 or 4 instruction bubble. And taking ~5 cycles per instruction on average is pathetic. (Most pipelined designs aim for at least 1 instruction per clock.)
Caches and superscalar CPUs also lead to complex interactions between instructions depending on whether they depend on earlier results or not.
But this is sort of like what you might do if considering increasing the L1d cache load-use latency by 1 cycle (for example), if that took it off the critical path and let you raise the clock frequency. Or vice versa, tightening up the latency or reducing the number of pipeline stages on something at the cost of reducing frequency.

Cycles per instruction a count of cycles. ghz doesnt matter as far as that average goes. But saying that we can see from your numbers that one instruction is more clocks but the processors are a different speed.
So while it takes more cycles to do the same job on the faster processor the speed of the processor DOES compensate for that so it seems clear this is a question about does the processor speed account for the extra clock?
5.15 cycles/instruction / 2.5 (giga) cycles/second, cycles cancels out you get
2.06 seconds/(giga) instruction or (nano) seconds/ instruction
5.30 / 2.7 = 1.96296 (nano) seconds / instruction
The faster one takes a slightly less amount of time so it will run the program faster.
Another way to see this to check the math.
For 100 clock cycles on the slower machine 15% of those are beq. So 15 of the 100 clocks, which is 5 beq instructions. The same 5 beq instructions take 20 clocks on the faster machine so 105 clocks total for the same instructions on the faster machine.
100 cycles at 2.5ghz vs 105 at 2.7ghz
we want the amount of time
hz is cycles / second we want seconds on the top
so we want
cycles / (cycles/second) to have cycles cancel out and have seconds on the top
1/2.5 = 0.400 (400 picoseconds)
1/2.7 = 0.370
0.400 * 100 = 40.00 units of time
0.370 * 105 = 38.85 units of time
So despite taking 5 more cycles the processor speed differences is fast enough to compensate.
2.7/2.5 = 1.08
105/100 = 1.05
so 2.5 * 1.05 = 2.625 so a processor 2.625ghz or faster would run that program faster.
Now what were the rules for changing computers, is less time defined as a reason to change computers? What is the definition of better? How much more power does the faster one consume it might take less time but the power consumption might not be linear so it may take more watts despite taking less time. I assume the question is not that detailed, meaning it is vague meaning it is a poorly written question on its own, so it goes to what the textbook or lecture defined as the threshold for change to the other processor.
Disclaimer, dont blame me if you miss this question on your homework/test.
Outside an academic exercise like this, the real world is full of pipelined processors (not all but most of the folks writing programs are writing programs for) and basically you cant put a number on clock cycles per instruction type in a way that you can do this calculation because of a laundry list of factors. Make sore you understand that, nice exercise, but that specific exercise is difficult and dangerous to attempt on real world processors. Dangerous in that as hard as you work you may be incorrectly measuring something and jumping to the wrong conclusions and as a result making bad recommendations. At the same time there is very much the reality that faster ghz does improve some percentage of the execution, but another percentage suffers, and is there a net gain or loss. Or a new processor design faster or slower may have features that perform better than an older processor, but not all feature will be better, there is a tradeoff and then we get into what "better" means.

H2O - Not seeing much speed-up after moving to powerful machine

I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student

If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.

solving computer performance by amdal's low speedup factor

a program runs 100s. multiyply instructions are 80% of the program.There is need to make the program 2 times faster. how much should the multiply instruction be speedup achieve to reach the overall speedup level ?
pls help me to solve this question by amdal's low....

Hint:
The running time is 80 s for multiplies and 20 s for the rest.
You need to bring this down to 30 / 20 s.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio