I have a Racket program that will be long running. Executing many instances of the same programs will help finding the answer faster. (It depends on the randomness.) So I execute 10 instances of the same program from the command line on a 24-core machine. The average throughput when executing one instance (on one core) is 500 iterations/s. The average throughput when executing 10 instances (on 10 cores) goes down to 100 iterations/s per core. I expect to see similar throughput per core because each execution does not interface with the others at all. Does anyone else experience this behavior? What is happening? How can I fix this?
--------------------------- Additional information -----------------------------
OS: ubuntu 13.10
cores: 24
Each instance write its own output file. Approximately once per minute, each instance will replace the same output file with the updated result which is about 10 lines of text. So, I don't think they hit I/O bound.
According to top, each core uses 1.5-2.5% of memory. When running 10 core, 16 GB is used and, 9 GB is free. With nothing running, 11 GB is used, and 14 GB are free.
There is no network request.
The follows are (current-memory-use) divided by 1,000,000 over 12 minutes on 3 of the 10 cores (MB).
core 3: 313, 48, 73, 154, 292, 242
core 4: 56, 245, 261, 106, 229, 190
core 6: 55, 238, 66, 229, 275, 207
When I run (current-memory-use) without anything else, it returns 29 MB.
I found the issue. My program indeed used too much memory. Therefore, when I'm running multiple instances at the same time, either everything can't fit in cache (probably L3) or it exceeds memory bandwidth.
I tried to discover the source of the problem why my program used so much memory. By putting (current-memory-use) at many places in the program, I found that the issue was from arithmetic-shift. Because of that one operation, somehow the memory usage became doubled immediately.
The problem occured when executing (arithmetic-shift x y) when x is big and y is positive. In that case, I believe the result is represented using "flonum" (boxed) instead of "fixnum" (unboxed).
Even though I masked the result to 32-bit later, something prevented racket from optimizing that, likely first-order functions. I fixed it by masking x before passing it to arithmetic-shift such that the result is never greater than 32-bit number, and that fixed the problem. Now, my program uses 80 MB instead of 300 MB, and I get the speed up I expect!
I suppose this isn't truly an answer; it's more like a guess and advice that doesn't fit in a comment.
From the list #MarkSetchell gave, the most obvious place to start is I/O -- do the processes make network requests, or share an input file?
Slightly less obvious (but, wild guess, more likely in your case) is memory. The sole instance could use all available RAM, if needed. Does it?. With 10 instances sharing the same RAM, they'd probably garbage collect more often, which would be slower.
Try adding something like
(thread
(λ ()
(let loop ()
(displayln (current-memory-use))
(sleep 5)
(loop))))
and see how that plots over time. For one instance, does it top out at a value? How does that compare to RAM in the system?
And/or, use racket -W "error debug#GC" <your-program> to show debug-level log info from the GC.
Related
I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.
I'm trying to determine the ideal number of cores for my model (in this example I'm using xbeach but I'm having the same issue when running other models such as SWAN) by doing a speedup test, i.e. I'm running the same code mpiexec -n <np> mymodel.exe with <np> = 2, 4, 8, 12, 16, 18, 20, 24, 32, and 36 cores/processes and see how long it takes. I ran the test multiple times, each with different results - see image
To my understanding MPI tries to distribute all tasks evenly across all processors (so having other applications running in the background might slow things down) but I'm still confused why I get SUCH a large range of computing times for the same code with the same number of cores/tasks. Why could it be that my model takes 18 minutes to run on 18 cores one time and then 128 minutes the next time I run it?
I'm using MPICH2 1.4.1p1 on Windows 10 on a machine with 2 NUMA nodes with 36 cores each so any potential background tasks should easily be handled on the remaining cores/node. It also doesn't look like I'm running out of memory.
To summarize my questions are:
Is it normal to have such a large range in computation time when using MPI?
If it is - why?
Is there a better way than mine to determine the ideal number of cores for an MPI program?
I've done a cross-validated SVC analysis. Reading the scikit-learn docs for svc, I see this:
"Kernel cache size: For SVC, SVR, nuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems. If you have enough RAM available, it is recommended to set cache_size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB)."
http://scikit-learn.org/stable/modules/svm.html
So I re-ran my analysis several times and timed the results using several different values for cache_size (50, 100, 200, 800, 1200, 2000, 4000, 8000).
My full analysis takes about 11.2 seconds when the cache_size is below 2000, and the time jumps to 40 seconds when the cache_size is greater than 2000.
The analysis takes place on a modern computer with 16 gigabytes of ram.
I'm wondering if anybody knows possible reasons why the processing time wouldn't change at all for any cache value below 2000, and why the processing time would get longer with higher values. Again, the computer has 16 gigs of ram and no signs of otherwise slowing occurs at any value of cache_size.
Thank you for anybody's thoughts.
The slow-down you noticed for cache > 2000 MB may be the consequence of this bug: https://github.com/scikit-learn/scikit-learn/issues/8012 (signed 32-bit integer overflow).
fio -numjobs=8 -directory=/mnt -iodepth=64 -direct=1 -ioengine=libaio -sync=1 -rw=randread -bs=4k
FioTest: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
iops: (8 threads and iodepth=64)-> 356, 397, 399, 396, ...
but when -numjobs=1 and iodepth=64, the iops -> 15873
I feel a little confused. Why the -numjobs larger, the iops will be smaller?
It's hard to make a general statement because the correct answer depends on a given setup.
For example, imagine I have a cheap spinning SATA disk whose sequential speed is fair but whose random access is poor. The more random I make the accesses the worse things get (because of the latency involved in each I/O being serviced - https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html suggests 3ms is the cost of having to seek). So 64 simultaneous random access is bad because the disk head is seeking to 64 different locations before the last I/O is serviced. If I now bump the number of jobs up to 8 that 64 * 8 = 512 means even MORE seeking. Worse, there are only so many simultaneous I/Os that can actually be serviced at any given time. So the disk's queue of in-flight simultaneous I/Os can become completely full, other queues start backing up, latency in turn goes up again and IOPS start tumbling. Also note this is compounded because you're prevent the disk saying "It's in my cache, you can carry on" because sync=1 forces the I/O to have to be on non-volatile media before it is marked as done.
This may not be what is happening in your case but is an example of a "what if" scenario.
I think you should add '--group_reporting' on your fio command.
group_reporting
If set, display per-group reports instead of per-job when numjobs is specified.
I’m working on tuning performance on a high-performance, high-capacity data engine which ultimately services an end-user web experience. Specifically, the piece delegated to me revolves around characterizing multi-threaded file IO and memory mapping of the data to local cache. In writing test applications to isolate the timing tall-poles, several questions have been exposed. The code has been minimized to perform only a system file open (open(O_RDONLY)) call. I’m hoping that the result of this query helps us understand the fundamental low-level system processes so that a complete predictive (or at least relational) timing model can be understood. Suggestions are always welcome. We’ve seemed to hit a timing barrier, and would like to understand the behavior and determine whether that barrier can be broken.
The test program:
Is written in C, compiled using the gnu C compiler as noted below;
Is minimally written to isolate the discovered issues to a single system file “open()”;
Is configurable to simultaneously launch a requested number of pthreads;
loads a list of 1000 text files of ~8K size;
creates the threads (simply) with no attribute modifications;
each thread performs multiple, sequential file open() calls on the next available file from the pre-determined list of files until the file list is exhausted in such a way that a single thread should open all 1000 files, 2 threads should theoretically open 500 files (not proven as of yet), etc.);
We’ve run tests multiple times, parametrically varying the thread count, file sizes, and whether the files are located on a local or remote server. Several questions have come up.
Observed results (opening remote files):
File open times are higher the first time through (as expected, due to file caching);
Running the test app with one thread to load all the remote files takes X seconds;
It appears that running the app with a thread count between 1 and # of available CPUs on the machine results in times that are proportional to the number of CPUs (nX seconds).
Running the app using a thread count > #CPUs results in run times that seem to level out at the approx same value as the time is takes to run with #CPUs threads (is this coincidental, or a systematic limit, or what?).
Running multiple, concurrent processes (for example, 25 concurrent instances of the same test app) results in the times being approximately linear with number of processes for a selected thread count.
Running app on different servers shows similar results
Observed results (opening files residing locally):
Orders of magnitude faster times (as to be expected);
With increasing the thread count, a LOW timing inflection point occurs at around 4-5 active threads, then increases again until the number of threads equals the CPU count, then levels off again;
Running multiple, concurrent processes (same test) results in the times being approximately linear with number of processes for a constant thread count (same result as #5 above).
Also, we noticed that Local opens take about .01 ms and sequential network opens are 100x slower at 1ms. Opening network files, we get a linear throughput increase up to 8x with 8 threads, but 9+ threads do nothing. The network open calls seem to block after more than 8 simultaneous requests. What we expected was an initial delay equal to the network roundtrip, and then approximately the same throughput as local. Perhaps there is extra mutex locking done on the local and remote systems that takes 100x longer. Perhaps there is some internal queue of remote calls that only holds 8.
Expected results and questions to be answered either by test or by answers from forums like this one:
Running multiple threads would result in the same work done in shorter time;
Is there an optimal number of threads;
Is there a relationship between the number of threads and CPUs available?
Is there some other systematic reason that an 8-10 file limit is observed?
How does the system call to “open()” work in a multi-threading process?
Each thread gets its context-switched time-slice;
Does the open() call block and wait until the file is open/loaded into file cache? Or does the call allow context switching to occur while the operation is in progress?
When the open() completes, does the scheduler reprioritize that thread to execute sooner, or does the thread have to wait until its turn in round-robin way;
Would having the mounted volume on which the 1000 files reside set as read-only or read/write make a difference?
When open() is called with a full path, is each element in the path stat()ed? Would it make more sense to open() a common directory in the list of files tree, and then open() the files under that common directory by relative path?
Development test setup:
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
8-CPUS, each with characteristics as shown below:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU X5460 # 3.16GHz
stepping : 6
cpu MHz : 1992.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm
bogomips : 6317.47
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
GNU C compiler, version:
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
Not sure if this is one of your issues, but it may be of use.
The one thing that struck me, while optimizing thousands of random reads on a single SATA disk, was that performing non-blocking I/O isn't so easy to do in linux in a clean way, without extra threads.
It is (currently) impossible to issue a non-blocking read() on a block device; i.e. it will block for the 5 ms seek time the disk needs (and 5 ms is an eternity, at 3 GHz). Specifying O_NONBLOCK to open() only served some purpose for backward compatibility, with CD burners or something (this was a rather vague issue). Normally, open() doesn't block or cache anything, it's mostly just to get a handle on a file to do some data I/O later.
For my purposes, mmap() seemed to get me as close to the kernel handling of the disk as possible. Using madvise() and mincore() I was able to fully exploit the NCQ capabilities of the disk, which was simply proved by varying the queue depth of outstanding requests, which turned out to be inversely proportional to the total time taken to issue 10k reads.
Thanks to 64 bit memory addressing, using mmap() to map an entire disk to memory is no problem at all. (on 32 bit platforms, you would need to map the parts of the disk you need using mmap64())