Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 months ago.
Improve this question
Question
Hello, I have a question about threadpool, HDD read/write simultaneously. It's my first time leaving a question, so I am sorry in advance because the writing is lengthy...
On one PC,
The Image processing and image storage programs,
and The image loading program is running.
If the image storage and image import operations are running simultaneously on one HDD, the image processing operation seems to slow down.
HDD has only one disk head, so I know it's the fastest to do only one move at a time... There is nothing we can do about this part, so I want to minimize or slowdown.
Next, the development environment and implementation situation.
I worked with MFC + OpenCV (Windows 10.0.19044)
The image processing program is repeated every time an instruction is received and is running 24 hours a day.
The image is 16384 * 40000 pixels * 1bytes 2 sheets.
Since it is a high-capacity image, both image processing and image storage after image area division are performed in a thread pool.
The image loading program operates when the user needs it.
When inquiring, DB inquires video information and retrieves images from HDD.
The PC is equipped with SSD and two HDDs (13TB)
The processor is i9-12900KF, 16core, 24thread.
Any job is taken out by queuing it, and both image processing and image storage jobs are processing on the one thread pool.
I share the same thread pool and use it, so I guess that during image storage, the number of threads used for image processing decreases.
I set the number of threads at 40 for both programs. There's no particular reason. I heard that we need to catch it efficiently depending on the number of cores, but I am considering it.
I store the image in png format and jpg format respectively.
The default action for image loading is to load the file into a small jpg and the function is divided so that the user can load it directly into png if necessary.
When saving a split image,
The image encoding operation is performed simultaneously in the thread pool
Memory -> hdd transmissions are sequentially transmitted one by one in a single thread.
For image loading, hdd -> memory is loaded one by one sequentially
The image decoding operation is performed simultaneously in the thread pool.
The image processing result should be stored in the DB, and the result should be sent quickly.
It doesn't matter if the image storage is slowed down.
The image loading action is not satisfactory to the user, but it can be compromised to some extent. (Still, I want to implement it to give the result as soon as possible...)
So what I thought
If image storage/importing threads lower thread priority, will image processing threads do more work and work?
Is it meaningful to divide the thread pool for image storage/image processing instead of one thread pool?
Why don't you save the image on SDD, create a separate service program, and send it slowly to the HDD?
Actually, isn't there a problem with the disk?
1, 2, will be developed, and released. (It is difficult to reproduce problems in the office...)
The third method is to write to an HDD in SDD, write to an HDD at once, and overlap with the HDD reading
I think it's just the development that gets complicated. However, it is significantly faster than HDD when storing images.
In the case of number 4, jpg is not slow to load images due to the low file capacity... The process of decoding is slow. I thought it would have nothing to do with HDD from the decoding stage.
So, both programs have 40 threads in the thread pool The image import program reduced the number of threads to two and sent an update, but it was reported that the image import operation was only slow and the issue remained.
The situation is complicated and there are many suspicious things, but I'm asking you because I think there are parts that I don't know or have errors...
First of all, you use a thread pool with far more threads than the number of cores on the i9-12900KF processor. Having two threads running on the same physical core generally cause them to be slower. If they run on the same logical core, then they cannot run simultaneously (they will be constantly interrupted). In fact, even if they run on different physical cores, one thread can significantly slow down another if it intensively make use of the L3 cache or the memory which is likely your case. Operating on a large buffer can causes cache lines of the cache of other cores to be evicted and thus reloaded later. This is known as cache trashing. This problem can become critical with non-contiguous loads/stores.
The target processor is a big-little one so the scheduling of threads on such a processor is more complex than usual. In fact, many libraries do not support well such architecture yet (they are not running efficiently). Even OS stacks are barely suited for such kind of architecture (at least on Windows and Linux). The number of threads per core is not the same for all core: big core can execute 2 threads simultaneously (sharing available resources) while little core can only execute 1 thread at a time. It is worth noting that the frequency of the little core is not the same than the big core: 2.4 GHz VS 3.2 GHz for the base frequency and 3.9 GHz vs 5.1 for the turbo frequency). Regarding the scheduling of the thread to the core, the performance of the target thread can change.
The frequency of the cores running the threads is dependent of the number of cores used and the work done on each cores. For example, running a computationally intensive code using the FP AVX-2 units (or the non-officially supported AVX-512 units) on a core can significantly reduce the frequency of other cores. The higher the number of active core, the lower the frequency. Dynamic frequency stalling affect the scalability of application but this scaling is important for the processor to fulfil its power budget (and not melt too).
Caching also matters a lot. Indeed, mainstream OS tends to put HDD read/written data in memory so to operate faster. This requires some additional memory which is not considered allocated. When a process request a large amount of memory, the OS flush/invalidate the IO cache regarding the requested space and later accesses cause data to be reloaded from the storage device (much slower). The solution is to check the amount fully available memory (the part not cached) and not to use too much memory if the remaining space is used by the storage device cache.
Having two thread doing IO operations is generally not faster than 1 thread on HDD (especially with 1 head). Some OS stacks use locks if not even a giant lock. Because of that, one loading thread with asynchronous IO can be faster than blocking IO on one/multiple threads. Indeed, the OS can reorder requests so they can be more contiguous in that case (so to reduce the seek time by loading data on the way).
Related
First I will tell environment of my PC, background of my question, my problem, than I will explain my exact question.
Environment:
OS: Ubuntu 16.04
Kernel: 4.17.1
CPU: i7-6700k
Memory: 8GB DRAM
Storage: SSD 120GB
Background:
I'm trying to optimizing linux kernel for my specific application. Following is abstract logic of this application.
1. call malloc, allocate the memory space which size is exactly 4KB(page size)
2. Copy predefined data(also, size is 4KB) to allocated memory space.
3. Do computation
4. Free allocated memory space.
This sequence occurs about several thousands to ten thousands times a second.
So I thought copy predefined data to allocated memory space using memcpy() thousands of times every second is very inefficient. But I cannot fix the code of this application.
My problem:
I want to do these copies asynchronously by kernel module, using less CPU cycles as possible. So I'm trying to implement a kernel module that copy this predefined data to free page frames asynchronously in kernel, and managing a pool page frames which has predefined data on them. When my specific application request a page frame, my kernel will give a page frame from this pool.
To copy data asynchronously, I first considered DMA, but intel idma64 of my CPU cannot copy data memory to memory asynchronously. Now, I'm trying to copy this data from secondary storage(SSD) to memory. I found that there is library for asynchronous IO named libaio in linux.
My question:
1. Can I use libaio libraries in kernel module? If not, what kind of library or APIs do I have to use to copy asynchronously in my kernel module?
2. Will libaio(or something else) really do copies without exploiting CPU cycles?
I don't think you need to write a kernel module. A user space thread pool of CPU pinned threads working with a collection of memory maps of files will be as efficient as is possible to implement. Just be careful of "TLB shootdown" i.e. avoid modifying the address space of the process, and throw as much virtual address space as you can at the problem to avoid that. Perhaps a little bit of hinting to the kernel as to what written pages will never be used again via madvise(), and you should be optimal - sufficient multiple threads will maximise queue depth to the SSD, you want to aim for QD8 to QD16, and you should easily saturate a NVMe link whilst keeping CPU usage below 100%.
Things get harder if you have many NVMe linked SSDs, you may need to consider replacing Linux will something with more scalable storage i/o, but there is a throughput vs scalability tradeoff there. Windows and FreeBSD will scale better with lots of devices if you partition the work up right, but Linux will do much better with a few devices. Good luck!
After some theoretical discussion today I decided to do some research, but I did not find anything conclusive.
Here's the problem:
We have written a tool that reads around 10Gb of image files from a data set of several terabytes. We want to speed up the execution time by minimizing I/O overhead. The idea would be to "pre-warm" the disk cache, as we known beforehand what directory we will be reading from as the tool executes. Is there any API or method to give this hint to Windows so that it can start pre-warming the disk cache, speeding up future disk access as the files are already in RAM (of which there is plenty on the machines we run the tool on)?
I know Windows does readahead on a single file, but what if I have a directory with thousands of files?
I haven't found any direct win32 APIs or command line tools to do this directly.
What if I start a low priority background thread, opening all the files for reading and closing them?
I could of course memory map all the files and pin them in RAM, but that would probably run the risk of starving the main worker thread of I/O.
The general idea here is that the tool "bursts" I/O requests, as each thread will do I/O and CPU processing in sequence, hence we could use the "idle" I/O time to preload the remaining files into RAM.
(I could of course benchmark, and I will, but I would like to understand a bit more of how this works in order to be more scientific and less cargo culty).
I have a high throughput low latency application (3000 Request/Sec, 100ms per request), and we heavily use Java 8 ConcurrentHashMap for performing lookups. Usually these maps are updated by a single background thread and multiple threads read from these maps.
I am seeing a performance bottleneck, and on profiling I find ConcurrentHashMap.get as being the hotspot and taking majority of the time.
I another case, I see ConcurrentHashMap.computeIfAbsent being the hotspot, although the mapping-function has very small latency and the profile shows computeIfAbsent spending 90% of the time executing itself, and very less time in executing the mapping-function.
My question is there any way i could improve the performance? I have around 80 threads concurrently reading from CHM.
I have around 80 threads concurrently reading from CHM.
The simplest things to do are
if you have a CPU bound process, don't have more active threads than you have CPUs, otherwise this will only add overhead and if these threads hold a lock while not running, because you have too many threads, it will really not help.
increase the number of partitions. You will want to have at least 4x the number of segments/partitions and you have threads accessing a single map. However, you will get strange behaviour in CHM if you access it with more than 40 threads due to the way cache coherency works. I suggest using a more efficient data structure for higher degrees of concurrency. In Java 8 the concurrencyLevel is a hint, but it is better than leaving the default initialise size of 16.
don't spend so much time in CHM. Find a way to do useful work without hitting a shared resource and your threads will run much more efficiently.
If you have any latencies you can see in a low latency system, you have a problem IMHO.
Lots of personal experience, anecdotal evidence, and some rudimentary analysis suggests that a Java server (running, typically, Oracle's 1.6 JVM) has faster response times when it's under a decent amount of load (only up to a point, obviously).
I don't think this is purely hotspot, since response times slow down a bit again when the traffic dies down.
In a number of cases we can demonstrate this by averaging response times from server logs ... in some cases it's as high as 20% faster, on average, and with a smaller standard deviation.
Can anyone explain why this is so? Is it likely a genuine effect, or are the averages simply misleading? I've seen this for years now, through several jobs, and tend to state it as a fact, but have no explanation for why.
Thanks,
Eric
EDIT a fairly large edit for wording and adding more detail throughout.
A few thoughts:
Hotspot kicks in when a piece of code is being executed significantly more than other pieces (it's the hot spot of the program). This makes that piece of code significantly faster (for the normal path) from that point forward. The rate of call after the hotspot compilation is not important, so I don't think this is causing the effect you are mentioning.
Is the effect real? It's very easy to trick yourself with statistics. Not saying you are, but be sure that all your runs are included in the result, and that all other effects (such as other programs, activity, and your monitoring program are the same in all cases. I have more than one had my monitoring program, such as top, cause a difference in behaviour). On one occasion, the performance of the application went up appreciably when the caches warmed up on the database - there was memory pressure from other applications on the same DB instance.
The Operating System and/or CPU may well be involved. The OS and CPU both actively and passively do things to improve the responsiveness of the main program as it moves from being mainly running to being mainly waiting for I/O and vice versa, including:
OS paging memory to disk while it's not being used, and back to RAM when the program is running
OS will cache frequently used disk blocks, which again may improve the application performance
CPU instruction and memory caches fill with the active program's instruction and data
Java applications particularly sensitive to memory paging effects because:
A typical Java application server will pre-allocate almost all free memory to Java. The large memory makes the application inherently more sensitive to memory effects
The generational garbage collector used to manage Java memory ends up creating new objects over a lot of pages, so each request to the application will need more page requests than in other languages. (this is true principally for 'new' objects that have not been through many garbage collections. Objects promoted to the permanent generation are actually very compactly stored)
As most available physical memory is allocated on the system, there is always a pressure on memory, and the largest, least recently run application is a perfect candidate to be pages out.
With these considerations, there is much more probability that there will be page misses and therefore a performance hit than environments with smaller memory requirements. These will be particularly manifest after Java has been idle for some time.
If you use Solaris or Mac, the excellent dTrace can trace memory and disk paging specific to an application. The JVM has numerous dTrace hooks that can be used as triggers to start and stop page monitoring.
On Solaris, you can use large memory pages (even over 1GB in size) and pin them to RAM so they will never be paged out. This should eliminate the memory page problem stated above. Remember to leave a good chunk of free memory for disk caching and for other system/maintenance/backup/management apps. I am sure that other OSes support similar features.
TL/DR: The currently running program in modern operating systems will appear to run faster after a few seconds as the OS brings the program and data pages back from disk, places frequently used disk pages in disk cache and the OS instruction and data caches will tend to be "warmer" for the main program. This effect is not unique to the JVM but is more visible due to the memory requirements of typical Java applications and the garbage collection memory model.
We've just bought a 32-core Opteron machine, and the speedups we get are a little disappointing: beyond about 24 threads we see no speedup at all (actually gets slower overall) and after about 6 threads it becomes significantly sub-linear.
Our application is very thread-friendly: our job breaks down into about 170,000 little tasks which can each be executed separately, each taking 5-10 seconds. They all read from the same memory-mapped file of size about 4Gb. They make occasional writes to it, but it might be 10,000 reads to each write - we just write a little bit of data at the end of each of the 170,000 tasks. The writes are lock-protected. Profiling shows that the locks are not a problem. The threads use a lot of JVM memory each in non-shared objects and they make very little access to shared JVM objects and of that, only a small percentage of accesses involve writes.
We're programming in Java, on Linux, with NUMA enabled. We have 128Gb RAM. We have 2 Opteron CPU's (model 6274) of 16 cores each. Each CPU has 2 NUMA nodes. The same job running on an Intel quad-core (i.e. 8 cores) scaled nearly linearly up to 8 threads.
We've tried replicating the read-only data to have one-per-thread, in the hope that most lookups can be local to a NUMA node, but we observed no speedup from this.
With 32 threads, 'top' shows the CPU's 74% "us" (user) and about 23% "id" (idle). But there are no sleeps and almost no disk i/o. With 24 threads we get 83% CPU usage. I'm not sure how to interpret 'idle' state - does this mean 'waiting for memory controller'?
We tried turning NUMA on and off (I'm referring to the Linux-level setting that requires a reboot) and saw no difference. When NUMA was enabled, 'numastat' showed only about 5% of 'allocation and access misses' (95% of cache misses were local to the NUMA node). [Edit:] But adding "-XX:+useNUMA" as a java commandline flag gave us a 10% boost.
One theory we have is that we're maxing out the memory controllers, because our application uses a lot of RAM and we think there are a lot of cache misses.
What can we do to either (a) speed up our program to approach linear scalability, or (b) diagnose what's happening?
Also: (c) how do I interpret the 'top' result - does 'idle' mean 'blocked on memory controllers'? and (d) is there any difference in the characteristics of Opteron vs Xeon's?
I also have a 32 core Opteron machine, with 8 NUMA nodes (4x6128 processors, Mangy Cours, not Bulldozer), and I have faced similar issues.
I think the answer to your problem is hinted at by the 2.3% "sys" time shown in top. In my experience, this sys time is the time the system spends in the kernel waiting for a lock. When a thread can't get a lock it then sits idle until it makes its next attempt. Both the sys and idle time are a direct result of lock contention. You say that your profiler is not showing locks to be the problem. My guess is that for some reason the code causing the lock in question is not included in the profile results.
In my case a significant cause of lock contention was not the processing I was actually doing but the work scheduler that was handing out the individual pieces of work to each thread. This code used locks to keep track of which thread was doing which piece of work. My solution to this problem was to rewrite my work scheduler avoiding mutexes, which I have read do not scale well beyond 8-12 cores, and instead use gcc builtin atomics (I program in C on Linux). Atomic operations are effectively a very fine grained lock that scales much better with high core counts. In your case if your work parcels really do take 5-10s each it seems unlikely this will be significant for you.
I also had problems with malloc, which suffers horrible lock issues in high core count situations, but I can't, off the top of my head, remember whether this also led to sys & idle figures in top, or whether it just showed up using Mike Dunlavey's debugger profiling method (How can I profile C++ code running in Linux?). I suspect it did cause sys & idle problems, but I draw the line at digging through all my old notes to find out :) I do know that I now avoid runtime mallocs as much as possible.
My best guess is that some piece of library code you are using implements locks without your knowledge, is not included in your profiling results, and is not scaling well to high core-count situations. Beware memory allocators!
I'm sure the answer will lie in a consideration of the hardware architecture. You have to think of multi core computers as if they were individual machines connected by a network. In fact that's all that Hypertransport and QPI are.
I find that to solve these scalability problems you have to stop thinking in terms of shared memory and start adopting the philosophy of Communicating Sequential Processes. It means thinking very differently, ie imagine how you would write the software if your hardware was 32 single core machines connected by a network. Modern (and ancient) CPU architectures are not designed to give unfettered scaling of the sort you're after. They are designed to allow many different processes to get on with processing their own data.
Like everything else in computing these things go in fashions. CSP dates back to the 1970s, but the very modern and Java derived Scala is a popular embodiment of the concept. See this section on Scala concurrency on Wikipedia.
What the philosophy of CSP does is force you to design a data distribution scheme that fits your data and the problem you're solving. That's not necessarily easy, but if you manage it then you have a solution that will scale very well indeed. Scala may make it easier to develop.
Personally I do everything in CSP and in C. It's allowed me to develop a signal processing application that scales perfectly linearly from 8 cores to several thousand cores (the limit being how big my room is).
The first thing you're going to have to do is actually use NUMA. It isn't a magic setting that you turn on, you have to exploit it in your software's architecture. I don't know about Java, but in C one would bind a memory allocation to a specific core's memory controller (aka memory affinity), and similarly for threads (core affinity) in cases where the OS doesn't get the hint.
I presume that your data doesn't break down into 32 neat, discrete chunks? It's difficult to give advice without knowing exactly the data flows implicit in your program. But think about it in terms of data flow. Draw it out even; Data Flow Diagrams are useful for this (another ancient graphical formal notation). If your picture shows all your data going through a single object (eg through a single memory buffer) then it's going to be slow...
I assume you have optimized your locks, and synchronization made a minimum. In such a case, it still depends a lot on what libraries you are using to program in parallel.
One issue that can happen even if you have no synchronization issue, is memory bus congestion. This is very nasty and difficult to get rid of.
All I can suggest is somehow make your tasks bigger and create fewer tasks. This depends highly on the nature of your problem. Ideally you want as many tasks as the number of cores/threads, but this is not easy (if possible) to achieve.
Something else that can help is to give more heap to your JVM. This will reduce the need to run Garbage Collector frequently, and speeds up a little.
does 'idle' mean 'blocked on memory controllers'
No. You don't see that in top. I mean if the CPU is waiting for memory access, it will be shown as busy. If you have idle periods, it is either waiting for a lock, or for IO.
I'm the Original Poster. We think we've diagnosed the issue, and it's not locks, not system calls, not memory bus congestion; we think it's level 2/3 CPU cache contention.
To reiterate, our task is embarrassingly parallel so it should scale well. However, one thread has a large amount of CPU cache it can access, but as we add more threads, the amount of CPU cache each process can access gets lower and lower (the same amount of cache divided by more processes). Some levels on some architectures are shared between cores on a die, some are even shared between dies (I think), and it may help to get "down in the weeds" with the specific machine you're using, and optimise your algorithms, but our conclusion is that there's not a lot we can do to achieve the scalability we thought we'd get.
We identified this as the cause by using 2 different algorithms. The one which accesses more level 2/3 cache scales much worse than the one which does more processing with less data. They both make frequent accesses to the main data in main memory.
If you haven't tried that yet: Look at hardware-level profilers like Oracle Studio has (for CentOS, Redhat, and Oracle Linux) or if you are stuck with Windows: Intel VTune. Then start looking at operations with suspiciously high clocks per instruction metrics. Suspiciously high mean a lot higher than the same code on a single-numa, single-L3-cache machine (like current Intel desktop CPUs).