How to handle mass database manipulation every second - threading? - performance

I have a very hard problem:
I have round about 20-50 objects, which I MUST (that is given for the problem, please don't spend time in thinking around it) put througt a logic EVERY SECOND.
The logic itself need round about 200-600 milliseconds (90% it is 200ms - 10% it is 600ms).
I try to find any solution how I can make is smaller, but there isn't. I must get an object from DB, I must have a lot of if-else and I must actual it. - Even if I reduce it to 50ms or smaller, to veriable rate of the object up to 50 will break my neck with the 1 second timer, because 50 x 50mx =2,5 second. So a tick needs longer then the tickrate should be.
So, my only, not very smart I think, idea is to open for every object an own thread and lead a mainthread for handling. So the mainthread opens x other thread. So only this opening must take unter 1 second. After it logic is used, the thread can kill itself and we all are happy, aren't we?
By given the last answers, I will explain my problem:
I try to build an auctioneer site. So I have up to 50 auctions running at the same moment - nothing special. So I need to: every single second look to the auctionlist, see if the time is 00:00:01 and if it is, bid automaticly (it's a feature, that user can create).
So: get 50 objects in a list, iterate through, check if a automatic bid is need, do it.

With 50 objects and the processing time you've given on average you are doing 12 seconds worth of processing every second. Assuming you have 4 cores, you can get this down to an execution time of 4 seconds via threading. Every second. This means that you're going to start off behind and slip further behind as time goes on.
I know you said you tried to think of a way to make it more efficient, but couldn't, but I fear you're going to have to. The problem as stated now is computationally intractable. You're either going to have to process the objects in a rotating window (so each object gets hit once every 4th cycle or so), or you need to make your processing run faster.

First: Profile, if you haven't already. Figure out what section of your code are taking time, etc. I'd go after that database - how long is the I/O of the objects from the database taking? Can you cache that I/O? (If you're manipulating the same 50 objects, don't load them every second.)
Let's address your threads idea: If you want multiple threads, don't create and destroy them every second. Create your X threads, and leave them be -- creating & destroying them are going to be expensive operations. You might find that less threads will work better - such as 1 or 2 per core, as you might be able to reduce time doing context switches.

To expand on Jonathan Leffler's comment on the question, as the OP requested: (This answer is a wiki)
Say you have these three things being auctioned, ending at the times indicated:
10 Apples - ends at 1:05:00 PM
20 Blueberries - ends at 2:00:00 PM
15 Pears - ends at 3:50:00 PM
If the current time is 1:00:00 PM, then sleep for 4 minutes, 58 seconds (since the closest item ends in 5 minutes). We use the 2 seconds then for processing - adjust that threshold as needed. Once we're done with the apples, we'll sleep for (2 PM - now() - 2s), for the blueberries.
Note that when we wake up at 1:04:58 PM to process the apples auction, we do not touch the blueberries or the pears -- we know that they're still way out in the future, so we don't care.

Related

How to get interval of time without using timestamp?

I want voting system where voting phase ends after a week. I don't want to use block_timestamp. What near_sdk env should I use?
pub fn block_index() -> BlockHeight
There are epoch_height and block_index, what should I use? I have heard that block_index may not be continuous and can have missing numbers. Is that true?
https://docs.rs/near-sdk/3.1.0/near_sdk/env/
Generally, block_timestamp is used in all the applications.
In a few cases block_index is used, but there are indeed no guarantees that it blocks will be produced every second and that it won't change in the future in some way.
block_index can indeed not have all consecutive numbers when some blocks were not produced, but that also means more time elapsed as there is target time period per each block (which is 1 second right now). E.g. if there is block 1111 and next block is 1113 - that means roughly ~2 seconds have passed between these two blocks.

WinDbg runaway command output explained

I have a production CPU issue, after days of regular activity suddenly the CPU starts to peak. I've saved the dump file and run the !runaway command to get the list of highest CPU time consuming threads. the output is below:
User Mode Time
Thread Time
21:110 0 days 10:51:39.781
19:f84 0 days 10:41:59.671
5:cc4 0 days 0:53:25.343
48:74 0 days 0:34:20.140
47:1670 0 days 0:34:09.812
13:460 0 days 0:32:57.640
8:14d4 0 days 0:19:30.546
7:d90 0 days 0:03:15.000
23:1520 0 days 0:02:21.984
22:ca0 0 days 0:02:08.375
24:72c 0 days 0:02:01.640
29:10ac 0 days 0:01:58.671
27:1088 0 days 0:01:44.390
As you can see, the output shows I've 2 threads: 21 & 19, that consumes more than 20 hours of CPU time combined ,I was able to track the callstack of 1 of those threads like so:
~21s
!CLRStack
the output doesn't matter at the moment, let's call it the "X callstack"
What I would like, is an explanation about the !runaway command output. from what I understand, a dump file is a snapshot of the current state of the application. so my questions are:
How can the runaway command shows 10:51 hours value for thread 21, when the dumping process only took a few seconds?
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours? or it's the total time the 21 thread executed his whole X callstacks executions? If so, it seems strange that the 21 thread responsible for so many executions of the X callstacks. As I know the origin is a web request (the runtime should assign a random thread for each call)
I've a speculation that may answer those 2 questions:
Maybe the windbg calculate the time by taking the thread callstack actual time and dividing it by the scope of the dumping process, so if for example the specific execution of the X callstack took 1 second and the whole dumping process took 3 seconds (33%), while the process was running for total of 24 hours the output will show:
8 hours (33% of 24 hours)
Am I right, or completely got it wrong?
This answer is intended to be comprehensible for the OP. It's not intended to be correct into all bits and bytes.
[...] and dividing it by the scope of the dumping process [...]
This understanding is probably the root of all evil: dumping a process only gives you the state of the process at a certain point in time. The duration of dumping the process is 0.0 seconds, since all threads are suspended during the operation. (so, relative time for your process, nothing has changed and time is standing still; of course wall clock time changes)
You are thinking of dumping a process as monitoring it over a longer period of time, which is not the case. Dumping a process just takes time because it involves disk activity etc.
So no, there is no "scope" and thus you cannot (it's really hard) measure performance issues with crash dumps.
How can the runaway command shows 10:51 hours value for thread 21, [...]
How can your C# program know how long the program is running if you only have a timer event that fires every second? The answer is: it uses a variable and increases the value.
That's roughly how Windows does it. Windows is responsible for thread scheduling and each time it re-schedules threads, it updates a variable that contains the thread time.
When writing the crash dump, the information that was collected by the OS long time ago already, is included in the crash dump.
[...] when the dumping process only took a few seconds?
Since the crash dump is taken by a thread of WinDbg, the time for that is accounted on that thread. You would need to debug WinDbg and do !runaway on a WinDbg thread to see how much CPU time that took. Potentially a nice exercise and the .dbgdbg (debug the debugger) command may be new to you; other than that, this particular case is not really helpful.
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours?
No. It means that at the point in time when you created the crash dump, that specific method was executed. Not more, not less.
This information is unrelated to !runaway, because the thread may have been doing something totally different for a long time, but that ended just a moment ago.
or it's the total time the 21 thread executed his whole X callstacks executions?
No. A crash dump does not contain such detailed performance data. You need a performance profiler like JetBrains dotTrace do get that information. A profiler will look at callstacks very often, then aggregate identical call stacks and derive CPU time per call stack.

Python 3 multiprocessing: optimal chunk size

How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

Testing Erlang function performance with timer

I'm testing the performance of a function in a tight loop (say 5000 iterations) using timer:tc/3:
{Duration_us, _Result} = timer:tc(M, F, [A])
This returns both the duration (in microseconds) and the result of the function. For argument's sake the duration is N microseconds.
I then perform a simple average calculation on the results of the iterations.
If I place a timer:sleep(1) function call before the timer:tc/3 call, the average duration for all the iterations is always > the average without the sleep:
timer:sleep(1),
timer:tc(M, F, [A]).
This doesn't make much sense to me as the timer:tc/3 function should be atomic and not care about anything that happened before it.
Can anyone explain this strange functionality? Is it somehow related to scheduling and reductions?
Do you mean like this:
4> foo:foo(10000).
Where:
-module(foo).
-export([foo/1, baz/1]).
foo(N) -> TL = bar(N), {TL,sum(TL)/N} .
bar(0) -> [];
bar(N) ->
timer:sleep(1),
{D,_} = timer:tc(?MODULE, baz, [1000]),
[D|bar(N-1)]
.
baz(0) -> ok;
baz(N) -> baz(N-1).
sum([]) -> 0;
sum([H|T]) -> H + sum(T).
I tried this, and it's interesting. With the sleep statement the mean time returned by timer:tc/3 is 19 to 22 microseconds, and with the sleep commented out, the average drops to 4 to 6 microseconds. Quite dramatic!
I notice there are artefacts in the timings, so events like this (these numbers being the individual microsecond timings returned by timer:tc/3) are not uncommon:
---- snip ----
5,5,5,6,5,5,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,5,4,5,5,5,5,6,5,5,
5,6,5,5,5,5,5,6,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,5,5,5,4,5,
5,5,5,6,5,5,5,6,5,5,7,8,7,8,5,6,5,5,5,6,5,5,5,5,4,5,5,5,5,
14,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,5,5,4,5,5,4,
5,5,4,5,4,5,5,4,4,5,5,4,5,5,4,4,4,4,4,5,4,5,5,4,5,5,5,4,5,5,
4,5,5,4,5,4,5,5,5,4,5,5,4,5,5,4,5,4,5,4,5,4,5,5,4,4,4,4,5,4,
5,5,54,22,26,21,22,22,24,24,32,31,36,31,33,27,25,21,22,21,
24,21,22,22,24,21,22,21,24,21,22,22,24,21,22,21,24,21,22,21,
23,27,22,21,24,21,22,21,24,22,22,21,23,22,22,21,24,22,22,21,
24,21,22,22,24,22,22,21,24,22,22,22,24,22,22,22,24,22,22,22,
24,22,22,22,24,22,22,21,24,22,22,21,24,21,22,22,24,22,22,21,
24,21,23,21,24,22,23,21,24,21,22,22,24,21,22,22,24,21,22,22,
24,22,23,21,24,21,23,21,23,21,21,21,23,21,25,22,24,21,22,21,
24,21,22,21,24,22,21,24,22,22,21,24,22,23,21,23,21,22,21,23,
21,22,21,23,21,23,21,24,22,22,22,24,22,22,41,36,30,33,30,35,
21,23,21,25,21,23,21,24,22,22,21,23,21,22,21,24,22,22,22,24,
22,22,21,24,22,22,22,24,22,22,21,24,22,22,21,24,22,22,21,24,
22,22,21,24,21,22,22,27,22,23,21,23,21,21,21,23,21,21,21,24,
21,22,21,24,21,22,22,24,22,22,22,24,21,22,22,24,21,22,21,24,
21,23,21,23,21,22,21,23,21,23,22,24,22,22,21,24,21,22,22,24,
21,23,21,24,21,22,22,24,21,22,22,24,21,22,21,24,21,22,22,24,
22,22,22,24,22,22,21,24,22,21,21,24,21,22,22,24,21,22,22,24,
24,23,21,24,21,22,24,21,22,21,23,21,22,21,24,21,22,21,32,31,
32,21,25,21,22,22,24,46,5,5,5,5,5,4,5,5,5,5,6,5,5,5,5,5,5,4,
6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,
5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,4,6,4,6,5,5,5,5,5,5,4,6,5,5,5,
5,4,5,5,5,5,5,5,6,5,5,5,5,4,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,
5,5,5,4,5,5,6,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,6,5,5,5,5,5,5,5,
6,5,5,5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,4,5,4,5,5,5,5,5,6,5,5,
5,5,4,5,4,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,5,5,5,6,5,5,5,5,
---- snip ----
I assume this is the effect you are referring to, though when you say always > N, is it always, or just mostly? Not always for me anyway.
The above results extract was without the sleep. Typically when using sleep timer:tc/3 returns low times like 4 or 5 most of the time without the sleep, but sometimes big times like 22, and with the sleep in place it's usually big times like 22, with occasional batches of low times.
It's certainly not obvious why this would happen, since sleep really just means yield. I wonder if all this is not down to the CPU cache. After all, especially on a machine that's not busy, one might expect the case without the sleep to execute most of the code all in one go without it getting moved to another core, without doing so much else with the core, thus making the most out of the caches... but when you sleep, and thus yield, and come back later, the chances of cache hits might be considerably less.
Measuring performance is a complex task especially on new HW and in modern OS. There are many things which can fiddle with your result. First thing, you are not alone. It is when you measure on your desktop or notebook, there can be other processes which can interfere with your measurement including system ones. Second thing, there is HW itself. Moder CPUs have many cool features which control performance and power consumption. They can boost performance for a short time before overheat, they can boost performance when there is not work on other CPUs on the same chip or other hyper thread on the same CPU. On another hand, they can enter power saving mode when there is not enough work and CPU doesn't react fast enough to the sudden change. It is hard to tell if it is your case, but it is naive to thing previous work or lack of it can't affect your measurement. You should always take care to measure in steady state for long enough time (seconds at least) and remove as much as possible other things which could affect your measurement. (And do not forget GC in Erlang as well.)

Solaris prstat - definition of "recent" time used in percentages

The man page for prstat (on Solaris 10 in my case) notes that that CPU % output is the "percentage of recent CPU time". I am trying to understand in more depth what "recent" means in this context - is it a defined amount of time prior to the sample, does it relate to the sampling interval, etc? Appreciate any insights, particularly with references to supporting documentation. I've searched but haven't been able to find a good answer. Thanks!
Adrian
The kernel maintains data that you see at the bottom - those three numbers.
For each process.
uptime shows you what those numbers are. Those are the 'recent' times for load average - the line at the bottom of prstat. 1 minute, 5 minutes, and 15 minutes.
Recent == 1 minute worth of sampling (last 60 seconds). Those numbers are averages, which is why when you first start prstat the number and processes usually change.
On the first pass you may see processes like nscd that have lots of cpu but have been up for a long time. The first display iteration is completely historical. After that the numbers reflect recent == last one minute average.
You should consider enabling sar sampling to get a much better picture.
Want a reference - try :
http://www.amazon.com/Solaris-Internals-OpenSolaris-Architecture-Edition/dp/0131482092

Resources