Memory comsume of spring statemachine - spring-statemachine

I have used the spring statemachine to develop a game, in the peak hour, it will create about 5000 state machine, each state machine has about 25 state. But from a performance test, we notice that there are many object created, it is memory consumed and decrease the gc performance.
For my option, there are some object are inmutable like ObjectState, it can be reused for all the state machine. But it is created for each.
Is it a performance issue of spring state machine? Or is it my mis-usage?
10: 104192 10002432 org.springframework.statemachine.state.ObjectState
15: 208398 6668736 org.springframework.statemachine.support.OrderedCompositeItem
17: 397839 6365424 org.springframework.core.annotation.AnnotationAwareOrderComparator
18: 189440 6062080 org.springframework.statemachine.listener.OrderedComposite
26: 71040 4546560 org.springframework.statemachine.trigger.TimerTrigger
27: 89984 4319232 org.springframework.statemachine.transition.DefaultExternalTransition
30: 203648 3258368 org.springframework.statemachine.action.CompositeActionListener
42: 71040 1704960 org.springframework.statemachine.support.DefaultStateMachineExecutor$2
45: 104192 1667072 org.springframework.statemachine.support.AbstractStateMachine$4
46: 104192 1667072 org.springframework.statemachine.state.CompositeStateListener

You're right, those are created for every machine. It's interesting idea if those could be created once and then somehow shared between other instances as it would definitely reduce garbage. Would you mind creating a gh issue for this?

Related

why stmt.Close is slow?

We implemented an Clickhouse driver according to sql/driver with ConnPrepareContext interface which use stmt to do the real querying work.
For some queries, our driver cost must of the time during stmt.Close(). I can find the query execution time in Clickhouse system.query_log.
For 10 out of 4000 queries in production environment, I found they executed very fast in Clikchouse but cost too much in driver.
Take one query as example: it run 100ms in Clickhouse and resulted 0 rows, but the driver cost 10s on the Close() method.
This issue is not easy to reproduce. I want to ask you for help. If this is database/sql package issue, it should not only emerge in my driver, but also other drivers. Does anyone meet the same issue?
Can you give some suggestion on how to debug this.
I find some hints.
In one sql.DB.QueryContext() call, our driver will
step 1: call sql.DB.PrepareContext() to prepare sql.Stmt, and
step 2: call sql.Stmt.QueryContext(), and
step 3: run sql.Stmt.finalClose() due to Rows.Close()
Both in step 1 and step 2, sql.Stmt will apply one sql.driverConn, and append it into sql.Stmt.css.
In step 3, sql.Stmt will be removed in each sql.driverConn in sql.Stmt.css, which will acquire sql.driverConn lock. This will cause contention.
In our case, we don't care so much about security. We will implement QueryerContext interface in Golang sql/driver avoid unnecessary PrepareContext.
refer http://go-database-sql.org/prepared.html

Dask data loading on local cluster: "Worker exceeded 95% memory budget". Restarting and then "KilledWorker"

I know similar questions have been asked before but their solutions were not very helpful. I guess the best solution is probably more specific to each cluster configuration, so I'm putting more details here about my cluster and my error.
import dask.dataframe as dd
import dask.bag as db
import json
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
Here are my cluster settings
cluster.scheduler
#output:
Scheduler: tcp://127.0.0.1:35367 workers: 8 cores: 48 tasks: 0
cluster.workers
#output:
{0: <Nanny: tcp://127.0.0.1:43789, threads: 6>,
1: <Nanny: tcp://127.0.0.1:41375, threads: 6>,
2: <Nanny: tcp://127.0.0.1:42577, threads: 6>,
3: <Nanny: tcp://127.0.0.1:40171, threads: 6>,
4: <Nanny: tcp://127.0.0.1:32867, threads: 6>,
5: <Nanny: tcp://127.0.0.1:46529, threads: 6>,
6: <Nanny: tcp://127.0.0.1:41535, threads: 6>,
7: <Nanny: tcp://127.0.0.1:39645, threads: 6>}
client
#output
Client
Scheduler: tcp://127.0.0.1:35367
Dashboard: http://127.0.0.1:8787/status
Cluster
Workers: 8
Cores: 48
Memory: 251.64 GiB
Here are my data loading codes:
b = db.read_text('2019-12-16-latest-level.json').map(json.loads)
def flatten(record):
return {
'uuid': record['uuid'],
'stored_at': record['stored_at'],
'duration': record['duration']
}
All codes above run fine. Here is the one that got in trouble:
df = b.map(flatten).to_dataframe()
df.head()
The codes run for around 1 day and gave the warning below:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
Then around another day, the program stopped and gave me the error below:
---------------------------------------------------------------------------
KilledWorker Traceback (most recent call last)
<ipython-input-10-84f98622da69> in <module>
1 df = b.map(flatten).to_dataframe()
----> 2 df.head()
And here are the last few lines of the error report:
KilledWorker: ("('bag-from-delayed-file_to_blocks-list-loads-flatten-0daa9cba16c635566df6215c209f653c', 0)", <WorkerState 'tcp://127.0.0.1:41535', name: 6, memory: 0, processing: 1>)
Screenshots of the full error report is also attached:
Any suggestions on how to deal with this issue? Thanks.
I have been using dask, with mixed results, for about a month now. My personal belief is that the software has some kind of deadly embrace in its memory management when executing task graphs. A typical motis operandi of dask is to compute 95% of a large computation in just a few minutes, then spend the next 8 hours crunching away at the last 5% appearing to do nothing before it crashes or I run out of compute budget. This is very frustrating.
That said, I have had some limited success using fewer workers or confining workers to processes rather than threads. So, on a 16 core machine I might do:
client = Client(processes=True, threads_per_worker=1)
Another important thing to do is persist intelligently. Persisting causes there to be fewer tasks in the graph (and as a result in memory) at a given time. So if I want to read a bag from a json file, I would persist the bag before converting to a dataframe, otherwise the reading and conversion both happen at the compute() step and I find that is a recipe for failure.
However, as I said, I have found dask quite disappointing, considering all it appears on the surface to be capable of. I'm switching to vaex instead.
Sorry I was unable to help more.

Hazelcast ensure Near Cache preload on Member

I have a use-case where I need to ensure that the data is loaded to near cache, before working with it. I am using Hazelcast as Cache Manager provider. I do store data in near cache in OBJECT in-memory-format, and do cache local entries. On application start up I do warm cache by calling service cached method:
#Override
public void onApplicationEvent(final ApplicationReadyEvent applicationReadyEvent) {
applicationReadyEvent.getApplicationContext().getBean(MyService.class).myCachedMethod(); //1 warm cahce
applicationReadyEvent.getApplicationContext().getBean(MyService.class).myCachedMethod(); //2 warm near cache here
MyData myData = applicationReadyEvent.getApplicationContext().getBean(MyService.class).myCachedMethod(); // 3 warm near cache here
//4
//work with data from near cache here
myDtata.doStufff();
}
After it I do call same method several times, to load data into near cache. Issue is that some times near cache is not loaded by the time I try to work with method data. It looks like Near Cache is loaded asynchronously, and I keep receiving data not from near cache at step 4. So my question is - is there any way to make NearCache preload possible, or to ensure that at particular point NearCache is populated? Even if that mean to wait for some time for it to get populated, before using. Adding Thread.sleep() does the trick for me, but this is by no means way to go. Hazelcast version is 3.11.4
Any help appreciated.
You can preload the Near Cache using the <preloader> configuration element. See the following example config similar to one from the Reference Manual:
<near-cache name="myDataStructure">
<in-memory-format>OBJECT</in-memory-format>
<preloader enabled="true"
directory="nearcache-example"
store-initial-delay-seconds="600"
store-interval-seconds="600"/>
</near-cache>

Run is slow between components in Test Lab

I'm using UFT/BPT for API and GUI Testing, everything works fine, I have business components which are in flows which are used in Business-Process, I run the Business-Process from Test Lab - ALM, here I have a problem with big times on runs.
EX: Business-Process Test
Component 1:
Start: 18:17:48
End: 18:17:48
Component 2:
Start: 18:18:00
End: 18:18:01
Component 3:
Start: 18:18:12
End: 18:18:13
Component 4:
Start: 18:18:24
End: 18:18:24
Conclusion:
After Component 1 it's ended and Component 2 it's started are 12 seconds between.
Component 2 and Component 3: 11 seconds
Component 3 and component 4: 11 seconds
Why it's stay so much between components?
Experiencing the same starting with UFT 15+, I got the official reply that since the new version has sooo many new powerful features it is to be considered normal for some things to take more time with the new version.
Which of course is not a good reply, but that´s what I have been officially told by my support source.
So if you have a short test calling 10 components which all need 2 seconds to do their job, you used to execute such test in 10*2=20 seconds, and now it will take 10*2+20*10=220 seconds. For that additional 0 (from 20 to about 200 seconds), you get absolutely nothing in return. Great deal? Oooookay...
I think this should be fixed, naturally. There is no good reason why calling a component should have an oberhead of 9-15 seconds (SECONDS!).
But what can you do if you are using standard software packages that get messed up by the devs over the years? You can waste your time with support, but you will achieve nothing. So what can you do?
Nothing, except for migrating elsewhere :(
And I think it´s a shame. I wish SO user Motti (ex-UFT dev) would see this and clarify, but, well, he cannot be everywhere :)

How to detect maxlag exception in pywikibot

I am developing a Wikipedia bot to analyze editing contributions. Unfortunately, it takes hours to complete a single run and during that time Wikipedia's database replication delay—at some point during the run—is sure to exceed 5 seconds (the default maxlag value). The recommendation in the API's maxlag parameter is to detect the lag error, pause for X seconds and retry.
But all I am doing is reading contributions with:
usrpg = pywikibot.Page(site, 'User:' + username)
usr = pywikibot.User(usrpg)
for contrib in usr.contributions(total=max_per_user_contribs):
# (analyzes contrib here)
How to detect the error and resume it? This is the error:
WARNING: API error maxlag: Waiting for 10.64.32.21: 7.1454429626465 seconds lagged
Traceback (most recent call last):
File ".../bot/core/pwb.py", line 256, in <module>
if not main():
File ".../bot/core/pwb.py", line 250, in main
run_python_file(filename, [filename] + args, argvu, file_package)
File ".../bot/core/pwb.py", line 121, in run_python_file
main_mod.__dict__)
File "analyze_activity.py", line 230, in <module>
attrs = usr.getprops()
File ".../bot/core/pywikibot/page.py", line 2913, in getprops
self._userprops = list(self.site.users([self.username, ]))[0]
File ".../bot/core/pywikibot/data/api.py", line 2739, in __iter__
self.data = self.request.submit()
File ".../bot/core/pywikibot/data/api.py", line 2183, in submit
raise APIError(**result['error'])
pywikibot.data.api.APIError: maxlag: Waiting for 10.64.32.21:
7.1454 seconds lagged [help:See https://en.wikipedia.org/w/api.php for API usage]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.
It occurs to me to catch the exception thrown in that line of code:
raise APIError(**result['error'])
But then restarting the contributions for the user seems terribly inefficient. Some users have 400,000 edits, so rerunning that from the beginning is a lot of backsliding.
I have googled for examples of doing this (detecting the error and retrying) but I found nothing useful.
Converting the previous conversation in comments into an answer.
One possible method to resolve this is to try/catch the error and redo the piece of code which caused the error.
But, pywikibot already does this internally for us ! Pywikibot, by default tries to retry every failed API call 2 times if you're using the default user-config.py it generates. I found that increasing the following configs does the trick in my case:
maxlag = 20
retry_wait = 20
max_retries = 8
The maxlag is the parameter recommended to increase according to the documentation of Maxlag parameter, especially if you're doing a large number of writes in a short span of time. But, the retry_wait and max_retries configs are useful in case someone else is writing a lot (As is my case: My scripts just read from wiki).

Resources