H2 database: Does a 60 second write delay have adverse effects on db health? - h2

We're currently using H2 version 199 in embedded mode with default nio file protocol and MVStore storage system. The write_delay parameter is set to 60 seconds.
We run a batch insert/update/delete of about 30.000 statements within 2 seconds (in one transaction) followed by another batch of a couple of hundred statements only 30 seconds later (in a second transaction). The next attempt to open a db connection (only 2 minutes later) shows that the DB is corrupt:
File corrupted while reading record: null. Possible solution: use the recovery tool [90030-199]
Since the transactions occur within a minute, we wonder whether the write_delay of 60 seconds might be contributing to the issue.

Changing write_delay to 60s (from a default value of 0.5s) will definitely increase your risk of lost transactions, and I do not see a good reason for doing it. Should not cause a db corruption, though. More likely some thread interruptions do that, since you are running in the same JVM a web server and who knows what else. Using async file store might help in that area, and yes it is stable enough (how much worse it can go for your app, than a database corruption, anyway).

Related

SOS-Berlin JobScheduler process queue logic

We're running into an issue with the SOS-Berlin JobScheduler running on Windows that is difficult to diagnose* and I would appreciate any guidance.
*Difficult because I don't know Scala (though I do know C++ and Java). It's difficult to navigate this code-base (some of it's in German).
We have a process-class called Foo, that will sometimes burst up outside the limit of how many processes can be run. So, for example, we limit the process-class to 30 processes and 60 want to run. This leaves 30 running and 30 "waiting for process."
The problem is that JobScheduler doesn't seem to prioritize the 30 that are waiting for a process. Instead, any new job that gets fired after the burst receives processes, leaving some jobs waiting indefinitely. Once the number of jobs "waiting for process" hits zero, the jobs clear out immediately.
Further, it seems that when there are a large number of jobs "waiting for process," the run time for tasks doubles or triples. A job that normally takes 20 seconds to run, will spike to 1-2 minutes, further amplifying the issue as processes are not released back to the pool.
Admittedly, we're running an older version of JS, which we're planning to upgrade this/next week. However, I'm wondering if there is something fundamental we're missing. We've turned down the logging, looked for DB locks, added memory to the heap, shut-down some other processes on the server. We've also increased the process pool, but we don't want to push it too far, lest we crush the server. Nothing seems to be alleviating the issue.
Any tuning help would be appreciated!
As a follow-up, we determined the cause of the issue.
Another user had been using the temp directory to store intermediate generated files. The user was not clearing out these files, resulting in 100's of thousands of files in the directory. They were not very large so we didn't notice. For some reason Job Scheduler started to choke based on this. I'm not clear on the reasons.
Clearing the temp directory, scolding the user, and fixing his script fixed the issue.

How to optimize berkeley-DB-JE Environment for fast recovery?

We are running a simple multi-threaded java application which uses Berkeley-DB databases for its storage. There is about 500 threads and each thread has its own Berkeley-DB database - and each database is about 100K of value-key pairs. All databases are transactional and each transaction has maximum of about 1000 operations. No long running transactions.
The problem is that, occasionally, recovery of Berkeley-DB takes very very long time when restart our application. During recovery (opening the environment) we see that java process is reading from disk at rate of ~100MB/s. No writes - just reading.
Our setup is like this:
je.env.runCheckpointer=true
je.env.runCleaner=true
je.checkpointer.highPriority=true
je.cleaner.threads=256
je.cleaner.maxBatchFiles=10
je.log.checksumRead=false
je.lock.nLockTables=353
je.maxMemory=16106127360
je.log.nDataDirectories=256
We also tried running checkpoint manually every 15 minutes (assuming that maybe checkpointer stops or something). We also set setMinimizeRecoveryTime(true). But no help.
We assume that maybe the problem is some java or Berkeley DB configuration.
Is there a way to ensure faster recovery time while sacrificing speed of puts into database?

Does redis persistence blocks read and write request

I am using redis and saving data to disk in certain time interval. I see normally redis read and write time is order of .2 miliseconds but I see few peeks of order of 30 milliseconds. I read redis forks a background process to write data into disk , is forking happens on same (redis use single thread to serve all requests) thread which serves read and write request.
If this is true I want a solution such that persistence would not increase latency for read and write request.
If you issue a BGSAVE, the background save will fork. The OS needs to have a lazy separate CPU thread available ofcourse, for this not to impact Redis-server's main thread. If you configure save in redis.conf, a BGSAVE is basically what happens. I would configure it to off and issue BGSAVE manually while troubleshooting.
If you issue a SAVE, saving will be syncronous, and other clients will have to wait.
See also here. You might want to skip rdb snapshotting altogether, and rely on AOF.
Also see my remark on sensitive data: SO comment. There are many ways to make sure your data is safe. Disk-persistence is only one of them.
Hope this helps, TW

Repeated tasks - spawn new processes or run continuously?

We have about 10 different Python scripts that download data from the web, read data from a database and write data back to that database. They do so repeatedly every 10 seconds (or 10 seconds after the last task has completed).
The question is, what is the best approach at running these tasks? I can think of a few ways:
a while True that runs the task then sleeps for the interval. It could be guarded by a watchdog like supervisord, making sure it is always up.
having the script execute the task just once, and invoking the script externally once every 10 seconds by another process.
having the script execute the task lets say for 1 hour (every 10 seconds for an hour), and having a watchdog make sure that task runs again once the hour is over.
I would like to avoid long running processes that actually do something because I don't want to deal with memory problems etc over long periods of time.
Additional Information
The scripts are different because they each retrieve data from a different source, and query, calculate and insert different data into the database.
The tasks are performed every 10 seconds since the data being retrieve is in real-time, and we need to not only keep updating it very frequently, but also keep all the historical data in the database.
There are a lot of resources being used by the scripts - MySQL connections, HTTP connections, Redis connections, etc. We have encountered issues with using the long-running approach before, specifically with MySQL connections (things like MySQL server has gone away, even though all connections had been closed). Hence the inclination toward having the scripts run in shorter periods of time.
What are some common approaches at this?
Unless your scripts somehow leak memory (quite unlikely), they should all be the same. So, for sheer simplicity (your time programming/debugging is much more expensive than a few miliseconds of the machine's time, even each 10 seconds!) I'd go for the single script that checks each 10 seconds.
OTOH, checking each 10 seconds sounds like busywork. Can't you set up so that whatever you are monitoring tells you when there are changes? Or batch the records up so you can retrieve, say, a day's worth at at time?
If you are running on linux, cron has granularity of a minute. We have processes we run constantly. Rather than watch them, the script will open a semaphore that gets released when the program finishes normally or not. This way if it runs long and it gets called again by cron, the copy will exit when it can't get the lock. This way you can call it a often as you need to without it stepping on a possibly still running copy.

no response from the host :snmpwalk

I have implemented AgentX using mib2c.create-dataset.conf ( with cache enabled)
In my snmd.conf :: agentXTimeout 15
In testtable.h file I have changed cache value as below...
#define testTABLE_TIMEOUT 60
According to my understanding It loads data every 60 second.
Now my issue is if the data in data table is exceeds some amount it takes some amount of time to load it.
As in between If I fired SNMPWALK it gives me “no response from the host” If I use SNMPWALK for whole table and in between testTABLE_TIMEOUT occurs it stops in between and shows following error (no response from the host).
Please tell me how to solve it ? In my table large amount of data is present and changing frequently.
I read some where:
(when the agent receives a request for something in this table and the cache is older than the defined timeout (12s > 10s), then it does re-load the data. This is the expected behaviour.
However the agent does not automatically release the local cache (i.e. call the 'free' routine) as soon as the timeout has expired.
Instead this is handled by a regular "garbage collection" run (once a minute), which will free any stale caches.
In the meantime, a request that tries to use that cache will spot that it's expired, and reload the data.)
Is there any connection between these two ?? I can’t get this... How to resolve my problem ???
Unfortunately, if your data set is very large and it takes a long time to load then you simply need to suffer the slow load and slow response. You can try and load the data on a regular basis using snmp_alarm or something so it's immediately available when a request comes in, but that doesn't really solve the problem either since the request could still come right after the alarm is triggered and the agent will still take a long time to respond.
So... the best thing to do is optimize your load routine as much as possible, and possibly simply increase the timeout that the manager uses. For snmpwalk, for example, you might add -t 30 to the command line arguments and I bet everything will suddenly work just fine.

Resources