stream-state-metrics bytes-written-total is always 0 - apache-kafka-streams

kafka_streams_stream_state_metrics_bytes_written_rate is always 0 in my case. But i see kafka_streams_stream_rocksdb_state_metrics_put_total reflecting the total number of records. What may be the reason? I tried to set the Statistics in options through rocksdb.config.setter class and then it works. I am using kafka-streams-2.4.0. In addition to setting the metrics.recording.level to debug do i need to do anything to make the RocksDBMetricsRecorder to record Statistics ?

I assume you hit a known bug: https://issues.apache.org/jira/browse/KAFKA-9675
It's fixed already but the fix is not released yet. You can either build from the corresponding branches yourself to get the fix, or you need to wait until 2.4.2, 2.5.1, or 2.6.0 is released.

Related

"Select jobs to execute..." runs literally forever

I have a rather complex workflow with 750 samples and roughly 18.000 jobs, at first snakemake runs just fine but then after around 4.000 jobs it suddenly freezes and upon restart it hangs with "Select jobs to execute..." for 24h, after that I terminated it. The initial DAG building takes roughly 2-3 minutes, though.
When I run snakemake (v5.32.0 and v5.32.1) with the --verbose option, I get tons of lines similar to this one:
Cbc0010I After 600 nodes, 304 on tree, -52534.791 best solution, best possible -52538.194 (7.08 seconds
I tried to delete the .snakemake folder in the hope that something went riot there, but that wasn't the case, unfortunately. To me it seems that the CBC MILP Solver somehow does not converge, and it keeps going and going to bring the best and the best possible solution closer together!?
Now I do not have any idea anymore, how to proceed and fix the problem. My possible solutions are somehow to change the convergence criteria or the solver itself. In the manual I found the option --scheduler-ilp-solver but it has apparently only one option, the default COIN_CMD.
After terminating a (shorter) run, I get this verbose output
Result - User ctrl-cuser ctrl-c
Objective value: 52534.79114334
Upper bound: 52538.202
Gap: -0.00
Enumerated nodes: 186926
Total iterations: 1807277
Time (CPU seconds): 1181.97
Time (Wallclock seconds): 1188.11
Next I will try to limit the number of samples in the workflow and see if this has any impact (for other datasets with 500 samples, it ran without any problems (with snakemake version 5.24), but there the DAG building took some hours. Hence, I am not very eager to try the old version.)
So, any idea how to fix the problem is highly appreciated. Also, I do not even know, if this is a bug!?
EDIT Actually, I believe it is a bug in the current version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...
I also ran into this issue with a smaller workflow (~1500 jobs total) and snakemake version 6.0.2. About half the jobs had run when the workflow got stuck, and refused to run any more jobs. Looks like it's a problem specific to the ILP solver, because when I re-ran with --scheduler greedy, it worked fine.
Actually, I believe it is a bug in the current snakemake version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...

Oracle 12c startup error: ORA-00093: _shared_pool_reserved_min_alloc must be between 4000 and 0

We have a number of databases at our company. Among them an oracle 12c (12.2.0.1.0 to be precise), but we have no (qualified) oracle DBAs. The performance has deteriorated drastically in the last 6 months or so and I now have the task of finding out why. My research suggested that I should up some memory parameters in the initDBN.ora file. Here's what the original looked like:
DBN.__data_transfer_cache_size=0
DBN.__db_cache_size=50331648
DBN.__inmemory_ext_roarea=0
DBN.__inmemory_ext_rwarea=0
DBN.__java_pool_size=79691776
DBN.__large_pool_size=8388608
DBN.__oracle_base='/orabin/app/oracle'#ORACLE_BASE set from environment
DBN.__pga_aggregate_target=197132288
DBN.__sga_target=734003200
DBN.__shared_io_pool_size=12582912
DBN.__shared_pool_size=536870912
DBN.__streams_pool_size=4194304
*.audit_file_dest='/orabin/app/oracle/admin/tmf/adump'
*.audit_trail='db'
*.compatible='12.2.0'
*.control_files='/orabin/app/oracle/oradata/tmf/control01.ctl','/orabin/app/oracle/fast_recovery_area/tmf/control02.ctl'
*.db_16k_cache_size=8388608
*.db_32k_cache_size=8388608
*.db_4k_cache_size=8388608
*.db_block_size=8192
*.db_domain='ubs-hainer.com'
*.db_name='tmf'
*.db_recovery_file_dest='/orabin/app/oracle/fast_recovery_area/tmf'
*.db_recovery_file_dest_size=4096m
*.diagnostic_dest='/orabin/app/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=TMFXDB)'
*.local_listener='LISTENER_TMF'
*.memory_max_target=0
*.nls_language='GERMAN'
*.nls_territory='GERMANY'
*.open_cursors=300
*.pga_aggregate_target=188m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=700m
*.shared_pool_size=536870912
*.streams_pool_size=4194304
*.undo_tablespace='UNDOTBS1'
Please don't blame me for this, because I did not write it. It certainly doesn't look like the sample init.ora and I am not at all certain where the syntax came from. The values I changed were:
DBN.__sga_target=1024m
*.sga_target=1024m
*.memory_max_target=1408m
DBN.__pga_aggregate_target=384m and *.pga_aggregate_target=384m
That's the order in which I made the changes. After each change I used sqlplus to firstly recreate the spffile with:
create spfile='spfileDBN.ora' from pfile='initDBN.ora';
This was followed by an attempt to startup the database with startup nomount. In each case I got an error message which lead me to make the next change.
Finally I got the error which is in the title of this post. When I tried to search for information on this, the findings were grim. Mostly the information dealt with other parameters and did not explain what this error actually meant. The only thing that gave any real background was this link from Burleson Consulting. It didn't really help me solve the problem, so I decided to revert the initDBN.ora file and do some more research. A slow database is generally better than no database.
But Hey! I still get that same error, even after reerting to the original init file. I'm getting desperate now. I have no idea how to fix this. From what I've read to date, setting "underscore variables" in your init file is a "NO NO".
Can anybody provide me with some helpful tips as to how to get rid of this error?
We don't know if the apps running on this database need specific block sizes, but if the priority is getting the database open, you can shrink the init.ora down the smallest, simplest set of parameters that gets you moving forward.
*.audit_file_dest='/orabin/app/oracle/admin/tmf/adump'
*.audit_trail='db'
*.compatible='12.2.0'
*.control_files='/orabin/app/oracle/oradata/tmf/control01.ctl','/orabin/app/oracle/fast_recovery_area/tmf/control02.ctl'
*.db_block_size=8192
*.db_domain='ubs-hainer.com'
*.db_name='tmf'
*.db_recovery_file_dest='/orabin/app/oracle/fast_recovery_area/tmf'
*.db_recovery_file_dest_size=4096m
*.diagnostic_dest='/orabin/app/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=TMFXDB)'
*.local_listener='LISTENER_TMF'
*.nls_language='GERMAN'
*.nls_territory='GERMANY'
*.open_cursors=300
*.pga_aggregate_target=188m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=1000m
*.undo_tablespace='UNDOTBS1'
should get you an open database. Notice I have bumped up the sga_target to 1000m which is about the minimum you need to get a database started. The true values for sga_target and pga_aggregate_target really need to be set based on your expected usage, and the server configuration. But the init.ora above should get your database running.
I am not sure that this really qualifies as a "solution", but it does fix the initial problem. As mentioned in my reply to Connor McDonald, I set the parameter _shared_pool_reserved_min_alloc to 3000 in the initDBN.ora file, which I copied from Connor's example (thanks for that). After recreating the spfile and trying to restart the database, I got the following error:
ORA-00093: _shared_pool_reserved_min_alloc must be between 4000 and 11953766
That got me thinking that the value 0 in the original error was probably a standin value which really means "the maximum allowed". By actually setting the parameter, I have apparently managed to generate an error which is more meaningful.
The value of _shared_pool_reserved_min_alloc is now set to 4200, which is a value I recall reading in one of the less helpful posts. (No, that post did not say that this is a value that should be used, just that it could be used.) This time, after re-creating the spfile I was able to start the database.
Before I do any more fiddling with parameters, I will do a bit more research... or maybe a lot more.

Unexpected update behaviour of versions maven plugin

The goal versions:update-properties produced the following output:
10:52:25,255 INFO - --- versions-maven-plugin:2.7:update-properties (default-cli) # release-plugin-test-new-bo ---
10:52:32,605 INFO - artifact de.continentale.muv:coutil: checking for updates from nexus
10:52:32,666 INFO - Subincremental version changes allowed
10:52:32,682 INFO - Updated ${coutil.version} from 7.0.0-SNAPSHOT to 7.0.1-RC0002
I set the parameters -DallowIncrementalUpdates=false, and also -DallowMinorUpdates=false and -DallowMajorUpdates=false, which is reflected in the line "Subincremental version changes allowed". Nevertheless, the version was upgraded by changing the third number in the version.
This behaviour is unexpected and also not idempotent (the next run replaces 7.0.1-RC0002 by 7.0.1).
I tried to figure out why that happens from the documentation and also from the Javadoc and source code but got lost somewhere in Maven version comparison.
Can someone enlighten me? Is this a bug, or do I need to configure things differently to avoid the updates on the third number?
Some debugging lead to the conclusion that for 7.0.0-SNAPSHOT, the goal versions:update-properties with the parameters as above does about the following:
Create an upper bound by incrementing the third number (in this case, the upper bound is 7.0.1-SNAPSHOT).
Look for the largest version below that bound (for Maven 7.0.1-RC0002 is smaller than 7.0.1-SNAPSHOT).
IMHO, the code does not behave properly because there is actually an incremental change in the version number although I set the respective property to false.

How to investigate reasons for failed writes to Proficy Historian 4.5

I have about 25000 failed writes within each 30 minutes. Previously I had an issue with timestamps ahead of current server clock, but this was addressed. When looking at collected data, it is OK - no gaps all values are good.
Is there any way I can tell why those writes fail?
Try checking the log files. For version 4.5 it should be in deafult path "C:\Proficy Historian Data\LogFiles"

Elasticsearch get works half of the time

I recently ran into a problem with elasticsearch, versions 1.0.1, 1.2.2, 1.2.4, and 1.4.1.
When I try to get a document by ID GET http://localhost:9200/thing/otherthing/700254a4-4e72-46b9-adeb-d498159652cb It will return the document half the time, and the other half I will get a "found" : false error. (These switch off literally every other time, I do a get and it works, do another get and it doesn't).
These documents have no custom routing.
I have tried completely uninstalling elasticsearch and removing all files related to it, then re installing from the official repo to no avail, and google doesn't give me any similar problems or ideas on how to solve this.
The only thing I would think of that would cause a repeatable failure like this would be unassigned shards/replica sets which contain this information.
Do you know how many replica sets you have?
I believe the read is round-robin, so if you only have 2 replicas of the data (1 master + a replica set), and 1 has become unassigned (after being written to), then you might see a failure like this.

Resources