Elasticsearch get works half of the time - elasticsearch

I recently ran into a problem with elasticsearch, versions 1.0.1, 1.2.2, 1.2.4, and 1.4.1.
When I try to get a document by ID GET http://localhost:9200/thing/otherthing/700254a4-4e72-46b9-adeb-d498159652cb It will return the document half the time, and the other half I will get a "found" : false error. (These switch off literally every other time, I do a get and it works, do another get and it doesn't).
These documents have no custom routing.
I have tried completely uninstalling elasticsearch and removing all files related to it, then re installing from the official repo to no avail, and google doesn't give me any similar problems or ideas on how to solve this.

The only thing I would think of that would cause a repeatable failure like this would be unassigned shards/replica sets which contain this information.
Do you know how many replica sets you have?
I believe the read is round-robin, so if you only have 2 replicas of the data (1 master + a replica set), and 1 has become unassigned (after being written to), then you might see a failure like this.

Related

"Select jobs to execute..." runs literally forever

I have a rather complex workflow with 750 samples and roughly 18.000 jobs, at first snakemake runs just fine but then after around 4.000 jobs it suddenly freezes and upon restart it hangs with "Select jobs to execute..." for 24h, after that I terminated it. The initial DAG building takes roughly 2-3 minutes, though.
When I run snakemake (v5.32.0 and v5.32.1) with the --verbose option, I get tons of lines similar to this one:
Cbc0010I After 600 nodes, 304 on tree, -52534.791 best solution, best possible -52538.194 (7.08 seconds
I tried to delete the .snakemake folder in the hope that something went riot there, but that wasn't the case, unfortunately. To me it seems that the CBC MILP Solver somehow does not converge, and it keeps going and going to bring the best and the best possible solution closer together!?
Now I do not have any idea anymore, how to proceed and fix the problem. My possible solutions are somehow to change the convergence criteria or the solver itself. In the manual I found the option --scheduler-ilp-solver but it has apparently only one option, the default COIN_CMD.
After terminating a (shorter) run, I get this verbose output
Result - User ctrl-cuser ctrl-c
Objective value: 52534.79114334
Upper bound: 52538.202
Gap: -0.00
Enumerated nodes: 186926
Total iterations: 1807277
Time (CPU seconds): 1181.97
Time (Wallclock seconds): 1188.11
Next I will try to limit the number of samples in the workflow and see if this has any impact (for other datasets with 500 samples, it ran without any problems (with snakemake version 5.24), but there the DAG building took some hours. Hence, I am not very eager to try the old version.)
So, any idea how to fix the problem is highly appreciated. Also, I do not even know, if this is a bug!?
EDIT Actually, I believe it is a bug in the current version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...
I also ran into this issue with a smaller workflow (~1500 jobs total) and snakemake version 6.0.2. About half the jobs had run when the workflow got stuck, and refused to run any more jobs. Looks like it's a problem specific to the ILP solver, because when I re-ran with --scheduler greedy, it worked fine.
Actually, I believe it is a bug in the current snakemake version, I downgraded Snakemake back to version 5.24, it created the DAG within 10 minutes and started to run the pipeline. So, apparently there is some bug with the latest version. I will make this an answer to my own question, as the downgrading to an older version solved the problem...

Oracle 12c startup error: ORA-00093: _shared_pool_reserved_min_alloc must be between 4000 and 0

We have a number of databases at our company. Among them an oracle 12c (12.2.0.1.0 to be precise), but we have no (qualified) oracle DBAs. The performance has deteriorated drastically in the last 6 months or so and I now have the task of finding out why. My research suggested that I should up some memory parameters in the initDBN.ora file. Here's what the original looked like:
DBN.__data_transfer_cache_size=0
DBN.__db_cache_size=50331648
DBN.__inmemory_ext_roarea=0
DBN.__inmemory_ext_rwarea=0
DBN.__java_pool_size=79691776
DBN.__large_pool_size=8388608
DBN.__oracle_base='/orabin/app/oracle'#ORACLE_BASE set from environment
DBN.__pga_aggregate_target=197132288
DBN.__sga_target=734003200
DBN.__shared_io_pool_size=12582912
DBN.__shared_pool_size=536870912
DBN.__streams_pool_size=4194304
*.audit_file_dest='/orabin/app/oracle/admin/tmf/adump'
*.audit_trail='db'
*.compatible='12.2.0'
*.control_files='/orabin/app/oracle/oradata/tmf/control01.ctl','/orabin/app/oracle/fast_recovery_area/tmf/control02.ctl'
*.db_16k_cache_size=8388608
*.db_32k_cache_size=8388608
*.db_4k_cache_size=8388608
*.db_block_size=8192
*.db_domain='ubs-hainer.com'
*.db_name='tmf'
*.db_recovery_file_dest='/orabin/app/oracle/fast_recovery_area/tmf'
*.db_recovery_file_dest_size=4096m
*.diagnostic_dest='/orabin/app/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=TMFXDB)'
*.local_listener='LISTENER_TMF'
*.memory_max_target=0
*.nls_language='GERMAN'
*.nls_territory='GERMANY'
*.open_cursors=300
*.pga_aggregate_target=188m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=700m
*.shared_pool_size=536870912
*.streams_pool_size=4194304
*.undo_tablespace='UNDOTBS1'
Please don't blame me for this, because I did not write it. It certainly doesn't look like the sample init.ora and I am not at all certain where the syntax came from. The values I changed were:
DBN.__sga_target=1024m
*.sga_target=1024m
*.memory_max_target=1408m
DBN.__pga_aggregate_target=384m and *.pga_aggregate_target=384m
That's the order in which I made the changes. After each change I used sqlplus to firstly recreate the spffile with:
create spfile='spfileDBN.ora' from pfile='initDBN.ora';
This was followed by an attempt to startup the database with startup nomount. In each case I got an error message which lead me to make the next change.
Finally I got the error which is in the title of this post. When I tried to search for information on this, the findings were grim. Mostly the information dealt with other parameters and did not explain what this error actually meant. The only thing that gave any real background was this link from Burleson Consulting. It didn't really help me solve the problem, so I decided to revert the initDBN.ora file and do some more research. A slow database is generally better than no database.
But Hey! I still get that same error, even after reerting to the original init file. I'm getting desperate now. I have no idea how to fix this. From what I've read to date, setting "underscore variables" in your init file is a "NO NO".
Can anybody provide me with some helpful tips as to how to get rid of this error?
We don't know if the apps running on this database need specific block sizes, but if the priority is getting the database open, you can shrink the init.ora down the smallest, simplest set of parameters that gets you moving forward.
*.audit_file_dest='/orabin/app/oracle/admin/tmf/adump'
*.audit_trail='db'
*.compatible='12.2.0'
*.control_files='/orabin/app/oracle/oradata/tmf/control01.ctl','/orabin/app/oracle/fast_recovery_area/tmf/control02.ctl'
*.db_block_size=8192
*.db_domain='ubs-hainer.com'
*.db_name='tmf'
*.db_recovery_file_dest='/orabin/app/oracle/fast_recovery_area/tmf'
*.db_recovery_file_dest_size=4096m
*.diagnostic_dest='/orabin/app/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=TMFXDB)'
*.local_listener='LISTENER_TMF'
*.nls_language='GERMAN'
*.nls_territory='GERMANY'
*.open_cursors=300
*.pga_aggregate_target=188m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=1000m
*.undo_tablespace='UNDOTBS1'
should get you an open database. Notice I have bumped up the sga_target to 1000m which is about the minimum you need to get a database started. The true values for sga_target and pga_aggregate_target really need to be set based on your expected usage, and the server configuration. But the init.ora above should get your database running.
I am not sure that this really qualifies as a "solution", but it does fix the initial problem. As mentioned in my reply to Connor McDonald, I set the parameter _shared_pool_reserved_min_alloc to 3000 in the initDBN.ora file, which I copied from Connor's example (thanks for that). After recreating the spfile and trying to restart the database, I got the following error:
ORA-00093: _shared_pool_reserved_min_alloc must be between 4000 and 11953766
That got me thinking that the value 0 in the original error was probably a standin value which really means "the maximum allowed". By actually setting the parameter, I have apparently managed to generate an error which is more meaningful.
The value of _shared_pool_reserved_min_alloc is now set to 4200, which is a value I recall reading in one of the less helpful posts. (No, that post did not say that this is a value that should be used, just that it could be used.) This time, after re-creating the spfile I was able to start the database.
Before I do any more fiddling with parameters, I will do a bit more research... or maybe a lot more.

stream-state-metrics bytes-written-total is always 0

kafka_streams_stream_state_metrics_bytes_written_rate is always 0 in my case. But i see kafka_streams_stream_rocksdb_state_metrics_put_total reflecting the total number of records. What may be the reason? I tried to set the Statistics in options through rocksdb.config.setter class and then it works. I am using kafka-streams-2.4.0. In addition to setting the metrics.recording.level to debug do i need to do anything to make the RocksDBMetricsRecorder to record Statistics ?
I assume you hit a known bug: https://issues.apache.org/jira/browse/KAFKA-9675
It's fixed already but the fix is not released yet. You can either build from the corresponding branches yourself to get the fix, or you need to wait until 2.4.2, 2.5.1, or 2.6.0 is released.

SphinxSearch - Different Nodes using shared data

We are in the process of building a SphinxSearch Cluster using Amazon EC2 instances. We did a sample test like several instances using the same shared file system (Elastic File System). Our idea is, in a cluster we might have more than 10 nodes, But we can use a single instance to index documents and keep it in Elastic File System and can shared by multiple nodes for reading.
Our test worked fine, But technically any problem with this approach? (Like locking issue etc)
Can someone please suggest on this
Thanks in Advance
If you're ok with having N copies of the index you can do as follows:
build an index in one place in a temp folder
rename the files so they include .new.
distribute the index to all the other places using rsync or whatever you like. Some even do broadcasting with UFTP
rotate the indexes at once in all the places by sending HUP to the searchds or better by doing RELOAD INDEX (http://docs.manticoresearch.com/latest/html/sphinxql_reference/reload_index_syntax.html), it normally takes only few ms so we can say that your new index replaces the previous one simultaneously on all the nodes
previously (and perhaps still in Sphinx) there was an issue with rotating the index (either by --rotate or RELOAD) in case it was processing a long query (the rotate just had to wait). It was fixed in Manticoresearch recently.
This is tried'n'true solution people use in production for years, but if you really want to share the same files among multiple searchd instances you can softlink all the files except .spl, but then to rotate the index in the searchd instances using the links (not the actual files) you'll need to restart the searchd instances which doesn't look good in general, but in some special cases may be still a good solution.

Neo4j in Docker - Max Heap Size Causes Hard crash 137

I'm trying to spin up a Neo4j 3.1 instance in a Docker container (through Docker-Compose), running on OSX (El Capitan). All is well, unless I try to increase the max-heap space available to Neo above the default of 512MB.
According to the docs, this can be achieved by adding the environment variable NEO4J_dbms_memory_heap_maxSize, which then causes the server wrapper script to update the neo4j.conf file accordingly. I've checked and it is being updated as one would expect.
The problem is, when I run docker-compose up to spin up the container, the Neo4j instance crashes out with a 137 status code. A little research tells me this is a linux hard-crash, based on heap-size maximum limits.
$ docker-compose up
Starting elasticsearch
Recreating neo4j31
Attaching to elasticsearch, neo4j31
neo4j31 | Starting Neo4j.
neo4j31 exited with code 137
My questions:
Is this due to a Docker or an OSX limitation?
Is there a way I can modify these limits? If I drop the requested limit to 1GB, it will spin up, but still crashes once I run my heavy query (which is what caused the need for increased Heap space anyway).
The query that I'm running is a large-scale update across a lot of nodes (>150k) containing full-text attributes, so that they can be syncronised to ElasticSearch using the plug-in. Is there a way I can get Neo to step through doing, say, 500 nodes at a time, using only cypher (I'd rather avoid writing a script if I can, feels a little dirty for this).
My docker-compose.yml is as follows:
---
version: '2'
services:
# ---<SNIP>
neo4j:
image: neo4j:3.1
container_name: neo4j31
volumes:
- ./docker/neo4j/conf:/var/lib/neo4j/conf
- ./docker/neo4j/mnt:/var/lib/neo4j/import
- ./docker/neo4j/plugins:/plugins
- ./docker/neo4j/data:/data
- ./docker/neo4j/logs:/var/lib/neo4j/logs
ports:
- "7474:7474"
- "7687:7687"
environment:
- NEO4J_dbms_memory_heap_maxSize=4G
# ---<SNIP>
Is this due to a Docker or an OSX limitation?
NO Increase the amount of available RAM to Docker to resolve this issue.
Is there a way I can modify these limits? If I drop the requested
limit to 1GB, it will spin up, but still crashes once I run my heavy
query (which is what caused the need for increased Heap space
anyway).
The query that I'm running is a large-scale update across a lot of
nodes (>150k) containing full-text attributes, so that they can be
syncronised to ElasticSearch using the plug-in. Is there a way I can
get Neo to step through doing, say, 500 nodes at a time, using only
cypher (I'd rather avoid writing a script if I can, feels a little
dirty for this).
N/A This is a NEO4J specific question. It might be better to seperate this from the Docker questions listed above.
3.The query that I'm running is a large-scale update across a lot of nodes (>150k) containing full-text attributes, so that they can be syncronised to ElasticSearch using the plug-in. Is there a way I can get Neo to step through doing, say, 500 nodes at a time, using only cypher (I'd rather avoid writing a script if I can, feels a little dirty for this).
You can do this with the help of apoc plugin for neo4j, more specifically apoc.periodic.iterate
or apoc.periodic.commit
.
If you will use apoc.periodic.commit your first match should be specific like in example you mark which nodes have you already synced, because it sometimes fall in the loop:
call apoc.periodic.commit("
match (user:User) WHERE user.synced = false
with user limit {limit}
MERGE (city:City {name:user.city})
MERGE (user)-[:LIVES_IN]->(city)
SET user.synced =true
RETURN count(*)
",{limit:10000})
If you use apoc.periodic.iterate you can run it in parallel mode:
CALL apoc.periodic.iterate(
"MATCH (o:Order) WHERE o.date > '2016-10-13' RETURN o",
"with {o} as o MATCH (o)-[:HAS_ITEM]->(i) WITH o, sum(i.value) as value
CALL apoc.es.post(host-or-port,index-or-null,type-or-null,
query-or-null,payload-or-null) yield value return *", {batchSize:100, parallel:true})
Note that there is no need for second MATCH clause and apoc.es.post is a function for apoc that can send post requests to elastic search.
see documentation for more info

Resources