CouchDB 3-node cluster (Windows) - multiple erlang errors - windows

Im receiving multiple erlang errors in my CouchDB 2.1.1 cluster (3 nodes/Windows), see errors and node configuration below:
3 nodes (10.0.7.4 - 10.0.7.6), Azure application gateway is used as load balancer.
Why do these errors appear? system resources of the nodes are far from overload.
I would be thankful for any help - thanks in advance.
Errors:
rexi_server: from: couchdb#10.0.7.4(<0.14976.568>) mfa: fabric_rpc:changes/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream_last,2,[{file,"src/rexi.erl"},{line,224}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,86}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
rexi_server: from: couchdb#10.0.7.6(<13540.24597.655>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,308}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,642}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
rexi_server: from: couchdb#10.0.7.6(<13540.5991.623>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,308}]},{couch_mrview,map_fold,3,[{file,"src/couch_mrview.erl"},{line,511}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,848}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,222}]},{couch_db,enum_docs,5,[{file,"src/couch_db.erl"},{line,1450}]},{couch_mrview,all_docs_fold,4,[{file,"src/couch_mrview.erl"},{line,425}]}]
req_err(3206982071) unknown_error : normal [<<"mochiweb_request:recv/3 L180">>,<<"mochiweb_request:stream_unchunked_body/4 L540">>,<<"mochiweb_request:recv_body/2 L214">>,<<"chttpd:body/1 L636">>,<<"chttpd:json_body/1 L649">>,<<"chttpd:json_body_obj/1 L657">>,<<"chttpd_db:db_req/2 L386">>,<<"chttpd:process_request/1 L295">>]
System running to use fully qualified hostnames ** ** Hostname localhost is illegal
COMPACTION-ERRORS
Supervisor couch_secondary_services had child compaction_daemon started with couch_compaction_daemon:start_link() at <0.18509.478> exit with reason {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} in context child_terminated
CRASH REPORT Process couch_compaction_daemon (<0.18509.478>) with 0 neighbors exited with reason: {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} at gen_server:terminate/7(line:826) <= proc_lib:init_p_do_apply/3(line:240); initial_call: {couch_compaction_daemon,init,['Argument__1']}, ancestors: [couch_secondary_services,couch_sup,<0.200.0>], messages: [], links: [<0.12665.492>], dictionary: [], trap_exit: true, status: running, heap_size: 987, stack_size: 27, reductions: 3173
gen_server couch_compaction_daemon terminated with reason: {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} last msg: {'EXIT',<0.23195.476>,{timeout,{gen_server,call,[couch_server,get_server]}}} state: {state,<0.23195.476>,[]}
Error in process <0.16890.22> on node 'couchdb#10.0.7.4' with exit value: {{rexi_DOWN,{'couchdb#10.0.7.5',noproc}},[{mem3_rpc,rexi_call,2,[{file,"src/mem3_rpc.erl"},{line,269}]},{mem3_rep,calculate_start_seq,1,[{file,"src/mem3_rep.erl"},{line,194}]},{mem3_rep,repl,2,[{file,"src/mem3_rep.erl"},{line,175}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,81}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,208}]}]}
#vm.args
-name couchdb#10.0.7.4
-setcookie monster
-kernel error_logger silent
-sasl sasl_error_logger false
+K true
+A 16
+Bd -noinput
+Q 134217727`
local.ini
[fabric]
request_timeout = infinity
[couchdb]
max_dbs_open = 10000
os_process_timeout = 20000
uuid =
[chttpd]
port = 5984
bind_address = 0.0.0.0
[httpd]
socket_options = [{recbuf, 262144}, {sndbuf, 262144}, {nodelay, true}]
enable_cors = true
[couch_httpd_auth]
secret =
[daemons]
compaction_daemon={couch_compaction_daemon, start_link, []}
[compactions]
_default = [{db_fragmentation, "50%"}, {view_fragmentation, "50%"}, {from, "23:00"}, {to, "04:00"}]
[compaction_daemon]
check_interval = 300
min_file_size = 100000
[vendor]
name = COUCHCLUSTERNODE0X
[admins]
adminuser =
[cors]
methods = GET, PUT, POST, HEAD, DELETE
headers = accept, authorization, content-type, origin, referer
origins = *
credentials = true
[query_server_config]
os_process_limit = 2000
os_process_soft_limit = 1000

Related

Rllib & Remote Desktop Connection: [Errno 10054] An existing connection was forcibly closed by the remote host

The general setting:
I am currently running several rllib processes on the machine at my workplace. The goal is to run multi-agent reinforcement learning simulations with a varying number of agents, differing start states and three different configurations. The computer only has the capacity to run one process at a time, which is why I loop through all configurations to allow for overnight and weekend simulations.
The Problem:
To oversee the process (e.g. if there are any problems with disk space) off work I have configured a remote desktop connection. Now after approximately 6-8 hours of running following error message is produced:
2022-06-08 08:01:04,008 ERROR worker.py:1259 -- listen_error_messages_raylet: [WinError 10054] Eine vorhandene Verbindung wurde vom Remotehost geschlossen
2022-06-08 08:01:04,008 ERROR worker.py:488 -- print_logs: [WinError 10054] Eine vorhandene Verbindung wurde vom Remotehost geschlossen
[2022-06-08 08:02:04,321 C 4548 1908] gcs_client.cc:343: Couldn't reconnect to GCS server. The last attempted GCS server address was 127.0.0.1:61369
*** StackTrace Information ***
Windows fatal exception: access violation
Translation: Eine vorhandene Verbindung wurde vom Remotehost geschlossen => An existing connection was forcibly closed by the remote host
The code:
Here is my code, even though I believe that it must have something to do with the Windows Remote Desktop connection or with the GCS server (no idea what this is or what I need it for actually):
def main(debug, framework="tf"):
n_agents = [8, 16, 32, 64]
init_state = [0.1, 0.25, 0.5, 1., 2.]
tax = ["none", "vote", "central"]
for n in n_agents:
for s in init_state:
for t in tax:
shutdown()
register_env(args.env_name, lambda cnfg: ParallelPettingZooEnv(env_creator(cnfg)))
train_n_replicates = 1 if debug else 10
seeds = list(range(train_n_replicates))
ray.init(num_cpus=os.cpu_count(), num_gpus=0, local_mode=debug)
rllib_config, stop_config = get_rllib_config(seeds=seeds,
n_agents=n,
init_state=s,
tax=t,
debug=debug,
framework=framework)
# Define logger to use (e.g. which output formats)
custom_logger = LifecycleLoggerCallback(
logger_classes=[CSVLogger, TBXLogger],
)
log_dir = os.path.join(os.getcwd(), "run_configurations/checkpoints")
tune_analysis = tune.run(
args.run,
config=rllib_config,
stop=stop_config,
checkpoint_freq=0,
checkpoint_at_end=True,
name=args.experiment_name,
local_dir=log_dir,
callbacks=[custom_logger],
trial_name_creator=trial_str_creator,
raise_on_failed_trial=False,
)
ray.shutdown()
for i in range(len(tune_analysis.trials)):
results_path = os.path.join(log_dir, args.experiment_name, str(tune_analysis.trials[i].logdir))
results_to_csv(results_path)
if __name__ == "__main__":
debug_mode = False
args = parser.parse_args()
main(debug_mode, args.framework)
P.S.: I run the PPO RLlib-registered algorithm.
Grateful for any tips. Cheers :)

Eclispe Milo handle missing Sever Nonce in ActivateSessionRequest

I use Eclipse Milo (0.2.3) in my prject for OPC UA communication. The OPC UA participants are a client (written using Eclipse Milo) and a server, which is running on a remote machine, and is not implemented using Milo).
I can connect the client to the server normally and if the remote server is shut down, I am able to reconnect the client automatically, as soon as the server is accessible again.
However, after updating the server software, the client can't reconnect any more and it floods the server with the following messages:
Create Session Request
The server is able to create a session
Activate Session Request
The server sends an Activate Session Response, in which the ServerNonce is missing and the service result is "bad"
This causes the client to send a new Create Session Request. This all happens multiple times within a second, which makes it impossible for the server to execute any other tasks then trying to create this session.
Are there any settings in Milo to specify the reconnection delay? Or is there any setting for sepcifying what should happen when receiving an empty ServerNonce?
The server's responses are as follows:
If the session can be activated:
OpcUa Binary Protocol
Message Type: MSG
Chunk Type: F
Message Size: 96
SecureChannelId: 1599759116
Security Token Id: 1
Security Sequence Number: 53
Security RequestId: 3
OpcUa Service : Encodeable Object
TypeId : ExpandedNodeId
NodeId EncodingMask: Four byte encoded Numeric (0x01)
NodeId Namespace Index: 0
NodeId Identifier Numeric: ActivateSessionResponse (470)
ActivateSessionResponse
ResponseHeader: ResponseHeader
Timestamp: Nov 16, 2018 14:05:47.974000000
RequestHandle: 1
ServiceResult: 0x00000000 [Good]
ServiceDiagnostics: DiagnosticInfo
EncodingMask: 0x00
.... ...0 = has symbolic id: False
.... ..0. = has namespace: False
.... .0.. = has localizedtext: False
.... 0... = has locale: False
...0 .... = has additional info: False
..0. .... = has inner statuscode: False
.0.. .... = has inner diagnostic info: False
StringTable: Array of String
ArraySize: 0
AdditionalHeader: ExtensionObject
TypeId: ExpandedNodeId
EncodingMask: 0x00
ServerNonce: ab...
Results: Array of StatusCode
ArraySize: 0
DiagnosticInfos: Array of DiagnosticInfo
ArraySize: 0
If the session can't be activated (after updating the server's software):
OpcUa Binary Protocol
Message Type: MSG
Chunk Type: F
Message Size: 64
SecureChannelId: 1599759041
Security Token Id: 1
Security Sequence Number: 61
Security RequestId: 11
OpcUa Service : Encodeable Object
TypeId : ExpandedNodeId
ActivateSessionResponse
ResponseHeader: ResponseHeader
Timestamp: Nov 16, 2018 12:49:08.235000000
RequestHandle: 222
ServiceResult: 0x80000000 [Bad]
ServiceDiagnostics: DiagnosticInfo
EncodingMask: 0x00
.... ...0 = has symbolic id: False
.... ..0. = has namespace: False
.... .0.. = has localizedtext: False
.... 0... = has locale: False
...0 .... = has additional info: False
..0. .... = has inner statuscode: False
.0.. .... = has inner diagnostic info: False
StringTable: Array of String
ArraySize: 0
AdditionalHeader: ExtensionObject
TypeId: ExpandedNodeId
EncodingMask: 0x00
ServerNonce: <MISSING>[OpcUa Null ByteString]
Results: Array of StatusCode
ArraySize: 0
DiagnosticInfos: Array of DiagnosticInfo
ArraySize: 0
Thank you in advance for your help.
This corner case you described where there's no delay between a failed re-activation and the subsequent re-creation is addressed on the dev/0.3 branch in this commit.
I might be able to back port it to 0.2.x next week if I have some spare time.
I don't think there are any workarounds you can use.

MapR installation failing for single node cluster

I was referring quick installation guide for single node cluster. For this i used 20GB storage file for MaprFS but while on installation , it is giving ' Unable to find disks: /maprfs/storagefile' .
Here is my configuration file.
# Each Node section can specify nodes in the following format
# Hostname: disk1, disk2, disk3
# Specifying disks is optional. If not provided, the installer will use the values of 'disks' from the Defaults section
[Control_Nodes]
maprlocal.td.td.com: /maprfs/storagefile
#control-node2.mydomain: /dev/disk3, /dev/disk9
#control-node3.mydomain: /dev/sdb, /dev/sdc, /dev/sdd
[Data_Nodes]
#data-node1.mydomain
#data-node2.mydomain: /dev/sdb, /dev/sdc, /dev/sdd
#data-node3.mydomain: /dev/sdd
#data-node4.mydomain: /dev/sdb, /dev/sdd
[Client_Nodes]
#client1.mydomain
#client2.mydomain
#client3.mydomain
[Options]
MapReduce1 = true
YARN = true
HBase = true
MapR-DB = true
ControlNodesAsDataNodes = true
WirelevelSecurity = false
LocalRepo = false
[Defaults]
ClusterName = my.cluster.com
User = mapr
Group = mapr
Password = mapr
UID = 2000
GID = 2000
Disks = /maprfs/storagefile
StripeWidth = 3
ForceFormat = false
CoreRepoURL = http://package.mapr.com/releases
EcoRepoURL = http://package.mapr.com/releases/ecosystem-4.x
Version = 4.0.2
MetricsDBHost =
MetricsDBUser =
MetricsDBPassword =
MetricsDBSchema =
Below is the error that i am getting.
2015-04-16 08:18:03,659 callbacks 42 [INFO]: Running task: [Verify Pre-Requisites]
2015-04-16 08:18:03,661 callbacks 87 [ERROR]: maprlocal.td.td.com: Unable to find disks: /maprfs/storagefile from /maprfs/storagefile remove disks: /dev/sda,/dev/sda1,/dev/sda2,/dev/sda3 and retry
2015-04-16 08:18:03,662 callbacks 91 [ERROR]: failed: [maprlocal.td.td.com] => {"failed": true}
2015-04-16 08:18:03,667 installrunner 199 [ERROR]: Host: maprlocal.td.td.com has 1 failures
2015-04-16 08:18:03,668 common 203 [ERROR]: Control Nodes have failures. Please fix the failures and re-run the installation. For more information refer to the installer log at /opt/mapr-installer/var/mapr-installer.log
Please help me here.
Thanks
Shashi
Error is resolved by adding skip-check new option after install
/opt/mapr-installer/bin/install --skip-checks new

Can not connect to websocket server using WebSocket4Net

I have a mochiweb as WebSocket server; connectivity using JavaScript from Chrome browser as ws client went smooth (open, send message, close). However, when I tried to connect from C# using websocket4Net, I always get below error from mochiweb.
=CRASH REPORT==== 30-Jan-2013::16:57:41 ===
crasher:
initial call: mochiweb_acceptor:init/3
pid: <0.228.0>
registered_name: []
exception error: no case clause matching {error,timeout}
in function mochiweb_http:websocket_init_with_origin_validated/4 (mochiweb_http.erl, line 292)
in call from mochiweb_http:headers_ws_upgrade/4 (mochiweb_http.erl, line 192)
ancestors: [cim_https,<0.166.0>]
messages: []
links: [<0.167.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 1597
stack_size: 24
reductions: 1585
my C# snippet:
webSocketClient = new WebSocket("wss://localhost:8080/login");
webSocketClient.Error += new EventHandler<SuperSocket.ClientEngine.ErrorEventArgs>(webSocketClient_Error) ;
webSocketClient.AllowUnstrustedCertificate = true;
webSocketClient.Opened += new EventHandler(webSocketClient_Opened);
webSocketClient.Closed += new EventHandler(webSocketClient_Closed);
webSocketClient.MessageReceived += new EventHandler<MessageReceivedEventArgs>(webSocketClient_MessageReceived);
webSocketClient.Open();
Is there any parameter that I've missed ? Any idea on how to trace this ?
Found the issue. Apparently, Mochiweb only supports what in websocket4net is known as Hybi00 -no support for RFC 6455 yet.
Seems like now I have to patch my mochiweb.

Disabling echo from webrick

How can I disable messages from webrick echoed on to the terminal? For the INFO messages that appear at the beginning, I was able to disable it by setting the Logger parameter so as:
s = WEBrick::HTTPServer.new(
Port: 3000,
BindAddress: "localhost",
Logger: WEBrick::Log.new("/dev/null"),
)
But I further want to disable the messages that look like:
localhost - - [17/Jun/2011:10:01:38
EDT] "GET .... HTTP/1.1" 200 0
http://localhost:3000/ -> .....
when a request is made from the web browser.
Following the link to the source and suggestion provied by Yet Another Geek, I was able to figure out a way. Set the AccessLog parameter to [nil, nil] [] (Changed following suggestion by Robert Watkins).
s = WEBrick::HTTPServer.new(
Port: 3000,
BindAddress: "localhost",
Logger: WEBrick::Log.new("/dev/null"),
AccessLog: [],
)

Resources