Error in setting up genesis transactions -- what am I doing wrong? - cosmos-sdk

I'm trying to go through the "Running a node" tutorial here: https://github.com/cosmos/cosmos-sdk/blob/master/docs/run-node/run-node.md
I seem to have some issue though, the genesis transactions don't manage to set up a validator, so the validator set is empty and the app stops. Am I missing something?
I'm running script.sh and getting error message in error.log
simd version: goz-phase-1-1119-g8572a84eb
script.sh
#!/bin/bash
set -eu
PATH=build:$PATH
MONIKER=foobar
simd init $MONIKER --chain-id my-test-chain
simd keys add my_validator --keyring-backend test
# Put the generated address in a variable for later use.
MY_VALIDATOR_ADDRESS=$(simd keys show my_validator -a --keyring-backend test)
simd add-genesis-account $MY_VALIDATOR_ADDRESS 100000000stake
# Create a gentx.
simd gentx my_validator 100000stake --chain-id my-test-chain --keyring-backend test
# Add the gentx to the genesis file.
simd collect-gentxs
simd start
error.log:
5:09PM INF starting ABCI with Tendermint
5:09PM INF Starting multiAppConn service impl={"Logger":{}} module=proxy
5:09PM INF Starting localClient service connection=query impl="marshaling error: json: unsupported type: abcicli.Callback" module=abci-client
5:09PM INF Starting localClient service connection=snapshot impl="marshaling error: json: unsupported type: abcicli.Callback" module=abci-client
5:09PM INF Starting localClient service connection=mempool impl="marshaling error: json: unsupported type: abcicli.Callback" module=abci-client
5:09PM INF Starting localClient service connection=consensus impl="marshaling error: json: unsupported type: abcicli.Callback" module=abci-client
5:09PM INF Starting EventBus service impl={"Logger":{}} module=events
5:09PM INF Starting PubSub service impl={"Logger":{}} module=pubsub
5:09PM INF Starting IndexerService service impl={"Logger":{}} module=txindex
5:09PM INF ABCI Handshake App Info hash= height=0 module=consensus protocol-version=0 software-version=
5:09PM INF ABCI Replay Blocks appHeight=0 module=consensus stateHeight=0 storeHeight=0
5:09PM INF asserting crisis invariants inv=0/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=1/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=2/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=3/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=4/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=5/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=6/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=7/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=8/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=9/11 module=x/crisis
5:09PM INF asserting crisis invariants inv=10/11 module=x/crisis
5:09PM INF asserted all invariants duration=0.844065 height=0 module=x/crisis
5:09PM INF created new capability module=ibc name=ports/transfer
5:09PM INF port binded module=x/ibc/port port=transfer
5:09PM INF claimed capability capability=1 module=transfer name=ports/transfer
Error: error during handshake: error on replay: validator set is nil in genesis and still empty after InitChain
Usage:
simd start [flags]
Flags:
--abci string specify abci transport (socket | grpc) (default "socket")
... [other usage info]

I tried it myself and saw the same error but was able to fix it by increasing the amount of stake in the simd gentx command to 100000000stake. It works now as follows:
#!/bin/bash
set -eu
PATH=build:$PATH
MONIKER=foobar
simd init $MONIKER --chain-id my-test-chain
simd keys add my_validator --keyring-backend test
# Put the generated address in a variable for later use.
MY_VALIDATOR_ADDRESS=$(simd keys show my_validator -a --keyring-backend test)
simd add-genesis-account $MY_VALIDATOR_ADDRESS 100000000stake
# Create a gentx.
simd gentx my_validator 100000000stake --chain-id my-test-chain --keyring-backend test
# Add the gentx to the genesis file.
simd collect-gentxs
# simd start
Where did you get the script.sh file from? I didn't see it inside of https://github.com/cosmos/cosmos-sdk/blob/master/docs/run-node/run-node.md

The answer from okwme is correct. But the initial balance (set with simd add-genesis-account) should also be increased if you want to play around more with that account. In the tutorial, later that account sends some stake to another account. But with the above script, the full amount in $MY_VALIDATOR_ADDRESS is delegated, so none can be sent.

Related

etcd v2: etcd-server is healthy but etcd-events is not joining ("cluster ID mismatch" and "unmatched member while checking PeerURLs" errors)

I have a legacy Kubernetes cluster running etcd v2 with 3 masters (etcd-a, etcd-b, etcd-c). We attempted an upgrade to etcd v3 but this broken the first master (etcd-a) and it was no longer able to join the cluster. After some time I was able to restore it:
removed etcd-a from etcd cluster with etcdctl member rm
added a new etcd-a1 with a clean state and added to the cluster etcdctl member add
started kubelet with ETCD_INITIAL_CLUSTER_STATE set to existing, then started protokube. At this point the master is able to join the cluster.
At the beginning I thought the cluster was healthy:
/ # etcdctl member list
a4***b2: name=etcd-c peerURLs=http://etcd-c.internal.mydomain.com:2380 clientURLs=http://etcd-c.internal.mydomain.com:4001
cf***97: name=etcd-a1 peerURLs=http://etcd-a1.internal.mydomain.com:2380 clientURLs=http://etcd-a1.internal.mydomain.com:4001
d3***59: name=etcd-b peerURLs=http://etcd-b.internal.mydomain.com:2380 clientURLs=http://etcd-b.internal.mydomain.com:4001
/ # etcdctl cluster-health
member a4***b2 is healthy: got healthy result from http://etcd-c.internal.mydomain.com:4001
member cf***97 is healthy: got healthy result from http://etcd-a1.internal.mydomain.com:4001
member d3***59 is healthy: got healthy result from http://etcd-b.internal.mydomain.com:4001
cluster is healthy
Yet the status of etcd-events is not great. etcd-events for a1 is not running
etcd-server-events-ip-a1 0/1 CrashLoopBackOff 430
etcd-server-events-ip-b 1/1 Running 3
etcd-server-events-ip-c 1/1 Running 0
Logs from etcd-events-a1:
flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd-events-a1.internal.mydomain.com:4002
flags: recognized and used environment variable ETCD_DATA_DIR=/var/etcd/data-events
flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-events-a1.internal.mydomain.com:2381
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-events-a1=http://etcd-events-a1.internal.mydomain.com:2381,etcd-events-b=http://etcd-events-b.internal.mydomain.com:2381,etcd-events-c=http://etcd-events-c.internal.mydomain.com:2381
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-token-etcd-events
flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:4002
flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2381
flags: recognized and used environment variable ETCD_NAME=etcd-events-a1
etcdmain: etcd Version: 2.2.1
etcdmain: Git SHA: 75f8282
etcdmain: Go Version: go1.5.1
etcdmain: Go OS/Arch: linux/amd64
etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
etcdmain: the server is already initialized as member before, starting as etcd member...
etcdmain: listening for peers on http://0.0.0.0:2381
etcdmain: listening for client requests on http://0.0.0.0:4002
netutil: resolving etcd-events-b.internal.mydomain.com:2381 to 10.15.***:2381
netutil: resolving etcd-events-a1.internal.mydomain.com:2381 to 10.15.***:2381
etcdmain: stopping listening for client requests on http://0.0.0.0:4002
etcdmain: stopping listening for peers on http://0.0.0.0:2381
etcdmain: error validating peerURLs {ClusterID:5a***b3 Members:[&{ID:a7***32 RaftAttributes:{PeerURLs:[http://etcd-events-b.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-b ClientURLs:[http://etcd-events-b.internal.mydomain.com:4002]}} &{ID:cc***b3 RaftAttributes:{PeerURLs:[https://etcd-events-a.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-a ClientURLs:[https://etcd-events-a.internal.mydomain.com:4002]}} &{ID:7f***2ca RaftAttributes:{PeerURLs:[http://etcd-events-c.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-c ClientURLs:[http://etcd-events-c.internal.mydomain.com:4002]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs
# restart
...
etcdserver: restarting member eb***3a in cluster 96***07 at commit index 3
raft: eb***a3a became follower at term 12407
raft: newRaft eb***3a [peers: [], term: 12407, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
etcdserver: starting server... [version: 2.2.1, cluster version: to_be_decided]
etcdserver: added local member eb***3a [http://etcd-events-a1.internal.mydomain.com:2381] to cluster 96***07
etcdserver: added member 7f***ca [http://etcd-events-c.internal.mydomain.com:2381] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***b3, local=96***07)
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***3, local=96***07)
rafthttp: failed to dial 7f***ca on stream Message (cluster ID mismatch)
rafthttp: failed to dial 7f***ca on stream MsgApp v2 (cluster ID mismatch)
etcdserver: added member a7***32 [http://etcd-events-b.internal.mydomain.com:2381] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
rafthttp: failed to dial a7***32 on stream MsgApp v2 (cluster ID mismatch)
...
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
osutil: received terminated signal, shutting down...
etcdserver: aborting publish because server is stopped
Logs from etcd-events-b:
rafthttp: streaming request ignored (cluster ID mismatch got 96***07 want 5a***b3)
rafthttp: the connection to peer cc***b3 is unhealthy
Logs from etcd-events-c:
etcdserver: failed to reach the peerURL(https://etcd-events-a.internal.mydomain.com:2381) of member cc***b3 (Get https://etcd-events-a.internal.mydomain.com:2381/version: dial tcp 10.15.131.7:2381: i/o timeout)
etcdserver: cannot get the version of member cc***b3 (Get https://etcd-events-a.internal.mydomain.com:2381/version: dial tcp 10.15.131.7:2381: i/o timeout)
From the log I saw two problems:
etcd-events on 1a seems to ignore the existing cluster (then IDs doesn't match).
the other nodes (b and c) still somehow remembers the removed old node a.
I'm short of ideas on how to fix this. Any suggestion?
Thanks!
If you tried to upgrade etcd2 and did not restart all the masters at the same time, you will fail the upgrade.
Make sure you read through https://kops.sigs.k8s.io/etcd3-migration/
I also strongly recommend using the latest possible version of kOps as there are quite a few migration bugs fixed along the way.
There may be multiple reasons why the cluster ID changed, but if I remember correctly, replacing members like that was never really supported and with etcd2 your options are limited. Trying to get to etcd-manager and etcdv3 may be the best way to get your cluster in a working state again.

Global variable that count number of errors

Im new in Ansible. Im using playbooks/roles in Openstack and im searching for a Global variable or similar that count the number of errors in the playbook or at least that an error ocurred in some moment.
This is why (let say that this variable is called GLOB):
In Task:
---
# tasks file
- name: testing block
block:
# List of tests to run in this test suite:
- include: ../tests/DoThings.yml #API calls, http 50x possible errors: YES
always: #Final task always executed
- include: ../tests/Clear.yml #API calls, http 50x possible errors: YES
- include: ../tests/Report.yml #NOT API calls, http 50x possible errors: NO, just log checker.
So if some error ocurred in DoThings.yml i want to clear all the thing with Clear.yml and AFTER execute Report.yml, inside this i will check if GLOBAL var has 1 failed (one or more failed). This is because a failed can be
A "50x HTTP", that is impossible to predict, you have to try again (this is important to detect for me and differentiate from other errors)
A normal error for code mistake or similar, easy to fix (not important in this case)

Unknown processors type "resourcedetection" for "resourcedetection"

Running OT Collector with image ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector:0.58.0
In config.yaml I have,
processors:
batch:
resourcedetection:
detectors: [ env ]
timeout: 2s
override: false
The collector is deployed as a sidecar but it keeps failing with
collector server run finished with error: failed to get config: cannot unmarshal the configuration: unknown processors type "resourcedetection" for "resourcedetection" (valid values: [resource span probabilistic_sampler filter batch memory_limiter attributes])
Any idea as to what is causing this? I haven't found any relevant documentation/question
The Resource Detection Processor is part of the otelcol-contrib distro upstream and you'd hence would need to use otel/opentelemetry-collector-contrib:0.58.0 (or the equivalent on your container registry of choice) for this processor to be available in your collector.

BYFN.sh failed with error when setting up MSP of type bccsp

I am trying the first-network demo on OS X and am getting the following error. I have tried searching for an answer. I did found one here but it appears that is for Ubuntu. All the commands didn't work on OS X.
Can anyone suggest a solution on OS X? Thanks!
2018-11-02 03:13:45.696 UTC [main] main -> ERRO 001 Cannot run peer
because error when setting up MSP of type bccsp from directory
/opt/gopath/src/github.com/hyperledger/fabric/peer/crypto/peerOrganizations/org1.example.com/users/Admin#org1.example.com/msp:
could not initialize BCCSP Factories: Failed initializing PKCS11.BCCSP
%!s(): Could not initialize BCCSP PKCS11 [Failed to initialize
software key store: An invalid KeyStore path provided. Path cannot be
an empty string.] !!!!!!!!!!!!!!! Channel creation failed
!!!!!!!!!!!!!!!!
========= ERROR !!! FAILED to execute End-2-End Scenario ==========
The error message is not very specific.
In my case, I was trying to connect to a peer from my fabric-tools container, and there was a TLS enabled mismatch (fabric-tools has TLS enabled, and fabric-peer had TLS disabled). Aligning the TLS configuration made the error go away.
Might help someone out...

H2O cluster startup frequently timing out

Trying to start an h2o cluster on (MapR) hadoop via python
# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
"""
Function for communicating streaming text lines from seperate thread.
see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
"""
for line in iter(out.readline, b''):
queue.put(line)
out.close()
# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))
# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-3.18.0.2-mapr5.2/'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)),
shell=False,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
continue
else: # got line
print line
# check for first instance connection url output
if re.search('Open H2O Flow in your web browser', line) is not None:
h2o_url_out = line
break
if re.search('Error', line) is not None:
print 'Error generated: %s' % line
sys.exit()
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search('(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip
frequently throws a timeout error
Waiting for H2O cluster to come up...
H2O node 172.18.4.66:54321 requested flatfile
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster
Looking at the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html) (and just doing a wordfind for the word "timeout"), was unable to find anything that helped the problem (eg. extending the timeout time via hadoop jar h2odriver.jar -timeout <some time> did nothing but extend the time until the timeout error popped up).
Have noticed that this happens often when there is another instance of an h2o cluster already up and running (which I don't understand since I would think that YARN could support multiple instances), yet also sometimes when there is no other cluster initialized.
Anyone know anything else that can be tried to solve this problem or get more debugging info beyond the error message being thrown by h2o?
UPDATE:
Trying to recreate the problem from the commandline, getting
[me#mnode01 project]$ /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 172.18.4.62]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.62:29388
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings:
mapreduce.map.java.opts: -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
H2O node 172.18.4.66:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
----- YARN cluster metrics -----
Number of YARN worker nodes: 6
----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 2.0 / 8.7 GB used, 1 / 2 vcores used
----- Queues -----
Queue name: root.default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 0.00
Maximum capacity: -1.00
Application count: 0
Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used
----------------------------------------------------------------------
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
and noticing the later outputs
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster
I am confused by the reported 0GB mem. and 0 vcores becuase there are no other applications running on the cluster and looking at the cluster details in the YARN RM web UI shows
(using image, since could not find unified place in log files for this info and why the mem. availability is so uneven despite having no other running applications, I do not know). At this point, should mention that don't have much experience tinkering with / examining YARN configs, so it's difficult for me to find relevant information at this point.
Could it be that I am starting h2o cluster with -mapperXmx=6g, but (as shown in the image) one of the nodes only has 5g mem. available, so if this node is randomly selected to contribute to the initialized h2o application, it does not have enough memory to support the requested mapper mem.? Changing the startup command to /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir and start/stopping multiple times without error seems to support this theory (though need to check further to determine if I'm interpreting things correctly).
This is most likely because your Hadoop cluster is busy, and there just isn't space to start new yarn containers.
If you ask for N nodes, then you either get all N nodes, or the launch process times out like you are seeing. You can optionally use the -timeout command line flag to increase the timeout.

Resources