Trino server is accepting connection, but is still initializing. How to wait until Trino is ready to receive queries? - trino

For some background, we have included a Trino server as part of our CI setup and tests currently fail while the server is still adding all of the catalogs. Currently, I have setup our CI to retry this curl command, but it does not wait until the server is fully started.
docker run appropriate/curl --retry 60 --retry-delay 1 --retry-connrefused http://trino:8080/
Trino responds before it is fully initialized so the tests start failing due to the Trino server error: Trino server is still initializing.

In order to check if a Trino server is still initializing, you can query information about the cluster by connecting to the coordinator. The following command is where you can see information about a cluster.
docker run appropriate/curl --retry 60 --retry-delay 1 --retry-connrefused http://trino:8080/v1/info
Depending on the state of the server it will return something like the following JSON.
{
"nodeVersion": {
"version": "358"
},
"environment": "test",
"coordinator": true,
"starting": false,
"uptime": "7.56m"
}
You may use the $.starting JSON field to determine if the server is alive and ready for queries. A simple ruby script that tries 60 times before giving is provided.
cat <<"RUBY" | ruby
60.times do
exit if `docker run appropriate/curl http://trino:8080/v1/info | jq .starting`.strip == 'false'
sleep 1
end
exit!
RUBY

Related

puppet-mongodb module doesn't show Mongo shell errors in puppet-agent run

We use the voxpupuli/puppet-mongodb module to create and manage Mongo data bases and users.
As usual we add necessary data to manifests and Hiera and after that the module form a command which it runs on the client side during a puppet-agent run.
We can see the executing command from the puppet-agent debug logs for instance this command create a new DB with the name unixtest_db:
Executing: '/usr/bin/mongo unixtest_db2 --quiet --host 127.0.0.1:27017 --eval load('/root/.mongorc.js'); db.dummyData.insert({"created_by_puppet": 1})'
The problem is the module never report any errors occurred during the commands execution.
For example consider the next hiera code:
tele2_mongodb::mongodb_db:
'unixtest_db': # DB name
user : unixtest
password : >
ENC[PKCS7,....]
auth_mechanism : scram_sha_1
roles :
- dbOwner
As result we have the next command executing and results:
Executing: '/usr/bin/mongo unixtest_db2 --quiet --host 127.0.0.1:27017 --eval load('/root/.mongorc.js'); db.dummyData.insert({"created_by_puppet": 1})'
Notice: /Stage[main]/Tele2_mongodb/Mongodb::Db[unixtest_db]/Mongodb_database[unixtest_db]/ensure: created
Now we can connect to the DB with the user credential mentioned in hiera:
# mongo -u unixtest -p password unixtest_db
MongoDB shell version v5.0.9 connecting to: mongodb://127.0.0.1:27017/unixtest_db?
...
unixtesttst:PRIMARY>
Next we change the DB name in Hiera to all capital letters:
tele2_mongodb::mongodb_db:
'UNIXTEST_DB': # DB name
In puppet agent output we see identical:
Executing: '/usr/bin/mongo UNIXTEST_DB --quiet --host 127.0.0.1:27017 --eval load('/root/.mongorc.js'); db.dummyData.insert({"created_by_puppet": 1})'
Notice: /Stage[main]/Tele2_mongodb/Mongodb::Db[UNIXTEST_DB]/Mongodb_database[UNIXTEST_DB]/ensure: created
But the database is not created, it's not in the db.adminCommand('listDatabases') output (the 1st one with all small letters still exist).
And if we run the command manually on the OS console, we see the error message:
# /usr/bin/mongo UNIXTEST_DB --host 127.0.0.1:27017 --eval " load('/root/.mongorc.js'); db.dummyData.insert({"created_by_puppet": 1})"
MongoDB shell version v5.0.9
...
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 13297,
"errmsg" : "db already exists with different case already have: [unixtest_db] trying to create [UNIXTEST_DB]"
}
})
So my question is how to make the puppet also report errors if command execution failed instead of fake notification everything went smoothly.
I have the same problem for DB changes, users and passwords. If something went wrong we never know that from the puppet output.
Any ideas how to fix it?

telegraf output to Elasticsearch: "health check timeout: no Elasticsearch node available"

I'm having trouble connecting to an Elasticsearch instance with a Telegraf output plugin.
I created an Elasticsearch setup via the Elasticsearch service. I created a user and password (connected to a role) in Kibana for it.
Then I setup a Telegraf output for it:
[[outputs.elasticsearch]]
urls = [ "https://hostname:port" ] # required.
timeout = "5s"
enable_sniffer = false
health_check_interval = "10s"
## HTTP basic authentication details.
username = "my_username"
password = "my_password"
index_name = "device_logs" # required.
insecure_skip_verify = true
manage_template = true
template_name = "telegraf"
overwrite_template = false
But when I try to start Telegraf with this, it just gives the error,
[agent] Failed to connect to [outputs.elasticsearch], retrying in 15s, error was 'health check timeout: no Elasticsearch node available'
The connect fail seems to originate deep in the bowels of golang's net/http library, and I don't know how to get some more useful output at this point.
Things I've tried:
Thing #1: I tested cURL:
curl -u my_username:my_password -X POST "https://hostname:port/device_logs/_doc" -H 'Content-Type: application/json' -d'
{
"name": "John Doe"
}'
This works fine.
Thing #2: I created a simple Go program to connect to elasticsearch from Go:
package main
import (
"log"
"time"
"gopkg.in/olivere/elastic.v3"
)
func main() {
// configure connection to ES
client, err := elastic.NewClient(elastic.SetURL("https://hostname:port"))
if err != nil {
panic(err)
}
log.Printf("client.running? %v",client.IsRunning())
if ! client.IsRunning() {
panic("Could not make connection, not running")
}
}
.. and it hits the first panic with the same "no Elasticsearch node available".
Thing #3: I tried running gdb on that Go program to debug into it.
It jumps down to assembly as soon as I call NewClient, so I can't really learn what is happening in the bowels of net/http.
I've never used Go before, so I'm hoping to avoid hours of learning Go, spelunking, and debugging to get around what hopefully is a simple issue here.
Any ideas on how to get more info here or why this is failing? Are there build or runtime flags for Go that I can use? gdb-with-Go debugging tips so I can step down into the Go library code? Elasticsearch client know-how?
To answer my own question, the problem here turned out to be the roles permissions. The Telegraf output plugin for Elasticsearch needs both the monitor and the manage_index_templates permissions to be enabled, or else it'll fail to connect to the Elasticsearch server without printing any information about why.
BTW: to build golang code and be able to debug into the libraries it calls:
go build -gcflags=all="-N -l"

Putting to local DynamoDB table with Python boto3 times out

I am attempting to programmatically put data into a locally running DynamoDB Container by triggering a Python lambda expression.
I'm trying to follow the template provided here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.03.html
I am using the amazon/dynamodb-local you can download here: https://hub.docker.com/r/amazon/dynamodb-local
Using Ubuntu 18.04.2 LTS to run the container and lambda server
AWS Sam CLI to run my Lambda api
Docker Version 18.09.4
Python 3.6 (You can see this in sam logs below)
Startup command for python lambda is just "sam local start-api"
First my Lambda Code
import json
import boto3
def lambda_handler(event, context):
print("before grabbing dynamodb")
# dynamodb = boto3.resource('dynamodb', endpoint_url="http://localhost:8000",region_name='us-west-2',AWS_ACCESS_KEY_ID='RANDOM',AWS_SECRET_ACCESS_KEY='RANDOM')
dynamodb = boto3.resource('dynamodb', endpoint_url="http://localhost:8000")
table = dynamodb.Table('ContactRequests')
try:
response = table.put_item(
Item={
'id': "1234",
'name': "test user",
'email': "testEmail#gmail.com"
}
)
print("response: " + str(response))
return {
"statusCode": 200,
"body": json.dumps({
"message": "hello world"
}),
}
I know that I should have this table ContactRequests available at localhost:8000, because I can run this script to view my docker container dynamodb tables
I have tested this with a variety of values in the boto.resource call to include the access keys, region names, and secret keys, with no improvement to result
dev#ubuntu:~/Projects$ aws dynamodb list-tables --endpoint-url http://localhost:8000
{
"TableNames": [
"ContactRequests"
]
}
I am also able to successfully hit the localhost:8000/shell that dynamodb offers
Unfortunately while running, if I hit the endpoint that triggers this method, I get a timeout that logs like so
Fetching lambci/lambda:python3.6 Docker container image......
2019-04-09 15:52:08 Mounting /home/dev/Projects/sam-app/.aws-sam/build/HelloWorldFunction as /var/task:ro inside runtime container
2019-04-09 15:52:12 Function 'HelloWorldFunction' timed out after 3 seconds
2019-04-09 15:52:13 Function returned an invalid response (must include one of: body, headers or statusCode in the response object). Response received:
2019-04-09 15:52:13 127.0.0.1 - - [09/Apr/2019 15:52:13] "GET /hello HTTP/1.1" 502 -
Notice that none of my print methods are being triggered, if I remove the call to table.put, then the print methods are successfully called.
I've seen similar questions on Stack Overflow such as this lambda python dynamodb write gets timeout error that state that the problem is I am using a local db, but shouldn't I still be able to write to a local db with boto3, if I point it to my locally running dynamodb instance?
Your Docker container running the Lambda function can't reach the DynamoDB at 127.0.0.1. Try instead the name of your DynamoDB local docker container as the host name for the endpoint:
dynamodb = boto3.resource('dynamodb', endpoint_url="http://<DynamoDB_LOCAL_NAME>:8000")
You can use docker ps to find the <DynamoDB_LOCAL_NAME> or give it a name:
docker run --name dynamodb amazon/dynamodb-local
and then connect:
dynamodb = boto3.resource('dynamodb', endpoint_url="http://dynamodb:8000")
Found the solution to the problem here: connecting AWS SAM Local with dynamodb in docker
The question asker noted that he saw online that he may need to connect to the same docker network using:
docker network create lambda-local
So created this network, then updated my sam command and my docker commands to use this network, like so:
docker run --name dynamodb -p 8000:8000 --network=local-lambda amazon/dynamodb-local
sam local start-api --docker-network local-lambda
After that I no longer experienced the timeout issue.
I'm still working on understanding exactly why this was the issue
To be fair though, it was important that I use the dynamodb container name as the host for my boto3 resource call as well.
So in the end, it was a combination of the solution above and the answer provided by "Reto Aebersold" that created the final solution
dynamodb = boto3.resource('dynamodb', endpoint_url="http://<DynamoDB_LOCAL_NAME>:8000")

Recovering from Consul "No Cluster leader" state

I have:
one mesos-master in which I configured a consul server;
one mesos-slave in which I configure consul client, and;
one bootstrap server for consul.
When I hit start I am seeing the following error:
2016/04/21 19:31:31 [ERR] agent: failed to sync remote state: rpc error: No cluster leader
2016/04/21 19:31:44 [ERR] agent: coordinate update error: rpc error: No cluster leader
How do I recover from this state?
Did you look at the Consul docs ?
It looks like you have performed a ungraceful stop and now need to clean your raft/peers.json file by removing all entries there to perform an outage recovery. See the above link for more details.
As of Consul 0.7 things work differently from Keyan P's answer. raft/peers.json (in the Consul data dir) has become a manual recovery mechanism. It doesn't exist unless you create it, and then when Consul starts it loads the file and deletes it from the filesystem so it won't be read on future starts. There are instructions in raft/peers.info. Note that if you delete raft/peers.info it won't read raft/peers.json but it will delete it anyway, and it will recreate raft/peers.info. The log will indicate when it's reading and deleting the file separately.
Assuming you've already tried the bootstrap or bootstrap_expect settings, that file might help. The Outage Recovery guide in Keyan P's answer is a helpful link. You create raft/peers.json in the data dir and start Consul, and the log should indicate that it's reading/deleting the file and then it should say something like "cluster leadership acquired". The file contents are:
[ { "id": "<node-id>", "address": "<node-ip>:8300", "non_voter": false } ]
where <node-id> can be found in the node-id file in the data dir.
If u got raft version more than 2:
[
{
"id": "e3a30829-9849-bad7-32bc-11be85a49200",
"address": "10.88.0.59:8300",
"non_voter": false
},
{
"id": "326d7d5c-1c78-7d38-a306-e65988d5e9a3",
"address": "10.88.0.45:8300",
"non_voter": false
},
{
"id": "a8d60750-4b33-99d7-1185-b3c6d7458d4f",
"address": "10.233.103.119",
"non_voter": false
}
]
In my case I had 2 worker nodes in the k8s cluster, after adding another node the consul servers could elect a master and everything is up and running.
I will update what I did:
Little Background: We scaled down the AWS Autoscaling so lost the leader. But we had one server still running but without any leader.
What I did was:
I scaled up to 3 servers(don't make 2-4)
stopped consul in all 3 servers.sudo service consul stop(you can do status/stop/start)
created peers.json file and put it in old server(/opt/consul/data/raft)
start the 3 servers (peers.json should be placed on 1 server only)
For other 2 servers join it to leader using consul join 10.201.8.XXX
check peers are connected to leader using consul operator raft list-peers
Sample peers.json file
[
{
"id": "306efa34-1c9c-acff-1226-538vvvvvv",
"address": "10.201.n.vvv:8300",
"non_voter": false
},
{
"id": "dbeeffce-c93e-8678-de97-b7",
"address": "10.201.X.XXX:8300",
"non_voter": false
},
{
"id": "62d77513-e016-946b-e9bf-0149",
"address": "10.201.X.XXX:8300",
"non_voter": false
}
]
These id you can get from each server in /opt/consul/data/
[root#ip-10-20 data]# ls
checkpoint-signature node-id raft serf
[root#ip-10-1 data]# cat node-id
Some useful commands:
consul members
curl http://ip:8500/v1/status/peers
curl http://ip:8500/v1/status/leader
consul operator raft list-peers
cd opt/consul/data/raft/
consul info
sudo service consul status
consul catalog services
You may also ensure that bootstrap parameter is set in your Consul configuration file config.json on the first node:
# /etc/consul/config.json
{
"bootstrap": true,
...
}
or start the consul agent with the -bootstrap=1 option as described in the official Failure of a single server cluster Consul documentation.

Unresponsive socket after x time (puma - ruby)

I'm experiencing an unresponsive socket in with my Puma setup after random time. Up to this point I don't have a clue what's causing the issue. I was hoping somebody over here can help we with some answers or point me in the right direction. I'm having the following setup:
I'm using the official docker ruby-2.2.3-slim image together with the latest puma release 2.15.3, I've also installed Nginx as a reverse proxy. But I'm already sure Nginx isn't the problem over here because and I've tried to verify if the socket was working using this script. And the socket wasn't working, I got a timeout over there as well so I could ignore Nginx.
This is a testing environment so the server isn't experiencing any extreme load, I've also check memory consumption it has still several GB's of free space so that couldn't be the issue either.
What triggered me to look at the puma socket was the error message I got in my Nginx error logging:
upstream timed out (110: Connection timed out) while reading response header from upstream
Also I couldn't find anything in the logs of puma indicating what is going wrong, over here are my puma setup:
threads 0, 16
app_dir = ENV.fetch('APP_HOME')
environment ENV['RAILS_ENV']
daemonize
bind "unix://#{app_dir}/sockets/puma.sock"
stdout_redirect "#{app_dir}/log/puma.stdout.log", "#{app_dir}/log/puma.stderr.log", true
pidfile "#{app_dir}/pids/puma.pid"
state_path "#{app_dir}/pids/puma.state"
activate_control_app
on_worker_boot do
require 'active_record'
ActiveRecord::Base.connection.disconnect! rescue ActiveRecord::ConnectionNotEstablished
ActiveRecord::Base.establish_connection(YAML.load_file("#{app_dir}/config/database.yml")[ENV['RAILS_ENV']])
end
And this it the output in my puma state file:
---
pid: 43
config: !ruby/object:Puma::Configuration
cli_options:
conf:
options:
:min_threads: 0
:max_threads: 16
:quiet: false
:debug: false
:binds:
- unix:///APP/sockets/puma.sock
:workers: 1
:daemon: true
:mode: :http
:before_fork: []
:worker_timeout: 60
:worker_boot_timeout: 60
:worker_shutdown_timeout: 30
:environment: staging
:redirect_stdout: "/APP/log/puma.stdout.log"
:redirect_stderr: "/APP/log/puma.stderr.log"
:redirect_append: true
:pidfile: "/APP/pids/puma.pid"
:state: "/APP/pids/puma.state"
:control_url: unix:///tmp/puma-status-1449260516541-37
:config_file: config/puma.rb
:control_url_temp: "/tmp/puma-status-1449260516541-37"
:control_auth_token: cda8879717be7a645ea323d931b88d4b
:tag: APP
The application itself is a Rails app on the latest version 4.2.5, it's deployed on GCE (Google Container Engine).
If somebody could give me some pointer's on how to debug this any further would be very much appreciated. Because now I don't see any output anywhere which could help me any further.
EDIT
I replaced the unix socket with tcp connection to Puma with the same result, still hangs after x time
I'd start with:
How many requests get processed successfully per instance of puma?
Make sure you log the beginning and end of each request with the thread id of the thread executing it, what do you see?
Not knowing more about your application, I'd say it's likely the threads get stuck doing some long/blocking calls without timeouts or spinning on some computation until the whole thread pool gets depleted.
We'll see.
I finally found out why my application was behaving the way it was.
After trying to use a tcp connection and switching to Unicorn I start looking into other possible sources.
That's when I thought maybe my connection to Google Cloud SQL could be the problem. Once I read the faq of Cloud SQL, they mentioned that you have to tweak you Compute instances to ensure they keep open your DB connection. So I performed the next steps they recommend and that solved the problem for me, I added them just in case:
# Display the current tcp_keepalive_time value.
$ cat /proc/sys/net/ipv4/tcp_keepalive_time
# Set tcp_keepalive_time to 60 seconds and make it permanent across reboots.
$ echo 'net.ipv4.tcp_keepalive_time = 60' | sudo tee -a /etc/sysctl.conf
# Apply the change.
$ sudo /sbin/sysctl --load=/etc/sysctl.conf
# Display the tcp_keepalive_time value to verify the change was applied.
$ cat /proc/sys/net/ipv4/tcp_keepalive_time

Resources