We are using a JDBC URL like "jdbc:vertica://80.90..:***/". How can I set a second Vertica host for a separate cluster in this URL? Both clusters have the same table, username and password. The only difference is the host IP.
I have tried to set the URL as shown below but it doesn't work.
jdbc:vertica://00.00.00.2:1111,00.00.00.1:1111/vertica
url = "jdbc:vertica://****:***/"
url1 = "jdbc:vertica://***:****/"
properties = {
"user": "****",
"password": "*****",
"driver": "com.vertica.jdbc.Driver"
}
df =spark.read.format("JDBC").options(
url =url and url1,
query = "SELECT COUNT(*) from traffic.stats where date(time_stamp) between '2019-03-16 ' and '2019-03-17' ",
**properties
).load().show()
Note: pyspark 2.4 , vertica jar 9.1.1
One way to do this is to specify a backup host.
url = "jdbc:vertica://00.00.00.2:1111/vertica"
properties = {
"user": "****",
"password": "*****",
"driver": "com.Vertica.jdbc.Driver",
"ConnectionLoadBalance": 1,
"BackupServerNode": "00.00.00.1:1111"
}
This will try the host specified in the URL (00.00.00.2:1111). If that host is unavailable it will try the BackupServerNode. You can specify multiple backup server nodes separated by commas.
The above solution will only work if the original host is unavailable.
Another solution is, if you want a random host selected, you can do that logic within python itself.
import random
host_list = ["00.00.00.2:1111", "00.00.00.1:1111"]
host = random.choice(hosts) # python2 random syntax, lookup random if using a different version of python
url = "jdbc:vertica://{0}/vertica".format(host)
Note: The connection property BackupServerNode is named such because it is usually used to specify an alternate node within the same database cluster, but if—like yourself—you have two databases with the same username, password, etc., it will also work for connecting to a separate database cluster host.
Related
I am using glue console not dev endpoint. The glue job is able to access glue catalogue and table using below code
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"glue-db", table_name = "countries")
print "Table Schema:", datasource0.schema()
print "datasource0", datasource0.show()
Now I want to get the metadata for all tables from the glue data base glue-db.
I could not find a function in awsglue.context api, therefore i am using boto3.
client = boto3.client('glue', 'eu-central-1')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
print ("databaseName:{}".format(databaseName))
responseGetTables = client.get_tables( DatabaseName = databaseName,
MaxResults=123)
print("responseGetDatabases{}".format(responseGetTables))
tableList = responseGetTables['TableList']
print("response Object{0}".format(responseGetTables))
for tableDict in tableList:
tableName = tableDict['Name']
print("-- tableName:{}".format(tableName))
the code runs in lambda function, but fails within glue etl job with following error
botocore.vendored.requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='glue.eu-central-1.amazonaws.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(, 'Connection to glue.eu-central-1.amazonaws.com timed out. (connect timeout=60)'))
The problem seems to be in environment configuration. Glue VPC has two subnets
private subnet: with s3 endpoint for glue, allows inbound traffic from the RDS security group. It has
public subnet: in glue vpc with nat gateway. Private subnet is reachable through gate nat Gateway. I am not sure what i am missing here.
Try using a proxy while creating the boto3 client:
from pyhocon import ConfigFactory
service_name = 'glue'
default = ConfigFactory.parse_file('glue-default.conf')
override = ConfigFactory.parse_file('glue-override.conf')
host = override.get('proxy.host', default.get('proxy.host'))
port = override.get('proxy.port', default.get('proxy.port'))
config = Config()
if host and port:
config.proxies = {'https': '{}:{}'.format(host, port)}
client = boto3.Session(region_name=region).client(service_name=service_name, config=config)
glue-default.conf and glue-override.conf are deployed to the cluster by glue while spark submit into the /tmp directory.
I had a similar issue and I did the same by using the public library from glue:
s3://aws-glue-assets-eu-central-1/scripts/lib/utils.py
can you please try the boto client creation as below by specifying the region explicitly?
client = boto3.client('glue',region_name='eu-central-1')
I had a similar problem when I was running this command from Glue Python Shell.
So I created endpoint (VPC->Endpoints) for Glue service (service name: "com.amazonaws.eu-west-1.glue"), this one was assigned to the same Subnet and Security Group as the Glue Connection which was used in the Glue Python Shell Job.
I am facing a problem while connecting with a SSL enabled PostgreSQL server from Windows. I am getting the following error:
Error :
Error in postgresqlNewConnection(drv, …) :
RS-DBI driver: (could not connect ip:80 on dbname "all": sslmode value "require" invalid when SSL support is not compiled in.
Commands I have used :
install.packages(“RPostgreSQL”)
install.packages(“rstudioapi”)
require(“RPostgreSQL”)
require(“rstudioapi”)
drv <- dbDriver("PostgreSQL")
pg_dsn = paste0(
'dbname=', "all", ' ',
'sslmode=require')
con <- dbConnect(drv,
dbname = pg_dsn,
host = "ip",
port = 80,
user = "abcd",
password = rstudioapi::askForPassword("Database password"))
You need to use a PostgreSQL client shared library (libpq.dll) that was built with SSL support.
I am trying to setup a remote deployment with Capistrano on the Amazon Cloud.
The idea : I SSH to a random machine of the autoscaling group and I want to deploy to all the other machines from there. In order to do that I need to get the names of the other instances so I can define the Capistrano servers I want to deploy to
I have installed the Ruby sdk but I cannot figure out the best way to retrieve the instances names (taking advantage that I am on the VPN).
I have actually two possibilities : either find the instances by tags (I have tagged them with "production") or by the ID of the autoscaling group.
I don't want to use other "big guns" like Chef, etc.
After reading too much documentation
Two strategies : retrieve the dns names by autoscaling group OR by tags
By Tags
ec2 = Aws::EC2::Client.new
instances_tagged = ec2.describe_instances(
dry_run: false,
filters: [
{
name: 'tag:environment',
values: ['production'],
},
{
name: 'tag:stack',
values: ['rails'],
}
],
)
dns_tagged = instances_tagged.reservations[0].instances.map(&:private_dns_name)
By Autoscaling group
as = Aws::AutoScaling::Client.new
instances_of_as = as.describe_auto_scaling_groups(
auto_scaling_group_names: ['Autoscaling-Group-Name'],
max_records: 1,
).auto_scaling_groups[0].instances
if instances_of_as.empty?
autoscaling_dns = []
else
instances_ids = instances_of_as.map(&:instance_id)
autoscaling_dns = instance_ids.map do |instance_id|
ec2.instances[instance_id].private_dns_name
end
end
Let's say I'm using Terraform to provision two machines inside AWS:
An EC2 Machine running NodeJS
An RDS instance
How does the NodeJS code obtain the address of the RDS instance?
You've got a couple of options here. The simplest one is to create a CNAME record in Route53 for the database and then always point to that CNAME in your application.
A basic example would look something like this:
resource "aws_db_instance" "mydb" {
allocated_storage = 10
engine = "mysql"
engine_version = "5.6.17"
instance_class = "db.t2.micro"
name = "mydb"
username = "foo"
password = "bar"
db_subnet_group_name = "my_database_subnet_group"
parameter_group_name = "default.mysql5.6"
}
resource "aws_route53_record" "database" {
zone_id = "${aws_route53_zone.primary.zone_id}"
name = "database.example.com"
type = "CNAME"
ttl = "300"
records = ["${aws_db_instance.default.endpoint}"]
}
Alternative options include taking the endpoint output from the aws_db_instance and passing that into a user data script when creating the instance or passing it to Consul and using Consul Template to control the config that your application uses.
You may try Sparrowform - a lightweight provision tool for Terraform based instances, it's capable to make an inventory of Terraform resources and provision related hosts, passing all the necessary data:
$ terrafrom apply # bootstrap infrastructure
$ cat sparrowfile # this scenario
# fetches DB address from terraform cache
# and populate configuration file
# at server with node js code:
#!/usr/bin/env perl6
use Sparrowform;
$ sparrowfrom --ssh_private_key=~/.ssh/aws.pem --ssh_user=ec2 # run provision tool
my $rdb-adress;
for tf-resources() -> $r {
my $r-id = $r[0]; # resource id
if ( $r-id 'aws_db_instance.mydb') {
my $r-data = $r[1];
$rdb-address = $r-data<address>;
last;
}
}
# For instance, we can
# Install configuration file
# Next chunk of code will be applied to
# The server with node-js code:
template-create '/path/to/config/app.conf', %(
source => ( slurp 'app.conf.tmpl' ),
variables => %(
rdb-address => $rdb-address
),
);
# sparrowform --ssh_private_key=~/.ssh/aws.pem --ssh_user=ec2 # run provisioning
PS. disclosure - I am the tool author
Using Ruby I'm making a call like:
client = SoftLayer::Client.new(:username => user, :api_key => api_key, :timeout => 999999)
client['Account'].object_mask("mask[id, hostname, fullyQualifiedDomainName, provisionDate, datacenter[name], billingItem[recurringFee, associatedChildren[recurringFee], orderItem[description, order[userRecord[username], id]]], tagReferences[tagId, tag[name]], primaryIpAddress, primaryBackendIpAddress]").getHardware
But only some machines return a provisionDate and only some return orderItem information. How can I consistently get this information for each machine? What would cause one machine to return this data and another machine to not?
Example output:
{"fullyQualifiedDomainName"=>"<removed_by_me>",
"hostname"=>"<removed_by_me>",
"id"=>167719,
"provisionDate"=>"",
"primaryBackendIpAddress"=>"<removed_by_me>",
"primaryIpAddress"=>"<removed_by_me>",
"billingItem"=>
{"recurringFee"=>"506.78",
"associatedChildren"=>
[<removed_by_me>]},
"datacenter"=>{"name"=>"dal09"},
"tagReferences"=>
[{"tagId"=>139415, "tag"=>{"name"=>"<removed_by_me>"}},
{"tagId"=>139417, "tag"=>{"name"=>"<removed_by_me>"}},
{"tagId"=>140549, "tag"=>{"name"=>"<removed_by_me>"}}]}
To be clear, most machines return this data so I'm trying to understand why some do not.
Please see the following provisioning steps, below is a little flow to consider:
1. Order a Server
Result:
* An orderId is assigned to the server
* The createDate has a new value
* activeTransaction value is = Null
* provisionDate value is = Null
2. The order is approved
Result:
* activeTransaction value is <> Null
* provisionDate value = Null
3. Server is already provisioned
Result:
* activeTransaction value is = Null
* provisionDate value has a New value
* billingItem property has a new value
To see if your machines have still ”activeTransaction”, please execute:
https://[username]:[apikey]#api.softlayer.com/rest/v3/SoftLayer_Hardware_Server/[server_id]/getActiveTransaction
Method: GET
Now, after reviewing your example response, this server had some problems when completing the provisioning; for that reason this step was completed manually but the provisionDate was not set for any reason(please open a ticket if you want that the provisionDate can be set) . This is a special case. I can see that another server has a similar behavior. But the other servers that don’t have provisionDate, have still ”activeTransaction<>null” (it means that these server are not provisioned yet).
EDIT:
Other property can help you to know that your machine has been already provisioned although other kind of transaction is being executed, is “hardwareStatus”, it should have “ACTIVE” value.
https://[username]:[apikey]#api.softlayer.com/rest/v3/SoftLayer_Account/getHardware?objectMask=mask[id, hostname, fullyQualifiedDomainName, provisionDate,hardwareStatus]
Method: GET
The response should be something like this:
{
"fullyQualifiedDomainName": "myhostname.softlayer.com"
"hostname": " myhostname"
"id": 1234567
"provisionDate": "2015-06-29T00:21:39-05:00"
"hardwareStatus": {
"id": 5
"status": "ACTIVE"
}