glue job times out when calling aws boto3 client api - etl

I am using glue console not dev endpoint. The glue job is able to access glue catalogue and table using below code
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =
"glue-db", table_name = "countries")
print "Table Schema:", datasource0.schema()
print "datasource0", datasource0.show()
Now I want to get the metadata for all tables from the glue data base glue-db.
I could not find a function in awsglue.context api, therefore i am using boto3.
client = boto3.client('glue', 'eu-central-1')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
databaseName = databaseDict['Name']
print ("databaseName:{}".format(databaseName))
responseGetTables = client.get_tables( DatabaseName = databaseName,
MaxResults=123)
print("responseGetDatabases{}".format(responseGetTables))
tableList = responseGetTables['TableList']
print("response Object{0}".format(responseGetTables))
for tableDict in tableList:
tableName = tableDict['Name']
print("-- tableName:{}".format(tableName))
the code runs in lambda function, but fails within glue etl job with following error
botocore.vendored.requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='glue.eu-central-1.amazonaws.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(, 'Connection to glue.eu-central-1.amazonaws.com timed out. (connect timeout=60)'))
The problem seems to be in environment configuration. Glue VPC has two subnets
private subnet: with s3 endpoint for glue, allows inbound traffic from the RDS security group. It has
public subnet: in glue vpc with nat gateway. Private subnet is reachable through gate nat Gateway. I am not sure what i am missing here.

Try using a proxy while creating the boto3 client:
from pyhocon import ConfigFactory
service_name = 'glue'
default = ConfigFactory.parse_file('glue-default.conf')
override = ConfigFactory.parse_file('glue-override.conf')
host = override.get('proxy.host', default.get('proxy.host'))
port = override.get('proxy.port', default.get('proxy.port'))
config = Config()
if host and port:
config.proxies = {'https': '{}:{}'.format(host, port)}
client = boto3.Session(region_name=region).client(service_name=service_name, config=config)
glue-default.conf and glue-override.conf are deployed to the cluster by glue while spark submit into the /tmp directory.
I had a similar issue and I did the same by using the public library from glue:
s3://aws-glue-assets-eu-central-1/scripts/lib/utils.py

can you please try the boto client creation as below by specifying the region explicitly?
client = boto3.client('glue',region_name='eu-central-1')

I had a similar problem when I was running this command from Glue Python Shell.
So I created endpoint (VPC->Endpoints) for Glue service (service name: "com.amazonaws.eu-west-1.glue"), this one was assigned to the same Subnet and Security Group as the Glue Connection which was used in the Glue Python Shell Job.

Related

Executing Powershell script on remote Windows EC2 instance in Terraform

I am starting a Windows EC2 instance in AWS. Now I want to install certain software like OpenSSH and some other tasks like creating user after the server has been created. If I have a PowerShell script, how do I execute on the remote instance?
I have a local PowerShell script - install_sft.ps1 and I want to execute on the remote EC2 instance in AWS.
I know I need to use a "provisioner" but unable to get my head around how to use it for Windows.
resource "aws_instance" "win-master" {
provider = aws.lmedba-dc
ami = data.aws_ssm_parameter.WindowsAmi.value
instance_type = var.instance-type
key_name = "RPNVirginia"
associate_public_ip_address = true
vpc_security_group_ids = [aws_security_group.windows-sg.id]
subnet_id = aws_subnet.dc1.id
tags = {
Name = "Win server"
}
depends_on = [aws_main_route_table_association.set-master-default-rt-assoc]
}
You can do this by making use of the user_data parameter of the aws_instance resource:
resource "aws_instance" "win-master" {
...
user_data_base64 = "${base64encode(file(install_sft.ps1))}"
...
}
Just ensure that install_sft.ps1 is in the same directory as your Terraform code.
An EC2 instance's User Data script executes when it starts up for the first time. See the AWS documentation here for more details.

Create CloudWatch alarm that sets an instance to standby via SNS/Lambda

What I am looking to do is set an instance to standby mode when it hits an alarm state. I already have an alarm set up to detect when my instance hits 90% CPU for a while. The alarm currently sends a Slack and text message via SNS calling a Lambda function. I would like to add is to have the instance go into standby mode. The instances are in an autoscaling group.
I found that you can perform this through the CLI using the command :
aws autoscaling enter-standby --instance-ids i-66b4f7d5be234234234 --auto-scaling-group-name my-asg --should-decrement-desired-capacity
You can also do this with boto3 :
response = client.enter_standby(
InstanceIds=[
'string',
],
AutoScalingGroupName='string',
ShouldDecrementDesiredCapacity=True|False
)
I assume I need to write another Lambda function that will be triggered by SNS that will use the boto3 code to do this?
Is there a better/easier way before I start?
I already have the InstanceId passed into the event to the Lambda so I will have to add the ASG name in the event.
Is there a way to get the ASG name in the Lambda function when I already have the Instance ID? Then I do not have to pass it in with the event.
Thanks!
Your question has a couple sub-parts, so I'll try to answer them in order:
I assume I need to write another Lambda function that will be triggered by SNS that will use the boto3 code to do this?
You don't need to, you could overload your existing function. I could see a valid argument for either separate functions (separation of concerns) or one function (since "reacting to CPU hitting 90%" is basically "one thing").
Is there a better/easier way before I start?
I don't know of any other way you could do it, other than Cloudwatch -> SNS -> Lambda.
Is there a way to get the ASG name in the Lambda function when I already have the Instance ID?
Yes, see this question for an example. It's up to you whether it looks like doing it in the Lambda or passing an additional parameter is the cleaner option.
For anyone interested, here is what I came up with for the Lambda function (in Python) :
# Puts the instance in the standby mode which takes it off the load balancer
# and a replacement unit is spun up to take its place
#
import json
import boto3
ec2_client = boto3.client('ec2')
asg_client = boto3.client('autoscaling')
def lambda_handler(event, context):
# Get the id from the event JSON
msg = event['Records'][0]['Sns']['Message']
msg_json = json.loads(msg)
id = msg_json['Trigger']['Dimensions'][0]['value']
print("Instance id is " + str(id))
# Capture all the info about the instance so we can extract the ASG name later
response = ec2_client.describe_instances(
Filters=[
{
'Name': 'instance-id',
'Values': [str(id)]
},
],
)
# Get the ASG name from the response JSON
#autoscaling_name = response['Reservations'][0]['Instances'][0]['Tags'][1]['Value']
tags = response['Reservations'][0]['Instances'][0]['Tags']
autoscaling_name = next(t["Value"] for t in tags if t["Key"] == "aws:autoscaling:groupName")
print("Autoscaling name is - " + str(autoscaling_name))
# Put the instance in standby
response = asg_client.enter_standby(
InstanceIds=[
str(id),
],
AutoScalingGroupName=str(autoscaling_name),
ShouldDecrementDesiredCapacity=False
)

Airflow SimpleHttpOperator for HTTPS

I'm trying to use SimpleHttpOperator for consuming a RESTful API. But, As the name suggests, it only supporting HTTP protocol where I need to consume a HTTPS URI. so, now, I have to use either "requests" object from Python or handle the invocation from within the application code. But, It may not be a standard way. so, I'm looking for any other options available to consume HTTPS URI from within Airflow. Thanks.
I dove into this and am pretty sure that this behavior is a bug in airflow. I have created a ticket for it here:
https://issues.apache.org/jira/browse/AIRFLOW-2910
For now, the best you can do is override SimpleHttpOperator as well as HttpHook in order to change the way that HttpHook.get_conn works (to accept https). I may end up doing this, and if I do I'll post some code.
Update:
Operator override:
from airflow.operators.http_operator import SimpleHttpOperator
from airflow.exceptions import AirflowException
from operators.https_support.https_hook import HttpsHook
class HttpsOperator(SimpleHttpOperator):
def execute(self, context):
http = HttpsHook(self.method, http_conn_id=self.http_conn_id)
self.log.info("Calling HTTP method")
response = http.run(self.endpoint,
self.data,
self.headers,
self.extra_options)
if self.response_check:
if not self.response_check(response):
raise AirflowException("Response check returned False.")
if self.xcom_push_flag:
return response.text
Hook override
from airflow.hooks.http_hook import HttpHook
import requests
class HttpsHook(HttpHook):
def get_conn(self, headers):
"""
Returns http session for use with requests. Supports https.
"""
conn = self.get_connection(self.http_conn_id)
session = requests.Session()
if "://" in conn.host:
self.base_url = conn.host
elif conn.schema:
self.base_url = conn.schema + "://" + conn.host
elif conn.conn_type: # https support
self.base_url = conn.conn_type + "://" + conn.host
else:
# schema defaults to HTTP
self.base_url = "http://" + conn.host
if conn.port:
self.base_url = self.base_url + ":" + str(conn.port) + "/"
if conn.login:
session.auth = (conn.login, conn.password)
if headers:
session.headers.update(headers)
return session
Usage:
Drop-in replacement for SimpleHttpOperator.
This is a couple of months old now, but for what it is worth I did not have any issue with making an HTTPS call on Airflow 1.10.2.
In my initial test I was making a request for templates from sendgrid, so the connection was set up like this:
Conn Id : sendgrid_templates_test
Conn Type : HTTP
Host : https://api.sendgrid.com/
Extra : { "authorization": "Bearer [my token]"}
and then in the dag code:
get_templates = SimpleHttpOperator(
task_id='get_templates',
method='GET',
endpoint='/v3/templates',
http_conn_id = 'sendgrid_templates_test',
trigger_rule="all_done",
xcom_push=True
dag=dag,
)
and that worked. Also notice that my request happens after a Branch Operator, so I needed to set the trigger rule appropriately (to "all_done" to make sure it fires even when one of the branches is skipped), which has nothing to do with the question, but I just wanted to point it out.
Now to be clear, I did get an Insecure Request warning as I did not have certificate verification enabled. But you can see the resulting logs below
[2019-02-21 16:15:01,333] {http_operator.py:89} INFO - Calling HTTP method
[2019-02-21 16:15:01,336] {logging_mixin.py:95} INFO - [2019-02-21 16:15:01,335] {base_hook.py:83} INFO - Using connection to: id: sendgrid_templates_test. Host: https://api.sendgrid.com/, Port: None, Schema: None, Login: None, Password: XXXXXXXX, extra: {'authorization': 'Bearer [my token]'}
[2019-02-21 16:15:01,338] {logging_mixin.py:95} INFO - [2019-02-21 16:15:01,337] {http_hook.py:126} INFO - Sending 'GET' to url: https://api.sendgrid.com//v3/templates
[2019-02-21 16:15:01,956] {logging_mixin.py:95} WARNING - /home/csconnell/.pyenv/versions/airflow/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
[2019-02-21 16:15:05,242] {logging_mixin.py:95} INFO - [2019-02-21 16:15:05,241] {jobs.py:2527} INFO - Task exited with return code 0
I was having the same problem with HTTP/HTTPS when trying to set the connections using environment variables (although it works when i set the connection on the UI).
I've checked the issue #melchoir55 opened (https://issues.apache.org/jira/browse/AIRFLOW-2910) and you don't need to make a custom operator for that, the problem is not that HttpHook or HttpOperator can't use HTTPS, the problem is the way get_hook parse the connection string when dealing with HTTP, it actually understand that the first part (http:// or https://) is the connection type.
In summary, you don't need a custom operator, you can just set the connection in your env as the following:
AIRFLOW_CONN_HTTP_EXAMPLE=http://https%3a%2f%2fexample.com/
Instead of:
AIRFLOW_CONN_HTTP_EXAMPLE=https://example.com/
Or set the connection on the UI.
It is not a intuitive way to set up a connection but I think they are working on a better way to parse connections for Ariflow 2.0.
In Airflow 2.x you can use https URLs by passing https for schema value while setting up your connection and can still use SimpleHttpOperator like shown below.
my_api = SimpleHttpOperator(
task_id="my_api",
http_conn_id="YOUR_CONN_ID",
method="POST",
endpoint="/base-path/end-point",
data=get_data,
headers={"Content-Type": "application/json"},
)
Instead of implementing HttpsHook, we could just put one line of codes into HttpsOperator(SimpleHttpOperator)#above as follows
...
self.extra_options['verify'] = True
response = http.run(self.endpoint,
self.data,
self.headers,
self.extra_options)
...
in Airflow 2, the problem is been resolved.
just check out that :
host name in Connection UI Form, don't end up with /
'endpoint' parameter of SimpleHttpOperator starts with /
I am using Airflow 2.1.0,and the following setting works for https API
In connection UI, setting host name as usual, no need to specify 'https' in schema field, don't forget to set login account and password if your API server request ones.
Connection UI Setting
When constructing your task, add extra_options parameter in SimpleHttpOperator, and put your CA_bundle certification file path as the value for key verify, if you don't have a certification file, then use false to skip verification.
Task definition
Reference: here

When provisioning with Terraform, how does code obtain a reference to machine IDs (e.g. database machine address)

Let's say I'm using Terraform to provision two machines inside AWS:
An EC2 Machine running NodeJS
An RDS instance
How does the NodeJS code obtain the address of the RDS instance?
You've got a couple of options here. The simplest one is to create a CNAME record in Route53 for the database and then always point to that CNAME in your application.
A basic example would look something like this:
resource "aws_db_instance" "mydb" {
allocated_storage = 10
engine = "mysql"
engine_version = "5.6.17"
instance_class = "db.t2.micro"
name = "mydb"
username = "foo"
password = "bar"
db_subnet_group_name = "my_database_subnet_group"
parameter_group_name = "default.mysql5.6"
}
resource "aws_route53_record" "database" {
zone_id = "${aws_route53_zone.primary.zone_id}"
name = "database.example.com"
type = "CNAME"
ttl = "300"
records = ["${aws_db_instance.default.endpoint}"]
}
Alternative options include taking the endpoint output from the aws_db_instance and passing that into a user data script when creating the instance or passing it to Consul and using Consul Template to control the config that your application uses.
You may try Sparrowform - a lightweight provision tool for Terraform based instances, it's capable to make an inventory of Terraform resources and provision related hosts, passing all the necessary data:
$ terrafrom apply # bootstrap infrastructure
$ cat sparrowfile # this scenario
# fetches DB address from terraform cache
# and populate configuration file
# at server with node js code:
#!/usr/bin/env perl6
use Sparrowform;
$ sparrowfrom --ssh_private_key=~/.ssh/aws.pem --ssh_user=ec2 # run provision tool
my $rdb-adress;
for tf-resources() -> $r {
my $r-id = $r[0]; # resource id
if ( $r-id 'aws_db_instance.mydb') {
my $r-data = $r[1];
$rdb-address = $r-data<address>;
last;
}
}
# For instance, we can
# Install configuration file
# Next chunk of code will be applied to
# The server with node-js code:
template-create '/path/to/config/app.conf', %(
source => ( slurp 'app.conf.tmpl' ),
variables => %(
rdb-address => $rdb-address
),
);
# sparrowform --ssh_private_key=~/.ssh/aws.pem --ssh_user=ec2 # run provisioning
PS. disclosure - I am the tool author

EC2 - Connect to running instance by using the API

I create an EC2 instance via the provided interface, and I am using the AWS API to connect to the existing running instance, but when I run the following code I get "You have 0 Amazon EC2 instance(s) running.":
DescribeAvailabilityZonesResult availabilityZonesResult = ec2.describeAvailabilityZones();
System.out.println("You have access to " + availabilityZonesResult.getAvailabilityZones().size() +
" Availability Zones.");
DescribeInstancesResult describeInstancesRequest = ec2.describeInstances();
List<Reservation> reservations = describeInstancesRequest.getReservations();
Set<Instance> instances = new HashSet<Instance>();
for (Reservation reservation : reservations) {
instances.addAll(reservation.getInstances());
}
System.out.println("You have " + instances.size() + " Amazon EC2 instance(s) running.");
Do you have any ideas about what might be the problem?
If you double checked that your instances are actually up and running, they most likely are not in the "us-east-1" instance region (which is the default one that the AWS SDK assumes).
So set your AmazonEC2Client instance to point to the correct endpoint and everything should be fine, e.g. for Europe (Ireland):
ec2.setEndpoint("ec2.eu-west-1.amazonaws.com");
More details, as well as links to where you can find the endpoint strings, in this SO answer.

Resources