HttpOperator or HttpHook for HTTPS in Airflow - https

I'm working on a little proof of concept about Airflow on Google Cloud.
Essentially, I want to create a workflow that download data from an REST API (https), transform this data into JSON format and upload it on a Google Cloud storage unit.
I've already done this with pure Python code and it works. Pretty straightforward! But because I want to schedule this and there is some dependencies, Airflow should be the ideal tool for this.
After careful reading of the Airflow documentation, I've seen the HttpOperator and/or HttpHook can do the trick for the download part.
I've created my Http connection into the WebUI with my email/password for the authorization as the following:
{Conn Id: "atlassian_marketplace", Conn Type: "HTTP", Host: "https://marketplace.atlassian.com/rest/2", Schema: None/Blank, Login: "my username", Password: "my password", Port: None/Blank, Extra: None/Blank}
First question:
-When to use the SimpleHttpOperator versus the HttpHook?
Second question:
-How do we use SimpleHttpOperator or HttpHook with HTTPs calls?
Third question:
-How do we access the data returned by the API call?
In my case, the XCOM feature will not do the trick because these API calls can return a lot of data (100-300mb)!
I've look on Google to find an example code on how to use the operaor/hook for my use case but i didn't find anything useful, yet.
Any ideas?
I put here the skeleton of my code so far.
# Usual Airflow import
# Dag creation
dag = DAG(
'get_reporting_links',
default_args=default_args,
description='Get reporting links',
schedule_interval=timedelta(days=1))
# Task 1: Dummy start
start = DummyOperator(task_id="Start", retries=2, dag=dag)
# Task 2: Connect to Atlassian Marketplace
get_data = SimpleHttpOperator(
http_conn_id="atlassian_marketplace",
endpoint="/vendors/{vendorId}/reporting".format({vendorId: "some number"}),
method="GET")
# Task 3: Save JSON data locally
# TODO: transform_json: transform to JSON get_data.json()?
# Task 4: Upload data to GCP
# TODO: upload_gcs: use Airflow GCS connection
# Task 5: Stop
stop = DummyOperator(task_id="Stop", retries=2, dag=dag)
# Dependencies
start >> get_data >> transform_json >> upload_gcs >> stop

Look at the following example:
# Usual Airflow import
# Dag creation
dag = DAG(
'get_reporting_links',
default_args=default_args,
description='Get reporting links',
schedule_interval=timedelta(days=1))
# Task 1: Dummy start
start = DummyOperator(task_id="Start", retries=2, dag=dag)
# Task 2: Connect to Atlassian Marketplace
get_data = SimpleHttpOperator(
task_id="get_data",
http_conn_id="atlassian_marketplace",
endpoint="/vendors/{vendorId}/reporting".format({vendorId: "some number"}),
method="GET",
xcom_push=True,
)
def transform_json(**kwargs):
ti = kwargs['ti']
pulled_value_1 = ti.xcom_pull(key=None, task_ids='get_data')
...
# transform the json here and save the content to a file
# Task 3: Save JSON data locally
save_and_transform = PythonOperator(
task_id="save_and_transform",
python_callable=transform_json,
provide_context=True,
)
# Task 4: Upload data to GCP
upload_to_gcs = FileToGoogleCloudStorageOperator(...)
# Task 5: Stop
stop = DummyOperator(task_id="Stop", retries=2, dag=dag)
# Dependencies
start >> get_data >> save_and_transform >> upload_to_gcs >> stop

Related

How can I pass JSON data to a webhook Chainlink job?

I am trying to get the "Pipeline Input" to somehow be passed to an external adapter via the $(jobRun.requestBody) pipeline variable and then parsed by a jsonparse task and then sent via a fetch task. I am not sure what format the input should be in when running a webhook job on a Chainlink node. I keep getting this and other errors no matter what I try:
data: key requestBody (segment 1 in keypath jobRun.requestBody): keypath not found
This is what I am seeing on the Chainlink Admin UI:
Here is the closest thing I have found in the documentation:
- https://docs.chain.link/chainlink-nodes/oracle-jobs/job-types/webhook
Here is the job definition if useful:
type = "webhook"
schemaVersion = 1
name = "account-balance-webhook"
forwardingAllowed = false
observationSource = """
parse_request [type="jsonparse" path="data,address" data="$(jobRun.requestBody)"]
fetch [type=bridge name="test" requestData="{\\"id\\": \\"0\\", \\"data\\": { \\"address\\": \\"$(parse_request)\\"}}"]
parse [type=jsonparse path="data,free" data="$(fetch)"]
parse_request -> fetch -> parse
"""
I am running Chainlink in a Docker container with this image: smartcontract/chainlink:1.11.0-root
Some background: I am working on developing an external adapter and want to be able to easily and quickly test.
We use the following webhook job to quickly verify there is no syntax errors, etc. with the bridge to an EA.
type = "webhook"
schemaVersion = 1
name = "[WH-username] cbor0-v0"
observationSource = """
fetch [type="bridge" name="bridge_name" requestData="{ \\"id\\": $(jobSpec.externalJobID), \\"input1\\": \\"value1\\" }"]
parse [type=jsonparse path="data,results" data="$(fetch)"]
fetch -> parse
"""
In general, its best to quickly test an EA directly through curl GET/POST. If the curl works, then a bridge will work as long as you named the bridge correctly in the job-spec.toml

Vault Error, Server gave HTTP response to HTTPS client

I'm using Hashicorp vault as a secrets store and installed it via apt repository on Ubuntu 20.04.
After that, I added the root key to access the UI and I'm able to add or delete secrets using UI.
Whenever I'm trying to add or get a secret using the command line, I get the following error :
jarvis#saki:~$ vault kv get secret/vault
Get "https://127.0.0.1:8200/v1/sys/internal/ui/mounts/secret/vault": http: server gave HTTP response to HTTPS client
My vault config looks like this :
# Full configuration options can be found at https://www.vaultproject.io/docs/configuration
ui = true
#mlock = true
#disable_mlock = true
storage "file" {
path = "/opt/vault/data"
}
#storage "consul" {
# address = "127.0.0.1:8500"
# path = "vault"
#}
# HTTP listener
#listener "tcp" {
# address = "127.0.0.1:8200"
# tls_disable = 1
#}
# HTTPS listener
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/opt/vault/tls/tls.crt"
tls_key_file = "/opt/vault/tls/tls.key"
}
# Example AWS KMS auto unseal
#seal "awskms" {
# region = "us-east-1"
# kms_key_id = "REPLACE-ME"
#}
# Example HSM auto unseal
#seal "pkcs11" {
# lib = "/usr/vault/lib/libCryptoki2_64.so"
# slot = "0"
# pin = "AAAA-BBBB-CCCC-DDDD"
# key_label = "vault-hsm-key"
# hmac_key_label = "vault-hsm-hmac-key"
#}
I fixed the problem. Though the exception can be common to more than one similar problem, I fixed the problem by exporting the root token generated after running this command :
vault server -dev
The output is like this
...
You may need to set the following environment variable:
$ export VAULT_ADDR='http://127.0.0.1:8200'
The unseal key and root token are displayed below in case you want to
seal/unseal the Vault or re-authenticate.
Unseal Key: 1+yv+v5mz+aSCK67X6slL3ECxb4UDL8ujWZU/ONBpn0=
Root Token: s.XmpNPoi9sRhYtdKHaQhkHP6x
Development mode should NOT be used in production installations!
...
Then just export these variables by running the following commands :
export VAULT_ADDR='http://127.0.0.1:8200'
export VAULT_TOKEN="s.XmpNPoi9sRhYtdKHaQhkHP6x"
Note: Replace "s.XmpNPoi9sRhYtdKHaQhkHP6x" with your token received as output from the above command.
Then run the following command to check the status :
vault status
Again, the error message can be similar for many different problems.
In PowerShell on Windows 10, I was able to set it this way:
$Env:VAULT_ADDR='http://127.0.0.1:8200'
Then
vault status
returned correctly. This was on Vault 1.7.3 in dev mode
You can echo VAULT_ADDR by specifying it on the command line and pressing enter - same as the set line above but omitting the = sign and everything after it
$Env:VAULT_ADDR
Output:
Key Value
--- ----- Seal Type shamir Initialized true Sealed false Total Shares 1 Threshold 1 Version
1.7.3 Storage Type inmem Cluster Name vault-cluster-80649ba2 Cluster ID 2a35e304-0836-2896-e927-66722e7ca488 HA Enabled
false
Try using a new terminal window. This worked for me

Run a specific task in the end of play

How I run a specific task when playbook all other tasks completed? The problem is that this needs to be done in every playbook. Just adding to every playbook is not a good idea, I need to make it common for everyone. There is one common role in every playbook, but it works in the beginning. Is it possible to add a task to it that would start at the very end? Or some other way to do this, so that it is always done at the end without editing each playbook.
You could do it with writing a Callback Plugin. This is python code, that executes pre-defined Functions when an (ansible-internal) event occurs.
Interesting for you would be the v2_playbook_on_stats method, which is on of the last steps executed.
For this, please checkout the basic Developer Guidelines page of Ansible:
https://docs.ansible.com/ansible/latest/dev_guide/index.html
But more importantly the Plugins Guide:
https://docs.ansible.com/ansible/latest/dev_guide/developing_plugins.html
The basic structure as outlined in the document is:
from ansible.plugins.callback import CallbackBase
class CallbackModule(CallbackBase):
pass
They even provide a proper example executing the v2_playbook_on_stats method:
# Make coding more python3-ish, this is required for contributions to Ansible
from __future__ import (absolute_import, division, print_function)
__metaclass__ = type
# not only visible to ansible-doc, it also 'declares' the options the plugin requires and how to configure them.
DOCUMENTATION = '''
callback: timer
callback_type: aggregate
requirements:
- whitelist in configuration
short_description: Adds time to play stats
version_added: "2.0"
description:
- This callback just adds total play duration to the play stats.
options:
format_string:
description: format of the string shown to user at play end
ini:
- section: callback_timer
key: format_string
env:
- name: ANSIBLE_CALLBACK_TIMER_FORMAT
default: "Playbook run took %s days, %s hours, %s minutes, %s seconds"
'''
from datetime import datetime
from ansible.plugins.callback import CallbackBase
class CallbackModule(CallbackBase):
"""
This callback module tells you how long your plays ran for.
"""
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'aggregate'
CALLBACK_NAME = 'namespace.collection_name.timer'
# only needed if you ship it and don't want to enable by default
CALLBACK_NEEDS_WHITELIST = True
def __init__(self):
# make sure the expected objects are present, calling the base's __init__
super(CallbackModule, self).__init__()
# start the timer when the plugin is loaded, the first play should start a few milliseconds after.
self.start_time = datetime.now()
def _days_hours_minutes_seconds(self, runtime):
''' internal helper method for this callback '''
minutes = (runtime.seconds // 60) % 60
r_seconds = runtime.seconds - (minutes * 60)
return runtime.days, runtime.seconds // 3600, minutes, r_seconds
# this is only event we care about for display, when the play shows its summary stats; the rest are ignored by the base class
def v2_playbook_on_stats(self, stats):
end_time = datetime.now()
runtime = end_time - self.start_time
# Shows the usage of a config option declared in the DOCUMENTATION variable. Ansible will have set it when it loads the plugin.
# Also note the use of the display object to print to screen. This is available to all callbacks, and you should use this over printing yourself
self._display.display(self._plugin_options['format_string'] % (self._days_hours_minutes_seconds(runtime)))
I want to also highlight the importance of the DOCUMENTATION string. I first thought, that this is only for generating the Doc help page. But no. Checkout this Example:
options:
format_string:
description: format of the string shown to user at play end
ini:
- section: callback_timer
key: format_string
env:
- name: ANSIBLE_CALLBACK_TIMER_FORMAT
default: "Playbook run took %s days, %s hours, %s minutes, %s seconds"
In there you have ini, env, and default sections, these are actually used to inject options into your Callback plugin using self._plugin_options['format_string'] or using self.get_option("format_string") for a list of all callback methods that can be overriden, please refer to https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/callback/init.py
For you the methods starting with v2_ are interesting, because these are for Ansible 2+.
Checkout https://github.com/ansible/ansible/tree/devel/lib/ansible/plugins/callback for more examples.
But it seems, that they are cleaning up quite a lot at the moment.
Therefore, I would say, please checkout a Version Tag, like:
https://github.com/ansible/ansible/tree/v2.9.6/lib/ansible/plugins/callback
Hope this helps.

Create CloudWatch alarm that sets an instance to standby via SNS/Lambda

What I am looking to do is set an instance to standby mode when it hits an alarm state. I already have an alarm set up to detect when my instance hits 90% CPU for a while. The alarm currently sends a Slack and text message via SNS calling a Lambda function. I would like to add is to have the instance go into standby mode. The instances are in an autoscaling group.
I found that you can perform this through the CLI using the command :
aws autoscaling enter-standby --instance-ids i-66b4f7d5be234234234 --auto-scaling-group-name my-asg --should-decrement-desired-capacity
You can also do this with boto3 :
response = client.enter_standby(
InstanceIds=[
'string',
],
AutoScalingGroupName='string',
ShouldDecrementDesiredCapacity=True|False
)
I assume I need to write another Lambda function that will be triggered by SNS that will use the boto3 code to do this?
Is there a better/easier way before I start?
I already have the InstanceId passed into the event to the Lambda so I will have to add the ASG name in the event.
Is there a way to get the ASG name in the Lambda function when I already have the Instance ID? Then I do not have to pass it in with the event.
Thanks!
Your question has a couple sub-parts, so I'll try to answer them in order:
I assume I need to write another Lambda function that will be triggered by SNS that will use the boto3 code to do this?
You don't need to, you could overload your existing function. I could see a valid argument for either separate functions (separation of concerns) or one function (since "reacting to CPU hitting 90%" is basically "one thing").
Is there a better/easier way before I start?
I don't know of any other way you could do it, other than Cloudwatch -> SNS -> Lambda.
Is there a way to get the ASG name in the Lambda function when I already have the Instance ID?
Yes, see this question for an example. It's up to you whether it looks like doing it in the Lambda or passing an additional parameter is the cleaner option.
For anyone interested, here is what I came up with for the Lambda function (in Python) :
# Puts the instance in the standby mode which takes it off the load balancer
# and a replacement unit is spun up to take its place
#
import json
import boto3
ec2_client = boto3.client('ec2')
asg_client = boto3.client('autoscaling')
def lambda_handler(event, context):
# Get the id from the event JSON
msg = event['Records'][0]['Sns']['Message']
msg_json = json.loads(msg)
id = msg_json['Trigger']['Dimensions'][0]['value']
print("Instance id is " + str(id))
# Capture all the info about the instance so we can extract the ASG name later
response = ec2_client.describe_instances(
Filters=[
{
'Name': 'instance-id',
'Values': [str(id)]
},
],
)
# Get the ASG name from the response JSON
#autoscaling_name = response['Reservations'][0]['Instances'][0]['Tags'][1]['Value']
tags = response['Reservations'][0]['Instances'][0]['Tags']
autoscaling_name = next(t["Value"] for t in tags if t["Key"] == "aws:autoscaling:groupName")
print("Autoscaling name is - " + str(autoscaling_name))
# Put the instance in standby
response = asg_client.enter_standby(
InstanceIds=[
str(id),
],
AutoScalingGroupName=str(autoscaling_name),
ShouldDecrementDesiredCapacity=False
)

When provisioning with Terraform, how does code obtain a reference to machine IDs (e.g. database machine address)

Let's say I'm using Terraform to provision two machines inside AWS:
An EC2 Machine running NodeJS
An RDS instance
How does the NodeJS code obtain the address of the RDS instance?
You've got a couple of options here. The simplest one is to create a CNAME record in Route53 for the database and then always point to that CNAME in your application.
A basic example would look something like this:
resource "aws_db_instance" "mydb" {
allocated_storage = 10
engine = "mysql"
engine_version = "5.6.17"
instance_class = "db.t2.micro"
name = "mydb"
username = "foo"
password = "bar"
db_subnet_group_name = "my_database_subnet_group"
parameter_group_name = "default.mysql5.6"
}
resource "aws_route53_record" "database" {
zone_id = "${aws_route53_zone.primary.zone_id}"
name = "database.example.com"
type = "CNAME"
ttl = "300"
records = ["${aws_db_instance.default.endpoint}"]
}
Alternative options include taking the endpoint output from the aws_db_instance and passing that into a user data script when creating the instance or passing it to Consul and using Consul Template to control the config that your application uses.
You may try Sparrowform - a lightweight provision tool for Terraform based instances, it's capable to make an inventory of Terraform resources and provision related hosts, passing all the necessary data:
$ terrafrom apply # bootstrap infrastructure
$ cat sparrowfile # this scenario
# fetches DB address from terraform cache
# and populate configuration file
# at server with node js code:
#!/usr/bin/env perl6
use Sparrowform;
$ sparrowfrom --ssh_private_key=~/.ssh/aws.pem --ssh_user=ec2 # run provision tool
my $rdb-adress;
for tf-resources() -> $r {
my $r-id = $r[0]; # resource id
if ( $r-id 'aws_db_instance.mydb') {
my $r-data = $r[1];
$rdb-address = $r-data<address>;
last;
}
}
# For instance, we can
# Install configuration file
# Next chunk of code will be applied to
# The server with node-js code:
template-create '/path/to/config/app.conf', %(
source => ( slurp 'app.conf.tmpl' ),
variables => %(
rdb-address => $rdb-address
),
);
# sparrowform --ssh_private_key=~/.ssh/aws.pem --ssh_user=ec2 # run provisioning
PS. disclosure - I am the tool author

Resources