Setting proxy environment variables when running a DAG in Apache Airflow

Setting proxy environment variables when running a DAG in Apache Airflow - proxy

I need to run Apache Airflow in a corporate network. For that I need to set "http_proxy", "https_proxy" and "no_proxy" in any machine I want to use internet.
Right now, the VM that I'm using to run Airflow stores these env. variables in /etc/profile.
I can run Python scripts that make HTTP requests to external websites with ease, when I run them on the terminal, but when I run them inside a DAG, it breaks because it couldn't resolve/access the address.
It seems that Airflow runs scripts in an isolated environment. I am currently using CeleryExecutor.
Firstly, I've accessed all the environment variables with a print(environ). I got this:
environ({'LANG': 'en_US.UTF-8', 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin', 'HOME': '/home/airflow', 'LOGNAME': 'airflow', 'USER': 'airflow', 'SHELL': '/bin/bash', 'INVOCATION_ID': '5c777ce3b07748309b972d877a0545ea', 'JOURNAL_STREAM': '9:37430', 'AIRFLOW_CONFIG': '/opt/airflow/airflow.cfg', 'AIRFLOW_HOME': '/opt/airflow', '_MP_FORK_LOGLEVEL_': '20', '_MP_FORK_LOGFILE_': '', '_MP_FORK_LOGFORMAT_': '[%(asctime)s: %(levelname)s/%(processName)s] %(message)s', 'CELERY_LOG_LEVEL': '20', 'CELERY_LOG_FILE': '', 'CELERY_LOG_REDIRECT': '1', 'CELERY_LOG_REDIRECT_LEVEL': 'WARNING', 'AIRFLOW_CTX_DAG_OWNER': 'airflow', 'AIRFLOW_CTX_DAG_ID': 'primeiro-teste', 'AIRFLOW_CTX_TASK_ID': 'extract', 'AIRFLOW_CTX_EXECUTION_DATE': '2022-12-13T16:18:17.185417+00:00', 'AIRFLOW_CTX_DAG_RUN_ID': 'manual__2022-12-13T16:18:17.185417+00:00'})
There is no proxy variables, so the script cannot access outside information.
I've even debugged within a DAG which were the DNS servers, to see if they were correct. The result was positive.
The only way I got the script to work was by getting these environ variables defined before running an HTTP request:
os.environ['HTTP_PROXY'] = os.environ['http_proxy'] = os.environ['HTTPS_PROXY'] = os.environ['https_proxy'] = "PROXY STRING"
I was hoping to find a way to get these variables defined for all DAGs, but when I set them like Tomasz, I can't seem to use them if they don't start with the "AIRFLOW" prefix.

Creating an environment file and putting it in some location is not sufficient. You have to tell Airflow about the location of that file when it starts, however you do that (e.g. systemd).
Airflow gets its environment variables very specifically. When Airflow starts you need to reference the environment file created for Airflow. When you run Airflow using systemd you can specify which EnvironmentFile that you would like Airflow to use, under the [Service] section of the unit file. Environment variables not defined within that file will not be picked up by Airflow. Your unit files may look different to mine but here is mine as an example:
[Unit]
Description=Airflow webserver daemon
After=network.target mysqld.service rabbitmq-server.service
Wants=mysqld.service rabbitmq-server.service
[Service]
EnvironmentFile=/prod/airflow/airflow.env
User=airflow
Group=airflow
Type=simple
ExecStart=/usr/bin/bash -c "source /prod/airflow/airflow_38_venv/bin/activate ; /prod/airflow/airflow_38_venv/bin/airflow webserver -p 7635 --pid /prod/airflow/run/webserver.pid"
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target
EnvironmentFile can point to any location/filename that the user running Airflow has read access to. The suggested filename and location are /etc/sysconfig/airflow but as you can see mine is different than what is recommended.
Here is what the body of my EnvironmentFile looks like, edited to remove specific details. Again, yours will probably look different.
$ cat /prod/airflow/airflow.env
# This file is the environment file for Airflow. Put this file in /etc/sysconfig/airflow per default
# configuration of the systemd unit files.
#
AIRFLOW_CONFIG=/prod/airflow/airflow.cfg
AIRFLOW_HOME=/prod/airflow
http_proxy=http://something.proxyserver.com:80
https_proxy=http://something.proxyserver.com:80
no_proxy=*.google.com,127.0.0.1
HTTP_PROXY=http://something.proxyserver.com:80
HTTPS_PROXY=http://something.proxyserver.com:80
NO_PROXY=*.google.com,127.0.0.1

Related

Starting an opensplice publisher via systemd does not publish data

I have an opensplice publisher on Ubuntu 20.04 that is started via systemd.
If the publisher starts via systemd then the data is not pubished, but also no errors are reported or present in the opensplice log files.
The publisher works if I run it from a command line or if I stop and restart the service.
The QoS are the same for the publisher and subscriber.
The publisher and subscriber applications are running on different machines.
There are no other participants on the network. All the machines are rebooted and the order of reboot does not change the observed behaviour.
The systemd service is:
[Unit]
Description=Publisher Process
Documentation=
After=network.target
StartLimitIntervalSec=0
[Service]
Type=simple
WorkingDirectory=/opt/publisher/bin
ExecStart=/opt/publisher/bin/publisher.sh
Restart=always
RestartSec=2
[Install]
WantedBy=multi-user.target
The publisher.sh is:
#!/bin/bash
cd /opt/publisher/bin
source bashrc_local
# We just keep running the application (in case of a crash)
while true; do
./publisher
sleep 15
done
I have a work around that feels a little bit naff.
#!/bin/bash
cd /opt/publisher/bin
source bashrc_local
timeout 30 ./remote_processor
killall remote_processor
# We just keep running the application (in case of a crash)
while true; do
./publisher
sleep 15
done
Any ideas on how I can remove my work around?
Edit 16 Sept 22
The issue appears to be systemd start order and dependencies as I have run into the same issue with a program publishing data via UDP which is not using DDS.
Changing the dependencies so the services are started just before the user login does not help.

check your environment variables as systemd will not run with the same environment as your bash console
in particular have you set the OSPL_URI variable to point at the config?
if using the commercial version, OSPL_HOME and ADLINK_LICENSE will also need to be set
Does the PATH variable include your OSPL shared libraries?
These are all setup by running the $OSPL_HOME\release.com script in your bash session
I tend to manually add the required ones to the service file
e.g.
Environment=OSPL_URI=file:///opt/ospl.xml

Volume mount and environment variable access for a Window service running in a container in kubernetes

An application that runs as a Windows service in a Windows container (without any overrides for run-as user - New-Service -Name x -BinaryPath someservice.exe step in the container build) is expecting an environment variable set on the container as well as read a file mounted in the container. I know that I can run applications directly via the entrypoint and they are able to read the env variable and file from mount, but as a service I am getting errors indicating it isn't.
Are environment variables scoped to user by default, would something like a RunAs configuration in security context be needed, or some other mechanism? Or would there be any limitations on access to file mounts by the service?
edit
Investigate the environment variables a bit more, seems like this might be the part that's missing. I tried to echo a var based on specific scopes:
PS C:\dir> echo ([System.Environment]::GetEnvironmentVariable("varname","User"))
PS C:\dir> echo ([System.Environment]::GetEnvironmentVariable("varname","Machine"))
PS C:\dir> echo ([System.Environment]::GetEnvironmentVariable("varname","Process"))
expected_value
So I suspect the service doesn't have access to the process. Am going to try rescoping the variable:
[System.Environment]::SetEnvironmentVariable('varname',$env:varname,[System.EnvironmentVariableTarget]::Machine)

The issue was related due to environment variable scoping - the environment variables passed to the container are Process scoped. However, if a Windows service is started in the container, it does not have access to the scope (the entrypoint process does).
In order to resolve this, I set entrypoint to a powershell wrapper for the app that copied the Process scope environment variable to the Machine scope before starting the Windows service:
[System.Environment]::SetEnvironmentVariable('varname',$env:varname,[System.EnvironmentVariableTarget]::Machine)
# start service command.

systemd prepending /bin to Environment PATH

I'm trying to setup my Bamboo agents as a systemd service. The service file looks like this:
[Unit]
Description=Atlassian Bamboo Agent
After=syslog.target network.target
[Service]
Type=forking
User=bamboo
Group=bamboo
ExecStart=/opt/bamboo-1/bin/bamboo-agent.sh start
ExecStop=/opt/bamboo-1/bin/bamboo-agent.sh stop
Environment="PATH=/opt/rh/devtoolset-3/root/bin/:/usr/local/bin:/usr/bin"
[Install]
WantedBy=multi-user.target
When I check the process environment, the PATH is correctly set to what I expect, with with the only exception that my PATH is prepended with /bin.
cat /proc/12345/environ <--- 12345 is my Bamboo PID
...
PATH=/bin:/opt/rh/devtoolset-3/root/bin/:/usr/local/bin:/usr/bin
...
That means my builds will use the wrong gcc, cmake, etc.
Is there any way to prevent /bin to be prepended to the PATH?

I created a test service that just printed out the path after setting Environment= with a new path, and found it worked as expected on Ubuntu 16.04 with systemd 229.
I conclude that something in your script is pre-pending /bin to your environment.
Nothing in the systemd.exec man page suggests that systemd is designed to behave the way you observe.

When running from systemd unit file, unable to open directory

I have a strange problem with Ubuntu 16 and a systemd unit file. I have a service which reads a directory from the local filesystem. The directory is read from an environment variable. Now when I start the service manually (as in: in a ssh session), everything works fine. But when I start the service with the unit file from below, the service is unable to open the storage directory. The error I get is: could nog read contents of storage" message="open /srv/services/poddy/storage: no such file or directory.
Now my question is: does systemd kind of "sandbox" the services?
[Unit]
Description=Poddy service
After=network.target
[Service]
Type=simple
User=myusername
Group=myusername
WorkingDirectory=/srv/services/poddy
ExecStart=/srv/services/poddy/poddy
Restart=always
RestartSec=5
StartLimitInterval=60s
StartLimitBurst=3
Environment=PODDY_STORAGE="/srv/services/poddy/storage"
Environment=PODDY_PORT=8085
[Install]
WantedBy=multi-user.target

Well, I solved it myself. It turns out that quoting the value of an environment var in the systemd unit file eventually double-escaped the value.
So, changing this:
Environment=PODDY_STORAGE="/srv/services/poddy/storage"
into:
Environment=PODDY_STORAGE=/srv/services/poddy/storage
solved my problem :).

Puppet agent daemon not reading a facter fact (EC2, cloud-init)

I am using puppet to read a fact from facter, and based on that I apply a different configuration to my modules.
Problem:
the puppet agent isn't seeing this fact. Running puppet agent --test interactively works as expected. Even running it non-interactively from a script seems to work fine. Only the agent itself is screwing up.
Process:
I am deploying an Ubuntu-based app stack on EC2. Using userdata (#cloud-config), I set an environment variable in /etc/environment:
export FACTER_tl_role=development
then immediately in #cloud-config, i source /etc/environment.
only THEN i apt-get install puppet (i moved away from using package: puppet to eliminate ambiguity in the sequence of #cloud-config steps)
Once the instance boots, I confirm that the fact is available: running facter tl_role returns "development". I then check /var/log/syslog, and apparently the puppet agent is not seeing this fact - I know this because it's unable to compile the catalog, and there's nothing (blank) where I'm supposed to be seeing the value of the variable set depending on this fact.
However, running puppet agent --test interactively compiles and runs the catalog just fine.
even running this from the #cloud-config script (immediately after installing puppet) also works just fine.
How do I make this fact available to the puppet agent? Restarting the agent service makes no difference, it remains unaware of the custom fact. Rebooting the instance also makes no difference.
here's some code:
EC2 userdata:
#cloud-config
puppet:
conf:
agent:
server: "puppet.foo.bar"
certname: "%i.%f"
report: "true"
runcmd:
- sleep 20
- echo 'export FACTER_tl_role=development' >> /etc/environment
- . /etc/environment
- apt-get install puppet
- puppet agent --test
Main puppet manifest:
# /etc/puppet/manifests/site.pp
node default {
case $tl_role {
'development': { $sitedomain = "dev.foo.bar"}
'production': { $sitedomain = "new.foo.bar"}
}
class {"code" : sitedomain => $sitedomain}
class {"apache::site" : sitedomain => $sitedomain}
class {"nodejs::grunt-daemon" : sitedomain => $sitedomain}
And then I see failures where $sitedomain is supposed to be, so $tl_role appears to be not set.
Any ideas? This is exploding my brain....

Another easy option would be to drop a fact into an external fact.
Dropping a file into /etc/facter/facts.d/* is fairly easy, and you can use a text file, yaml json or an executable to do it.
http://docs.puppetlabs.com/guides/custom_facts.html#external-facts
*that's on Open source puppet, on unix-y machines. See the link for the full docs.

Thank you, #christopher. This may be a good solution, I will test it and possibly move to it from my current horrible hack.
The answer I got in the Puppet Users Google Group was that I should not assume that the Puppet agent process will have an environment of a login shell, and that Facter will also have this environment when it is run by the Puppet agent.
Here is the way I solved it (admittedly, by brute force):
runcmd:
- echo 'export FACTER_tl_role=development' >> /etc/environment
- . /etc/environment
- apt-get install puppet
- service puppet stop
- sed -i '/init-functions/a\. \/etc\/environment' /etc/init.d/puppet
- puppet agent --test
- service puppet start
As you can see, after installing Puppet, I stop the agent, and add a line to /etc/init.d/puppet to source /etc/environment. Then I start the agent. NOT ideal... but it works!

I don't think . /etc/environment is going to work properly the way cloud-init executes runcmd. Two possible solutions:
Export the variable with the puppet agent command:
export FACTER_tl_role=development && puppet agent --test
If that doesn't work:
Just drop the commands into a user-data script and wire them together as a "multipart input" (described in the cloud-init docs).
The second solution executes the commands as a proper shell script, and would likely fix the problem. If the first works, though, it's easier to do with what you have.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio