Sagemaker Train Job can't connect to ec2 instance

Sagemaker Train Job can't connect to ec2 instance - amazon-ec2

I have MLFlow server running on ec2 instance, port 5000.
This ec2 instance has security group with opened TCP connection on port 5000 to another security group designated for SageMaker.
ec2 instance inbound rules:
SageMaker outbound rules:
These 2 security groups are in the same VPC
Now, I try to run SageMaker training job with designated security group, so that the training script will log metrics to ec2 server via internal IP address. (As answered here), but connection fails
SageMaker job init:
role = "ml_sagemaker"
security_group_ids = ['sg-04868acca16e81183']
bucket = sagemaker_session.default_bucket()
out_path = f"s3://{bucket}/{project_name}"
estimator = PyTorch(entry_point='run_train.py',
source_dir='.',
sagemaker_session=sagemaker_session,
instance_type=instance_type,
instance_count=1,
framework_version='1.5.0',
py_version='py3',
role=role,
security_group_ids=security_group_ids,
hyperparameters={},
)
....
Inside run_train.py:
import mlflow
tracking_uri = "http://172.31.77.137:5000" # <- this is internal ec2 IP
mlflow.set_tracking_uri(tracking_uri)
mlflow.log_param("test_param", 3)
Error:
File "/opt/conda/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
However, when when I create SageMaker Notebook instance with the same security group and the same IAM role, I am able to successfully connect to ec2 and log metrics from within the Notebook.
Here is SageMaker Notebook configurations:
How can I connect to ec2 instance from SageMaker Training Job?

Your estimator will create a standalone instance so it does not matter if you are able to access the mlflow from the notebook. If you wish to use Subnet/Security group configuration with “ PyTorch” estimator with internet connection, you need to set VPC resource.
I had this same issue, Sagemaker plus MLflow Server on another ec2. The first instinct is to assign the estimator the same VPC and security groups as the ec2(MLflow Server). They should be able to connect to each other since they are within the same private net. Here comes another problem, the instance that Sagemaker spins up cannot connect to internet to download the libraries/packages you specify in requirements.txt(ie, mlflow). Then the problem is how to connect to the internet.
The only way to provide internet access when subnet are used for estimators is by having it in a subnet with NAT gateway configured.
Create a NAT gateway in one of your public subnets such as subnet-axxxx
Create a new route table as “NAT_Route_Table”
Edit routes: Destination add 0.0.0.0/0, Target add Newly create NAT gateway (add other routes if needed)
Create a new subnet named “NAT_Subnet” and associate it to the newly created “NAT_Route_Table”
Traffic will go through NAT to the internet.

Related

Unable to connect to Airflow server running on EC2

I am trying to set up an Apache Airflow server on ec2. I managed to get it running and verify status by hitting /health endpoint using curl on http://localhost:8989. Airflow listens on port 8989 here.
The next I want is to be able to connect to the admin dashboard/UI using the browser on EC2's public IP. So I added the inbound rule in the AWS security group ec2 instance belongs to.
While connecting to Airflow, I am getting the following error
Failed to connect to ec2-XX-XX-XXX-XXX.compute-1.amazonaws.com port 8989: Operation timed out
Not sure what else I need to do to reach server running on ec2.

If you can SSH to an EC2 instance, you've added a security group rule for ingress on another port, but can't reach the instance on that port, here are some other things to check:
Firewall running on the instance. Amazon Linux and recent official
Ubuntu AMIs shouldn't have iptables or some other firewall running on
them by default, but if you're using another AMI or someone else has
configured the EC2 instance, it's possible to have iptables/ufw or
some other firewall running. Check processes on your instance to make
sure you don't have a firewall.
Network ACL on the VPC subnet. The default ACL will permit
traffic on all ports. It's possible that the default has been changed
to allow traffic only on selected ports.
Multiple security groups assigned to the EC2 instance. It's possible
to assign more than one security group to the instance. Check to make
sure you don't have a rule in some other security group that's
blocking the port.

How can I connect to AWS Documentdb with Robo 3T?

Using the latest Robo 3T and the command line provided by AWS
mongodb://<dbname>:<insertYourPassword>#example-db.cluster-c2e1234stuff0e.eu-west-2.docdb.amazonaws.com:27017
I get this Error:
Reason:
SSL tunnel failure: Network is unreachable or SSL connection rejected by server.
Reason: Connect failed
I have also tried following THIS walkthrough but had no joy.
I have read that it is possible to SSH to a EC2 instance on the same VPC and access documentdb this way but ideally I would like to access it directly and not pay for an extra EC2 instance. If I have that right?
I have tried via Mongo shell too and get the following response:
Error: couldn't connect to server example-db.cluster-c2eblahblaho0e.eu-west-2.docdb.amazonaws.com:27017, connection attempt failed: NetworkTimeout: Error connecting to example-db.cluster-c2eblahblaho0e.eu-west-2.docdb.amazonaws.com:27017 (<IP address>) :: caused by :: Socket operation timed out :
connect#src/mongo/shell/mongo.js:344:17
#(connect):2:6
exception: connect failed

What I suspect is happening is that either you do not have an EC2 instance in the same VPC as your DocumentDB cluster or that EC2 instance is not reachable from your laptop. I'd first connect to the EC2 instance with SSH to establish connectivity and then use that EC2 instance to SSH proxy from Robo3T.
For context, Amazon DocumentDB clusters deployed within a VPC can be accessed directly by EC2 instances or other AWS services that are deployed in the same VPC. Additionally, Amazon DocumentDB can be accessed by EC2 instances or other AWS services in different VPCs in the same region or other regions via VPC peering.
The advantage of deploying clusters within a VPC is that VPCs provide a strong network boundary to the Internet. A common way to connect to DocumentDB from your laptop is to create an EC2 instance within the same VPC as your DocumentDB cluster and SSH tunnel through that EC2 instance to your cluster: https://docs.aws.amazon.com/documentdb/latest/developerguide/connect-from-outside-a-vpc.html
To minimize costs for local development, start with the smallest EC2 instance size and utilize the start/stop functionality when not using the cluster.
The same can be done with DocumentDB. When you are developing, you can save on instance costs by stopping the cluster when it is no longer needed: https://docs.aws.amazon.com/documentdb/latest/developerguide/db-cluster-stop-start.html
An alternative is to utilize AWS Cloud9: https://docs.aws.amazon.com/documentdb/latest/developerguide/connect-with-cloud9.html. This solution still requires an EC2 instance in the same VPC as your Amazon Document. What is useful about this solution is that Cloud9 provides a mechanisms to automatically shutdown the EC2 instance if it has been idle for 30-minutes, for example, to help save costs.

Setup VPN to connect VPC to home network?

I'm not clear if this is possible, but here is what I'd like to do:
Goal:
Set up a VPN between my home network and my AWS VPC. A use case I'd like to have working:
Have a Lambda function write to a database, e.g. Postgres running on my home network behind my router. Think of some machine with 192.168.. address on my home network running Postgres
I have read the documentation and I wanted to confirm what it would require to make this happen. Assume I have a VPC with a Lambda deployed to it.
Create a Virtual Private Gateway for the VPC
Create a Customer Gateway for my home network.
Configure the Customer Gateway machine in my home network (e.g. Raspberry PI) after downloading the vpn connection file from AWS.
I'm looking at this article for reference:
setup raspberry PI3 as AWS VPN Customer Gateway
Is this all that I would need to do? Do I need to use some 3rd party software in addition to this? Or is this not even possible?
Thanks

You can setup an OpenVPN server on an EC2 instance and change your SG inside your VPC resources to only allow access from your VPC CIDR block.
AWS provide an AMI for OpenVPN server : https://aws.amazon.com/marketplace/pp/B00MI40CAE/ref=mkt_wir_openvpn_byol

Able to ping EC2 from on-premises through VPN. But, unable to ping DMS replication instance

I have setup a VPN and able to ping the Private IP of EC2 instance from on-premises and vice versa. However, I am unable to the ping the Private IP of DMS Replication Instance.
I have created an endpoint pointing DB in EC2. Endpoint test connection succeeds. However, endpoint test connection fails for DB in on-premises.
The EC2 and DMS Replication Instance use the same Subnet, Security Group etc., The details are given in the image below.
May I know
1) why the DMS instance is not communicating with on-premises (and vice-versa)
2) why EC2 works fine in VPN but not DMS instance?
EDIT:
Details of Security Group associated with the DMS instance:
vpc - the same default vpc used by EC2
inbound rules - all traffic, all protocol, all port range, source = 192.168.0.0/24
outbound rules - all traffic, all protocol, all port range, source = 0.0.0.0/0
Route table:
destination - 10.0.0.0/16, target = local
destination - 0.0.0.0/0, target = internet gateway
destination - 192.168.0.0/24, target = virtual private gateway used in VPN
This is the error message I get when I try to test the DMS DB endpoint connection:
Test Endpoint failed: Application-Status: 1020912, Application-Message: Failed to connect Network error has occurred, Application-Detailed-Message: RetCode: SQL_ERROR SqlState: HYT00 NativeError: 0 Message: [unixODBC][Microsoft][ODBC Driver 13 for SQL Server]Login timeout expired ODBC general error.

You might need to describe/provide your full network topology for a more precise answer, but my best guess, based on AWS' documentation on "Network Security for AWS Database Migration Service", is that you're missing source and target database configuration:
Database endpoints must include network ACLs and security group rules that allow incoming access from the replication instance. You can achieve this using the replication instance's security group, the private IP address, the public IP address, or the NAT gateway’s public address, depending on your configuration.
Also, is this EC2 you mentioned a NAT instance? Just in case:
If your network uses a VPN tunnel, the Amazon EC2 instance acting as the NAT gateway must use a security group that has rules that allow the replication instance to send traffic through it.

connect lambda to another vpc via an EC2 vpn tunnel

We have 2 separate VPC's and dont need to do any peering. one VPC has an openvpn software running for vpn purposes and a lambda in another vpc that needs access to the resource in the openvpn VPC. so how can this be done if we try to create a tunnel from an EC2 instance running in the Lambda's VPC that is connected to the other VPC via vpnclient? Would this work in this scenario or are there any other alternatives. The Lambda would like to reach the elasticsearch service running in the other VPC via VPN client running in the EC2 instance

Please create VPC peering between these 2 VPCs & configure route tables of both.
In case of need further help, please do let me know

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sagemaker Train Job can't connect to ec2 instance - amazon-ec2

Related

Unable to connect to Airflow server running on EC2

How can I connect to AWS Documentdb with Robo 3T?

Setup VPN to connect VPC to home network?

Able to ping EC2 from on-premises through VPN. But, unable to ping DMS replication instance

connect lambda to another vpc via an EC2 vpn tunnel

Categories

Resources