How to run DBT in AWS Lambda? - aws-lambda

I have currently dockerized my DBT solution and I launch it in AWS Fargate (triggered from Airflow). However, Fargate requires about 1 minute to start running (image pull + resource provisioning + etc.), which is great for long running executions (hours), but not for short ones (1-5 minutes).
I'm trying to run my docker container in AWS Lambda instead of in AWS Fargate for short executions, but I encountered several problems during this migration.
The one I cannot fix is related to the bellow message, at the time of running the dbt deps --profiles-dir . && dbt run -t my_target --profiles-dir . --select my_model
Running with dbt=0.21.0
Encountered an error:
[Errno 38] Function not implemented
It says there is no function implemented but I cannot see anywhere which is that function. As it appears at the time of installing dbt packages (redshift and dbt_utils), I tried to download them and include them in the docker image (set local paths in packages.yml), but nothing changed. Moreover, DBT writes no logs at this phase (I set the log-path to /tmp in the dbt_project.yml so that it can have write permissions within the Lambda), so I'm blind.
Digging into this problem, I've found that this can be related to multiprocessing issues within AWS Lamba (my docker image contains python scripts), as stated in https://github.com/dbt-labs/dbt-core/issues/2992. I run DBT from python using the subprocess library.
Since it may be a multiprocessing issue, I have also tried to set "threads": 1 in profiles.yml but it did not solve the problem.
Does anyone succeeded in deploying DBT in AWS Lambda?

I've recently been trying to do this, and the summary of what I've found is that it seems to be possible, but isn't worth it.
You can pretty easily build a Lambda Layer that includes dbt & the provider you want to use, but you'll also need to patch the multiprocessing behavior and invoke dbt.main from within the Lambda code. Once you've jumped through all those hops, you're left with a dbt instance that is limited to a relatively small upper bound on memory, a 15 minute maximum runtime, and is throttled to a single thread.
This discussion gives an rough example of what's needed to get it running in Lambda: https://github.com/dbt-labs/dbt-core/issues/2992#issuecomment-919288906
All that said, I'd love to put dbt on a Lambda and I hope dbt's multiprocessing will one day support it.

Related

Can I specify a timeout for a GCP ai-platform training job?

I recently submitted a training job with a command that looked like:
gcloud ai-platform jobs submit training foo --region us-west2 --master-image-uri us.gcr.io/bar:latest -- baz qux
(more on how this command works here: https://cloud.google.com/ml-engine/docs/training-jobs)
There was a bug in my code which cause the job to just keep running, rather than terminate. Two weeks and $61 later, I discovered my error and cancelled the job. I want to make sure I don't make that kind of mistake again.
I'm considering using the timeout command within the training container to kill the process if it takes too long (typical runtime is about 2 or 3 hours), but rather than trust the container to kill itself, I would prefer to configure GCP to kill it externally.
Is there a way to achieve this?
As a workaround, you could write a small script that runs your command and then sleeps the time you want until running a cancel job command.
As a timeout definition is not available in AI Platform training service, I took the liberty to open a Public Issue with a Feature Request to record the lack of this command. You can track the PI progress here.
Except the script mentioned above, you can also try:
TimeOut Keras callback, or timeout= Optuna param (depending on which library you actually use)
Cron-triggered Lambda (Cloud Function)

Run TPCC on local cockroachdb cluster

I have a local cockroachdb up and running by following instructions from https://www.cockroachlabs.com/docs/stable/start-a-local-cluster.html
I am trying to run the tpcc benchmark following the instructions from https://www.cockroachlabs.com/docs/stable/performance-benchmarking-with-tpc-c.html
It looks like the TPCC binary workload.LATEST assumes the cluster is on google cloud; and so it issues the following error:
$ ./workload.LATEST fixtures load tpcc --warehouses=1000 "postgres://root#localhost:26257?sslmode=disable"
Error: failed to create google cloud client (You may need to setup the GCS application default credentials: 'gcloud auth application-default login --project=cockroach-shared'): dialing: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
What can I change to run the benchmark?
If you upgrade to v2.1, workload is a built-in command and you can run it with your cluster, it does not make google cloud assumption: https://www.cockroachlabs.com/docs/stable/cockroach-workload.html
It's not nearly as fast as using the fixtures stored in Google Cloud, but you can load the data into your cluster using normal SQL statements by running something like:
workload init tpcc --warehouses=1000
Note that while I'm not sure exactly how long it will take to load 1000 warehouses in this way locally, I expect it will take quite some time.

Downloading and Transferring Files Simultaneously with Limited EC2 Storage

I am using an EC2 instance in AWS to run a bash script that downloads files from a server using a CLI while simultaneously moving them into S3 using the AWS CLI (aws s3 mv). However, I usually run out of storage before I can do this because the download speeds are faster than the transfer speeds to S3. The files, which are downloaded daily, are usually hundreds of GB and I do not want to upgrade storage capacity if at all possible.
The CLI I am using for the downloads runs continuously until success/fail but outputs statuses to the console (when I run it from command line instead of .sh) as it goes. I am looking for a way to theoretically run this script based on the specifications given. My most recent attempt was to use something long the lines of:
until (CLI_is_downloading) | grep -m 1 "download complete"; do aws s3 mv --recursive path/local_memory s3://path/s3; done
But that ran out of memory and the download failed well before the move was finished.
Some possible solutions that I thought of are to somehow run the download CLI until I reach a certain point of memory available before switching to the transfer and then alternating back and forth. Also, I am not too experienced with AWS so I am not sure this would work, but could I limit the download speed to match the transfer speed (like network throttling)? Any advice on the practicality of my ideas or other suggestions on how to implement this would be greatly appreciated.
EDIT: I checked my console output again and it seems that the aws s3 mv --recursive only moved the files that were currently there when the function was first called and then stopped. I believe if I called it repeatedly until I got my "files downloaded" message from my other CLI command, it might work. I am not sure exactly how to do this yet so suggestions would still be appreciated but otherwise, this seems like a job for tomorrow.

Serverless Detect Running Locally

I am running a command like the following.
serverless invoke local --function twilio_incoming_call
When I run locally in my code I plan to detect this and instead of looking for POST variables look for a MOCK file I'll be giving it.
I don't know how to detect if I'm running serverless with this local command however.
How do you do this?
I looked around on the serverless website and could find lots of info about running in local but not detecting if you were in local.
I found out the answer. process.env.IS_LOCAL will detect if you are running locally. Missed this on their website somehow...
If you're using AWS Lambda, it has some built-in environment variables. In the absence of those variables, then you can conclude that your function is running locally.
https://docs.aws.amazon.com/lambda/latest/dg/lambda-environment-variables.html
const isRunningLocally = !process.env.AWS_EXECUTION_ENV
This method works regardless of the framework you use whether you are using serverless, Apex UP, AWS SAM, etc.
You can also check what is in process.argv:
process.argv[1] will equal '/usr/local/bin/sls'
process.argv[2] will equal 'invoke'
process.argv[3] will equal 'local'

EC2 init.d script - what's the best practice

I'm creating an init.d script that will run a couple of tasks when the instance starts up.
it will create a new volume with our code repository and mount it if it doesn't exist already.
it will tag the instance
The tasks above being complete will be crucial for our site (i.e. without the code repository mounted the site won't work). How can I make sure that the server doesn't end up being publicly visible? Should I start my init.d script with de-registering the instance from the ELB (I'm not even sure if it will be registered at that point), and then register it again when all the tasks finished successfully?
What is the best practice?
Thanks!
You should have a health check on your ELB. So your server shouldn't get in unless it reports as happy. And it shouldn't report happy if the boot script errors out.
(Also, you should look into using cloud-init. That way you can change the boot script without making a new AMI.)
I suggest you use CloudFormation instead. You can bring up a full stack of your system by representing it in a JSON format template.
For example, you can create an autoscale group that has an instances with unique tags and the instances have another volume attached (which presumably has your code)
Here's a sample JSON template attaching an EBS volume to an instance:
https://s3.amazonaws.com/cloudformation-templates-us-east-1/EC2WithEBSSample.template
And here many other JSON templates that you can use for your guidance and deploy your specific Stack and Application.
http://aws.amazon.com/cloudformation/aws-cloudformation-templates/
Of course you can accomplish the same using init.d script or using the rc.local file in your instance but I believe CloudFormation is a cleaner solution from the outside (not inside your instance)
You can also write your own script that brings up your stack from the outside by why reinvent the wheel.
Hope this helps.

Resources