I am beginner in AWS (from Microsoft domain). I want to run a SQL query against Redshift tables to view duplicates in table on daily basis and send results out in email to a Prod Support group.
Please advise, what is right way to proceed on this.
Recommend doing this with either AWS Lambda or AWS Batch. Use one of these services to issue a short query on a schedule and send the results if required.
Lambda is ideally for simple tasks that complete quickly. https://aws.amazon.com/lambda/ Note that Lambda charges by duration has very tight limits on how long a step can run. A basic skeleton for connecting to Redshift in Lambda is provided in this S.O. answer: Using psycopg2 with Lambda to Update Redshift (Python)
Batch is useful for more complex or long running tasks that need to complete in a sequence. https://aws.amazon.com/batch/
There is no in-built capability with Amazon Redshift to do this for you (eg no stored procedures).
The right way is to write a program that queries Redshift and then sends an email.
I see that you tagged your question with aws-lambda. I'd say that a Lambda function would not be suitable here because it can only run for a maximum of 5 minutes and that might be longer than you need your analysis to run.
Instead, you could run the program from an Amazon EC2 instance, or from any computer connected to the Internet.
Related
I jave a JAVA application in which I am using GCP to create VM instances from images.
In this application, I would like to allow the user to view the vm creation logs in order to be updated on the status of the creation, and to be able to see failure points in detail.
I am sure such logs exist in GCP, but have been unable to find specific APIOs which let me see a specific action, for example creation of instance "X".
Thanks for the help
When you create a VM, the answer that you have is a JobID (because the creation take time and the Compute Engine API answer immediately). To know the status of the VM start (and creation) you have to poll regularly this JobID.
In the logs, you can also filter with this JobID to select and view only the logs that you want on the Compute API side (create/start errors).
If you want to see the logs of the VM, filter the logs not with the JobID but with the name of the VM, and its zone.
In Java, you have client libraries that help you to achieve this
Very new to Datadog and need some help. I have crafted 2 SQL queries (one for on-prem database and one for cloud database) and I would like to run those queries through Datadog and be able display the query results and validate that the daily results fall within an expected variance between the two systems.
I have already set up Datadog on the cloud environment and believe I should use DogStatsD to create a custom metric but I am pretty lost with how I can incorporate my necessary SQL queries in the code to create the metric for eventual display on a dashboard. Any help will be greatly appreciated!!!
You probably want to be using the MySQL integration, and configure the 'custom queries' option: https://docs.datadoghq.com/integrations/faq/how-to-collect-metrics-from-custom-mysql-queries
You can follow those instructions after you configure the base integration https://docs.datadoghq.com/integrations/mysql/#pagetitle (This will give you a lot of use metrics in addition to the custom queries you want to run)
As you mentioned, DogStatsD is a library you can import to whatever script or application in order to submit metrics. But it really isn't a common practice in the slightest to modify the underlying code of your database. So instead it makes more sense to externally run a query on the database, take those results, and send them to datadog. You could totally write a python script or something to do this. However the Datadog agent already has this capability built in, so it's probably easier to just use that.
I am also just assuming SQL refers to MySQL, there are other integration for things like SQL Server, and PostgreSQL, and pretty much every implementation of sql. And the same pattern applies where you would configure the integration, and then add an extra line to the config file where you have the check run your queries.
I have turned on database activity events which I think is some kind of log file on AWS Aurora. They are currently being passed through AWS kinesis into s3 via AWS Firehose. The log in s3 looks like this:
{"type":"DatabaseActivityMonitoringRecords","version":"1.0","databaseActivityEvents":"AYADeOC+7S/mFpoYLr17gZCXuq8AXwABABVhd3MtY3J5cHRvLXB1YmxpYy1rZXkAREFvbjhIZ01uQTVpVHlyS0l3NnVIOS9xdXF3OWEza0xZV0c2QXYzQmtWUFI2alpIK2hsczNwalAyTTIzYnpPS2RXUT09AAEAAkJDABtEYXRhS2V5AAAAgAAAAAwzb2YKNe4h6b2CpykAMLzY7gDftUKUr3QxmxSzylw9qCRxnGW9Fn1qL4uKnbDV/PE44WyOQbXKGXv9s8BxEwIAAAAADAAAEAAAAAAAAAAAAAAAAAC+gU55u4hvWxW1RG/FNNSJ/////wAAAAEAAAAAAAAAAAAAAAEAAACtbmBmDwZw2/1rKiwA4Nyl7cm19/RcHhCpMMwbOFFkZHKL/bvsohf5T+yM9vNxCgAi2qTUIEe17VA5bJ0eCcNAA9mb6Ys+PR1w7QhKrQsHHTBC2dhJ4ELwpXamGRmPLga5Dml2rOveA59YefcJ4PhrqztZXfrS8fBYJ3HgBWHY9nPh1jdyinjQAl61hQrz2LPII85zlqAWTNeL2pXwaRdtGdYeIXXoh4VsoV3Q18Hj/uOQzTIbT8EJvwnk0gj8AGcwZQIxAJNuoCJhHPUfbkk0fHF6HYz1STIc4HX2HOl0qSIHqwpgtQK6BMa3YlPI9hNwhB8x+AIwWDY0bMjuLRGQgjjBv5z1xPpZQ+pMZ4K6m9JaNBFVKxZTvqDL1z7lrV0rlbZThad+","key":"AQIDAHhQgnMAiP8TEQ3/r+nxwePP2VOcLmMGvmFXX8om3hCCugE7IUxSH/eJBEKvnkYoNIqFAAAAfjB8BgkqhkiG9w0BBwagbzBtAgEAMGgGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMQIX97gE5ioBR1+nnAgEQgDuDX2B2T7nOxjKDyL31+wHJb0pwkCeaU7CwA6BwIkiT7FmhMB71XgvCVrY9C9ABUtc1e5J7QIfsVB214w=="}
I think a KMS key is being used to encrypt that log file. How do I decrypt it? Is there working sample code somewhere? Also, more importantly, the Aurora database I'm using is a test database with no activity (no inserts, selects, updates). Why are there so many logs? Why are there so many databaseActivityEvents. They seem to be getting written to s3 every minute of the day.
Yes it uses RDS Activity stream KMS key (ActivityStreamKmsKeyId) for encrypting the log event and also base64 encoding. You will have to make use of AWS cryptographic SDKs to decrypt the key and the log event.
For reference see below their the sample java and python versions:
Processing a Database Activity Stream using the AWS SDK
In your firehose pipeline you can add transformation with Lambda step and do this decryption in your lambda function.
Why there are so many events in idle postgres RDS cluster? They are heartbeat events.
When you decrypt and take a look at the actual activity event json, it has type field which can be either be record or heartbeat. Events with type record are the user activity generated ones.
I want to know the best approach for mysql connection creation and termination to an External MySQL instance, to allow more than 1000 users to access AWS lambda function at same time.
The best option is to configure MySQL on AWS RDS.
1: https://docs.aws.amazon.com/lambda/latest/dg/vpc-rds.html. Accessing MySQL from AWS Lambda is no different than accessing it from any native code (Java/ Python/ Node or C#). Be sure to configure proper roles so that MySQL can be accessed from lambda (details).
1000 concurrent connections is the default limit for a single lambda function. From the console you can increase the number of "nodes" running your lambda function or you can increase the size limit. -> https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html
Please help me in answering below questions.
What is deployment strategy for Hive related scripts. Like For SQL we have dacpac, Is there any such components ?
Is there any API to get status of Job submitted through ODBC.
Have you looked at Azure Data Factory: http://azure.microsoft.com/en-us/services/data-factory/
Regarding your questions on APIs to check job status, here are a few PowerShell APIs. Do these help you?
“Start-AzureHDInsightJob” (https://msdn.microsoft.com/en-us/library/dn593743.aspx) starts the job and returns a job object which can be used to track/kill the job.
“Wait-AzureHDInsightJob” (https://msdn.microsoft.com/en-us/library/dn593748.aspx) uses the job object to check the status of the job. It will wait until the job completes or the wait time is exceeded.
“Stop-AzureHDInsightJob” (https://msdn.microsoft.com/en-us/library/dn593754.aspx) stops the job.