Datadog data validation between on-prem and cloud database - validation

Very new to Datadog and need some help. I have crafted 2 SQL queries (one for on-prem database and one for cloud database) and I would like to run those queries through Datadog and be able display the query results and validate that the daily results fall within an expected variance between the two systems.
I have already set up Datadog on the cloud environment and believe I should use DogStatsD to create a custom metric but I am pretty lost with how I can incorporate my necessary SQL queries in the code to create the metric for eventual display on a dashboard. Any help will be greatly appreciated!!!

You probably want to be using the MySQL integration, and configure the 'custom queries' option: https://docs.datadoghq.com/integrations/faq/how-to-collect-metrics-from-custom-mysql-queries
You can follow those instructions after you configure the base integration https://docs.datadoghq.com/integrations/mysql/#pagetitle (This will give you a lot of use metrics in addition to the custom queries you want to run)
As you mentioned, DogStatsD is a library you can import to whatever script or application in order to submit metrics. But it really isn't a common practice in the slightest to modify the underlying code of your database. So instead it makes more sense to externally run a query on the database, take those results, and send them to datadog. You could totally write a python script or something to do this. However the Datadog agent already has this capability built in, so it's probably easier to just use that.
I am also just assuming SQL refers to MySQL, there are other integration for things like SQL Server, and PostgreSQL, and pretty much every implementation of sql. And the same pattern applies where you would configure the integration, and then add an extra line to the config file where you have the check run your queries.

Related

Azure Blob Storage lifecycle management - send report or log after run

I am considering using Azure Blob Storage's build-in lifecycle management feature for deleting blobs of a certain age.
However, due to a business requirement, it must be possible to generate a report or log statement after each daily execution of the defined ruleset. The report or log must state the number of blob blocks that were affected, e.g. deleted during the run.
I have read through the documentation and Googled to see if others have had similar inquiries, but so far without any luck.
So my question: Does any of you know if and how I can get a build-in Lifecycle management system to do one of the following after each daily run:
Add a log statement to the storage account containing the Blob storage.
Generate and send a report to an endpoint I define.
If the above can't be done I will have to code the daily deletion job and report generation myself, which surely I can do, but I would like to use the built-in feature if possible.
I summarize the solution as below.
If you want to know which blobs are deleted every day, we can configure Diagnostics settings in the storqge account. After doing that, we will get the logs for read, write, and delete requests for the blob. For more detail, please refer to here and here
Regarding how to enable it, we can use PowerShell command Set-AzStorageServiceLoggingProperty.

How to plug in a process of identifying sensitive information somewhere in ETL pipeline?

Hope you are doing well !
We have already developed ETL pipeline using apache NiFi. Which gets trigger only when client uploads source data file from portal.After that, the data present inside source file goes through various layers,gets transformed and stored back to warehouse(i.e. hive).
Goal : To identify sensitive information and mask it so that end user won't see actual data.
Identify Sensitive data & masking strategy : We will make use of open source tool to achieve this goal as follow.
Data steward studio : This tool allow me to identify sensitive information and tag it properly.
Apache Atlas : Once data steward user has confirmed the tag then that tag will be pushed into Apache atlas.
Apache ranger : At the final, we can define tag based-masking policy using Apache ranger which will allow or deny to specific user. 
For more details on above solution , please visit link.
https://www.youtube.com/watch?v=RzEfLwJaLsc
Problem : In order to feed the data to DSS tool, it should be loaded first in hive table. That is fine. But we cannot stop the existing ETL flow in-between and then start identification process of sensitive information. The above solution must require some manual process which i want to get rid of and make it automated.that is, it should be plugged in somewhere within NiFi pipeline.But so far, as per my understanding DSS do not allow us to do something like that.
Manual Process :
Create Asset collection
Accept/Reject suggested tags within DSS.
If we cannot plug identification process in pipeline, then client sensitive data will be exposed to everyone and visible to everyone in team. I want something where we can de-identify sensitive data before it actually get loaded into HDFS or hive tables.
Please write your response to me on the same problem, if anyone has already worked into this particular area.
  
I did not test it, but here are my thoughts on this challenge.
Set up the system such that data is NOT visible to everyone(or anyone) by default
Load the data into hive
Let the profilers run and accept its suggestions
Open up the data to those who should have access (except for the things found by the profiler)
There are still some implementation details to work out (e.g. How to automate step 3/4 and whether you can just solve this with tags or whether the data needs to sit in a staging area first). But I hope this steers you in a good direction.
One idea might be to use EncryptContent of nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.EncryptContent/). Then the values loaded into Hive will be encrypted in the first place and would not be visible to the stewards. Once the tagging has been done - then in the subsequent part of the pipeline (where I'm assuming you're using nifi as well) - you can decrypt back content as required.

Logging for Talend job running within spring-boot

We have talend-jobs triggered within Spring-boot application. Is there any way to configure the output of talend-jobs to the application log files?
One workaround we find is to write logs directly to an external file (filePath passed as context-param). But wanted to find if there is a better way to configure this seamlessly.
Not sure if I understood the question correctly, but I guess your concerns might be on what might have happened to the triggered Jobs.
Logging
With Respect to Logging for Talend, You could configure using Log4j,
https://help.talend.com/reader/5DC~TBhDsBie5JTXyVLW4g/QSGCZJKXo~uhKvZDq1DxUg
Monitoring
Regarding the Status of the Job Executed, you could get the execution details retrieved using REST Call(Talend Metaservlet API).
getTaskExecutionStatus
https://help.talend.com/reader/oYf9gKhmYrkWCiSua4qLeg/SLiAyHyDTjuznLR_F~MiQQ
By Modifying the Existing Talend Job,You could also design a like a feedback loop, ie Trigger a REST Call back to your application. With the details of Execution from Talend Job.

Automate SQL Query to send email on Redshift

I am beginner in AWS (from Microsoft domain). I want to run a SQL query against Redshift tables to view duplicates in table on daily basis and send results out in email to a Prod Support group.
Please advise, what is right way to proceed on this.
Recommend doing this with either AWS Lambda or AWS Batch. Use one of these services to issue a short query on a schedule and send the results if required.
Lambda is ideally for simple tasks that complete quickly. https://aws.amazon.com/lambda/ Note that Lambda charges by duration has very tight limits on how long a step can run. A basic skeleton for connecting to Redshift in Lambda is provided in this S.O. answer: Using psycopg2 with Lambda to Update Redshift (Python)
Batch is useful for more complex or long running tasks that need to complete in a sequence. https://aws.amazon.com/batch/
There is no in-built capability with Amazon Redshift to do this for you (eg no stored procedures).
The right way is to write a program that queries Redshift and then sends an email.
I see that you tagged your question with aws-lambda. I'd say that a Lambda function would not be suitable here because it can only run for a maximum of 5 minutes and that might be longer than you need your analysis to run.
Instead, you could run the program from an Amazon EC2 instance, or from any computer connected to the Internet.

Spring Data Couchbase - Search without having admin rights on the cluster

I'm currently working on a POC with Couchbase, using Spring Data to put & get documents on/off a bucket on a cluster.
As I'm working in a big company, I'm lucky they gave me a bucket, but still I don't have the admin rights on the cluster, so I only have access to the bucket.
But as I'm digging into the Spring Data documentation, I'm not able to find a way to retrieve documents without creating views on the server. (I'm getting errors like "Unknown query param" ). Nevertheless with couchbase java sdk i'm able to, through n1ql queries, but the use of the Spring data layer is mandatory.
The answers I found always point me to the server-side function direction
ex : https://stackoverflow.com/a/30928169/3744307
What I would like to find, is a way to add a repository method like
List findReceiptByAccount(String Account)
without having to specificly declare the function server-side.
Is this possible, or have I to send a request to the administrators to create functions for me everytime I have to add a findByX method?
Thanks for your time,
What version of CB is it ?
I think that prior to 4.5, a n1ql access (which you seems to have) is enough to build your index yourself !
With Spring Data Couchbase 2.x that would use a N1QL index in the background, and it would work with a single primary index (although having 1 index per repository entity class would be best for performance). Maybe you can ask your admin to create that index once?

Resources